mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2024-12-29 09:16:33 +00:00
Networking changes for 6.3.
Core ---- - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols --------- - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF --- - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter --------- - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt. races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API ---------- - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers ---------------------- - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers ------- - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - enetc: support XDP_REDIRECT for XDP non-linear buffers - enetc: improve reconfig, avoid link flap and waiting for idle - enetc: support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation. Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI= =xXhC -----END PGP SIGNATURE----- Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
This commit is contained in:
commit
5b7c4cabbb
19
Documentation/ABI/testing/sysfs-class-net-peak_usb
Normal file
19
Documentation/ABI/testing/sysfs-class-net-peak_usb
Normal file
@ -0,0 +1,19 @@
|
||||
|
||||
What: /sys/class/net/<iface>/peak_usb/can_channel_id
|
||||
Date: November 2022
|
||||
KernelVersion: 6.2
|
||||
Contact: Stephane Grosjean <s.grosjean@peak-system.com>
|
||||
Description:
|
||||
PEAK PCAN-USB devices support user-configurable CAN channel
|
||||
identifiers. Contrary to a USB serial number, these identifiers
|
||||
are writable and can be set per CAN interface. This means that
|
||||
if a USB device exports multiple CAN interfaces, each of them
|
||||
can be assigned a unique channel ID.
|
||||
This attribute provides read-only access to the currently
|
||||
configured value of the channel identifier. Depending on the
|
||||
device type, the identifier has a length of 8 or 32 bit. The
|
||||
value read from this attribute is always an 8 digit 32 bit
|
||||
hexadecimal value in big endian format. If the device only
|
||||
supports an 8 bit identifier, the upper 24 bit of the value are
|
||||
set to zero.
|
||||
|
@ -557,6 +557,7 @@
|
||||
Format: <string>
|
||||
nosocket -- Disable socket memory accounting.
|
||||
nokmem -- Disable kernel memory accounting.
|
||||
nobpf -- Disable BPF memory accounting.
|
||||
|
||||
checkreqprot= [SELINUX] Set initial checkreqprot flag value.
|
||||
Format: { "0" | "1" }
|
||||
|
@ -215,6 +215,12 @@ rmem_max
|
||||
|
||||
The maximum receive socket buffer size in bytes.
|
||||
|
||||
rps_default_mask
|
||||
----------------
|
||||
|
||||
The default RPS CPU mask used on newly created network devices. An empty
|
||||
mask means RPS disabled by default.
|
||||
|
||||
tstamp_allow_data
|
||||
-----------------
|
||||
Allow processes to receive tx timestamps looped together with the original
|
||||
|
@ -208,6 +208,10 @@ data structures and compile with kernel internal headers. Both of these
|
||||
kernel internals are subject to change and can break with newer kernels
|
||||
such that the program needs to be adapted accordingly.
|
||||
|
||||
New BPF functionality is generally added through the use of kfuncs instead of
|
||||
new helpers. Kfuncs are not considered part of the stable API, and have their own
|
||||
lifecycle expectations as described in :ref:`BPF_kfunc_lifecycle_expectations`.
|
||||
|
||||
Q: Are tracepoints part of the stable ABI?
|
||||
------------------------------------------
|
||||
A: NO. Tracepoints are tied to internal implementation details hence they are
|
||||
@ -236,8 +240,8 @@ A: NO. Classic BPF programs are converted into extend BPF instructions.
|
||||
|
||||
Q: Can BPF call arbitrary kernel functions?
|
||||
-------------------------------------------
|
||||
A: NO. BPF programs can only call a set of helper functions which
|
||||
is defined for every program type.
|
||||
A: NO. BPF programs can only call specific functions exposed as BPF helpers or
|
||||
kfuncs. The set of available functions is defined for every program type.
|
||||
|
||||
Q: Can BPF overwrite arbitrary kernel memory?
|
||||
---------------------------------------------
|
||||
@ -263,7 +267,12 @@ Q: New functionality via kernel modules?
|
||||
Q: Can BPF functionality such as new program or map types, new
|
||||
helpers, etc be added out of kernel module code?
|
||||
|
||||
A: NO.
|
||||
A: Yes, through kfuncs and kptrs
|
||||
|
||||
The core BPF functionality such as program types, maps and helpers cannot be
|
||||
added to by modules. However, modules can expose functionality to BPF programs
|
||||
by exporting kfuncs (which may return pointers to module-internal data
|
||||
structures as kptrs).
|
||||
|
||||
Q: Directly calling kernel function is an ABI?
|
||||
----------------------------------------------
|
||||
@ -278,7 +287,8 @@ kernel functions have already been used by other kernel tcp
|
||||
cc (congestion-control) implementations. If any of these kernel
|
||||
functions has changed, both the in-tree and out-of-tree kernel tcp cc
|
||||
implementations have to be changed. The same goes for the bpf
|
||||
programs and they have to be adjusted accordingly.
|
||||
programs and they have to be adjusted accordingly. See
|
||||
:ref:`BPF_kfunc_lifecycle_expectations` for details.
|
||||
|
||||
Q: Attaching to arbitrary kernel functions is an ABI?
|
||||
-----------------------------------------------------
|
||||
@ -340,6 +350,7 @@ compatibility for these features?
|
||||
|
||||
A: NO.
|
||||
|
||||
Unlike map value types, there are no stability guarantees for this case. The
|
||||
whole API to work with allocated objects and any support for special fields
|
||||
inside them is unstable (since it is exposed through kfuncs).
|
||||
Unlike map value types, the API to work with allocated objects and any support
|
||||
for special fields inside them is exposed through kfuncs, and thus has the same
|
||||
lifecycle expectations as the kfuncs themselves. See
|
||||
:ref:`BPF_kfunc_lifecycle_expectations` for details.
|
||||
|
393
Documentation/bpf/cpumasks.rst
Normal file
393
Documentation/bpf/cpumasks.rst
Normal file
@ -0,0 +1,393 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
.. _cpumasks-header-label:
|
||||
|
||||
==================
|
||||
BPF cpumask kfuncs
|
||||
==================
|
||||
|
||||
1. Introduction
|
||||
===============
|
||||
|
||||
``struct cpumask`` is a bitmap data structure in the kernel whose indices
|
||||
reflect the CPUs on the system. Commonly, cpumasks are used to track which CPUs
|
||||
a task is affinitized to, but they can also be used to e.g. track which cores
|
||||
are associated with a scheduling domain, which cores on a machine are idle,
|
||||
etc.
|
||||
|
||||
BPF provides programs with a set of :ref:`kfuncs-header-label` that can be
|
||||
used to allocate, mutate, query, and free cpumasks.
|
||||
|
||||
2. BPF cpumask objects
|
||||
======================
|
||||
|
||||
There are two different types of cpumasks that can be used by BPF programs.
|
||||
|
||||
2.1 ``struct bpf_cpumask *``
|
||||
----------------------------
|
||||
|
||||
``struct bpf_cpumask *`` is a cpumask that is allocated by BPF, on behalf of a
|
||||
BPF program, and whose lifecycle is entirely controlled by BPF. These cpumasks
|
||||
are RCU-protected, can be mutated, can be used as kptrs, and can be safely cast
|
||||
to a ``struct cpumask *``.
|
||||
|
||||
2.1.1 ``struct bpf_cpumask *`` lifecycle
|
||||
----------------------------------------
|
||||
|
||||
A ``struct bpf_cpumask *`` is allocated, acquired, and released, using the
|
||||
following functions:
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_create
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_acquire
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_release
|
||||
|
||||
For example:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct cpumask_map_value {
|
||||
struct bpf_cpumask __kptr_ref * cpumask;
|
||||
};
|
||||
|
||||
struct array_map {
|
||||
__uint(type, BPF_MAP_TYPE_ARRAY);
|
||||
__type(key, int);
|
||||
__type(value, struct cpumask_map_value);
|
||||
__uint(max_entries, 65536);
|
||||
} cpumask_map SEC(".maps");
|
||||
|
||||
static int cpumask_map_insert(struct bpf_cpumask *mask, u32 pid)
|
||||
{
|
||||
struct cpumask_map_value local, *v;
|
||||
long status;
|
||||
struct bpf_cpumask *old;
|
||||
u32 key = pid;
|
||||
|
||||
local.cpumask = NULL;
|
||||
status = bpf_map_update_elem(&cpumask_map, &key, &local, 0);
|
||||
if (status) {
|
||||
bpf_cpumask_release(mask);
|
||||
return status;
|
||||
}
|
||||
|
||||
v = bpf_map_lookup_elem(&cpumask_map, &key);
|
||||
if (!v) {
|
||||
bpf_cpumask_release(mask);
|
||||
return -ENOENT;
|
||||
}
|
||||
|
||||
old = bpf_kptr_xchg(&v->cpumask, mask);
|
||||
if (old)
|
||||
bpf_cpumask_release(old);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* A sample tracepoint showing how a task's cpumask can be queried and
|
||||
* recorded as a kptr.
|
||||
*/
|
||||
SEC("tp_btf/task_newtask")
|
||||
int BPF_PROG(record_task_cpumask, struct task_struct *task, u64 clone_flags)
|
||||
{
|
||||
struct bpf_cpumask *cpumask;
|
||||
int ret;
|
||||
|
||||
cpumask = bpf_cpumask_create();
|
||||
if (!cpumask)
|
||||
return -ENOMEM;
|
||||
|
||||
if (!bpf_cpumask_full(task->cpus_ptr))
|
||||
bpf_printk("task %s has CPU affinity", task->comm);
|
||||
|
||||
bpf_cpumask_copy(cpumask, task->cpus_ptr);
|
||||
return cpumask_map_insert(cpumask, task->pid);
|
||||
}
|
||||
|
||||
----
|
||||
|
||||
2.1.1 ``struct bpf_cpumask *`` as kptrs
|
||||
---------------------------------------
|
||||
|
||||
As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can
|
||||
also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in
|
||||
a map, the reference can be removed from the map with bpf_kptr_xchg(), or
|
||||
opportunistically acquired with bpf_cpumask_kptr_get():
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_kptr_get
|
||||
|
||||
Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/* struct containing the struct bpf_cpumask kptr which is stored in the map. */
|
||||
struct cpumasks_kfunc_map_value {
|
||||
struct bpf_cpumask __kptr_ref * bpf_cpumask;
|
||||
};
|
||||
|
||||
/* The map containing struct cpumasks_kfunc_map_value entries. */
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_ARRAY);
|
||||
__type(key, int);
|
||||
__type(value, struct cpumasks_kfunc_map_value);
|
||||
__uint(max_entries, 1);
|
||||
} cpumasks_kfunc_map SEC(".maps");
|
||||
|
||||
/* ... */
|
||||
|
||||
/**
|
||||
* A simple example tracepoint program showing how a
|
||||
* struct bpf_cpumask * kptr that is stored in a map can
|
||||
* be acquired using the bpf_cpumask_kptr_get() kfunc.
|
||||
*/
|
||||
SEC("tp_btf/cgroup_mkdir")
|
||||
int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
|
||||
{
|
||||
struct bpf_cpumask *kptr;
|
||||
struct cpumasks_kfunc_map_value *v;
|
||||
u32 key = 0;
|
||||
|
||||
/* Assume a bpf_cpumask * kptr was previously stored in the map. */
|
||||
v = bpf_map_lookup_elem(&cpumasks_kfunc_map, &key);
|
||||
if (!v)
|
||||
return -ENOENT;
|
||||
|
||||
/* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */
|
||||
kptr = bpf_cpumask_kptr_get(&v->cpumask);
|
||||
if (!kptr)
|
||||
/* If no bpf_cpumask was present in the map, it's because
|
||||
* we're racing with another CPU that removed it with
|
||||
* bpf_kptr_xchg() between the bpf_map_lookup_elem()
|
||||
* above, and our call to bpf_cpumask_kptr_get().
|
||||
* bpf_cpumask_kptr_get() internally safely handles this
|
||||
* race, and will return NULL if the cpumask is no longer
|
||||
* present in the map by the time we invoke the kfunc.
|
||||
*/
|
||||
return -EBUSY;
|
||||
|
||||
/* Free the reference we just took above. Note that the
|
||||
* original struct bpf_cpumask * kptr is still in the map. It will
|
||||
* be freed either at a later time if another context deletes
|
||||
* it from the map, or automatically by the BPF subsystem if
|
||||
* it's still present when the map is destroyed.
|
||||
*/
|
||||
bpf_cpumask_release(kptr);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
----
|
||||
|
||||
2.2 ``struct cpumask``
|
||||
----------------------
|
||||
|
||||
``struct cpumask`` is the object that actually contains the cpumask bitmap
|
||||
being queried, mutated, etc. A ``struct bpf_cpumask`` wraps a ``struct
|
||||
cpumask``, which is why it's safe to cast it as such (note however that it is
|
||||
**not** safe to cast a ``struct cpumask *`` to a ``struct bpf_cpumask *``, and
|
||||
the verifier will reject any program that tries to do so).
|
||||
|
||||
As we'll see below, any kfunc that mutates its cpumask argument will take a
|
||||
``struct bpf_cpumask *`` as that argument. Any argument that simply queries the
|
||||
cpumask will instead take a ``struct cpumask *``.
|
||||
|
||||
3. cpumask kfuncs
|
||||
=================
|
||||
|
||||
Above, we described the kfuncs that can be used to allocate, acquire, release,
|
||||
etc a ``struct bpf_cpumask *``. This section of the document will describe the
|
||||
kfuncs for mutating and querying cpumasks.
|
||||
|
||||
3.1 Mutating cpumasks
|
||||
---------------------
|
||||
|
||||
Some cpumask kfuncs are "read-only" in that they don't mutate any of their
|
||||
arguments, whereas others mutate at least one argument (which means that the
|
||||
argument must be a ``struct bpf_cpumask *``, as described above).
|
||||
|
||||
This section will describe all of the cpumask kfuncs which mutate at least one
|
||||
argument. :ref:`cpumasks-querying-label` below describes the read-only kfuncs.
|
||||
|
||||
3.1.1 Setting and clearing CPUs
|
||||
-------------------------------
|
||||
|
||||
bpf_cpumask_set_cpu() and bpf_cpumask_clear_cpu() can be used to set and clear
|
||||
a CPU in a ``struct bpf_cpumask`` respectively:
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_set_cpu bpf_cpumask_clear_cpu
|
||||
|
||||
These kfuncs are pretty straightforward, and can be used, for example, as
|
||||
follows:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/**
|
||||
* A sample tracepoint showing how a cpumask can be queried.
|
||||
*/
|
||||
SEC("tp_btf/task_newtask")
|
||||
int BPF_PROG(test_set_clear_cpu, struct task_struct *task, u64 clone_flags)
|
||||
{
|
||||
struct bpf_cpumask *cpumask;
|
||||
|
||||
cpumask = bpf_cpumask_create();
|
||||
if (!cpumask)
|
||||
return -ENOMEM;
|
||||
|
||||
bpf_cpumask_set_cpu(0, cpumask);
|
||||
if (!bpf_cpumask_test_cpu(0, cast(cpumask)))
|
||||
/* Should never happen. */
|
||||
goto release_exit;
|
||||
|
||||
bpf_cpumask_clear_cpu(0, cpumask);
|
||||
if (bpf_cpumask_test_cpu(0, cast(cpumask)))
|
||||
/* Should never happen. */
|
||||
goto release_exit;
|
||||
|
||||
/* struct cpumask * pointers such as task->cpus_ptr can also be queried. */
|
||||
if (bpf_cpumask_test_cpu(0, task->cpus_ptr))
|
||||
bpf_printk("task %s can use CPU %d", task->comm, 0);
|
||||
|
||||
release_exit:
|
||||
bpf_cpumask_release(cpumask);
|
||||
return 0;
|
||||
}
|
||||
|
||||
----
|
||||
|
||||
bpf_cpumask_test_and_set_cpu() and bpf_cpumask_test_and_clear_cpu() are
|
||||
complementary kfuncs that allow callers to atomically test and set (or clear)
|
||||
CPUs:
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_test_and_set_cpu bpf_cpumask_test_and_clear_cpu
|
||||
|
||||
----
|
||||
|
||||
We can also set and clear entire ``struct bpf_cpumask *`` objects in one
|
||||
operation using bpf_cpumask_setall() and bpf_cpumask_clear():
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_setall bpf_cpumask_clear
|
||||
|
||||
3.1.2 Operations between cpumasks
|
||||
---------------------------------
|
||||
|
||||
In addition to setting and clearing individual CPUs in a single cpumask,
|
||||
callers can also perform bitwise operations between multiple cpumasks using
|
||||
bpf_cpumask_and(), bpf_cpumask_or(), and bpf_cpumask_xor():
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_and bpf_cpumask_or bpf_cpumask_xor
|
||||
|
||||
The following is an example of how they may be used. Note that some of the
|
||||
kfuncs shown in this example will be covered in more detail below.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/**
|
||||
* A sample tracepoint showing how a cpumask can be mutated using
|
||||
bitwise operators (and queried).
|
||||
*/
|
||||
SEC("tp_btf/task_newtask")
|
||||
int BPF_PROG(test_and_or_xor, struct task_struct *task, u64 clone_flags)
|
||||
{
|
||||
struct bpf_cpumask *mask1, *mask2, *dst1, *dst2;
|
||||
|
||||
mask1 = bpf_cpumask_create();
|
||||
if (!mask1)
|
||||
return -ENOMEM;
|
||||
|
||||
mask2 = bpf_cpumask_create();
|
||||
if (!mask2) {
|
||||
bpf_cpumask_release(mask1);
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
// ...Safely create the other two masks... */
|
||||
|
||||
bpf_cpumask_set_cpu(0, mask1);
|
||||
bpf_cpumask_set_cpu(1, mask2);
|
||||
bpf_cpumask_and(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
|
||||
if (!bpf_cpumask_empty((const struct cpumask *)dst1))
|
||||
/* Should never happen. */
|
||||
goto release_exit;
|
||||
|
||||
bpf_cpumask_or(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
|
||||
if (!bpf_cpumask_test_cpu(0, (const struct cpumask *)dst1))
|
||||
/* Should never happen. */
|
||||
goto release_exit;
|
||||
|
||||
if (!bpf_cpumask_test_cpu(1, (const struct cpumask *)dst1))
|
||||
/* Should never happen. */
|
||||
goto release_exit;
|
||||
|
||||
bpf_cpumask_xor(dst2, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
|
||||
if (!bpf_cpumask_equal((const struct cpumask *)dst1,
|
||||
(const struct cpumask *)dst2))
|
||||
/* Should never happen. */
|
||||
goto release_exit;
|
||||
|
||||
release_exit:
|
||||
bpf_cpumask_release(mask1);
|
||||
bpf_cpumask_release(mask2);
|
||||
bpf_cpumask_release(dst1);
|
||||
bpf_cpumask_release(dst2);
|
||||
return 0;
|
||||
}
|
||||
|
||||
----
|
||||
|
||||
The contents of an entire cpumask may be copied to another using
|
||||
bpf_cpumask_copy():
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_copy
|
||||
|
||||
----
|
||||
|
||||
.. _cpumasks-querying-label:
|
||||
|
||||
3.2 Querying cpumasks
|
||||
---------------------
|
||||
|
||||
In addition to the above kfuncs, there is also a set of read-only kfuncs that
|
||||
can be used to query the contents of cpumasks.
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_test_cpu
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_equal bpf_cpumask_intersects bpf_cpumask_subset
|
||||
bpf_cpumask_empty bpf_cpumask_full
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumask.c
|
||||
:identifiers: bpf_cpumask_any bpf_cpumask_any_and
|
||||
|
||||
----
|
||||
|
||||
Some example usages of these querying kfuncs were shown above. We will not
|
||||
replicate those exmaples here. Note, however, that all of the aforementioned
|
||||
kfuncs are tested in `tools/testing/selftests/bpf/progs/cpumask_success.c`_, so
|
||||
please take a look there if you're looking for more examples of how they can be
|
||||
used.
|
||||
|
||||
.. _tools/testing/selftests/bpf/progs/cpumask_success.c:
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_success.c
|
||||
|
||||
|
||||
4. Adding BPF cpumask kfuncs
|
||||
============================
|
||||
|
||||
The set of supported BPF cpumask kfuncs are not (yet) a 1-1 match with the
|
||||
cpumask operations in include/linux/cpumask.h. Any of those cpumask operations
|
||||
could easily be encapsulated in a new kfunc if and when required. If you'd like
|
||||
to support a new cpumask operation, please feel free to submit a patch. If you
|
||||
do add a new cpumask kfunc, please document it here, and add any relevant
|
||||
selftest testcases to the cpumask selftest suite.
|
267
Documentation/bpf/graph_ds_impl.rst
Normal file
267
Documentation/bpf/graph_ds_impl.rst
Normal file
@ -0,0 +1,267 @@
|
||||
=========================
|
||||
BPF Graph Data Structures
|
||||
=========================
|
||||
|
||||
This document describes implementation details of new-style "graph" data
|
||||
structures (linked_list, rbtree), with particular focus on the verifier's
|
||||
implementation of semantics specific to those data structures.
|
||||
|
||||
Although no specific verifier code is referred to in this document, the document
|
||||
assumes that the reader has general knowledge of BPF verifier internals, BPF
|
||||
maps, and BPF program writing.
|
||||
|
||||
Note that the intent of this document is to describe the current state of
|
||||
these graph data structures. **No guarantees** of stability for either
|
||||
semantics or APIs are made or implied here.
|
||||
|
||||
.. contents::
|
||||
:local:
|
||||
:depth: 2
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
The BPF map API has historically been the main way to expose data structures
|
||||
of various types for use within BPF programs. Some data structures fit naturally
|
||||
with the map API (HASH, ARRAY), others less so. Consequentially, programs
|
||||
interacting with the latter group of data structures can be hard to parse
|
||||
for kernel programmers without previous BPF experience.
|
||||
|
||||
Luckily, some restrictions which necessitated the use of BPF map semantics are
|
||||
no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
|
||||
BPF allocator, it is now possible to implement BPF data structures whose API
|
||||
and semantics more closely match those exposed to the rest of the kernel.
|
||||
|
||||
Two such data structures - linked_list and rbtree - have many verification
|
||||
details in common. Because both have "root"s ("head" for linked_list) and
|
||||
"node"s, the verifier code and this document refer to common functionality
|
||||
as "graph_api", "graph_root", "graph_node", etc.
|
||||
|
||||
Unless otherwise stated, examples and semantics below apply to both graph data
|
||||
structures.
|
||||
|
||||
Unstable API
|
||||
------------
|
||||
|
||||
Data structures implemented using the BPF map API have historically used BPF
|
||||
helper functions - either standard map API helpers like ``bpf_map_update_elem``
|
||||
or map-specific helpers. The new-style graph data structures instead use kfuncs
|
||||
to define their manipulation helpers. Because there are no stability guarantees
|
||||
for kfuncs, the API and semantics for these data structures can be evolved in
|
||||
a way that breaks backwards compatibility if necessary.
|
||||
|
||||
Root and node types for the new data structures are opaquely defined in the
|
||||
``uapi/linux/bpf.h`` header.
|
||||
|
||||
Locking
|
||||
-------
|
||||
|
||||
The new-style data structures are intrusive and are defined similarly to their
|
||||
vanilla kernel counterparts:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct node_data {
|
||||
long key;
|
||||
long data;
|
||||
struct bpf_rb_node node;
|
||||
};
|
||||
|
||||
struct bpf_spin_lock glock;
|
||||
struct bpf_rb_root groot __contains(node_data, node);
|
||||
|
||||
The "root" type for both linked_list and rbtree expects to be in a map_value
|
||||
which also contains a ``bpf_spin_lock`` - in the above example both global
|
||||
variables are placed in a single-value arraymap. The verifier considers this
|
||||
spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
|
||||
the same map_value and will enforce that the correct lock is held when
|
||||
verifying BPF programs that manipulate the tree. Since this lock checking
|
||||
happens at verification time, there is no runtime penalty.
|
||||
|
||||
Non-owning references
|
||||
---------------------
|
||||
|
||||
**Motivation**
|
||||
|
||||
Consider the following BPF code:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
|
||||
|
||||
bpf_spin_lock(&lock);
|
||||
|
||||
bpf_rbtree_add(&tree, n); /* PASSED */
|
||||
|
||||
bpf_spin_unlock(&lock);
|
||||
|
||||
From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
|
||||
has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
|
||||
``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
|
||||
program has ownership of the pointee's (object pointed to by ``n``) lifetime.
|
||||
The BPF program must pass off ownership before exiting - either via
|
||||
``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
|
||||
``bpf_rbtree_add``.
|
||||
|
||||
(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
|
||||
"ownership is acquired" and "ownership is passed", respectively)
|
||||
|
||||
What should the verifier do with ``n`` after ownership is passed off? If the
|
||||
object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
|
||||
should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
|
||||
the object is no longer valid. The underlying memory may have been reused for
|
||||
some other allocation, unmapped, etc.
|
||||
|
||||
When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
|
||||
obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
|
||||
but that would result in programs with useful, common coding patterns being
|
||||
rejected, e.g.:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int x;
|
||||
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
|
||||
|
||||
bpf_spin_lock(&lock);
|
||||
|
||||
bpf_rbtree_add(&tree, n); /* PASSED */
|
||||
x = n->data;
|
||||
n->data = 42;
|
||||
|
||||
bpf_spin_unlock(&lock);
|
||||
|
||||
Both the read from and write to ``n->data`` would be rejected. The verifier
|
||||
can do better, though, by taking advantage of two details:
|
||||
|
||||
* Graph data structure APIs can only be used when the ``bpf_spin_lock``
|
||||
associated with the graph root is held
|
||||
|
||||
* Both graph data structures have pointer stability
|
||||
|
||||
* Because graph nodes are allocated with ``bpf_obj_new`` and
|
||||
adding / removing from the root involves fiddling with the
|
||||
``bpf_{list,rb}_node`` field of the node struct, a graph node will
|
||||
remain at the same address after either operation.
|
||||
|
||||
Because the associated ``bpf_spin_lock`` must be held by any program adding
|
||||
or removing, if we're in the critical section bounded by that lock, we know
|
||||
that no other program can add or remove until the end of the critical section.
|
||||
This combined with pointer stability means that, until the critical section
|
||||
ends, we can safely access the graph node through ``n`` even after it was used
|
||||
to pass ownership.
|
||||
|
||||
The verifier considers such a reference a *non-owning reference*. The ref
|
||||
returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
|
||||
Both terms currently only have meaning in the context of graph nodes and API.
|
||||
|
||||
**Details**
|
||||
|
||||
Let's enumerate the properties of both types of references.
|
||||
|
||||
*owning reference*
|
||||
|
||||
* This reference controls the lifetime of the pointee
|
||||
|
||||
* Ownership of pointee must be 'released' by passing it to some graph API
|
||||
kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
|
||||
|
||||
* If not released before program ends, verifier considers program invalid
|
||||
|
||||
* Access to the pointee's memory will not page fault
|
||||
|
||||
*non-owning reference*
|
||||
|
||||
* This reference does not own the pointee
|
||||
|
||||
* It cannot be used to add the graph node to a graph root, nor ``free``'d via
|
||||
``bpf_obj_drop``
|
||||
|
||||
* No explicit control of lifetime, but can infer valid lifetime based on
|
||||
non-owning ref existence (see explanation below)
|
||||
|
||||
* Access to the pointee's memory will not page fault
|
||||
|
||||
From verifier's perspective non-owning references can only exist
|
||||
between spin_lock and spin_unlock. Why? After spin_unlock another program
|
||||
can do arbitrary operations on the data structure like removing and ``free``-ing
|
||||
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
|
||||
``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
|
||||
Or the memory could go away.
|
||||
|
||||
To prevent this logic violation all non-owning references are invalidated by the
|
||||
verifier after a critical section ends. This is necessary to ensure the "will
|
||||
not page fault" property of non-owning references. So if the verifier hasn't
|
||||
invalidated a non-owning ref, accessing it will not page fault.
|
||||
|
||||
Currently ``bpf_obj_drop`` is not allowed in the critical section, so
|
||||
if there's a valid non-owning ref, we must be in a critical section, and can
|
||||
conclude that the ref's memory hasn't been dropped-and- ``free``'d or
|
||||
dropped-and-reused.
|
||||
|
||||
Any reference to a node that is in an rbtree _must_ be non-owning, since
|
||||
the tree has control of the pointee's lifetime. Similarly, any ref to a node
|
||||
that isn't in rbtree _must_ be owning. This results in a nice property:
|
||||
graph API add / remove implementations don't need to check if a node
|
||||
has already been added (or already removed), as the ownership model
|
||||
allows the verifier to prevent such a state from being valid by simply checking
|
||||
types.
|
||||
|
||||
However, pointer aliasing poses an issue for the above "nice property".
|
||||
Consider the following example:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct node_data *n, *m, *o, *p;
|
||||
n = bpf_obj_new(typeof(*n)); /* 1 */
|
||||
|
||||
bpf_spin_lock(&lock);
|
||||
|
||||
bpf_rbtree_add(&tree, n); /* 2 */
|
||||
m = bpf_rbtree_first(&tree); /* 3 */
|
||||
|
||||
o = bpf_rbtree_remove(&tree, n); /* 4 */
|
||||
p = bpf_rbtree_remove(&tree, m); /* 5 */
|
||||
|
||||
bpf_spin_unlock(&lock);
|
||||
|
||||
bpf_obj_drop(o);
|
||||
bpf_obj_drop(p); /* 6 */
|
||||
|
||||
Assume the tree is empty before this program runs. If we track verifier state
|
||||
changes here using numbers in above comments:
|
||||
|
||||
1) n is an owning reference
|
||||
|
||||
2) n is a non-owning reference, it's been added to the tree
|
||||
|
||||
3) n and m are non-owning references, they both point to the same node
|
||||
|
||||
4) o is an owning reference, n and m non-owning, all point to same node
|
||||
|
||||
5) o and p are owning, n and m non-owning, all point to the same node
|
||||
|
||||
6) a double-free has occurred, since o and p point to same node and o was
|
||||
``free``'d in previous statement
|
||||
|
||||
States 4 and 5 violate our "nice property", as there are non-owning refs to
|
||||
a node which is not in an rbtree. Statement 5 will try to remove a node which
|
||||
has already been removed as a result of this violation. State 6 is a dangerous
|
||||
double-free.
|
||||
|
||||
At a minimum we should prevent state 6 from being possible. If we can't also
|
||||
prevent state 5 then we must abandon our "nice property" and check whether a
|
||||
node has already been removed at runtime.
|
||||
|
||||
We prevent both by generalizing the "invalidate non-owning references" behavior
|
||||
of ``bpf_spin_unlock`` and doing similar invalidation after
|
||||
``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
|
||||
|
||||
* takes an arbitrary node argument
|
||||
|
||||
* removes it from the data structure
|
||||
|
||||
* returns an owning reference to the removed node
|
||||
|
||||
May result in a state where some other non-owning reference points to the same
|
||||
node. So ``remove``-type kfuncs must be considered a non-owning reference
|
||||
invalidation point as well.
|
@ -20,6 +20,7 @@ that goes into great technical depth about the BPF Architecture.
|
||||
syscall_api
|
||||
helpers
|
||||
kfuncs
|
||||
cpumasks
|
||||
programs
|
||||
maps
|
||||
bpf_prog_run
|
||||
|
@ -7,6 +7,11 @@ eBPF Instruction Set Specification, v1.0
|
||||
|
||||
This document specifies version 1.0 of the eBPF instruction set.
|
||||
|
||||
Documentation conventions
|
||||
=========================
|
||||
|
||||
For brevity, this document uses the type notion "u64", "u32", etc.
|
||||
to mean an unsigned integer whose width is the specified number of bits.
|
||||
|
||||
Registers and calling convention
|
||||
================================
|
||||
@ -30,20 +35,56 @@ Instruction encoding
|
||||
eBPF has two instruction encodings:
|
||||
|
||||
* the basic instruction encoding, which uses 64 bits to encode an instruction
|
||||
* the wide instruction encoding, which appends a second 64-bit immediate value
|
||||
(imm64) after the basic instruction for a total of 128 bits.
|
||||
* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
|
||||
constant) value after the basic instruction for a total of 128 bits.
|
||||
|
||||
The basic instruction encoding looks as follows:
|
||||
The basic instruction encoding is as follows, where MSB and LSB mean the most significant
|
||||
bits and least significant bits, respectively:
|
||||
|
||||
============= ======= =============== ==================== ============
|
||||
32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
|
||||
============= ======= =============== ==================== ============
|
||||
immediate offset source register destination register opcode
|
||||
============= ======= =============== ==================== ============
|
||||
============= ======= ======= ======= ============
|
||||
32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
|
||||
============= ======= ======= ======= ============
|
||||
imm offset src_reg dst_reg opcode
|
||||
============= ======= ======= ======= ============
|
||||
|
||||
**imm**
|
||||
signed integer immediate value
|
||||
|
||||
**offset**
|
||||
signed integer offset used with pointer arithmetic
|
||||
|
||||
**src_reg**
|
||||
the source register number (0-10), except where otherwise specified
|
||||
(`64-bit immediate instructions`_ reuse this field for other purposes)
|
||||
|
||||
**dst_reg**
|
||||
destination register number (0-10)
|
||||
|
||||
**opcode**
|
||||
operation to perform
|
||||
|
||||
Note that most instructions do not use all of the fields.
|
||||
Unused fields shall be cleared to zero.
|
||||
|
||||
As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
|
||||
instruction uses a 64-bit immediate value that is constructed as follows.
|
||||
The 64 bits following the basic instruction contain a pseudo instruction
|
||||
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
|
||||
and imm containing the high 32 bits of the immediate value.
|
||||
|
||||
================= ==================
|
||||
64 bits (MSB) 64 bits (LSB)
|
||||
================= ==================
|
||||
basic instruction pseudo instruction
|
||||
================= ==================
|
||||
|
||||
Thus the 64-bit immediate value is constructed as follows:
|
||||
|
||||
imm64 = (next_imm << 32) | imm
|
||||
|
||||
where 'next_imm' refers to the imm value of the pseudo instruction
|
||||
following the basic instruction.
|
||||
|
||||
Instruction classes
|
||||
-------------------
|
||||
|
||||
@ -71,27 +112,32 @@ For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` an
|
||||
============== ====== =================
|
||||
4 bits (MSB) 1 bit 3 bits (LSB)
|
||||
============== ====== =================
|
||||
operation code source instruction class
|
||||
code source instruction class
|
||||
============== ====== =================
|
||||
|
||||
The 4th bit encodes the source operand:
|
||||
**code**
|
||||
the operation code, whose meaning varies by instruction class
|
||||
|
||||
====== ===== ========================================
|
||||
**source**
|
||||
the source operand location, which unless otherwise specified is one of:
|
||||
|
||||
====== ===== ==============================================
|
||||
source value description
|
||||
====== ===== ========================================
|
||||
BPF_K 0x00 use 32-bit immediate as source operand
|
||||
BPF_X 0x08 use 'src_reg' register as source operand
|
||||
====== ===== ========================================
|
||||
|
||||
The four MSB bits store the operation code.
|
||||
====== ===== ==============================================
|
||||
BPF_K 0x00 use 32-bit 'imm' value as source operand
|
||||
BPF_X 0x08 use 'src_reg' register value as source operand
|
||||
====== ===== ==============================================
|
||||
|
||||
**instruction class**
|
||||
the instruction class (see `Instruction classes`_)
|
||||
|
||||
Arithmetic instructions
|
||||
-----------------------
|
||||
|
||||
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
|
||||
otherwise identical operations.
|
||||
The 'code' field encodes the operation as below:
|
||||
The 'code' field encodes the operation as below, where 'src' and 'dst' refer
|
||||
to the values of the source and destination registers, respectively.
|
||||
|
||||
======== ===== ==========================================================
|
||||
code value description
|
||||
@ -99,35 +145,49 @@ code value description
|
||||
BPF_ADD 0x00 dst += src
|
||||
BPF_SUB 0x10 dst -= src
|
||||
BPF_MUL 0x20 dst \*= src
|
||||
BPF_DIV 0x30 dst /= src
|
||||
BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0
|
||||
BPF_OR 0x40 dst \|= src
|
||||
BPF_AND 0x50 dst &= src
|
||||
BPF_LSH 0x60 dst <<= src
|
||||
BPF_RSH 0x70 dst >>= src
|
||||
BPF_NEG 0x80 dst = ~src
|
||||
BPF_MOD 0x90 dst %= src
|
||||
BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst
|
||||
BPF_XOR 0xa0 dst ^= src
|
||||
BPF_MOV 0xb0 dst = src
|
||||
BPF_ARSH 0xc0 sign extending shift right
|
||||
BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
|
||||
======== ===== ==========================================================
|
||||
|
||||
Underflow and overflow are allowed during arithmetic operations, meaning
|
||||
the 64-bit or 32-bit value will wrap. If eBPF program execution would
|
||||
result in division by zero, the destination register is instead set to zero.
|
||||
If execution would result in modulo by zero, for ``BPF_ALU64`` the value of
|
||||
the destination register is unchanged whereas for ``BPF_ALU`` the upper
|
||||
32 bits of the destination register are zeroed.
|
||||
|
||||
``BPF_ADD | BPF_X | BPF_ALU`` means::
|
||||
|
||||
dst_reg = (u32) dst_reg + (u32) src_reg;
|
||||
dst = (u32) ((u32) dst + (u32) src)
|
||||
|
||||
where '(u32)' indicates that the upper 32 bits are zeroed.
|
||||
|
||||
``BPF_ADD | BPF_X | BPF_ALU64`` means::
|
||||
|
||||
dst_reg = dst_reg + src_reg
|
||||
dst = dst + src
|
||||
|
||||
``BPF_XOR | BPF_K | BPF_ALU`` means::
|
||||
|
||||
dst_reg = (u32) dst_reg ^ (u32) imm32
|
||||
dst = (u32) dst ^ (u32) imm32
|
||||
|
||||
``BPF_XOR | BPF_K | BPF_ALU64`` means::
|
||||
|
||||
dst_reg = dst_reg ^ imm32
|
||||
dst = dst ^ imm32
|
||||
|
||||
Also note that the division and modulo operations are unsigned. Thus, for
|
||||
``BPF_ALU``, 'imm' is first interpreted as an unsigned 32-bit value, whereas
|
||||
for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result
|
||||
interpreted as an unsigned 64-bit value. There are no instructions for
|
||||
signed division or modulo.
|
||||
|
||||
Byte swap instructions
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -155,11 +215,11 @@ Examples:
|
||||
|
||||
``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
|
||||
|
||||
dst_reg = htole16(dst_reg)
|
||||
dst = htole16(dst)
|
||||
|
||||
``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
|
||||
|
||||
dst_reg = htobe64(dst_reg)
|
||||
dst = htobe64(dst)
|
||||
|
||||
Jump instructions
|
||||
-----------------
|
||||
@ -234,15 +294,15 @@ instructions that transfer data between a register and memory.
|
||||
|
||||
``BPF_MEM | <size> | BPF_STX`` means::
|
||||
|
||||
*(size *) (dst_reg + off) = src_reg
|
||||
*(size *) (dst + offset) = src
|
||||
|
||||
``BPF_MEM | <size> | BPF_ST`` means::
|
||||
|
||||
*(size *) (dst_reg + off) = imm32
|
||||
*(size *) (dst + offset) = imm32
|
||||
|
||||
``BPF_MEM | <size> | BPF_LDX`` means::
|
||||
|
||||
dst_reg = *(size *) (src_reg + off)
|
||||
dst = *(size *) (src + offset)
|
||||
|
||||
Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
|
||||
|
||||
@ -276,11 +336,11 @@ BPF_XOR 0xa0 atomic xor
|
||||
|
||||
``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
|
||||
|
||||
*(u32 *)(dst_reg + off16) += src_reg
|
||||
*(u32 *)(dst + offset) += src
|
||||
|
||||
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
|
||||
|
||||
*(u64 *)(dst_reg + off16) += src_reg
|
||||
*(u64 *)(dst + offset) += src
|
||||
|
||||
In addition to the simple atomic operations, there also is a modifier and
|
||||
two complex atomic operations:
|
||||
@ -295,16 +355,16 @@ BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
|
||||
|
||||
The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
|
||||
always set for the complex atomic operations. If the ``BPF_FETCH`` flag
|
||||
is set, then the operation also overwrites ``src_reg`` with the value that
|
||||
is set, then the operation also overwrites ``src`` with the value that
|
||||
was in memory before it was modified.
|
||||
|
||||
The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
|
||||
addressed by ``dst_reg + off``.
|
||||
The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
|
||||
addressed by ``dst + offset``.
|
||||
|
||||
The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
|
||||
``dst_reg + off`` with ``R0``. If they match, the value addressed by
|
||||
``dst_reg + off`` is replaced with ``src_reg``. In either case, the
|
||||
value that was at ``dst_reg + off`` before the operation is zero-extended
|
||||
``dst + offset`` with ``R0``. If they match, the value addressed by
|
||||
``dst + offset`` is replaced with ``src``. In either case, the
|
||||
value that was at ``dst + offset`` before the operation is zero-extended
|
||||
and loaded back to ``R0``.
|
||||
|
||||
64-bit immediate instructions
|
||||
@ -317,7 +377,7 @@ There is currently only one such instruction.
|
||||
|
||||
``BPF_LD | BPF_DW | BPF_IMM`` means::
|
||||
|
||||
dst_reg = imm64
|
||||
dst = imm64
|
||||
|
||||
|
||||
Legacy BPF Packet access instructions
|
||||
|
@ -1,3 +1,7 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
.. _kfuncs-header-label:
|
||||
|
||||
=============================
|
||||
BPF Kernel Functions (kfuncs)
|
||||
=============================
|
||||
@ -9,7 +13,7 @@ BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux
|
||||
kernel which are exposed for use by BPF programs. Unlike normal BPF helpers,
|
||||
kfuncs do not have a stable interface and can change from one kernel release to
|
||||
another. Hence, BPF programs need to be updated in response to changes in the
|
||||
kernel.
|
||||
kernel. See :ref:`BPF_kfunc_lifecycle_expectations` for more information.
|
||||
|
||||
2. Defining a kfunc
|
||||
===================
|
||||
@ -37,7 +41,7 @@ An example is given below::
|
||||
__diag_ignore_all("-Wmissing-prototypes",
|
||||
"Global kfuncs as their definitions will be in BTF");
|
||||
|
||||
struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
|
||||
__bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
|
||||
{
|
||||
return find_get_task_by_vpid(nr);
|
||||
}
|
||||
@ -62,7 +66,7 @@ kfunc with a __tag, where tag may be one of the supported annotations.
|
||||
This annotation is used to indicate a memory and size pair in the argument list.
|
||||
An example is given below::
|
||||
|
||||
void bpf_memzero(void *mem, int mem__sz)
|
||||
__bpf_kfunc void bpf_memzero(void *mem, int mem__sz)
|
||||
{
|
||||
...
|
||||
}
|
||||
@ -82,7 +86,7 @@ safety of the program.
|
||||
|
||||
An example is given below::
|
||||
|
||||
void *bpf_obj_new(u32 local_type_id__k, ...)
|
||||
__bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...)
|
||||
{
|
||||
...
|
||||
}
|
||||
@ -121,6 +125,20 @@ flags on a set of kfuncs as follows::
|
||||
This set encodes the BTF ID of each kfunc listed above, and encodes the flags
|
||||
along with it. Ofcourse, it is also allowed to specify no flags.
|
||||
|
||||
kfunc definitions should also always be annotated with the ``__bpf_kfunc``
|
||||
macro. This prevents issues such as the compiler inlining the kfunc if it's a
|
||||
static kernel function, or the function being elided in an LTO build as it's
|
||||
not used in the rest of the kernel. Developers should not manually add
|
||||
annotations to their kfunc to prevent these issues. If an annotation is
|
||||
required to prevent such an issue with your kfunc, it is a bug and should be
|
||||
added to the definition of the macro so that other kfuncs are similarly
|
||||
protected. An example is given below::
|
||||
|
||||
__bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid)
|
||||
{
|
||||
...
|
||||
}
|
||||
|
||||
2.4.1 KF_ACQUIRE flag
|
||||
---------------------
|
||||
|
||||
@ -163,7 +181,8 @@ KF_ACQUIRE and KF_RET_NULL flags.
|
||||
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
|
||||
indicates that the all pointer arguments are valid, and that all pointers to
|
||||
BTF objects have been passed in their unmodified form (that is, at a zero
|
||||
offset, and without having been obtained from walking another pointer).
|
||||
offset, and without having been obtained from walking another pointer, with one
|
||||
exception described below).
|
||||
|
||||
There are two types of pointers to kernel objects which are considered "valid":
|
||||
|
||||
@ -176,6 +195,25 @@ KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
|
||||
The definition of "valid" pointers is subject to change at any time, and has
|
||||
absolutely no ABI stability guarantees.
|
||||
|
||||
As mentioned above, a nested pointer obtained from walking a trusted pointer is
|
||||
no longer trusted, with one exception. If a struct type has a field that is
|
||||
guaranteed to be valid as long as its parent pointer is trusted, the
|
||||
``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as
|
||||
follows:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
BTF_TYPE_SAFE_NESTED(struct task_struct) {
|
||||
const cpumask_t *cpus_ptr;
|
||||
};
|
||||
|
||||
In other words, you must:
|
||||
|
||||
1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro.
|
||||
|
||||
2. Specify the type and name of the trusted nested field. This field must match
|
||||
the field in the original type definition exactly.
|
||||
|
||||
2.4.6 KF_SLEEPABLE flag
|
||||
-----------------------
|
||||
|
||||
@ -200,6 +238,28 @@ single argument which must be a trusted argument or a MEM_RCU pointer.
|
||||
The argument may have reference count of 0 and the kfunc must take this
|
||||
into consideration.
|
||||
|
||||
.. _KF_deprecated_flag:
|
||||
|
||||
2.4.9 KF_DEPRECATED flag
|
||||
------------------------
|
||||
|
||||
The KF_DEPRECATED flag is used for kfuncs which are scheduled to be
|
||||
changed or removed in a subsequent kernel release. A kfunc that is
|
||||
marked with KF_DEPRECATED should also have any relevant information
|
||||
captured in its kernel doc. Such information typically includes the
|
||||
kfunc's expected remaining lifespan, a recommendation for new
|
||||
functionality that can replace it if any is available, and possibly a
|
||||
rationale for why it is being removed.
|
||||
|
||||
Note that while on some occasions, a KF_DEPRECATED kfunc may continue to be
|
||||
supported and have its KF_DEPRECATED flag removed, it is likely to be far more
|
||||
difficult to remove a KF_DEPRECATED flag after it's been added than it is to
|
||||
prevent it from being added in the first place. As described in
|
||||
:ref:`BPF_kfunc_lifecycle_expectations`, users that rely on specific kfuncs are
|
||||
encouraged to make their use-cases known as early as possible, and participate
|
||||
in upstream discussions regarding whether to keep, change, deprecate, or remove
|
||||
those kfuncs if and when such discussions occur.
|
||||
|
||||
2.5 Registering the kfuncs
|
||||
--------------------------
|
||||
|
||||
@ -223,14 +283,150 @@ type. An example is shown below::
|
||||
}
|
||||
late_initcall(init_subsystem);
|
||||
|
||||
3. Core kfuncs
|
||||
2.6 Specifying no-cast aliases with ___init
|
||||
--------------------------------------------
|
||||
|
||||
The verifier will always enforce that the BTF type of a pointer passed to a
|
||||
kfunc by a BPF program, matches the type of pointer specified in the kfunc
|
||||
definition. The verifier, does, however, allow types that are equivalent
|
||||
according to the C standard to be passed to the same kfunc arg, even if their
|
||||
BTF_IDs differ.
|
||||
|
||||
For example, for the following type definition:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct bpf_cpumask {
|
||||
cpumask_t cpumask;
|
||||
refcount_t usage;
|
||||
};
|
||||
|
||||
The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc
|
||||
taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For
|
||||
instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed
|
||||
to bpf_cpumask_test_cpu().
|
||||
|
||||
In some cases, this type-aliasing behavior is not desired. ``struct
|
||||
nf_conn___init`` is one such example:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct nf_conn___init {
|
||||
struct nf_conn ct;
|
||||
};
|
||||
|
||||
The C standard would consider these types to be equivalent, but it would not
|
||||
always be safe to pass either type to a trusted kfunc. ``struct
|
||||
nf_conn___init`` represents an allocated ``struct nf_conn`` object that has
|
||||
*not yet been initialized*, so it would therefore be unsafe to pass a ``struct
|
||||
nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct
|
||||
nf_conn *`` (e.g. ``bpf_ct_change_timeout()``).
|
||||
|
||||
In order to accommodate such requirements, the verifier will enforce strict
|
||||
PTR_TO_BTF_ID type matching if two types have the exact same name, with one
|
||||
being suffixed with ``___init``.
|
||||
|
||||
.. _BPF_kfunc_lifecycle_expectations:
|
||||
|
||||
3. kfunc lifecycle expectations
|
||||
===============================
|
||||
|
||||
kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the
|
||||
strict stability restrictions associated with kernel <-> user UAPIs. This means
|
||||
they can be thought of as similar to EXPORT_SYMBOL_GPL, and can therefore be
|
||||
modified or removed by a maintainer of the subsystem they're defined in when
|
||||
it's deemed necessary.
|
||||
|
||||
Like any other change to the kernel, maintainers will not change or remove a
|
||||
kfunc without having a reasonable justification. Whether or not they'll choose
|
||||
to change a kfunc will ultimately depend on a variety of factors, such as how
|
||||
widely used the kfunc is, how long the kfunc has been in the kernel, whether an
|
||||
alternative kfunc exists, what the norm is in terms of stability for the
|
||||
subsystem in question, and of course what the technical cost is of continuing
|
||||
to support the kfunc.
|
||||
|
||||
There are several implications of this:
|
||||
|
||||
a) kfuncs that are widely used or have been in the kernel for a long time will
|
||||
be more difficult to justify being changed or removed by a maintainer. In
|
||||
other words, kfuncs that are known to have a lot of users and provide
|
||||
significant value provide stronger incentives for maintainers to invest the
|
||||
time and complexity in supporting them. It is therefore important for
|
||||
developers that are using kfuncs in their BPF programs to communicate and
|
||||
explain how and why those kfuncs are being used, and to participate in
|
||||
discussions regarding those kfuncs when they occur upstream.
|
||||
|
||||
b) Unlike regular kernel symbols marked with EXPORT_SYMBOL_GPL, BPF programs
|
||||
that call kfuncs are generally not part of the kernel tree. This means that
|
||||
refactoring cannot typically change callers in-place when a kfunc changes,
|
||||
as is done for e.g. an upstreamed driver being updated in place when a
|
||||
kernel symbol is changed.
|
||||
|
||||
Unlike with regular kernel symbols, this is expected behavior for BPF
|
||||
symbols, and out-of-tree BPF programs that use kfuncs should be considered
|
||||
relevant to discussions and decisions around modifying and removing those
|
||||
kfuncs. The BPF community will take an active role in participating in
|
||||
upstream discussions when necessary to ensure that the perspectives of such
|
||||
users are taken into account.
|
||||
|
||||
c) A kfunc will never have any hard stability guarantees. BPF APIs cannot and
|
||||
will not ever hard-block a change in the kernel purely for stability
|
||||
reasons. That being said, kfuncs are features that are meant to solve
|
||||
problems and provide value to users. The decision of whether to change or
|
||||
remove a kfunc is a multivariate technical decision that is made on a
|
||||
case-by-case basis, and which is informed by data points such as those
|
||||
mentioned above. It is expected that a kfunc being removed or changed with
|
||||
no warning will not be a common occurrence or take place without sound
|
||||
justification, but it is a possibility that must be accepted if one is to
|
||||
use kfuncs.
|
||||
|
||||
3.1 kfunc deprecation
|
||||
---------------------
|
||||
|
||||
As described above, while sometimes a maintainer may find that a kfunc must be
|
||||
changed or removed immediately to accommodate some changes in their subsystem,
|
||||
usually kfuncs will be able to accommodate a longer and more measured
|
||||
deprecation process. For example, if a new kfunc comes along which provides
|
||||
superior functionality to an existing kfunc, the existing kfunc may be
|
||||
deprecated for some period of time to allow users to migrate their BPF programs
|
||||
to use the new one. Or, if a kfunc has no known users, a decision may be made
|
||||
to remove the kfunc (without providing an alternative API) after some
|
||||
deprecation period so as to provide users with a window to notify the kfunc
|
||||
maintainer if it turns out that the kfunc is actually being used.
|
||||
|
||||
It's expected that the common case will be that kfuncs will go through a
|
||||
deprecation period rather than being changed or removed without warning. As
|
||||
described in :ref:`KF_deprecated_flag`, the kfunc framework provides the
|
||||
KF_DEPRECATED flag to kfunc developers to signal to users that a kfunc has been
|
||||
deprecated. Once a kfunc has been marked with KF_DEPRECATED, the following
|
||||
procedure is followed for removal:
|
||||
|
||||
1. Any relevant information for deprecated kfuncs is documented in the kfunc's
|
||||
kernel docs. This documentation will typically include the kfunc's expected
|
||||
remaining lifespan, a recommendation for new functionality that can replace
|
||||
the usage of the deprecated function (or an explanation as to why no such
|
||||
replacement exists), etc.
|
||||
|
||||
2. The deprecated kfunc is kept in the kernel for some period of time after it
|
||||
was first marked as deprecated. This time period will be chosen on a
|
||||
case-by-case basis, and will typically depend on how widespread the use of
|
||||
the kfunc is, how long it has been in the kernel, and how hard it is to move
|
||||
to alternatives. This deprecation time period is "best effort", and as
|
||||
described :ref:`above<BPF_kfunc_lifecycle_expectations>`, circumstances may
|
||||
sometimes dictate that the kfunc be removed before the full intended
|
||||
deprecation period has elapsed.
|
||||
|
||||
3. After the deprecation period the kfunc will be removed. At this point, BPF
|
||||
programs calling the kfunc will be rejected by the verifier.
|
||||
|
||||
4. Core kfuncs
|
||||
==============
|
||||
|
||||
The BPF subsystem provides a number of "core" kfuncs that are potentially
|
||||
applicable to a wide variety of different possible use cases and programs.
|
||||
Those kfuncs are documented here.
|
||||
|
||||
3.1 struct task_struct * kfuncs
|
||||
4.1 struct task_struct * kfuncs
|
||||
-------------------------------
|
||||
|
||||
There are a number of kfuncs that allow ``struct task_struct *`` objects to be
|
||||
@ -306,7 +502,7 @@ Here is an example of it being used:
|
||||
return 0;
|
||||
}
|
||||
|
||||
3.2 struct cgroup * kfuncs
|
||||
4.2 struct cgroup * kfuncs
|
||||
--------------------------
|
||||
|
||||
``struct cgroup *`` objects also have acquire and release functions:
|
||||
@ -420,3 +616,10 @@ the verifier. bpf_cgroup_ancestor() can be used as follows:
|
||||
bpf_cgroup_release(parent);
|
||||
return 0;
|
||||
}
|
||||
|
||||
4.3 struct cpumask * kfuncs
|
||||
---------------------------
|
||||
|
||||
BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
|
||||
destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label`
|
||||
for more details.
|
||||
|
@ -83,8 +83,8 @@ This prevents from accidentally exporting a symbol, that is not supposed
|
||||
to be a part of ABI what, in turn, improves both libbpf developer- and
|
||||
user-experiences.
|
||||
|
||||
ABI versionning
|
||||
---------------
|
||||
ABI versioning
|
||||
--------------
|
||||
|
||||
To make future ABI extensions possible libbpf ABI is versioned.
|
||||
Versioning is implemented by ``libbpf.map`` version script that is
|
||||
@ -148,7 +148,7 @@ API documentation convention
|
||||
The libbpf API is documented via comments above definitions in
|
||||
header files. These comments can be rendered by doxygen and sphinx
|
||||
for well organized html output. This section describes the
|
||||
convention in which these comments should be formated.
|
||||
convention in which these comments should be formatted.
|
||||
|
||||
Here is an example from btf.h:
|
||||
|
||||
|
498
Documentation/bpf/map_sockmap.rst
Normal file
498
Documentation/bpf/map_sockmap.rst
Normal file
@ -0,0 +1,498 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright Red Hat
|
||||
|
||||
==============================================
|
||||
BPF_MAP_TYPE_SOCKMAP and BPF_MAP_TYPE_SOCKHASH
|
||||
==============================================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_SOCKMAP`` was introduced in kernel version 4.14
|
||||
- ``BPF_MAP_TYPE_SOCKHASH`` was introduced in kernel version 4.18
|
||||
|
||||
``BPF_MAP_TYPE_SOCKMAP`` and ``BPF_MAP_TYPE_SOCKHASH`` maps can be used to
|
||||
redirect skbs between sockets or to apply policy at the socket level based on
|
||||
the result of a BPF (verdict) program with the help of the BPF helpers
|
||||
``bpf_sk_redirect_map()``, ``bpf_sk_redirect_hash()``,
|
||||
``bpf_msg_redirect_map()`` and ``bpf_msg_redirect_hash()``.
|
||||
|
||||
``BPF_MAP_TYPE_SOCKMAP`` is backed by an array that uses an integer key as the
|
||||
index to look up a reference to a ``struct sock``. The map values are socket
|
||||
descriptors. Similarly, ``BPF_MAP_TYPE_SOCKHASH`` is a hash backed BPF map that
|
||||
holds references to sockets via their socket descriptors.
|
||||
|
||||
.. note::
|
||||
The value type is either __u32 or __u64; the latter (__u64) is to support
|
||||
returning socket cookies to userspace. Returning the ``struct sock *`` that
|
||||
the map holds to user-space is neither safe nor useful.
|
||||
|
||||
These maps may have BPF programs attached to them, specifically a parser program
|
||||
and a verdict program. The parser program determines how much data has been
|
||||
parsed and therefore how much data needs to be queued to come to a verdict. The
|
||||
verdict program is essentially the redirect program and can return a verdict
|
||||
of ``__SK_DROP``, ``__SK_PASS``, or ``__SK_REDIRECT``.
|
||||
|
||||
When a socket is inserted into one of these maps, its socket callbacks are
|
||||
replaced and a ``struct sk_psock`` is attached to it. Additionally, this
|
||||
``sk_psock`` inherits the programs that are attached to the map.
|
||||
|
||||
A sock object may be in multiple maps, but can only inherit a single
|
||||
parse or verdict program. If adding a sock object to a map would result
|
||||
in having multiple parser programs the update will return an EBUSY error.
|
||||
|
||||
The supported programs to attach to these maps are:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct sk_psock_progs {
|
||||
struct bpf_prog *msg_parser;
|
||||
struct bpf_prog *stream_parser;
|
||||
struct bpf_prog *stream_verdict;
|
||||
struct bpf_prog *skb_verdict;
|
||||
};
|
||||
|
||||
.. note::
|
||||
Users are not allowed to attach ``stream_verdict`` and ``skb_verdict``
|
||||
programs to the same map.
|
||||
|
||||
The attach types for the map programs are:
|
||||
|
||||
- ``msg_parser`` program - ``BPF_SK_MSG_VERDICT``.
|
||||
- ``stream_parser`` program - ``BPF_SK_SKB_STREAM_PARSER``.
|
||||
- ``stream_verdict`` program - ``BPF_SK_SKB_STREAM_VERDICT``.
|
||||
- ``skb_verdict`` program - ``BPF_SK_SKB_VERDICT``.
|
||||
|
||||
There are additional helpers available to use with the parser and verdict
|
||||
programs: ``bpf_msg_apply_bytes()`` and ``bpf_msg_cork_bytes()``. With
|
||||
``bpf_msg_apply_bytes()`` BPF programs can tell the infrastructure how many
|
||||
bytes the given verdict should apply to. The helper ``bpf_msg_cork_bytes()``
|
||||
handles a different case where a BPF program cannot reach a verdict on a msg
|
||||
until it receives more bytes AND the program doesn't want to forward the packet
|
||||
until it is known to be good.
|
||||
|
||||
Finally, the helpers ``bpf_msg_pull_data()`` and ``bpf_msg_push_data()`` are
|
||||
available to ``BPF_PROG_TYPE_SK_MSG`` BPF programs to pull in data and set the
|
||||
start and end pointers to given values or to add metadata to the ``struct
|
||||
sk_msg_buff *msg``.
|
||||
|
||||
All these helpers will be described in more detail below.
|
||||
|
||||
Usage
|
||||
=====
|
||||
Kernel BPF
|
||||
----------
|
||||
bpf_msg_redirect_map()
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
|
||||
|
||||
This helper is used in programs implementing policies at the socket level. If
|
||||
the message ``msg`` is allowed to pass (i.e., if the verdict BPF program
|
||||
returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
|
||||
``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces
|
||||
can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used
|
||||
to select the ingress path otherwise the egress path is selected. This is the
|
||||
only flag supported for now.
|
||||
|
||||
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
|
||||
|
||||
bpf_sk_redirect_map()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key u64 flags)
|
||||
|
||||
Redirect the packet to the socket referenced by ``map`` (of type
|
||||
``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces
|
||||
can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used
|
||||
to select the ingress path otherwise the egress path is selected. This is the
|
||||
only flag supported for now.
|
||||
|
||||
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
socket entries of type ``struct sock *`` can be retrieved using the
|
||||
``bpf_map_lookup_elem()`` helper.
|
||||
|
||||
bpf_sock_map_update()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
|
||||
|
||||
Add an entry to, or update a ``map`` referencing sockets. The ``skops`` is used
|
||||
as a new value for the entry associated to ``key``. The ``flags`` argument can
|
||||
be one of the following:
|
||||
|
||||
- ``BPF_ANY``: Create a new element or update an existing element.
|
||||
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
|
||||
- ``BPF_EXIST``: Update an existing element.
|
||||
|
||||
If the ``map`` has BPF programs (parser and verdict), those will be inherited
|
||||
by the socket being added. If the socket is already attached to BPF programs,
|
||||
this results in an error.
|
||||
|
||||
Returns 0 on success, or a negative error in case of failure.
|
||||
|
||||
bpf_sock_hash_update()
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
|
||||
|
||||
Add an entry to, or update a sockhash ``map`` referencing sockets. The ``skops``
|
||||
is used as a new value for the entry associated to ``key``.
|
||||
|
||||
The ``flags`` argument can be one of the following:
|
||||
|
||||
- ``BPF_ANY``: Create a new element or update an existing element.
|
||||
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
|
||||
- ``BPF_EXIST``: Update an existing element.
|
||||
|
||||
If the ``map`` has BPF programs (parser and verdict), those will be inherited
|
||||
by the socket being added. If the socket is already attached to BPF programs,
|
||||
this results in an error.
|
||||
|
||||
Returns 0 on success, or a negative error in case of failure.
|
||||
|
||||
bpf_msg_redirect_hash()
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
|
||||
|
||||
This helper is used in programs implementing policies at the socket level. If
|
||||
the message ``msg`` is allowed to pass (i.e., if the verdict BPF program returns
|
||||
``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
|
||||
``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress
|
||||
interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in
|
||||
``flags`` is used to select the ingress path otherwise the egress path is
|
||||
selected. This is the only flag supported for now.
|
||||
|
||||
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
|
||||
|
||||
bpf_sk_redirect_hash()
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
|
||||
|
||||
This helper is used in programs implementing policies at the skb socket level.
|
||||
If the sk_buff ``skb`` is allowed to pass (i.e., if the verdict BPF program
|
||||
returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
|
||||
``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress
|
||||
interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in
|
||||
``flags`` is used to select the ingress path otherwise the egress path is
|
||||
selected. This is the only flag supported for now.
|
||||
|
||||
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
|
||||
|
||||
bpf_msg_apply_bytes()
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
|
||||
|
||||
For socket policies, apply the verdict of the BPF program to the next (number
|
||||
of ``bytes``) of message ``msg``. For example, this helper can be used in the
|
||||
following cases:
|
||||
|
||||
- A single ``sendmsg()`` or ``sendfile()`` system call contains multiple
|
||||
logical messages that the BPF program is supposed to read and for which it
|
||||
should apply a verdict.
|
||||
- A BPF program only cares to read the first ``bytes`` of a ``msg``. If the
|
||||
message has a large payload, then setting up and calling the BPF program
|
||||
repeatedly for all bytes, even though the verdict is already known, would
|
||||
create unnecessary overhead.
|
||||
|
||||
Returns 0
|
||||
|
||||
bpf_msg_cork_bytes()
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
|
||||
|
||||
For socket policies, prevent the execution of the verdict BPF program for
|
||||
message ``msg`` until the number of ``bytes`` have been accumulated.
|
||||
|
||||
This can be used when one needs a specific number of bytes before a verdict can
|
||||
be assigned, even if the data spans multiple ``sendmsg()`` or ``sendfile()``
|
||||
calls.
|
||||
|
||||
Returns 0
|
||||
|
||||
bpf_msg_pull_data()
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
|
||||
|
||||
For socket policies, pull in non-linear data from user space for ``msg`` and set
|
||||
pointers ``msg->data`` and ``msg->data_end`` to ``start`` and ``end`` bytes
|
||||
offsets into ``msg``, respectively.
|
||||
|
||||
If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only
|
||||
parse data that the (``data``, ``data_end``) pointers have already consumed.
|
||||
For ``sendmsg()`` hooks this is likely the first scatterlist element. But for
|
||||
calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be
|
||||
the range (**0**, **0**) because the data is shared with user space and by
|
||||
default the objective is to avoid allowing user space to modify data while (or
|
||||
after) BPF verdict is being decided. This helper can be used to pull in data
|
||||
and to set the start and end pointers to given values. Data will be copied if
|
||||
necessary (i.e., if data was not linear and if start and end pointers do not
|
||||
point to the same chunk).
|
||||
|
||||
A call to this helper is susceptible to change the underlying packet buffer.
|
||||
Therefore, at load time, all checks on pointers previously done by the verifier
|
||||
are invalidated and must be performed again, if the helper is used in
|
||||
combination with direct packet access.
|
||||
|
||||
All values for ``flags`` are reserved for future usage, and must be left at
|
||||
zero.
|
||||
|
||||
Returns 0 on success, or a negative error in case of failure.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Look up a socket entry in the sockmap or sockhash map.
|
||||
|
||||
Returns the socket entry associated to ``key``, or NULL if no entry was found.
|
||||
|
||||
bpf_map_update_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
|
||||
|
||||
Add or update a socket entry in a sockmap or sockhash.
|
||||
|
||||
The flags argument can be one of the following:
|
||||
|
||||
- BPF_ANY: Create a new element or update an existing element.
|
||||
- BPF_NOEXIST: Create a new element only if it did not exist.
|
||||
- BPF_EXIST: Update an existing element.
|
||||
|
||||
Returns 0 on success, or a negative error in case of failure.
|
||||
|
||||
bpf_map_delete_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Delete a socket entry from a sockmap or a sockhash.
|
||||
|
||||
Returns 0 on success, or a negative error in case of failure.
|
||||
|
||||
User space
|
||||
----------
|
||||
bpf_map_update_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags)
|
||||
|
||||
Sockmap entries can be added or updated using the ``bpf_map_update_elem()``
|
||||
function. The ``key`` parameter is the index value of the sockmap array. And the
|
||||
``value`` parameter is the FD value of that socket.
|
||||
|
||||
Under the hood, the sockmap update function uses the socket FD value to
|
||||
retrieve the associated socket and its attached psock.
|
||||
|
||||
The flags argument can be one of the following:
|
||||
|
||||
- BPF_ANY: Create a new element or update an existing element.
|
||||
- BPF_NOEXIST: Create a new element only if it did not exist.
|
||||
- BPF_EXIST: Update an existing element.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_lookup_elem(int fd, const void *key, void *value)
|
||||
|
||||
Sockmap entries can be retrieved using the ``bpf_map_lookup_elem()`` function.
|
||||
|
||||
.. note::
|
||||
The entry returned is a socket cookie rather than a socket itself.
|
||||
|
||||
bpf_map_delete_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_delete_elem(int fd, const void *key)
|
||||
|
||||
Sockmap entries can be deleted using the ``bpf_map_delete_elem()``
|
||||
function.
|
||||
|
||||
Returns 0 on success, or negative error in case of failure.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
Several examples of the use of sockmap APIs can be found in:
|
||||
|
||||
- `tools/testing/selftests/bpf/progs/test_sockmap_kern.h`_
|
||||
- `tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`_
|
||||
- `tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`_
|
||||
- `tools/testing/selftests/bpf/progs/test_sockmap_listen.c`_
|
||||
- `tools/testing/selftests/bpf/progs/test_sockmap_update.c`_
|
||||
|
||||
The following code snippet shows how to declare a sockmap.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_SOCKMAP);
|
||||
__uint(max_entries, 1);
|
||||
__type(key, __u32);
|
||||
__type(value, __u64);
|
||||
} sock_map_rx SEC(".maps");
|
||||
|
||||
The following code snippet shows a sample parser program.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
SEC("sk_skb/stream_parser")
|
||||
int bpf_prog_parser(struct __sk_buff *skb)
|
||||
{
|
||||
return skb->len;
|
||||
}
|
||||
|
||||
The following code snippet shows a simple verdict program that interacts with a
|
||||
sockmap to redirect traffic to another socket based on the local port.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
SEC("sk_skb/stream_verdict")
|
||||
int bpf_prog_verdict(struct __sk_buff *skb)
|
||||
{
|
||||
__u32 lport = skb->local_port;
|
||||
__u32 idx = 0;
|
||||
|
||||
if (lport == 10000)
|
||||
return bpf_sk_redirect_map(skb, &sock_map_rx, idx, 0);
|
||||
|
||||
return SK_PASS;
|
||||
}
|
||||
|
||||
The following code snippet shows how to declare a sockhash map.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct socket_key {
|
||||
__u32 src_ip;
|
||||
__u32 dst_ip;
|
||||
__u32 src_port;
|
||||
__u32 dst_port;
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_SOCKHASH);
|
||||
__uint(max_entries, 1);
|
||||
__type(key, struct socket_key);
|
||||
__type(value, __u64);
|
||||
} sock_hash_rx SEC(".maps");
|
||||
|
||||
The following code snippet shows a simple verdict program that interacts with a
|
||||
sockhash to redirect traffic to another socket based on a hash of some of the
|
||||
skb parameters.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
static inline
|
||||
void extract_socket_key(struct __sk_buff *skb, struct socket_key *key)
|
||||
{
|
||||
key->src_ip = skb->remote_ip4;
|
||||
key->dst_ip = skb->local_ip4;
|
||||
key->src_port = skb->remote_port >> 16;
|
||||
key->dst_port = (bpf_htonl(skb->local_port)) >> 16;
|
||||
}
|
||||
|
||||
SEC("sk_skb/stream_verdict")
|
||||
int bpf_prog_verdict(struct __sk_buff *skb)
|
||||
{
|
||||
struct socket_key key;
|
||||
|
||||
extract_socket_key(skb, &key);
|
||||
|
||||
return bpf_sk_redirect_hash(skb, &sock_hash_rx, &key, 0);
|
||||
}
|
||||
|
||||
User space
|
||||
----------
|
||||
Several examples of the use of sockmap APIs can be found in:
|
||||
|
||||
- `tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`_
|
||||
- `tools/testing/selftests/bpf/test_sockmap.c`_
|
||||
- `tools/testing/selftests/bpf/test_maps.c`_
|
||||
|
||||
The following code sample shows how to create a sockmap, attach a parser and
|
||||
verdict program, as well as add a socket entry.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int create_sample_sockmap(int sock, int parse_prog_fd, int verdict_prog_fd)
|
||||
{
|
||||
int index = 0;
|
||||
int map, err;
|
||||
|
||||
map = bpf_map_create(BPF_MAP_TYPE_SOCKMAP, NULL, sizeof(int), sizeof(int), 1, NULL);
|
||||
if (map < 0) {
|
||||
fprintf(stderr, "Failed to create sockmap: %s\n", strerror(errno));
|
||||
return -1;
|
||||
}
|
||||
|
||||
err = bpf_prog_attach(parse_prog_fd, map, BPF_SK_SKB_STREAM_PARSER, 0);
|
||||
if (err){
|
||||
fprintf(stderr, "Failed to attach_parser_prog_to_map: %s\n", strerror(errno));
|
||||
goto out;
|
||||
}
|
||||
|
||||
err = bpf_prog_attach(verdict_prog_fd, map, BPF_SK_SKB_STREAM_VERDICT, 0);
|
||||
if (err){
|
||||
fprintf(stderr, "Failed to attach_verdict_prog_to_map: %s\n", strerror(errno));
|
||||
goto out;
|
||||
}
|
||||
|
||||
err = bpf_map_update_elem(map, &index, &sock, BPF_NOEXIST);
|
||||
if (err) {
|
||||
fprintf(stderr, "Failed to update sockmap: %s\n", strerror(errno));
|
||||
goto out;
|
||||
}
|
||||
|
||||
out:
|
||||
close(map);
|
||||
return err;
|
||||
}
|
||||
|
||||
References
|
||||
===========
|
||||
|
||||
- https://github.com/jrfastab/linux-kernel-xdp/commit/c89fd73cb9d2d7f3c716c3e00836f07b1aeb261f
|
||||
- https://lwn.net/Articles/731133/
|
||||
- http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf
|
||||
- https://lwn.net/Articles/748628/
|
||||
- https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com/
|
||||
|
||||
.. _`tools/testing/selftests/bpf/progs/test_sockmap_kern.h`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_kern.h
|
||||
.. _`tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
|
||||
.. _`tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c
|
||||
.. _`tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
|
||||
.. _`tools/testing/selftests/bpf/test_sockmap.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_sockmap.c
|
||||
.. _`tools/testing/selftests/bpf/test_maps.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_maps.c
|
||||
.. _`tools/testing/selftests/bpf/progs/test_sockmap_listen.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
|
||||
.. _`tools/testing/selftests/bpf/progs/test_sockmap_update.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_update.c
|
@ -178,7 +178,7 @@ The following code snippet shows how to update an XSKMAP with an XSK entry.
|
||||
|
||||
For an example on how create AF_XDP sockets, please see the AF_XDP-example and
|
||||
AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
|
||||
For a detailed explaination of the AF_XDP interface please see:
|
||||
For a detailed explanation of the AF_XDP interface please see:
|
||||
|
||||
- `libxdp-readme`_.
|
||||
- `AF_XDP`_ kernel documentation.
|
||||
|
@ -6,4 +6,5 @@ Other
|
||||
:maxdepth: 1
|
||||
|
||||
ringbuf
|
||||
llvm_reloc
|
||||
llvm_reloc
|
||||
graph_ds_impl
|
||||
|
@ -124,7 +124,7 @@ buffer. Currently 4 are supported:
|
||||
|
||||
- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
|
||||
- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
|
||||
- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
|
||||
- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical position
|
||||
of consumer/producer, respectively.
|
||||
|
||||
Returned values are momentarily snapshots of ring buffer state and could be
|
||||
@ -146,7 +146,7 @@ Design and Implementation
|
||||
This reserve/commit schema allows a natural way for multiple producers, either
|
||||
on different CPUs or even on the same CPU/in the same BPF program, to reserve
|
||||
independent records and work with them without blocking other producers. This
|
||||
means that if BPF program was interruped by another BPF program sharing the
|
||||
means that if BPF program was interrupted by another BPF program sharing the
|
||||
same ring buffer, they will both get a record reserved (provided there is
|
||||
enough space left) and can work with it and submit it independently. This
|
||||
applies to NMI context as well, except that due to using a spinlock during
|
||||
|
@ -192,7 +192,7 @@ checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
|
||||
As well as range-checking, the tracked information is also used for enforcing
|
||||
alignment of pointer accesses. For instance, on most systems the packet pointer
|
||||
is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
|
||||
over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
|
||||
over the Ethernet header, then reads IHL and adds (IHL * 4), the resulting
|
||||
pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
|
||||
bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
|
||||
that pointer are safe.
|
||||
@ -316,6 +316,301 @@ Pruning considers not only the registers but also the stack (and any spilled
|
||||
registers it may hold). They must all be safe for the branch to be pruned.
|
||||
This is implemented in states_equal().
|
||||
|
||||
Some technical details about state pruning implementation could be found below.
|
||||
|
||||
Register liveness tracking
|
||||
--------------------------
|
||||
|
||||
In order to make state pruning effective, liveness state is tracked for each
|
||||
register and stack slot. The basic idea is to track which registers and stack
|
||||
slots are actually used during subseqeuent execution of the program, until
|
||||
program exit is reached. Registers and stack slots that were never used could be
|
||||
removed from the cached state thus making more states equivalent to a cached
|
||||
state. This could be illustrated by the following program::
|
||||
|
||||
0: call bpf_get_prandom_u32()
|
||||
1: r1 = 0
|
||||
2: if r0 == 0 goto +1
|
||||
3: r0 = 1
|
||||
--- checkpoint ---
|
||||
4: r0 = r1
|
||||
5: exit
|
||||
|
||||
Suppose that a state cache entry is created at instruction #4 (such entries are
|
||||
also called "checkpoints" in the text below). The verifier could reach the
|
||||
instruction with one of two possible register states:
|
||||
|
||||
* r0 = 1, r1 = 0
|
||||
* r0 = 0, r1 = 0
|
||||
|
||||
However, only the value of register ``r1`` is important to successfully finish
|
||||
verification. The goal of the liveness tracking algorithm is to spot this fact
|
||||
and figure out that both states are actually equivalent.
|
||||
|
||||
Data structures
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Liveness is tracked using the following data structures::
|
||||
|
||||
enum bpf_reg_liveness {
|
||||
REG_LIVE_NONE = 0,
|
||||
REG_LIVE_READ32 = 0x1,
|
||||
REG_LIVE_READ64 = 0x2,
|
||||
REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
|
||||
REG_LIVE_WRITTEN = 0x4,
|
||||
REG_LIVE_DONE = 0x8,
|
||||
};
|
||||
|
||||
struct bpf_reg_state {
|
||||
...
|
||||
struct bpf_reg_state *parent;
|
||||
...
|
||||
enum bpf_reg_liveness live;
|
||||
...
|
||||
};
|
||||
|
||||
struct bpf_stack_state {
|
||||
struct bpf_reg_state spilled_ptr;
|
||||
...
|
||||
};
|
||||
|
||||
struct bpf_func_state {
|
||||
struct bpf_reg_state regs[MAX_BPF_REG];
|
||||
...
|
||||
struct bpf_stack_state *stack;
|
||||
}
|
||||
|
||||
struct bpf_verifier_state {
|
||||
struct bpf_func_state *frame[MAX_CALL_FRAMES];
|
||||
struct bpf_verifier_state *parent;
|
||||
...
|
||||
}
|
||||
|
||||
* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new
|
||||
verifier state creation;
|
||||
|
||||
* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is
|
||||
defined by some instruction verified between this verifier state's parent and
|
||||
verifier state itself;
|
||||
|
||||
* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot)
|
||||
is read by a some child state of this verifier state;
|
||||
|
||||
* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid
|
||||
processing same verifier state multiple times and for some sanity checks;
|
||||
|
||||
* ``->live`` field values are formed by combining ``enum bpf_reg_liveness``
|
||||
values using bitwise or.
|
||||
|
||||
Register parentage chains
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In order to propagate information between parent and child states, a *register
|
||||
parentage chain* is established. Each register or stack slot is linked to a
|
||||
corresponding register or stack slot in its parent state via a ``->parent``
|
||||
pointer. This link is established upon state creation in ``is_state_visited()``
|
||||
and might be modified by ``set_callee_state()`` called from
|
||||
``__check_func_call()``.
|
||||
|
||||
The rules for correspondence between registers / stack slots are as follows:
|
||||
|
||||
* For the current stack frame, registers and stack slots of the new state are
|
||||
linked to the registers and stack slots of the parent state with the same
|
||||
indices.
|
||||
|
||||
* For the outer stack frames, only caller saved registers (r6-r9) and stack
|
||||
slots are linked to the registers and stack slots of the parent state with the
|
||||
same indices.
|
||||
|
||||
* When function call is processed a new ``struct bpf_func_state`` instance is
|
||||
allocated, it encapsulates a new set of registers and stack slots. For this
|
||||
new frame, parent links for r6-r9 and stack slots are set to nil, parent links
|
||||
for r1-r5 are set to match caller r1-r5 parent links.
|
||||
|
||||
This could be illustrated by the following diagram (arrows stand for
|
||||
``->parent`` pointers)::
|
||||
|
||||
... ; Frame #0, some instructions
|
||||
--- checkpoint #0 ---
|
||||
1 : r6 = 42 ; Frame #0
|
||||
--- checkpoint #1 ---
|
||||
2 : call foo() ; Frame #0
|
||||
... ; Frame #1, instructions from foo()
|
||||
--- checkpoint #2 ---
|
||||
... ; Frame #1, instructions from foo()
|
||||
--- checkpoint #3 ---
|
||||
exit ; Frame #1, return from foo()
|
||||
3 : r1 = r6 ; Frame #0 <- current state
|
||||
|
||||
+-------------------------------+-------------------------------+
|
||||
| Frame #0 | Frame #1 |
|
||||
Checkpoint +-------------------------------+-------------------------------+
|
||||
#0 | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+
|
||||
^ ^ ^ ^
|
||||
| | | |
|
||||
Checkpoint +-------------------------------+
|
||||
#1 | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+
|
||||
^ ^ ^
|
||||
|_______|_______|_______________
|
||||
| | |
|
||||
nil nil | | | nil nil
|
||||
| | | | | | |
|
||||
Checkpoint +-------------------------------+-------------------------------+
|
||||
#2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+-------------------------------+
|
||||
^ ^ ^ ^ ^
|
||||
nil nil | | | | |
|
||||
| | | | | | |
|
||||
Checkpoint +-------------------------------+-------------------------------+
|
||||
#3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+-------------------------------+
|
||||
^ ^
|
||||
nil nil | |
|
||||
| | | |
|
||||
Current +-------------------------------+
|
||||
state | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+
|
||||
\
|
||||
r6 read mark is propagated via these links
|
||||
all the way up to checkpoint #1.
|
||||
The checkpoint #1 contains a write mark for r6
|
||||
because of instruction (1), thus read propagation
|
||||
does not reach checkpoint #0 (see section below).
|
||||
|
||||
Liveness marks tracking
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For each processed instruction, the verifier tracks read and written registers
|
||||
and stack slots. The main idea of the algorithm is that read marks propagate
|
||||
back along the state parentage chain until they hit a write mark, which 'screens
|
||||
off' earlier states from the read. The information about reads is propagated by
|
||||
function ``mark_reg_read()`` which could be summarized as follows::
|
||||
|
||||
mark_reg_read(struct bpf_reg_state *state, ...):
|
||||
parent = state->parent
|
||||
while parent:
|
||||
if state->live & REG_LIVE_WRITTEN:
|
||||
break
|
||||
if parent->live & REG_LIVE_READ64:
|
||||
break
|
||||
parent->live |= REG_LIVE_READ64
|
||||
state = parent
|
||||
parent = state->parent
|
||||
|
||||
Notes:
|
||||
|
||||
* The read marks are applied to the **parent** state while write marks are
|
||||
applied to the **current** state. The write mark on a register or stack slot
|
||||
means that it is updated by some instruction in the straight-line code leading
|
||||
from the parent state to the current state.
|
||||
|
||||
* Details about REG_LIVE_READ32 are omitted.
|
||||
|
||||
* Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`)
|
||||
might override the first parent link. Please refer to the comments in the
|
||||
``propagate_liveness()`` and ``mark_reg_read()`` source code for further
|
||||
details.
|
||||
|
||||
Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are
|
||||
applied conservatively: stack slots are marked as written only if write size
|
||||
corresponds to the size of the register, e.g. see function ``save_register_state()``.
|
||||
|
||||
Consider the following example::
|
||||
|
||||
0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8
|
||||
--- checkpoint #0 ---
|
||||
1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes
|
||||
2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1)
|
||||
3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0)
|
||||
|
||||
As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should
|
||||
it be otherwise, the algorithm above wouldn't be able to propagate the read mark
|
||||
from (3) to checkpoint #0.
|
||||
|
||||
Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is
|
||||
called to update the ``->branches`` counter for each verifier state in a chain
|
||||
of parent verifier states. When the ``->branches`` counter reaches zero the
|
||||
verifier state becomes a valid entry in a set of cached verifier states.
|
||||
|
||||
Each entry of the verifier states cache is post-processed by a function
|
||||
``clean_live_states()``. This function marks all registers and stack slots
|
||||
without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``.
|
||||
Registers/stack slots marked in this way are ignored in function ``stacksafe()``
|
||||
called from ``states_equal()`` when a state cache entry is considered for
|
||||
equivalence with a current state.
|
||||
|
||||
Now it is possible to explain how the example from the beginning of the section
|
||||
works::
|
||||
|
||||
0: call bpf_get_prandom_u32()
|
||||
1: r1 = 0
|
||||
2: if r0 == 0 goto +1
|
||||
3: r0 = 1
|
||||
--- checkpoint[0] ---
|
||||
4: r0 = r1
|
||||
5: exit
|
||||
|
||||
* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }``
|
||||
is pushed to states processing queue (pc stands for program counter).
|
||||
|
||||
* At instruction #4:
|
||||
|
||||
* ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``;
|
||||
* ``checkpoint[0].r0`` is marked as written;
|
||||
* ``checkpoint[0].r1`` is marked as read;
|
||||
|
||||
* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
|
||||
by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
|
||||
read mark and all other registers and stack slots are marked as ``NOT_INIT``
|
||||
or ``STACK_INVALID``
|
||||
|
||||
* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue
|
||||
and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states
|
||||
are considered equivalent.
|
||||
|
||||
.. _read_marks_for_cache_hits:
|
||||
|
||||
Read marks propagation for cache hits
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Another point is the handling of read marks when a previously verified state is
|
||||
found in the states cache. Upon cache hit verifier must behave in the same way
|
||||
as if the current state was verified to the program exit. This means that all
|
||||
read marks, present on registers and stack slots of the cached state, must be
|
||||
propagated over the parentage chain of the current state. Example below shows
|
||||
why this is important. Function ``propagate_liveness()`` handles this case.
|
||||
|
||||
Consider the following state parentage chain (S is a starting state, A-E are
|
||||
derived states, -> arrows show which state is derived from which)::
|
||||
|
||||
r1 read
|
||||
<------------- A[r1] == 0
|
||||
C[r1] == 0
|
||||
S ---> A ---> B ---> exit E[r1] == 1
|
||||
|
|
||||
` ---> C ---> D
|
||||
|
|
||||
` ---> E ^
|
||||
|___ suppose all these
|
||||
^ states are at insn #Y
|
||||
|
|
||||
suppose all these
|
||||
states are at insn #X
|
||||
|
||||
* Chain of states ``S -> A -> B -> exit`` is verified first.
|
||||
|
||||
* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is
|
||||
propagated up to state ``A``.
|
||||
|
||||
* When chain of states ``C -> D`` is verified the state ``D`` turns out to be
|
||||
equivalent to state ``B``.
|
||||
|
||||
* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state
|
||||
``C`` might get mistakenly marked as equivalent to state ``E`` even though
|
||||
values for register ``r1`` differ between ``C`` and ``E``.
|
||||
|
||||
Understanding eBPF verifier messages
|
||||
====================================
|
||||
|
||||
|
@ -116,6 +116,9 @@ if major >= 3:
|
||||
|
||||
# include/linux/linkage.h:
|
||||
"asmlinkage",
|
||||
|
||||
# include/linux/btf.h
|
||||
"__bpf_kfunc",
|
||||
]
|
||||
|
||||
else:
|
||||
|
@ -127,6 +127,7 @@ Documents that don't fit elsewhere or which have yet to be categorized.
|
||||
:maxdepth: 1
|
||||
|
||||
librs
|
||||
netlink
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
101
Documentation/core-api/netlink.rst
Normal file
101
Documentation/core-api/netlink.rst
Normal file
@ -0,0 +1,101 @@
|
||||
.. SPDX-License-Identifier: BSD-3-Clause
|
||||
|
||||
.. _kernel_netlink:
|
||||
|
||||
===================================
|
||||
Netlink notes for kernel developers
|
||||
===================================
|
||||
|
||||
General guidance
|
||||
================
|
||||
|
||||
Attribute enums
|
||||
---------------
|
||||
|
||||
Older families often define "null" attributes and commands with value
|
||||
of ``0`` and named ``unspec``. This is supported (``type: unused``)
|
||||
but should be avoided in new families. The ``unspec`` enum values are
|
||||
not used in practice, so just set the value of the first attribute to ``1``.
|
||||
|
||||
Message enums
|
||||
-------------
|
||||
|
||||
Use the same command IDs for requests and replies. This makes it easier
|
||||
to match them up, and we have plenty of ID space.
|
||||
|
||||
Use separate command IDs for notifications. This makes it easier to
|
||||
sort the notifications from replies (and present them to the user
|
||||
application via a different API than replies).
|
||||
|
||||
Answer requests
|
||||
---------------
|
||||
|
||||
Older families do not reply to all of the commands, especially NEW / ADD
|
||||
commands. User only gets information whether the operation succeeded or
|
||||
not via the ACK. Try to find useful data to return. Once the command is
|
||||
added whether it replies with a full message or only an ACK is uAPI and
|
||||
cannot be changed. It's better to err on the side of replying.
|
||||
|
||||
Specifically NEW and ADD commands should reply with information identifying
|
||||
the created object such as the allocated object's ID (without having to
|
||||
resort to using ``NLM_F_ECHO``).
|
||||
|
||||
NLM_F_ECHO
|
||||
----------
|
||||
|
||||
Make sure to pass the request info to genl_notify() to allow ``NLM_F_ECHO``
|
||||
to take effect. This is useful for programs that need precise feedback
|
||||
from the kernel (for example for logging purposes).
|
||||
|
||||
Support dump consistency
|
||||
------------------------
|
||||
|
||||
If iterating over objects during dump may skip over objects or repeat
|
||||
them - make sure to report dump inconsistency with ``NLM_F_DUMP_INTR``.
|
||||
This is usually implemented by maintaining a generation id for the
|
||||
structure and recording it in the ``seq`` member of struct netlink_callback.
|
||||
|
||||
Netlink specification
|
||||
=====================
|
||||
|
||||
Documentation of the Netlink specification parts which are only relevant
|
||||
to the kernel space.
|
||||
|
||||
Globals
|
||||
-------
|
||||
|
||||
kernel-policy
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Defines if the kernel validation policy is per operation (``per-op``)
|
||||
or for the entire family (``global``). New families should use ``per-op``
|
||||
(default) to be able to narrow down the attributes accepted by a specific
|
||||
command.
|
||||
|
||||
checks
|
||||
------
|
||||
|
||||
Documentation for the ``checks`` sub-sections of attribute specs.
|
||||
|
||||
unterminated-ok
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Accept strings without the null-termination (for legacy families only).
|
||||
Switches from the ``NLA_NUL_STRING`` to ``NLA_STRING`` policy type.
|
||||
|
||||
max-len
|
||||
~~~~~~~
|
||||
|
||||
Defines max length for a binary or string attribute (corresponding
|
||||
to the ``len`` member of struct nla_policy). For string attributes terminating
|
||||
null character is not counted towards ``max-len``.
|
||||
|
||||
The field may either be a literal integer value or a name of a defined
|
||||
constant. String types may reduce the constant by one
|
||||
(i.e. specify ``max-len: CONST - 1``) to reserve space for the terminating
|
||||
character so implementations should recognize such pattern.
|
||||
|
||||
min-len
|
||||
~~~~~~~
|
||||
|
||||
Similar to ``max-len`` but defines minimum length.
|
@ -161,6 +161,6 @@ xxx_packing() that calls it using the proper QUIRK_* one-hot bits set.
|
||||
|
||||
The packing() function returns an int-encoded error code, which protects the
|
||||
programmer against incorrect API use. The errors are not expected to occur
|
||||
durring runtime, therefore it is reasonable for xxx_packing() to return void
|
||||
during runtime, therefore it is reasonable for xxx_packing() to return void
|
||||
and simply swallow those errors. Optionally it can dump stack or print the
|
||||
error description.
|
||||
|
@ -57,6 +57,15 @@ patternProperties:
|
||||
enum:
|
||||
- mscc,ocelot-miim
|
||||
|
||||
"^ethernet-switch@[0-9a-f]+$":
|
||||
type: object
|
||||
$ref: /schemas/net/mscc,vsc7514-switch.yaml
|
||||
unevaluatedProperties: false
|
||||
properties:
|
||||
compatible:
|
||||
enum:
|
||||
- mscc,vsc7512-switch
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
|
@ -0,0 +1,80 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/amlogic,g12a-mdio-mux.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: MDIO bus multiplexer/glue of Amlogic G12a SoC family
|
||||
|
||||
description:
|
||||
This is a special case of a MDIO bus multiplexer. It allows to choose between
|
||||
the internal mdio bus leading to the embedded 10/100 PHY or the external
|
||||
MDIO bus.
|
||||
|
||||
maintainers:
|
||||
- Neil Armstrong <neil.armstrong@linaro.org>
|
||||
|
||||
allOf:
|
||||
- $ref: mdio-mux.yaml#
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
const: amlogic,g12a-mdio-mux
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
|
||||
clocks:
|
||||
items:
|
||||
- description: peripheral clock
|
||||
- description: platform crytal
|
||||
- description: SoC 50MHz MPLL
|
||||
|
||||
clock-names:
|
||||
items:
|
||||
- const: pclk
|
||||
- const: clkin0
|
||||
- const: clkin1
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
- clocks
|
||||
- clock-names
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
#include <dt-bindings/interrupt-controller/irq.h>
|
||||
#include <dt-bindings/interrupt-controller/arm-gic.h>
|
||||
mdio-multiplexer@4c000 {
|
||||
compatible = "amlogic,g12a-mdio-mux";
|
||||
reg = <0x4c000 0xa4>;
|
||||
clocks = <&clkc_eth_phy>, <&xtal>, <&clkc_mpll>;
|
||||
clock-names = "pclk", "clkin0", "clkin1";
|
||||
mdio-parent-bus = <&mdio0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
mdio@0 {
|
||||
reg = <0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
};
|
||||
|
||||
mdio@1 {
|
||||
reg = <1>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
ethernet-phy@8 {
|
||||
compatible = "ethernet-phy-id0180.3301",
|
||||
"ethernet-phy-ieee802.3-c22";
|
||||
interrupts = <GIC_SPI 9 IRQ_TYPE_LEVEL_HIGH>;
|
||||
reg = <8>;
|
||||
max-speed = <100>;
|
||||
};
|
||||
};
|
||||
};
|
||||
...
|
@ -0,0 +1,64 @@
|
||||
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/amlogic,gxl-mdio-mux.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: Amlogic GXL MDIO bus multiplexer
|
||||
|
||||
maintainers:
|
||||
- Jerome Brunet <jbrunet@baylibre.com>
|
||||
|
||||
description:
|
||||
This is a special case of a MDIO bus multiplexer. It allows to choose between
|
||||
the internal mdio bus leading to the embedded 10/100 PHY or the external
|
||||
MDIO bus on the Amlogic GXL SoC family.
|
||||
|
||||
allOf:
|
||||
- $ref: mdio-mux.yaml#
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
const: amlogic,gxl-mdio-mux
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
|
||||
clocks:
|
||||
maxItems: 1
|
||||
|
||||
clock-names:
|
||||
items:
|
||||
- const: ref
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
- clocks
|
||||
- clock-names
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
eth_phy_mux: mdio@558 {
|
||||
compatible = "amlogic,gxl-mdio-mux";
|
||||
reg = <0x558 0xc>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
clocks = <&refclk>;
|
||||
clock-names = "ref";
|
||||
mdio-parent-bus = <&mdio0>;
|
||||
|
||||
external_mdio: mdio@0 {
|
||||
reg = <0x0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
};
|
||||
|
||||
internal_mdio: mdio@1 {
|
||||
reg = <0x1>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
};
|
||||
};
|
@ -19,6 +19,7 @@ description: |
|
||||
|
||||
allOf:
|
||||
- $ref: ethernet-controller.yaml#
|
||||
- $ref: /schemas/spi/spi-peripheral-props.yaml
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
@ -39,8 +40,8 @@ properties:
|
||||
it should be marked GPIO_ACTIVE_LOW.
|
||||
maxItems: 1
|
||||
|
||||
controller-data: true
|
||||
local-mac-address: true
|
||||
|
||||
mac-address: true
|
||||
|
||||
required:
|
||||
|
@ -28,6 +28,12 @@ properties:
|
||||
- renesas,r8a77995-canfd # R-Car D3
|
||||
- const: renesas,rcar-gen3-canfd # R-Car Gen3 and RZ/G2
|
||||
|
||||
- items:
|
||||
- enum:
|
||||
- renesas,r8a779a0-canfd # R-Car V3U
|
||||
- renesas,r8a779g0-canfd # R-Car V4H
|
||||
- const: renesas,rcar-gen4-canfd # R-Car Gen4
|
||||
|
||||
- items:
|
||||
- enum:
|
||||
- renesas,r9a07g043-canfd # RZ/G2UL and RZ/Five
|
||||
@ -35,8 +41,6 @@ properties:
|
||||
- renesas,r9a07g054-canfd # RZ/V2L
|
||||
- const: renesas,rzg2l-canfd # RZ/G2L family
|
||||
|
||||
- const: renesas,r8a779a0-canfd # R-Car V3U
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
|
||||
@ -60,7 +64,7 @@ properties:
|
||||
$ref: /schemas/types.yaml#/definitions/flag
|
||||
description:
|
||||
The controller can operate in either CAN FD only mode (default) or
|
||||
Classical CAN only mode. The mode is global to both the channels.
|
||||
Classical CAN only mode. The mode is global to all channels.
|
||||
Specify this property to put the controller in Classical CAN only mode.
|
||||
|
||||
assigned-clocks:
|
||||
@ -80,6 +84,10 @@ patternProperties:
|
||||
The controller supports multiple channels and each is represented as a
|
||||
child node. Each channel can be enabled/disabled individually.
|
||||
|
||||
properties:
|
||||
phys:
|
||||
maxItems: 1
|
||||
|
||||
additionalProperties: false
|
||||
|
||||
required:
|
||||
@ -159,7 +167,7 @@ allOf:
|
||||
properties:
|
||||
compatible:
|
||||
contains:
|
||||
const: renesas,r8a779a0-canfd
|
||||
const: renesas,rcar-gen4-canfd
|
||||
then:
|
||||
patternProperties:
|
||||
"^channel[2-7]$": false
|
||||
|
@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
title: Arrow SpeedChips XRS7000 Series Switch
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
|
||||
maintainers:
|
||||
- George McCollister <george.mccollister@gmail.com>
|
||||
|
@ -66,7 +66,7 @@ required:
|
||||
- reg
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
- if:
|
||||
properties:
|
||||
compatible:
|
||||
|
@ -85,11 +85,16 @@ properties:
|
||||
ports:
|
||||
type: object
|
||||
|
||||
properties:
|
||||
brcm,use-bcm-hdr:
|
||||
description: if present, indicates that the switch port has Broadcom
|
||||
tags enabled (per-packet metadata)
|
||||
type: boolean
|
||||
patternProperties:
|
||||
'^port@[0-9a-f]$':
|
||||
$ref: dsa-port.yaml#
|
||||
unevaluatedProperties: false
|
||||
|
||||
properties:
|
||||
brcm,use-bcm-hdr:
|
||||
description: if present, indicates that the switch port has Broadcom
|
||||
tags enabled (per-packet metadata)
|
||||
type: boolean
|
||||
|
||||
required:
|
||||
- reg
|
||||
|
@ -4,18 +4,19 @@
|
||||
$id: http://devicetree.org/schemas/net/dsa/dsa-port.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: Ethernet Switch port
|
||||
title: Generic DSA Switch Port
|
||||
|
||||
maintainers:
|
||||
- Andrew Lunn <andrew@lunn.ch>
|
||||
- Florian Fainelli <f.fainelli@gmail.com>
|
||||
- Vivien Didelot <vivien.didelot@gmail.com>
|
||||
- Vladimir Oltean <olteanv@gmail.com>
|
||||
|
||||
description:
|
||||
Ethernet switch port Description
|
||||
A DSA switch port is a component of a switch that manages one MAC, and can
|
||||
pass Ethernet frames. It can act as a stanadard Ethernet switch port, or have
|
||||
DSA-specific functionality.
|
||||
|
||||
allOf:
|
||||
- $ref: /schemas/net/ethernet-controller.yaml#
|
||||
$ref: /schemas/net/ethernet-switch-port.yaml#
|
||||
|
||||
properties:
|
||||
reg:
|
||||
@ -58,25 +59,6 @@ properties:
|
||||
- rtl8_4t
|
||||
- seville
|
||||
|
||||
phy-handle: true
|
||||
|
||||
phy-mode: true
|
||||
|
||||
fixed-link: true
|
||||
|
||||
mac-address: true
|
||||
|
||||
sfp: true
|
||||
|
||||
managed: true
|
||||
|
||||
rx-internal-delay-ps: true
|
||||
|
||||
tx-internal-delay-ps: true
|
||||
|
||||
required:
|
||||
- reg
|
||||
|
||||
# CPU and DSA ports must have phylink-compatible link descriptions
|
||||
if:
|
||||
oneOf:
|
||||
|
@ -9,7 +9,7 @@ title: Ethernet Switch
|
||||
maintainers:
|
||||
- Andrew Lunn <andrew@lunn.ch>
|
||||
- Florian Fainelli <f.fainelli@gmail.com>
|
||||
- Vivien Didelot <vivien.didelot@gmail.com>
|
||||
- Vladimir Oltean <olteanv@gmail.com>
|
||||
|
||||
description:
|
||||
This binding represents Ethernet Switches which have a dedicated CPU
|
||||
@ -18,10 +18,9 @@ description:
|
||||
|
||||
select: false
|
||||
|
||||
properties:
|
||||
$nodename:
|
||||
pattern: "^(ethernet-)?switch(@.*)?$"
|
||||
$ref: /schemas/net/ethernet-switch.yaml#
|
||||
|
||||
properties:
|
||||
dsa,member:
|
||||
minItems: 2
|
||||
maxItems: 2
|
||||
@ -32,30 +31,28 @@ properties:
|
||||
(single device hanging off a CPU port) must not specify this property
|
||||
$ref: /schemas/types.yaml#/definitions/uint32-array
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?ports$":
|
||||
type: object
|
||||
properties:
|
||||
'#address-cells':
|
||||
const: 1
|
||||
'#size-cells':
|
||||
const: 0
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?port@[0-9]+$":
|
||||
type: object
|
||||
description: Ethernet switch ports
|
||||
|
||||
$ref: dsa-port.yaml#
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
oneOf:
|
||||
- required:
|
||||
- ports
|
||||
- required:
|
||||
- ethernet-ports
|
||||
|
||||
additionalProperties: true
|
||||
|
||||
$defs:
|
||||
ethernet-ports:
|
||||
description: A DSA switch without any extra port properties
|
||||
$ref: '#/'
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?ports$":
|
||||
type: object
|
||||
additionalProperties: false
|
||||
|
||||
properties:
|
||||
'#address-cells':
|
||||
const: 1
|
||||
'#size-cells':
|
||||
const: 0
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?port@[0-9]+$":
|
||||
description: Ethernet switch ports
|
||||
$ref: dsa-port.yaml#
|
||||
unevaluatedProperties: false
|
||||
|
||||
...
|
||||
|
@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
title: Hirschmann Hellcreek TSN Switch
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
|
||||
maintainers:
|
||||
- Andrew Lunn <andrew@lunn.ch>
|
||||
|
@ -24,56 +24,46 @@ description: |
|
||||
|
||||
There is only the standalone version of MT7531.
|
||||
|
||||
Port 5 on MT7530 has got various ways of configuration.
|
||||
|
||||
For standalone MT7530:
|
||||
Port 5 on MT7530 has got various ways of configuration:
|
||||
|
||||
- Port 5 can be used as a CPU port.
|
||||
|
||||
- PHY 0 or 4 of the switch can be muxed to connect to the gmac of the SoC
|
||||
which port 5 is wired to. Usually used for connecting the wan port
|
||||
directly to the CPU to achieve 2 Gbps routing in total.
|
||||
- PHY 0 or 4 of the switch can be muxed to gmac5 of the switch. Therefore,
|
||||
the gmac of the SoC which is wired to port 5 can connect to the PHY.
|
||||
This is usually used for connecting the wan port directly to the CPU to
|
||||
achieve 2 Gbps routing in total.
|
||||
|
||||
The driver looks up the reg on the ethernet-phy node which the phy-handle
|
||||
property refers to on the gmac node to mux the specified phy.
|
||||
The driver looks up the reg on the ethernet-phy node, which the phy-handle
|
||||
property on the gmac node refers to, to mux the specified phy.
|
||||
|
||||
The driver requires the gmac of the SoC to have "mediatek,eth-mac" as the
|
||||
compatible string and the reg must be 1. So, for now, only gmac1 of an
|
||||
compatible string and the reg must be 1. So, for now, only gmac1 of a
|
||||
MediaTek SoC can benefit this. Banana Pi BPI-R2 suits this.
|
||||
Check out example 5 for a similar configuration.
|
||||
|
||||
- Port 5 can be wired to an external phy. Port 5 becomes a DSA slave.
|
||||
Check out example 7 for a similar configuration.
|
||||
|
||||
For multi-chip module MT7530:
|
||||
|
||||
- Port 5 can be used as a CPU port.
|
||||
|
||||
- PHY 0 or 4 of the switch can be muxed to connect to gmac1 of the SoC.
|
||||
Usually used for connecting the wan port directly to the CPU to achieve 2
|
||||
Gbps routing in total.
|
||||
|
||||
The driver looks up the reg on the ethernet-phy node which the phy-handle
|
||||
property refers to on the gmac node to mux the specified phy.
|
||||
|
||||
For the MT7621 SoCs, rgmii2 group must be claimed with rgmii2 function.
|
||||
|
||||
Check out example 5.
|
||||
|
||||
- In case of an external phy wired to gmac1 of the SoC, port 5 must not be
|
||||
enabled.
|
||||
- For the multi-chip module MT7530, in case of an external phy wired to
|
||||
gmac1 of the SoC, port 5 must not be enabled.
|
||||
|
||||
In case of muxing PHY 0 or 4, the external phy must not be enabled.
|
||||
|
||||
For the MT7621 SoCs, rgmii2 group must be claimed with rgmii2 function.
|
||||
|
||||
Check out example 6.
|
||||
|
||||
- Port 5 can be muxed to an external phy. Port 5 becomes a DSA slave.
|
||||
The external phy must be wired TX to TX to gmac1 of the SoC for this to
|
||||
work. Ubiquiti EdgeRouter X SFP is wired this way.
|
||||
- Port 5 can be wired to an external phy. Port 5 becomes a DSA slave.
|
||||
|
||||
Muxing PHY 0 or 4 won't work when the external phy is connected TX to TX.
|
||||
For the multi-chip module MT7530, the external phy must be wired TX to TX
|
||||
to gmac1 of the SoC for this to work. Ubiquiti EdgeRouter X SFP is wired
|
||||
this way.
|
||||
|
||||
For the multi-chip module MT7530, muxing PHY 0 or 4 won't work when the
|
||||
external phy is connected TX to TX.
|
||||
|
||||
For the MT7621 SoCs, rgmii2 group must be claimed with gpio function.
|
||||
|
||||
Check out example 7.
|
||||
|
||||
properties:
|
||||
@ -157,9 +147,6 @@ patternProperties:
|
||||
patternProperties:
|
||||
"^(ethernet-)?port@[0-9]+$":
|
||||
type: object
|
||||
description: Ethernet switch ports
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
properties:
|
||||
reg:
|
||||
@ -168,7 +155,6 @@ patternProperties:
|
||||
for user ports.
|
||||
|
||||
allOf:
|
||||
- $ref: dsa-port.yaml#
|
||||
- if:
|
||||
required: [ ethernet ]
|
||||
then:
|
||||
@ -238,7 +224,7 @@ $defs:
|
||||
- sgmii
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
- if:
|
||||
required:
|
||||
- mediatek,mcm
|
||||
@ -605,7 +591,7 @@ examples:
|
||||
label = "lan4";
|
||||
};
|
||||
|
||||
/* Commented out, phy4 is muxed to gmac1.
|
||||
/* Commented out, phy4 is connected to gmac1.
|
||||
port@4 {
|
||||
reg = <4>;
|
||||
label = "wan";
|
||||
|
@ -11,7 +11,7 @@ maintainers:
|
||||
- Woojung Huh <Woojung.Huh@microchip.com>
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
- $ref: /schemas/spi/spi-peripheral-props.yaml#
|
||||
|
||||
properties:
|
||||
|
@ -10,7 +10,7 @@ maintainers:
|
||||
- UNGLinuxDriver@microchip.com
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
|
@ -78,7 +78,7 @@ required:
|
||||
- reg
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
- if:
|
||||
properties:
|
||||
compatible:
|
||||
|
@ -13,7 +13,7 @@ description:
|
||||
depends on the SPI bus master driver.
|
||||
|
||||
allOf:
|
||||
- $ref: "dsa.yaml#"
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
- $ref: /schemas/spi/spi-peripheral-props.yaml#
|
||||
|
||||
maintainers:
|
||||
|
@ -66,15 +66,11 @@ properties:
|
||||
With the legacy mapping the reg corresponding to the internal
|
||||
mdio is the switch reg with an offset of -1.
|
||||
|
||||
$ref: "dsa.yaml#"
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?ports$":
|
||||
type: object
|
||||
properties:
|
||||
'#address-cells':
|
||||
const: 1
|
||||
'#size-cells':
|
||||
const: 0
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?port@[0-6]$":
|
||||
type: object
|
||||
@ -116,7 +112,7 @@ required:
|
||||
- compatible
|
||||
- reg
|
||||
|
||||
additionalProperties: true
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
@ -148,8 +144,6 @@ examples:
|
||||
|
||||
switch@10 {
|
||||
compatible = "qca,qca8337";
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
reset-gpios = <&gpio 42 GPIO_ACTIVE_LOW>;
|
||||
reg = <0x10>;
|
||||
|
||||
@ -209,8 +203,6 @@ examples:
|
||||
|
||||
switch@10 {
|
||||
compatible = "qca,qca8337";
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
reset-gpios = <&gpio 42 GPIO_ACTIVE_LOW>;
|
||||
reg = <0x10>;
|
||||
|
||||
|
@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
title: Realtek switches for unmanaged switches
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
|
||||
maintainers:
|
||||
- Linus Walleij <linus.walleij@linaro.org>
|
||||
|
@ -14,7 +14,7 @@ description: |
|
||||
handles 4 ports + 1 CPU management port.
|
||||
|
||||
allOf:
|
||||
- $ref: dsa.yaml#
|
||||
- $ref: dsa.yaml#/$defs/ethernet-ports
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
|
@ -0,0 +1,26 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/ethernet-switch-port.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: Generic Ethernet Switch Port
|
||||
|
||||
maintainers:
|
||||
- Andrew Lunn <andrew@lunn.ch>
|
||||
- Florian Fainelli <f.fainelli@gmail.com>
|
||||
- Vladimir Oltean <olteanv@gmail.com>
|
||||
|
||||
description:
|
||||
An Ethernet switch port is a component of a switch that manages one MAC, and
|
||||
can pass Ethernet frames.
|
||||
|
||||
$ref: ethernet-controller.yaml#
|
||||
|
||||
properties:
|
||||
reg:
|
||||
description: Port number
|
||||
|
||||
additionalProperties: true
|
||||
|
||||
...
|
62
Documentation/devicetree/bindings/net/ethernet-switch.yaml
Normal file
62
Documentation/devicetree/bindings/net/ethernet-switch.yaml
Normal file
@ -0,0 +1,62 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/ethernet-switch.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: Generic Ethernet Switch
|
||||
|
||||
maintainers:
|
||||
- Andrew Lunn <andrew@lunn.ch>
|
||||
- Florian Fainelli <f.fainelli@gmail.com>
|
||||
- Vladimir Oltean <olteanv@gmail.com>
|
||||
|
||||
description:
|
||||
Ethernet switches are multi-port Ethernet controllers. Each port has
|
||||
its own number and is represented as its own Ethernet controller.
|
||||
The minimum required functionality is to pass packets to software.
|
||||
They may or may not be able to forward packets automonously between
|
||||
ports.
|
||||
|
||||
select: false
|
||||
|
||||
properties:
|
||||
$nodename:
|
||||
pattern: "^(ethernet-)?switch(@.*)?$"
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?ports$":
|
||||
type: object
|
||||
unevaluatedProperties: false
|
||||
|
||||
properties:
|
||||
'#address-cells':
|
||||
const: 1
|
||||
'#size-cells':
|
||||
const: 0
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?port@[0-9]+$":
|
||||
type: object
|
||||
description: Ethernet switch ports
|
||||
|
||||
oneOf:
|
||||
- required:
|
||||
- ports
|
||||
- required:
|
||||
- ethernet-ports
|
||||
|
||||
additionalProperties: true
|
||||
|
||||
$defs:
|
||||
base:
|
||||
description: An ethernet switch without any extra port properties
|
||||
$ref: '#/'
|
||||
|
||||
patternProperties:
|
||||
"^(ethernet-)?port@[0-9]+$":
|
||||
description: Ethernet switch ports
|
||||
$ref: ethernet-switch-port.yaml#
|
||||
unevaluatedProperties: false
|
||||
|
||||
...
|
@ -51,6 +51,7 @@ properties:
|
||||
- fsl,imx8mm-fec
|
||||
- fsl,imx8mn-fec
|
||||
- fsl,imx8mp-fec
|
||||
- fsl,imx93-fec
|
||||
- const: fsl,imx8mq-fec
|
||||
- const: fsl,imx6sx-fec
|
||||
- items:
|
||||
|
47
Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml
Normal file
47
Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml
Normal file
@ -0,0 +1,47 @@
|
||||
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/maxlinear,gpy2xx.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: MaxLinear GPY2xx PHY
|
||||
|
||||
maintainers:
|
||||
- Andrew Lunn <andrew@lunn.ch>
|
||||
- Michael Walle <michael@walle.cc>
|
||||
|
||||
allOf:
|
||||
- $ref: ethernet-phy.yaml#
|
||||
|
||||
properties:
|
||||
maxlinear,use-broken-interrupts:
|
||||
description: |
|
||||
Interrupts are broken on some GPY2xx PHYs in that they keep the
|
||||
interrupt line asserted even after the interrupt status register is
|
||||
cleared. Thus it is blocking the interrupt line which is usually bad
|
||||
for shared lines. By default interrupts are disabled for this PHY and
|
||||
polling mode is used. If one can live with the consequences, this
|
||||
property can be used to enable interrupt handling.
|
||||
|
||||
Affected PHYs (as far as known) are GPY215B and GPY215C.
|
||||
type: boolean
|
||||
|
||||
dependencies:
|
||||
maxlinear,use-broken-interrupts: [ interrupts ]
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
ethernet {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
ethernet-phy@0 {
|
||||
reg = <0>;
|
||||
interrupts-extended = <&intc 0>;
|
||||
maxlinear,use-broken-interrupts;
|
||||
};
|
||||
};
|
||||
|
||||
...
|
@ -1,48 +0,0 @@
|
||||
Properties for the MDIO bus multiplexer/glue of Amlogic G12a SoC family.
|
||||
|
||||
This is a special case of a MDIO bus multiplexer. It allows to choose between
|
||||
the internal mdio bus leading to the embedded 10/100 PHY or the external
|
||||
MDIO bus.
|
||||
|
||||
Required properties in addition to the generic multiplexer properties:
|
||||
- compatible : amlogic,g12a-mdio-mux
|
||||
- reg: physical address and length of the multiplexer/glue registers
|
||||
- clocks: list of clock phandle, one for each entry clock-names.
|
||||
- clock-names: should contain the following:
|
||||
* "pclk" : peripheral clock.
|
||||
* "clkin0" : platform crytal
|
||||
* "clkin1" : SoC 50MHz MPLL
|
||||
|
||||
Example :
|
||||
|
||||
mdio_mux: mdio-multiplexer@4c000 {
|
||||
compatible = "amlogic,g12a-mdio-mux";
|
||||
reg = <0x0 0x4c000 0x0 0xa4>;
|
||||
clocks = <&clkc CLKID_ETH_PHY>,
|
||||
<&xtal>,
|
||||
<&clkc CLKID_MPLL_5OM>;
|
||||
clock-names = "pclk", "clkin0", "clkin1";
|
||||
mdio-parent-bus = <&mdio0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
ext_mdio: mdio@0 {
|
||||
reg = <0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
};
|
||||
|
||||
int_mdio: mdio@1 {
|
||||
reg = <1>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
internal_ephy: ethernet-phy@8 {
|
||||
compatible = "ethernet-phy-id0180.3301",
|
||||
"ethernet-phy-ieee802.3-c22";
|
||||
interrupts = <GIC_SPI 9 IRQ_TYPE_LEVEL_HIGH>;
|
||||
reg = <8>;
|
||||
max-speed = <100>;
|
||||
};
|
||||
};
|
||||
};
|
@ -158,6 +158,7 @@ KSZ9031:
|
||||
no link will be established.
|
||||
|
||||
KSZ9131:
|
||||
LAN8841:
|
||||
|
||||
All skew control options are specified in picoseconds. The increment
|
||||
step is 100ps. Unlike KSZ9031, the values represent picoseccond delays.
|
||||
|
117
Documentation/devicetree/bindings/net/motorcomm,yt8xxx.yaml
Normal file
117
Documentation/devicetree/bindings/net/motorcomm,yt8xxx.yaml
Normal file
@ -0,0 +1,117 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/motorcomm,yt8xxx.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: MotorComm yt8xxx Ethernet PHY
|
||||
|
||||
maintainers:
|
||||
- Frank Sae <frank.sae@motor-comm.com>
|
||||
|
||||
allOf:
|
||||
- $ref: ethernet-phy.yaml#
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
enum:
|
||||
- ethernet-phy-id4f51.e91a
|
||||
- ethernet-phy-id4f51.e91b
|
||||
|
||||
rx-internal-delay-ps:
|
||||
description: |
|
||||
RGMII RX Clock Delay used only when PHY operates in RGMII mode with
|
||||
internal delay (phy-mode is 'rgmii-id' or 'rgmii-rxid') in pico-seconds.
|
||||
enum: [ 0, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, 1500, 1650,
|
||||
1800, 1900, 1950, 2050, 2100, 2200, 2250, 2350, 2500, 2650, 2800,
|
||||
2950, 3100, 3250, 3400, 3550, 3700, 3850, 4000, 4150 ]
|
||||
default: 1950
|
||||
|
||||
tx-internal-delay-ps:
|
||||
description: |
|
||||
RGMII TX Clock Delay used only when PHY operates in RGMII mode with
|
||||
internal delay (phy-mode is 'rgmii-id' or 'rgmii-txid') in pico-seconds.
|
||||
enum: [ 0, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, 1500, 1650, 1800,
|
||||
1950, 2100, 2250 ]
|
||||
default: 1950
|
||||
|
||||
motorcomm,clk-out-frequency-hz:
|
||||
description: clock output on clock output pin.
|
||||
enum: [0, 25000000, 125000000]
|
||||
default: 0
|
||||
|
||||
motorcomm,keep-pll-enabled:
|
||||
description: |
|
||||
If set, keep the PLL enabled even if there is no link. Useful if you
|
||||
want to use the clock output without an ethernet link.
|
||||
type: boolean
|
||||
|
||||
motorcomm,auto-sleep-disabled:
|
||||
description: |
|
||||
If set, PHY will not enter sleep mode and close AFE after unplug cable
|
||||
for a timer.
|
||||
type: boolean
|
||||
|
||||
motorcomm,tx-clk-adj-enabled:
|
||||
description: |
|
||||
This configuration is mainly to adapt to VF2 with JH7110 SoC.
|
||||
Useful if you want to use tx-clk-xxxx-inverted to adj the delay of tx clk.
|
||||
type: boolean
|
||||
|
||||
motorcomm,tx-clk-10-inverted:
|
||||
description: |
|
||||
Use original or inverted RGMII Transmit PHY Clock to drive the RGMII
|
||||
Transmit PHY Clock delay train configuration when speed is 10Mbps.
|
||||
type: boolean
|
||||
|
||||
motorcomm,tx-clk-100-inverted:
|
||||
description: |
|
||||
Use original or inverted RGMII Transmit PHY Clock to drive the RGMII
|
||||
Transmit PHY Clock delay train configuration when speed is 100Mbps.
|
||||
type: boolean
|
||||
|
||||
motorcomm,tx-clk-1000-inverted:
|
||||
description: |
|
||||
Use original or inverted RGMII Transmit PHY Clock to drive the RGMII
|
||||
Transmit PHY Clock delay train configuration when speed is 1000Mbps.
|
||||
type: boolean
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
mdio {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
phy-mode = "rgmii-id";
|
||||
ethernet-phy@4 {
|
||||
/* Only needed to make DT lint tools work. Do not copy/paste
|
||||
* into real DTS files.
|
||||
*/
|
||||
compatible = "ethernet-phy-id4f51.e91a";
|
||||
|
||||
reg = <4>;
|
||||
rx-internal-delay-ps = <2100>;
|
||||
tx-internal-delay-ps = <150>;
|
||||
motorcomm,clk-out-frequency-hz = <0>;
|
||||
motorcomm,keep-pll-enabled;
|
||||
motorcomm,auto-sleep-disabled;
|
||||
};
|
||||
};
|
||||
- |
|
||||
mdio {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
phy-mode = "rgmii";
|
||||
ethernet-phy@5 {
|
||||
/* Only needed to make DT lint tools work. Do not copy/paste
|
||||
* into real DTS files.
|
||||
*/
|
||||
compatible = "ethernet-phy-id4f51.e91a";
|
||||
|
||||
reg = <5>;
|
||||
motorcomm,clk-out-frequency-hz = <125000000>;
|
||||
motorcomm,keep-pll-enabled;
|
||||
motorcomm,auto-sleep-disabled;
|
||||
};
|
||||
};
|
@ -18,14 +18,52 @@ description: |
|
||||
packets using CPU. Additionally, PTP is supported as well as FDMA for faster
|
||||
packet extraction/injection.
|
||||
|
||||
properties:
|
||||
$nodename:
|
||||
pattern: "^switch@[0-9a-f]+$"
|
||||
allOf:
|
||||
- if:
|
||||
properties:
|
||||
compatible:
|
||||
const: mscc,vsc7514-switch
|
||||
then:
|
||||
$ref: ethernet-switch.yaml#
|
||||
required:
|
||||
- interrupts
|
||||
- interrupt-names
|
||||
properties:
|
||||
reg:
|
||||
minItems: 21
|
||||
reg-names:
|
||||
minItems: 21
|
||||
ethernet-ports:
|
||||
patternProperties:
|
||||
"^port@[0-9a-f]+$":
|
||||
$ref: ethernet-switch-port.yaml#
|
||||
unevaluatedProperties: false
|
||||
|
||||
- if:
|
||||
properties:
|
||||
compatible:
|
||||
const: mscc,vsc7512-switch
|
||||
then:
|
||||
$ref: /schemas/net/dsa/dsa.yaml#
|
||||
properties:
|
||||
reg:
|
||||
maxItems: 20
|
||||
reg-names:
|
||||
maxItems: 20
|
||||
ethernet-ports:
|
||||
patternProperties:
|
||||
"^port@[0-9a-f]+$":
|
||||
$ref: /schemas/net/dsa/dsa-port.yaml#
|
||||
unevaluatedProperties: false
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
const: mscc,vsc7514-switch
|
||||
enum:
|
||||
- mscc,vsc7512-switch
|
||||
- mscc,vsc7514-switch
|
||||
|
||||
reg:
|
||||
minItems: 20
|
||||
items:
|
||||
- description: system target
|
||||
- description: rewriter target
|
||||
@ -50,6 +88,7 @@ properties:
|
||||
- description: fdma target
|
||||
|
||||
reg-names:
|
||||
minItems: 20
|
||||
items:
|
||||
- const: sys
|
||||
- const: rew
|
||||
@ -87,59 +126,16 @@ properties:
|
||||
- const: xtr
|
||||
- const: fdma
|
||||
|
||||
ethernet-ports:
|
||||
type: object
|
||||
|
||||
properties:
|
||||
'#address-cells':
|
||||
const: 1
|
||||
'#size-cells':
|
||||
const: 0
|
||||
|
||||
additionalProperties: false
|
||||
|
||||
patternProperties:
|
||||
"^port@[0-9a-f]+$":
|
||||
type: object
|
||||
description: Ethernet ports handled by the switch
|
||||
|
||||
$ref: ethernet-controller.yaml#
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
properties:
|
||||
reg:
|
||||
description: Switch port number
|
||||
|
||||
phy-handle: true
|
||||
|
||||
phy-mode: true
|
||||
|
||||
fixed-link: true
|
||||
|
||||
mac-address: true
|
||||
|
||||
required:
|
||||
- reg
|
||||
- phy-mode
|
||||
|
||||
oneOf:
|
||||
- required:
|
||||
- phy-handle
|
||||
- required:
|
||||
- fixed-link
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
- reg-names
|
||||
- interrupts
|
||||
- interrupt-names
|
||||
- ethernet-ports
|
||||
|
||||
additionalProperties: false
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
# VSC7514 (Switchdev)
|
||||
- |
|
||||
switch@1010000 {
|
||||
compatible = "mscc,vsc7514-switch";
|
||||
@ -187,5 +183,51 @@ examples:
|
||||
};
|
||||
};
|
||||
};
|
||||
# VSC7512 (DSA)
|
||||
- |
|
||||
ethernet-switch@1{
|
||||
compatible = "mscc,vsc7512-switch";
|
||||
reg = <0x71010000 0x10000>,
|
||||
<0x71030000 0x10000>,
|
||||
<0x71080000 0x100>,
|
||||
<0x710e0000 0x10000>,
|
||||
<0x711e0000 0x100>,
|
||||
<0x711f0000 0x100>,
|
||||
<0x71200000 0x100>,
|
||||
<0x71210000 0x100>,
|
||||
<0x71220000 0x100>,
|
||||
<0x71230000 0x100>,
|
||||
<0x71240000 0x100>,
|
||||
<0x71250000 0x100>,
|
||||
<0x71260000 0x100>,
|
||||
<0x71270000 0x100>,
|
||||
<0x71280000 0x100>,
|
||||
<0x71800000 0x80000>,
|
||||
<0x71880000 0x10000>,
|
||||
<0x71040000 0x10000>,
|
||||
<0x71050000 0x10000>,
|
||||
<0x71060000 0x10000>;
|
||||
reg-names = "sys", "rew", "qs", "ptp", "port0", "port1",
|
||||
"port2", "port3", "port4", "port5", "port6",
|
||||
"port7", "port8", "port9", "port10", "qsys",
|
||||
"ana", "s0", "s1", "s2";
|
||||
|
||||
ethernet-ports {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
port@0 {
|
||||
reg = <0>;
|
||||
ethernet = <&mac_sw>;
|
||||
phy-handle = <&phy0>;
|
||||
phy-mode = "internal";
|
||||
};
|
||||
port@1 {
|
||||
reg = <1>;
|
||||
phy-handle = <&phy1>;
|
||||
phy-mode = "internal";
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
...
|
||||
|
@ -4,7 +4,7 @@
|
||||
$id: http://devicetree.org/schemas/net/nxp,dwmac-imx.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: NXP i.MX8 DWMAC glue layer
|
||||
title: NXP i.MX8/9 DWMAC glue layer
|
||||
|
||||
maintainers:
|
||||
- Clark Wang <xiaoning.wang@nxp.com>
|
||||
@ -19,6 +19,7 @@ select:
|
||||
enum:
|
||||
- nxp,imx8mp-dwmac-eqos
|
||||
- nxp,imx8dxl-dwmac-eqos
|
||||
- nxp,imx93-dwmac-eqos
|
||||
required:
|
||||
- compatible
|
||||
|
||||
@ -32,6 +33,7 @@ properties:
|
||||
- enum:
|
||||
- nxp,imx8mp-dwmac-eqos
|
||||
- nxp,imx8dxl-dwmac-eqos
|
||||
- nxp,imx93-dwmac-eqos
|
||||
- const: snps,dwmac-5.10a
|
||||
|
||||
clocks:
|
||||
|
51
Documentation/devicetree/bindings/net/rfkill-gpio.yaml
Normal file
51
Documentation/devicetree/bindings/net/rfkill-gpio.yaml
Normal file
@ -0,0 +1,51 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/rfkill-gpio.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: GPIO controlled rfkill switch
|
||||
|
||||
maintainers:
|
||||
- Johannes Berg <johannes@sipsolutions.net>
|
||||
- Philipp Zabel <p.zabel@pengutronix.de>
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
const: rfkill-gpio
|
||||
|
||||
label:
|
||||
description: rfkill switch name, defaults to node name
|
||||
|
||||
radio-type:
|
||||
description: rfkill radio type
|
||||
enum:
|
||||
- bluetooth
|
||||
- fm
|
||||
- gps
|
||||
- nfc
|
||||
- ultrawideband
|
||||
- wimax
|
||||
- wlan
|
||||
- wwan
|
||||
|
||||
shutdown-gpios:
|
||||
maxItems: 1
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- radio-type
|
||||
- shutdown-gpios
|
||||
|
||||
additionalProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
#include <dt-bindings/gpio/gpio.h>
|
||||
|
||||
rfkill {
|
||||
compatible = "rfkill-gpio";
|
||||
label = "rfkill-pcie-wlan";
|
||||
radio-type = "wlan";
|
||||
shutdown-gpios = <&gpio2 25 GPIO_ACTIVE_HIGH>;
|
||||
};
|
@ -49,11 +49,11 @@ properties:
|
||||
- rockchip,rk3368-gmac
|
||||
- rockchip,rk3399-gmac
|
||||
- rockchip,rv1108-gmac
|
||||
- rockchip,rv1126-gmac
|
||||
- items:
|
||||
- enum:
|
||||
- rockchip,rk3568-gmac
|
||||
- rockchip,rk3588-gmac
|
||||
- rockchip,rv1126-gmac
|
||||
- const: snps,dwmac-4.20a
|
||||
|
||||
clocks:
|
||||
|
@ -552,7 +552,7 @@ required:
|
||||
|
||||
dependencies:
|
||||
snps,reset-active-low: ["snps,reset-gpio"]
|
||||
snps,reset-delay-us: ["snps,reset-gpio"]
|
||||
snps,reset-delays-us: ["snps,reset-gpio"]
|
||||
|
||||
allOf:
|
||||
- $ref: "ethernet-controller.yaml#"
|
||||
|
@ -57,6 +57,7 @@ properties:
|
||||
- ti,am654-cpsw-nuss
|
||||
- ti,j7200-cpswxg-nuss
|
||||
- ti,j721e-cpsw-nuss
|
||||
- ti,j721e-cpswxg-nuss
|
||||
- ti,am642-cpsw-nuss
|
||||
|
||||
reg:
|
||||
@ -111,7 +112,7 @@ properties:
|
||||
const: 0
|
||||
|
||||
patternProperties:
|
||||
"^port@[1-4]$":
|
||||
"^port@[1-8]$":
|
||||
type: object
|
||||
description: CPSWxG NUSS external ports
|
||||
|
||||
@ -121,7 +122,7 @@ properties:
|
||||
properties:
|
||||
reg:
|
||||
minimum: 1
|
||||
maximum: 4
|
||||
maximum: 8
|
||||
description: CPSW port number
|
||||
|
||||
phys:
|
||||
@ -186,12 +187,36 @@ allOf:
|
||||
properties:
|
||||
compatible:
|
||||
contains:
|
||||
const: ti,j7200-cpswxg-nuss
|
||||
const: ti,j721e-cpswxg-nuss
|
||||
then:
|
||||
properties:
|
||||
ethernet-ports:
|
||||
patternProperties:
|
||||
"^port@[3-4]$": false
|
||||
"^port@[5-8]$": false
|
||||
"^port@[1-4]$":
|
||||
properties:
|
||||
reg:
|
||||
minimum: 1
|
||||
maximum: 4
|
||||
|
||||
- if:
|
||||
not:
|
||||
properties:
|
||||
compatible:
|
||||
contains:
|
||||
enum:
|
||||
- ti,j721e-cpswxg-nuss
|
||||
- ti,j7200-cpswxg-nuss
|
||||
then:
|
||||
properties:
|
||||
ethernet-ports:
|
||||
patternProperties:
|
||||
"^port@[3-8]$": false
|
||||
"^port@[1-2]$":
|
||||
properties:
|
||||
reg:
|
||||
minimum: 1
|
||||
maximum: 2
|
||||
|
||||
additionalProperties: false
|
||||
|
||||
|
@ -93,6 +93,14 @@ properties:
|
||||
description:
|
||||
Number of timestamp Generator function outputs (TS_GENFx)
|
||||
|
||||
ti,pps:
|
||||
$ref: /schemas/types.yaml#/definitions/uint32-array
|
||||
minItems: 2
|
||||
maxItems: 2
|
||||
description: |
|
||||
The pair of HWx_TS_PUSH input and TS_GENFy output indexes used for
|
||||
PPS events generation. Platform/board specific.
|
||||
|
||||
refclk-mux:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
|
@ -29,15 +29,15 @@ additionalProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
mmc {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
mmc {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
wifi@1 {
|
||||
compatible = "esp,esp8089";
|
||||
reg = <1>;
|
||||
esp,crystal-26M-en = <2>;
|
||||
};
|
||||
};
|
||||
wifi@1 {
|
||||
compatible = "esp,esp8089";
|
||||
reg = <1>;
|
||||
esp,crystal-26M-en = <2>;
|
||||
};
|
||||
};
|
||||
|
||||
...
|
||||
|
@ -1,6 +1,5 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
# Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
|
||||
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/wireless/ieee80211.yaml#
|
||||
|
@ -1,4 +1,4 @@
|
||||
Marvell 8787/8897/8997 (sd8787/sd8897/sd8997/pcie8997) SDIO/PCIE devices
|
||||
Marvell 8787/8897/8978/8997 (sd8787/sd8897/sd8978/sd8997/pcie8997) SDIO/PCIE devices
|
||||
------
|
||||
|
||||
This node provides properties for controlling the Marvell SDIO/PCIE wireless device.
|
||||
@ -10,7 +10,9 @@ Required properties:
|
||||
- compatible : should be one of the following:
|
||||
* "marvell,sd8787"
|
||||
* "marvell,sd8897"
|
||||
* "marvell,sd8978"
|
||||
* "marvell,sd8997"
|
||||
* "nxp,iw416"
|
||||
* "pci11ab,2b42"
|
||||
* "pci1b4b,2b42"
|
||||
|
||||
|
@ -1,6 +1,5 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
# Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
|
||||
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/wireless/mediatek,mt76.yaml#
|
||||
|
@ -1,6 +1,5 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
# Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
|
||||
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/net/wireless/qcom,ath11k.yaml#
|
||||
@ -21,6 +20,7 @@ properties:
|
||||
- qcom,ipq8074-wifi
|
||||
- qcom,ipq6018-wifi
|
||||
- qcom,wcn6750-wifi
|
||||
- qcom,ipq5018-wifi
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
@ -262,10 +262,10 @@ allOf:
|
||||
examples:
|
||||
- |
|
||||
|
||||
q6v5_wcss: q6v5_wcss@CD00000 {
|
||||
q6v5_wcss: remoteproc@cd00000 {
|
||||
compatible = "qcom,ipq8074-wcss-pil";
|
||||
reg = <0xCD00000 0x4040>,
|
||||
<0x4AB000 0x20>;
|
||||
reg = <0xcd00000 0x4040>,
|
||||
<0x4ab000 0x20>;
|
||||
reg-names = "qdsp6",
|
||||
"rmb";
|
||||
};
|
||||
@ -386,7 +386,7 @@ examples:
|
||||
#address-cells = <2>;
|
||||
#size-cells = <2>;
|
||||
|
||||
qcn9074_0: qcn9074_0@51100000 {
|
||||
qcn9074_0: wifi@51100000 {
|
||||
no-map;
|
||||
reg = <0x0 0x51100000 0x0 0x03500000>;
|
||||
};
|
||||
@ -463,6 +463,6 @@ examples:
|
||||
qcom,smem-states = <&wlan_smp2p_out 0>;
|
||||
qcom,smem-state-names = "wlan-smp2p-out";
|
||||
wifi-firmware {
|
||||
iommus = <&apps_smmu 0x1c02 0x1>;
|
||||
iommus = <&apps_smmu 0x1c02 0x1>;
|
||||
};
|
||||
};
|
||||
|
@ -2,7 +2,6 @@
|
||||
# Copyright (c) 2020, Silicon Laboratories, Inc.
|
||||
%YAML 1.2
|
||||
---
|
||||
|
||||
$id: http://devicetree.org/schemas/net/wireless/silabs,wfx.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
|
@ -90,47 +90,47 @@ examples:
|
||||
|
||||
// For wl12xx family:
|
||||
spi1 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
wlcore1: wlcore@1 {
|
||||
compatible = "ti,wl1271";
|
||||
reg = <1>;
|
||||
spi-max-frequency = <48000000>;
|
||||
interrupts = <8 IRQ_TYPE_LEVEL_HIGH>;
|
||||
vwlan-supply = <&vwlan_fixed>;
|
||||
clock-xtal;
|
||||
ref-clock-frequency = <38400000>;
|
||||
};
|
||||
wlcore1: wlcore@1 {
|
||||
compatible = "ti,wl1271";
|
||||
reg = <1>;
|
||||
spi-max-frequency = <48000000>;
|
||||
interrupts = <8 IRQ_TYPE_LEVEL_HIGH>;
|
||||
vwlan-supply = <&vwlan_fixed>;
|
||||
clock-xtal;
|
||||
ref-clock-frequency = <38400000>;
|
||||
};
|
||||
};
|
||||
|
||||
// For wl18xx family:
|
||||
spi2 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
wlcore2: wlcore@0 {
|
||||
compatible = "ti,wl1835";
|
||||
reg = <0>;
|
||||
spi-max-frequency = <48000000>;
|
||||
interrupts = <27 IRQ_TYPE_EDGE_RISING>;
|
||||
vwlan-supply = <&vwlan_fixed>;
|
||||
};
|
||||
wlcore2: wlcore@0 {
|
||||
compatible = "ti,wl1835";
|
||||
reg = <0>;
|
||||
spi-max-frequency = <48000000>;
|
||||
interrupts = <27 IRQ_TYPE_EDGE_RISING>;
|
||||
vwlan-supply = <&vwlan_fixed>;
|
||||
};
|
||||
};
|
||||
|
||||
// SDIO example:
|
||||
mmc3 {
|
||||
vmmc-supply = <&wlan_en_reg>;
|
||||
bus-width = <4>;
|
||||
cap-power-off-card;
|
||||
keep-power-in-suspend;
|
||||
vmmc-supply = <&wlan_en_reg>;
|
||||
bus-width = <4>;
|
||||
cap-power-off-card;
|
||||
keep-power-in-suspend;
|
||||
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
wlcore3: wlcore@2 {
|
||||
compatible = "ti,wl1835";
|
||||
reg = <2>;
|
||||
interrupts = <19 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
wlcore3: wlcore@2 {
|
||||
compatible = "ti,wl1835";
|
||||
reg = <2>;
|
||||
interrupts = <19 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
};
|
||||
|
@ -785,6 +785,8 @@ patternProperties:
|
||||
description: MaxBotix Inc.
|
||||
"^maxim,.*":
|
||||
description: Maxim Integrated Products
|
||||
"^maxlinear,.*":
|
||||
description: MaxLinear Inc.
|
||||
"^mbvl,.*":
|
||||
description: Mobiveil Inc.
|
||||
"^mcube,.*":
|
||||
@ -855,6 +857,8 @@ patternProperties:
|
||||
description: Moortec Semiconductor Ltd.
|
||||
"^mosaixtech,.*":
|
||||
description: Mosaix Technologies, Inc.
|
||||
"^motorcomm,.*":
|
||||
description: MotorComm, Inc.
|
||||
"^motorola,.*":
|
||||
description: Motorola, Inc.
|
||||
"^moxa,.*":
|
||||
|
@ -323,7 +323,7 @@ If the lowest bit of showcapimsgs is set, kernelcapi logs controller and
|
||||
application up and down events.
|
||||
|
||||
In addition, every registered CAPI controller has an associated traceflag
|
||||
parameter controlling how CAPI messages sent from and to tha controller are
|
||||
parameter controlling how CAPI messages sent from and to the controller are
|
||||
logged. The traceflag parameter is initialized with the value of the
|
||||
showcapimsgs parameter when the controller is registered, but can later be
|
||||
changed via the MANUFACTURER_REQ command KCAPI_CMD_TRACE.
|
||||
|
@ -3,7 +3,7 @@ mISDN Driver
|
||||
============
|
||||
|
||||
mISDN is a new modular ISDN driver, in the long term it should replace
|
||||
the old I4L driver architecture for passiv ISDN cards.
|
||||
the old I4L driver architecture for passive ISDN cards.
|
||||
It was designed to allow a broad range of applications and interfaces
|
||||
but only have the basic function in kernel, the interface to the user
|
||||
space is based on sockets with a own address family AF_ISDN.
|
||||
|
331
Documentation/netlink/genetlink-c.yaml
Normal file
331
Documentation/netlink/genetlink-c.yaml
Normal file
@ -0,0 +1,331 @@
|
||||
# SPDX-License-Identifier: GPL-2.0
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://kernel.org/schemas/netlink/genetlink-c.yaml#
|
||||
$schema: https://json-schema.org/draft-07/schema
|
||||
|
||||
# Common defines
|
||||
$defs:
|
||||
uint:
|
||||
type: integer
|
||||
minimum: 0
|
||||
len-or-define:
|
||||
type: [ string, integer ]
|
||||
pattern: ^[0-9A-Za-z_]+( - 1)?$
|
||||
minimum: 0
|
||||
|
||||
# Schema for specs
|
||||
title: Protocol
|
||||
description: Specification of a genetlink protocol
|
||||
type: object
|
||||
required: [ name, doc, attribute-sets, operations ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: Name of the genetlink family.
|
||||
type: string
|
||||
doc:
|
||||
type: string
|
||||
version:
|
||||
description: Generic Netlink family version. Default is 1.
|
||||
type: integer
|
||||
minimum: 1
|
||||
protocol:
|
||||
description: Schema compatibility level. Default is "genetlink".
|
||||
enum: [ genetlink, genetlink-c ]
|
||||
# Start genetlink-c
|
||||
uapi-header:
|
||||
description: Path to the uAPI header, default is linux/${family-name}.h
|
||||
type: string
|
||||
c-family-name:
|
||||
description: Name of the define for the family name.
|
||||
type: string
|
||||
c-version-name:
|
||||
description: Name of the define for the verion of the family.
|
||||
type: string
|
||||
max-by-define:
|
||||
description: Makes the number of attributes and commands be specified by a define, not an enum value.
|
||||
type: boolean
|
||||
# End genetlink-c
|
||||
|
||||
definitions:
|
||||
description: List of type and constant definitions (enums, flags, defines).
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ type, name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
header:
|
||||
description: For C-compatible languages, header which already defines this value.
|
||||
type: string
|
||||
type:
|
||||
enum: [ const, enum, flags ]
|
||||
doc:
|
||||
type: string
|
||||
# For const
|
||||
value:
|
||||
description: For const - the value.
|
||||
type: [ string, integer ]
|
||||
# For enum and flags
|
||||
value-start:
|
||||
description: For enum or flags the literal initializer for the first value.
|
||||
type: [ string, integer ]
|
||||
entries:
|
||||
description: For enum or flags array of values.
|
||||
type: array
|
||||
items:
|
||||
oneOf:
|
||||
- type: string
|
||||
- type: object
|
||||
required: [ name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
value:
|
||||
type: integer
|
||||
doc:
|
||||
type: string
|
||||
render-max:
|
||||
description: Render the max members for this enum.
|
||||
type: boolean
|
||||
# Start genetlink-c
|
||||
enum-name:
|
||||
description: Name for enum, if empty no name will be used.
|
||||
type: [ string, "null" ]
|
||||
name-prefix:
|
||||
description: For enum the prefix of the values, optional.
|
||||
type: string
|
||||
# End genetlink-c
|
||||
|
||||
attribute-sets:
|
||||
description: Definition of attribute spaces for this family.
|
||||
type: array
|
||||
items:
|
||||
description: Definition of a single attribute space.
|
||||
type: object
|
||||
required: [ name, attributes ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: |
|
||||
Name used when referring to this space in other definitions, not used outside of the spec.
|
||||
type: string
|
||||
name-prefix:
|
||||
description: |
|
||||
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
|
||||
type: string
|
||||
enum-name:
|
||||
description: Name for the enum type of the attribute.
|
||||
type: string
|
||||
doc:
|
||||
description: Documentation of the space.
|
||||
type: string
|
||||
subset-of:
|
||||
description: |
|
||||
Name of another space which this is a logical part of. Sub-spaces can be used to define
|
||||
a limited group of attributes which are used in a nest.
|
||||
type: string
|
||||
# Start genetlink-c
|
||||
attr-cnt-name:
|
||||
description: The explicit name for constant holding the count of attributes (last attr + 1).
|
||||
type: string
|
||||
attr-max-name:
|
||||
description: The explicit name for last member of attribute enum.
|
||||
type: string
|
||||
# End genetlink-c
|
||||
attributes:
|
||||
description: List of attributes in the space.
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ name, type ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
type: &attr-type
|
||||
enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64,
|
||||
string, nest, array-nest, nest-type-value ]
|
||||
doc:
|
||||
description: Documentation of the attribute.
|
||||
type: string
|
||||
value:
|
||||
description: Value for the enum item representing this attribute in the uAPI.
|
||||
$ref: '#/$defs/uint'
|
||||
type-value:
|
||||
description: Name of the value extracted from the type of a nest-type-value attribute.
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
byte-order:
|
||||
enum: [ little-endian, big-endian ]
|
||||
multi-attr:
|
||||
type: boolean
|
||||
nested-attributes:
|
||||
description: Name of the space (sub-space) used inside the attribute.
|
||||
type: string
|
||||
enum:
|
||||
description: Name of the enum type used for the attribute.
|
||||
type: string
|
||||
enum-as-flags:
|
||||
description: |
|
||||
Treat the enum as flags. In most cases enum is either used as flags or as values.
|
||||
Sometimes, however, both forms are necessary, in which case header contains the enum
|
||||
form while specific attributes may request to convert the values into a bitfield.
|
||||
type: boolean
|
||||
checks:
|
||||
description: Kernel input validation.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
flags-mask:
|
||||
description: Name of the flags constant on which to base mask (unsigned scalar types only).
|
||||
type: string
|
||||
min:
|
||||
description: Min value for an integer attribute.
|
||||
type: integer
|
||||
min-len:
|
||||
description: Min length for a binary attribute.
|
||||
$ref: '#/$defs/len-or-define'
|
||||
max-len:
|
||||
description: Max length for a string or a binary attribute.
|
||||
$ref: '#/$defs/len-or-define'
|
||||
sub-type: *attr-type
|
||||
|
||||
# Make sure name-prefix does not appear in subsets (subsets inherit naming)
|
||||
dependencies:
|
||||
name-prefix:
|
||||
not:
|
||||
required: [ subset-of ]
|
||||
subset-of:
|
||||
not:
|
||||
required: [ name-prefix ]
|
||||
|
||||
operations:
|
||||
description: Operations supported by the protocol.
|
||||
type: object
|
||||
required: [ list ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
enum-model:
|
||||
description: |
|
||||
The model of assigning values to the operations.
|
||||
"unified" is the recommended model where all message types belong
|
||||
to a single enum.
|
||||
"directional" has the messages sent to the kernel and from the kernel
|
||||
enumerated separately.
|
||||
enum: [ unified ]
|
||||
name-prefix:
|
||||
description: |
|
||||
Prefix for the C enum name of the command. The name is formed by concatenating
|
||||
the prefix with the upper case name of the command, with dashes replaced by underscores.
|
||||
type: string
|
||||
enum-name:
|
||||
description: Name for the enum type with commands.
|
||||
type: string
|
||||
async-prefix:
|
||||
description: Same as name-prefix but used to render notifications and events to separate enum.
|
||||
type: string
|
||||
async-enum:
|
||||
description: Name for the enum type with notifications/events.
|
||||
type: string
|
||||
list:
|
||||
description: List of commands
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
additionalProperties: False
|
||||
required: [ name, doc ]
|
||||
properties:
|
||||
name:
|
||||
description: Name of the operation, also defining its C enum value in uAPI.
|
||||
type: string
|
||||
doc:
|
||||
description: Documentation for the command.
|
||||
type: string
|
||||
value:
|
||||
description: Value for the enum in the uAPI.
|
||||
$ref: '#/$defs/uint'
|
||||
attribute-set:
|
||||
description: |
|
||||
Attribute space from which attributes directly in the requests and replies
|
||||
to this command are defined.
|
||||
type: string
|
||||
flags: &cmd_flags
|
||||
description: Command flags.
|
||||
type: array
|
||||
items:
|
||||
enum: [ admin-perm ]
|
||||
dont-validate:
|
||||
description: Kernel attribute validation flags.
|
||||
type: array
|
||||
items:
|
||||
enum: [ strict, dump ]
|
||||
do: &subop-type
|
||||
description: Main command handler.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
request: &subop-attr-list
|
||||
description: Definition of the request message for a given command.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
attributes:
|
||||
description: |
|
||||
Names of attributes from the attribute-set (not full attribute
|
||||
definitions, just names).
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
reply: *subop-attr-list
|
||||
pre:
|
||||
description: Hook for a function to run before the main callback (pre_doit or start).
|
||||
type: string
|
||||
post:
|
||||
description: Hook for a function to run after the main callback (post_doit or done).
|
||||
type: string
|
||||
dump: *subop-type
|
||||
notify:
|
||||
description: Name of the command sharing the reply type with this notification.
|
||||
type: string
|
||||
event:
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
attributes:
|
||||
description: Explicit list of the attributes for the notification.
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
mcgrp:
|
||||
description: Name of the multicast group generating given notification.
|
||||
type: string
|
||||
mcast-groups:
|
||||
description: List of multicast groups.
|
||||
type: object
|
||||
required: [ list ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
list:
|
||||
description: List of groups.
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: |
|
||||
The name for the group, used to form the define and the value of the define.
|
||||
type: string
|
||||
# Start genetlink-c
|
||||
c-define-name:
|
||||
description: Override for the name of the define in C uAPI.
|
||||
type: string
|
||||
# End genetlink-c
|
||||
flags: *cmd_flags
|
361
Documentation/netlink/genetlink-legacy.yaml
Normal file
361
Documentation/netlink/genetlink-legacy.yaml
Normal file
@ -0,0 +1,361 @@
|
||||
# SPDX-License-Identifier: GPL-2.0
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://kernel.org/schemas/netlink/genetlink-legacy.yaml#
|
||||
$schema: https://json-schema.org/draft-07/schema
|
||||
|
||||
# Common defines
|
||||
$defs:
|
||||
uint:
|
||||
type: integer
|
||||
minimum: 0
|
||||
len-or-define:
|
||||
type: [ string, integer ]
|
||||
pattern: ^[0-9A-Za-z_]+( - 1)?$
|
||||
minimum: 0
|
||||
|
||||
# Schema for specs
|
||||
title: Protocol
|
||||
description: Specification of a genetlink protocol
|
||||
type: object
|
||||
required: [ name, doc, attribute-sets, operations ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: Name of the genetlink family.
|
||||
type: string
|
||||
doc:
|
||||
type: string
|
||||
version:
|
||||
description: Generic Netlink family version. Default is 1.
|
||||
type: integer
|
||||
minimum: 1
|
||||
protocol:
|
||||
description: Schema compatibility level. Default is "genetlink".
|
||||
enum: [ genetlink, genetlink-c, genetlink-legacy ] # Trim
|
||||
# Start genetlink-c
|
||||
uapi-header:
|
||||
description: Path to the uAPI header, default is linux/${family-name}.h
|
||||
type: string
|
||||
c-family-name:
|
||||
description: Name of the define for the family name.
|
||||
type: string
|
||||
c-version-name:
|
||||
description: Name of the define for the verion of the family.
|
||||
type: string
|
||||
max-by-define:
|
||||
description: Makes the number of attributes and commands be specified by a define, not an enum value.
|
||||
type: boolean
|
||||
# End genetlink-c
|
||||
# Start genetlink-legacy
|
||||
kernel-policy:
|
||||
description: |
|
||||
Defines if the input policy in the kernel is global, per-operation, or split per operation type.
|
||||
Default is split.
|
||||
enum: [ split, per-op, global ]
|
||||
# End genetlink-legacy
|
||||
|
||||
definitions:
|
||||
description: List of type and constant definitions (enums, flags, defines).
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ type, name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
header:
|
||||
description: For C-compatible languages, header which already defines this value.
|
||||
type: string
|
||||
type:
|
||||
enum: [ const, enum, flags, struct ] # Trim
|
||||
doc:
|
||||
type: string
|
||||
# For const
|
||||
value:
|
||||
description: For const - the value.
|
||||
type: [ string, integer ]
|
||||
# For enum and flags
|
||||
value-start:
|
||||
description: For enum or flags the literal initializer for the first value.
|
||||
type: [ string, integer ]
|
||||
entries:
|
||||
description: For enum or flags array of values.
|
||||
type: array
|
||||
items:
|
||||
oneOf:
|
||||
- type: string
|
||||
- type: object
|
||||
required: [ name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
value:
|
||||
type: integer
|
||||
doc:
|
||||
type: string
|
||||
render-max:
|
||||
description: Render the max members for this enum.
|
||||
type: boolean
|
||||
# Start genetlink-c
|
||||
enum-name:
|
||||
description: Name for enum, if empty no name will be used.
|
||||
type: [ string, "null" ]
|
||||
name-prefix:
|
||||
description: For enum the prefix of the values, optional.
|
||||
type: string
|
||||
# End genetlink-c
|
||||
# Start genetlink-legacy
|
||||
members:
|
||||
description: List of struct members. Only scalars and strings members allowed.
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ name, type ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
type:
|
||||
enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string ]
|
||||
len:
|
||||
$ref: '#/$defs/len-or-define'
|
||||
# End genetlink-legacy
|
||||
|
||||
attribute-sets:
|
||||
description: Definition of attribute spaces for this family.
|
||||
type: array
|
||||
items:
|
||||
description: Definition of a single attribute space.
|
||||
type: object
|
||||
required: [ name, attributes ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: |
|
||||
Name used when referring to this space in other definitions, not used outside of the spec.
|
||||
type: string
|
||||
name-prefix:
|
||||
description: |
|
||||
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
|
||||
type: string
|
||||
enum-name:
|
||||
description: Name for the enum type of the attribute.
|
||||
type: string
|
||||
doc:
|
||||
description: Documentation of the space.
|
||||
type: string
|
||||
subset-of:
|
||||
description: |
|
||||
Name of another space which this is a logical part of. Sub-spaces can be used to define
|
||||
a limited group of attributes which are used in a nest.
|
||||
type: string
|
||||
# Start genetlink-c
|
||||
attr-cnt-name:
|
||||
description: The explicit name for constant holding the count of attributes (last attr + 1).
|
||||
type: string
|
||||
attr-max-name:
|
||||
description: The explicit name for last member of attribute enum.
|
||||
type: string
|
||||
# End genetlink-c
|
||||
attributes:
|
||||
description: List of attributes in the space.
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ name, type ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
type: &attr-type
|
||||
enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64,
|
||||
string, nest, array-nest, nest-type-value ]
|
||||
doc:
|
||||
description: Documentation of the attribute.
|
||||
type: string
|
||||
value:
|
||||
description: Value for the enum item representing this attribute in the uAPI.
|
||||
$ref: '#/$defs/uint'
|
||||
type-value:
|
||||
description: Name of the value extracted from the type of a nest-type-value attribute.
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
byte-order:
|
||||
enum: [ little-endian, big-endian ]
|
||||
multi-attr:
|
||||
type: boolean
|
||||
nested-attributes:
|
||||
description: Name of the space (sub-space) used inside the attribute.
|
||||
type: string
|
||||
enum:
|
||||
description: Name of the enum type used for the attribute.
|
||||
type: string
|
||||
enum-as-flags:
|
||||
description: |
|
||||
Treat the enum as flags. In most cases enum is either used as flags or as values.
|
||||
Sometimes, however, both forms are necessary, in which case header contains the enum
|
||||
form while specific attributes may request to convert the values into a bitfield.
|
||||
type: boolean
|
||||
checks:
|
||||
description: Kernel input validation.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
flags-mask:
|
||||
description: Name of the flags constant on which to base mask (unsigned scalar types only).
|
||||
type: string
|
||||
min:
|
||||
description: Min value for an integer attribute.
|
||||
type: integer
|
||||
min-len:
|
||||
description: Min length for a binary attribute.
|
||||
$ref: '#/$defs/len-or-define'
|
||||
max-len:
|
||||
description: Max length for a string or a binary attribute.
|
||||
$ref: '#/$defs/len-or-define'
|
||||
sub-type: *attr-type
|
||||
|
||||
# Make sure name-prefix does not appear in subsets (subsets inherit naming)
|
||||
dependencies:
|
||||
name-prefix:
|
||||
not:
|
||||
required: [ subset-of ]
|
||||
subset-of:
|
||||
not:
|
||||
required: [ name-prefix ]
|
||||
|
||||
operations:
|
||||
description: Operations supported by the protocol.
|
||||
type: object
|
||||
required: [ list ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
enum-model:
|
||||
description: |
|
||||
The model of assigning values to the operations.
|
||||
"unified" is the recommended model where all message types belong
|
||||
to a single enum.
|
||||
"directional" has the messages sent to the kernel and from the kernel
|
||||
enumerated separately.
|
||||
enum: [ unified, directional ] # Trim
|
||||
name-prefix:
|
||||
description: |
|
||||
Prefix for the C enum name of the command. The name is formed by concatenating
|
||||
the prefix with the upper case name of the command, with dashes replaced by underscores.
|
||||
type: string
|
||||
enum-name:
|
||||
description: Name for the enum type with commands.
|
||||
type: string
|
||||
async-prefix:
|
||||
description: Same as name-prefix but used to render notifications and events to separate enum.
|
||||
type: string
|
||||
async-enum:
|
||||
description: Name for the enum type with notifications/events.
|
||||
type: string
|
||||
list:
|
||||
description: List of commands
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
additionalProperties: False
|
||||
required: [ name, doc ]
|
||||
properties:
|
||||
name:
|
||||
description: Name of the operation, also defining its C enum value in uAPI.
|
||||
type: string
|
||||
doc:
|
||||
description: Documentation for the command.
|
||||
type: string
|
||||
value:
|
||||
description: Value for the enum in the uAPI.
|
||||
$ref: '#/$defs/uint'
|
||||
attribute-set:
|
||||
description: |
|
||||
Attribute space from which attributes directly in the requests and replies
|
||||
to this command are defined.
|
||||
type: string
|
||||
flags: &cmd_flags
|
||||
description: Command flags.
|
||||
type: array
|
||||
items:
|
||||
enum: [ admin-perm ]
|
||||
dont-validate:
|
||||
description: Kernel attribute validation flags.
|
||||
type: array
|
||||
items:
|
||||
enum: [ strict, dump ]
|
||||
do: &subop-type
|
||||
description: Main command handler.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
request: &subop-attr-list
|
||||
description: Definition of the request message for a given command.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
attributes:
|
||||
description: |
|
||||
Names of attributes from the attribute-set (not full attribute
|
||||
definitions, just names).
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
# Start genetlink-legacy
|
||||
value:
|
||||
description: |
|
||||
ID of this message if value for request and response differ,
|
||||
i.e. requests and responses have different message enums.
|
||||
$ref: '#/$defs/uint'
|
||||
# End genetlink-legacy
|
||||
reply: *subop-attr-list
|
||||
pre:
|
||||
description: Hook for a function to run before the main callback (pre_doit or start).
|
||||
type: string
|
||||
post:
|
||||
description: Hook for a function to run after the main callback (post_doit or done).
|
||||
type: string
|
||||
dump: *subop-type
|
||||
notify:
|
||||
description: Name of the command sharing the reply type with this notification.
|
||||
type: string
|
||||
event:
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
attributes:
|
||||
description: Explicit list of the attributes for the notification.
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
mcgrp:
|
||||
description: Name of the multicast group generating given notification.
|
||||
type: string
|
||||
mcast-groups:
|
||||
description: List of multicast groups.
|
||||
type: object
|
||||
required: [ list ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
list:
|
||||
description: List of groups.
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: |
|
||||
The name for the group, used to form the define and the value of the define.
|
||||
type: string
|
||||
# Start genetlink-c
|
||||
c-define-name:
|
||||
description: Override for the name of the define in C uAPI.
|
||||
type: string
|
||||
# End genetlink-c
|
||||
flags: *cmd_flags
|
296
Documentation/netlink/genetlink.yaml
Normal file
296
Documentation/netlink/genetlink.yaml
Normal file
@ -0,0 +1,296 @@
|
||||
# SPDX-License-Identifier: GPL-2.0
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://kernel.org/schemas/netlink/genetlink-legacy.yaml#
|
||||
$schema: https://json-schema.org/draft-07/schema
|
||||
|
||||
# Common defines
|
||||
$defs:
|
||||
uint:
|
||||
type: integer
|
||||
minimum: 0
|
||||
len-or-define:
|
||||
type: [ string, integer ]
|
||||
pattern: ^[0-9A-Za-z_]+( - 1)?$
|
||||
minimum: 0
|
||||
|
||||
# Schema for specs
|
||||
title: Protocol
|
||||
description: Specification of a genetlink protocol
|
||||
type: object
|
||||
required: [ name, doc, attribute-sets, operations ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: Name of the genetlink family.
|
||||
type: string
|
||||
doc:
|
||||
type: string
|
||||
version:
|
||||
description: Generic Netlink family version. Default is 1.
|
||||
type: integer
|
||||
minimum: 1
|
||||
protocol:
|
||||
description: Schema compatibility level. Default is "genetlink".
|
||||
enum: [ genetlink ]
|
||||
|
||||
definitions:
|
||||
description: List of type and constant definitions (enums, flags, defines).
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ type, name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
header:
|
||||
description: For C-compatible languages, header which already defines this value.
|
||||
type: string
|
||||
type:
|
||||
enum: [ const, enum, flags ]
|
||||
doc:
|
||||
type: string
|
||||
# For const
|
||||
value:
|
||||
description: For const - the value.
|
||||
type: [ string, integer ]
|
||||
# For enum and flags
|
||||
value-start:
|
||||
description: For enum or flags the literal initializer for the first value.
|
||||
type: [ string, integer ]
|
||||
entries:
|
||||
description: For enum or flags array of values.
|
||||
type: array
|
||||
items:
|
||||
oneOf:
|
||||
- type: string
|
||||
- type: object
|
||||
required: [ name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
value:
|
||||
type: integer
|
||||
doc:
|
||||
type: string
|
||||
render-max:
|
||||
description: Render the max members for this enum.
|
||||
type: boolean
|
||||
|
||||
attribute-sets:
|
||||
description: Definition of attribute spaces for this family.
|
||||
type: array
|
||||
items:
|
||||
description: Definition of a single attribute space.
|
||||
type: object
|
||||
required: [ name, attributes ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: |
|
||||
Name used when referring to this space in other definitions, not used outside of the spec.
|
||||
type: string
|
||||
name-prefix:
|
||||
description: |
|
||||
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
|
||||
type: string
|
||||
enum-name:
|
||||
description: Name for the enum type of the attribute.
|
||||
type: string
|
||||
doc:
|
||||
description: Documentation of the space.
|
||||
type: string
|
||||
subset-of:
|
||||
description: |
|
||||
Name of another space which this is a logical part of. Sub-spaces can be used to define
|
||||
a limited group of attributes which are used in a nest.
|
||||
type: string
|
||||
attributes:
|
||||
description: List of attributes in the space.
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ name, type ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
type: &attr-type
|
||||
enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64,
|
||||
string, nest, array-nest, nest-type-value ]
|
||||
doc:
|
||||
description: Documentation of the attribute.
|
||||
type: string
|
||||
value:
|
||||
description: Value for the enum item representing this attribute in the uAPI.
|
||||
$ref: '#/$defs/uint'
|
||||
type-value:
|
||||
description: Name of the value extracted from the type of a nest-type-value attribute.
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
byte-order:
|
||||
enum: [ little-endian, big-endian ]
|
||||
multi-attr:
|
||||
type: boolean
|
||||
nested-attributes:
|
||||
description: Name of the space (sub-space) used inside the attribute.
|
||||
type: string
|
||||
enum:
|
||||
description: Name of the enum type used for the attribute.
|
||||
type: string
|
||||
enum-as-flags:
|
||||
description: |
|
||||
Treat the enum as flags. In most cases enum is either used as flags or as values.
|
||||
Sometimes, however, both forms are necessary, in which case header contains the enum
|
||||
form while specific attributes may request to convert the values into a bitfield.
|
||||
type: boolean
|
||||
checks:
|
||||
description: Kernel input validation.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
flags-mask:
|
||||
description: Name of the flags constant on which to base mask (unsigned scalar types only).
|
||||
type: string
|
||||
min:
|
||||
description: Min value for an integer attribute.
|
||||
type: integer
|
||||
min-len:
|
||||
description: Min length for a binary attribute.
|
||||
$ref: '#/$defs/len-or-define'
|
||||
max-len:
|
||||
description: Max length for a string or a binary attribute.
|
||||
$ref: '#/$defs/len-or-define'
|
||||
sub-type: *attr-type
|
||||
|
||||
# Make sure name-prefix does not appear in subsets (subsets inherit naming)
|
||||
dependencies:
|
||||
name-prefix:
|
||||
not:
|
||||
required: [ subset-of ]
|
||||
subset-of:
|
||||
not:
|
||||
required: [ name-prefix ]
|
||||
|
||||
operations:
|
||||
description: Operations supported by the protocol.
|
||||
type: object
|
||||
required: [ list ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
enum-model:
|
||||
description: |
|
||||
The model of assigning values to the operations.
|
||||
"unified" is the recommended model where all message types belong
|
||||
to a single enum.
|
||||
"directional" has the messages sent to the kernel and from the kernel
|
||||
enumerated separately.
|
||||
enum: [ unified ]
|
||||
name-prefix:
|
||||
description: |
|
||||
Prefix for the C enum name of the command. The name is formed by concatenating
|
||||
the prefix with the upper case name of the command, with dashes replaced by underscores.
|
||||
type: string
|
||||
enum-name:
|
||||
description: Name for the enum type with commands.
|
||||
type: string
|
||||
async-prefix:
|
||||
description: Same as name-prefix but used to render notifications and events to separate enum.
|
||||
type: string
|
||||
async-enum:
|
||||
description: Name for the enum type with notifications/events.
|
||||
type: string
|
||||
list:
|
||||
description: List of commands
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
additionalProperties: False
|
||||
required: [ name, doc ]
|
||||
properties:
|
||||
name:
|
||||
description: Name of the operation, also defining its C enum value in uAPI.
|
||||
type: string
|
||||
doc:
|
||||
description: Documentation for the command.
|
||||
type: string
|
||||
value:
|
||||
description: Value for the enum in the uAPI.
|
||||
$ref: '#/$defs/uint'
|
||||
attribute-set:
|
||||
description: |
|
||||
Attribute space from which attributes directly in the requests and replies
|
||||
to this command are defined.
|
||||
type: string
|
||||
flags: &cmd_flags
|
||||
description: Command flags.
|
||||
type: array
|
||||
items:
|
||||
enum: [ admin-perm ]
|
||||
dont-validate:
|
||||
description: Kernel attribute validation flags.
|
||||
type: array
|
||||
items:
|
||||
enum: [ strict, dump ]
|
||||
do: &subop-type
|
||||
description: Main command handler.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
request: &subop-attr-list
|
||||
description: Definition of the request message for a given command.
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
attributes:
|
||||
description: |
|
||||
Names of attributes from the attribute-set (not full attribute
|
||||
definitions, just names).
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
reply: *subop-attr-list
|
||||
pre:
|
||||
description: Hook for a function to run before the main callback (pre_doit or start).
|
||||
type: string
|
||||
post:
|
||||
description: Hook for a function to run after the main callback (post_doit or done).
|
||||
type: string
|
||||
dump: *subop-type
|
||||
notify:
|
||||
description: Name of the command sharing the reply type with this notification.
|
||||
type: string
|
||||
event:
|
||||
type: object
|
||||
additionalProperties: False
|
||||
properties:
|
||||
attributes:
|
||||
description: Explicit list of the attributes for the notification.
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
mcgrp:
|
||||
description: Name of the multicast group generating given notification.
|
||||
type: string
|
||||
mcast-groups:
|
||||
description: List of multicast groups.
|
||||
type: object
|
||||
required: [ list ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
list:
|
||||
description: List of groups.
|
||||
type: array
|
||||
items:
|
||||
type: object
|
||||
required: [ name ]
|
||||
additionalProperties: False
|
||||
properties:
|
||||
name:
|
||||
description: |
|
||||
The name for the group, used to form the define and the value of the define.
|
||||
type: string
|
||||
flags: *cmd_flags
|
397
Documentation/netlink/specs/ethtool.yaml
Normal file
397
Documentation/netlink/specs/ethtool.yaml
Normal file
@ -0,0 +1,397 @@
|
||||
name: ethtool
|
||||
|
||||
protocol: genetlink-legacy
|
||||
|
||||
doc: Partial family for Ethtool Netlink.
|
||||
|
||||
attribute-sets:
|
||||
-
|
||||
name: header
|
||||
attributes:
|
||||
-
|
||||
name: dev-index
|
||||
type: u32
|
||||
value: 1
|
||||
-
|
||||
name: dev-name
|
||||
type: string
|
||||
-
|
||||
name: flags
|
||||
type: u32
|
||||
|
||||
-
|
||||
name: bitset-bit
|
||||
attributes:
|
||||
-
|
||||
name: index
|
||||
type: u32
|
||||
value: 1
|
||||
-
|
||||
name: name
|
||||
type: string
|
||||
-
|
||||
name: value
|
||||
type: flag
|
||||
-
|
||||
name: bitset-bits
|
||||
attributes:
|
||||
-
|
||||
name: bit
|
||||
type: nest
|
||||
nested-attributes: bitset-bit
|
||||
value: 1
|
||||
-
|
||||
name: bitset
|
||||
attributes:
|
||||
-
|
||||
name: nomask
|
||||
type: flag
|
||||
value: 1
|
||||
-
|
||||
name: size
|
||||
type: u32
|
||||
-
|
||||
name: bits
|
||||
type: nest
|
||||
nested-attributes: bitset-bits
|
||||
|
||||
-
|
||||
name: string
|
||||
attributes:
|
||||
-
|
||||
name: index
|
||||
type: u32
|
||||
value: 1
|
||||
-
|
||||
name: value
|
||||
type: string
|
||||
-
|
||||
name: strings
|
||||
attributes:
|
||||
-
|
||||
name: string
|
||||
type: nest
|
||||
value: 1
|
||||
multi-attr: true
|
||||
nested-attributes: string
|
||||
-
|
||||
name: stringset
|
||||
attributes:
|
||||
-
|
||||
name: id
|
||||
type: u32
|
||||
value: 1
|
||||
-
|
||||
name: count
|
||||
type: u32
|
||||
-
|
||||
name: strings
|
||||
type: nest
|
||||
multi-attr: true
|
||||
nested-attributes: strings
|
||||
-
|
||||
name: stringsets
|
||||
attributes:
|
||||
-
|
||||
name: stringset
|
||||
type: nest
|
||||
multi-attr: true
|
||||
value: 1
|
||||
nested-attributes: stringset
|
||||
-
|
||||
name: strset
|
||||
attributes:
|
||||
-
|
||||
name: header
|
||||
value: 1
|
||||
type: nest
|
||||
nested-attributes: header
|
||||
-
|
||||
name: stringsets
|
||||
type: nest
|
||||
nested-attributes: stringsets
|
||||
-
|
||||
name: counts-only
|
||||
type: flag
|
||||
|
||||
-
|
||||
name: privflags
|
||||
attributes:
|
||||
-
|
||||
name: header
|
||||
value: 1
|
||||
type: nest
|
||||
nested-attributes: header
|
||||
-
|
||||
name: flags
|
||||
type: nest
|
||||
nested-attributes: bitset
|
||||
|
||||
-
|
||||
name: rings
|
||||
attributes:
|
||||
-
|
||||
name: header
|
||||
value: 1
|
||||
type: nest
|
||||
nested-attributes: header
|
||||
-
|
||||
name: rx-max
|
||||
type: u32
|
||||
-
|
||||
name: rx-mini-max
|
||||
type: u32
|
||||
-
|
||||
name: rx-jumbo-max
|
||||
type: u32
|
||||
-
|
||||
name: tx-max
|
||||
type: u32
|
||||
-
|
||||
name: rx
|
||||
type: u32
|
||||
-
|
||||
name: rx-mini
|
||||
type: u32
|
||||
-
|
||||
name: rx-jumbo
|
||||
type: u32
|
||||
-
|
||||
name: tx
|
||||
type: u32
|
||||
-
|
||||
name: rx-buf-len
|
||||
type: u32
|
||||
-
|
||||
name: tcp-data-split
|
||||
type: u8
|
||||
-
|
||||
name: cqe-size
|
||||
type: u32
|
||||
-
|
||||
name: tx-push
|
||||
type: u8
|
||||
-
|
||||
name: rx-push
|
||||
type: u8
|
||||
|
||||
-
|
||||
name: mm-stat
|
||||
attributes:
|
||||
-
|
||||
name: pad
|
||||
value: 1
|
||||
type: pad
|
||||
-
|
||||
name: reassembly-errors
|
||||
type: u64
|
||||
-
|
||||
name: smd-errors
|
||||
type: u64
|
||||
-
|
||||
name: reassembly-ok
|
||||
type: u64
|
||||
-
|
||||
name: rx-frag-count
|
||||
type: u64
|
||||
-
|
||||
name: tx-frag-count
|
||||
type: u64
|
||||
-
|
||||
name: hold-count
|
||||
type: u64
|
||||
-
|
||||
name: mm
|
||||
attributes:
|
||||
-
|
||||
name: header
|
||||
value: 1
|
||||
type: nest
|
||||
nested-attributes: header
|
||||
-
|
||||
name: pmac-enabled
|
||||
type: u8
|
||||
-
|
||||
name: tx-enabled
|
||||
type: u8
|
||||
-
|
||||
name: tx-active
|
||||
type: u8
|
||||
-
|
||||
name: tx-min-frag-size
|
||||
type: u32
|
||||
-
|
||||
name: tx-min-frag-size
|
||||
type: u32
|
||||
-
|
||||
name: verify-enabled
|
||||
type: u8
|
||||
-
|
||||
name: verify-status
|
||||
type: u8
|
||||
-
|
||||
name: verify-time
|
||||
type: u32
|
||||
-
|
||||
name: max-verify-time
|
||||
type: u32
|
||||
-
|
||||
name: stats
|
||||
type: nest
|
||||
nested-attributes: mm-stat
|
||||
|
||||
operations:
|
||||
enum-model: directional
|
||||
list:
|
||||
-
|
||||
name: strset-get
|
||||
doc: Get string set from the kernel.
|
||||
|
||||
attribute-set: strset
|
||||
|
||||
do: &strset-get-op
|
||||
request:
|
||||
value: 1
|
||||
attributes:
|
||||
- header
|
||||
- stringsets
|
||||
- counts-only
|
||||
reply:
|
||||
value: 1
|
||||
attributes:
|
||||
- header
|
||||
- stringsets
|
||||
dump: *strset-get-op
|
||||
|
||||
# TODO: fill in the requests in between
|
||||
|
||||
-
|
||||
name: privflags-get
|
||||
doc: Get device private flags.
|
||||
|
||||
attribute-set: privflags
|
||||
|
||||
do: &privflag-get-op
|
||||
request:
|
||||
value: 13
|
||||
attributes:
|
||||
- header
|
||||
reply:
|
||||
value: 14
|
||||
attributes:
|
||||
- header
|
||||
- flags
|
||||
dump: *privflag-get-op
|
||||
-
|
||||
name: privflags-set
|
||||
doc: Set device private flags.
|
||||
|
||||
attribute-set: privflags
|
||||
|
||||
do:
|
||||
request:
|
||||
attributes:
|
||||
- header
|
||||
- flags
|
||||
-
|
||||
name: privflags-ntf
|
||||
doc: Notification for change in device private flags.
|
||||
notify: privflags-get
|
||||
|
||||
-
|
||||
name: rings-get
|
||||
doc: Get ring params.
|
||||
|
||||
attribute-set: rings
|
||||
|
||||
do: &ring-get-op
|
||||
request:
|
||||
attributes:
|
||||
- header
|
||||
reply:
|
||||
attributes:
|
||||
- header
|
||||
- rx-max
|
||||
- rx-mini-max
|
||||
- rx-jumbo-max
|
||||
- tx-max
|
||||
- rx
|
||||
- rx-mini
|
||||
- rx-jumbo
|
||||
- tx
|
||||
- rx-buf-len
|
||||
- tcp-data-split
|
||||
- cqe-size
|
||||
- tx-push
|
||||
- rx-push
|
||||
dump: *ring-get-op
|
||||
-
|
||||
name: rings-set
|
||||
doc: Set ring params.
|
||||
|
||||
attribute-set: rings
|
||||
|
||||
do:
|
||||
request:
|
||||
attributes:
|
||||
- header
|
||||
- rx
|
||||
- rx-mini
|
||||
- rx-jumbo
|
||||
- tx
|
||||
- rx-buf-len
|
||||
- tcp-data-split
|
||||
- cqe-size
|
||||
- tx-push
|
||||
- rx-push
|
||||
-
|
||||
name: rings-ntf
|
||||
doc: Notification for change in ring params.
|
||||
notify: rings-get
|
||||
|
||||
# TODO: fill in the requests in between
|
||||
|
||||
-
|
||||
name: mm-get
|
||||
doc: Get MAC Merge configuration and state
|
||||
|
||||
attribute-set: mm
|
||||
|
||||
do: &mm-get-op
|
||||
request:
|
||||
value: 42
|
||||
attributes:
|
||||
- header
|
||||
reply:
|
||||
value: 42
|
||||
attributes:
|
||||
- header
|
||||
- pmac-enabled
|
||||
- tx-enabled
|
||||
- tx-active
|
||||
- tx-min-frag-size
|
||||
- rx-min-frag-size
|
||||
- verify-enabled
|
||||
- verify-time
|
||||
- max-verify-time
|
||||
- stats
|
||||
dump: *mm-get-op
|
||||
-
|
||||
name: mm-set
|
||||
doc: Set MAC Merge configuration
|
||||
|
||||
attribute-set: mm
|
||||
|
||||
do:
|
||||
request:
|
||||
attributes:
|
||||
- header
|
||||
- verify-enabled
|
||||
- verify-time
|
||||
- tx-enabled
|
||||
- pmac-enabled
|
||||
- tx-min-frag-size
|
||||
-
|
||||
name: mm-ntf
|
||||
doc: Notification for change in MAC Merge configuration.
|
||||
notify: mm-get
|
128
Documentation/netlink/specs/fou.yaml
Normal file
128
Documentation/netlink/specs/fou.yaml
Normal file
@ -0,0 +1,128 @@
|
||||
name: fou
|
||||
|
||||
protocol: genetlink-legacy
|
||||
|
||||
doc: |
|
||||
Foo-over-UDP.
|
||||
|
||||
c-family-name: fou-genl-name
|
||||
c-version-name: fou-genl-version
|
||||
max-by-define: true
|
||||
kernel-policy: global
|
||||
|
||||
definitions:
|
||||
-
|
||||
type: enum
|
||||
name: encap_type
|
||||
name-prefix: fou-encap-
|
||||
enum-name:
|
||||
entries: [ unspec, direct, gue ]
|
||||
|
||||
attribute-sets:
|
||||
-
|
||||
name: fou
|
||||
name-prefix: fou-attr-
|
||||
attributes:
|
||||
-
|
||||
name: unspec
|
||||
type: unused
|
||||
-
|
||||
name: port
|
||||
type: u16
|
||||
byte-order: big-endian
|
||||
-
|
||||
name: af
|
||||
type: u8
|
||||
-
|
||||
name: ipproto
|
||||
type: u8
|
||||
-
|
||||
name: type
|
||||
type: u8
|
||||
-
|
||||
name: remcsum_nopartial
|
||||
type: flag
|
||||
-
|
||||
name: local_v4
|
||||
type: u32
|
||||
-
|
||||
name: local_v6
|
||||
type: binary
|
||||
checks:
|
||||
min-len: 16
|
||||
-
|
||||
name: peer_v4
|
||||
type: u32
|
||||
-
|
||||
name: peer_v6
|
||||
type: binary
|
||||
checks:
|
||||
min-len: 16
|
||||
-
|
||||
name: peer_port
|
||||
type: u16
|
||||
byte-order: big-endian
|
||||
-
|
||||
name: ifindex
|
||||
type: s32
|
||||
|
||||
operations:
|
||||
list:
|
||||
-
|
||||
name: unspec
|
||||
doc: unused
|
||||
|
||||
-
|
||||
name: add
|
||||
doc: Add port.
|
||||
attribute-set: fou
|
||||
|
||||
dont-validate: [ strict, dump ]
|
||||
flags: [ admin-perm ]
|
||||
|
||||
do:
|
||||
request: &all_attrs
|
||||
attributes:
|
||||
- port
|
||||
- ipproto
|
||||
- type
|
||||
- remcsum_nopartial
|
||||
- local_v4
|
||||
- peer_v4
|
||||
- local_v6
|
||||
- peer_v6
|
||||
- peer_port
|
||||
- ifindex
|
||||
|
||||
-
|
||||
name: del
|
||||
doc: Delete port.
|
||||
attribute-set: fou
|
||||
|
||||
dont-validate: [ strict, dump ]
|
||||
flags: [ admin-perm ]
|
||||
|
||||
do:
|
||||
request: &select_attrs
|
||||
attributes:
|
||||
- af
|
||||
- ifindex
|
||||
- port
|
||||
- peer_port
|
||||
- local_v4
|
||||
- peer_v4
|
||||
- local_v6
|
||||
- peer_v6
|
||||
|
||||
-
|
||||
name: get
|
||||
doc: Get tunnel info.
|
||||
attribute-set: fou
|
||||
dont-validate: [ strict, dump ]
|
||||
|
||||
do:
|
||||
request: *select_attrs
|
||||
reply: *all_attrs
|
||||
|
||||
dump:
|
||||
reply: *all_attrs
|
100
Documentation/netlink/specs/netdev.yaml
Normal file
100
Documentation/netlink/specs/netdev.yaml
Normal file
@ -0,0 +1,100 @@
|
||||
name: netdev
|
||||
|
||||
doc:
|
||||
netdev configuration over generic netlink.
|
||||
|
||||
definitions:
|
||||
-
|
||||
type: flags
|
||||
name: xdp-act
|
||||
entries:
|
||||
-
|
||||
name: basic
|
||||
doc:
|
||||
XDP feautues set supported by all drivers
|
||||
(XDP_ABORTED, XDP_DROP, XDP_PASS, XDP_TX)
|
||||
-
|
||||
name: redirect
|
||||
doc:
|
||||
The netdev supports XDP_REDIRECT
|
||||
-
|
||||
name: ndo-xmit
|
||||
doc:
|
||||
This feature informs if netdev implements ndo_xdp_xmit callback.
|
||||
-
|
||||
name: xsk-zerocopy
|
||||
doc:
|
||||
This feature informs if netdev supports AF_XDP in zero copy mode.
|
||||
-
|
||||
name: hw-offload
|
||||
doc:
|
||||
This feature informs if netdev supports XDP hw oflloading.
|
||||
-
|
||||
name: rx-sg
|
||||
doc:
|
||||
This feature informs if netdev implements non-linear XDP buffer
|
||||
support in the driver napi callback.
|
||||
-
|
||||
name: ndo-xmit-sg
|
||||
doc:
|
||||
This feature informs if netdev implements non-linear XDP buffer
|
||||
support in ndo_xdp_xmit callback.
|
||||
|
||||
attribute-sets:
|
||||
-
|
||||
name: dev
|
||||
attributes:
|
||||
-
|
||||
name: ifindex
|
||||
doc: netdev ifindex
|
||||
type: u32
|
||||
value: 1
|
||||
checks:
|
||||
min: 1
|
||||
-
|
||||
name: pad
|
||||
type: pad
|
||||
-
|
||||
name: xdp-features
|
||||
doc: Bitmask of enabled xdp-features.
|
||||
type: u64
|
||||
enum: xdp-act
|
||||
enum-as-flags: true
|
||||
|
||||
operations:
|
||||
list:
|
||||
-
|
||||
name: dev-get
|
||||
doc: Get / dump information about a netdev.
|
||||
value: 1
|
||||
attribute-set: dev
|
||||
do:
|
||||
request:
|
||||
attributes:
|
||||
- ifindex
|
||||
reply: &dev-all
|
||||
attributes:
|
||||
- ifindex
|
||||
- xdp-features
|
||||
dump:
|
||||
reply: *dev-all
|
||||
-
|
||||
name: dev-add-ntf
|
||||
doc: Notification about device appearing.
|
||||
notify: dev-get
|
||||
mcgrp: mgmt
|
||||
-
|
||||
name: dev-del-ntf
|
||||
doc: Notification about device disappearing.
|
||||
notify: dev-get
|
||||
mcgrp: mgmt
|
||||
-
|
||||
name: dev-change-ntf
|
||||
doc: Notification about device configuration being changed.
|
||||
notify: dev-get
|
||||
mcgrp: mgmt
|
||||
|
||||
mcast-groups:
|
||||
list:
|
||||
-
|
||||
name: mgmt
|
@ -419,7 +419,7 @@ XDP_UMEM_REG setsockopt
|
||||
-----------------------
|
||||
|
||||
This setsockopt registers a UMEM to a socket. This is the area that
|
||||
contain all the buffers that packet can recide in. The call takes a
|
||||
contain all the buffers that packet can reside in. The call takes a
|
||||
pointer to the beginning of this area and the size of it. Moreover, it
|
||||
also has parameter called chunk_size that is the size that the UMEM is
|
||||
divided into. It can only be 2K or 4K at the moment. If you have an
|
||||
@ -592,7 +592,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually
|
||||
A number of other ways are possible all up to the capabilities of
|
||||
the NIC you have.
|
||||
|
||||
Q: Can I use the XSKMAP to implement a switch betwen different umems
|
||||
Q: Can I use the XSKMAP to implement a switch between different umems
|
||||
in copy mode?
|
||||
|
||||
A: The short answer is no, that is not supported at the moment. The
|
||||
|
@ -1902,7 +1902,7 @@ of 32 possible I/O Base addresses using the following tables::
|
||||
6 | 10
|
||||
|
||||
The I/O address is sum of all switches set to "1". Remember that
|
||||
the I/O address space bellow 0x200 is RESERVED for mainboard, so
|
||||
the I/O address space below 0x200 is RESERVED for mainboard, so
|
||||
switch 1 should be ALWAYS SET TO OFF.
|
||||
|
||||
|
||||
|
@ -159,7 +159,7 @@ Please send us comments, experiences, questions, anything :)
|
||||
IRC:
|
||||
#batadv on ircs://irc.hackint.org/
|
||||
Mailing-list:
|
||||
b.a.t.m.a.n@open-mesh.org (optional subscription at
|
||||
b.a.t.m.a.n@lists.open-mesh.org (optional subscription at
|
||||
https://lists.open-mesh.org/mailman3/postorius/lists/b.a.t.m.a.n.lists.open-mesh.org/)
|
||||
|
||||
You can also contact the Authors:
|
||||
|
@ -931,7 +931,7 @@ ival1:
|
||||
ival2:
|
||||
Throttle the received message rate down to the value of ival2. This
|
||||
is useful to reduce messages for the application when the signal inside the
|
||||
CAN frame is stateless as state changes within the ival2 periode may get
|
||||
CAN frame is stateless as state changes within the ival2 period may get
|
||||
lost.
|
||||
|
||||
Broadcast Manager Multiplex Message Receive Filter
|
||||
|
@ -50,7 +50,7 @@ Setup Packet
|
||||
``wIndex`` USB Interface Index (0 for device commands)
|
||||
``wLength`` * Host to Device - Number of bytes to transmit
|
||||
* Device to Host - Maximum Number of bytes to
|
||||
receive. If the device send less. Commom ZLP
|
||||
receive. If the device send less. Common ZLP
|
||||
semantics are used.
|
||||
================= =====================================================
|
||||
|
||||
|
@ -93,7 +93,7 @@ MBIM function can be looked up using sysfs. For example::
|
||||
USB configuration descriptors
|
||||
-----------------------------
|
||||
The wMaxControlMessage field of the CDC MBIM functional descriptor
|
||||
limits the maximum control message size. The managament application is
|
||||
limits the maximum control message size. The management application is
|
||||
responsible for negotiating a control message size complying with the
|
||||
requirements in section 9.3.1 of [1], taking this descriptor field
|
||||
into consideration.
|
||||
|
@ -4,7 +4,7 @@
|
||||
ATM (i)Chip IA Linux Driver Source
|
||||
==================================
|
||||
|
||||
READ ME FISRT
|
||||
READ ME FIRST
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
|
@ -577,7 +577,7 @@ CTU CAN FD IP Core and Driver Development Acknowledgment
|
||||
|
||||
* Linux driver development
|
||||
* continuous integration platform architect and GHDL updates
|
||||
* theses `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
|
||||
* thesis `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
|
||||
|
||||
* Jiri Novak <jnovak@fel.cvut.cz>
|
||||
|
||||
@ -603,7 +603,7 @@ CTU CAN FD IP Core and Driver Development Acknowledgment
|
||||
* Jan Charvat
|
||||
|
||||
* implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_)
|
||||
* Bachelor theses Model of CAN FD Communication Controller for QEMU Emulator
|
||||
* Bachelor thesis Model of CAN FD Communication Controller for QEMU Emulator
|
||||
|
||||
Notes
|
||||
-----
|
||||
|
@ -129,10 +129,10 @@
|
||||
</g>
|
||||
</g>
|
||||
<text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text>
|
||||
<text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccesfull</tspan></text>
|
||||
<text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccessful</tspan></text>
|
||||
<g font-size="12px" stroke-width="3.77953" text-anchor="middle">
|
||||
<text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text>
|
||||
<text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">succesfull</tspan></text>
|
||||
<text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">successful</tspan></text>
|
||||
<text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text>
|
||||
</g>
|
||||
<g stroke-width="3.77953" text-anchor="middle">
|
||||
|
Before Width: | Height: | Size: 16 KiB After Width: | Height: | Size: 16 KiB |
@ -254,7 +254,7 @@ Media selection
|
||||
A number of the older NICs such as the 3c590 and 3c900 series have
|
||||
10base2 and AUI interfaces.
|
||||
|
||||
Prior to January, 2001 this driver would autoeselect the 10base2 or AUI
|
||||
Prior to January, 2001 this driver would autoselect the 10base2 or AUI
|
||||
port if it didn't detect activity on the 10baseT port. It would then
|
||||
get stuck on the 10base2 port and a driver reload was necessary to
|
||||
switch back to 10baseT. This behaviour could not be prevented with a
|
||||
|
@ -270,7 +270,7 @@ RX flow rules (ntuple filters)
|
||||
|
||||
ethtool -K ethX ntuple <on|off>
|
||||
|
||||
When disabling ntuple filters, all the user programed filters are
|
||||
When disabling ntuple filters, all the user programmed filters are
|
||||
flushed from the driver cache and hardware. All needed filters must
|
||||
be re-added when ntuple is re-enabled.
|
||||
|
||||
@ -418,7 +418,7 @@ Default value: 0xFFFF
|
||||
0 Disable interrupt throttling.
|
||||
1 Enable interrupt throttling and use specified tx and rx rates.
|
||||
0xFFFF Auto throttling mode. Driver will choose the best RX and TX
|
||||
interrupt throtting settings based on link speed.
|
||||
interrupt throttling settings based on link speed.
|
||||
====== ==============================================================
|
||||
|
||||
aq_itr_tx - TX interrupt throttle rate
|
||||
@ -456,7 +456,7 @@ AQ_CFG_RX_PAGEORDER
|
||||
|
||||
Default value: 0
|
||||
|
||||
RX page order override. Thats a power of 2 number of RX pages allocated for
|
||||
RX page order override. That's a power of 2 number of RX pages allocated for
|
||||
each descriptor. Received descriptor size is still limited by
|
||||
AQ_CFG_RX_FRAME_MAX.
|
||||
|
||||
|
@ -11,7 +11,7 @@ Overview
|
||||
--------
|
||||
|
||||
The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network
|
||||
drivers (dpaa2-eth, dpaa2-ethsw) interract with the PHY library.
|
||||
drivers (dpaa2-eth, dpaa2-ethsw) interact with the PHY library.
|
||||
|
||||
DPAA2 Software Architecture
|
||||
---------------------------
|
||||
|
@ -39,7 +39,7 @@ Contents:
|
||||
intel/ice
|
||||
marvell/octeontx2
|
||||
marvell/octeon_ep
|
||||
mellanox/mlx5
|
||||
mellanox/mlx5/index
|
||||
microsoft/netvsc
|
||||
neterion/s2io
|
||||
netronome/nfp
|
||||
|
@ -901,15 +901,17 @@ To enable/disable UDP Segmentation Offload, issue the following command::
|
||||
|
||||
# ethtool -K <ethX> tx-udp-segmentation [off|on]
|
||||
|
||||
|
||||
GNSS module
|
||||
-----------
|
||||
Allows user to read messages from the GNSS module and write supported commands.
|
||||
If the module is physically present, driver creates 2 TTYs for each supported
|
||||
device in /dev, ttyGNSS_<device>:<function>_0 and _1. First one (_0) is RW and
|
||||
the second one is RO.
|
||||
The protocol of write commands is dependent on the GNSS module as the driver
|
||||
writes raw bytes from the TTY to the GNSS i2c. Please refer to the module
|
||||
documentation for details.
|
||||
Requires kernel compiled with CONFIG_GNSS=y or CONFIG_GNSS=m.
|
||||
Allows user to read messages from the GNSS hardware module and write supported
|
||||
commands. If the module is physically present, a GNSS device is spawned:
|
||||
``/dev/gnss<id>``.
|
||||
The protocol of write command is dependent on the GNSS hardware module as the
|
||||
driver writes raw bytes by the GNSS object to the receiver through i2c. Please
|
||||
refer to the hardware GNSS module documentation for configuration details.
|
||||
|
||||
|
||||
Performance Optimization
|
||||
========================
|
||||
|
@ -127,7 +127,7 @@ Type1:
|
||||
Type2:
|
||||
- RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels.
|
||||
- A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of
|
||||
VF0 will be received by VF1 and viceversa.
|
||||
VF0 will be received by VF1 and vice versa.
|
||||
- These VFs can be used by applications or virtual machines to communicate between them
|
||||
without sending traffic outside. There is no switch present in HW, hence the support
|
||||
for loopback VFs.
|
||||
|
@ -1,746 +0,0 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
|
||||
|
||||
=================================================
|
||||
Mellanox ConnectX(R) mlx5 core VPI Network Driver
|
||||
=================================================
|
||||
|
||||
Copyright (c) 2019, Mellanox Technologies LTD.
|
||||
|
||||
Contents
|
||||
========
|
||||
|
||||
- `Enabling the driver and kconfig options`_
|
||||
- `Devlink info`_
|
||||
- `Devlink parameters`_
|
||||
- `Bridge offload`_
|
||||
- `mlx5 subfunction`_
|
||||
- `mlx5 function attributes`_
|
||||
- `Devlink health reporters`_
|
||||
- `mlx5 tracepoints`_
|
||||
|
||||
Enabling the driver and kconfig options
|
||||
=======================================
|
||||
|
||||
| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
|
||||
| at build time via kernel Kconfig flags.
|
||||
| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
|
||||
| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
|
||||
| For the list of advanced features, please see below.
|
||||
|
||||
**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
|
||||
|
||||
| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
|
||||
| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
|
||||
|
||||
|
||||
**CONFIG_MLX5_CORE_EN=(y/n)**
|
||||
|
||||
| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
|
||||
| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
|
||||
| built-in into mlx5_core.ko.
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_ARFS=(y/n)**
|
||||
|
||||
| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
|
||||
| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_RXNFC=(y/n)**
|
||||
|
||||
| Enables ethtool receive network flow classification, which allows user defined
|
||||
| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
|
||||
|
||||
|
||||
**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
|
||||
|
||||
| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
|
||||
|
||||
|
||||
**CONFIG_MLX5_MPFS=(y/n)**
|
||||
|
||||
| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
|
||||
| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
|
||||
| user configured unicast MAC addresses to the requesting PF.
|
||||
|
||||
|
||||
**CONFIG_MLX5_ESWITCH=(y/n)**
|
||||
|
||||
| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
|
||||
| and switching for the enabled VFs and PF in two available modes:
|
||||
| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
|
||||
| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
|
||||
|
||||
|
||||
**CONFIG_MLX5_CORE_IPOIB=(y/n)**
|
||||
|
||||
| IPoIB offloads & acceleration support.
|
||||
| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
|
||||
| IPoIB ulp netdevice.
|
||||
|
||||
|
||||
**CONFIG_MLX5_FPGA=(y/n)**
|
||||
|
||||
| Build support for the Innova family of network cards by Mellanox Technologies.
|
||||
| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
|
||||
| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
|
||||
| building sandbox-specific client drivers.
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_IPSEC=(y/n)**
|
||||
|
||||
| Enables `IPSec XFRM cryptography-offload acceleration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
|
||||
|
||||
**CONFIG_MLX5_EN_TLS=(y/n)**
|
||||
|
||||
| TLS cryptography-offload acceleration.
|
||||
|
||||
|
||||
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
|
||||
|
||||
| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
|
||||
|
||||
**CONFIG_MLX5_SF=(y/n)**
|
||||
|
||||
| Build support for subfunction.
|
||||
| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
|
||||
| will enable support for creating subfunction devices.
|
||||
|
||||
**External options** ( Choose if the corresponding mlx5 feature is required )
|
||||
|
||||
- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
|
||||
- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
|
||||
- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
|
||||
|
||||
Devlink info
|
||||
============
|
||||
|
||||
The devlink info reports the running and stored firmware versions on device.
|
||||
It also prints the device PSID which represents the HCA board type ID.
|
||||
|
||||
User command example::
|
||||
|
||||
$ devlink dev info pci/0000:00:06.0
|
||||
pci/0000:00:06.0:
|
||||
driver mlx5_core
|
||||
versions:
|
||||
fixed:
|
||||
fw.psid MT_0000000009
|
||||
running:
|
||||
fw.version 16.26.0100
|
||||
stored:
|
||||
fw.version 16.26.0100
|
||||
|
||||
Devlink parameters
|
||||
==================
|
||||
|
||||
flow_steering_mode: Device flow steering mode
|
||||
---------------------------------------------
|
||||
The flow steering mode parameter controls the flow steering mode of the driver.
|
||||
Two modes are supported:
|
||||
1. 'dmfs' - Device managed flow steering.
|
||||
2. 'smfs' - Software/Driver managed flow steering.
|
||||
|
||||
In DMFS mode, the HW steering entities are created and managed through the
|
||||
Firmware.
|
||||
In SMFS mode, the HW steering entities are created and managed though by
|
||||
the driver directly into hardware without firmware intervention.
|
||||
|
||||
SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
|
||||
|
||||
User command examples:
|
||||
|
||||
- Set SMFS flow steering mode::
|
||||
|
||||
$ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
|
||||
|
||||
- Read device flow steering mode::
|
||||
|
||||
$ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
|
||||
pci/0000:06:00.0:
|
||||
name flow_steering_mode type driver-specific
|
||||
values:
|
||||
cmode runtime value smfs
|
||||
|
||||
enable_roce: RoCE enablement state
|
||||
----------------------------------
|
||||
RoCE enablement state controls driver support for RoCE traffic.
|
||||
When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well-known UDP RoCE port is handled as raw ethernet traffic.
|
||||
|
||||
To change RoCE enablement state, a user must change the driverinit cmode value and run devlink reload.
|
||||
|
||||
User command examples:
|
||||
|
||||
- Disable RoCE::
|
||||
|
||||
$ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
|
||||
$ devlink dev reload pci/0000:06:00.0
|
||||
|
||||
- Read RoCE enablement state::
|
||||
|
||||
$ devlink dev param show pci/0000:06:00.0 name enable_roce
|
||||
pci/0000:06:00.0:
|
||||
name enable_roce type generic
|
||||
values:
|
||||
cmode driverinit value true
|
||||
|
||||
esw_port_metadata: Eswitch port metadata state
|
||||
----------------------------------------------
|
||||
When applicable, disabling eswitch metadata can increase packet rate
|
||||
up to 20% depending on the use case and packet sizes.
|
||||
|
||||
Eswitch port metadata state controls whether to internally tag packets with
|
||||
metadata. Metadata tagging must be enabled for multi-port RoCE, failover
|
||||
between representors and stacked devices.
|
||||
By default metadata is enabled on the supported devices in E-switch.
|
||||
Metadata is applicable only for E-switch in switchdev mode and
|
||||
users may disable it when NONE of the below use cases will be in use:
|
||||
1. HCA is in Dual/multi-port RoCE mode.
|
||||
2. VF/SF representor bonding (Usually used for Live migration)
|
||||
3. Stacked devices
|
||||
|
||||
When metadata is disabled, the above use cases will fail to initialize if
|
||||
users try to enable them.
|
||||
|
||||
- Show eswitch port metadata::
|
||||
|
||||
$ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
|
||||
pci/0000:06:00.0:
|
||||
name esw_port_metadata type driver-specific
|
||||
values:
|
||||
cmode runtime value true
|
||||
|
||||
- Disable eswitch port metadata::
|
||||
|
||||
$ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime
|
||||
|
||||
- Change eswitch mode to switchdev mode where after choosing the metadata value::
|
||||
|
||||
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
|
||||
|
||||
Bridge offload
|
||||
==============
|
||||
The mlx5 driver implements support for offloading bridge rules when in switchdev
|
||||
mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
|
||||
representor is attached to bridge.
|
||||
|
||||
- Change device to switchdev mode::
|
||||
|
||||
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
|
||||
|
||||
- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
|
||||
|
||||
$ ip link set enp8s0f0 master bridge1
|
||||
|
||||
VLANs
|
||||
-----
|
||||
Following bridge VLAN functions are supported by mlx5:
|
||||
|
||||
- VLAN filtering (including multiple VLANs per port)::
|
||||
|
||||
$ ip link set bridge1 type bridge vlan_filtering 1
|
||||
$ bridge vlan add dev enp8s0f0 vid 2-3
|
||||
|
||||
- VLAN push on bridge ingress::
|
||||
|
||||
$ bridge vlan add dev enp8s0f0 vid 3 pvid
|
||||
|
||||
- VLAN pop on bridge egress::
|
||||
|
||||
$ bridge vlan add dev enp8s0f0 vid 3 untagged
|
||||
|
||||
mlx5 subfunction
|
||||
================
|
||||
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
|
||||
|
||||
A subfunction has its own function capabilities and its own resources. This
|
||||
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
|
||||
queues are neither shared nor stolen from the parent PCI function.
|
||||
|
||||
When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
|
||||
resources neither shared nor stolen from the parent PCI function.
|
||||
|
||||
A subfunction has a dedicated window in PCI BAR space that is not shared
|
||||
with the other subfunctions or the parent PCI function. This ensures that all
|
||||
devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
|
||||
PCI BAR space.
|
||||
|
||||
A subfunction supports eswitch representation through which it supports tc
|
||||
offloads. The user configures eswitch to send/receive packets from/to
|
||||
the subfunction port.
|
||||
|
||||
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
|
||||
other subfunctions and/or with its parent PCI function.
|
||||
|
||||
Example mlx5 software, system, and device view::
|
||||
|
||||
_______
|
||||
| admin |
|
||||
| user |----------
|
||||
|_______| |
|
||||
| |
|
||||
____|____ __|______ _________________
|
||||
| | | | | |
|
||||
| devlink | | tc tool | | user |
|
||||
| tool | |_________| | applications |
|
||||
|_________| | |_________________|
|
||||
| | | |
|
||||
| | | | Userspace
|
||||
+---------|-------------|-------------------|----------|--------------------+
|
||||
| | +----------+ +----------+ Kernel
|
||||
| | | netdev | | rdma dev |
|
||||
| | +----------+ +----------+
|
||||
(devlink port add/del | ^ ^
|
||||
port function set) | | |
|
||||
| | +---------------|
|
||||
_____|___ | | _______|_______
|
||||
| | | | | mlx5 class |
|
||||
| devlink | +------------+ | | drivers |
|
||||
| kernel | | rep netdev | | |(mlx5_core,ib) |
|
||||
|_________| +------------+ | |_______________|
|
||||
| | | ^
|
||||
(devlink ops) | | (probe/remove)
|
||||
_________|________ | | ____|________
|
||||
| subfunction | | +---------------+ | subfunction |
|
||||
| management driver|----- | subfunction |---| driver |
|
||||
| (mlx5_core) | | auxiliary dev | | (mlx5_core) |
|
||||
|__________________| +---------------+ |_____________|
|
||||
| ^
|
||||
(sf add/del, vhca events) |
|
||||
| (device add/del)
|
||||
_____|____ ____|________
|
||||
| | | subfunction |
|
||||
| PCI NIC |--- activate/deactivate events--->| host driver |
|
||||
|__________| | (mlx5_core) |
|
||||
|_____________|
|
||||
|
||||
Subfunction is created using devlink port interface.
|
||||
|
||||
- Change device to switchdev mode::
|
||||
|
||||
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
|
||||
|
||||
- Add a devlink port of subfunction flavour::
|
||||
|
||||
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
|
||||
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:00:00 state inactive opstate detached
|
||||
|
||||
- Show a devlink port of the subfunction::
|
||||
|
||||
$ devlink port show pci/0000:06:00.0/32768
|
||||
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
|
||||
function:
|
||||
hw_addr 00:00:00:00:00:00 state inactive opstate detached
|
||||
|
||||
- Delete a devlink port of subfunction after use::
|
||||
|
||||
$ devlink port del pci/0000:06:00.0/32768
|
||||
|
||||
mlx5 function attributes
|
||||
========================
|
||||
The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
|
||||
a unified way for SmartNIC and non-SmartNIC.
|
||||
|
||||
This is supported only when the eswitch mode is set to switchdev. Port function
|
||||
configuration of the PCI VF/SF is supported through devlink eswitch port.
|
||||
|
||||
Port function attributes should be set before PCI VF/SF is enumerated by the
|
||||
driver.
|
||||
|
||||
MAC address setup
|
||||
-----------------
|
||||
mlx5 driver support devlink port function attr mechanism to setup MAC
|
||||
address. (refer to Documentation/networking/devlink/devlink-port.rst)
|
||||
|
||||
RoCE capability setup
|
||||
---------------------
|
||||
Not all mlx5 PCI devices/SFs require RoCE capability.
|
||||
|
||||
When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
|
||||
PCI devices/SF.
|
||||
|
||||
mlx5 driver support devlink port function attr mechanism to setup RoCE
|
||||
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
|
||||
|
||||
migratable capability setup
|
||||
---------------------------
|
||||
User who wants mlx5 PCI VFs to be able to perform live migration need to
|
||||
explicitly enable the VF migratable capability.
|
||||
|
||||
mlx5 driver support devlink port function attr mechanism to setup migratable
|
||||
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
|
||||
|
||||
SF state setup
|
||||
--------------
|
||||
To use the SF, the user must activate the SF using the SF function state
|
||||
attribute.
|
||||
|
||||
- Get the state of the SF identified by its unique devlink port index::
|
||||
|
||||
$ devlink port show ens2f0npf0sf88
|
||||
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:88:88 state inactive opstate detached
|
||||
|
||||
- Activate the function and verify its state is active::
|
||||
|
||||
$ devlink port function set ens2f0npf0sf88 state active
|
||||
|
||||
$ devlink port show ens2f0npf0sf88
|
||||
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:88:88 state active opstate detached
|
||||
|
||||
Upon function activation, the PF driver instance gets the event from the device
|
||||
that a particular SF was activated. It's the cue to put the device on bus, probe
|
||||
it and instantiate the devlink instance and class specific auxiliary devices
|
||||
for it.
|
||||
|
||||
- Show the auxiliary device and port of the subfunction::
|
||||
|
||||
$ devlink dev show
|
||||
devlink dev show auxiliary/mlx5_core.sf.4
|
||||
|
||||
$ devlink port show auxiliary/mlx5_core.sf.4/1
|
||||
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
|
||||
|
||||
$ rdma link show mlx5_0/1
|
||||
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
|
||||
|
||||
$ rdma dev show
|
||||
8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
|
||||
13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
|
||||
|
||||
- Subfunction auxiliary device and class device hierarchy::
|
||||
|
||||
mlx5_core.sf.4
|
||||
(subfunction auxiliary device)
|
||||
/\
|
||||
/ \
|
||||
/ \
|
||||
/ \
|
||||
/ \
|
||||
mlx5_core.eth.4 mlx5_core.rdma.4
|
||||
(sf eth aux dev) (sf rdma aux dev)
|
||||
| |
|
||||
| |
|
||||
p0sf88 mlx5_0
|
||||
(sf netdev) (sf rdma device)
|
||||
|
||||
Additionally, the SF port also gets the event when the driver attaches to the
|
||||
auxiliary device of the subfunction. This results in changing the operational
|
||||
state of the function. This provides visibility to the user to decide when is it
|
||||
safe to delete the SF port for graceful termination of the subfunction.
|
||||
|
||||
- Show the SF port operational state::
|
||||
|
||||
$ devlink port show ens2f0npf0sf88
|
||||
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:88:88 state active opstate attached
|
||||
|
||||
Devlink health reporters
|
||||
========================
|
||||
|
||||
tx reporter
|
||||
-----------
|
||||
The tx reporter is responsible for reporting and recovering of the following two error scenarios:
|
||||
|
||||
- tx timeout
|
||||
Report on kernel tx timeout detection.
|
||||
Recover by searching lost interrupts.
|
||||
- tx error completion
|
||||
Report on error tx completion.
|
||||
Recover by flushing the tx queue and reset it.
|
||||
|
||||
tx reporter also support on demand diagnose callback, on which it provides
|
||||
real time information of its send queues status.
|
||||
|
||||
User commands examples:
|
||||
|
||||
- Diagnose send queues status::
|
||||
|
||||
$ devlink health diagnose pci/0000:82:00.0 reporter tx
|
||||
|
||||
NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
|
||||
|
||||
- Show number of tx errors indicated, number of recover flows ended successfully,
|
||||
is autorecover enabled and graceful period from last recover::
|
||||
|
||||
$ devlink health show pci/0000:82:00.0 reporter tx
|
||||
|
||||
rx reporter
|
||||
-----------
|
||||
The rx reporter is responsible for reporting and recovering of the following two error scenarios:
|
||||
|
||||
- rx queues' initialization (population) timeout
|
||||
Population of rx queues' descriptors on ring initialization is done
|
||||
in napi context via triggering an irq. In case of a failure to get
|
||||
the minimum amount of descriptors, a timeout would occur, and
|
||||
descriptors could be recovered by polling the EQ (Event Queue).
|
||||
- rx completions with errors (reported by HW on interrupt context)
|
||||
Report on rx completion error.
|
||||
Recover (if needed) by flushing the related queue and reset it.
|
||||
|
||||
rx reporter also supports on demand diagnose callback, on which it
|
||||
provides real time information of its receive queues' status.
|
||||
|
||||
- Diagnose rx queues' status and corresponding completion queue::
|
||||
|
||||
$ devlink health diagnose pci/0000:82:00.0 reporter rx
|
||||
|
||||
NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
|
||||
|
||||
- Show number of rx errors indicated, number of recover flows ended successfully,
|
||||
is autorecover enabled, and graceful period from last recover::
|
||||
|
||||
$ devlink health show pci/0000:82:00.0 reporter rx
|
||||
|
||||
fw reporter
|
||||
-----------
|
||||
The fw reporter implements `diagnose` and `dump` callbacks.
|
||||
It follows symptoms of fw error such as fw syndrome by triggering
|
||||
fw core dump and storing it into the dump buffer.
|
||||
The fw reporter diagnose command can be triggered any time by the user to check
|
||||
current fw status.
|
||||
|
||||
User commands examples:
|
||||
|
||||
- Check fw heath status::
|
||||
|
||||
$ devlink health diagnose pci/0000:82:00.0 reporter fw
|
||||
|
||||
- Read FW core dump if already stored or trigger new one::
|
||||
|
||||
$ devlink health dump show pci/0000:82:00.0 reporter fw
|
||||
|
||||
NOTE: This command can run only on the PF which has fw tracer ownership,
|
||||
running it on other PF or any VF will return "Operation not permitted".
|
||||
|
||||
fw fatal reporter
|
||||
-----------------
|
||||
The fw fatal reporter implements `dump` and `recover` callbacks.
|
||||
It follows fatal errors indications by CR-space dump and recover flow.
|
||||
The CR-space dump uses vsc interface which is valid even if the FW command
|
||||
interface is not functional, which is the case in most FW fatal errors.
|
||||
The recover function runs recover flow which reloads the driver and triggers fw
|
||||
reset if needed.
|
||||
On firmware error, the health buffer is dumped into the dmesg. The log
|
||||
level is derived from the error's severity (given in health buffer).
|
||||
|
||||
User commands examples:
|
||||
|
||||
- Run fw recover flow manually::
|
||||
|
||||
$ devlink health recover pci/0000:82:00.0 reporter fw_fatal
|
||||
|
||||
- Read FW CR-space dump if already stored or trigger new one::
|
||||
|
||||
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
|
||||
|
||||
NOTE: This command can run only on PF.
|
||||
|
||||
mlx5 tracepoints
|
||||
================
|
||||
|
||||
mlx5 driver provides internal tracepoints for tracking and debugging using
|
||||
kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
|
||||
|
||||
For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`.
|
||||
|
||||
tc and eswitch offloads tracepoints:
|
||||
|
||||
- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
|
||||
|
||||
$ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
|
||||
|
||||
- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
|
||||
|
||||
$ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
|
||||
|
||||
- mlx5e_stats_flower: trace flower stats request::
|
||||
|
||||
$ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
|
||||
|
||||
- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
|
||||
|
||||
$ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
|
||||
|
||||
- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
|
||||
|
||||
$ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
|
||||
|
||||
Bridge offloads tracepoints:
|
||||
|
||||
- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
|
||||
|
||||
- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
|
||||
|
||||
- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
|
||||
mlx5::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
|
||||
|
||||
- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
|
||||
representor::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
|
||||
|
||||
- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
|
||||
representor::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
|
||||
|
||||
- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
|
||||
device::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
|
||||
|
||||
- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
|
||||
device::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
|
||||
|
||||
Eswitch QoS tracepoints:
|
||||
|
||||
- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
|
||||
|
||||
$ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
|
||||
|
||||
- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
|
||||
|
||||
$ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
|
||||
|
||||
- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
|
||||
|
||||
$ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
|
||||
|
||||
- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
|
||||
|
||||
$ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
|
||||
|
||||
- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
|
||||
|
||||
$ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
|
||||
|
||||
- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
|
||||
|
||||
$ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
|
||||
|
||||
SF tracepoints:
|
||||
|
||||
- mlx5_sf_add: trace addition of the SF port::
|
||||
|
||||
$ echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
|
||||
|
||||
- mlx5_sf_free: trace freeing of the SF port::
|
||||
|
||||
$ echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
|
||||
|
||||
- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
|
||||
|
||||
$ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
|
||||
|
||||
- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
|
||||
|
||||
$ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
|
||||
|
||||
- mlx5_sf_hwc_deferred_free : trace deferred freeing of the hardware SF context::
|
||||
|
||||
$ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
|
||||
|
||||
- mlx5_sf_vhca_event: trace SF vhca event and state::
|
||||
|
||||
$ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
|
||||
|
||||
- mlx5_sf_dev_add : trace SF device add event::
|
||||
|
||||
$ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
|
||||
|
||||
- mlx5_sf_dev_del : trace SF device delete event::
|
||||
|
||||
$ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
|
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,224 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
=======
|
||||
Devlink
|
||||
=======
|
||||
|
||||
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
|
||||
Contents
|
||||
========
|
||||
|
||||
- `Info`_
|
||||
- `Parameters`_
|
||||
- `Health reporters`_
|
||||
|
||||
Info
|
||||
====
|
||||
|
||||
The devlink info reports the running and stored firmware versions on device.
|
||||
It also prints the device PSID which represents the HCA board type ID.
|
||||
|
||||
User command example::
|
||||
|
||||
$ devlink dev info pci/0000:00:06.0
|
||||
pci/0000:00:06.0:
|
||||
driver mlx5_core
|
||||
versions:
|
||||
fixed:
|
||||
fw.psid MT_0000000009
|
||||
running:
|
||||
fw.version 16.26.0100
|
||||
stored:
|
||||
fw.version 16.26.0100
|
||||
|
||||
Parameters
|
||||
==========
|
||||
|
||||
flow_steering_mode: Device flow steering mode
|
||||
---------------------------------------------
|
||||
The flow steering mode parameter controls the flow steering mode of the driver.
|
||||
Two modes are supported:
|
||||
1. 'dmfs' - Device managed flow steering.
|
||||
2. 'smfs' - Software/Driver managed flow steering.
|
||||
|
||||
In DMFS mode, the HW steering entities are created and managed through the
|
||||
Firmware.
|
||||
In SMFS mode, the HW steering entities are created and managed though by
|
||||
the driver directly into hardware without firmware intervention.
|
||||
|
||||
SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
|
||||
|
||||
User command examples:
|
||||
|
||||
- Set SMFS flow steering mode::
|
||||
|
||||
$ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
|
||||
|
||||
- Read device flow steering mode::
|
||||
|
||||
$ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
|
||||
pci/0000:06:00.0:
|
||||
name flow_steering_mode type driver-specific
|
||||
values:
|
||||
cmode runtime value smfs
|
||||
|
||||
enable_roce: RoCE enablement state
|
||||
----------------------------------
|
||||
If the device supports RoCE disablement, RoCE enablement state controls device
|
||||
support for RoCE capability. Otherwise, the control occurs in the driver stack.
|
||||
When RoCE is disabled at the driver level, only raw ethernet QPs are supported.
|
||||
|
||||
To change RoCE enablement state, a user must change the driverinit cmode value
|
||||
and run devlink reload.
|
||||
|
||||
User command examples:
|
||||
|
||||
- Disable RoCE::
|
||||
|
||||
$ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
|
||||
$ devlink dev reload pci/0000:06:00.0
|
||||
|
||||
- Read RoCE enablement state::
|
||||
|
||||
$ devlink dev param show pci/0000:06:00.0 name enable_roce
|
||||
pci/0000:06:00.0:
|
||||
name enable_roce type generic
|
||||
values:
|
||||
cmode driverinit value true
|
||||
|
||||
esw_port_metadata: Eswitch port metadata state
|
||||
----------------------------------------------
|
||||
When applicable, disabling eswitch metadata can increase packet rate
|
||||
up to 20% depending on the use case and packet sizes.
|
||||
|
||||
Eswitch port metadata state controls whether to internally tag packets with
|
||||
metadata. Metadata tagging must be enabled for multi-port RoCE, failover
|
||||
between representors and stacked devices.
|
||||
By default metadata is enabled on the supported devices in E-switch.
|
||||
Metadata is applicable only for E-switch in switchdev mode and
|
||||
users may disable it when NONE of the below use cases will be in use:
|
||||
1. HCA is in Dual/multi-port RoCE mode.
|
||||
2. VF/SF representor bonding (Usually used for Live migration)
|
||||
3. Stacked devices
|
||||
|
||||
When metadata is disabled, the above use cases will fail to initialize if
|
||||
users try to enable them.
|
||||
|
||||
- Show eswitch port metadata::
|
||||
|
||||
$ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
|
||||
pci/0000:06:00.0:
|
||||
name esw_port_metadata type driver-specific
|
||||
values:
|
||||
cmode runtime value true
|
||||
|
||||
- Disable eswitch port metadata::
|
||||
|
||||
$ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime
|
||||
|
||||
- Change eswitch mode to switchdev mode where after choosing the metadata value::
|
||||
|
||||
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
|
||||
|
||||
Health reporters
|
||||
================
|
||||
|
||||
tx reporter
|
||||
-----------
|
||||
The tx reporter is responsible for reporting and recovering of the following two error scenarios:
|
||||
|
||||
- tx timeout
|
||||
Report on kernel tx timeout detection.
|
||||
Recover by searching lost interrupts.
|
||||
- tx error completion
|
||||
Report on error tx completion.
|
||||
Recover by flushing the tx queue and reset it.
|
||||
|
||||
tx reporter also support on demand diagnose callback, on which it provides
|
||||
real time information of its send queues status.
|
||||
|
||||
User commands examples:
|
||||
|
||||
- Diagnose send queues status::
|
||||
|
||||
$ devlink health diagnose pci/0000:82:00.0 reporter tx
|
||||
|
||||
NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
|
||||
|
||||
- Show number of tx errors indicated, number of recover flows ended successfully,
|
||||
is autorecover enabled and graceful period from last recover::
|
||||
|
||||
$ devlink health show pci/0000:82:00.0 reporter tx
|
||||
|
||||
rx reporter
|
||||
-----------
|
||||
The rx reporter is responsible for reporting and recovering of the following two error scenarios:
|
||||
|
||||
- rx queues' initialization (population) timeout
|
||||
Population of rx queues' descriptors on ring initialization is done
|
||||
in napi context via triggering an irq. In case of a failure to get
|
||||
the minimum amount of descriptors, a timeout would occur, and
|
||||
descriptors could be recovered by polling the EQ (Event Queue).
|
||||
- rx completions with errors (reported by HW on interrupt context)
|
||||
Report on rx completion error.
|
||||
Recover (if needed) by flushing the related queue and reset it.
|
||||
|
||||
rx reporter also supports on demand diagnose callback, on which it
|
||||
provides real time information of its receive queues' status.
|
||||
|
||||
- Diagnose rx queues' status and corresponding completion queue::
|
||||
|
||||
$ devlink health diagnose pci/0000:82:00.0 reporter rx
|
||||
|
||||
NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
|
||||
|
||||
- Show number of rx errors indicated, number of recover flows ended successfully,
|
||||
is autorecover enabled, and graceful period from last recover::
|
||||
|
||||
$ devlink health show pci/0000:82:00.0 reporter rx
|
||||
|
||||
fw reporter
|
||||
-----------
|
||||
The fw reporter implements `diagnose` and `dump` callbacks.
|
||||
It follows symptoms of fw error such as fw syndrome by triggering
|
||||
fw core dump and storing it into the dump buffer.
|
||||
The fw reporter diagnose command can be triggered any time by the user to check
|
||||
current fw status.
|
||||
|
||||
User commands examples:
|
||||
|
||||
- Check fw heath status::
|
||||
|
||||
$ devlink health diagnose pci/0000:82:00.0 reporter fw
|
||||
|
||||
- Read FW core dump if already stored or trigger new one::
|
||||
|
||||
$ devlink health dump show pci/0000:82:00.0 reporter fw
|
||||
|
||||
NOTE: This command can run only on the PF which has fw tracer ownership,
|
||||
running it on other PF or any VF will return "Operation not permitted".
|
||||
|
||||
fw fatal reporter
|
||||
-----------------
|
||||
The fw fatal reporter implements `dump` and `recover` callbacks.
|
||||
It follows fatal errors indications by CR-space dump and recover flow.
|
||||
The CR-space dump uses vsc interface which is valid even if the FW command
|
||||
interface is not functional, which is the case in most FW fatal errors.
|
||||
The recover function runs recover flow which reloads the driver and triggers fw
|
||||
reset if needed.
|
||||
On firmware error, the health buffer is dumped into the dmesg. The log
|
||||
level is derived from the error's severity (given in health buffer).
|
||||
|
||||
User commands examples:
|
||||
|
||||
- Run fw recover flow manually::
|
||||
|
||||
$ devlink health recover pci/0000:82:00.0 reporter fw_fatal
|
||||
|
||||
- Read FW CR-space dump if already stored or trigger new one::
|
||||
|
||||
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
|
||||
|
||||
NOTE: This command can run only on PF.
|
@ -0,0 +1,26 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
Mellanox ConnectX(R) mlx5 core VPI Network Driver
|
||||
=================================================
|
||||
|
||||
:Copyright: |copy| 2019, Mellanox Technologies LTD.
|
||||
:Copyright: |copy| 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
|
||||
Contents:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
kconfig
|
||||
devlink
|
||||
switchdev
|
||||
tracepoints
|
||||
counters
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
@ -0,0 +1,168 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
=======================================
|
||||
Enabling the driver and kconfig options
|
||||
=======================================
|
||||
|
||||
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
|
||||
| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
|
||||
| at build time via kernel Kconfig flags.
|
||||
| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
|
||||
| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
|
||||
| For the list of advanced features, please see below.
|
||||
|
||||
**CONFIG_MLX5_BRIDGE=(y/n)**
|
||||
|
||||
| Enable :ref:`Ethernet Bridging (BRIDGE) offloading support <mlx5_bridge_offload>`.
|
||||
| This will provide the ability to add representors of mlx5 uplink and VF
|
||||
| ports to Bridge and offloading rules for traffic between such ports.
|
||||
| Supports VLANs (trunk and access modes).
|
||||
|
||||
|
||||
**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
|
||||
|
||||
| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
|
||||
| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
|
||||
|
||||
|
||||
**CONFIG_MLX5_CORE_EN=(y/n)**
|
||||
|
||||
| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
|
||||
| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
|
||||
| built-in into mlx5_core.ko.
|
||||
|
||||
|
||||
**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
|
||||
|
||||
| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
|
||||
|
||||
|
||||
**CONFIG_MLX5_CORE_IPOIB=(y/n)**
|
||||
|
||||
| IPoIB offloads & acceleration support.
|
||||
| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
|
||||
| IPoIB ulp netdevice.
|
||||
|
||||
|
||||
**CONFIG_MLX5_CLS_ACT=(y/n)**
|
||||
|
||||
| Enables offload support for TC classifier action (NET_CLS_ACT).
|
||||
| Works in both native NIC mode and Switchdev SRIOV mode.
|
||||
| Flow-based classifiers, such as those registered through
|
||||
| `tc-flower(8)`, are processed by the device, rather than the
|
||||
| host. Actions that would then overwrite matching classification
|
||||
| results would then be instant due to the offload.
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_ARFS=(y/n)**
|
||||
|
||||
| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
|
||||
| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_IPSEC=(y/n)**
|
||||
|
||||
| Enables `IPSec XFRM cryptography-offload acceleration <https://support.mellanox.com/s/article/ConnectX-6DX-Bluefield-2-IPsec-HW-Full-Offload-Configuration-Guide>`_.
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_MACSEC=(y/n)**
|
||||
|
||||
| Build support for MACsec cryptography-offload acceleration in the NIC.
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_RXNFC=(y/n)**
|
||||
|
||||
| Enables ethtool receive network flow classification, which allows user defined
|
||||
| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
|
||||
|
||||
|
||||
**CONFIG_MLX5_EN_TLS=(y/n)**
|
||||
|
||||
| TLS cryptography-offload acceleration.
|
||||
|
||||
|
||||
**CONFIG_MLX5_ESWITCH=(y/n)**
|
||||
|
||||
| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
|
||||
| and switching for the enabled VFs and PF in two available modes:
|
||||
| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
|
||||
| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
|
||||
|
||||
|
||||
**CONFIG_MLX5_FPGA=(y/n)**
|
||||
|
||||
| Build support for the Innova family of network cards by Mellanox Technologies.
|
||||
| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
|
||||
| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
|
||||
| building sandbox-specific client drivers.
|
||||
|
||||
|
||||
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
|
||||
|
||||
| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
|
||||
|
||||
|
||||
**CONFIG_MLX5_MPFS=(y/n)**
|
||||
|
||||
| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
|
||||
| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
|
||||
| user configured unicast MAC addresses to the requesting PF.
|
||||
|
||||
|
||||
**CONFIG_MLX5_SF=(y/n)**
|
||||
|
||||
| Build support for subfunction.
|
||||
| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
|
||||
| will enable support for creating subfunction devices.
|
||||
|
||||
|
||||
**CONFIG_MLX5_SF_MANAGER=(y/n)**
|
||||
|
||||
| Build support for subfuction port in the NIC. A Mellanox subfunction
|
||||
| port is managed through devlink. A subfunction supports RDMA, netdevice
|
||||
| and vdpa device. It is similar to a SRIOV VF but it doesn't require
|
||||
| SRIOV support.
|
||||
|
||||
|
||||
**CONFIG_MLX5_SW_STEERING=(y/n)**
|
||||
|
||||
| Build support for software-managed steering in the NIC.
|
||||
|
||||
|
||||
**CONFIG_MLX5_TC_CT=(y/n)**
|
||||
|
||||
| Support offloading connection tracking rules via tc ct action.
|
||||
|
||||
|
||||
**CONFIG_MLX5_TC_SAMPLE=(y/n)**
|
||||
|
||||
| Support offloading sample rules via tc sample action.
|
||||
|
||||
|
||||
**CONFIG_MLX5_VDPA=(y/n)**
|
||||
|
||||
| Support library for Mellanox VDPA drivers. Provides code that is
|
||||
| common for all types of VDPA drivers. The following drivers are planned:
|
||||
| net, block.
|
||||
|
||||
|
||||
**CONFIG_MLX5_VDPA_NET=(y/n)**
|
||||
|
||||
| VDPA network driver for ConnectX6 and newer. Provides offloading
|
||||
| of virtio net datapath such that descriptors put on the ring will
|
||||
| be executed by the hardware. It also supports a variety of stateless
|
||||
| offloads depending on the actual device used and firmware version.
|
||||
|
||||
|
||||
**CONFIG_MLX5_VFIO_PCI=(y/n)**
|
||||
|
||||
| This provides migration support for MLX5 devices using the VFIO framework.
|
||||
|
||||
|
||||
**External options** ( Choose if the corresponding mlx5 feature is required )
|
||||
|
||||
- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
|
||||
- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
|
||||
- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
|
@ -0,0 +1,239 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
=========
|
||||
Switchdev
|
||||
=========
|
||||
|
||||
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
|
||||
.. _mlx5_bridge_offload:
|
||||
|
||||
Bridge offload
|
||||
==============
|
||||
|
||||
The mlx5 driver implements support for offloading bridge rules when in switchdev
|
||||
mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
|
||||
representor is attached to bridge.
|
||||
|
||||
- Change device to switchdev mode::
|
||||
|
||||
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
|
||||
|
||||
- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
|
||||
|
||||
$ ip link set enp8s0f0 master bridge1
|
||||
|
||||
VLANs
|
||||
-----
|
||||
|
||||
Following bridge VLAN functions are supported by mlx5:
|
||||
|
||||
- VLAN filtering (including multiple VLANs per port)::
|
||||
|
||||
$ ip link set bridge1 type bridge vlan_filtering 1
|
||||
$ bridge vlan add dev enp8s0f0 vid 2-3
|
||||
|
||||
- VLAN push on bridge ingress::
|
||||
|
||||
$ bridge vlan add dev enp8s0f0 vid 3 pvid
|
||||
|
||||
- VLAN pop on bridge egress::
|
||||
|
||||
$ bridge vlan add dev enp8s0f0 vid 3 untagged
|
||||
|
||||
Subfunction
|
||||
===========
|
||||
|
||||
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
|
||||
|
||||
A subfunction has its own function capabilities and its own resources. This
|
||||
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
|
||||
queues are neither shared nor stolen from the parent PCI function.
|
||||
|
||||
When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
|
||||
resources neither shared nor stolen from the parent PCI function.
|
||||
|
||||
A subfunction has a dedicated window in PCI BAR space that is not shared
|
||||
with the other subfunctions or the parent PCI function. This ensures that all
|
||||
devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
|
||||
PCI BAR space.
|
||||
|
||||
A subfunction supports eswitch representation through which it supports tc
|
||||
offloads. The user configures eswitch to send/receive packets from/to
|
||||
the subfunction port.
|
||||
|
||||
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
|
||||
other subfunctions and/or with its parent PCI function.
|
||||
|
||||
Example mlx5 software, system, and device view::
|
||||
|
||||
_______
|
||||
| admin |
|
||||
| user |----------
|
||||
|_______| |
|
||||
| |
|
||||
____|____ __|______ _________________
|
||||
| | | | | |
|
||||
| devlink | | tc tool | | user |
|
||||
| tool | |_________| | applications |
|
||||
|_________| | |_________________|
|
||||
| | | |
|
||||
| | | | Userspace
|
||||
+---------|-------------|-------------------|----------|--------------------+
|
||||
| | +----------+ +----------+ Kernel
|
||||
| | | netdev | | rdma dev |
|
||||
| | +----------+ +----------+
|
||||
(devlink port add/del | ^ ^
|
||||
port function set) | | |
|
||||
| | +---------------|
|
||||
_____|___ | | _______|_______
|
||||
| | | | | mlx5 class |
|
||||
| devlink | +------------+ | | drivers |
|
||||
| kernel | | rep netdev | | |(mlx5_core,ib) |
|
||||
|_________| +------------+ | |_______________|
|
||||
| | | ^
|
||||
(devlink ops) | | (probe/remove)
|
||||
_________|________ | | ____|________
|
||||
| subfunction | | +---------------+ | subfunction |
|
||||
| management driver|----- | subfunction |---| driver |
|
||||
| (mlx5_core) | | auxiliary dev | | (mlx5_core) |
|
||||
|__________________| +---------------+ |_____________|
|
||||
| ^
|
||||
(sf add/del, vhca events) |
|
||||
| (device add/del)
|
||||
_____|____ ____|________
|
||||
| | | subfunction |
|
||||
| PCI NIC |--- activate/deactivate events--->| host driver |
|
||||
|__________| | (mlx5_core) |
|
||||
|_____________|
|
||||
|
||||
Subfunction is created using devlink port interface.
|
||||
|
||||
- Change device to switchdev mode::
|
||||
|
||||
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
|
||||
|
||||
- Add a devlink port of subfunction flavour::
|
||||
|
||||
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
|
||||
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:00:00 state inactive opstate detached
|
||||
|
||||
- Show a devlink port of the subfunction::
|
||||
|
||||
$ devlink port show pci/0000:06:00.0/32768
|
||||
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
|
||||
function:
|
||||
hw_addr 00:00:00:00:00:00 state inactive opstate detached
|
||||
|
||||
- Delete a devlink port of subfunction after use::
|
||||
|
||||
$ devlink port del pci/0000:06:00.0/32768
|
||||
|
||||
Function attributes
|
||||
===================
|
||||
|
||||
The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
|
||||
a unified way for SmartNIC and non-SmartNIC.
|
||||
|
||||
This is supported only when the eswitch mode is set to switchdev. Port function
|
||||
configuration of the PCI VF/SF is supported through devlink eswitch port.
|
||||
|
||||
Port function attributes should be set before PCI VF/SF is enumerated by the
|
||||
driver.
|
||||
|
||||
MAC address setup
|
||||
-----------------
|
||||
|
||||
mlx5 driver support devlink port function attr mechanism to setup MAC
|
||||
address. (refer to Documentation/networking/devlink/devlink-port.rst)
|
||||
|
||||
RoCE capability setup
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
Not all mlx5 PCI devices/SFs require RoCE capability.
|
||||
|
||||
When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
|
||||
PCI devices/SF.
|
||||
|
||||
mlx5 driver support devlink port function attr mechanism to setup RoCE
|
||||
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
|
||||
|
||||
migratable capability setup
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
User who wants mlx5 PCI VFs to be able to perform live migration need to
|
||||
explicitly enable the VF migratable capability.
|
||||
|
||||
mlx5 driver support devlink port function attr mechanism to setup migratable
|
||||
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
|
||||
|
||||
SF state setup
|
||||
--------------
|
||||
|
||||
To use the SF, the user must activate the SF using the SF function state
|
||||
attribute.
|
||||
|
||||
- Get the state of the SF identified by its unique devlink port index::
|
||||
|
||||
$ devlink port show ens2f0npf0sf88
|
||||
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:88:88 state inactive opstate detached
|
||||
|
||||
- Activate the function and verify its state is active::
|
||||
|
||||
$ devlink port function set ens2f0npf0sf88 state active
|
||||
|
||||
$ devlink port show ens2f0npf0sf88
|
||||
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:88:88 state active opstate detached
|
||||
|
||||
Upon function activation, the PF driver instance gets the event from the device
|
||||
that a particular SF was activated. It's the cue to put the device on bus, probe
|
||||
it and instantiate the devlink instance and class specific auxiliary devices
|
||||
for it.
|
||||
|
||||
- Show the auxiliary device and port of the subfunction::
|
||||
|
||||
$ devlink dev show
|
||||
devlink dev show auxiliary/mlx5_core.sf.4
|
||||
|
||||
$ devlink port show auxiliary/mlx5_core.sf.4/1
|
||||
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
|
||||
|
||||
$ rdma link show mlx5_0/1
|
||||
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
|
||||
|
||||
$ rdma dev show
|
||||
8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
|
||||
13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
|
||||
|
||||
- Subfunction auxiliary device and class device hierarchy::
|
||||
|
||||
mlx5_core.sf.4
|
||||
(subfunction auxiliary device)
|
||||
/\
|
||||
/ \
|
||||
/ \
|
||||
/ \
|
||||
/ \
|
||||
mlx5_core.eth.4 mlx5_core.rdma.4
|
||||
(sf eth aux dev) (sf rdma aux dev)
|
||||
| |
|
||||
| |
|
||||
p0sf88 mlx5_0
|
||||
(sf netdev) (sf rdma device)
|
||||
|
||||
Additionally, the SF port also gets the event when the driver attaches to the
|
||||
auxiliary device of the subfunction. This results in changing the operational
|
||||
state of the function. This provides visibility to the user to decide when is it
|
||||
safe to delete the SF port for graceful termination of the subfunction.
|
||||
|
||||
- Show the SF port operational state::
|
||||
|
||||
$ devlink port show ens2f0npf0sf88
|
||||
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
|
||||
function:
|
||||
hw_addr 00:00:00:00:88:88 state active opstate attached
|
@ -0,0 +1,229 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
===========
|
||||
Tracepoints
|
||||
===========
|
||||
|
||||
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
|
||||
mlx5 driver provides internal tracepoints for tracking and debugging using
|
||||
kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
|
||||
|
||||
For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`.
|
||||
|
||||
tc and eswitch offloads tracepoints:
|
||||
|
||||
- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
|
||||
|
||||
$ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
|
||||
|
||||
- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
|
||||
|
||||
$ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
|
||||
|
||||
- mlx5e_stats_flower: trace flower stats request::
|
||||
|
||||
$ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
|
||||
|
||||
- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
|
||||
|
||||
$ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
|
||||
|
||||
- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
|
||||
|
||||
$ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
|
||||
|
||||
Bridge offloads tracepoints:
|
||||
|
||||
- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
|
||||
|
||||
- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
|
||||
|
||||
- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
|
||||
mlx5::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
|
||||
|
||||
- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
|
||||
representor::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
|
||||
|
||||
- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
|
||||
representor::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
|
||||
|
||||
- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
|
||||
device::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
|
||||
|
||||
- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
|
||||
device::
|
||||
|
||||
$ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
|
||||
|
||||
Eswitch QoS tracepoints:
|
||||
|
||||
- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
|
||||
|
||||
$ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
|
||||
|
||||
- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
|
||||
|
||||
$ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
|
||||
|
||||
- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
|
||||
|
||||
$ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
|
||||
|
||||
- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
|
||||
|
||||
$ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
|
||||
|
||||
- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
|
||||
|
||||
$ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
|
||||
|
||||
- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
|
||||
|
||||
$ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
<...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
|
||||
|
||||
SF tracepoints:
|
||||
|
||||
- mlx5_sf_add: trace addition of the SF port::
|
||||
|
||||
$ echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
|
||||
|
||||
- mlx5_sf_free: trace freeing of the SF port::
|
||||
|
||||
$ echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
|
||||
|
||||
- mlx5_sf_activate: trace activation of the SF port::
|
||||
|
||||
$ echo mlx5:mlx5_sf_activate >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-29841 [008] ..... 3669.635095: mlx5_sf_activate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
|
||||
|
||||
- mlx5_sf_deactivate: trace deactivation of the SF port::
|
||||
|
||||
$ echo mlx5:mlx5_sf_deactivate >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-29994 [008] ..... 4015.969467: mlx5_sf_deactivate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
|
||||
|
||||
- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
|
||||
|
||||
$ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
|
||||
|
||||
- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
|
||||
|
||||
$ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
|
||||
|
||||
- mlx5_sf_hwc_deferred_free: trace deferred freeing of the hardware SF context::
|
||||
|
||||
$ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
|
||||
|
||||
- mlx5_sf_update_state: trace state updates for SF contexts::
|
||||
|
||||
$ echo mlx5:mlx5_sf_update_state >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u20:3-29490 [009] ..... 4141.453530: mlx5_sf_update_state: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 state=2
|
||||
|
||||
- mlx5_sf_vhca_event: trace SF vhca event and state::
|
||||
|
||||
$ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
|
||||
|
||||
- mlx5_sf_dev_add: trace SF device add event::
|
||||
|
||||
$ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
|
||||
|
||||
- mlx5_sf_dev_del: trace SF device delete event::
|
||||
|
||||
$ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event
|
||||
$ cat /sys/kernel/debug/tracing/trace
|
||||
...
|
||||
kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
|
@ -83,7 +83,7 @@ Configuring the Driver
|
||||
MTU
|
||||
---
|
||||
|
||||
Jumbo frame support is available with a maximim size of 9194 bytes.
|
||||
Jumbo frame support is available with a maximum size of 9194 bytes.
|
||||
|
||||
Interrupt coalescing
|
||||
--------------------
|
||||
|
@ -124,7 +124,7 @@ Multicast flooding
|
||||
==================
|
||||
CPU port mcast_flooding is always on
|
||||
|
||||
Turning flooding on/off on swithch ports:
|
||||
Turning flooding on/off on switch ports:
|
||||
bridge link set dev sw0p1 mcast_flood on/off
|
||||
|
||||
Access and Trunk port
|
||||
|
@ -174,7 +174,7 @@ Multicast flooding
|
||||
==================
|
||||
CPU port mcast_flooding is always on
|
||||
|
||||
Turning flooding on/off on swithch ports:
|
||||
Turning flooding on/off on switch ports:
|
||||
bridge link set dev sw0p1 mcast_flood on/off
|
||||
|
||||
Access and Trunk port
|
||||
|
@ -69,7 +69,7 @@ wwan0-X network device
|
||||
The IOSM driver exposes IP link interface "wwan0-X" of type "wwan" for IP
|
||||
traffic. Iproute network utility is used for creating "wwan0-X" network
|
||||
interface and for associating it with MBIM IP session. The Driver supports
|
||||
upto 8 IP sessions for simultaneous IP communication.
|
||||
up to 8 IP sessions for simultaneous IP communication.
|
||||
|
||||
The userspace management application is responsible for creating new IP link
|
||||
prior to establishing MBIM IP session where the SessionId is greater than 0.
|
||||
|
@ -33,7 +33,7 @@ Device driver can provide specific callbacks for each "health reporter", e.g.:
|
||||
* Recovery procedures
|
||||
* Diagnostics procedures
|
||||
* Object dump procedures
|
||||
* OOB initial parameters
|
||||
* Out Of Box initial parameters
|
||||
|
||||
Different parts of the driver can register different types of health reporters
|
||||
with different handlers.
|
||||
@ -46,12 +46,31 @@ Once an error is reported, devlink health will perform the following actions:
|
||||
* A log is being send to the kernel trace events buffer
|
||||
* Health status and statistics are being updated for the reporter instance
|
||||
* Object dump is being taken and saved at the reporter instance (as long as
|
||||
there is no other dump which is already stored)
|
||||
auto-dump is set and there is no other dump which is already stored)
|
||||
* Auto recovery attempt is being done. Depends on:
|
||||
|
||||
- Auto-recovery configuration
|
||||
- Grace period vs. time passed since last recover
|
||||
|
||||
Devlink formatted message
|
||||
=========================
|
||||
|
||||
To handle devlink health diagnose and health dump requests, devlink creates a
|
||||
formatted message structure ``devlink_fmsg`` and send it to the driver's callback
|
||||
to fill the data in using the devlink fmsg API.
|
||||
|
||||
Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
|
||||
json-like format. The API allows the driver to add nested attributes such as
|
||||
object, object pair and value array, in addition to attributes such as name and
|
||||
value.
|
||||
|
||||
Driver should use this API to fill the fmsg context in a format which will be
|
||||
translated by the devlink to the netlink message later. When it needs to send
|
||||
the data using SKBs to the netlink layer, it fragments the data between
|
||||
different SKBs. In order to do this fragmentation, it uses virtual nests
|
||||
attributes, to avoid actual nesting use which cannot be divided between
|
||||
different SKBs.
|
||||
|
||||
User Interface
|
||||
==============
|
||||
|
||||
|
@ -285,7 +285,7 @@ features are enabled after the hierarchy is exported, but before any
|
||||
changes are made.
|
||||
|
||||
This feature is also dependent on switchdev being enabled in the system.
|
||||
It's required bacause devlink-rate requires devlink-port objects to be
|
||||
It's required because devlink-rate requires devlink-port objects to be
|
||||
present, and those objects are only created in switchdev mode.
|
||||
|
||||
If the driver is set to the switchdev mode, it will export internal
|
||||
@ -320,7 +320,7 @@ nodes and nodes with children also can't be deleted.
|
||||
* - ``tx_weight``
|
||||
- allows for usage of Weighted Fair Queuing arbitration scheme among
|
||||
siblings. This arbitration scheme can be used simultaneously with
|
||||
the strict priority. Range 1-200. Only relative values mater for
|
||||
the strict priority. Range 1-200. Only relative values matter for
|
||||
arbitration.
|
||||
|
||||
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
|
||||
|
@ -66,3 +66,4 @@ parameters, info versions, and other features it supports.
|
||||
prestera
|
||||
iosm
|
||||
octeontx2
|
||||
sfc
|
||||
|
@ -54,6 +54,24 @@ parameters.
|
||||
- Control the number of large groups (size > 1) in the FDB table.
|
||||
|
||||
* The default value is 15, and the range is between 1 and 1024.
|
||||
* - ``esw_multiport``
|
||||
- Boolean
|
||||
- runtime
|
||||
- Control MultiPort E-Switch shared fdb mode.
|
||||
|
||||
An experimental mode where a single E-Switch is used and all the vports
|
||||
and physical ports on the NIC are connected to it.
|
||||
|
||||
An example is to send traffic from a VF that is created on PF0 to an
|
||||
uplink that is natively associated with the uplink of PF1
|
||||
|
||||
Note: Future devices, ConnectX-8 and onward, will eventually have this
|
||||
as the default to allow forwarding between all NIC ports in a single
|
||||
E-switch environment and the dual E-switch mode will likely get
|
||||
deprecated.
|
||||
|
||||
Default: disabled
|
||||
|
||||
|
||||
The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
|
||||
|
||||
|
@ -95,5 +95,5 @@ Driver-specific Traps
|
||||
* - ``fid_miss``
|
||||
- ``exception``
|
||||
- When a packet enters the device it is classified to a filtering
|
||||
indentifier (FID) based on the ingress port and VLAN. This trap is used
|
||||
identifier (FID) based on the ingress port and VLAN. This trap is used
|
||||
to trap packets for which a FID could not be found
|
||||
|
@ -138,4 +138,4 @@ Driver-specific Traps
|
||||
- Drops packets with zero (0) IPV4 source address.
|
||||
* - ``met_red``
|
||||
- ``drop``
|
||||
- Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwith.
|
||||
- Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwidth.
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user