Networking changes for 6.9.

Core & protocols
 ----------------
 
  - Large effort by Eric to lower rtnl_lock pressure and remove locks:
 
    - Make commonly used parts of rtnetlink (address, route dumps etc.)
      lockless, protected by RCU instead of rtnl_lock.
 
    - Add a netns exit callback which already holds rtnl_lock,
      allowing netns exit to take rtnl_lock once in the core
      instead of once for each driver / callback.
 
    - Remove locks / serialization in the socket diag interface.
 
    - Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
 
    - Remove the dev_base_lock, depend on RCU where necessary.
 
  - Support busy polling on a per-epoll context basis. Poll length
    and budget parameters can be set independently of system defaults.
 
  - Introduce struct net_hotdata, to make sure read-mostly global config
    variables fit in as few cache lines as possible.
 
  - Add optional per-nexthop statistics to ease monitoring / debug
    of ECMP imbalance problems.
 
  - Support TCP_NOTSENT_LOWAT in MPTCP.
 
  - Ensure that IPv6 temporary addresses' preferred lifetimes are long
    enough, compared to other configured lifetimes, and at least 2 sec.
 
  - Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
 
  - Add support for the independent control state machine for bonding
    per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
    control state machine.
 
  - Add "network ID" to MCTP socket APIs to support hosts with multiple
    disjoint MCTP networks.
 
  - Re-use the mono_delivery_time skbuff bit for packets which user
    space wants to be sent at a specified time. Maintain the timing
    information while traversing veth links, bridge etc.
 
  - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
 
  - Simplify many places iterating over netdevs by using an xarray
    instead of a hash table walk (hash table remains in place, for
    use on fastpaths).
 
  - Speed up scanning for expired routes by keeping a dedicated list.
 
  - Speed up "generic" XDP by trying harder to avoid large allocations.
 
  - Support attaching arbitrary metadata to netconsole messages.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
    VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena).
 
  - Rework selftest harness to enable the use of the full range of
    ksft exit code (pass, fail, skip, xfail, xpass).
 
 Netfilter
 ---------
 
  - Allow userspace to define a table that is exclusively owned by a daemon
    (via netlink socket aliveness) without auto-removing this table when
    the userspace program exits. Such table gets marked as orphaned and
    a restarting management daemon can re-attach/regain ownership.
 
  - Speed up element insertions to nftables' concatenated-ranges set type.
    Compact a few related data structures.
 
 BPF
 ---
 
  - Add BPF token support for delegating a subset of BPF subsystem
    functionality from privileged system-wide daemons such as systemd
    through special mount options for userns-bound BPF fs to a trusted
    & unprivileged application.
 
  - Introduce bpf_arena which is sparse shared memory region between BPF
    program and user space where structures inside the arena can have
    pointers to other areas of the arena, and pointers work seamlessly
    for both user-space programs and BPF programs.
 
  - Introduce may_goto instruction that is a contract between the verifier
    and the program. The verifier allows the program to loop assuming it's
    behaving well, but reserves the right to terminate it.
 
  - Extend the BPF verifier to enable static subprog calls in spin lock
    critical sections.
 
  - Support registration of struct_ops types from modules which helps
    projects like fuse-bpf that seeks to implement a new struct_ops type.
 
  - Add support for retrieval of cookies for perf/kprobe multi links.
 
  - Support arbitrary TCP SYN cookie generation / validation in the TC
    layer with BPF to allow creating SYN flood handling in BPF firewalls.
 
  - Add code generation to inline the bpf_kptr_xchg() helper which
    improves performance when stashing/popping the allocated BPF objects.
 
 Wireless
 --------
 
  - Add SPP (signaling and payload protected) AMSDU support.
 
  - Support wider bandwidth OFDMA, as required for EHT operation.
 
 Driver API
 ----------
 
  - Major overhaul of the Energy Efficient Ethernet internals to support
    new link modes (2.5GE, 5GE), share more code between drivers
    (especially those using phylib), and encourage more uniform behavior.
    Convert and clean up drivers.
 
  - Define an API for querying per netdev queue statistics from drivers.
 
  - IPSec: account in global stats for fully offloaded sessions.
 
  - Create a concept of Ethernet PHY Packages at the Device Tree level,
    to allow parameterizing the existing PHY package code.
 
  - Enable Rx hashing (RSS) on GTP protocol fields.
 
 Misc
 ----
 
  - Improvements and refactoring all over networking selftests.
 
  - Create uniform module aliases for TC classifiers, actions,
    and packet schedulers to simplify creating modprobe policies.
 
  - Address all missing MODULE_DESCRIPTION() warnings in networking.
 
  - Extend the Netlink descriptions in YAML to cover message encapsulation
    or "Netlink polymorphism", where interpretation of nested attributes
    depends on link type, classifier type or some other "class type".
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
    - Intel (100G, ice, idpf):
      - support E825-C devices
    - nVidia/Mellanox:
      - support devices with one port and multiple PCIe links
    - Broadcom (bnxt):
      - support n-tuple filters
      - support configuring the RSS key
    - Wangxun (ngbe/txgbe):
      - implement irq_domain for TXGBE's sub-interrupts
    - Pensando/AMD:
      - support XDP
      - optimize queue submission and wakeup handling (+17% bps)
      - optimize struct layout, saving 28% of memory on queues
 
  - Ethernet NICs embedded and virtual:
    - Google cloud vNIC:
      - refactor driver to perform memory allocations for new queue
        config before stopping and freeing the old queue memory
    - Synopsys (stmmac):
      - obey queueMaxSDU and implement counters required by 802.1Qbv
    - Renesas (ravb):
      - support packet checksum offload
      - suspend to RAM and runtime PM support
 
  - Ethernet switches:
    - nVidia/Mellanox:
      - support for nexthop group statistics
    - Microchip:
      - ksz8: implement PHY loopback
      - add support for KSZ8567, a 7-port 10/100Mbps switch
 
  - PTP:
    - New driver for RENESAS FemtoClock3 Wireless clock generator.
    - Support OCP PTP cards designed and built by Adva.
 
  - CAN:
    - Support recvmsg() flags for own, local and remote traffic
      on CAN BCM sockets.
    - Support for esd GmbH PCIe/402 CAN device family.
    - m_can:
      - Rx/Tx submission coalescing
      - wake on frame Rx
 
  - WiFi:
    - Intel (iwlwifi):
      - enable signaling and payload protected A-MSDUs
      - support wider-bandwidth OFDMA
      - support for new devices
      - bump FW API to 89 for AX devices; 90 for BZ/SC devices
    - MediaTek (mt76):
      - mt7915: newer ADIE version support
      - mt7925: radio temperature sensor support
    - Qualcomm (ath11k):
      - support 6 GHz station power modes: Low Power Indoor (LPI),
        Standard Power) SP and Very Low Power (VLP)
      - QCA6390 & WCN6855: support 2 concurrent station interfaces
      - QCA2066 support
    - Qualcomm (ath12k):
      - refactoring in preparation for Multi-Link Operation (MLO) support
      - 1024 Block Ack window size support
      - firmware-2.bin support
      - support having multiple identical PCI devices (firmware needs to
        have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
      - QCN9274: support split-PHY devices
      - WCN7850: enable Power Save Mode in station mode
      - WCN7850: P2P support
    - RealTek:
      - rtw88: support for more rtw8811cu and rtw8821cu devices
      - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
      - rtlwifi: speed up USB firmware initialization
      - rtwl8xxxu:
        - RTL8188F: concurrent interface support
        - Channel Switch Announcement (CSA) support in AP mode
    - Broadcom (brcmfmac):
      - per-vendor feature support
      - per-vendor SAE password setup
      - DMI nvram filename quirk for ACEPC W5 Pro
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmXv0mgACgkQMUZtbf5S
 IrtgMxAAuRd+WJW++SENr4KxIWhYO1q6Xcxnai43wrNkan9swD24icG8TYALt4f3
 yoT6idQvWReAb5JNlh9rUQz8R7E0nJXlvEFn5MtJwcthx2C6wFo/XkJlddlRrT+j
 c2xGILwLjRhW65LaC0MZ2ECbEERkFz8xcGfK2SWzUgh6KYvPjcRfKFxugpM7xOQK
 P/Wnqhs4fVRS/Mj/bCcXcO+yhwC121Q3qVeQVjGS0AzEC65hAW87a/kc2BfgcegD
 EyI9R7mf6criQwX+0awubjfoIdr4oW/8oDVNvUDczkJkbaEVaLMQk9P5x/0XnnVS
 UHUchWXyI80Q8Rj12uN1/I0h3WtwNQnCRBuLSmtm6GLfCAwbLvp2nGWDnaXiqryW
 DVKUIHGvqPKjkOOMOVfSvfB3LvkS3xsFVVYiQBQCn0YSs/gtu4CoF2Nty9CiLPbK
 tTuxUnLdPDZDxU//l0VArZmP8p2JM7XQGJ+JH8GFH4SBTyBR23e0iyPSoyaxjnYn
 RReDnHMVsrS1i7GPhbqDJWn+uqMSs7N149i0XmmyeqwQHUVSJN3J2BApP2nCaDfy
 H2lTuYly5FfEezt61NvCE4qr/VsWeEjm1fYlFQ9dFn4pGn+HghyCpw+xD1ZN56DN
 lujemau5B3kk1UTtAT4ypPqvuqjkRFqpNV2LzsJSk/Js+hApw8Y=
 =oY52
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Large effort by Eric to lower rtnl_lock pressure and remove locks:

      - Make commonly used parts of rtnetlink (address, route dumps
        etc) lockless, protected by RCU instead of rtnl_lock.

      - Add a netns exit callback which already holds rtnl_lock,
        allowing netns exit to take rtnl_lock once in the core instead
        of once for each driver / callback.

      - Remove locks / serialization in the socket diag interface.

      - Remove 6 calls to synchronize_rcu() while holding rtnl_lock.

      - Remove the dev_base_lock, depend on RCU where necessary.

   - Support busy polling on a per-epoll context basis. Poll length and
     budget parameters can be set independently of system defaults.

   - Introduce struct net_hotdata, to make sure read-mostly global
     config variables fit in as few cache lines as possible.

   - Add optional per-nexthop statistics to ease monitoring / debug of
     ECMP imbalance problems.

   - Support TCP_NOTSENT_LOWAT in MPTCP.

   - Ensure that IPv6 temporary addresses' preferred lifetimes are long
     enough, compared to other configured lifetimes, and at least 2 sec.

   - Support forwarding of ICMP Error messages in IPSec, per RFC 4301.

   - Add support for the independent control state machine for bonding
     per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
     control state machine.

   - Add "network ID" to MCTP socket APIs to support hosts with multiple
     disjoint MCTP networks.

   - Re-use the mono_delivery_time skbuff bit for packets which user
     space wants to be sent at a specified time. Maintain the timing
     information while traversing veth links, bridge etc.

   - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.

   - Simplify many places iterating over netdevs by using an xarray
     instead of a hash table walk (hash table remains in place, for use
     on fastpaths).

   - Speed up scanning for expired routes by keeping a dedicated list.

   - Speed up "generic" XDP by trying harder to avoid large allocations.

   - Support attaching arbitrary metadata to netconsole messages.

  Things we sprinkled into general kernel code:

   - Enforce VM_IOREMAP flag and range in ioremap_page_range and
     introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
     bpf_arena).

   - Rework selftest harness to enable the use of the full range of ksft
     exit code (pass, fail, skip, xfail, xpass).

  Netfilter:

   - Allow userspace to define a table that is exclusively owned by a
     daemon (via netlink socket aliveness) without auto-removing this
     table when the userspace program exits. Such table gets marked as
     orphaned and a restarting management daemon can re-attach/regain
     ownership.

   - Speed up element insertions to nftables' concatenated-ranges set
     type. Compact a few related data structures.

  BPF:

   - Add BPF token support for delegating a subset of BPF subsystem
     functionality from privileged system-wide daemons such as systemd
     through special mount options for userns-bound BPF fs to a trusted
     & unprivileged application.

   - Introduce bpf_arena which is sparse shared memory region between
     BPF program and user space where structures inside the arena can
     have pointers to other areas of the arena, and pointers work
     seamlessly for both user-space programs and BPF programs.

   - Introduce may_goto instruction that is a contract between the
     verifier and the program. The verifier allows the program to loop
     assuming it's behaving well, but reserves the right to terminate
     it.

   - Extend the BPF verifier to enable static subprog calls in spin lock
     critical sections.

   - Support registration of struct_ops types from modules which helps
     projects like fuse-bpf that seeks to implement a new struct_ops
     type.

   - Add support for retrieval of cookies for perf/kprobe multi links.

   - Support arbitrary TCP SYN cookie generation / validation in the TC
     layer with BPF to allow creating SYN flood handling in BPF
     firewalls.

   - Add code generation to inline the bpf_kptr_xchg() helper which
     improves performance when stashing/popping the allocated BPF
     objects.

  Wireless:

   - Add SPP (signaling and payload protected) AMSDU support.

   - Support wider bandwidth OFDMA, as required for EHT operation.

  Driver API:

   - Major overhaul of the Energy Efficient Ethernet internals to
     support new link modes (2.5GE, 5GE), share more code between
     drivers (especially those using phylib), and encourage more
     uniform behavior. Convert and clean up drivers.

   - Define an API for querying per netdev queue statistics from
     drivers.

   - IPSec: account in global stats for fully offloaded sessions.

   - Create a concept of Ethernet PHY Packages at the Device Tree level,
     to allow parameterizing the existing PHY package code.

   - Enable Rx hashing (RSS) on GTP protocol fields.

  Misc:

   - Improvements and refactoring all over networking selftests.

   - Create uniform module aliases for TC classifiers, actions, and
     packet schedulers to simplify creating modprobe policies.

   - Address all missing MODULE_DESCRIPTION() warnings in networking.

   - Extend the Netlink descriptions in YAML to cover message
     encapsulation or "Netlink polymorphism", where interpretation of
     nested attributes depends on link type, classifier type or some
     other "class type".

  Drivers:

   - Ethernet high-speed NICs:
      - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
      - Intel (100G, ice, idpf):
         - support E825-C devices
      - nVidia/Mellanox:
         - support devices with one port and multiple PCIe links
      - Broadcom (bnxt):
         - support n-tuple filters
         - support configuring the RSS key
      - Wangxun (ngbe/txgbe):
         - implement irq_domain for TXGBE's sub-interrupts
      - Pensando/AMD:
         - support XDP
         - optimize queue submission and wakeup handling (+17% bps)
         - optimize struct layout, saving 28% of memory on queues

   - Ethernet NICs embedded and virtual:
      - Google cloud vNIC:
         - refactor driver to perform memory allocations for new queue
           config before stopping and freeing the old queue memory
      - Synopsys (stmmac):
         - obey queueMaxSDU and implement counters required by 802.1Qbv
      - Renesas (ravb):
         - support packet checksum offload
         - suspend to RAM and runtime PM support

   - Ethernet switches:
      - nVidia/Mellanox:
         - support for nexthop group statistics
      - Microchip:
         - ksz8: implement PHY loopback
         - add support for KSZ8567, a 7-port 10/100Mbps switch

   - PTP:
      - New driver for RENESAS FemtoClock3 Wireless clock generator.
      - Support OCP PTP cards designed and built by Adva.

   - CAN:
      - Support recvmsg() flags for own, local and remote traffic on CAN
        BCM sockets.
      - Support for esd GmbH PCIe/402 CAN device family.
      - m_can:
         - Rx/Tx submission coalescing
         - wake on frame Rx

   - WiFi:
      - Intel (iwlwifi):
         - enable signaling and payload protected A-MSDUs
         - support wider-bandwidth OFDMA
         - support for new devices
         - bump FW API to 89 for AX devices; 90 for BZ/SC devices
      - MediaTek (mt76):
         - mt7915: newer ADIE version support
         - mt7925: radio temperature sensor support
      - Qualcomm (ath11k):
         - support 6 GHz station power modes: Low Power Indoor (LPI),
           Standard Power) SP and Very Low Power (VLP)
         - QCA6390 & WCN6855: support 2 concurrent station interfaces
         - QCA2066 support
      - Qualcomm (ath12k):
         - refactoring in preparation for Multi-Link Operation (MLO)
           support
         - 1024 Block Ack window size support
         - firmware-2.bin support
         - support having multiple identical PCI devices (firmware needs
           to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
         - QCN9274: support split-PHY devices
         - WCN7850: enable Power Save Mode in station mode
         - WCN7850: P2P support
      - RealTek:
         - rtw88: support for more rtw8811cu and rtw8821cu devices
         - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
         - rtlwifi: speed up USB firmware initialization
         - rtwl8xxxu:
             - RTL8188F: concurrent interface support
             - Channel Switch Announcement (CSA) support in AP mode
      - Broadcom (brcmfmac):
         - per-vendor feature support
         - per-vendor SAE password setup
         - DMI nvram filename quirk for ACEPC W5 Pro"

* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
  nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
  nexthop: Fix out-of-bounds access during attribute validation
  nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
  nexthop: Only parse NHA_OP_FLAGS for get messages that require it
  bpf: move sleepable flag from bpf_prog_aux to bpf_prog
  bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
  selftests/bpf: Add kprobe multi triggering benchmarks
  ptp: Move from simple ida to xarray
  vxlan: Remove generic .ndo_get_stats64
  vxlan: Do not alloc tstats manually
  devlink: Add comments to use netlink gen tool
  nfp: flower: handle acti_netdevs allocation failure
  net/packet: Add getsockopt support for PACKET_COPY_THRESH
  net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
  selftests/bpf: Add bpf_arena_htab test.
  selftests/bpf: Add bpf_arena_list test.
  selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  bpf: Add helper macro bpf_addr_space_cast()
  libbpf: Recognize __arena global variables.
  bpftool: Recognize arena map type
  ...
This commit is contained in:
Linus Torvalds 2024-03-12 17:44:08 -07:00
commit 9187210eee
1881 changed files with 91672 additions and 36254 deletions

View File

@ -1,4 +1,5 @@
Alan Cox <alan@lxorguk.ukuu.org.uk>
Alan Cox <root@hraefn.swansea.linux.org.uk>
Christoph Hellwig <hch@lst.de>
Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Marc Gonzalez <marc.w.gonzalez@free.fr>

View File

@ -573,6 +573,7 @@ Simon Kelley <simon@thekelleys.org.uk>
Sricharan Ramabadhran <quic_srichara@quicinc.com> <sricharan@codeaurora.org>
Srinivas Ramana <quic_sramana@quicinc.com> <sramana@codeaurora.org>
Sriram R <quic_srirrama@quicinc.com> <srirrama@codeaurora.org>
Stefan Wahren <wahrenst@gmx.net> <stefan.wahren@i2se.com>
Stéphane Witzmann <stephane.witzmann@ubpmes.univ-bpclermont.fr>
Stephen Hemminger <stephen@networkplumber.org> <shemminger@linux-foundation.org>
Stephen Hemminger <stephen@networkplumber.org> <shemminger@osdl.org>

View File

@ -96,3 +96,26 @@ Description:
Indicates the absolute minimum limit of bytes allowed to be
queued on this network device transmit queue. Default value is
0.
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_thrs
Date: Jan 2024
KernelVersion: 6.9
Contact: netdev@vger.kernel.org
Description:
Tx completion stall detection threshold in ms. Kernel will
guarantee to detect all stalls longer than this threshold but
may also detect stalls longer than half of the threshold.
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_cnt
Date: Jan 2024
KernelVersion: 6.9
Contact: netdev@vger.kernel.org
Description:
Number of detected Tx completion stalls.
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_max
Date: Jan 2024
KernelVersion: 6.9
Contact: netdev@vger.kernel.org
Description:
Longest detected Tx completion stall. Write 0 to clear.

View File

@ -206,6 +206,11 @@ Will increase power usage.
Default: 0 (off)
mem_pcpu_rsv
------------
Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
rmem_default
------------

View File

@ -177,10 +177,10 @@ In addition to kfuncs' arguments, verifier may need more information about the
type of kfunc(s) being registered with the BPF subsystem. To do so, we define
flags on a set of kfuncs as follows::
BTF_SET8_START(bpf_task_set)
BTF_KFUNCS_START(bpf_task_set)
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
BTF_SET8_END(bpf_task_set)
BTF_KFUNCS_END(bpf_task_set)
This set encodes the BTF ID of each kfunc listed above, and encodes the flags
along with it. Ofcourse, it is also allowed to specify no flags.
@ -347,10 +347,10 @@ Once the kfunc is prepared for use, the final step to making it visible is
registering it with the BPF subsystem. Registration is done per BPF program
type. An example is shown below::
BTF_SET8_START(bpf_task_set)
BTF_KFUNCS_START(bpf_task_set)
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
BTF_SET8_END(bpf_task_set)
BTF_KFUNCS_END(bpf_task_set)
static const struct btf_kfunc_id_set bpf_task_kfunc_set = {
.owner = THIS_MODULE,

View File

@ -17,7 +17,7 @@ significant byte.
LPM tries may be created with a maximum prefix length that is a multiple
of 8, in the range from 8 to 2048. The key used for lookup and update
operations is a ``struct bpf_lpm_trie_key``, extended by
operations is a ``struct bpf_lpm_trie_key_u8``, extended by
``max_prefixlen/8`` bytes.
- For IPv4 addresses the data length is 4 bytes

View File

@ -1,11 +1,11 @@
.. contents::
.. sectnum::
=======================================
BPF Instruction Set Specification, v1.0
=======================================
======================================
BPF Instruction Set Architecture (ISA)
======================================
This document specifies version 1.0 of the BPF instruction set.
This document specifies the BPF instruction set architecture (ISA).
Documentation conventions
=========================
@ -24,22 +24,22 @@ a type's signedness (`S`) and bit width (`N`), respectively.
.. table:: Meaning of signedness notation.
==== =========
`S` Meaning
S Meaning
==== =========
`u` unsigned
`s` signed
u unsigned
s signed
==== =========
.. table:: Meaning of bit-width notation.
===== =========
`N` Bit width
N Bit width
===== =========
`8` 8 bits
`16` 16 bits
`32` 32 bits
`64` 64 bits
`128` 128 bits
8 8 bits
16 16 bits
32 32 bits
64 64 bits
128 128 bits
===== =========
For example, `u32` is a type whose valid values are all the 32-bit unsigned
@ -48,31 +48,31 @@ numbers.
Functions
---------
* `htobe16`: Takes an unsigned 16-bit number in host-endian format and
* htobe16: Takes an unsigned 16-bit number in host-endian format and
returns the equivalent number as an unsigned 16-bit number in big-endian
format.
* `htobe32`: Takes an unsigned 32-bit number in host-endian format and
* htobe32: Takes an unsigned 32-bit number in host-endian format and
returns the equivalent number as an unsigned 32-bit number in big-endian
format.
* `htobe64`: Takes an unsigned 64-bit number in host-endian format and
* htobe64: Takes an unsigned 64-bit number in host-endian format and
returns the equivalent number as an unsigned 64-bit number in big-endian
format.
* `htole16`: Takes an unsigned 16-bit number in host-endian format and
* htole16: Takes an unsigned 16-bit number in host-endian format and
returns the equivalent number as an unsigned 16-bit number in little-endian
format.
* `htole32`: Takes an unsigned 32-bit number in host-endian format and
* htole32: Takes an unsigned 32-bit number in host-endian format and
returns the equivalent number as an unsigned 32-bit number in little-endian
format.
* `htole64`: Takes an unsigned 64-bit number in host-endian format and
* htole64: Takes an unsigned 64-bit number in host-endian format and
returns the equivalent number as an unsigned 64-bit number in little-endian
format.
* `bswap16`: Takes an unsigned 16-bit number in either big- or little-endian
* bswap16: Takes an unsigned 16-bit number in either big- or little-endian
format and returns the equivalent number with the same bit width but
opposite endianness.
* `bswap32`: Takes an unsigned 32-bit number in either big- or little-endian
* bswap32: Takes an unsigned 32-bit number in either big- or little-endian
format and returns the equivalent number with the same bit width but
opposite endianness.
* `bswap64`: Takes an unsigned 64-bit number in either big- or little-endian
* bswap64: Takes an unsigned 64-bit number in either big- or little-endian
format and returns the equivalent number with the same bit width but
opposite endianness.
@ -97,40 +97,101 @@ Definitions
A: 10000110
B: 11111111 10000110
Conformance groups
------------------
An implementation does not need to support all instructions specified in this
document (e.g., deprecated instructions). Instead, a number of conformance
groups are specified. An implementation must support the base32 conformance
group and may support additional conformance groups, where supporting a
conformance group means it must support all instructions in that conformance
group.
The use of named conformance groups enables interoperability between a runtime
that executes instructions, and tools as such compilers that generate
instructions for the runtime. Thus, capability discovery in terms of
conformance groups might be done manually by users or automatically by tools.
Each conformance group has a short ASCII label (e.g., "base32") that
corresponds to a set of instructions that are mandatory. That is, each
instruction has one or more conformance groups of which it is a member.
This document defines the following conformance groups:
* base32: includes all instructions defined in this
specification unless otherwise noted.
* base64: includes base32, plus instructions explicitly noted
as being in the base64 conformance group.
* atomic32: includes 32-bit atomic operation instructions (see `Atomic operations`_).
* atomic64: includes atomic32, plus 64-bit atomic operation instructions.
* divmul32: includes 32-bit division, multiplication, and modulo instructions.
* divmul64: includes divmul32, plus 64-bit division, multiplication,
and modulo instructions.
* packet: deprecated packet access instructions.
Instruction encoding
====================
BPF has two instruction encodings:
* the basic instruction encoding, which uses 64 bits to encode an instruction
* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
constant) value after the basic instruction for a total of 128 bits.
* the wide instruction encoding, which appends a second 64 bits
after the basic instruction for a total of 128 bits.
The fields conforming an encoded basic instruction are stored in the
following order::
Basic instruction encoding
--------------------------
opcode:8 src_reg:4 dst_reg:4 offset:16 imm:32 // In little-endian BPF.
opcode:8 dst_reg:4 src_reg:4 offset:16 imm:32 // In big-endian BPF.
A basic instruction is encoded as follows::
**imm**
signed integer immediate value
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| opcode | regs | offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| imm |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
**opcode**
operation to perform, encoded as follows::
+-+-+-+-+-+-+-+-+
|specific |class|
+-+-+-+-+-+-+-+-+
**specific**
The format of these bits varies by instruction class
**class**
The instruction class (see `Instruction classes`_)
**regs**
The source and destination register numbers, encoded as follows
on a little-endian host::
+-+-+-+-+-+-+-+-+
|src_reg|dst_reg|
+-+-+-+-+-+-+-+-+
and as follows on a big-endian host::
+-+-+-+-+-+-+-+-+
|dst_reg|src_reg|
+-+-+-+-+-+-+-+-+
**src_reg**
the source register number (0-10), except where otherwise specified
(`64-bit immediate instructions`_ reuse this field for other purposes)
**dst_reg**
destination register number (0-10)
**offset**
signed integer offset used with pointer arithmetic
**src_reg**
the source register number (0-10), except where otherwise specified
(`64-bit immediate instructions`_ reuse this field for other purposes)
**imm**
signed integer immediate value
**dst_reg**
destination register number (0-10)
**opcode**
operation to perform
Note that the contents of multi-byte fields ('imm' and 'offset') are
stored using big-endian byte ordering in big-endian BPF and
little-endian byte ordering in little-endian BPF.
Note that the contents of multi-byte fields ('offset' and 'imm') are
stored using big-endian byte ordering on big-endian hosts and
little-endian byte ordering on little-endian hosts.
For example::
@ -143,71 +204,83 @@ For example::
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
instruction uses a 64-bit immediate value that is constructed as follows.
The 64 bits following the basic instruction contain a pseudo instruction
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
and imm containing the high 32 bits of the immediate value.
Wide instruction encoding
--------------------------
Some instructions are defined to use the wide instruction encoding,
which uses two 32-bit immediate values. The 64 bits following
the basic instruction format contain a pseudo instruction
with 'opcode', 'dst_reg', 'src_reg', and 'offset' all set to zero.
This is depicted in the following figure::
basic_instruction
.-----------------------------.
| |
code:8 regs:8 offset:16 imm:32 unused:32 imm:32
| |
'--------------'
pseudo instruction
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| opcode | regs | offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| imm |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| next_imm |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Thus the 64-bit immediate value is constructed as follows:
**opcode**
operation to perform, encoded as explained above
imm64 = (next_imm << 32) | imm
**regs**
The source and destination register numbers, encoded as explained above
where 'next_imm' refers to the imm value of the pseudo instruction
following the basic instruction. The unused bytes in the pseudo
instruction are reserved and shall be cleared to zero.
**offset**
signed integer offset used with pointer arithmetic
**imm**
signed integer immediate value
**reserved**
unused, set to zero
**next_imm**
second signed integer immediate value
Instruction classes
-------------------
The three LSB bits of the 'opcode' field store the instruction class:
The three least significant bits of the 'opcode' field store the instruction class:
========= ===== =============================== ===================================
class value description reference
========= ===== =============================== ===================================
BPF_LD 0x00 non-standard load operations `Load and store instructions`_
BPF_LDX 0x01 load into register operations `Load and store instructions`_
BPF_ST 0x02 store from immediate operations `Load and store instructions`_
BPF_STX 0x03 store from register operations `Load and store instructions`_
BPF_ALU 0x04 32-bit arithmetic operations `Arithmetic and jump instructions`_
BPF_JMP 0x05 64-bit jump operations `Arithmetic and jump instructions`_
BPF_JMP32 0x06 32-bit jump operations `Arithmetic and jump instructions`_
BPF_ALU64 0x07 64-bit arithmetic operations `Arithmetic and jump instructions`_
========= ===== =============================== ===================================
===== ===== =============================== ===================================
class value description reference
===== ===== =============================== ===================================
LD 0x0 non-standard load operations `Load and store instructions`_
LDX 0x1 load into register operations `Load and store instructions`_
ST 0x2 store from immediate operations `Load and store instructions`_
STX 0x3 store from register operations `Load and store instructions`_
ALU 0x4 32-bit arithmetic operations `Arithmetic and jump instructions`_
JMP 0x5 64-bit jump operations `Arithmetic and jump instructions`_
JMP32 0x6 32-bit jump operations `Arithmetic and jump instructions`_
ALU64 0x7 64-bit arithmetic operations `Arithmetic and jump instructions`_
===== ===== =============================== ===================================
Arithmetic and jump instructions
================================
For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:
For arithmetic and jump instructions (``ALU``, ``ALU64``, ``JMP`` and
``JMP32``), the 8-bit 'opcode' field is divided into three parts::
============== ====== =================
4 bits (MSB) 1 bit 3 bits (LSB)
============== ====== =================
code source instruction class
============== ====== =================
+-+-+-+-+-+-+-+-+
| code |s|class|
+-+-+-+-+-+-+-+-+
**code**
the operation code, whose meaning varies by instruction class
**source**
**s (source)**
the source operand location, which unless otherwise specified is one of:
====== ===== ==============================================
source value description
====== ===== ==============================================
BPF_K 0x00 use 32-bit 'imm' value as source operand
BPF_X 0x08 use 'src_reg' register value as source operand
K 0 use 32-bit 'imm' value as source operand
X 1 use 'src_reg' register value as source operand
====== ===== ==============================================
**instruction class**
@ -216,70 +289,75 @@ code source instruction class
Arithmetic instructions
-----------------------
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
``ALU`` uses 32-bit wide operands while ``ALU64`` uses 64-bit wide operands for
otherwise identical operations. ``ALU64`` instructions belong to the
base64 conformance group unless noted otherwise.
The 'code' field encodes the operation as below, where 'src' and 'dst' refer
to the values of the source and destination registers, respectively.
========= ===== ======= ==========================================================
code value offset description
========= ===== ======= ==========================================================
BPF_ADD 0x00 0 dst += src
BPF_SUB 0x10 0 dst -= src
BPF_MUL 0x20 0 dst \*= src
BPF_DIV 0x30 0 dst = (src != 0) ? (dst / src) : 0
BPF_SDIV 0x30 1 dst = (src != 0) ? (dst s/ src) : 0
BPF_OR 0x40 0 dst \|= src
BPF_AND 0x50 0 dst &= src
BPF_LSH 0x60 0 dst <<= (src & mask)
BPF_RSH 0x70 0 dst >>= (src & mask)
BPF_NEG 0x80 0 dst = -dst
BPF_MOD 0x90 0 dst = (src != 0) ? (dst % src) : dst
BPF_SMOD 0x90 1 dst = (src != 0) ? (dst s% src) : dst
BPF_XOR 0xa0 0 dst ^= src
BPF_MOV 0xb0 0 dst = src
BPF_MOVSX 0xb0 8/16/32 dst = (s8,s16,s32)src
BPF_ARSH 0xc0 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
BPF_END 0xd0 0 byte swap operations (see `Byte swap instructions`_ below)
========= ===== ======= ==========================================================
===== ===== ======= ==========================================================
name code offset description
===== ===== ======= ==========================================================
ADD 0x0 0 dst += src
SUB 0x1 0 dst -= src
MUL 0x2 0 dst \*= src
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
OR 0x4 0 dst \|= src
AND 0x5 0 dst &= src
LSH 0x6 0 dst <<= (src & mask)
RSH 0x7 0 dst >>= (src & mask)
NEG 0x8 0 dst = -dst
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
XOR 0xa 0 dst ^= src
MOV 0xb 0 dst = src
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
===== ===== ======= ==========================================================
Underflow and overflow are allowed during arithmetic operations, meaning
the 64-bit or 32-bit value will wrap. If BPF program execution would
result in division by zero, the destination register is instead set to zero.
If execution would result in modulo by zero, for ``BPF_ALU64`` the value of
the destination register is unchanged whereas for ``BPF_ALU`` the upper
If execution would result in modulo by zero, for ``ALU64`` the value of
the destination register is unchanged whereas for ``ALU`` the upper
32 bits of the destination register are zeroed.
``BPF_ADD | BPF_X | BPF_ALU`` means::
``{ADD, X, ALU}``, where 'code' = ``ADD``, 'source' = ``X``, and 'class' = ``ALU``, means::
dst = (u32) ((u32) dst + (u32) src)
where '(u32)' indicates that the upper 32 bits are zeroed.
``BPF_ADD | BPF_X | BPF_ALU64`` means::
``{ADD, X, ALU64}`` means::
dst = dst + src
``BPF_XOR | BPF_K | BPF_ALU`` means::
``{XOR, K, ALU}`` means::
dst = (u32) dst ^ (u32) imm32
dst = (u32) dst ^ (u32) imm
``BPF_XOR | BPF_K | BPF_ALU64`` means::
``{XOR, K, ALU64}`` means::
dst = dst ^ imm32
dst = dst ^ imm
Note that most instructions have instruction offset of 0. Only three instructions
(``BPF_SDIV``, ``BPF_SMOD``, ``BPF_MOVSX``) have a non-zero offset.
(``SDIV``, ``SMOD``, ``MOVSX``) have a non-zero offset.
Division, multiplication, and modulo operations for ``ALU`` are part
of the "divmul32" conformance group, and division, multiplication, and
modulo operations for ``ALU64`` are part of the "divmul64" conformance
group.
The division and modulo operations support both unsigned and signed flavors.
For unsigned operations (``BPF_DIV`` and ``BPF_MOD``), for ``BPF_ALU``,
'imm' is interpreted as a 32-bit unsigned value. For ``BPF_ALU64``,
For unsigned operations (``DIV`` and ``MOD``), for ``ALU``,
'imm' is interpreted as a 32-bit unsigned value. For ``ALU64``,
'imm' is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
interpreted as a 64-bit unsigned value.
For signed operations (``BPF_SDIV`` and ``BPF_SMOD``), for ``BPF_ALU``,
'imm' is interpreted as a 32-bit signed value. For ``BPF_ALU64``, 'imm'
For signed operations (``SDIV`` and ``SMOD``), for ``ALU``,
'imm' is interpreted as a 32-bit signed value. For ``ALU64``, 'imm'
is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
interpreted as a 64-bit signed value.
@ -291,11 +369,15 @@ etc. This specification requires that signed modulo use truncated division
a % n = a - n * trunc(a / n)
The ``BPF_MOVSX`` instruction does a move operation with sign extension.
``BPF_ALU | BPF_MOVSX`` :term:`sign extends<Sign Extend>` 8-bit and 16-bit operands into 32
The ``MOVSX`` instruction does a move operation with sign extension.
``{MOVSX, X, ALU}`` :term:`sign extends<Sign Extend>` 8-bit and 16-bit operands into 32
bit operands, and zeroes the remaining upper 32 bits.
``BPF_ALU64 | BPF_MOVSX`` :term:`sign extends<Sign Extend>` 8-bit, 16-bit, and 32-bit
operands into 64 bit operands.
``{MOVSX, X, ALU64}`` :term:`sign extends<Sign Extend>` 8-bit, 16-bit, and 32-bit
operands into 64 bit operands. Unlike other arithmetic instructions,
``MOVSX`` is only defined for register source operands (``X``).
The ``NEG`` instruction is only defined when the source bit is clear
(``K``).
Shift operations use a mask of 0x3F (63) for 64-bit operations and 0x1F (31)
for 32-bit operations.
@ -303,43 +385,45 @@ for 32-bit operations.
Byte swap instructions
----------------------
The byte swap instructions use instruction classes of ``BPF_ALU`` and ``BPF_ALU64``
and a 4-bit 'code' field of ``BPF_END``.
The byte swap instructions use instruction classes of ``ALU`` and ``ALU64``
and a 4-bit 'code' field of ``END``.
The byte swap instructions operate on the destination register
only and do not use a separate source register or immediate value.
For ``BPF_ALU``, the 1-bit source operand field in the opcode is used to
For ``ALU``, the 1-bit source operand field in the opcode is used to
select what byte order the operation converts from or to. For
``BPF_ALU64``, the 1-bit source operand field in the opcode is reserved
``ALU64``, the 1-bit source operand field in the opcode is reserved
and must be set to 0.
========= ========= ===== =================================================
class source value description
========= ========= ===== =================================================
BPF_ALU BPF_TO_LE 0x00 convert between host byte order and little endian
BPF_ALU BPF_TO_BE 0x08 convert between host byte order and big endian
BPF_ALU64 Reserved 0x00 do byte swap unconditionally
========= ========= ===== =================================================
===== ======== ===== =================================================
class source value description
===== ======== ===== =================================================
ALU TO_LE 0 convert between host byte order and little endian
ALU TO_BE 1 convert between host byte order and big endian
ALU64 Reserved 0 do byte swap unconditionally
===== ======== ===== =================================================
The 'imm' field encodes the width of the swap operations. The following widths
are supported: 16, 32 and 64.
are supported: 16, 32 and 64. Width 64 operations belong to the base64
conformance group and other swap operations belong to the base32
conformance group.
Examples:
``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16/32/64 means::
``{END, TO_LE, ALU}`` with imm = 16/32/64 means::
dst = htole16(dst)
dst = htole32(dst)
dst = htole64(dst)
``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 16/32/64 means::
``{END, TO_BE, ALU}`` with imm = 16/32/64 means::
dst = htobe16(dst)
dst = htobe32(dst)
dst = htobe64(dst)
``BPF_ALU64 | BPF_TO_LE | BPF_END`` with imm = 16/32/64 means::
``{END, TO_LE, ALU64}`` with imm = 16/32/64 means::
dst = bswap16(dst)
dst = bswap32(dst)
@ -348,56 +432,61 @@ Examples:
Jump instructions
-----------------
``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations.
``JMP32`` uses 32-bit wide operands and indicates the base32
conformance group, while ``JMP`` uses 64-bit wide operands for
otherwise identical operations, and indicates the base64 conformance
group unless otherwise specified.
The 'code' field encodes the operation as below:
======== ===== === =========================================== =========================================
code value src description notes
======== ===== === =========================================== =========================================
BPF_JA 0x0 0x0 PC += offset BPF_JMP class
BPF_JA 0x0 0x0 PC += imm BPF_JMP32 class
BPF_JEQ 0x1 any PC += offset if dst == src
BPF_JGT 0x2 any PC += offset if dst > src unsigned
BPF_JGE 0x3 any PC += offset if dst >= src unsigned
BPF_JSET 0x4 any PC += offset if dst & src
BPF_JNE 0x5 any PC += offset if dst != src
BPF_JSGT 0x6 any PC += offset if dst > src signed
BPF_JSGE 0x7 any PC += offset if dst >= src signed
BPF_CALL 0x8 0x0 call helper function by address see `Helper functions`_
BPF_CALL 0x8 0x1 call PC += imm see `Program-local functions`_
BPF_CALL 0x8 0x2 call helper function by BTF ID see `Helper functions`_
BPF_EXIT 0x9 0x0 return BPF_JMP only
BPF_JLT 0xa any PC += offset if dst < src unsigned
BPF_JLE 0xb any PC += offset if dst <= src unsigned
BPF_JSLT 0xc any PC += offset if dst < src signed
BPF_JSLE 0xd any PC += offset if dst <= src signed
======== ===== === =========================================== =========================================
======== ===== ======= =============================== ===================================================
code value src_reg description notes
======== ===== ======= =============================== ===================================================
JA 0x0 0x0 PC += offset {JA, K, JMP} only
JA 0x0 0x0 PC += imm {JA, K, JMP32} only
JEQ 0x1 any PC += offset if dst == src
JGT 0x2 any PC += offset if dst > src unsigned
JGE 0x3 any PC += offset if dst >= src unsigned
JSET 0x4 any PC += offset if dst & src
JNE 0x5 any PC += offset if dst != src
JSGT 0x6 any PC += offset if dst > src signed
JSGE 0x7 any PC += offset if dst >= src signed
CALL 0x8 0x0 call helper function by address {CALL, K, JMP} only, see `Helper functions`_
CALL 0x8 0x1 call PC += imm {CALL, K, JMP} only, see `Program-local functions`_
CALL 0x8 0x2 call helper function by BTF ID {CALL, K, JMP} only, see `Helper functions`_
EXIT 0x9 0x0 return {CALL, K, JMP} only
JLT 0xa any PC += offset if dst < src unsigned
JLE 0xb any PC += offset if dst <= src unsigned
JSLT 0xc any PC += offset if dst < src signed
JSLE 0xd any PC += offset if dst <= src signed
======== ===== ======= =============================== ===================================================
The BPF program needs to store the return value into register R0 before doing a
``BPF_EXIT``.
The BPF program needs to store the return value into register R0 before doing an
``EXIT``.
Example:
``BPF_JSGE | BPF_X | BPF_JMP32`` (0x7e) means::
``{JSGE, X, JMP32}`` means::
if (s32)dst s>= (s32)src goto +offset
where 's>=' indicates a signed '>=' comparison.
``BPF_JA | BPF_K | BPF_JMP32`` (0x06) means::
``{JA, K, JMP32}`` means::
gotol +imm
where 'imm' means the branch offset comes from insn 'imm' field.
Note that there are two flavors of ``BPF_JA`` instructions. The
``BPF_JMP`` class permits a 16-bit jump offset specified by the 'offset'
field, whereas the ``BPF_JMP32`` class permits a 32-bit jump offset
Note that there are two flavors of ``JA`` instructions. The
``JMP`` class permits a 16-bit jump offset specified by the 'offset'
field, whereas the ``JMP32`` class permits a 32-bit jump offset
specified by the 'imm' field. A > 16-bit conditional jump may be
converted to a < 16-bit conditional jump plus a 32-bit unconditional
jump.
All ``CALL`` and ``JA`` instructions belong to the
base32 conformance group.
Helper functions
~~~~~~~~~~~~~~~~
@ -416,78 +505,83 @@ Program-local functions
~~~~~~~~~~~~~~~~~~~~~~~
Program-local functions are functions exposed by the same BPF program as the
caller, and are referenced by offset from the call instruction, similar to
``BPF_JA``. The offset is encoded in the imm field of the call instruction.
A ``BPF_EXIT`` within the program-local function will return to the caller.
``JA``. The offset is encoded in the imm field of the call instruction.
A ``EXIT`` within the program-local function will return to the caller.
Load and store instructions
===========================
For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
8-bit 'opcode' field is divided as:
For load and store instructions (``LD``, ``LDX``, ``ST``, and ``STX``), the
8-bit 'opcode' field is divided as::
============ ====== =================
3 bits (MSB) 2 bits 3 bits (LSB)
============ ====== =================
mode size instruction class
============ ====== =================
+-+-+-+-+-+-+-+-+
|mode |sz |class|
+-+-+-+-+-+-+-+-+
The mode modifier is one of:
**mode**
The mode modifier is one of:
============= ===== ==================================== =============
mode modifier value description reference
============= ===== ==================================== =============
BPF_IMM 0x00 64-bit immediate instructions `64-bit immediate instructions`_
BPF_ABS 0x20 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
BPF_IND 0x40 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
BPF_MEM 0x60 regular load and store operations `Regular load and store operations`_
BPF_MEMSX 0x80 sign-extension load operations `Sign-extension load operations`_
BPF_ATOMIC 0xc0 atomic operations `Atomic operations`_
============= ===== ==================================== =============
============= ===== ==================================== =============
mode modifier value description reference
============= ===== ==================================== =============
IMM 0 64-bit immediate instructions `64-bit immediate instructions`_
ABS 1 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
IND 2 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
MEM 3 regular load and store operations `Regular load and store operations`_
MEMSX 4 sign-extension load operations `Sign-extension load operations`_
ATOMIC 6 atomic operations `Atomic operations`_
============= ===== ==================================== =============
The size modifier is one of:
**sz (size)**
The size modifier is one of:
============= ===== =====================
size modifier value description
============= ===== =====================
BPF_W 0x00 word (4 bytes)
BPF_H 0x08 half word (2 bytes)
BPF_B 0x10 byte
BPF_DW 0x18 double word (8 bytes)
============= ===== =====================
==== ===== =====================
size value description
==== ===== =====================
W 0 word (4 bytes)
H 1 half word (2 bytes)
B 2 byte
DW 3 double word (8 bytes)
==== ===== =====================
Instructions using ``DW`` belong to the base64 conformance group.
**class**
The instruction class (see `Instruction classes`_)
Regular load and store operations
---------------------------------
The ``BPF_MEM`` mode modifier is used to encode regular load and store
The ``MEM`` mode modifier is used to encode regular load and store
instructions that transfer data between a register and memory.
``BPF_MEM | <size> | BPF_STX`` means::
``{MEM, <size>, STX}`` means::
*(size *) (dst + offset) = src
``BPF_MEM | <size> | BPF_ST`` means::
``{MEM, <size>, ST}`` means::
*(size *) (dst + offset) = imm32
*(size *) (dst + offset) = imm
``BPF_MEM | <size> | BPF_LDX`` means::
``{MEM, <size>, LDX}`` means::
dst = *(unsigned size *) (src + offset)
Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW`` and
'unsigned size' is one of u8, u16, u32 or u64.
Where '<size>' is one of: ``B``, ``H``, ``W``, or ``DW``, and
'unsigned size' is one of: u8, u16, u32, or u64.
Sign-extension load operations
------------------------------
The ``BPF_MEMSX`` mode modifier is used to encode :term:`sign-extension<Sign Extend>` load
The ``MEMSX`` mode modifier is used to encode :term:`sign-extension<Sign Extend>` load
instructions that transfer data between a register and memory.
``BPF_MEMSX | <size> | BPF_LDX`` means::
``{MEMSX, <size>, LDX}`` means::
dst = *(signed size *) (src + offset)
Where size is one of: ``BPF_B``, ``BPF_H`` or ``BPF_W``, and
'signed size' is one of s8, s16 or s32.
Where size is one of: ``B``, ``H``, or ``W``, and
'signed size' is one of: s8, s16, or s32.
Atomic operations
-----------------
@ -497,10 +591,12 @@ interrupted or corrupted by other access to the same memory region
by other BPF programs or means outside of this specification.
All atomic operations supported by BPF are encoded as store operations
that use the ``BPF_ATOMIC`` mode modifier as follows:
that use the ``ATOMIC`` mode modifier as follows:
* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
* ``{ATOMIC, W, STX}`` for 32-bit operations, which are
part of the "atomic32" conformance group.
* ``{ATOMIC, DW, STX}`` for 64-bit operations, which are
part of the "atomic64" conformance group.
* 8-bit and 16-bit wide atomic operations are not supported.
The 'imm' field is used to encode the actual atomic operation.
@ -510,18 +606,18 @@ arithmetic operations in the 'imm' field to encode the atomic operation:
======== ===== ===========
imm value description
======== ===== ===========
BPF_ADD 0x00 atomic add
BPF_OR 0x40 atomic or
BPF_AND 0x50 atomic and
BPF_XOR 0xa0 atomic xor
ADD 0x00 atomic add
OR 0x40 atomic or
AND 0x50 atomic and
XOR 0xa0 atomic xor
======== ===== ===========
``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
``{ATOMIC, W, STX}`` with 'imm' = ADD means::
*(u32 *)(dst + offset) += src
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
``{ATOMIC, DW, STX}`` with 'imm' = ADD means::
*(u64 *)(dst + offset) += src
@ -531,20 +627,20 @@ two complex atomic operations:
=========== ================ ===========================
imm value description
=========== ================ ===========================
BPF_FETCH 0x01 modifier: return old value
BPF_XCHG 0xe0 | BPF_FETCH atomic exchange
BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
FETCH 0x01 modifier: return old value
XCHG 0xe0 | FETCH atomic exchange
CMPXCHG 0xf0 | FETCH atomic compare and exchange
=========== ================ ===========================
The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``BPF_FETCH`` flag
The ``FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``FETCH`` flag
is set, then the operation also overwrites ``src`` with the value that
was in memory before it was modified.
The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
The ``XCHG`` operation atomically exchanges ``src`` with the value
addressed by ``dst + offset``.
The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
The ``CMPXCHG`` operation atomically compares the value addressed by
``dst + offset`` with ``R0``. If they match, the value addressed by
``dst + offset`` is replaced with ``src``. In either case, the
value that was at ``dst + offset`` before the operation is zero-extended
@ -553,25 +649,25 @@ and loaded back to ``R0``.
64-bit immediate instructions
-----------------------------
Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding defined in `Instruction encoding`_, and use the 'src' field of the
Instructions with the ``IMM`` 'mode' modifier use the wide instruction
encoding defined in `Instruction encoding`_, and use the 'src_reg' field of the
basic instruction to hold an opcode subtype.
The following table defines a set of ``BPF_IMM | BPF_DW | BPF_LD`` instructions
with opcode subtypes in the 'src' field, using new terms such as "map"
The following table defines a set of ``{IMM, DW, LD}`` instructions
with opcode subtypes in the 'src_reg' field, using new terms such as "map"
defined further below:
========================= ====== === ========================================= =========== ==============
opcode construction opcode src pseudocode imm type dst type
========================= ====== === ========================================= =========== ==============
BPF_IMM | BPF_DW | BPF_LD 0x18 0x0 dst = imm64 integer integer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x1 dst = map_by_fd(imm) map fd map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x4 dst = code_addr(imm) integer code pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x5 dst = map_by_idx(imm) map index map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
========================= ====== === ========================================= =========== ==============
======= ========================================= =========== ==============
src_reg pseudocode imm type dst type
======= ========================================= =========== ==============
0x0 dst = (next_imm << 32) | imm integer integer
0x1 dst = map_by_fd(imm) map fd map
0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
0x3 dst = var_addr(imm) variable id data pointer
0x4 dst = code_addr(imm) integer code pointer
0x5 dst = map_by_idx(imm) map index map
0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
======= ========================================= =========== ==============
where
@ -609,5 +705,9 @@ Legacy BPF Packet access instructions
-------------------------------------
BPF previously introduced special instructions for access to packet data that were
carried over from classic BPF. However, these instructions are
deprecated and should no longer be used.
carried over from classic BPF. These instructions used an instruction
class of ``LD``, a size modifier of ``W``, ``H``, or ``B``, and a
mode modifier of ``ABS`` or ``IND``. The 'dst_reg' and 'offset' fields were
set to zero, and 'src_reg' was set to zero for ``ABS``. However, these
instructions are deprecated and should no longer be used. All legacy packet
access instructions belong to the "packet" conformance group.

View File

@ -562,7 +562,7 @@ works::
* ``checkpoint[0].r1`` is marked as read;
* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
by ``clean_live_states()``. After this processing ``checkpoint[0].r1`` has a
read mark and all other registers and stack slots are marked as ``NOT_INIT``
or ``STACK_INVALID``

View File

@ -259,9 +259,21 @@ Contributing new tests (details)
TEST_PROGS_EXTENDED, TEST_GEN_PROGS_EXTENDED mean it is the
executable which is not tested by default.
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
test.
TEST_INCLUDES is similar to TEST_FILES, it lists files which should be
included when exporting or installing the tests, with the following
differences:
* symlinks to files in other directories are preserved
* the part of paths below tools/testing/selftests/ is preserved when
copying the files to the output directory
TEST_INCLUDES is meant to list dependencies located in other directories of
the selftests hierarchy.
* First use the headers inside the kernel source and/or git repo, and then the
system headers. Headers for the kernel release as opposed to headers
installed by the distro on the system should be the primary focus to be able

View File

@ -200,6 +200,18 @@ properties:
#trigger-source-cells property in the source node.
$ref: /schemas/types.yaml#/definitions/phandle-array
active-low:
type: boolean
description:
Makes LED active low. To turn the LED ON, line needs to be
set to low voltage instead of high.
inactive-high-impedance:
type: boolean
description:
Set LED to high-impedance mode to turn the LED OFF. LED might also
describe this mode as tristate.
# Required properties for flash LED child nodes:
flash-max-microamp:
description:

View File

@ -52,10 +52,6 @@ patternProperties:
maxItems: 1
description: LED pin number
active-low:
type: boolean
description: Makes LED active low
required:
- reg

View File

@ -78,10 +78,6 @@ patternProperties:
- maximum: 23
description: LED pin number (only LEDs 0 to 23 are valid).
active-low:
type: boolean
description: Makes LED active low.
brcm,hardware-controlled:
type: boolean
description: Makes this LED hardware controlled.

View File

@ -25,8 +25,6 @@ LED sub-node required properties:
LED sub-node optional properties:
- label : see Documentation/devicetree/bindings/leds/common.txt
- active-low : Boolean, makes LED active low.
Default : false
- default-state : see
Documentation/devicetree/bindings/leds/common.txt
- linux,default-trigger : see

View File

@ -41,9 +41,7 @@ properties:
pwm-names: true
active-low:
description: For PWMs where the LED is wired to supply rather than ground.
type: boolean
active-low: true
color: true

View File

@ -34,11 +34,6 @@ patternProperties:
Maximum brightness possible for the LED
$ref: /schemas/types.yaml#/definitions/uint32
active-low:
description:
For PWMs where the LED is wired to supply rather than ground.
type: boolean
required:
- pwms
- max-brightness

View File

@ -15,6 +15,10 @@ description: Broadcom Ethernet controller first introduced with 72165
properties:
compatible:
oneOf:
- items:
- enum:
- brcm,bcm74165b0-asp
- const: brcm,asp-v2.2
- items:
- enum:
- brcm,bcm74165-asp

View File

@ -24,6 +24,7 @@ properties:
- brcm,genet-mdio-v5
- brcm,asp-v2.0-mdio
- brcm,asp-v2.1-mdio
- brcm,asp-v2.2-mdio
- brcm,unimac-mdio
reg:

View File

@ -28,6 +28,8 @@ Optional properties:
available with tcan4552/4553.
- device-wake-gpios: Wake up GPIO to wake up the TCAN device. Not
available with tcan4552/4553.
- wakeup-source: Leave the chip running when suspended, and configure
the RX interrupt to wake up the device.
Example:
tcan4x5x: tcan4x5x@0 {
@ -42,4 +44,5 @@ tcan4x5x: tcan4x5x@0 {
device-state-gpios = <&gpio3 21 GPIO_ACTIVE_HIGH>;
device-wake-gpios = <&gpio1 15 GPIO_ACTIVE_HIGH>;
reset-gpios = <&gpio1 27 GPIO_ACTIVE_HIGH>;
wakeup-source;
};

View File

@ -49,6 +49,10 @@ properties:
resets:
maxItems: 1
xlnx,has-ecc:
$ref: /schemas/types.yaml#/definitions/flag
description: CAN TX_OL, TX_TL and RX FIFOs have ECC support(AXI CAN)
required:
- compatible
- reg
@ -137,6 +141,7 @@ examples:
interrupts = <GIC_SPI 59 IRQ_TYPE_EDGE_RISING>;
tx-fifo-depth = <0x40>;
rx-fifo-depth = <0x40>;
xlnx,has-ecc;
};
- |

View File

@ -59,6 +59,11 @@ properties:
- cdns,gem # Generic
- cdns,macb # Generic
- items:
- enum:
- microchip,sam9x7-gem # Microchip SAM9X7 gigabit ethernet interface
- const: microchip,sama7g5-gem # Microchip SAMA7G5 gigabit ethernet interface
reg:
minItems: 1
items:

View File

@ -1,147 +0,0 @@
Atheros AR9331 built-in switch
=============================
It is a switch built-in to Atheros AR9331 WiSoC and addressable over internal
MDIO bus. All PHYs are built-in as well.
Required properties:
- compatible: should be: "qca,ar9331-switch"
- reg: Address on the MII bus for the switch.
- resets : Must contain an entry for each entry in reset-names.
- reset-names : Must include the following entries: "switch"
- interrupt-parent: Phandle to the parent interrupt controller
- interrupts: IRQ line for the switch
- interrupt-controller: Indicates the switch is itself an interrupt
controller. This is used for the PHY interrupts.
- #interrupt-cells: must be 1
- mdio: Container of PHY and devices on the switches MDIO bus.
See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of additional
required and optional properties.
Examples:
eth0: ethernet@19000000 {
compatible = "qca,ar9330-eth";
reg = <0x19000000 0x200>;
interrupts = <4>;
resets = <&rst 9>, <&rst 22>;
reset-names = "mac", "mdio";
clocks = <&pll ATH79_CLK_AHB>, <&pll ATH79_CLK_AHB>;
clock-names = "eth", "mdio";
phy-mode = "mii";
phy-handle = <&phy_port4>;
};
eth1: ethernet@1a000000 {
compatible = "qca,ar9330-eth";
reg = <0x1a000000 0x200>;
interrupts = <5>;
resets = <&rst 13>, <&rst 23>;
reset-names = "mac", "mdio";
clocks = <&pll ATH79_CLK_AHB>, <&pll ATH79_CLK_AHB>;
clock-names = "eth", "mdio";
phy-mode = "gmii";
fixed-link {
speed = <1000>;
full-duplex;
};
mdio {
#address-cells = <1>;
#size-cells = <0>;
switch10: switch@10 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "qca,ar9331-switch";
reg = <0x10>;
resets = <&rst 8>;
reset-names = "switch";
interrupt-parent = <&miscintc>;
interrupts = <12>;
interrupt-controller;
#interrupt-cells = <1>;
ports {
#address-cells = <1>;
#size-cells = <0>;
switch_port0: port@0 {
reg = <0x0>;
ethernet = <&eth1>;
phy-mode = "gmii";
fixed-link {
speed = <1000>;
full-duplex;
};
};
switch_port1: port@1 {
reg = <0x1>;
phy-handle = <&phy_port0>;
phy-mode = "internal";
};
switch_port2: port@2 {
reg = <0x2>;
phy-handle = <&phy_port1>;
phy-mode = "internal";
};
switch_port3: port@3 {
reg = <0x3>;
phy-handle = <&phy_port2>;
phy-mode = "internal";
};
switch_port4: port@4 {
reg = <0x4>;
phy-handle = <&phy_port3>;
phy-mode = "internal";
};
};
mdio {
#address-cells = <1>;
#size-cells = <0>;
interrupt-parent = <&switch10>;
phy_port0: phy@0 {
reg = <0x0>;
interrupts = <0>;
};
phy_port1: phy@1 {
reg = <0x1>;
interrupts = <0>;
};
phy_port2: phy@2 {
reg = <0x2>;
interrupts = <0>;
};
phy_port3: phy@3 {
reg = <0x3>;
interrupts = <0>;
};
phy_port4: phy@4 {
reg = <0x4>;
interrupts = <0>;
};
};
};
};
};

View File

@ -31,6 +31,7 @@ properties:
- microchip,ksz9893
- microchip,ksz9563
- microchip,ksz8563
- microchip,ksz8567
reset-gpios:
description:

View File

@ -0,0 +1,161 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/dsa/qca,ar9331.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Qualcomm Atheros AR9331 built-in switch
maintainers:
- Oleksij Rempel <o.rempel@pengutronix.de>
description:
Qualcomm Atheros AR9331 is a switch built-in to Atheros AR9331 WiSoC and
addressable over internal MDIO bus. All PHYs are built-in as well.
properties:
compatible:
const: qca,ar9331-switch
reg:
maxItems: 1
interrupts:
maxItems: 1
interrupt-controller: true
'#interrupt-cells':
const: 1
mdio:
$ref: /schemas/net/mdio.yaml#
unevaluatedProperties: false
properties:
interrupt-parent: true
patternProperties:
'(ethernet-)?phy@[0-4]+$':
type: object
unevaluatedProperties: false
properties:
reg: true
interrupts:
maxItems: 1
resets:
maxItems: 1
reset-names:
items:
- const: switch
required:
- compatible
- reg
- interrupts
- interrupt-controller
- '#interrupt-cells'
- mdio
- ports
- resets
- reset-names
allOf:
- $ref: dsa.yaml#/$defs/ethernet-ports
unevaluatedProperties: false
examples:
- |
mdio {
#address-cells = <1>;
#size-cells = <0>;
switch10: switch@10 {
compatible = "qca,ar9331-switch";
reg = <0x10>;
interrupt-parent = <&miscintc>;
interrupts = <12>;
interrupt-controller;
#interrupt-cells = <1>;
resets = <&rst 8>;
reset-names = "switch";
ports {
#address-cells = <1>;
#size-cells = <0>;
port@0 {
reg = <0x0>;
ethernet = <&eth1>;
phy-mode = "gmii";
fixed-link {
speed = <1000>;
full-duplex;
};
};
port@1 {
reg = <0x1>;
phy-handle = <&phy_port0>;
phy-mode = "internal";
};
port@2 {
reg = <0x2>;
phy-handle = <&phy_port1>;
phy-mode = "internal";
};
port@3 {
reg = <0x3>;
phy-handle = <&phy_port2>;
phy-mode = "internal";
};
port@4 {
reg = <0x4>;
phy-handle = <&phy_port3>;
phy-mode = "internal";
};
};
mdio {
#address-cells = <1>;
#size-cells = <0>;
interrupt-parent = <&switch10>;
phy_port0: ethernet-phy@0 {
reg = <0x0>;
interrupts = <0>;
};
phy_port1: ethernet-phy@1 {
reg = <0x1>;
interrupts = <0>;
};
phy_port2: ethernet-phy@2 {
reg = <0x2>;
interrupts = <0>;
};
phy_port3: ethernet-phy@3 {
reg = <0x3>;
interrupts = <0>;
};
phy_port4: ethernet-phy@4 {
reg = <0x4>;
interrupts = <0>;
};
};
};
};

View File

@ -59,6 +59,9 @@ properties:
description: GPIO to be used to reset the whole device
maxItems: 1
resets:
maxItems: 1
realtek,disable-leds:
type: boolean
description: |
@ -127,7 +130,6 @@ else:
- mdc-gpios
- mdio-gpios
- mdio
- reset-gpios
required:
- compatible

View File

@ -14,7 +14,6 @@ properties:
pattern: "^ethernet(@.*)?$"
label:
$ref: /schemas/types.yaml#/definitions/string
description: Human readable label on a port of a box.
local-mac-address:

View File

@ -0,0 +1,52 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/ethernet-phy-package.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Ethernet PHY Package Common Properties
maintainers:
- Christian Marangi <ansuelsmth@gmail.com>
description:
PHY packages are multi-port Ethernet PHY of the same family
and each Ethernet PHY is affected by the global configuration
of the PHY package.
Each reg of the PHYs defined in the PHY package node is
absolute and describe the real address of the Ethernet PHY on
the MDIO bus.
properties:
$nodename:
pattern: "^ethernet-phy-package@[a-f0-9]+$"
reg:
minimum: 0
maximum: 31
description:
The base ID number for the PHY package.
Commonly the ID of the first PHY in the PHY package.
Some PHY in the PHY package might be not defined but
still occupy ID on the device (just not attached to
anything) hence the PHY package reg might correspond
to a not attached PHY (offset 0).
'#address-cells':
const: 1
'#size-cells':
const: 0
patternProperties:
^ethernet-phy@[a-f0-9]+$:
$ref: ethernet-phy.yaml#
required:
- reg
- '#address-cells'
- '#size-cells'
additionalProperties: true

View File

@ -224,6 +224,9 @@ properties:
Can be omitted thus no delay is observed. Delay is in range of 1ms to 1000ms.
Other delays are invalid.
iommus:
maxItems: 1
required:
- compatible
- reg

View File

@ -73,7 +73,7 @@ examples:
#include <dt-bindings/gpio/gpio.h>
#include <dt-bindings/interrupt-controller/irq.h>
i2c {
spi {
#address-cells = <1>;
#size-cells = <0>;

View File

@ -0,0 +1,54 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/qca,qca808x.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Qualcomm Atheros QCA808X PHY
maintainers:
- Christian Marangi <ansuelsmth@gmail.com>
description:
QCA808X PHYs can have up to 3 LEDs attached.
All 3 LEDs are disabled by default.
2 LEDs have dedicated pins with the 3rd LED having the
double function of Interrupt LEDs/GPIO or additional LED.
By default this special PIN is set to LED function.
allOf:
- $ref: ethernet-phy.yaml#
properties:
compatible:
enum:
- ethernet-phy-id004d.d101
unevaluatedProperties: false
examples:
- |
#include <dt-bindings/leds/common.h>
mdio {
#address-cells = <1>;
#size-cells = <0>;
ethernet-phy@0 {
compatible = "ethernet-phy-id004d.d101";
reg = <0>;
leds {
#address-cells = <1>;
#size-cells = <0>;
led@0 {
reg = <0>;
color = <LED_COLOR_ID_GREEN>;
function = LED_FUNCTION_WAN;
default-state = "keep";
};
};
};
};

View File

@ -37,12 +37,14 @@ properties:
items:
- description: Combined signal for various interrupt events
- description: The interrupt that occurs when Rx exits the LPI state
- description: The interrupt that occurs when HW safety error triggered
interrupt-names:
minItems: 1
items:
- const: macirq
- const: eth_lpi
- enum: [eth_lpi, sfty]
- const: sfty
clocks:
maxItems: 4
@ -89,8 +91,9 @@ examples:
<&gcc GCC_ETH_PTP_CLK>,
<&gcc GCC_ETH_RGMII_CLK>;
interrupts = <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "macirq", "eth_lpi";
<GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 782 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "macirq", "eth_lpi", "sfty";
rx-fifo-depth = <4096>;
tx-fifo-depth = <4096>;

View File

@ -159,7 +159,7 @@ properties:
when the AP (not the modem) performs early initialization.
firmware-name:
$ref: /schemas/types.yaml#/definitions/string
maxItems: 1
description:
If present, name (or relative path) of the file within the
firmware search path containing the firmware image used when

View File

@ -44,6 +44,21 @@ properties:
items:
- const: gcc_mdio_ahb_clk
clock-frequency:
description:
The MDIO bus clock that must be output by the MDIO bus hardware, if
absent, the default hardware values are used.
MDC rate is feed by an external clock (fixed 100MHz) and is divider
internally. The default divider is /256 resulting in the default rate
applied of 390KHz.
To follow 802.3 standard that instruct up to 2.5MHz by default, if
this property is not declared and the divider is set to /256, by
default 1.5625Mhz is select.
enum: [ 390625, 781250, 1562500, 3125000, 6250000, 12500000 ]
default: 1562500
required:
- compatible
- reg

View File

@ -0,0 +1,184 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/qcom,qca807x.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Qualcomm QCA807x Ethernet PHY
maintainers:
- Christian Marangi <ansuelsmth@gmail.com>
- Robert Marko <robert.marko@sartura.hr>
description: |
Qualcomm QCA8072/5 Ethernet PHY is PHY package of 2 or 5
IEEE 802.3 clause 22 compliant 10BASE-Te, 100BASE-TX and
1000BASE-T PHY-s.
They feature 2 SerDes, one for PSGMII or QSGMII connection with
MAC, while second one is SGMII for connection to MAC or fiber.
Both models have a combo port that supports 1000BASE-X and
100BASE-FX fiber.
Each PHY inside of QCA807x series has 4 digitally controlled
output only pins that natively drive LED-s for up to 2 attached
LEDs. Some vendor also use these 4 output for GPIO usage without
attaching LEDs.
Note that output pins can be set to drive LEDs OR GPIO, mixed
definition are not accepted.
$ref: ethernet-phy-package.yaml#
properties:
compatible:
enum:
- qcom,qca8072-package
- qcom,qca8075-package
qcom,package-mode:
description: |
PHY package can be configured in 3 mode following this table:
First Serdes mode Second Serdes mode
Option 1 PSGMII for copper Disabled
ports 0-4
Option 2 PSGMII for copper 1000BASE-X / 100BASE-FX
ports 0-4
Option 3 QSGMII for copper SGMII for
ports 0-3 copper port 4
PSGMII mode (option 1 or 2) is configured dynamically based on
the presence of a connected SFP device.
$ref: /schemas/types.yaml#/definitions/string
enum:
- qsgmii
- psgmii
default: psgmii
qcom,tx-drive-strength-milliwatt:
description: set the TX Amplifier value in mv.
$ref: /schemas/types.yaml#/definitions/uint32
enum: [140, 160, 180, 200, 220,
240, 260, 280, 300, 320,
400, 500, 600]
default: 600
patternProperties:
^ethernet-phy@[a-f0-9]+$:
$ref: ethernet-phy.yaml#
properties:
qcom,dac-full-amplitude:
description:
Set Analog MDI driver amplitude to FULL.
With this not defined, amplitude is set to DSP.
(amplitude is adjusted based on cable length)
With this enabled and qcom,dac-full-bias-current
and qcom,dac-disable-bias-current-tweak disabled,
bias current is half.
type: boolean
qcom,dac-full-bias-current:
description:
Set Analog MDI driver bias current to FULL.
With this not defined, bias current is set to DSP.
(bias current is adjusted based on cable length)
Actual bias current might be different with
qcom,dac-disable-bias-current-tweak disabled.
type: boolean
qcom,dac-disable-bias-current-tweak:
description: |
Set Analog MDI driver bias current to disable tweak
to bias current.
With this not defined, bias current tweak are enabled
by default.
With this enabled the following tweak are NOT applied:
- With both FULL amplitude and FULL bias current: bias current
is set to half.
- With only DSP amplitude: bias current is set to half and
is set to 1/4 with cable < 10m.
- With DSP bias current (included both DSP amplitude and
DSP bias current): bias current is half the detected current
with cable < 10m.
type: boolean
gpio-controller: true
'#gpio-cells':
const: 2
if:
required:
- gpio-controller
then:
properties:
leds: false
unevaluatedProperties: false
required:
- compatible
unevaluatedProperties: false
examples:
- |
#include <dt-bindings/leds/common.h>
mdio {
#address-cells = <1>;
#size-cells = <0>;
ethernet-phy-package@0 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "qcom,qca8075-package";
reg = <0>;
qcom,package-mode = "qsgmii";
ethernet-phy@0 {
reg = <0>;
leds {
#address-cells = <1>;
#size-cells = <0>;
led@0 {
reg = <0>;
color = <LED_COLOR_ID_GREEN>;
function = LED_FUNCTION_LAN;
default-state = "keep";
};
};
};
ethernet-phy@1 {
reg = <1>;
};
ethernet-phy@2 {
reg = <2>;
gpio-controller;
#gpio-cells = <2>;
};
ethernet-phy@3 {
reg = <3>;
};
ethernet-phy@4 {
reg = <4>;
};
};
};

View File

@ -46,6 +46,7 @@ properties:
- enum:
- renesas,etheravb-r8a779a0 # R-Car V3U
- renesas,etheravb-r8a779g0 # R-Car V4H
- renesas,etheravb-r8a779h0 # R-Car V4M
- const: renesas,etheravb-rcar-gen4 # R-Car Gen4
- items:

View File

@ -95,6 +95,7 @@ properties:
- snps,dwmac-5.20
- snps,dwxgmac
- snps,dwxgmac-2.10
- starfive,jh7100-dwmac
- starfive,jh7110-dwmac
reg:
@ -107,13 +108,15 @@ properties:
- description: Combined signal for various interrupt events
- description: The interrupt to manage the remote wake-up packet detection
- description: The interrupt that occurs when Rx exits the LPI state
- description: The interrupt that occurs when HW safety error triggered
interrupt-names:
minItems: 1
items:
- const: macirq
- enum: [eth_wake_irq, eth_lpi]
- const: eth_lpi
- enum: [eth_wake_irq, eth_lpi, sfty]
- enum: [eth_wake_irq, eth_lpi, sfty]
- enum: [eth_wake_irq, eth_lpi, sfty]
clocks:
minItems: 1
@ -144,10 +147,12 @@ properties:
- description: AHB reset
reset-names:
minItems: 1
items:
- const: stmmaceth
- const: ahb
oneOf:
- items:
- enum: [stmmaceth, ahb]
- items:
- const: stmmaceth
- const: ahb
power-domains:
maxItems: 1

View File

@ -16,16 +16,20 @@ select:
compatible:
contains:
enum:
- starfive,jh7100-dwmac
- starfive,jh7110-dwmac
required:
- compatible
properties:
compatible:
items:
- enum:
- starfive,jh7110-dwmac
- const: snps,dwmac-5.20
oneOf:
- items:
- const: starfive,jh7100-dwmac
- const: snps,dwmac
- items:
- const: starfive,jh7110-dwmac
- const: snps,dwmac-5.20
reg:
maxItems: 1
@ -46,24 +50,6 @@ properties:
- const: tx
- const: gtx
interrupts:
minItems: 3
maxItems: 3
interrupt-names:
minItems: 3
maxItems: 3
resets:
items:
- description: MAC Reset signal.
- description: AHB Reset signal.
reset-names:
items:
- const: stmmaceth
- const: ahb
starfive,tx-use-rgmii-clk:
description:
Tx clock is provided by external rgmii clock.
@ -94,6 +80,48 @@ required:
allOf:
- $ref: snps,dwmac.yaml#
- if:
properties:
compatible:
contains:
const: starfive,jh7100-dwmac
then:
properties:
interrupts:
minItems: 2
maxItems: 2
interrupt-names:
minItems: 2
maxItems: 2
resets:
maxItems: 1
reset-names:
const: ahb
- if:
properties:
compatible:
contains:
const: starfive,jh7110-dwmac
then:
properties:
interrupts:
minItems: 3
maxItems: 3
interrupt-names:
minItems: 3
maxItems: 3
resets:
minItems: 2
reset-names:
minItems: 2
unevaluatedProperties: false
examples:

View File

@ -7,8 +7,9 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: TI SoC Ethernet Switch Controller (CPSW)
maintainers:
- Grygorii Strashko <grygorii.strashko@ti.com>
- Sekhar Nori <nsekhar@ti.com>
- Siddharth Vadapalli <s-vadapalli@ti.com>
- Ravi Gunasekaran <r-gunasekaran@ti.com>
- Roger Quadros <rogerq@kernel.org>
description:
The 3-port switch gigabit ethernet subsystem provides ethernet packet

View File

@ -62,6 +62,40 @@ properties:
for the PHY. The internal delay for the PHY is fixed to 3.5ns relative
to transmit data.
ti,cfg-dac-minus-one-bp:
description: |
DP83826 PHY only.
Sets the voltage ratio (with respect to the nominal value)
of the logical level -1 for the MLT-3 encoded TX data.
enum: [5000, 5625, 6250, 6875, 7500, 8125, 8750, 9375, 10000,
10625, 11250, 11875, 12500, 13125, 13750, 14375, 15000]
default: 10000
ti,cfg-dac-plus-one-bp:
description: |
DP83826 PHY only.
Sets the voltage ratio (with respect to the nominal value)
of the logical level +1 for the MLT-3 encoded TX data.
enum: [5000, 5625, 6250, 6875, 7500, 8125, 8750, 9375, 10000,
10625, 11250, 11875, 12500, 13125, 13750, 14375, 15000]
default: 10000
ti,rmii-mode:
description: |
If present, select the RMII operation mode. Two modes are
available:
- RMII master, where the PHY outputs a 50MHz reference clock which can
be connected to the MAC.
- RMII slave, where the PHY expects a 50MHz reference clock input
shared with the MAC.
The RMII operation mode can also be configured by its straps.
If the strap pin is not set correctly or not set at all, then this can be
used to configure it.
$ref: /schemas/types.yaml#/definitions/string
enum:
- master
- slave
required:
- reg

View File

@ -7,8 +7,9 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: The TI AM654x/J721E/AM642x SoC Gigabit Ethernet MAC (Media Access Controller)
maintainers:
- Grygorii Strashko <grygorii.strashko@ti.com>
- Sekhar Nori <nsekhar@ti.com>
- Siddharth Vadapalli <s-vadapalli@ti.com>
- Ravi Gunasekaran <r-gunasekaran@ti.com>
- Roger Quadros <rogerq@kernel.org>
description:
The TI AM654x/J721E SoC Gigabit Ethernet MAC (CPSW2G NUSS) has two ports

View File

@ -7,8 +7,9 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: The TI AM654x/J721E Common Platform Time Sync (CPTS) module
maintainers:
- Grygorii Strashko <grygorii.strashko@ti.com>
- Sekhar Nori <nsekhar@ti.com>
- Siddharth Vadapalli <s-vadapalli@ti.com>
- Ravi Gunasekaran <r-gunasekaran@ti.com>
- Roger Quadros <rogerq@kernel.org>
description: |+
The TI AM654x/J721E CPTS module is used to facilitate host control of time

View File

@ -19,9 +19,6 @@ description: |
Alternatively, it can specify the wireless part of the MT7628/MT7688
or MT7622/MT7986 SoC.
allOf:
- $ref: ieee80211.yaml#
properties:
compatible:
enum:
@ -38,7 +35,12 @@ properties:
MT7986 should contain 3 regions consys, dcm, and sku, in this order.
interrupts:
maxItems: 1
minItems: 1
items:
- description: major interrupt for rings
- description: additional interrupt for ring 19
- description: additional interrupt for ring 4
- description: additional interrupt for ring 5
power-domains:
maxItems: 1
@ -217,6 +219,24 @@ required:
- compatible
- reg
allOf:
- $ref: ieee80211.yaml#
- if:
properties:
compatible:
contains:
enum:
- mediatek,mt7981-wmac
- mediatek,mt7986-wmac
then:
properties:
interrupts:
minItems: 4
else:
properties:
interrupts:
maxItems: 1
unevaluatedProperties: false
examples:
@ -293,7 +313,10 @@ examples:
reg = <0x18000000 0x1000000>,
<0x10003000 0x1000>,
<0x11d10000 0x1000>;
interrupts = <GIC_SPI 213 IRQ_TYPE_LEVEL_HIGH>;
interrupts = <GIC_SPI 213 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 214 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 215 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 216 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&topckgen 50>,
<&topckgen 62>;
clock-names = "mcu", "ap2conn";

View File

@ -8,6 +8,7 @@ title: Qualcomm Technologies ath10k wireless devices
maintainers:
- Kalle Valo <kvalo@kernel.org>
- Jeff Johnson <jjohnson@kernel.org>
description:
Qualcomm Technologies, Inc. IEEE 802.11ac devices.

View File

@ -9,6 +9,7 @@ title: Qualcomm Technologies ath11k wireless devices (PCIe)
maintainers:
- Kalle Valo <kvalo@kernel.org>
- Jeff Johnson <jjohnson@kernel.org>
description: |
Qualcomm Technologies IEEE 802.11ax PCIe devices

View File

@ -9,6 +9,7 @@ title: Qualcomm Technologies ath11k wireless devices
maintainers:
- Kalle Valo <kvalo@kernel.org>
- Jeff Johnson <jjohnson@kernel.org>
description: |
These are dt entries for Qualcomm Technologies, Inc. IEEE 802.11ax

View File

@ -11,7 +11,7 @@ $defs:
minimum: 0
len-or-define:
type: [ string, integer ]
pattern: ^[0-9A-Za-z_]+( - 1)?$
pattern: ^[0-9A-Za-z_-]+( - 1)?$
minimum: 0
len-or-limit:
# literal int or limit based on fixed-width type e.g. u8-min, u16-max, etc.
@ -126,8 +126,9 @@ properties:
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
type: string
enum-name:
description: Name for the enum type of the attribute.
type: string
description: |
Name for the enum type of the attribute, if empty no name will be used.
type: [ string, "null" ]
doc:
description: Documentation of the space.
type: string
@ -208,6 +209,11 @@ properties:
exact-len:
description: Exact length for a string or a binary attribute.
$ref: '#/$defs/len-or-define'
unterminated-ok:
description: |
For string attributes, do not check whether attribute
contains the terminating null character.
type: boolean
sub-type: *attr-type
display-hint: &display-hint
description: |
@ -261,14 +267,16 @@ properties:
the prefix with the upper case name of the command, with dashes replaced by underscores.
type: string
enum-name:
description: Name for the enum type with commands.
type: string
description: |
Name for the enum type with commands, if empty no name will be used.
type: [ string, "null" ]
async-prefix:
description: Same as name-prefix but used to render notifications and events to separate enum.
type: string
async-enum:
description: Name for the enum type with notifications/events.
type: string
description: |
Name for the enum type with commands, if empty no name will be used.
type: [ string, "null" ]
list:
description: List of commands
type: array
@ -370,3 +378,22 @@ properties:
type: string
# End genetlink-c
flags: *cmd_flags
kernel-family:
description: Additional global attributes used for kernel C code generation.
type: object
additionalProperties: False
properties:
headers:
description: |
List of extra headers which should be included in the source
of the generated code.
type: array
items:
type: string
sock-priv:
description: |
Literal name of the type which is used within the kernel
to store the socket state. The type / structure is internal
to the kernel, and is not defined in the spec.
type: string

View File

@ -11,7 +11,7 @@ $defs:
minimum: 0
len-or-define:
type: [ string, integer ]
pattern: ^[0-9A-Za-z_]+( - 1)?$
pattern: ^[0-9A-Za-z_-]+( - 1)?$
minimum: 0
len-or-limit:
# literal int or limit based on fixed-width type e.g. u8-min, u16-max, etc.
@ -168,8 +168,9 @@ properties:
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
type: string
enum-name:
description: Name for the enum type of the attribute.
type: string
description: |
Name for the enum type of the attribute, if empty no name will be used.
type: [ string, "null" ]
doc:
description: Documentation of the space.
type: string
@ -251,6 +252,11 @@ properties:
exact-len:
description: Exact length for a string or a binary attribute.
$ref: '#/$defs/len-or-define'
unterminated-ok:
description: |
For string attributes, do not check whether attribute
contains the terminating null character.
type: boolean
sub-type: *attr-type
display-hint: *display-hint
# Start genetlink-c
@ -304,14 +310,16 @@ properties:
the prefix with the upper case name of the command, with dashes replaced by underscores.
type: string
enum-name:
description: Name for the enum type with commands.
type: string
description: |
Name for the enum type with commands, if empty no name will be used.
type: [ string, "null" ]
async-prefix:
description: Same as name-prefix but used to render notifications and events to separate enum.
type: string
async-enum:
description: Name for the enum type with notifications/events.
type: string
description: |
Name for the enum type with commands, if empty no name will be used.
type: [ string, "null" ]
# Start genetlink-legacy
fixed-header: &fixed-header
description: |
@ -431,3 +439,22 @@ properties:
type: string
# End genetlink-c
flags: *cmd_flags
kernel-family:
description: Additional global attributes used for kernel C code generation.
type: object
additionalProperties: False
properties:
headers:
description: |
List of extra headers which should be included in the source
of the generated code.
type: array
items:
type: string
sock-priv:
description: |
Literal name of the type which is used within the kernel
to store the socket state. The type / structure is internal
to the kernel, and is not defined in the spec.
type: string

View File

@ -11,7 +11,7 @@ $defs:
minimum: 0
len-or-define:
type: [ string, integer ]
pattern: ^[0-9A-Za-z_]+( - 1)?$
pattern: ^[0-9A-Za-z_-]+( - 1)?$
minimum: 0
len-or-limit:
# literal int or limit based on fixed-width type e.g. u8-min, u16-max, etc.
@ -328,3 +328,22 @@ properties:
The name for the group, used to form the define and the value of the define.
type: string
flags: *cmd_flags
kernel-family:
description: Additional global attributes used for kernel C code generation.
type: object
additionalProperties: False
properties:
headers:
description: |
List of extra headers which should be included in the source
of the generated code.
type: array
items:
type: string
sock-priv:
description: |
Literal name of the type which is used within the kernel
to store the socket state. The type / structure is internal
to the kernel, and is not defined in the spec.
type: string

View File

@ -11,7 +11,7 @@ $defs:
minimum: 0
len-or-define:
type: [ string, integer ]
pattern: ^[0-9A-Za-z_]+( - 1)?$
pattern: ^[0-9A-Za-z_-]+( - 1)?$
minimum: 0
# Schema for specs
@ -152,14 +152,23 @@ properties:
the right formatting mechanism when displaying values of this
type.
enum: [ hex, mac, fddi, ipv4, ipv6, uuid ]
struct:
description: Name of the nested struct type.
type: string
if:
properties:
type:
oneOf:
- const: binary
- const: pad
const: pad
then:
required: [ len ]
if:
properties:
type:
const: binary
then:
oneOf:
- required: [ len ]
- required: [ struct ]
# End genetlink-legacy
attribute-sets:
@ -180,8 +189,9 @@ properties:
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
type: string
enum-name:
description: Name for the enum type of the attribute.
type: string
description: |
Name for the enum type of the attribute, if empty no name will be used.
type: [ string, "null" ]
doc:
description: Documentation of the space.
type: string
@ -261,6 +271,11 @@ properties:
exact-len:
description: Exact length for a string or a binary attribute.
$ref: '#/$defs/len-or-define'
unterminated-ok:
description: |
For string attributes, do not check whether attribute
contains the terminating null character.
type: boolean
sub-type: *attr-type
display-hint: *display-hint
# Start genetlink-c
@ -362,14 +377,16 @@ properties:
the prefix with the upper case name of the command, with dashes replaced by underscores.
type: string
enum-name:
description: Name for the enum type with commands.
type: string
description: |
Name for the enum type with commands, if empty no name will be used.
type: [ string, "null" ]
async-prefix:
description: Same as name-prefix but used to render notifications and events to separate enum.
type: string
async-enum:
description: Name for the enum type with notifications/events.
type: string
description: |
Name for the enum type with commands, if empty no name will be used.
type: [ string, "null" ]
# Start genetlink-legacy
fixed-header: &fixed-header
description: |

View File

@ -290,7 +290,7 @@ attribute-sets:
enum: eswitch-mode
-
name: eswitch-inline-mode
type: u16
type: u8
enum: eswitch-inline-mode
-
name: dpipe-tables

View File

@ -51,6 +51,40 @@ definitions:
if dpll lock-state was not DPLL_LOCK_STATUS_LOCKED_HO_ACQ, the
dpll's lock-state shall remain DPLL_LOCK_STATUS_UNLOCKED)
render-max: true
-
type: enum
name: lock-status-error
doc: |
if previous status change was done due to a failure, this provides
information of dpll device lock status error.
Valid values for DPLL_A_LOCK_STATUS_ERROR attribute
entries:
-
name: none
doc: |
dpll device lock status was changed without any error
value: 1
-
name: undefined
doc: |
dpll device lock status was changed due to undefined error.
Driver fills this value up in case it is not able
to obtain suitable exact error type.
-
name: media-down
doc: |
dpll device lock status was changed because of associated
media got down.
This may happen for example if dpll device was previously
locked on an input pin of type PIN_TYPE_SYNCE_ETH_PORT.
-
name: fractional-frequency-offset-too-high
doc: |
the FFO (Fractional Frequency Offset) between the RX and TX
symbol rate on the media got too high.
This may happen for example if dpll device was previously
locked on an input pin of type PIN_TYPE_SYNCE_ETH_PORT.
render-max: true
-
type: const
name: temp-divider
@ -214,6 +248,10 @@ attribute-sets:
name: type
type: u32
enum: type
-
name: lock-status-error
type: u32
enum: lock-status-error
-
name: pin
enum-name: dpll_a_pin
@ -274,6 +312,7 @@ attribute-sets:
-
name: capabilities
type: u32
enum: pin-capabilities
-
name: parent-device
type: nest
@ -379,6 +418,7 @@ operations:
- mode
- mode-supported
- lock-status
- lock-status-error
- temp
- clock-id
- type

View File

@ -292,13 +292,14 @@ operations:
-
name: get-addr
doc: Get endpoint information
attribute-set: endpoint
attribute-set: attr
dont-validate: [ strict ]
flags: [ uns-admin-perm ]
do: &get-addr-attrs
request:
attributes:
- addr
- token
reply:
attributes:
- addr

View File

@ -74,6 +74,10 @@ definitions:
name: queue-type
type: enum
entries: [ rx, tx ]
-
name: qstats-scope
type: flags
entries: [ queue ]
attribute-sets:
-
@ -265,6 +269,73 @@ attribute-sets:
doc: ID of the NAPI instance which services this queue.
type: u32
-
name: qstats
doc: |
Get device statistics, scoped to a device or a queue.
These statistics extend (and partially duplicate) statistics available
in struct rtnl_link_stats64.
Value of the `scope` attribute determines how statistics are
aggregated. When aggregated for the entire device the statistics
represent the total number of events since last explicit reset of
the device (i.e. not a reconfiguration like changing queue count).
When reported per-queue, however, the statistics may not add
up to the total number of events, will only be reported for currently
active objects, and will likely report the number of events since last
reconfiguration.
attributes:
-
name: ifindex
doc: ifindex of the netdevice to which stats belong.
type: u32
checks:
min: 1
-
name: queue-type
doc: Queue type as rx, tx, for queue-id.
type: u32
enum: queue-type
-
name: queue-id
doc: Queue ID, if stats are scoped to a single queue instance.
type: u32
-
name: scope
doc: |
What object type should be used to iterate over the stats.
type: uint
enum: qstats-scope
-
name: rx-packets
doc: |
Number of wire packets successfully received and passed to the stack.
For drivers supporting XDP, XDP is considered the first layer
of the stack, so packets consumed by XDP are still counted here.
type: uint
value: 8 # reserve some attr ids in case we need more metadata later
-
name: rx-bytes
doc: Successfully received bytes, see `rx-packets`.
type: uint
-
name: tx-packets
doc: |
Number of wire packets successfully sent. Packet is considered to be
successfully sent once it is in device memory (usually this means
the device has issued a DMA completion for the packet).
type: uint
-
name: tx-bytes
doc: Successfully sent bytes, see `tx-packets`.
type: uint
-
name: rx-alloc-fail
doc: |
Number of times skb or buffer allocation failed on the Rx datapath.
Allocation failure may, or may not result in a packet drop, depending
on driver implementation and whether system recovers quickly.
type: uint
operations:
list:
-
@ -405,6 +476,26 @@ operations:
attributes:
- ifindex
reply: *napi-get-op
-
name: qstats-get
doc: |
Get / dump fine grained statistics. Which statistics are reported
depends on the device and the driver, and whether the driver stores
software counters per-queue.
attribute-set: qstats
dump:
request:
attributes:
- scope
reply:
attributes:
- ifindex
- queue-type
- queue-id
- rx-packets
- rx-bytes
- tx-packets
- tx-bytes
mcast-groups:
list:

View File

@ -0,0 +1,206 @@
# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
name: nlctrl
protocol: genetlink-legacy
uapi-header: linux/genetlink.h
doc: |
genetlink meta-family that exposes information about all genetlink
families registered in the kernel (including itself).
definitions:
-
name: op-flags
type: flags
enum-name:
entries:
- admin-perm
- cmd-cap-do
- cmd-cap-dump
- cmd-cap-haspol
- uns-admin-perm
-
name: attr-type
enum-name: netlink-attribute-type
type: enum
entries:
- invalid
- flag
- u8
- u16
- u32
- u64
- s8
- s16
- s32
- s64
- binary
- string
- nul-string
- nested
- nested-array
- bitfield32
- sint
- uint
attribute-sets:
-
name: ctrl-attrs
name-prefix: ctrl-attr-
attributes:
-
name: family-id
type: u16
-
name: family-name
type: string
-
name: version
type: u32
-
name: hdrsize
type: u32
-
name: maxattr
type: u32
-
name: ops
type: array-nest
nested-attributes: op-attrs
-
name: mcast-groups
type: array-nest
nested-attributes: mcast-group-attrs
-
name: policy
type: nest-type-value
type-value: [ policy-id, attr-id ]
nested-attributes: policy-attrs
-
name: op-policy
type: nest-type-value
type-value: [ op-id ]
nested-attributes: op-policy-attrs
-
name: op
type: u32
-
name: mcast-group-attrs
name-prefix: ctrl-attr-mcast-grp-
enum-name:
attributes:
-
name: name
type: string
-
name: id
type: u32
-
name: op-attrs
name-prefix: ctrl-attr-op-
enum-name:
attributes:
-
name: id
type: u32
-
name: flags
type: u32
enum: op-flags
enum-as-flags: true
-
name: policy-attrs
name-prefix: nl-policy-type-attr-
enum-name:
attributes:
-
name: type
type: u32
enum: attr-type
-
name: min-value-s
type: s64
-
name: max-value-s
type: s64
-
name: min-value-u
type: u64
-
name: max-value-u
type: u64
-
name: min-length
type: u32
-
name: max-length
type: u32
-
name: policy-idx
type: u32
-
name: policy-maxtype
type: u32
-
name: bitfield32-mask
type: u32
-
name: mask
type: u64
-
name: pad
type: pad
-
name: op-policy-attrs
name-prefix: ctrl-attr-policy-
enum-name:
attributes:
-
name: do
type: u32
-
name: dump
type: u32
operations:
enum-model: directional
name-prefix: ctrl-cmd-
list:
-
name: getfamily
doc: Get / dump genetlink families
attribute-set: ctrl-attrs
do:
request:
value: 3
attributes:
- family-name
reply: &all-attrs
value: 1
attributes:
- family-id
- family-name
- hdrsize
- maxattr
- mcast-groups
- ops
- version
dump:
reply: *all-attrs
-
name: getpolicy
doc: Get / dump genetlink policies
attribute-set: ctrl-attrs
dump:
request:
value: 10
attributes:
- family-name
- family-id
- op
reply:
value: 10
attributes:
- family-id
- op-policy
- policy

File diff suppressed because it is too large Load Diff

View File

@ -329,23 +329,24 @@ XDP_SHARED_UMEM option and provide the initial socket's fd in the
sxdp_shared_umem_fd field as you registered the UMEM on that
socket. These two sockets will now share one and the same UMEM.
There is no need to supply an XDP program like the one in the previous
case where sockets were bound to the same queue id and
device. Instead, use the NIC's packet steering capabilities to steer
the packets to the right queue. In the previous example, there is only
one queue shared among sockets, so the NIC cannot do this steering. It
can only steer between queues.
In this case, it is possible to use the NIC's packet steering
capabilities to steer the packets to the right queue. This is not
possible in the previous example as there is only one queue shared
among sockets, so the NIC cannot do this steering as it can only steer
between queues.
In libbpf, you need to use the xsk_socket__create_shared() API as it
takes a reference to a FILL ring and a COMPLETION ring that will be
created for you and bound to the shared UMEM. You can use this
function for all the sockets you create, or you can use it for the
second and following ones and use xsk_socket__create() for the first
one. Both methods yield the same result.
In libxdp (or libbpf prior to version 1.0), you need to use the
xsk_socket__create_shared() API as it takes a reference to a FILL ring
and a COMPLETION ring that will be created for you and bound to the
shared UMEM. You can use this function for all the sockets you create,
or you can use it for the second and following ones and use
xsk_socket__create() for the first one. Both methods yield the same
result.
Note that a UMEM can be shared between sockets on the same queue id
and device, as well as between queues on the same device and between
devices at the same time.
devices at the same time. It is also possible to redirect to any
socket as long as it is bound to the same umem with XDP_SHARED_UMEM.
XDP_USE_NEED_WAKEUP bind flag
-----------------------------
@ -822,6 +823,10 @@ A: The short answer is no, that is not supported at the moment. The
switch, or other distribution mechanism, in your NIC to direct
traffic to the correct queue id and socket.
Note that if you are using the XDP_SHARED_UMEM option, it is
possible to switch traffic between any socket bound to the same
umem.
Q: My packets are sometimes corrupted. What is wrong?
A: Care has to be taken not to feed the same buffer in the UMEM into

View File

@ -444,6 +444,18 @@ arp_missed_max
The default value is 2, and the allowable range is 1 - 255.
coupled_control
Specifies whether the LACP state machine's MUX in the 802.3ad mode
should have separate Collecting and Distributing states.
This is by implementing the independent control state machine per
IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control
state machine.
The default value is 1. This setting does not separate the Collecting
and Distributing states, maintaining the bond in coupled control.
downdelay
Specifies the time, in milliseconds, to wait before disabling

View File

@ -444,6 +444,24 @@ definitions are specified for CAN specific MTUs in include/linux/can.h:
#define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame
Returned Message Flags
----------------------
When using the system call recvmsg(2) on a RAW or a BCM socket, the
msg->msg_flags field may contain the following flags:
MSG_DONTROUTE:
set when the received frame was created on the local host.
MSG_CONFIRM:
set when the frame was sent via the socket it is received on.
This flag can be interpreted as a 'transmission confirmation' when the
CAN driver supports the echo of frames on driver level, see
:ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`.
(Note: In order to receive such messages on a RAW socket,
CAN_RAW_RECV_OWN_MSGS must be set.)
.. _socketcan-raw-sockets:
RAW Protocol Sockets with can_filters (SOCK_RAW)
@ -693,22 +711,6 @@ where the CAN_INV_FILTER flag is set in order to notch single CAN IDs or
CAN ID ranges from the incoming traffic.
RAW Socket Returned Message Flags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When using recvmsg() call, the msg->msg_flags may contain following flags:
MSG_DONTROUTE:
set when the received frame was created on the local host.
MSG_CONFIRM:
set when the frame was sent via the socket it is received on.
This flag can be interpreted as a 'transmission confirmation' when the
CAN driver supports the echo of frames on driver level, see
:ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`.
In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set.
Broadcast Manager Protocol Sockets (SOCK_DGRAM)
-----------------------------------------------

View File

@ -211,10 +211,16 @@ Documentation/networking/net_dim.rst
RX copybreak
============
The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
and can be configured by the ETHTOOL_STUNABLE command of the
SIOCETHTOOL ioctl.
This option controls the maximum packet length for which the RX
descriptor it was received on would be recycled. When a packet smaller
than RX copybreak bytes is received, it is copied into a new memory
buffer and the RX descriptor is returned to HW.
Statistics
==========

View File

@ -42,6 +42,7 @@ Contents:
intel/ice
marvell/octeontx2
marvell/octeon_ep
marvell/octeon_ep_vf
mellanox/mlx5/index
microsoft/netvsc
neterion/s2io

View File

@ -368,15 +368,28 @@ more options for Receive Side Scaling (RSS) hash byte configuration.
# ethtool -N <ethX> rx-flow-hash <type> <option>
Where <type> is:
tcp4 signifying TCP over IPv4
udp4 signifying UDP over IPv4
tcp6 signifying TCP over IPv6
udp6 signifying UDP over IPv6
tcp4 signifying TCP over IPv4
udp4 signifying UDP over IPv4
gtpc4 signifying GTP-C over IPv4
gtpc4t signifying GTP-C (include TEID) over IPv4
gtpu4 signifying GTP-U over IPV4
gtpu4e signifying GTP-U and Extension Header over IPV4
gtpu4u signifying GTP-U PSC Uplink over IPV4
gtpu4d signifying GTP-U PSC Downlink over IPV4
tcp6 signifying TCP over IPv6
udp6 signifying UDP over IPv6
gtpc6 signifying GTP-C over IPv6
gtpc6t signifying GTP-C (include TEID) over IPv6
gtpu6 signifying GTP-U over IPV6
gtpu6e signifying GTP-U and Extension Header over IPV6
gtpu6u signifying GTP-U PSC Uplink over IPV6
gtpu6d signifying GTP-U PSC Downlink over IPV6
And <option> is one or more of:
s Hash on the IP source address of the Rx packet.
d Hash on the IP destination address of the Rx packet.
f Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet.
n Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet.
e Hash on GTP Packet on TEID (4bytes) of the Rx packet.
Accelerated Receive Flow Steering (aRFS)

View File

@ -0,0 +1,24 @@
.. SPDX-License-Identifier: GPL-2.0+
=======================================================================
Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC VF
=======================================================================
Network driver for Marvell's Octeon PCI EndPoint NIC VF.
Copyright (c) 2020 Marvell International Ltd.
Overview
========
This driver implements networking functionality of Marvell's Octeon PCI
EndPoint NIC VF.
Supported Devices
=================
Currently, this driver support following devices:
* Network controller: Cavium, Inc. Device b203
* Network controller: Cavium, Inc. Device b403
* Network controller: Cavium, Inc. Device b103
* Network controller: Cavium, Inc. Device b903
* Network controller: Cavium, Inc. Device ba03
* Network controller: Cavium, Inc. Device bc03
* Network controller: Cavium, Inc. Device bd03

View File

@ -39,6 +39,34 @@ command and receive response:
- open the AT control channel using a UART tool or a special user tool
Sysfs
=====
The driver provides sysfs interfaces to userspace.
t7xx_mode
---------
The sysfs interface provides userspace with access to the device mode, this interface
supports read and write operations.
Device mode:
- ``unknown`` represents that device in unknown status
- ``ready`` represents that device in ready status
- ``reset`` represents that device in reset status
- ``fastboot_switching`` represents that device in fastboot switching status
- ``fastboot_download`` represents that device in fastboot download status
- ``fastboot_dump`` represents that device in fastboot dump status
Read from userspace to get the current device mode.
::
$ cat /sys/bus/pci/devices/${bdf}/t7xx_mode
Write from userspace to set the device mode.
::
$ echo fastboot_switching > /sys/bus/pci/devices/${bdf}/t7xx_mode
Management application development
==================================
The driver and userspace interfaces are described below. The MBIM protocol is
@ -97,6 +125,20 @@ The driver exposes an AT port by implementing AT WWAN Port.
The userspace end of the control port is a /dev/wwan0at0 character
device. Application shall use this interface to issue AT commands.
fastboot port userspace ABI
---------------------------
/dev/wwan0fastboot0 character device
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The driver exposes a fastboot protocol interface by implementing
fastboot WWAN Port. The userspace end of the fastboot channel pipe is a
/dev/wwan0fastboot0 character device. Application shall use this interface for
fastboot protocol communication.
Please note that driver needs to be reloaded to export /dev/wwan0fastboot0
port, because device needs a cold reset after enter ``fastboot_switching``
mode.
The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification.
References
@ -118,3 +160,7 @@ speak the Mobile Interface Broadband Model (MBIM) protocol"*
[4] *Specification # 27.007 - 3GPP*
- https://www.3gpp.org/DynaReport/27007.htm
[5] *fastboot "a mechanism for communicating with bootloaders"*
- https://android.googlesource.com/platform/system/core/+/refs/heads/main/fastboot/README.md

View File

@ -97,6 +97,10 @@ parameters.
When metadata is disabled, the above use cases will fail to initialize if
users try to enable them.
Note: Setting this parameter does not take effect immediately. Setting
must happen in legacy mode and eswitch port metadata takes effect after
enabling switchdev mode.
* - ``hairpin_num_queues``
- u32
- driverinit
@ -246,7 +250,7 @@ them in realtime.
Description of the vnic counters:
- total_q_under_processor_handle
- total_error_queues
number of queues in an error state due to
an async error or errored command.
- send_queue_priority_update_flow
@ -255,7 +259,8 @@ Description of the vnic counters:
number of times CQ entered an error state due to an overflow.
- async_eq_overrun
number of times an EQ mapped to async events was overrun.
comp_eq_overrun number of times an EQ mapped to completion events was
- comp_eq_overrun
number of times an EQ mapped to completion events was
overrun.
- quota_exceeded_command
number of commands issued and failed due to quota exceeded.

View File

@ -74,6 +74,7 @@ Contents:
mpls-sysctl
mptcp-sysctl
multiqueue
multi-pf-netdev
napi
net_cachelines/index
netconsole

View File

@ -2503,7 +2503,7 @@ use_tempaddr - INTEGER
temp_valid_lft - INTEGER
valid lifetime (in seconds) for temporary addresses. If less than the
minimum required lifetime (typically 5 seconds), temporary addresses
minimum required lifetime (typically 5-7 seconds), temporary addresses
will not be created.
Default: 172800 (2 days)
@ -2511,7 +2511,7 @@ temp_valid_lft - INTEGER
temp_prefered_lft - INTEGER
Preferred lifetime (in seconds) for temporary addresses. If
temp_prefered_lft is less than the minimum required lifetime (typically
5 seconds), temporary addresses will not be created. If
5-7 seconds), the preferred lifetime is the minimum required. If
temp_prefered_lft is greater than temp_valid_lft, the preferred lifetime
is temp_valid_lft.
@ -2535,6 +2535,16 @@ max_desync_factor - INTEGER
Default: 600
regen_min_advance - INTEGER
How far in advance (in seconds), at minimum, to create a new temporary
address before the current one is deprecated. This value is added to
the amount of time that may be required for duplicate address detection
to determine when to create a new address. Linux permits setting this
value to less than the default of 2 seconds, but a value less than 2
does not conform to RFC 8981.
Default: 2
regen_max_retry - INTEGER
Number of attempts before give up attempting to generate
valid temporary addresses.

View File

@ -386,12 +386,19 @@ Sample userspace code:
- Create session PPPoX data socket::
struct sockaddr_pppol2tp sax;
int fd;
/* Note, the tunnel socket must be bound already, else it
* will not be ready
/* Input: the L2TP tunnel UDP socket `tunnel_fd`, which needs to be
* bound already (both sockname and peername), otherwise it will not be
* ready.
*/
struct sockaddr_pppol2tp sax;
int session_fd;
int ret;
session_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
if (session_fd < 0)
return -errno;
sax.sa_family = AF_PPPOX;
sax.sa_protocol = PX_PROTO_OL2TP;
sax.pppol2tp.fd = tunnel_fd;
@ -406,12 +413,128 @@ Sample userspace code:
/* session_fd is the fd of the session's PPPoL2TP socket.
* tunnel_fd is the fd of the tunnel UDP / L2TPIP socket.
*/
fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
if (fd < 0 ) {
ret = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
if (ret < 0 ) {
close(session_fd);
return -errno;
}
return session_fd;
L2TP control packets will still be available for read on `tunnel_fd`.
- Create PPP channel::
/* Input: the session PPPoX data socket `session_fd` which was created
* as described above.
*/
int ppp_chan_fd;
int chindx;
int ret;
ret = ioctl(session_fd, PPPIOCGCHAN, &chindx);
if (ret < 0)
return -errno;
ppp_chan_fd = open("/dev/ppp", O_RDWR);
if (ppp_chan_fd < 0)
return -errno;
ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx);
if (ret < 0) {
close(ppp_chan_fd);
return -errno;
}
return ppp_chan_fd;
LCP PPP frames will be available for read on `ppp_chan_fd`.
- Create PPP interface::
/* Input: the PPP channel `ppp_chan_fd` which was created as described
* above.
*/
int ifunit = -1;
int ppp_if_fd;
int ret;
ppp_if_fd = open("/dev/ppp", O_RDWR);
if (ppp_if_fd < 0)
return -errno;
ret = ioctl(ppp_if_fd, PPPIOCNEWUNIT, &ifunit);
if (ret < 0) {
close(ppp_if_fd);
return -errno;
}
ret = ioctl(ppp_chan_fd, PPPIOCCONNECT, &ifunit);
if (ret < 0) {
close(ppp_if_fd);
return -errno;
}
return ppp_if_fd;
IPCP/IPv6CP PPP frames will be available for read on `ppp_if_fd`.
The ppp<ifunit> interface can then be configured as usual with netlink's
RTM_NEWLINK, RTM_NEWADDR, RTM_NEWROUTE, or ioctl's SIOCSIFMTU, SIOCSIFADDR,
SIOCSIFDSTADDR, SIOCSIFNETMASK, SIOCSIFFLAGS, or with the `ip` command.
- Bridging L2TP sessions which have PPP pseudowire types (this is also called
L2TP tunnel switching or L2TP multihop) is supported by bridging the PPP
channels of the two L2TP sessions to be bridged::
/* Input: the session PPPoX data sockets `session_fd1` and `session_fd2`
* which were created as described further above.
*/
int ppp_chan_fd;
int chindx1;
int chindx2;
int ret;
ret = ioctl(session_fd1, PPPIOCGCHAN, &chindx1);
if (ret < 0)
return -errno;
ret = ioctl(session_fd2, PPPIOCGCHAN, &chindx2);
if (ret < 0)
return -errno;
ppp_chan_fd = open("/dev/ppp", O_RDWR);
if (ppp_chan_fd < 0)
return -errno;
ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx1);
if (ret < 0) {
close(ppp_chan_fd);
return -errno;
}
ret = ioctl(ppp_chan_fd, PPPIOCBRIDGECHAN, &chindx2);
close(ppp_chan_fd);
if (ret < 0)
return -errno;
return 0;
It can be noted that when bridging PPP channels, the PPP session is not locally
terminated, and no local PPP interface is created. PPP frames arriving on one
channel are directly passed to the other channel, and vice versa.
The PPP channel does not need to be kept open. Only the session PPPoX data
sockets need to be kept open.
More generally, it is also possible in the same way to e.g. bridge a PPPoL2TP
PPP channel with other types of PPP channels, such as PPPoE.
See more details for the PPP side in ppp_generic.rst.
Old L2TPv2-only API
-------------------

View File

@ -0,0 +1,174 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
===============
Multi-PF Netdev
===============
Contents
========
- `Background`_
- `Overview`_
- `mlx5 implementation`_
- `Channels distribution`_
- `Observability`_
- `Steering`_
- `Mutually exclusive features`_
Background
==========
The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
the network, each through its own dedicated PCIe interface. Through either a connection harness that
splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
results in eliminating the network traffic traversing over the internal bus between the sockets,
significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
network throughput.
Overview
========
The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
sysfs entry, and devlink are kept separate.
Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
proximity to the device and achieve improved performance.
mlx5 implementation
===================
Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
The netdev network channels are distributed between all devices, a proper configuration would utilize
the correct close NUMA node when working on a certain app/CPU.
We pick one PF to be a primary (leader), and it fills a special role. The other devices
(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
to/from the secondaries.
Currently, we limit the support to PFs only, and up to two PFs (sockets).
Channels distribution
=====================
We distribute the channels between the different PFs to achieve local NUMA node performance
on multiple NUMA nodes.
Each combined channel works against one specific PF, creating all its datapath queues against it. We
distribute channels to PFs in a round-robin policy.
::
Example for 2 PFs and 5 channels:
+--------+--------+
| ch idx | PF idx |
+--------+--------+
| 0 | 0 |
| 1 | 1 |
| 2 | 0 |
| 3 | 1 |
| 4 | 0 |
+--------+--------+
The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
As the channel stats are persistent across channel's closure, changing the mapping every single time
would turn the accumulative stats less representing of the channel's history.
This is achieved by using the correct core device instance (mdev) in each channel, instead of them
all using the same instance under "priv->mdev".
Observability
=============
The relation between PF, irq, napi, and queue can be observed via netlink spec:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
{'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
{'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
{'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
{'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
{'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
{'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
{'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
{'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
[{'id': 543, 'ifindex': 13, 'irq': 42},
{'id': 542, 'ifindex': 13, 'irq': 41},
{'id': 541, 'ifindex': 13, 'irq': 40},
{'id': 540, 'ifindex': 13, 'irq': 39},
{'id': 539, 'ifindex': 13, 'irq': 36}]
Here you can clearly observe our channels distribution policy:
$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
Steering
========
Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
that is capable of pointing to the receive queues of a different PF.
In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
go out to the network through it.
In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
PF on the same node as the CPU.
XPS default config example:
NUMA node(s): 2
NUMA node0 CPU(s): 0-11
NUMA node1 CPU(s): 12-23
PF0 on node0, PF1 on node1.
- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
Mutually exclusive features
===========================
The nature of Multi-PF, where different channels work with different PFs, conflicts with
stateful features where the state is maintained in one of the PFs.
For example, in the TLS device-offload feature, special context objects are created per connection
and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence,
we disable this combination for now.

View File

@ -15,6 +15,8 @@ Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
Release prepend support by Breno Leitao <leitao@debian.org>, Jul 7 2023
Userdata append support by Matthew Wood <thepacketgeek@gmail.com>, Jan 22 2024
Please send bug reports to Matt Mackall <mpm@selenic.com>
Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com>
@ -171,6 +173,70 @@ You can modify these targets in runtime by creating the following targets::
cat cmdline1/remote_ip
10.0.0.3
Append User Data
----------------
Custom user data can be appended to the end of messages with netconsole
dynamic configuration enabled. User data entries can be modified without
changing the "enabled" attribute of a target.
Directories (keys) under `userdata` are limited to 53 character length, and
data in `userdata/<key>/value` are limited to 200 bytes::
cd /sys/kernel/config/netconsole && mkdir cmdline0
cd cmdline0
mkdir userdata/foo
echo bar > userdata/foo/value
mkdir userdata/qux
echo baz > userdata/qux/value
Messages will now include this additional user data::
echo "This is a message" > /dev/kmsg
Sends::
12,607,22085407756,-;This is a message
foo=bar
qux=baz
Preview the userdata that will be appended with::
cd /sys/kernel/config/netconsole/cmdline0/userdata
for f in `ls userdata`; do echo $f=$(cat userdata/$f/value); done
If a `userdata` entry is created but no data is written to the `value` file,
the entry will be omitted from netconsole messages::
cd /sys/kernel/config/netconsole && mkdir cmdline0
cd cmdline0
mkdir userdata/foo
echo bar > userdata/foo/value
mkdir userdata/qux
The `qux` key is omitted since it has no value::
echo "This is a message" > /dev/kmsg
12,607,22085407756,-;This is a message
foo=bar
Delete `userdata` entries with `rmdir`::
rmdir /sys/kernel/config/netconsole/cmdline0/userdata/qux
.. warning::
When writing strings to user data values, input is broken up per line in
configfs store calls and this can cause confusing behavior::
mkdir userdata/testing
printf "val1\nval2" > userdata/testing/value
# userdata store value is called twice, first with "val1\n" then "val2"
# so "val2" is stored, being the last value stored
cat userdata/testing/value
val2
It is recommended to not write user data values with newlines.
Extended console:
=================

View File

@ -252,8 +252,8 @@ ndo_eth_ioctl:
Context: process
ndo_get_stats:
Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU.
Context: atomic (can't sleep under rwlock or RCU)
Synchronization: rtnl_lock() semaphore, or RCU.
Context: atomic (can't sleep under RCU)
ndo_start_xmit:
Synchronization: __netif_tx_lock spinlock.

View File

@ -231,16 +231,136 @@ this documentation.
For further information on these methods, please see the inline
documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`.
9. Remove calls to of_parse_phandle() for the PHY,
of_phy_register_fixed_link() for fixed links etc. from the probe
function, and replace with:
9. Fill-in the :c:type:`struct phylink_config <phylink_config>` fields with
a reference to the :c:type:`struct device <device>` associated to your
:c:type:`struct net_device <net_device>`:
.. code-block:: c
struct phylink *phylink;
priv->phylink_config.dev = &dev.dev;
priv->phylink_config.type = PHYLINK_NETDEV;
Fill-in the various speeds, pause and duplex modes your MAC can handle:
.. code-block:: c
priv->phylink_config.mac_capabilities = MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000FD;
10. Some Ethernet controllers work in pair with a PCS (Physical Coding Sublayer)
block, that can handle among other things the encoding/decoding, link
establishment detection and autonegotiation. While some MACs have internal
PCS whose operation is transparent, some other require dedicated PCS
configuration for the link to become functional. In that case, phylink
provides a PCS abstraction through :c:type:`struct phylink_pcs <phylink_pcs>`.
Identify if your driver has one or more internal PCS blocks, and/or if
your controller can use an external PCS block that might be internally
connected to your controller.
If your controller doesn't have any internal PCS, you can go to step 11.
If your Ethernet controller contains one or several PCS blocks, create
one :c:type:`struct phylink_pcs <phylink_pcs>` instance per PCS block within
your driver's private data structure:
.. code-block:: c
struct phylink_pcs pcs;
Populate the relevant :c:type:`struct phylink_pcs_ops <phylink_pcs_ops>` to
configure your PCS. Create a :c:func:`pcs_get_state` function that reports
the inband link state, a :c:func:`pcs_config` function to configure your
PCS according to phylink-provided parameters, and a :c:func:`pcs_validate`
function that report to phylink all accepted configuration parameters for
your PCS:
.. code-block:: c
struct phylink_pcs_ops foo_pcs_ops = {
.pcs_validate = foo_pcs_validate,
.pcs_get_state = foo_pcs_get_state,
.pcs_config = foo_pcs_config,
};
Arrange for PCS link state interrupts to be forwarded into
phylink, via:
.. code-block:: c
phylink_pcs_change(pcs, link_is_up);
where ``link_is_up`` is true if the link is currently up or false
otherwise. If a PCS is unable to provide these interrupts, then
it should set ``pcs->pcs_poll = true;`` when creating the PCS.
11. If your controller relies on, or accepts the presence of an external PCS
controlled through its own driver, add a pointer to a phylink_pcs instance
in your driver private data structure:
.. code-block:: c
struct phylink_pcs *pcs;
The way of getting an instance of the actual PCS depends on the platform,
some PCS sit on an MDIO bus and are grabbed by passing a pointer to the
corresponding :c:type:`struct mii_bus <mii_bus>` and the PCS's address on
that bus. In this example, we assume the controller attaches to a Lynx PCS
instance:
.. code-block:: c
priv->pcs = lynx_pcs_create_mdiodev(bus, 0);
Some PCS can be recovered based on firmware information:
.. code-block:: c
priv->pcs = lynx_pcs_create_fwnode(of_fwnode_handle(node));
12. Populate the :c:func:`mac_select_pcs` callback and add it to your
:c:type:`struct phylink_mac_ops <phylink_mac_ops>` set of ops. This function
must return a pointer to the relevant :c:type:`struct phylink_pcs <phylink_pcs>`
that will be used for the requested link configuration:
.. code-block:: c
static struct phylink_pcs *foo_select_pcs(struct phylink_config *config,
phy_interface_t interface)
{
struct foo_priv *priv = container_of(config, struct foo_priv,
phylink_config);
if ( /* 'interface' needs a PCS to function */ )
return priv->pcs;
return NULL;
}
See :c:func:`mvpp2_select_pcs` for an example of a driver that has multiple
internal PCS.
13. Fill-in all the :c:type:`phy_interface_t <phy_interface_t>` (i.e. all MAC to
PHY link modes) that your MAC can output. The following example shows a
configuration for a MAC that can handle all RGMII modes, SGMII and 1000BaseX.
You must adjust these according to what your MAC and all PCS associated
with this MAC are capable of, and not just the interface you wish to use:
.. code-block:: c
phy_interface_set_rgmii(priv->phylink_config.supported_interfaces);
__set_bit(PHY_INTERFACE_MODE_SGMII,
priv->phylink_config.supported_interfaces);
__set_bit(PHY_INTERFACE_MODE_1000BASEX,
priv->phylink_config.supported_interfaces);
14. Remove calls to of_parse_phandle() for the PHY,
of_phy_register_fixed_link() for fixed links etc. from the probe
function, and replace with:
.. code-block:: c
struct phylink *phylink;
phylink = phylink_create(&priv->phylink_config, node, phy_mode, &phylink_ops);
if (IS_ERR(phylink)) {
err = PTR_ERR(phylink);
@ -249,14 +369,14 @@ this documentation.
priv->phylink = phylink;
and arrange to destroy the phylink in the probe failure path as
appropriate and the removal path too by calling:
and arrange to destroy the phylink in the probe failure path as
appropriate and the removal path too by calling:
.. code-block:: c
.. code-block:: c
phylink_destroy(priv->phylink);
10. Arrange for MAC link state interrupts to be forwarded into
15. Arrange for MAC link state interrupts to be forwarded into
phylink, via:
.. code-block:: c
@ -264,17 +384,16 @@ this documentation.
phylink_mac_change(priv->phylink, link_is_up);
where ``link_is_up`` is true if the link is currently up or false
otherwise. If a MAC is unable to provide these interrupts, then
it should set ``priv->phylink_config.pcs_poll = true;`` in step 9.
otherwise.
11. Verify that the driver does not call::
16. Verify that the driver does not call::
netif_carrier_on()
netif_carrier_off()
as these will interfere with phylink's tracking of the link state,
and cause phylink to omit calls via the :c:func:`mac_link_up` and
:c:func:`mac_link_down` methods.
as these will interfere with phylink's tracking of the link state,
and cause phylink to omit calls via the :c:func:`mac_link_up` and
:c:func:`mac_link_down` methods.
Network drivers should call phylink_stop() and phylink_start() via their
suspend/resume paths, which ensures that the appropriate

View File

@ -41,6 +41,15 @@ If `-s` is specified once the detailed errors won't be shown.
`ip` supports JSON formatting via the `-j` option.
Queue statistics
~~~~~~~~~~~~~~~~
Queue statistics are accessible via the netdev netlink family.
Currently no widely distributed CLI exists to access those statistics.
Kernel development tools (ynl) can be used to experiment with them,
see `Documentation/userspace-api/netlink/intro-specs.rst`.
Protocol-specific statistics
----------------------------
@ -147,6 +156,12 @@ Statistics are reported both in the responses to link information
requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
netdev (netlink)
~~~~~~~~~~~~~~~~
`netdev` generic netlink family allows accessing page pool and per queue
statistics.
ethtool
-------

View File

@ -71,9 +71,9 @@ Callbacks to implement
bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
struct xfrm_state *x);
void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
void (*xdo_dev_state_update_stats) (struct xfrm_state *x);
/* Solely packet offload callbacks */
void (*xdo_dev_state_update_curlft) (struct xfrm_state *x);
int (*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack);
void (*xdo_dev_policy_delete) (struct xfrm_policy *x);
void (*xdo_dev_policy_free) (struct xfrm_policy *x);
@ -191,6 +191,6 @@ xdo_dev_policy_free() on any remaining offloaded states.
Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
The HW/driver are responsible to perform it and provide accurate data when
xdo_dev_state_update_curlft() is called. In case of one of these limits
xdo_dev_state_update_stats() is called. In case of one of these limits
occuried, the driver needs to call to xfrm_state_check_expire() to make sure
that XFRM performs rekeying sequence.

View File

@ -310,6 +310,7 @@ Code Seq# Include File Comments
0x89 0B-DF linux/sockios.h
0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range
0x89 F0-FF linux/sockios.h SIOCDEVPRIVATE range
0x8A 00-1F linux/eventpoll.h
0x8B all linux/wireless.h
0x8C 00-3F WiNRADiO driver
<http://www.winradio.com.au/>

View File

@ -150,3 +150,45 @@ attributes from an ``attribute-set``. For example the following
Note that a selector attribute must appear in a netlink message before any
sub-message attributes that depend on it.
If an attribute such as ``kind`` is defined at more than one nest level, then a
sub-message selector will be resolved using the value 'closest' to the selector.
For example, if the same attribute name is defined in a nested ``attribute-set``
alongside a sub-message selector and also in a top level ``attribute-set``, then
the selector will be resolved using the value 'closest' to the selector. If the
value is not present in the message at the same level as defined in the spec
then this is an error.
Nested struct definitions
-------------------------
Many raw netlink families such as :doc:`tc<../../networking/netlink_spec/tc>`
make use of nested struct definitions. The ``netlink-raw`` schema makes it
possible to embed a struct within a struct definition using the ``struct``
property. For example, the following struct definition embeds the
``tc-ratespec`` struct definition for both the ``rate`` and the ``peakrate``
members of ``struct tc-tbf-qopt``.
.. code-block:: yaml
-
name: tc-tbf-qopt
type: struct
members:
-
name: rate
type: binary
struct: tc-ratespec
-
name: peakrate
type: binary
struct: tc-ratespec
-
name: limit
type: u32
-
name: buffer
type: u32
-
name: mtu
type: u32

View File

@ -3807,6 +3807,7 @@ M: Alexei Starovoitov <ast@kernel.org>
M: Daniel Borkmann <daniel@iogearbox.net>
M: Andrii Nakryiko <andrii@kernel.org>
R: Martin KaFai Lau <martin.lau@linux.dev>
R: Eduard Zingerman <eddyz87@gmail.com>
R: Song Liu <song@kernel.org>
R: Yonghong Song <yonghong.song@linux.dev>
R: John Fastabend <john.fastabend@gmail.com>
@ -3867,6 +3868,7 @@ F: net/unix/unix_bpf.c
BPF [LIBRARY] (libbpf)
M: Andrii Nakryiko <andrii@kernel.org>
M: Eduard Zingerman <eddyz87@gmail.com>
L: bpf@vger.kernel.org
S: Maintained
F: tools/lib/bpf/
@ -3924,6 +3926,7 @@ F: security/bpf/
BPF [SELFTESTS] (Test Runners & Infrastructure)
M: Andrii Nakryiko <andrii@kernel.org>
M: Eduard Zingerman <eddyz87@gmail.com>
R: Mykola Lysenko <mykolal@fb.com>
L: bpf@vger.kernel.org
S: Maintained
@ -4637,8 +4640,8 @@ S: Maintained
F: net/sched/sch_cake.c
CAN NETWORK DRIVERS
M: Wolfgang Grandegger <wg@grandegger.com>
M: Marc Kleine-Budde <mkl@pengutronix.de>
M: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
L: linux-can@vger.kernel.org
S: Maintained
W: https://github.com/linux-can
@ -7897,6 +7900,13 @@ S: Maintained
F: include/linux/errseq.h
F: lib/errseq.c
ESD CAN NETWORK DRIVERS
M: Stefan Mätje <stefan.maetje@esd.eu>
R: socketcan@esd.eu
L: linux-can@vger.kernel.org
S: Maintained
F: drivers/net/can/esd/
ESD CAN/USB DRIVERS
M: Frank Jungclaus <frank.jungclaus@esd.eu>
R: socketcan@esd.eu
@ -8599,6 +8609,13 @@ F: Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
F: drivers/soc/fsl/qe/qmc.c
F: include/soc/fsl/qe/qmc.h
FREESCALE QUICC ENGINE QMC HDLC DRIVER
M: Herve Codina <herve.codina@bootlin.com>
L: netdev@vger.kernel.org
L: linuxppc-dev@lists.ozlabs.org
S: Maintained
F: drivers/net/wan/fsl_qmc_hdlc.c
FREESCALE QUICC ENGINE TSA DRIVER
M: Herve Codina <herve.codina@bootlin.com>
L: linuxppc-dev@lists.ozlabs.org
@ -13082,6 +13099,15 @@ L: netdev@vger.kernel.org
S: Supported
F: drivers/net/ethernet/marvell/octeon_ep
MARVELL OCTEON ENDPOINT VF DRIVER
M: Veerasenareddy Burru <vburru@marvell.com>
M: Sathesh Edara <sedara@marvell.com>
M: Shinas Rasheed <srasheed@marvell.com>
M: Satananda Burla <sburla@marvell.com>
L: netdev@vger.kernel.org
S: Supported
F: drivers/net/ethernet/marvell/octeon_ep_vf
MARVELL OCTEONTX2 PHYSICAL FUNCTION DRIVER
M: Sunil Goutham <sgoutham@marvell.com>
M: Geetha sowjanya <gakula@marvell.com>
@ -15120,6 +15146,7 @@ NETDEVSIM
M: Jakub Kicinski <kuba@kernel.org>
S: Maintained
F: drivers/net/netdevsim/*
F: tools/testing/selftests/drivers/net/netdevsim/*
NETEM NETWORK EMULATOR
M: Stephen Hemminger <stephen@networkplumber.org>
@ -18041,6 +18068,13 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git
F: Documentation/devicetree/bindings/net/wireless/qca,ath9k.yaml
F: drivers/net/wireless/ath/ath9k/
QUALCOMM ATHEROS QCA7K ETHERNET DRIVER
M: Stefan Wahren <wahrenst@gmx.net>
L: netdev@vger.kernel.org
S: Maintained
F: Documentation/devicetree/bindings/net/qca,qca7000.txt
F: drivers/net/ethernet/qualcomm/qca*
QUALCOMM BAM-DMUX WWAN NETWORK DRIVER
M: Stephan Gerhold <stephan@gerhold.net>
L: netdev@vger.kernel.org
@ -24168,7 +24202,6 @@ F: drivers/net/ethernet/xilinx/xilinx_axienet*
XILINX CAN DRIVER
M: Appana Durga Kedareswara rao <appana.durga.rao@xilinx.com>
R: Naga Sureshkumar Relli <naga.sureshkumar.relli@xilinx.com>
L: linux-can@vger.kernel.org
S: Maintained
F: Documentation/devicetree/bindings/net/can/xilinx,can.yaml

View File

@ -110,8 +110,8 @@ void __init add_static_vm_early(struct static_vm *svm)
int ioremap_page(unsigned long virt, unsigned long phys,
const struct mem_type *mtype)
{
return ioremap_page_range(virt, virt + PAGE_SIZE, phys,
__pgprot(mtype->prot_pte));
return vmap_page_range(virt, virt + PAGE_SIZE, phys,
__pgprot(mtype->prot_pte));
}
EXPORT_SYMBOL(ioremap_page);
@ -466,8 +466,8 @@ int pci_remap_iospace(const struct resource *res, phys_addr_t phys_addr)
if (res->end > IO_SPACE_LIMIT)
return -EINVAL;
return ioremap_page_range(vaddr, vaddr + resource_size(res), phys_addr,
__pgprot(get_mem_type(pci_ioremap_mem_type)->prot_pte));
return vmap_page_range(vaddr, vaddr + resource_size(res), phys_addr,
__pgprot(get_mem_type(pci_ioremap_mem_type)->prot_pte));
}
EXPORT_SYMBOL(pci_remap_iospace);

View File

@ -8,6 +8,8 @@ int aarch64_insn_read(void *addr, u32 *insnp);
int aarch64_insn_write(void *addr, u32 insn);
int aarch64_insn_write_literal_u64(void *addr, u64 val);
void *aarch64_insn_set(void *dst, u32 insn, size_t len);
void *aarch64_insn_copy(void *dst, void *src, size_t len);
int aarch64_insn_patch_text_nosync(void *addr, u32 insn);
int aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt);

View File

@ -105,6 +105,81 @@ noinstr int aarch64_insn_write_literal_u64(void *addr, u64 val)
return ret;
}
typedef void text_poke_f(void *dst, void *src, size_t patched, size_t len);
static void *__text_poke(text_poke_f func, void *addr, void *src, size_t len)
{
unsigned long flags;
size_t patched = 0;
size_t size;
void *waddr;
void *ptr;
raw_spin_lock_irqsave(&patch_lock, flags);
while (patched < len) {
ptr = addr + patched;
size = min_t(size_t, PAGE_SIZE - offset_in_page(ptr),
len - patched);
waddr = patch_map(ptr, FIX_TEXT_POKE0);
func(waddr, src, patched, size);
patch_unmap(FIX_TEXT_POKE0);
patched += size;
}
raw_spin_unlock_irqrestore(&patch_lock, flags);
flush_icache_range((uintptr_t)addr, (uintptr_t)addr + len);
return addr;
}
static void text_poke_memcpy(void *dst, void *src, size_t patched, size_t len)
{
copy_to_kernel_nofault(dst, src + patched, len);
}
static void text_poke_memset(void *dst, void *src, size_t patched, size_t len)
{
u32 c = *(u32 *)src;
memset32(dst, c, len / 4);
}
/**
* aarch64_insn_copy - Copy instructions into (an unused part of) RX memory
* @dst: address to modify
* @src: source of the copy
* @len: length to copy
*
* Useful for JITs to dump new code blocks into unused regions of RX memory.
*/
noinstr void *aarch64_insn_copy(void *dst, void *src, size_t len)
{
/* A64 instructions must be word aligned */
if ((uintptr_t)dst & 0x3)
return NULL;
return __text_poke(text_poke_memcpy, dst, src, len);
}
/**
* aarch64_insn_set - memset for RX memory regions.
* @dst: address to modify
* @insn: value to set
* @len: length of memory region.
*
* Useful for JITs to fill regions of RX memory with illegal instructions.
*/
noinstr void *aarch64_insn_set(void *dst, u32 insn, size_t len)
{
if ((uintptr_t)dst & 0x3)
return NULL;
return __text_poke(text_poke_memset, dst, &insn, len);
}
int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
u32 *tp = addr;

View File

@ -7,6 +7,7 @@
#include <linux/kernel.h>
#include <linux/efi.h>
#include <linux/export.h>
#include <linux/filter.h>
#include <linux/ftrace.h>
#include <linux/kprobes.h>
#include <linux/sched.h>
@ -266,6 +267,31 @@ noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry,
kunwind_stack_walk(arch_kunwind_consume_entry, &data, task, regs);
}
struct bpf_unwind_consume_entry_data {
bool (*consume_entry)(void *cookie, u64 ip, u64 sp, u64 fp);
void *cookie;
};
static bool
arch_bpf_unwind_consume_entry(const struct kunwind_state *state, void *cookie)
{
struct bpf_unwind_consume_entry_data *data = cookie;
return data->consume_entry(data->cookie, state->common.pc, 0,
state->common.fp);
}
noinline noinstr void arch_bpf_stack_walk(bool (*consume_entry)(void *cookie, u64 ip, u64 sp,
u64 fp), void *cookie)
{
struct bpf_unwind_consume_entry_data data = {
.consume_entry = consume_entry,
.cookie = cookie,
};
kunwind_stack_walk(arch_bpf_unwind_consume_entry, &data, current, NULL);
}
static bool dump_backtrace_entry(void *arg, unsigned long where)
{
char *loglvl = arg;

View File

@ -76,6 +76,7 @@ struct jit_ctx {
int *offset;
int exentry_idx;
__le32 *image;
__le32 *ro_image;
u32 stack_size;
int fpb_offset;
};
@ -205,6 +206,14 @@ static void jit_fill_hole(void *area, unsigned int size)
*ptr++ = cpu_to_le32(AARCH64_BREAK_FAULT);
}
int bpf_arch_text_invalidate(void *dst, size_t len)
{
if (!aarch64_insn_set(dst, AARCH64_BREAK_FAULT, len))
return -EINVAL;
return 0;
}
static inline int epilogue_offset(const struct jit_ctx *ctx)
{
int to = ctx->epilogue_offset;
@ -285,7 +294,8 @@ static bool is_lsi_offset(int offset, int scale)
/* Tail call offset to jump into */
#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
bool is_exception_cb)
{
const struct bpf_prog *prog = ctx->prog;
const bool is_main_prog = !bpf_is_subprog(prog);
@ -333,19 +343,34 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
emit(A64_NOP, ctx);
/* Sign lr */
if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
emit(A64_PACIASP, ctx);
if (!is_exception_cb) {
/* Sign lr */
if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
emit(A64_PACIASP, ctx);
/* Save FP and LR registers to stay align with ARM64 AAPCS */
emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
emit(A64_MOV(1, A64_FP, A64_SP), ctx);
/* Save FP and LR registers to stay align with ARM64 AAPCS */
emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
emit(A64_MOV(1, A64_FP, A64_SP), ctx);
/* Save callee-saved registers */
emit(A64_PUSH(r6, r7, A64_SP), ctx);
emit(A64_PUSH(r8, r9, A64_SP), ctx);
emit(A64_PUSH(fp, tcc, A64_SP), ctx);
emit(A64_PUSH(fpb, A64_R(28), A64_SP), ctx);
/* Save callee-saved registers */
emit(A64_PUSH(r6, r7, A64_SP), ctx);
emit(A64_PUSH(r8, r9, A64_SP), ctx);
emit(A64_PUSH(fp, tcc, A64_SP), ctx);
emit(A64_PUSH(fpb, A64_R(28), A64_SP), ctx);
} else {
/*
* Exception callback receives FP of Main Program as third
* parameter
*/
emit(A64_MOV(1, A64_FP, A64_R(2)), ctx);
/*
* Main Program already pushed the frame record and the
* callee-saved registers. The exception callback will not push
* anything and re-use the main program's stack.
*
* 10 registers are on the stack
*/
emit(A64_SUB_I(1, A64_SP, A64_FP, 80), ctx);
}
/* Set up BPF prog stack base register */
emit(A64_MOV(1, fp, A64_SP), ctx);
@ -365,6 +390,20 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
emit_bti(A64_BTI_J, ctx);
}
/*
* Program acting as exception boundary should save all ARM64
* Callee-saved registers as the exception callback needs to recover
* all ARM64 Callee-saved registers in its epilogue.
*/
if (prog->aux->exception_boundary) {
/*
* As we are pushing two more registers, BPF_FP should be moved
* 16 bytes
*/
emit(A64_SUB_I(1, fp, fp, 16), ctx);
emit(A64_PUSH(A64_R(23), A64_R(24), A64_SP), ctx);
}
emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
/* Stack must be multiples of 16B */
@ -653,7 +692,7 @@ static void build_plt(struct jit_ctx *ctx)
plt->target = (u64)&dummy_tramp;
}
static void build_epilogue(struct jit_ctx *ctx)
static void build_epilogue(struct jit_ctx *ctx, bool is_exception_cb)
{
const u8 r0 = bpf2a64[BPF_REG_0];
const u8 r6 = bpf2a64[BPF_REG_6];
@ -666,6 +705,15 @@ static void build_epilogue(struct jit_ctx *ctx)
/* We're done with BPF stack */
emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
/*
* Program acting as exception boundary pushes R23 and R24 in addition
* to BPF callee-saved registers. Exception callback uses the boundary
* program's stack frame, so recover these extra registers in the above
* two cases.
*/
if (ctx->prog->aux->exception_boundary || is_exception_cb)
emit(A64_POP(A64_R(23), A64_R(24), A64_SP), ctx);
/* Restore x27 and x28 */
emit(A64_POP(fpb, A64_R(28), A64_SP), ctx);
/* Restore fs (x25) and x26 */
@ -707,7 +755,8 @@ static int add_exception_handler(const struct bpf_insn *insn,
struct jit_ctx *ctx,
int dst_reg)
{
off_t offset;
off_t ins_offset;
off_t fixup_offset;
unsigned long pc;
struct exception_table_entry *ex;
@ -724,12 +773,17 @@ static int add_exception_handler(const struct bpf_insn *insn,
return -EINVAL;
ex = &ctx->prog->aux->extable[ctx->exentry_idx];
pc = (unsigned long)&ctx->image[ctx->idx - 1];
pc = (unsigned long)&ctx->ro_image[ctx->idx - 1];
offset = pc - (long)&ex->insn;
if (WARN_ON_ONCE(offset >= 0 || offset < INT_MIN))
/*
* This is the relative offset of the instruction that may fault from
* the exception table itself. This will be written to the exception
* table and if this instruction faults, the destination register will
* be set to '0' and the execution will jump to the next instruction.
*/
ins_offset = pc - (long)&ex->insn;
if (WARN_ON_ONCE(ins_offset >= 0 || ins_offset < INT_MIN))
return -ERANGE;
ex->insn = offset;
/*
* Since the extable follows the program, the fixup offset is always
@ -738,12 +792,25 @@ static int add_exception_handler(const struct bpf_insn *insn,
* bits. We don't need to worry about buildtime or runtime sort
* modifying the upper bits because the table is already sorted, and
* isn't part of the main exception table.
*
* The fixup_offset is set to the next instruction from the instruction
* that may fault. The execution will jump to this after handling the
* fault.
*/
offset = (long)&ex->fixup - (pc + AARCH64_INSN_SIZE);
if (!FIELD_FIT(BPF_FIXUP_OFFSET_MASK, offset))
fixup_offset = (long)&ex->fixup - (pc + AARCH64_INSN_SIZE);
if (!FIELD_FIT(BPF_FIXUP_OFFSET_MASK, fixup_offset))
return -ERANGE;
ex->fixup = FIELD_PREP(BPF_FIXUP_OFFSET_MASK, offset) |
/*
* The offsets above have been calculated using the RO buffer but we
* need to use the R/W buffer for writes.
* switch ex to rw buffer for writing.
*/
ex = (void *)ctx->image + ((void *)ex - (void *)ctx->ro_image);
ex->insn = ins_offset;
ex->fixup = FIELD_PREP(BPF_FIXUP_OFFSET_MASK, fixup_offset) |
FIELD_PREP(BPF_FIXUP_REG_MASK, dst_reg);
ex->type = EX_TYPE_BPF;
@ -1511,7 +1578,8 @@ static inline void bpf_flush_icache(void *start, void *end)
struct arm64_jit_data {
struct bpf_binary_header *header;
u8 *image;
u8 *ro_image;
struct bpf_binary_header *ro_header;
struct jit_ctx ctx;
};
@ -1520,12 +1588,14 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
int image_size, prog_size, extable_size, extable_align, extable_offset;
struct bpf_prog *tmp, *orig_prog = prog;
struct bpf_binary_header *header;
struct bpf_binary_header *ro_header;
struct arm64_jit_data *jit_data;
bool was_classic = bpf_prog_was_classic(prog);
bool tmp_blinded = false;
bool extra_pass = false;
struct jit_ctx ctx;
u8 *image_ptr;
u8 *ro_image_ptr;
if (!prog->jit_requested)
return orig_prog;
@ -1552,8 +1622,11 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
}
if (jit_data->ctx.offset) {
ctx = jit_data->ctx;
image_ptr = jit_data->image;
ro_image_ptr = jit_data->ro_image;
ro_header = jit_data->ro_header;
header = jit_data->header;
image_ptr = (void *)header + ((void *)ro_image_ptr
- (void *)ro_header);
extra_pass = true;
prog_size = sizeof(u32) * ctx.idx;
goto skip_init_ctx;
@ -1575,7 +1648,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
* BPF line info needs ctx->offset[i] to be the offset of
* instruction[i] in jited image, so build prologue first.
*/
if (build_prologue(&ctx, was_classic)) {
if (build_prologue(&ctx, was_classic, prog->aux->exception_cb)) {
prog = orig_prog;
goto out_off;
}
@ -1586,7 +1659,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
}
ctx.epilogue_offset = ctx.idx;
build_epilogue(&ctx);
build_epilogue(&ctx, prog->aux->exception_cb);
build_plt(&ctx);
extable_align = __alignof__(struct exception_table_entry);
@ -1598,63 +1671,81 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
/* also allocate space for plt target */
extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
image_size = extable_offset + extable_size;
header = bpf_jit_binary_alloc(image_size, &image_ptr,
sizeof(u32), jit_fill_hole);
if (header == NULL) {
ro_header = bpf_jit_binary_pack_alloc(image_size, &ro_image_ptr,
sizeof(u32), &header, &image_ptr,
jit_fill_hole);
if (!ro_header) {
prog = orig_prog;
goto out_off;
}
/* 2. Now, the actual pass. */
/*
* Use the image(RW) for writing the JITed instructions. But also save
* the ro_image(RX) for calculating the offsets in the image. The RW
* image will be later copied to the RX image from where the program
* will run. The bpf_jit_binary_pack_finalize() will do this copy in the
* final step.
*/
ctx.image = (__le32 *)image_ptr;
ctx.ro_image = (__le32 *)ro_image_ptr;
if (extable_size)
prog->aux->extable = (void *)image_ptr + extable_offset;
prog->aux->extable = (void *)ro_image_ptr + extable_offset;
skip_init_ctx:
ctx.idx = 0;
ctx.exentry_idx = 0;
build_prologue(&ctx, was_classic);
build_prologue(&ctx, was_classic, prog->aux->exception_cb);
if (build_body(&ctx, extra_pass)) {
bpf_jit_binary_free(header);
prog = orig_prog;
goto out_off;
goto out_free_hdr;
}
build_epilogue(&ctx);
build_epilogue(&ctx, prog->aux->exception_cb);
build_plt(&ctx);
/* 3. Extra pass to validate JITed code. */
if (validate_ctx(&ctx)) {
bpf_jit_binary_free(header);
prog = orig_prog;
goto out_off;
goto out_free_hdr;
}
/* And we're done. */
if (bpf_jit_enable > 1)
bpf_jit_dump(prog->len, prog_size, 2, ctx.image);
bpf_flush_icache(header, ctx.image + ctx.idx);
if (!prog->is_func || extra_pass) {
if (extra_pass && ctx.idx != jit_data->ctx.idx) {
pr_err_once("multi-func JIT bug %d != %d\n",
ctx.idx, jit_data->ctx.idx);
bpf_jit_binary_free(header);
prog->bpf_func = NULL;
prog->jited = 0;
prog->jited_len = 0;
goto out_free_hdr;
}
if (WARN_ON(bpf_jit_binary_pack_finalize(prog, ro_header,
header))) {
/* ro_header has been freed */
ro_header = NULL;
prog = orig_prog;
goto out_off;
}
bpf_jit_binary_lock_ro(header);
/*
* The instructions have now been copied to the ROX region from
* where they will execute. Now the data cache has to be cleaned to
* the PoU and the I-cache has to be invalidated for the VAs.
*/
bpf_flush_icache(ro_header, ctx.ro_image + ctx.idx);
} else {
jit_data->ctx = ctx;
jit_data->image = image_ptr;
jit_data->ro_image = ro_image_ptr;
jit_data->header = header;
jit_data->ro_header = ro_header;
}
prog->bpf_func = (void *)ctx.image;
prog->bpf_func = (void *)ctx.ro_image;
prog->jited = 1;
prog->jited_len = prog_size;
@ -1675,6 +1766,14 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
bpf_jit_prog_release_other(prog, prog == orig_prog ?
tmp : orig_prog);
return prog;
out_free_hdr:
if (header) {
bpf_arch_text_copy(&ro_header->size, &header->size,
sizeof(header->size));
bpf_jit_binary_pack_free(ro_header, header);
}
goto out_off;
}
bool bpf_jit_supports_kfunc_call(void)
@ -1682,6 +1781,13 @@ bool bpf_jit_supports_kfunc_call(void)
return true;
}
void *bpf_arch_text_copy(void *dst, void *src, size_t len)
{
if (!aarch64_insn_copy(dst, src, len))
return ERR_PTR(-EINVAL);
return dst;
}
u64 bpf_jit_alloc_exec_limit(void)
{
return VMALLOC_END - VMALLOC_START;
@ -1970,7 +2076,7 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
/* store return value */
emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
/* reserve a nop for bpf_tramp_image_put */
im->ip_after_call = ctx->image + ctx->idx;
im->ip_after_call = ctx->ro_image + ctx->idx;
emit(A64_NOP, ctx);
}
@ -1985,7 +2091,7 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
run_ctx_off, false);
if (flags & BPF_TRAMP_F_CALL_ORIG) {
im->ip_epilogue = ctx->image + ctx->idx;
im->ip_epilogue = ctx->ro_image + ctx->idx;
emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
emit_call((const u64)__bpf_tramp_exit, ctx);
}
@ -2018,9 +2124,6 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
emit(A64_RET(A64_R(10)), ctx);
}
if (ctx->image)
bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
kfree(branches);
return ctx->idx;
@ -2063,14 +2166,43 @@ int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags,
return ret < 0 ? ret : ret * AARCH64_INSN_SIZE;
}
int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
void *image_end, const struct btf_func_model *m,
void *arch_alloc_bpf_trampoline(unsigned int size)
{
return bpf_prog_pack_alloc(size, jit_fill_hole);
}
void arch_free_bpf_trampoline(void *image, unsigned int size)
{
bpf_prog_pack_free(image, size);
}
void arch_protect_bpf_trampoline(void *image, unsigned int size)
{
}
void arch_unprotect_bpf_trampoline(void *image, unsigned int size)
{
}
int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
void *ro_image_end, const struct btf_func_model *m,
u32 flags, struct bpf_tramp_links *tlinks,
void *func_addr)
{
int ret, nregs;
void *image, *tmp;
u32 size = ro_image_end - ro_image;
/* image doesn't need to be in module memory range, so we can
* use kvmalloc.
*/
image = kvmalloc(size, GFP_KERNEL);
if (!image)
return -ENOMEM;
struct jit_ctx ctx = {
.image = image,
.ro_image = ro_image,
.idx = 0,
};
@ -2079,15 +2211,26 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
if (nregs > 8)
return -ENOTSUPP;
jit_fill_hole(image, (unsigned int)(image_end - image));
jit_fill_hole(image, (unsigned int)(ro_image_end - ro_image));
ret = prepare_trampoline(&ctx, im, tlinks, func_addr, nregs, flags);
if (ret > 0 && validate_code(&ctx) < 0)
if (ret > 0 && validate_code(&ctx) < 0) {
ret = -EINVAL;
goto out;
}
if (ret > 0)
ret *= AARCH64_INSN_SIZE;
tmp = bpf_arch_text_copy(ro_image, image, size);
if (IS_ERR(tmp)) {
ret = PTR_ERR(tmp);
goto out;
}
bpf_flush_icache(ro_image, ro_image + size);
out:
kvfree(image);
return ret;
}
@ -2305,3 +2448,42 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
return ret;
}
bool bpf_jit_supports_ptr_xchg(void)
{
return true;
}
bool bpf_jit_supports_exceptions(void)
{
/* We unwind through both kernel frames starting from within bpf_throw
* call and BPF frames. Therefore we require FP unwinder to be enabled
* to walk kernel frames and reach BPF frames in the stack trace.
* ARM64 kernel is aways compiled with CONFIG_FRAME_POINTER=y
*/
return true;
}
void bpf_jit_free(struct bpf_prog *prog)
{
if (prog->jited) {
struct arm64_jit_data *jit_data = prog->aux->jit_data;
struct bpf_binary_header *hdr;
/*
* If we fail the final pass of JIT (from jit_subprogs),
* the program may not be finalized yet. Call finalize here
* before freeing it.
*/
if (jit_data) {
bpf_arch_text_copy(&jit_data->ro_header->size, &jit_data->header->size,
sizeof(jit_data->header->size));
kfree(jit_data);
}
hdr = bpf_jit_binary_pack_hdr(prog);
bpf_jit_binary_pack_free(hdr, NULL);
WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
}
bpf_prog_unlock_free(prog);
}

View File

@ -490,7 +490,7 @@ static int __init add_legacy_isa_io(struct fwnode_handle *fwnode,
}
vaddr = (unsigned long)(PCI_IOBASE + range->io_start);
ioremap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL));
vmap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL));
return 0;
}

View File

@ -180,7 +180,7 @@ static int __init add_legacy_isa_io(struct fwnode_handle *fwnode, resource_size_
vaddr = PCI_IOBASE + range->io_start;
ioremap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL));
vmap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL));
return 0;
}

View File

@ -46,8 +46,8 @@ static void remap_isa_base(phys_addr_t pa, unsigned long size)
WARN_ON_ONCE(size & ~PAGE_MASK);
if (slab_is_available()) {
if (ioremap_page_range(ISA_IO_BASE, ISA_IO_BASE + size, pa,
pgprot_noncached(PAGE_KERNEL)))
if (vmap_page_range(ISA_IO_BASE, ISA_IO_BASE + size, pa,
pgprot_noncached(PAGE_KERNEL)))
vunmap_range(ISA_IO_BASE, ISA_IO_BASE + size);
} else {
early_ioremap_range(ISA_IO_BASE, pa, size,

View File

@ -13,11 +13,28 @@ struct pt_regs;
#ifdef CONFIG_CFI_CLANG
enum bug_trap_type handle_cfi_failure(struct pt_regs *regs);
#define __bpfcall
static inline int cfi_get_offset(void)
{
return 4;
}
#define cfi_get_offset cfi_get_offset
extern u32 cfi_bpf_hash;
extern u32 cfi_bpf_subprog_hash;
extern u32 cfi_get_func_hash(void *func);
#else
static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs)
{
return BUG_TRAP_TYPE_NONE;
}
#define cfi_bpf_hash 0U
#define cfi_bpf_subprog_hash 0U
static inline u32 cfi_get_func_hash(void *func)
{
return 0;
}
#endif /* CONFIG_CFI_CLANG */
#endif /* _ASM_RISCV_CFI_H */

View File

@ -75,3 +75,56 @@ enum bug_trap_type handle_cfi_failure(struct pt_regs *regs)
return report_cfi_failure(regs, regs->epc, &target, type);
}
#ifdef CONFIG_CFI_CLANG
struct bpf_insn;
/* Must match bpf_func_t / DEFINE_BPF_PROG_RUN() */
extern unsigned int __bpf_prog_runX(const void *ctx,
const struct bpf_insn *insn);
/*
* Force a reference to the external symbol so the compiler generates
* __kcfi_typid.
*/
__ADDRESSABLE(__bpf_prog_runX);
/* u32 __ro_after_init cfi_bpf_hash = __kcfi_typeid___bpf_prog_runX; */
asm (
" .pushsection .data..ro_after_init,\"aw\",@progbits \n"
" .type cfi_bpf_hash,@object \n"
" .globl cfi_bpf_hash \n"
" .p2align 2, 0x0 \n"
"cfi_bpf_hash: \n"
" .word __kcfi_typeid___bpf_prog_runX \n"
" .size cfi_bpf_hash, 4 \n"
" .popsection \n"
);
/* Must match bpf_callback_t */
extern u64 __bpf_callback_fn(u64, u64, u64, u64, u64);
__ADDRESSABLE(__bpf_callback_fn);
/* u32 __ro_after_init cfi_bpf_subprog_hash = __kcfi_typeid___bpf_callback_fn; */
asm (
" .pushsection .data..ro_after_init,\"aw\",@progbits \n"
" .type cfi_bpf_subprog_hash,@object \n"
" .globl cfi_bpf_subprog_hash \n"
" .p2align 2, 0x0 \n"
"cfi_bpf_subprog_hash: \n"
" .word __kcfi_typeid___bpf_callback_fn \n"
" .size cfi_bpf_subprog_hash, 4 \n"
" .popsection \n"
);
u32 cfi_get_func_hash(void *func)
{
u32 hash;
if (get_kernel_nofault(hash, func - cfi_get_offset()))
return 0;
return hash;
}
#endif

View File

@ -18,6 +18,11 @@ static inline bool rvc_enabled(void)
return IS_ENABLED(CONFIG_RISCV_ISA_C);
}
static inline bool rvzbb_enabled(void)
{
return IS_ENABLED(CONFIG_RISCV_ISA_ZBB) && riscv_has_extension_likely(RISCV_ISA_EXT_ZBB);
}
enum {
RV_REG_ZERO = 0, /* The constant value 0 */
RV_REG_RA = 1, /* Return address */
@ -730,6 +735,33 @@ static inline u16 rvc_swsp(u32 imm8, u8 rs2)
return rv_css_insn(0x6, imm, rs2, 0x2);
}
/* RVZBB instrutions. */
static inline u32 rvzbb_sextb(u8 rd, u8 rs1)
{
return rv_i_insn(0x604, rs1, 1, rd, 0x13);
}
static inline u32 rvzbb_sexth(u8 rd, u8 rs1)
{
return rv_i_insn(0x605, rs1, 1, rd, 0x13);
}
static inline u32 rvzbb_zexth(u8 rd, u8 rs)
{
if (IS_ENABLED(CONFIG_64BIT))
return rv_i_insn(0x80, rs, 4, rd, 0x3b);
return rv_i_insn(0x80, rs, 4, rd, 0x33);
}
static inline u32 rvzbb_rev8(u8 rd, u8 rs)
{
if (IS_ENABLED(CONFIG_64BIT))
return rv_i_insn(0x6b8, rs, 5, rd, 0x13);
return rv_i_insn(0x698, rs, 5, rd, 0x13);
}
/*
* RV64-only instructions.
*
@ -1087,9 +1119,111 @@ static inline void emit_subw(u8 rd, u8 rs1, u8 rs2, struct rv_jit_context *ctx)
emit(rv_subw(rd, rs1, rs2), ctx);
}
static inline void emit_sextb(u8 rd, u8 rs, struct rv_jit_context *ctx)
{
if (rvzbb_enabled()) {
emit(rvzbb_sextb(rd, rs), ctx);
return;
}
emit_slli(rd, rs, 56, ctx);
emit_srai(rd, rd, 56, ctx);
}
static inline void emit_sexth(u8 rd, u8 rs, struct rv_jit_context *ctx)
{
if (rvzbb_enabled()) {
emit(rvzbb_sexth(rd, rs), ctx);
return;
}
emit_slli(rd, rs, 48, ctx);
emit_srai(rd, rd, 48, ctx);
}
static inline void emit_sextw(u8 rd, u8 rs, struct rv_jit_context *ctx)
{
emit_addiw(rd, rs, 0, ctx);
}
static inline void emit_zexth(u8 rd, u8 rs, struct rv_jit_context *ctx)
{
if (rvzbb_enabled()) {
emit(rvzbb_zexth(rd, rs), ctx);
return;
}
emit_slli(rd, rs, 48, ctx);
emit_srli(rd, rd, 48, ctx);
}
static inline void emit_zextw(u8 rd, u8 rs, struct rv_jit_context *ctx)
{
emit_slli(rd, rs, 32, ctx);
emit_srli(rd, rd, 32, ctx);
}
static inline void emit_bswap(u8 rd, s32 imm, struct rv_jit_context *ctx)
{
if (rvzbb_enabled()) {
int bits = 64 - imm;
emit(rvzbb_rev8(rd, rd), ctx);
if (bits)
emit_srli(rd, rd, bits, ctx);
return;
}
emit_li(RV_REG_T2, 0, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
if (imm == 16)
goto out_be;
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
if (imm == 32)
goto out_be;
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
out_be:
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_mv(rd, RV_REG_T2, ctx);
}
#endif /* __riscv_xlen == 64 */
void bpf_jit_build_prologue(struct rv_jit_context *ctx);
void bpf_jit_build_prologue(struct rv_jit_context *ctx, bool is_subprog);
void bpf_jit_build_epilogue(struct rv_jit_context *ctx);
int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,

View File

@ -1301,7 +1301,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
return 0;
}
void bpf_jit_build_prologue(struct rv_jit_context *ctx)
void bpf_jit_build_prologue(struct rv_jit_context *ctx, bool is_subprog)
{
const s8 *fp = bpf2rv32[BPF_REG_FP];
const s8 *r1 = bpf2rv32[BPF_REG_1];

View File

@ -11,6 +11,7 @@
#include <linux/memory.h>
#include <linux/stop_machine.h>
#include <asm/patch.h>
#include <asm/cfi.h>
#include "bpf_jit.h"
#define RV_FENTRY_NINSNS 2
@ -141,6 +142,19 @@ static bool in_auipc_jalr_range(s64 val)
val < ((1L << 31) - (1L << 11));
}
/* Modify rd pointer to alternate reg to avoid corrupting original reg */
static void emit_sextw_alt(u8 *rd, u8 ra, struct rv_jit_context *ctx)
{
emit_sextw(ra, *rd, ctx);
*rd = ra;
}
static void emit_zextw_alt(u8 *rd, u8 ra, struct rv_jit_context *ctx)
{
emit_zextw(ra, *rd, ctx);
*rd = ra;
}
/* Emit fixed-length instructions for address */
static int emit_addr(u8 rd, u64 addr, bool extra_pass, struct rv_jit_context *ctx)
{
@ -326,12 +340,6 @@ static void emit_branch(u8 cond, u8 rd, u8 rs, int rvoff,
emit(rv_jalr(RV_REG_ZERO, RV_REG_T1, lower), ctx);
}
static void emit_zext_32(u8 reg, struct rv_jit_context *ctx)
{
emit_slli(reg, reg, 32, ctx);
emit_srli(reg, reg, 32, ctx);
}
static int emit_bpf_tail_call(int insn, struct rv_jit_context *ctx)
{
int tc_ninsn, off, start_insn = ctx->ninsns;
@ -346,7 +354,7 @@ static int emit_bpf_tail_call(int insn, struct rv_jit_context *ctx)
*/
tc_ninsn = insn ? ctx->offset[insn] - ctx->offset[insn - 1] :
ctx->offset[0];
emit_zext_32(RV_REG_A2, ctx);
emit_zextw(RV_REG_A2, RV_REG_A2, ctx);
off = offsetof(struct bpf_array, map.max_entries);
if (is_12b_check(off, insn))
@ -405,38 +413,6 @@ static void init_regs(u8 *rd, u8 *rs, const struct bpf_insn *insn,
*rs = bpf_to_rv_reg(insn->src_reg, ctx);
}
static void emit_zext_32_rd_rs(u8 *rd, u8 *rs, struct rv_jit_context *ctx)
{
emit_mv(RV_REG_T2, *rd, ctx);
emit_zext_32(RV_REG_T2, ctx);
emit_mv(RV_REG_T1, *rs, ctx);
emit_zext_32(RV_REG_T1, ctx);
*rd = RV_REG_T2;
*rs = RV_REG_T1;
}
static void emit_sext_32_rd_rs(u8 *rd, u8 *rs, struct rv_jit_context *ctx)
{
emit_addiw(RV_REG_T2, *rd, 0, ctx);
emit_addiw(RV_REG_T1, *rs, 0, ctx);
*rd = RV_REG_T2;
*rs = RV_REG_T1;
}
static void emit_zext_32_rd_t1(u8 *rd, struct rv_jit_context *ctx)
{
emit_mv(RV_REG_T2, *rd, ctx);
emit_zext_32(RV_REG_T2, ctx);
emit_zext_32(RV_REG_T1, ctx);
*rd = RV_REG_T2;
}
static void emit_sext_32_rd(u8 *rd, struct rv_jit_context *ctx)
{
emit_addiw(RV_REG_T2, *rd, 0, ctx);
*rd = RV_REG_T2;
}
static int emit_jump_and_link(u8 rd, s64 rvoff, bool fixed_addr,
struct rv_jit_context *ctx)
{
@ -480,6 +456,12 @@ static int emit_call(u64 addr, bool fixed_addr, struct rv_jit_context *ctx)
return emit_jump_and_link(RV_REG_RA, off, fixed_addr, ctx);
}
static inline void emit_kcfi(u32 hash, struct rv_jit_context *ctx)
{
if (IS_ENABLED(CONFIG_CFI_CLANG))
emit(hash, ctx);
}
static void emit_atomic(u8 rd, u8 rs, s16 off, s32 imm, bool is64,
struct rv_jit_context *ctx)
{
@ -519,32 +501,32 @@ static void emit_atomic(u8 rd, u8 rs, s16 off, s32 imm, bool is64,
emit(is64 ? rv_amoadd_d(rs, rs, rd, 0, 0) :
rv_amoadd_w(rs, rs, rd, 0, 0), ctx);
if (!is64)
emit_zext_32(rs, ctx);
emit_zextw(rs, rs, ctx);
break;
case BPF_AND | BPF_FETCH:
emit(is64 ? rv_amoand_d(rs, rs, rd, 0, 0) :
rv_amoand_w(rs, rs, rd, 0, 0), ctx);
if (!is64)
emit_zext_32(rs, ctx);
emit_zextw(rs, rs, ctx);
break;
case BPF_OR | BPF_FETCH:
emit(is64 ? rv_amoor_d(rs, rs, rd, 0, 0) :
rv_amoor_w(rs, rs, rd, 0, 0), ctx);
if (!is64)
emit_zext_32(rs, ctx);
emit_zextw(rs, rs, ctx);
break;
case BPF_XOR | BPF_FETCH:
emit(is64 ? rv_amoxor_d(rs, rs, rd, 0, 0) :
rv_amoxor_w(rs, rs, rd, 0, 0), ctx);
if (!is64)
emit_zext_32(rs, ctx);
emit_zextw(rs, rs, ctx);
break;
/* src_reg = atomic_xchg(dst_reg + off16, src_reg); */
case BPF_XCHG:
emit(is64 ? rv_amoswap_d(rs, rs, rd, 0, 0) :
rv_amoswap_w(rs, rs, rd, 0, 0), ctx);
if (!is64)
emit_zext_32(rs, ctx);
emit_zextw(rs, rs, ctx);
break;
/* r0 = atomic_cmpxchg(dst_reg + off16, r0, src_reg); */
case BPF_CMPXCHG:
@ -894,6 +876,8 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im,
emit_sd(RV_REG_SP, stack_size - 16, RV_REG_FP, ctx);
emit_addi(RV_REG_FP, RV_REG_SP, stack_size, ctx);
} else {
/* emit kcfi hash */
emit_kcfi(cfi_get_func_hash(func_addr), ctx);
/* For the trampoline called directly, just handle
* the frame of trampoline.
*/
@ -1091,7 +1075,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
case BPF_ALU64 | BPF_MOV | BPF_X:
if (imm == 1) {
/* Special mov32 for zext */
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
}
switch (insn->off) {
@ -1099,16 +1083,17 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit_mv(rd, rs, ctx);
break;
case 8:
emit_sextb(rd, rs, ctx);
break;
case 16:
emit_slli(RV_REG_T1, rs, 64 - insn->off, ctx);
emit_srai(rd, RV_REG_T1, 64 - insn->off, ctx);
emit_sexth(rd, rs, ctx);
break;
case 32:
emit_addiw(rd, rs, 0, ctx);
emit_sextw(rd, rs, ctx);
break;
}
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
/* dst = dst OP src */
@ -1116,7 +1101,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
case BPF_ALU64 | BPF_ADD | BPF_X:
emit_add(rd, rd, rs, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_SUB | BPF_X:
case BPF_ALU64 | BPF_SUB | BPF_X:
@ -1126,31 +1111,31 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit_subw(rd, rd, rs, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_AND | BPF_X:
case BPF_ALU64 | BPF_AND | BPF_X:
emit_and(rd, rd, rs, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_OR | BPF_X:
case BPF_ALU64 | BPF_OR | BPF_X:
emit_or(rd, rd, rs, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_XOR | BPF_X:
case BPF_ALU64 | BPF_XOR | BPF_X:
emit_xor(rd, rd, rs, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_MUL | BPF_X:
case BPF_ALU64 | BPF_MUL | BPF_X:
emit(is64 ? rv_mul(rd, rd, rs) : rv_mulw(rd, rd, rs), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_DIV | BPF_X:
case BPF_ALU64 | BPF_DIV | BPF_X:
@ -1159,7 +1144,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
else
emit(is64 ? rv_divu(rd, rd, rs) : rv_divuw(rd, rd, rs), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_MOD | BPF_X:
case BPF_ALU64 | BPF_MOD | BPF_X:
@ -1168,25 +1153,25 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
else
emit(is64 ? rv_remu(rd, rd, rs) : rv_remuw(rd, rd, rs), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_LSH | BPF_X:
case BPF_ALU64 | BPF_LSH | BPF_X:
emit(is64 ? rv_sll(rd, rd, rs) : rv_sllw(rd, rd, rs), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_RSH | BPF_X:
case BPF_ALU64 | BPF_RSH | BPF_X:
emit(is64 ? rv_srl(rd, rd, rs) : rv_srlw(rd, rd, rs), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_ARSH | BPF_X:
case BPF_ALU64 | BPF_ARSH | BPF_X:
emit(is64 ? rv_sra(rd, rd, rs) : rv_sraw(rd, rd, rs), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
/* dst = -dst */
@ -1194,73 +1179,27 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
case BPF_ALU64 | BPF_NEG:
emit_sub(rd, RV_REG_ZERO, rd, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
/* dst = BSWAP##imm(dst) */
case BPF_ALU | BPF_END | BPF_FROM_LE:
switch (imm) {
case 16:
emit_slli(rd, rd, 48, ctx);
emit_srli(rd, rd, 48, ctx);
emit_zexth(rd, rd, ctx);
break;
case 32:
if (!aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case 64:
/* Do nothing */
break;
}
break;
case BPF_ALU | BPF_END | BPF_FROM_BE:
case BPF_ALU64 | BPF_END | BPF_FROM_LE:
emit_li(RV_REG_T2, 0, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
if (imm == 16)
goto out_be;
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
if (imm == 32)
goto out_be;
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_slli(RV_REG_T2, RV_REG_T2, 8, ctx);
emit_srli(rd, rd, 8, ctx);
out_be:
emit_andi(RV_REG_T1, rd, 0xff, ctx);
emit_add(RV_REG_T2, RV_REG_T2, RV_REG_T1, ctx);
emit_mv(rd, RV_REG_T2, ctx);
emit_bswap(rd, imm, ctx);
break;
/* dst = imm */
@ -1268,7 +1207,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
case BPF_ALU64 | BPF_MOV | BPF_K:
emit_imm(rd, imm, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
/* dst = dst OP imm */
@ -1281,7 +1220,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit_add(rd, rd, RV_REG_T1, ctx);
}
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_SUB | BPF_K:
case BPF_ALU64 | BPF_SUB | BPF_K:
@ -1292,7 +1231,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit_sub(rd, rd, RV_REG_T1, ctx);
}
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_AND | BPF_K:
case BPF_ALU64 | BPF_AND | BPF_K:
@ -1303,7 +1242,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit_and(rd, rd, RV_REG_T1, ctx);
}
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_OR | BPF_K:
case BPF_ALU64 | BPF_OR | BPF_K:
@ -1314,7 +1253,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit_or(rd, rd, RV_REG_T1, ctx);
}
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_XOR | BPF_K:
case BPF_ALU64 | BPF_XOR | BPF_K:
@ -1325,7 +1264,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit_xor(rd, rd, RV_REG_T1, ctx);
}
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_MUL | BPF_K:
case BPF_ALU64 | BPF_MUL | BPF_K:
@ -1333,7 +1272,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit(is64 ? rv_mul(rd, rd, RV_REG_T1) :
rv_mulw(rd, rd, RV_REG_T1), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_DIV | BPF_K:
case BPF_ALU64 | BPF_DIV | BPF_K:
@ -1345,7 +1284,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit(is64 ? rv_divu(rd, rd, RV_REG_T1) :
rv_divuw(rd, rd, RV_REG_T1), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_MOD | BPF_K:
case BPF_ALU64 | BPF_MOD | BPF_K:
@ -1357,14 +1296,14 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit(is64 ? rv_remu(rd, rd, RV_REG_T1) :
rv_remuw(rd, rd, RV_REG_T1), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_LSH | BPF_K:
case BPF_ALU64 | BPF_LSH | BPF_K:
emit_slli(rd, rd, imm, ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_RSH | BPF_K:
case BPF_ALU64 | BPF_RSH | BPF_K:
@ -1374,7 +1313,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit(rv_srliw(rd, rd, imm), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
case BPF_ALU | BPF_ARSH | BPF_K:
case BPF_ALU64 | BPF_ARSH | BPF_K:
@ -1384,7 +1323,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
emit(rv_sraiw(rd, rd, imm), ctx);
if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
emit_zextw(rd, rd, ctx);
break;
/* JUMP off */
@ -1425,10 +1364,13 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
rvoff = rv_offset(i, off, ctx);
if (!is64) {
s = ctx->ninsns;
if (is_signed_bpf_cond(BPF_OP(code)))
emit_sext_32_rd_rs(&rd, &rs, ctx);
else
emit_zext_32_rd_rs(&rd, &rs, ctx);
if (is_signed_bpf_cond(BPF_OP(code))) {
emit_sextw_alt(&rs, RV_REG_T1, ctx);
emit_sextw_alt(&rd, RV_REG_T2, ctx);
} else {
emit_zextw_alt(&rs, RV_REG_T1, ctx);
emit_zextw_alt(&rd, RV_REG_T2, ctx);
}
e = ctx->ninsns;
/* Adjust for extra insns */
@ -1439,8 +1381,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
/* Adjust for and */
rvoff -= 4;
emit_and(RV_REG_T1, rd, rs, ctx);
emit_branch(BPF_JNE, RV_REG_T1, RV_REG_ZERO, rvoff,
ctx);
emit_branch(BPF_JNE, RV_REG_T1, RV_REG_ZERO, rvoff, ctx);
} else {
emit_branch(BPF_OP(code), rd, rs, rvoff, ctx);
}
@ -1469,18 +1410,18 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
case BPF_JMP32 | BPF_JSLE | BPF_K:
rvoff = rv_offset(i, off, ctx);
s = ctx->ninsns;
if (imm) {
if (imm)
emit_imm(RV_REG_T1, imm, ctx);
rs = RV_REG_T1;
} else {
/* If imm is 0, simply use zero register. */
rs = RV_REG_ZERO;
}
rs = imm ? RV_REG_T1 : RV_REG_ZERO;
if (!is64) {
if (is_signed_bpf_cond(BPF_OP(code)))
emit_sext_32_rd(&rd, ctx);
else
emit_zext_32_rd_t1(&rd, ctx);
if (is_signed_bpf_cond(BPF_OP(code))) {
emit_sextw_alt(&rd, RV_REG_T2, ctx);
/* rs has been sign extended */
} else {
emit_zextw_alt(&rd, RV_REG_T2, ctx);
if (imm)
emit_zextw(rs, rs, ctx);
}
}
e = ctx->ninsns;
@ -1504,7 +1445,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
* as t1 is used only in comparison against zero.
*/
if (!is64 && imm < 0)
emit_addiw(RV_REG_T1, RV_REG_T1, 0, ctx);
emit_sextw(RV_REG_T1, RV_REG_T1, ctx);
e = ctx->ninsns;
rvoff -= ninsns_rvoff(e - s);
emit_branch(BPF_JNE, RV_REG_T1, RV_REG_ZERO, rvoff, ctx);
@ -1779,7 +1720,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
return 0;
}
void bpf_jit_build_prologue(struct rv_jit_context *ctx)
void bpf_jit_build_prologue(struct rv_jit_context *ctx, bool is_subprog)
{
int i, stack_adjust = 0, store_offset, bpf_stack_adjust;
@ -1808,6 +1749,9 @@ void bpf_jit_build_prologue(struct rv_jit_context *ctx)
store_offset = stack_adjust - 8;
/* emit kcfi type preamble immediately before the first insn */
emit_kcfi(is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash, ctx);
/* nops reserved for auipc+jalr pair */
for (i = 0; i < RV_FENTRY_NINSNS; i++)
emit(rv_nop(), ctx);
@ -1874,3 +1818,8 @@ bool bpf_jit_supports_kfunc_call(void)
{
return true;
}
bool bpf_jit_supports_ptr_xchg(void)
{
return true;
}

View File

@ -10,6 +10,7 @@
#include <linux/filter.h>
#include <linux/memory.h>
#include <asm/patch.h>
#include <asm/cfi.h>
#include "bpf_jit.h"
/* Number of iterations to try until offsets converge. */
@ -100,7 +101,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
pass++;
ctx->ninsns = 0;
bpf_jit_build_prologue(ctx);
bpf_jit_build_prologue(ctx, bpf_is_subprog(prog));
ctx->prologue_len = ctx->ninsns;
if (build_body(ctx, extra_pass, ctx->offset)) {
@ -160,7 +161,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
ctx->ninsns = 0;
ctx->nexentries = 0;
bpf_jit_build_prologue(ctx);
bpf_jit_build_prologue(ctx, bpf_is_subprog(prog));
if (build_body(ctx, extra_pass, NULL)) {
prog = orig_prog;
goto out_free_hdr;
@ -170,9 +171,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
if (bpf_jit_enable > 1)
bpf_jit_dump(prog->len, prog_size, pass, ctx->insns);
prog->bpf_func = (void *)ctx->ro_insns;
prog->bpf_func = (void *)ctx->ro_insns + cfi_get_offset();
prog->jited = 1;
prog->jited_len = prog_size;
prog->jited_len = prog_size - cfi_get_offset();
if (!prog->is_func || extra_pass) {
if (WARN_ON(bpf_jit_binary_pack_finalize(prog, jit_data->ro_header,

View File

@ -113,6 +113,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
/* Pick a register outside of BPF range for JIT internal work */
#define AUX_REG (MAX_BPF_JIT_REG + 1)
#define X86_REG_R9 (MAX_BPF_JIT_REG + 2)
#define X86_REG_R12 (MAX_BPF_JIT_REG + 3)
/*
* The following table maps BPF registers to x86-64 registers.
@ -139,6 +140,7 @@ static const int reg2hex[] = {
[BPF_REG_AX] = 2, /* R10 temp register */
[AUX_REG] = 3, /* R11 temp register */
[X86_REG_R9] = 1, /* R9 register, 6th function argument */
[X86_REG_R12] = 4, /* R12 callee saved */
};
static const int reg2pt_regs[] = {
@ -167,6 +169,7 @@ static bool is_ereg(u32 reg)
BIT(BPF_REG_8) |
BIT(BPF_REG_9) |
BIT(X86_REG_R9) |
BIT(X86_REG_R12) |
BIT(BPF_REG_AX));
}
@ -205,6 +208,17 @@ static u8 add_2mod(u8 byte, u32 r1, u32 r2)
return byte;
}
static u8 add_3mod(u8 byte, u32 r1, u32 r2, u32 index)
{
if (is_ereg(r1))
byte |= 1;
if (is_ereg(index))
byte |= 2;
if (is_ereg(r2))
byte |= 4;
return byte;
}
/* Encode 'dst_reg' register into x86-64 opcode 'byte' */
static u8 add_1reg(u8 byte, u32 dst_reg)
{
@ -645,6 +659,8 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
pop_r12(&prog);
} else {
pop_callee_regs(&prog, callee_regs_used);
if (bpf_arena_get_kern_vm_start(bpf_prog->aux->arena))
pop_r12(&prog);
}
EMIT1(0x58); /* pop rax */
@ -704,6 +720,8 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
pop_r12(&prog);
} else {
pop_callee_regs(&prog, callee_regs_used);
if (bpf_arena_get_kern_vm_start(bpf_prog->aux->arena))
pop_r12(&prog);
}
EMIT1(0x58); /* pop rax */
@ -887,6 +905,18 @@ static void emit_insn_suffix(u8 **pprog, u32 ptr_reg, u32 val_reg, int off)
*pprog = prog;
}
static void emit_insn_suffix_SIB(u8 **pprog, u32 ptr_reg, u32 val_reg, u32 index_reg, int off)
{
u8 *prog = *pprog;
if (is_imm8(off)) {
EMIT3(add_2reg(0x44, BPF_REG_0, val_reg), add_2reg(0, ptr_reg, index_reg) /* SIB */, off);
} else {
EMIT2_off32(add_2reg(0x84, BPF_REG_0, val_reg), add_2reg(0, ptr_reg, index_reg) /* SIB */, off);
}
*pprog = prog;
}
/*
* Emit a REX byte if it will be necessary to address these registers
*/
@ -968,6 +998,37 @@ static void emit_ldsx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
*pprog = prog;
}
static void emit_ldx_index(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, u32 index_reg, int off)
{
u8 *prog = *pprog;
switch (size) {
case BPF_B:
/* movzx rax, byte ptr [rax + r12 + off] */
EMIT3(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x0F, 0xB6);
break;
case BPF_H:
/* movzx rax, word ptr [rax + r12 + off] */
EMIT3(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x0F, 0xB7);
break;
case BPF_W:
/* mov eax, dword ptr [rax + r12 + off] */
EMIT2(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x8B);
break;
case BPF_DW:
/* mov rax, qword ptr [rax + r12 + off] */
EMIT2(add_3mod(0x48, src_reg, dst_reg, index_reg), 0x8B);
break;
}
emit_insn_suffix_SIB(&prog, src_reg, dst_reg, index_reg, off);
*pprog = prog;
}
static void emit_ldx_r12(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
{
emit_ldx_index(pprog, size, dst_reg, src_reg, X86_REG_R12, off);
}
/* STX: *(u8*)(dst_reg + off) = src_reg */
static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
{
@ -1002,6 +1063,71 @@ static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
*pprog = prog;
}
/* STX: *(u8*)(dst_reg + index_reg + off) = src_reg */
static void emit_stx_index(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, u32 index_reg, int off)
{
u8 *prog = *pprog;
switch (size) {
case BPF_B:
/* mov byte ptr [rax + r12 + off], al */
EMIT2(add_3mod(0x40, dst_reg, src_reg, index_reg), 0x88);
break;
case BPF_H:
/* mov word ptr [rax + r12 + off], ax */
EMIT3(0x66, add_3mod(0x40, dst_reg, src_reg, index_reg), 0x89);
break;
case BPF_W:
/* mov dword ptr [rax + r12 + 1], eax */
EMIT2(add_3mod(0x40, dst_reg, src_reg, index_reg), 0x89);
break;
case BPF_DW:
/* mov qword ptr [rax + r12 + 1], rax */
EMIT2(add_3mod(0x48, dst_reg, src_reg, index_reg), 0x89);
break;
}
emit_insn_suffix_SIB(&prog, dst_reg, src_reg, index_reg, off);
*pprog = prog;
}
static void emit_stx_r12(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
{
emit_stx_index(pprog, size, dst_reg, src_reg, X86_REG_R12, off);
}
/* ST: *(u8*)(dst_reg + index_reg + off) = imm32 */
static void emit_st_index(u8 **pprog, u32 size, u32 dst_reg, u32 index_reg, int off, int imm)
{
u8 *prog = *pprog;
switch (size) {
case BPF_B:
/* mov byte ptr [rax + r12 + off], imm8 */
EMIT2(add_3mod(0x40, dst_reg, 0, index_reg), 0xC6);
break;
case BPF_H:
/* mov word ptr [rax + r12 + off], imm16 */
EMIT3(0x66, add_3mod(0x40, dst_reg, 0, index_reg), 0xC7);
break;
case BPF_W:
/* mov dword ptr [rax + r12 + 1], imm32 */
EMIT2(add_3mod(0x40, dst_reg, 0, index_reg), 0xC7);
break;
case BPF_DW:
/* mov qword ptr [rax + r12 + 1], imm32 */
EMIT2(add_3mod(0x48, dst_reg, 0, index_reg), 0xC7);
break;
}
emit_insn_suffix_SIB(&prog, dst_reg, 0, index_reg, off);
EMIT(imm, bpf_size_to_x86_bytes(size));
*pprog = prog;
}
static void emit_st_r12(u8 **pprog, u32 size, u32 dst_reg, int off, int imm)
{
emit_st_index(pprog, size, dst_reg, X86_REG_R12, off, imm);
}
static int emit_atomic(u8 **pprog, u8 atomic_op,
u32 dst_reg, u32 src_reg, s16 off, u8 bpf_size)
{
@ -1043,12 +1169,15 @@ static int emit_atomic(u8 **pprog, u8 atomic_op,
return 0;
}
#define DONT_CLEAR 1
bool ex_handler_bpf(const struct exception_table_entry *x, struct pt_regs *regs)
{
u32 reg = x->fixup >> 8;
/* jump over faulting load and clear dest register */
*(unsigned long *)((void *)regs + reg) = 0;
if (reg != DONT_CLEAR)
*(unsigned long *)((void *)regs + reg) = 0;
regs->ip += x->fixup & 0xff;
return true;
}
@ -1147,11 +1276,15 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
bool tail_call_seen = false;
bool seen_exit = false;
u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
u64 arena_vm_start, user_vm_start;
int i, excnt = 0;
int ilen, proglen = 0;
u8 *prog = temp;
int err;
arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
detect_reg_usage(insn, insn_cnt, callee_regs_used,
&tail_call_seen);
@ -1172,8 +1305,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
push_r12(&prog);
push_callee_regs(&prog, all_callee_regs_used);
} else {
if (arena_vm_start)
push_r12(&prog);
push_callee_regs(&prog, callee_regs_used);
}
if (arena_vm_start)
emit_mov_imm64(&prog, X86_REG_R12,
arena_vm_start >> 32, (u32) arena_vm_start);
ilen = prog - temp;
if (rw_image)
@ -1213,6 +1351,40 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
break;
case BPF_ALU64 | BPF_MOV | BPF_X:
if (insn->off == BPF_ADDR_SPACE_CAST &&
insn->imm == 1U << 16) {
if (dst_reg != src_reg)
/* 32-bit mov */
emit_mov_reg(&prog, false, dst_reg, src_reg);
/* shl dst_reg, 32 */
maybe_emit_1mod(&prog, dst_reg, true);
EMIT3(0xC1, add_1reg(0xE0, dst_reg), 32);
/* or dst_reg, user_vm_start */
maybe_emit_1mod(&prog, dst_reg, true);
if (is_axreg(dst_reg))
EMIT1_off32(0x0D, user_vm_start >> 32);
else
EMIT2_off32(0x81, add_1reg(0xC8, dst_reg), user_vm_start >> 32);
/* rol dst_reg, 32 */
maybe_emit_1mod(&prog, dst_reg, true);
EMIT3(0xC1, add_1reg(0xC0, dst_reg), 32);
/* xor r11, r11 */
EMIT3(0x4D, 0x31, 0xDB);
/* test dst_reg32, dst_reg32; check if lower 32-bit are zero */
maybe_emit_mod(&prog, dst_reg, dst_reg, false);
EMIT2(0x85, add_2reg(0xC0, dst_reg, dst_reg));
/* cmove r11, dst_reg; if so, set dst_reg to zero */
/* WARNING: Intel swapped src/dst register encoding in CMOVcc !!! */
maybe_emit_mod(&prog, AUX_REG, dst_reg, true);
EMIT3(0x0F, 0x44, add_2reg(0xC0, AUX_REG, dst_reg));
break;
}
fallthrough;
case BPF_ALU | BPF_MOV | BPF_X:
if (insn->off == 0)
emit_mov_reg(&prog,
@ -1564,6 +1736,56 @@ st: if (is_imm8(insn->off))
emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
break;
case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
start_of_ldx = prog;
emit_st_r12(&prog, BPF_SIZE(insn->code), dst_reg, insn->off, insn->imm);
goto populate_extable;
/* LDX: dst_reg = *(u8*)(src_reg + r12 + off) */
case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
start_of_ldx = prog;
if (BPF_CLASS(insn->code) == BPF_LDX)
emit_ldx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
else
emit_stx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
populate_extable:
{
struct exception_table_entry *ex;
u8 *_insn = image + proglen + (start_of_ldx - temp);
s64 delta;
if (!bpf_prog->aux->extable)
break;
if (excnt >= bpf_prog->aux->num_exentries) {
pr_err("mem32 extable bug\n");
return -EFAULT;
}
ex = &bpf_prog->aux->extable[excnt++];
delta = _insn - (u8 *)&ex->insn;
/* switch ex to rw buffer for writes */
ex = (void *)rw_image + ((void *)ex - (void *)image);
ex->insn = delta;
ex->data = EX_TYPE_BPF;
ex->fixup = (prog - start_of_ldx) |
((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[dst_reg] : DONT_CLEAR) << 8);
}
break;
/* LDX: dst_reg = *(u8*)(src_reg + off) */
case BPF_LDX | BPF_MEM | BPF_B:
case BPF_LDX | BPF_PROBE_MEM | BPF_B:
@ -2036,6 +2258,8 @@ st: if (is_imm8(insn->off))
pop_r12(&prog);
} else {
pop_callee_regs(&prog, callee_regs_used);
if (arena_vm_start)
pop_r12(&prog);
}
EMIT1(0xC9); /* leave */
emit_return(&prog, image + addrs[i - 1] + (prog - temp));
@ -3242,3 +3466,13 @@ void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
BUG_ON(ret < 0);
}
}
bool bpf_jit_supports_arena(void)
{
return true;
}
bool bpf_jit_supports_ptr_xchg(void)
{
return true;
}

View File

@ -2550,14 +2550,12 @@ static int fore200e_sba_probe(struct platform_device *op)
return 0;
}
static int fore200e_sba_remove(struct platform_device *op)
static void fore200e_sba_remove(struct platform_device *op)
{
struct fore200e *fore200e = dev_get_drvdata(&op->dev);
fore200e_shutdown(fore200e);
kfree(fore200e);
return 0;
}
static const struct of_device_id fore200e_sba_match[] = {
@ -2574,7 +2572,7 @@ static struct platform_driver fore200e_sba_driver = {
.of_match_table = fore200e_sba_match,
},
.probe = fore200e_sba_probe,
.remove = fore200e_sba_remove,
.remove_new = fore200e_sba_remove,
};
#endif

View File

@ -68,7 +68,7 @@ static struct attribute *bcma_device_attrs[] = {
};
ATTRIBUTE_GROUPS(bcma_device);
static struct bus_type bcma_bus_type = {
static const struct bus_type bcma_bus_type = {
.name = "bcma",
.match = bcma_bus_match,
.probe = bcma_device_probe,

View File

@ -11,6 +11,7 @@
#include <linux/firmware.h>
#include <linux/dmi.h>
#include <linux/of.h>
#include <linux/string.h>
#include <asm/unaligned.h>
#include <net/bluetooth/bluetooth.h>
@ -543,8 +544,6 @@ static const char *btbcm_get_board_name(struct device *dev)
struct device_node *root;
char *board_type;
const char *tmp;
int len;
int i;
root = of_find_node_by_path("/");
if (!root)
@ -554,13 +553,8 @@ static const char *btbcm_get_board_name(struct device *dev)
return NULL;
/* get rid of any '/' in the compatible string */
len = strlen(tmp) + 1;
board_type = devm_kzalloc(dev, len, GFP_KERNEL);
strscpy(board_type, tmp, len);
for (i = 0; i < len; i++) {
if (board_type[i] == '/')
board_type[i] = '-';
}
board_type = devm_kstrdup(dev, tmp, GFP_KERNEL);
strreplace(board_type, '/', '-');
of_node_put(root);
return board_type;

View File

@ -441,7 +441,7 @@ int btintel_read_version(struct hci_dev *hdev, struct intel_version *ver)
return PTR_ERR(skb);
}
if (skb->len != sizeof(*ver)) {
if (!skb || skb->len != sizeof(*ver)) {
bt_dev_err(hdev, "Intel version event size mismatch");
kfree_skb(skb);
return -EILSEQ;
@ -2670,6 +2670,119 @@ static void btintel_set_msft_opcode(struct hci_dev *hdev, u8 hw_variant)
}
}
static void btintel_print_fseq_info(struct hci_dev *hdev)
{
struct sk_buff *skb;
u8 *p;
u32 val;
const char *str;
skb = __hci_cmd_sync(hdev, 0xfcb3, 0, NULL, HCI_CMD_TIMEOUT);
if (IS_ERR(skb)) {
bt_dev_dbg(hdev, "Reading fseq status command failed (%ld)",
PTR_ERR(skb));
return;
}
if (skb->len < (sizeof(u32) * 16 + 2)) {
bt_dev_dbg(hdev, "Malformed packet of length %u received",
skb->len);
kfree_skb(skb);
return;
}
p = skb_pull_data(skb, 1);
if (*p) {
bt_dev_dbg(hdev, "Failed to get fseq status (0x%2.2x)", *p);
kfree_skb(skb);
return;
}
p = skb_pull_data(skb, 1);
switch (*p) {
case 0:
str = "Success";
break;
case 1:
str = "Fatal error";
break;
case 2:
str = "Semaphore acquire error";
break;
default:
str = "Unknown error";
break;
}
if (*p) {
bt_dev_err(hdev, "Fseq status: %s (0x%2.2x)", str, *p);
kfree_skb(skb);
return;
}
bt_dev_info(hdev, "Fseq status: %s (0x%2.2x)", str, *p);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Reason: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Global version: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Installed version: 0x%8.8x", val);
p = skb->data;
skb_pull_data(skb, 4);
bt_dev_info(hdev, "Fseq executed: %2.2u.%2.2u.%2.2u.%2.2u", p[0], p[1],
p[2], p[3]);
p = skb->data;
skb_pull_data(skb, 4);
bt_dev_info(hdev, "Fseq BT Top: %2.2u.%2.2u.%2.2u.%2.2u", p[0], p[1],
p[2], p[3]);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq Top init version: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq Cnvio init version: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq MBX Wifi file version: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq BT version: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq Top reset address: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq MBX timeout: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq MBX ack: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq CNVi id: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq CNVr id: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq Error handle: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq Magic noalive indication: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq OTP version: 0x%8.8x", val);
val = get_unaligned_le32(skb_pull_data(skb, 4));
bt_dev_dbg(hdev, "Fseq MBX otp version: 0x%8.8x", val);
kfree_skb(skb);
}
static int btintel_setup_combined(struct hci_dev *hdev)
{
const u8 param[1] = { 0xFF };
@ -2902,6 +3015,7 @@ static int btintel_setup_combined(struct hci_dev *hdev)
err = btintel_bootloader_setup_tlv(hdev, &ver_tlv);
btintel_register_devcoredump_support(hdev);
btintel_print_fseq_info(hdev);
break;
default:
bt_dev_err(hdev, "Unsupported Intel hw variant (%u)",

View File

@ -372,8 +372,10 @@ int btmtk_process_coredump(struct hci_dev *hdev, struct sk_buff *skb)
struct btmediatek_data *data = hci_get_priv(hdev);
int err;
if (!IS_ENABLED(CONFIG_DEV_COREDUMP))
if (!IS_ENABLED(CONFIG_DEV_COREDUMP)) {
kfree_skb(skb);
return 0;
}
switch (data->cd_info.state) {
case HCI_DEVCOREDUMP_IDLE:
@ -420,5 +422,6 @@ MODULE_LICENSE("GPL");
MODULE_FIRMWARE(FIRMWARE_MT7622);
MODULE_FIRMWARE(FIRMWARE_MT7663);
MODULE_FIRMWARE(FIRMWARE_MT7668);
MODULE_FIRMWARE(FIRMWARE_MT7922);
MODULE_FIRMWARE(FIRMWARE_MT7961);
MODULE_FIRMWARE(FIRMWARE_MT7925);

View File

@ -4,6 +4,7 @@
#define FIRMWARE_MT7622 "mediatek/mt7622pr2h.bin"
#define FIRMWARE_MT7663 "mediatek/mt7663pr2h.bin"
#define FIRMWARE_MT7668 "mediatek/mt7668pr2h.bin"
#define FIRMWARE_MT7922 "mediatek/BT_RAM_CODE_MT7922_1_1_hdr.bin"
#define FIRMWARE_MT7961 "mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin"
#define FIRMWARE_MT7925 "mediatek/mt7925/BT_RAM_CODE_MT7925_1_1_hdr.bin"

View File

@ -126,6 +126,7 @@ struct ps_data {
struct hci_dev *hdev;
struct work_struct work;
struct timer_list ps_timer;
struct mutex ps_lock;
};
struct wakeup_cmd_payload {
@ -317,6 +318,9 @@ static void ps_start_timer(struct btnxpuart_dev *nxpdev)
if (psdata->cur_psmode == PS_MODE_ENABLE)
mod_timer(&psdata->ps_timer, jiffies + msecs_to_jiffies(psdata->h2c_ps_interval));
if (psdata->ps_state == PS_STATE_AWAKE && psdata->ps_cmd == PS_CMD_ENTER_PS)
cancel_work_sync(&psdata->work);
}
static void ps_cancel_timer(struct btnxpuart_dev *nxpdev)
@ -337,6 +341,7 @@ static void ps_control(struct hci_dev *hdev, u8 ps_state)
!test_bit(BTNXPUART_SERDEV_OPEN, &nxpdev->tx_state))
return;
mutex_lock(&psdata->ps_lock);
switch (psdata->cur_h2c_wakeupmode) {
case WAKEUP_METHOD_DTR:
if (ps_state == PS_STATE_AWAKE)
@ -350,12 +355,15 @@ static void ps_control(struct hci_dev *hdev, u8 ps_state)
status = serdev_device_break_ctl(nxpdev->serdev, 0);
else
status = serdev_device_break_ctl(nxpdev->serdev, -1);
msleep(20); /* Allow chip to detect UART-break and enter sleep */
bt_dev_dbg(hdev, "Set UART break: %s, status=%d",
str_on_off(ps_state == PS_STATE_SLEEP), status);
break;
}
if (!status)
psdata->ps_state = ps_state;
mutex_unlock(&psdata->ps_lock);
if (ps_state == PS_STATE_AWAKE)
btnxpuart_tx_wakeup(nxpdev);
}
@ -391,17 +399,25 @@ static void ps_setup(struct hci_dev *hdev)
psdata->hdev = hdev;
INIT_WORK(&psdata->work, ps_work_func);
mutex_init(&psdata->ps_lock);
timer_setup(&psdata->ps_timer, ps_timeout_func, 0);
}
static void ps_wakeup(struct btnxpuart_dev *nxpdev)
static bool ps_wakeup(struct btnxpuart_dev *nxpdev)
{
struct ps_data *psdata = &nxpdev->psdata;
u8 ps_state;
if (psdata->ps_state != PS_STATE_AWAKE) {
mutex_lock(&psdata->ps_lock);
ps_state = psdata->ps_state;
mutex_unlock(&psdata->ps_lock);
if (ps_state != PS_STATE_AWAKE) {
psdata->ps_cmd = PS_CMD_EXIT_PS;
schedule_work(&psdata->work);
return true;
}
return false;
}
static int send_ps_cmd(struct hci_dev *hdev, void *data)
@ -1171,7 +1187,6 @@ static struct sk_buff *nxp_dequeue(void *data)
{
struct btnxpuart_dev *nxpdev = (struct btnxpuart_dev *)data;
ps_wakeup(nxpdev);
ps_start_timer(nxpdev);
return skb_dequeue(&nxpdev->txq);
}
@ -1186,6 +1201,9 @@ static void btnxpuart_tx_work(struct work_struct *work)
struct sk_buff *skb;
int len;
if (ps_wakeup(nxpdev))
return;
while ((skb = nxp_dequeue(nxpdev))) {
len = serdev_device_write_buf(serdev, skb->data, skb->len);
hdev->stat.byte_tx += len;
@ -1234,6 +1252,9 @@ static int btnxpuart_close(struct hci_dev *hdev)
ps_wakeup(nxpdev);
serdev_device_close(nxpdev->serdev);
skb_queue_purge(&nxpdev->txq);
kfree_skb(nxpdev->rx_skb);
nxpdev->rx_skb = NULL;
clear_bit(BTNXPUART_SERDEV_OPEN, &nxpdev->tx_state);
return 0;
}

View File

@ -69,6 +69,7 @@ enum btrtl_chip_id {
CHIP_ID_8852B = 20,
CHIP_ID_8852C = 25,
CHIP_ID_8851B = 36,
CHIP_ID_8852BT = 47,
};
struct id_table {
@ -307,6 +308,15 @@ static const struct id_table ic_id_table[] = {
.fw_name = "rtl_bt/rtl8851bu_fw",
.cfg_name = "rtl_bt/rtl8851bu_config",
.hw_info = "rtl8851bu" },
/* 8852BT/8852BE-VT */
{ IC_INFO(RTL_ROM_LMP_8852A, 0x87, 0xc, HCI_USB),
.config_needed = false,
.has_rom_version = true,
.has_msft_ext = true,
.fw_name = "rtl_bt/rtl8852btu_fw",
.cfg_name = "rtl_bt/rtl8852btu_config",
.hw_info = "rtl8852btu" },
};
static const struct id_table *btrtl_match_ic(u16 lmp_subver, u16 hci_rev,
@ -645,6 +655,7 @@ static int rtlbt_parse_firmware(struct hci_dev *hdev,
{ RTL_ROM_LMP_8852A, 20 }, /* 8852B */
{ RTL_ROM_LMP_8852A, 25 }, /* 8852C */
{ RTL_ROM_LMP_8851B, 36 }, /* 8851B */
{ RTL_ROM_LMP_8852A, 47 }, /* 8852BT */
};
if (btrtl_dev->fw_len <= 8)
@ -1275,6 +1286,7 @@ void btrtl_set_quirks(struct hci_dev *hdev, struct btrtl_device_info *btrtl_dev)
case CHIP_ID_8852B:
case CHIP_ID_8852C:
case CHIP_ID_8851B:
case CHIP_ID_8852BT:
set_bit(HCI_QUIRK_VALID_LE_STATES, &hdev->quirks);
set_bit(HCI_QUIRK_WIDEBAND_SPEECH_SUPPORTED, &hdev->quirks);
@ -1505,6 +1517,8 @@ MODULE_FIRMWARE("rtl_bt/rtl8852bs_fw.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852bs_config.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852bu_fw.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852bu_config.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852btu_fw.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852btu_config.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852cu_fw.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852cu_fw_v2.bin");
MODULE_FIRMWARE("rtl_bt/rtl8852cu_config.bin");

View File

@ -553,6 +553,9 @@ static const struct usb_device_id quirks_table[] = {
{ USB_DEVICE(0x13d3, 0x3572), .driver_info = BTUSB_REALTEK |
BTUSB_WIDEBAND_SPEECH },
/* Realtek 8852BT/8852BE-VT Bluetooth devices */
{ USB_DEVICE(0x0bda, 0x8520), .driver_info = BTUSB_REALTEK |
BTUSB_WIDEBAND_SPEECH },
/* Realtek Bluetooth devices */
{ USB_VENDOR_AND_INTERFACE_INFO(0x0bda, 0xe0, 0x01, 0x01),
.driver_info = BTUSB_REALTEK },
@ -655,6 +658,11 @@ static const struct usb_device_id quirks_table[] = {
BTUSB_WIDEBAND_SPEECH |
BTUSB_VALID_LE_STATES },
/* Additional MediaTek MT7925 Bluetooth devices */
{ USB_DEVICE(0x13d3, 0x3602), .driver_info = BTUSB_MEDIATEK |
BTUSB_WIDEBAND_SPEECH |
BTUSB_VALID_LE_STATES },
/* Additional Realtek 8723AE Bluetooth devices */
{ USB_DEVICE(0x0930, 0x021d), .driver_info = BTUSB_REALTEK },
{ USB_DEVICE(0x13d3, 0x3394), .driver_info = BTUSB_REALTEK },
@ -3080,7 +3088,7 @@ static int btusb_mtk_setup(struct hci_dev *hdev)
int err, status;
u32 dev_id = 0;
char fw_bin_name[64];
u32 fw_version = 0;
u32 fw_version = 0, fw_flavor = 0;
u8 param;
struct btmediatek_data *mediatek;
@ -3103,6 +3111,11 @@ static int btusb_mtk_setup(struct hci_dev *hdev)
bt_dev_err(hdev, "Failed to get fw version (%d)", err);
return err;
}
err = btusb_mtk_id_get(data, 0x70010020, &fw_flavor);
if (err < 0) {
bt_dev_err(hdev, "Failed to get fw flavor (%d)", err);
return err;
}
}
mediatek = hci_get_priv(hdev);
@ -3127,6 +3140,10 @@ static int btusb_mtk_setup(struct hci_dev *hdev)
snprintf(fw_bin_name, sizeof(fw_bin_name),
"mediatek/mt%04x/BT_RAM_CODE_MT%04x_1_%x_hdr.bin",
dev_id & 0xffff, dev_id & 0xffff, (fw_version & 0xff) + 1);
else if (dev_id == 0x7961 && fw_flavor)
snprintf(fw_bin_name, sizeof(fw_bin_name),
"mediatek/BT_RAM_CODE_MT%04x_1a_%x_hdr.bin",
dev_id & 0xffff, (fw_version & 0xff) + 1);
else
snprintf(fw_bin_name, sizeof(fw_bin_name),
"mediatek/BT_RAM_CODE_MT%04x_1_%x_hdr.bin",
@ -3273,7 +3290,6 @@ static int btusb_recv_acl_mtk(struct hci_dev *hdev, struct sk_buff *skb)
{
struct btusb_data *data = hci_get_drvdata(hdev);
u16 handle = le16_to_cpu(hci_acl_hdr(skb)->handle);
struct sk_buff *skb_cd;
switch (handle) {
case 0xfc6f: /* Firmware dump from device */
@ -3286,9 +3302,12 @@ static int btusb_recv_acl_mtk(struct hci_dev *hdev, struct sk_buff *skb)
* for backward compatibility, so we have to clone the packet
* extraly for the in-kernel coredump support.
*/
skb_cd = skb_clone(skb, GFP_ATOMIC);
if (skb_cd)
btmtk_process_coredump(hdev, skb_cd);
if (IS_ENABLED(CONFIG_DEV_COREDUMP)) {
struct sk_buff *skb_cd = skb_clone(skb, GFP_ATOMIC);
if (skb_cd)
btmtk_process_coredump(hdev, skb_cd);
}
fallthrough;
case 0x05ff: /* Firmware debug logging 1 */
@ -4481,6 +4500,7 @@ static int btusb_probe(struct usb_interface *intf,
set_bit(HCI_QUIRK_BROKEN_READ_TRANSMIT_POWER, &hdev->quirks);
set_bit(HCI_QUIRK_BROKEN_SET_RPA_TIMEOUT, &hdev->quirks);
set_bit(HCI_QUIRK_BROKEN_EXT_SCAN, &hdev->quirks);
set_bit(HCI_QUIRK_BROKEN_READ_ENC_KEY_SIZE, &hdev->quirks);
}
if (!reset)

View File

@ -113,6 +113,7 @@ struct h5_vnd {
int (*suspend)(struct h5 *h5);
int (*resume)(struct h5 *h5);
const struct acpi_gpio_mapping *acpi_gpio_map;
int sizeof_priv;
};
struct h5_device_data {
@ -863,7 +864,8 @@ static int h5_serdev_probe(struct serdev_device *serdev)
if (IS_ERR(h5->device_wake_gpio))
return PTR_ERR(h5->device_wake_gpio);
return hci_uart_register_device(&h5->serdev_hu, &h5p);
return hci_uart_register_device_priv(&h5->serdev_hu, &h5p,
h5->vnd->sizeof_priv);
}
static void h5_serdev_remove(struct serdev_device *serdev)
@ -1070,6 +1072,7 @@ static struct h5_vnd rtl_vnd = {
.suspend = h5_btrtl_suspend,
.resume = h5_btrtl_resume,
.acpi_gpio_map = acpi_btrtl_gpios,
.sizeof_priv = sizeof(struct btrealtek_data),
};
static const struct h5_device_data h5_data_rtl8822cs = {

Some files were not shown because too many files have changed in this diff Show More