mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-06 05:06:29 +00:00
Documentation: networking: Add description for multi-pf netdev
Add documentation for the multi-pf netdev feature. Describe the mlx5 implementation and design decisions. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
This commit is contained in:
parent
ed29705e4e
commit
77d9ec3f6c
@ -74,6 +74,7 @@ Contents:
|
||||
mpls-sysctl
|
||||
mptcp-sysctl
|
||||
multiqueue
|
||||
multi-pf-netdev
|
||||
napi
|
||||
net_cachelines/index
|
||||
netconsole
|
||||
|
174
Documentation/networking/multi-pf-netdev.rst
Normal file
174
Documentation/networking/multi-pf-netdev.rst
Normal file
@ -0,0 +1,174 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
===============
|
||||
Multi-PF Netdev
|
||||
===============
|
||||
|
||||
Contents
|
||||
========
|
||||
|
||||
- `Background`_
|
||||
- `Overview`_
|
||||
- `mlx5 implementation`_
|
||||
- `Channels distribution`_
|
||||
- `Observability`_
|
||||
- `Steering`_
|
||||
- `Mutually exclusive features`_
|
||||
|
||||
Background
|
||||
==========
|
||||
|
||||
The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
|
||||
the network, each through its own dedicated PCIe interface. Through either a connection harness that
|
||||
splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
|
||||
results in eliminating the network traffic traversing over the internal bus between the sockets,
|
||||
significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
|
||||
network throughput.
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
|
||||
one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
|
||||
sysfs entry, and devlink are kept separate.
|
||||
Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
|
||||
traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
|
||||
proximity to the device and achieve improved performance.
|
||||
|
||||
mlx5 implementation
|
||||
===================
|
||||
|
||||
Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
|
||||
NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
|
||||
to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
|
||||
|
||||
The netdev network channels are distributed between all devices, a proper configuration would utilize
|
||||
the correct close NUMA node when working on a certain app/CPU.
|
||||
|
||||
We pick one PF to be a primary (leader), and it fills a special role. The other devices
|
||||
(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
|
||||
mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
|
||||
the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
|
||||
to/from the secondaries.
|
||||
|
||||
Currently, we limit the support to PFs only, and up to two PFs (sockets).
|
||||
|
||||
Channels distribution
|
||||
=====================
|
||||
|
||||
We distribute the channels between the different PFs to achieve local NUMA node performance
|
||||
on multiple NUMA nodes.
|
||||
|
||||
Each combined channel works against one specific PF, creating all its datapath queues against it. We
|
||||
distribute channels to PFs in a round-robin policy.
|
||||
|
||||
::
|
||||
|
||||
Example for 2 PFs and 5 channels:
|
||||
+--------+--------+
|
||||
| ch idx | PF idx |
|
||||
+--------+--------+
|
||||
| 0 | 0 |
|
||||
| 1 | 1 |
|
||||
| 2 | 0 |
|
||||
| 3 | 1 |
|
||||
| 4 | 0 |
|
||||
+--------+--------+
|
||||
|
||||
|
||||
The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
|
||||
mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
|
||||
As the channel stats are persistent across channel's closure, changing the mapping every single time
|
||||
would turn the accumulative stats less representing of the channel's history.
|
||||
|
||||
This is achieved by using the correct core device instance (mdev) in each channel, instead of them
|
||||
all using the same instance under "priv->mdev".
|
||||
|
||||
Observability
|
||||
=============
|
||||
The relation between PF, irq, napi, and queue can be observed via netlink spec:
|
||||
|
||||
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
|
||||
[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
|
||||
{'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
|
||||
{'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
|
||||
{'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
|
||||
{'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
|
||||
{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
|
||||
{'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
|
||||
{'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
|
||||
{'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
|
||||
{'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
|
||||
|
||||
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
|
||||
[{'id': 543, 'ifindex': 13, 'irq': 42},
|
||||
{'id': 542, 'ifindex': 13, 'irq': 41},
|
||||
{'id': 541, 'ifindex': 13, 'irq': 40},
|
||||
{'id': 540, 'ifindex': 13, 'irq': 39},
|
||||
{'id': 539, 'ifindex': 13, 'irq': 36}]
|
||||
|
||||
Here you can clearly observe our channels distribution policy:
|
||||
|
||||
$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
|
||||
/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
|
||||
/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
|
||||
/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
|
||||
/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
|
||||
/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
|
||||
|
||||
Steering
|
||||
========
|
||||
Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
|
||||
|
||||
In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
|
||||
traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
|
||||
that is capable of pointing to the receive queues of a different PF.
|
||||
|
||||
In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
|
||||
go out to the network through it.
|
||||
|
||||
In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
|
||||
PF on the same node as the CPU.
|
||||
|
||||
XPS default config example:
|
||||
|
||||
NUMA node(s): 2
|
||||
NUMA node0 CPU(s): 0-11
|
||||
NUMA node1 CPU(s): 12-23
|
||||
|
||||
PF0 on node0, PF1 on node1.
|
||||
|
||||
- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
|
||||
- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
|
||||
- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
|
||||
- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
|
||||
- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
|
||||
- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
|
||||
- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
|
||||
- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
|
||||
- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
|
||||
- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
|
||||
- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
|
||||
- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
|
||||
- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
|
||||
- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
|
||||
- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
|
||||
- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
|
||||
- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
|
||||
- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
|
||||
- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
|
||||
- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
|
||||
- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
|
||||
- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
|
||||
- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
|
||||
- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
|
||||
|
||||
Mutually exclusive features
|
||||
===========================
|
||||
|
||||
The nature of Multi-PF, where different channels work with different PFs, conflicts with
|
||||
stateful features where the state is maintained in one of the PFs.
|
||||
For example, in the TLS device-offload feature, special context objects are created per connection
|
||||
and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence,
|
||||
we disable this combination for now.
|
Loading…
Reference in New Issue
Block a user