mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-01-06 13:23:18 +00:00
xsk: add multi-buffer documentation
Add AF_XDP multi-buffer support documentation including two pseudo-code samples. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-18-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
parent
a92b96c4ae
commit
49ca37d0d8
@ -462,8 +462,92 @@ XDP_OPTIONS getsockopt
|
||||
Gets options from an XDP socket. The only one supported so far is
|
||||
XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
|
||||
|
||||
Multi-Buffer Support
|
||||
====================
|
||||
|
||||
With multi-buffer support, programs using AF_XDP sockets can receive
|
||||
and transmit packets consisting of multiple buffers both in copy and
|
||||
zero-copy mode. For example, a packet can consist of two
|
||||
frames/buffers, one with the header and the other one with the data,
|
||||
or a 9K Ethernet jumbo frame can be constructed by chaining together
|
||||
three 4K frames.
|
||||
|
||||
Some definitions:
|
||||
|
||||
* A packet consists of one or more frames
|
||||
|
||||
* A descriptor in one of the AF_XDP rings always refers to a single
|
||||
frame. In the case the packet consists of a single frame, the
|
||||
descriptor refers to the whole packet.
|
||||
|
||||
To enable multi-buffer support for an AF_XDP socket, use the new bind
|
||||
flag XDP_USE_SG. If this is not provided, all multi-buffer packets
|
||||
will be dropped just as before. Note that the XDP program loaded also
|
||||
needs to be in multi-buffer mode. This can be accomplished by using
|
||||
"xdp.frags" as the section name of the XDP program used.
|
||||
|
||||
To represent a packet consisting of multiple frames, a new flag called
|
||||
XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
|
||||
descriptors. If it is true (1) the packet continues with the next
|
||||
descriptor and if it is false (0) it means this is the last descriptor
|
||||
of the packet. Why the reverse logic of end-of-packet (eop) flag found
|
||||
in many NICs? Just to preserve compatibility with non-multi-buffer
|
||||
applications that have this bit set to false for all packets on Rx,
|
||||
and the apps set the options field to zero for Tx, as anything else
|
||||
will be treated as an invalid descriptor.
|
||||
|
||||
These are the semantics for producing packets onto AF_XDP Tx ring
|
||||
consisting of multiple frames:
|
||||
|
||||
* When an invalid descriptor is found, all the other
|
||||
descriptors/frames of this packet are marked as invalid and not
|
||||
completed. The next descriptor is treated as the start of a new
|
||||
packet, even if this was not the intent (because we cannot guess
|
||||
the intent). As before, if your program is producing invalid
|
||||
descriptors you have a bug that must be fixed.
|
||||
|
||||
* Zero length descriptors are treated as invalid descriptors.
|
||||
|
||||
* For copy mode, the maximum supported number of frames in a packet is
|
||||
equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
|
||||
descriptors accumulated so far are dropped and treated as
|
||||
invalid. To produce an application that will work on any system
|
||||
regardless of this config setting, limit the number of frags to 18,
|
||||
as the minimum value of the config is 17.
|
||||
|
||||
* For zero-copy mode, the limit is up to what the NIC HW
|
||||
supports. Usually at least five on the NICs we have checked. We
|
||||
consciously chose to not enforce a rigid limit (such as
|
||||
CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
|
||||
resulted in copy actions under the hood to fit into what limit the
|
||||
NIC supports. Kind of defeats the purpose of zero-copy mode. How to
|
||||
probe for this limit is explained in the "probe for multi-buffer
|
||||
support" section.
|
||||
|
||||
On the Rx path in copy-mode, the xsk core copies the XDP data into
|
||||
multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
|
||||
detailed before. Zero-copy mode works the same, though the data is not
|
||||
copied. When the application gets a descriptor with the XDP_PKT_CONTD
|
||||
flag set to one, it means that the packet consists of multiple buffers
|
||||
and it continues with the next buffer in the following
|
||||
descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
|
||||
means that this is the last buffer of the packet. AF_XDP guarantees
|
||||
that only a complete packet (all frames in the packet) is sent to the
|
||||
application. If there is not enough space in the AF_XDP Rx ring, all
|
||||
frames of the packet will be dropped.
|
||||
|
||||
If application reads a batch of descriptors, using for example the libxdp
|
||||
interfaces, it is not guaranteed that the batch will end with a full
|
||||
packet. It might end in the middle of a packet and the rest of the
|
||||
buffers of that packet will arrive at the beginning of the next batch,
|
||||
since the libxdp interface does not read the whole ring (unless you
|
||||
have an enormous batch size or a very small ring size).
|
||||
|
||||
An example program each for Rx and Tx multi-buffer support can be found
|
||||
later in this document.
|
||||
|
||||
Usage
|
||||
=====
|
||||
-----
|
||||
|
||||
In order to use AF_XDP sockets two parts are needed. The
|
||||
user-space application and the XDP program. For a complete setup and
|
||||
@ -541,6 +625,131 @@ like this:
|
||||
But please use the libbpf functions as they are optimized and ready to
|
||||
use. Will make your life easier.
|
||||
|
||||
Usage Multi-Buffer Rx
|
||||
---------------------
|
||||
|
||||
Here is a simple Rx path pseudo-code example (using libxdp interfaces
|
||||
for simplicity). Error paths have been excluded to keep it short:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void rx_packets(struct xsk_socket_info *xsk)
|
||||
{
|
||||
static bool new_packet = true;
|
||||
u32 idx_rx = 0, idx_fq = 0;
|
||||
static char *pkt;
|
||||
|
||||
int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
|
||||
|
||||
xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
|
||||
|
||||
for (int i = 0; i < rcvd; i++) {
|
||||
struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
|
||||
char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
|
||||
bool eop = !(desc->options & XDP_PKT_CONTD);
|
||||
|
||||
if (new_packet)
|
||||
pkt = frag;
|
||||
else
|
||||
add_frag_to_pkt(pkt, frag);
|
||||
|
||||
if (eop)
|
||||
process_pkt(pkt);
|
||||
|
||||
new_packet = eop;
|
||||
|
||||
*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
|
||||
}
|
||||
|
||||
xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
|
||||
xsk_ring_cons__release(&xsk->rx, rcvd);
|
||||
}
|
||||
|
||||
Usage Multi-Buffer Tx
|
||||
---------------------
|
||||
|
||||
Here is an example Tx path pseudo-code (using libxdp interfaces for
|
||||
simplicity) ignoring that the umem is finite in size, and that we
|
||||
eventually will run out of packets to send. Also assumes pkts.addr
|
||||
points to a valid location in the umem.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
|
||||
int batch_size)
|
||||
{
|
||||
u32 idx, i, pkt_nb = 0;
|
||||
|
||||
xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
|
||||
|
||||
for (i = 0; i < batch_size;) {
|
||||
u64 addr = pkts[pkt_nb].addr;
|
||||
u32 len = pkts[pkt_nb].size;
|
||||
|
||||
do {
|
||||
struct xdp_desc *tx_desc;
|
||||
|
||||
tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
|
||||
tx_desc->addr = addr;
|
||||
|
||||
if (len > xsk_frame_size) {
|
||||
tx_desc->len = xsk_frame_size;
|
||||
tx_desc->options = XDP_PKT_CONTD;
|
||||
} else {
|
||||
tx_desc->len = len;
|
||||
tx_desc->options = 0;
|
||||
pkt_nb++;
|
||||
}
|
||||
len -= tx_desc->len;
|
||||
addr += xsk_frame_size;
|
||||
|
||||
if (i == batch_size) {
|
||||
/* Remember len, addr, pkt_nb for next iteration.
|
||||
* Skipped for simplicity.
|
||||
*/
|
||||
break;
|
||||
}
|
||||
} while (len);
|
||||
}
|
||||
|
||||
xsk_ring_prod__submit(&xsk->tx, i);
|
||||
}
|
||||
|
||||
Probing for Multi-Buffer Support
|
||||
--------------------------------
|
||||
|
||||
To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
|
||||
mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
|
||||
query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
|
||||
querying for XDP multi-buffer support. If XDP supports multi-buffer in
|
||||
a driver, then AF_XDP will also support that in SKB and DRV mode.
|
||||
|
||||
To discover if a driver supports multi-buffer AF_XDP in zero-copy
|
||||
mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
|
||||
flag. If it is set, it means that at least zero-copy is supported and
|
||||
you should go and check the netlink attribute
|
||||
NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
|
||||
value will be returned stating the max number of frags that are
|
||||
supported by this device in zero-copy mode. These are the possible
|
||||
return values:
|
||||
|
||||
1: Multi-buffer for zero-copy is not supported by this device, as max
|
||||
one fragment supported means that multi-buffer is not possible.
|
||||
|
||||
>=2: Multi-buffer is supported in zero-copy mode for this device. The
|
||||
returned number signifies the max number of frags supported.
|
||||
|
||||
For an example on how these are used through libbpf, please take a
|
||||
look at tools/testing/selftests/bpf/xskxceiver.c.
|
||||
|
||||
Multi-Buffer Support for Zero-Copy Drivers
|
||||
------------------------------------------
|
||||
|
||||
Zero-copy drivers usually use the batched APIs for Rx and Tx
|
||||
processing. Note that the Tx batch API guarantees that it will provide
|
||||
a batch of Tx descriptors that ends with full packet at the end. This
|
||||
to facilitate extending a zero-copy driver with multi-buffer support.
|
||||
|
||||
Sample application
|
||||
==================
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user