mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2024-12-29 17:22:07 +00:00
docs: networking: Describe irq suspension
Describe irq suspension, the epoll ioctls, and the tradeoffs of using different gro_flush_timeout values. Signed-off-by: Joe Damato <jdamato@fastly.com> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20241109050245.191288-7-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit is contained in:
parent
347fcdc414
commit
a90a91e24b
@ -192,6 +192,33 @@ is reused to control the delay of the timer, while
|
||||
``napi_defer_hard_irqs`` controls the number of consecutive empty polls
|
||||
before NAPI gives up and goes back to using hardware IRQs.
|
||||
|
||||
The above parameters can also be set on a per-NAPI basis using netlink via
|
||||
netdev-genl. When used with netlink and configured on a per-NAPI basis, the
|
||||
parameters mentioned above use hyphens instead of underscores:
|
||||
``gro-flush-timeout`` and ``napi-defer-hard-irqs``.
|
||||
|
||||
Per-NAPI configuration can be done programmatically in a user application
|
||||
or by using a script included in the kernel source tree:
|
||||
``tools/net/ynl/cli.py``.
|
||||
|
||||
For example, using the script:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ kernel-source/tools/net/ynl/cli.py \
|
||||
--spec Documentation/netlink/specs/netdev.yaml \
|
||||
--do napi-set \
|
||||
--json='{"id": 345,
|
||||
"defer-hard-irqs": 111,
|
||||
"gro-flush-timeout": 11111}'
|
||||
|
||||
Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
|
||||
via netdev-genl. There is no global sysfs parameter for this value.
|
||||
|
||||
``irq-suspend-timeout`` is used to determine how long an application can
|
||||
completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
|
||||
which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
|
||||
|
||||
.. _poll:
|
||||
|
||||
Busy polling
|
||||
@ -207,6 +234,46 @@ selected sockets or using the global ``net.core.busy_poll`` and
|
||||
``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
|
||||
also exists.
|
||||
|
||||
epoll-based busy polling
|
||||
------------------------
|
||||
|
||||
It is possible to trigger packet processing directly from calls to
|
||||
``epoll_wait``. In order to use this feature, a user application must ensure
|
||||
all file descriptors which are added to an epoll context have the same NAPI ID.
|
||||
|
||||
If the application uses a dedicated acceptor thread, the application can obtain
|
||||
the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
|
||||
distribute that file descriptor to a worker thread. The worker thread would add
|
||||
the file descriptor to its epoll context. This would ensure each worker thread
|
||||
has an epoll context with FDs that have the same NAPI ID.
|
||||
|
||||
Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program can
|
||||
be inserted to distribute incoming connections to threads such that each thread
|
||||
is only given incoming connections with the same NAPI ID. Care must be taken to
|
||||
carefully handle cases where a system may have multiple NICs.
|
||||
|
||||
In order to enable busy polling, there are two choices:
|
||||
|
||||
1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
|
||||
loop waiting for events. This is a system-wide setting and will cause all
|
||||
epoll-based applications to busy poll when they call epoll_wait. This may
|
||||
not be desirable as many applications may not have the need to busy poll.
|
||||
|
||||
2. Applications using recent kernels can issue an ioctl on the epoll context
|
||||
file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
|
||||
epoll_params``:, which user programs can define as follows:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct epoll_params {
|
||||
uint32_t busy_poll_usecs;
|
||||
uint16_t busy_poll_budget;
|
||||
uint8_t prefer_busy_poll;
|
||||
|
||||
/* pad the struct to a multiple of 64bits */
|
||||
uint8_t __pad;
|
||||
};
|
||||
|
||||
IRQ mitigation
|
||||
---------------
|
||||
|
||||
@ -222,12 +289,111 @@ Such applications can pledge to the kernel that they will perform a busy
|
||||
polling operation periodically, and the driver should keep the device IRQs
|
||||
permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
|
||||
socket option. To avoid system misbehavior the pledge is revoked
|
||||
if ``gro_flush_timeout`` passes without any busy poll call.
|
||||
if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
|
||||
busy polling applications, the ``prefer_busy_poll`` field of ``struct
|
||||
epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
|
||||
enable this mode. See the above section for more details.
|
||||
|
||||
The NAPI budget for busy polling is lower than the default (which makes
|
||||
sense given the low latency intention of normal busy polling). This is
|
||||
not the case with IRQ mitigation, however, so the budget can be adjusted
|
||||
with the ``SO_BUSY_POLL_BUDGET`` socket option.
|
||||
with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
|
||||
applications, the ``busy_poll_budget`` field can be adjusted to the desired value
|
||||
in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
|
||||
ioctl. See the above section for more details.
|
||||
|
||||
It is important to note that choosing a large value for ``gro_flush_timeout``
|
||||
will defer IRQs to allow for better batch processing, but will induce latency
|
||||
when the system is not fully loaded. Choosing a small value for
|
||||
``gro_flush_timeout`` can cause interference of the user application which is
|
||||
attempting to busy poll by device IRQs and softirq processing. This value
|
||||
should be chosen carefully with these tradeoffs in mind. epoll-based busy
|
||||
polling applications may be able to mitigate how much user processing happens
|
||||
by choosing an appropriate value for ``maxevents``.
|
||||
|
||||
Users may want to consider an alternate approach, IRQ suspension, to help deal
|
||||
with these tradeoffs.
|
||||
|
||||
IRQ suspension
|
||||
--------------
|
||||
|
||||
IRQ suspension is a mechanism wherein device IRQs are masked while epoll
|
||||
triggers NAPI packet processing.
|
||||
|
||||
While application calls to epoll_wait successfully retrieve events, the kernel will
|
||||
defer the IRQ suspension timer. If the kernel does not retrieve any events
|
||||
while busy polling (for example, because network traffic levels subsided), IRQ
|
||||
suspension is disabled and the IRQ mitigation strategies described above are
|
||||
engaged.
|
||||
|
||||
This allows users to balance CPU consumption with network processing
|
||||
efficiency.
|
||||
|
||||
To use this mechanism:
|
||||
|
||||
1. The per-NAPI config parameter ``irq-suspend-timeout`` should be set to the
|
||||
maximum time (in nanoseconds) the application can have its IRQs
|
||||
suspended. This is done using netlink, as described above. This timeout
|
||||
serves as a safety mechanism to restart IRQ driver interrupt processing if
|
||||
the application has stalled. This value should be chosen so that it covers
|
||||
the amount of time the user application needs to process data from its
|
||||
call to epoll_wait, noting that applications can control how much data
|
||||
they retrieve by setting ``max_events`` when calling epoll_wait.
|
||||
|
||||
2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
|
||||
and ``napi_defer_hard_irqs`` can be set to low values. They will be used
|
||||
to defer IRQs after busy poll has found no data.
|
||||
|
||||
3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
|
||||
the ``EPIOCSPARAMS`` ioctl as described above.
|
||||
|
||||
4. The application uses epoll as described above to trigger NAPI packet
|
||||
processing.
|
||||
|
||||
As mentioned above, as long as subsequent calls to epoll_wait return events to
|
||||
userland, the ``irq-suspend-timeout`` is deferred and IRQs are disabled. This
|
||||
allows the application to process data without interference.
|
||||
|
||||
Once a call to epoll_wait results in no events being found, IRQ suspension is
|
||||
automatically disabled and the ``gro_flush_timeout`` and
|
||||
``napi_defer_hard_irqs`` mitigation mechanisms take over.
|
||||
|
||||
It is expected that ``irq-suspend-timeout`` will be set to a value much larger
|
||||
than ``gro_flush_timeout`` as ``irq-suspend-timeout`` should suspend IRQs for
|
||||
the duration of one userland processing cycle.
|
||||
|
||||
While it is not stricly necessary to use ``napi_defer_hard_irqs`` and
|
||||
``gro_flush_timeout`` to use IRQ suspension, their use is strongly
|
||||
recommended.
|
||||
|
||||
IRQ suspension causes the system to alternate between polling mode and
|
||||
irq-driven packet delivery. During busy periods, ``irq-suspend-timeout``
|
||||
overrides ``gro_flush_timeout`` and keeps the system busy polling, but when
|
||||
epoll finds no events, the setting of ``gro_flush_timeout`` and
|
||||
``napi_defer_hard_irqs`` determine the next step.
|
||||
|
||||
There are essentially three possible loops for network processing and
|
||||
packet delivery:
|
||||
|
||||
1) hardirq -> softirq -> napi poll; basic interrupt delivery
|
||||
2) timer -> softirq -> napi poll; deferred irq processing
|
||||
3) epoll -> busy-poll -> napi poll; busy looping
|
||||
|
||||
Loop 2 can take control from Loop 1, if ``gro_flush_timeout`` and
|
||||
``napi_defer_hard_irqs`` are set.
|
||||
|
||||
If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are set, Loops 2
|
||||
and 3 "wrestle" with each other for control.
|
||||
|
||||
During busy periods, ``irq-suspend-timeout`` is used as timer in Loop 2,
|
||||
which essentially tilts network processing in favour of Loop 3.
|
||||
|
||||
If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are not set, Loop 3
|
||||
cannot take control from Loop 1.
|
||||
|
||||
Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
|
||||
the recommended usage, because otherwise setting ``irq-suspend-timeout``
|
||||
might not have any discernible effect.
|
||||
|
||||
.. _threaded:
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user