mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-06 05:06:29 +00:00
c745cfb27a
Update devlink-health.rst file: - Add devlink formatted message (fmsg) API documentation. - Add auto-dump as a condition to do dump once error reported. - Expand OOB to clarify this acronym. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
139 lines
5.9 KiB
ReStructuredText
139 lines
5.9 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
==============
|
|
Devlink Health
|
|
==============
|
|
|
|
Background
|
|
==========
|
|
|
|
The ``devlink`` health mechanism is targeted for Real Time Alerting, in
|
|
order to know when something bad happened to a PCI device.
|
|
|
|
* Provide alert debug information.
|
|
* Self healing.
|
|
* If problem needs vendor support, provide a way to gather all needed
|
|
debugging information.
|
|
|
|
Overview
|
|
========
|
|
|
|
The main idea is to unify and centralize driver health reports in the
|
|
generic ``devlink`` instance and allow the user to set different
|
|
attributes of the health reporting and recovery procedures.
|
|
|
|
The ``devlink`` health reporter:
|
|
Device driver creates a "health reporter" per each error/health type.
|
|
Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
|
|
or unknown (driver specific).
|
|
For each registered health reporter a driver can issue error/health reports
|
|
asynchronously. All health reports handling is done by ``devlink``.
|
|
Device driver can provide specific callbacks for each "health reporter", e.g.:
|
|
|
|
* Recovery procedures
|
|
* Diagnostics procedures
|
|
* Object dump procedures
|
|
* Out Of Box initial parameters
|
|
|
|
Different parts of the driver can register different types of health reporters
|
|
with different handlers.
|
|
|
|
Actions
|
|
=======
|
|
|
|
Once an error is reported, devlink health will perform the following actions:
|
|
|
|
* A log is being send to the kernel trace events buffer
|
|
* Health status and statistics are being updated for the reporter instance
|
|
* Object dump is being taken and saved at the reporter instance (as long as
|
|
auto-dump is set and there is no other dump which is already stored)
|
|
* Auto recovery attempt is being done. Depends on:
|
|
|
|
- Auto-recovery configuration
|
|
- Grace period vs. time passed since last recover
|
|
|
|
Devlink formatted message
|
|
=========================
|
|
|
|
To handle devlink health diagnose and health dump requests, devlink creates a
|
|
formatted message structure ``devlink_fmsg`` and send it to the driver's callback
|
|
to fill the data in using the devlink fmsg API.
|
|
|
|
Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
|
|
json-like format. The API allows the driver to add nested attributes such as
|
|
object, object pair and value array, in addition to attributes such as name and
|
|
value.
|
|
|
|
Driver should use this API to fill the fmsg context in a format which will be
|
|
translated by the devlink to the netlink message later. When it needs to send
|
|
the data using SKBs to the netlink layer, it fragments the data between
|
|
different SKBs. In order to do this fragmentation, it uses virtual nests
|
|
attributes, to avoid actual nesting use which cannot be divided between
|
|
different SKBs.
|
|
|
|
User Interface
|
|
==============
|
|
|
|
User can access/change each reporter's parameters and driver specific callbacks
|
|
via ``devlink``, e.g per error type (per health reporter):
|
|
|
|
* Configure reporter's generic parameters (like: disable/enable auto recovery)
|
|
* Invoke recovery procedure
|
|
* Run diagnostics
|
|
* Object dump
|
|
|
|
.. list-table:: List of devlink health interfaces
|
|
:widths: 10 90
|
|
|
|
* - Name
|
|
- Description
|
|
* - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
|
|
- Retrieves status and configuration info per DEV and reporter.
|
|
* - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
|
|
- Allows reporter-related configuration setting.
|
|
* - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
|
|
- Triggers reporter's recovery procedure.
|
|
* - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
|
|
- Triggers a fake health event on the reporter. The effects of the test
|
|
event in terms of recovery flow should follow closely that of a real
|
|
event.
|
|
* - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
|
|
- Retrieves current device state related to the reporter.
|
|
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
|
|
- Retrieves the last stored dump. Devlink health
|
|
saves a single dump. If an dump is not already stored by devlink
|
|
for this reporter, devlink generates a new dump.
|
|
Dump output is defined by the reporter.
|
|
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
|
|
- Clears the last saved dump file for the specified reporter.
|
|
|
|
The following diagram provides a general overview of ``devlink-health``::
|
|
|
|
netlink
|
|
+--------------------------+
|
|
| |
|
|
| + |
|
|
| | |
|
|
+--------------------------+
|
|
|request for ops
|
|
|(diagnose,
|
|
driver devlink |recover,
|
|
|dump)
|
|
+--------+ +--------------------------+
|
|
| | | reporter| |
|
|
| | | +---------v----------+ |
|
|
| | ops execution | | | |
|
|
| <----------------------------------+ | |
|
|
| | | | | |
|
|
| | | + ^------------------+ |
|
|
| | | | request for ops |
|
|
| | | | (recover, dump) |
|
|
| | | | |
|
|
| | | +-+------------------+ |
|
|
| | health report | | health handler | |
|
|
| +-------------------------------> | |
|
|
| | | +--------------------+ |
|
|
| | health reporter create | |
|
|
| +----------------------------> |
|
|
+--------+ +--------------------------+
|