KVM/arm64 fixes for 5.17, take #2

- A couple of fixes when handling an exception while a SError has been
   delivered
 
 - Workaround for Cortex-A510's single-step[ erratum
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmH9LlcPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDLTcP/3Ry8CzvPubZquMyNdRUFvEg2EcfTa6vtIGW
 Fw7ap2hwPUaXUgJKDihMFIWj3Wf/wPmXw4t2Sr8R/yq8v9kWe+IG1isnT0yQhY3W
 kLXEqc8Mu4Rf8+jvlFHsp5mLENHIswpWAv/EY49ChgZkNmtkKpnPm1qnD89d8bNv
 tUwooDWidQ/7nXdM3z6zygSROJS24+OGTYTWzOQ1KgV3FGaXbqYiCleoPOpRR/Tc
 DQQWF/tVl8bZCqgkGKZCv3aXT0ZUPrQggARJGai78vP0l2sE/Kyaydgq5I7npZja
 2L2U4kDNoPYIVa8A1jvV3Ef3AqNFs6B7+jXWfYIgAcXjCYzDK3cZcxavf/Inq9F1
 3udVGJGSzH1KkGaihW3BVhsqGORRHKCdksJzWRgqf6vGyJhJw0u0D2u1rTWcT+jw
 Nm4KxShp0CX59HSLnVF5sR0Mct3jNNZ7UCCgH7q10wuBqYRfJT32hCo2ZrT7g9oD
 IQ+pa2dVYa3SaKZ4O6T/lSlbLOuuxtvmcEIfxYpPD6m10S5RrxOdsW3MCtiYM5HQ
 24oo2mk6NIu/va0XxhcW+NBMcYtLQD9JUGbkUkpcRy2mgilTi9b4YPp+muYM7plQ
 /S1gj2kGY8vjMg0H+wysjMJyl2huEwSRsZ/UfxCAgW+MYhHLDxhxAnDWc8EcwGgE
 tUzomowB
 =Mbx/
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.17, take #2

- A couple of fixes when handling an exception while a SError has been
  delivered

- Workaround for Cortex-A510's single-step[ erratum
This commit is contained in:
Paolo Bonzini 2022-02-05 00:58:25 -05:00
commit 7e6a6b400d
10509 changed files with 453875 additions and 215194 deletions

View File

@ -216,7 +216,6 @@ ForEachMacros:
- 'for_each_migratetype_order'
- 'for_each_msi_entry'
- 'for_each_msi_entry_safe'
- 'for_each_msi_vector'
- 'for_each_net'
- 'for_each_net_continue_reverse'
- 'for_each_netdev'

View File

@ -10,10 +10,12 @@
# Please keep this list dictionary sorted.
#
Aaron Durbin <adurbin@google.com>
Abhinav Kumar <quic_abhinavk@quicinc.com> <abhinavk@codeaurora.org>
Adam Oldham <oldhamca@gmail.com>
Adam Radford <aradford@gmail.com>
Adriana Reus <adi.reus@gmail.com> <adriana.reus@intel.com>
Adrian Bunk <bunk@stusta.de>
Akhil P Oommen <quic_akhilpo@quicinc.com> <akhilpo@codeaurora.org>
Alan Cox <alan@lxorguk.ukuu.org.uk>
Alan Cox <root@hraefn.swansea.linux.org.uk>
Aleksandar Markovic <aleksandar.markovic@mips.com> <aleksandar.markovic@imgtec.com>
@ -42,6 +44,7 @@ Andrew Vasquez <andrew.vasquez@qlogic.com>
Andrey Konovalov <andreyknvl@gmail.com> <andreyknvl@google.com>
Andrey Ryabinin <ryabinin.a.a@gmail.com> <a.ryabinin@samsung.com>
Andrey Ryabinin <ryabinin.a.a@gmail.com> <aryabinin@virtuozzo.com>
Andrzej Hajda <andrzej.hajda@intel.com> <a.hajda@samsung.com>
Andy Adamson <andros@citi.umich.edu>
Antoine Tenart <atenart@kernel.org> <antoine.tenart@bootlin.com>
Antoine Tenart <atenart@kernel.org> <antoine.tenart@free-electrons.com>
@ -67,6 +70,7 @@ Boris Brezillon <bbrezillon@kernel.org> <boris.brezillon@bootlin.com>
Boris Brezillon <bbrezillon@kernel.org> <boris.brezillon@free-electrons.com>
Brian Avery <b.avery@hp.com>
Brian King <brking@us.ibm.com>
Brian Silverman <bsilver16384@gmail.com> <brian.silverman@bluerivertech.com>
Changbin Du <changbin.du@intel.com> <changbin.du@gmail.com>
Changbin Du <changbin.du@intel.com> <changbin.du@intel.com>
Chao Yu <chao@kernel.org> <chao2.yu@samsung.com>
@ -174,6 +178,7 @@ Jeff Layton <jlayton@kernel.org> <jlayton@redhat.com>
Jens Axboe <axboe@suse.de>
Jens Osterkamp <Jens.Osterkamp@de.ibm.com>
Jernej Skrabec <jernej.skrabec@gmail.com> <jernej.skrabec@siol.net>
Jessica Zhang <quic_jesszhan@quicinc.com> <jesszhan@codeaurora.org>
Jiri Slaby <jirislaby@kernel.org> <jirislaby@gmail.com>
Jiri Slaby <jirislaby@kernel.org> <jslaby@novell.com>
Jiri Slaby <jirislaby@kernel.org> <jslaby@suse.com>
@ -193,6 +198,7 @@ Juha Yrjola <at solidboot.com>
Juha Yrjola <juha.yrjola@nokia.com>
Juha Yrjola <juha.yrjola@solidboot.com>
Julien Thierry <julien.thierry.kdev@gmail.com> <julien.thierry@arm.com>
Kalyan Thota <quic_kalyant@quicinc.com> <kalyan_t@codeaurora.org>
Kay Sievers <kay.sievers@vrfy.org>
Kees Cook <keescook@chromium.org> <kees.cook@canonical.com>
Kees Cook <keescook@chromium.org> <keescook@google.com>
@ -204,9 +210,11 @@ Kenneth W Chen <kenneth.w.chen@intel.com>
Konstantin Khlebnikov <koct9i@gmail.com> <khlebnikov@yandex-team.ru>
Konstantin Khlebnikov <koct9i@gmail.com> <k.khlebnikov@samsung.com>
Koushik <raghavendra.koushik@neterion.com>
Krishna Manikandan <quic_mkrishn@quicinc.com> <mkrishn@codeaurora.org>
Krzysztof Kozlowski <krzk@kernel.org> <k.kozlowski.k@gmail.com>
Krzysztof Kozlowski <krzk@kernel.org> <k.kozlowski@samsung.com>
Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Kuogee Hsieh <quic_khsieh@quicinc.com> <khsieh@codeaurora.org>
Leonardo Bras <leobras.c@gmail.com> <leonardo@linux.ibm.com>
Leonid I Ananiev <leonid.i.ananiev@intel.com>
Leon Romanovsky <leon@kernel.org> <leon@leon.nu>
@ -313,6 +321,7 @@ Qais Yousef <qsyousef@gmail.com> <qais.yousef@imgtec.com>
Quentin Monnet <quentin@isovalent.com> <quentin.monnet@netronome.com>
Quentin Perret <qperret@qperret.net> <quentin.perret@arm.com>
Rafael J. Wysocki <rjw@rjwysocki.net> <rjw@sisk.pl>
Rajeev Nandan <quic_rajeevny@quicinc.com> <rajeevny@codeaurora.org>
Rajesh Shah <rajesh.shah@intel.com>
Ralf Baechle <ralf@linux-mips.org>
Ralf Wildenhues <Ralf.Wildenhues@gmx.de>
@ -327,6 +336,7 @@ Rui Saraiva <rmps@joel.ist.utl.pt>
Sachin P Sant <ssant@in.ibm.com>
Sakari Ailus <sakari.ailus@linux.intel.com> <sakari.ailus@iki.fi>
Sam Ravnborg <sam@mars.ravnborg.org>
Sankeerth Billakanti <quic_sbillaka@quicinc.com> <sbillaka@codeaurora.org>
Santosh Shilimkar <santosh.shilimkar@oracle.org>
Santosh Shilimkar <ssantosh@kernel.org>
Sarangdhar Joshi <spjoshi@codeaurora.org>

View File

@ -315,6 +315,11 @@ S: Via Delle Palme, 9
S: Terni 05100
S: Italy
N: Ohad Ben Cohen
E: ohad@wizery.com
D: Remote Processor (remoteproc) subsystem
D: Remote Processor Messaging (rpmsg) subsystem
N: Krzysztof Benedyczak
E: golbi@mat.uni.torun.pl
W: http://www.mat.uni.torun.pl/~golbi

View File

@ -1,22 +0,0 @@
What: /sys/class/dax/
Date: May, 2016
KernelVersion: v4.7
Contact: nvdimm@lists.linux.dev
Description: Device DAX is the device-centric analogue of Filesystem
DAX (CONFIG_FS_DAX). It allows memory ranges to be
allocated and mapped without need of an intervening file
system. Device DAX is strict, precise and predictable.
Specifically this interface:
1. Guarantees fault granularity with respect to a given
page size (pte, pmd, or pud) set at configuration time.
2. Enforces deterministic behavior by being strict about
what fault scenarios are supported.
The /sys/class/dax/ interface enumerates all the
device-dax instances in the system. The ABI is
deprecated and will be removed after 2020. It is
replaced with the DAX bus interface /sys/bus/dax/ where
device-dax instances can be found under
/sys/bus/dax/devices/

View File

@ -0,0 +1,676 @@
What: /sys/block/<disk>/alignment_offset
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Storage devices may report a physical block size that is
bigger than the logical block size (for instance a drive
with 4KB physical sectors exposing 512-byte logical
blocks to the operating system). This parameter
indicates how many bytes the beginning of the device is
offset from the disk's natural alignment.
What: /sys/block/<disk>/discard_alignment
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Devices that support discard functionality may
internally allocate space in units that are bigger than
the exported logical block size. The discard_alignment
parameter indicates how many bytes the beginning of the
device is offset from the internal allocation unit's
natural alignment.
What: /sys/block/<disk>/diskseq
Date: February 2021
Contact: Matteo Croce <mcroce@microsoft.com>
Description:
The /sys/block/<disk>/diskseq files reports the disk
sequence number, which is a monotonically increasing
number assigned to every drive.
Some devices, like the loop device, refresh such number
every time the backing file is changed.
The value type is 64 bit unsigned.
What: /sys/block/<disk>/inflight
Date: October 2009
Contact: Jens Axboe <axboe@kernel.dk>, Nikanth Karthikesan <knikanth@suse.de>
Description:
Reports the number of I/O requests currently in progress
(pending / in flight) in a device driver. This can be less
than the number of requests queued in the block device queue.
The report contains 2 fields: one for read requests
and one for write requests.
The value type is unsigned int.
Cf. Documentation/block/stat.rst which contains a single value for
requests in flight.
This is related to /sys/block/<disk>/queue/nr_requests
and for SCSI device also its queue_depth.
What: /sys/block/<disk>/integrity/device_is_integrity_capable
Date: July 2014
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Indicates whether a storage device is capable of storing
integrity metadata. Set if the device is T10 PI-capable.
What: /sys/block/<disk>/integrity/format
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Metadata format for integrity capable block device.
E.g. T10-DIF-TYPE1-CRC.
What: /sys/block/<disk>/integrity/protection_interval_bytes
Date: July 2015
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Describes the number of data bytes which are protected
by one integrity tuple. Typically the device's logical
block size.
What: /sys/block/<disk>/integrity/read_verify
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Indicates whether the block layer should verify the
integrity of read requests serviced by devices that
support sending integrity metadata.
What: /sys/block/<disk>/integrity/tag_size
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Number of bytes of integrity tag space available per
512 bytes of data.
What: /sys/block/<disk>/integrity/write_generate
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Indicates whether the block layer should automatically
generate checksums for write requests bound for
devices that support receiving integrity metadata.
What: /sys/block/<disk>/<partition>/alignment_offset
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Storage devices may report a physical block size that is
bigger than the logical block size (for instance a drive
with 4KB physical sectors exposing 512-byte logical
blocks to the operating system). This parameter
indicates how many bytes the beginning of the partition
is offset from the disk's natural alignment.
What: /sys/block/<disk>/<partition>/discard_alignment
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Devices that support discard functionality may
internally allocate space in units that are bigger than
the exported logical block size. The discard_alignment
parameter indicates how many bytes the beginning of the
partition is offset from the internal allocation unit's
natural alignment.
What: /sys/block/<disk>/<partition>/stat
Date: February 2008
Contact: Jerome Marchand <jmarchan@redhat.com>
Description:
The /sys/block/<disk>/<partition>/stat files display the
I/O statistics of partition <partition>. The format is the
same as the format of /sys/block/<disk>/stat.
What: /sys/block/<disk>/queue/add_random
Date: June 2010
Contact: linux-block@vger.kernel.org
Description:
[RW] This file allows to turn off the disk entropy contribution.
Default value of this file is '1'(on).
What: /sys/block/<disk>/queue/chunk_sectors
Date: September 2016
Contact: Hannes Reinecke <hare@suse.com>
Description:
[RO] chunk_sectors has different meaning depending on the type
of the disk. For a RAID device (dm-raid), chunk_sectors
indicates the size in 512B sectors of the RAID volume stripe
segment. For a zoned block device, either host-aware or
host-managed, chunk_sectors indicates the size in 512B sectors
of the zones of the device, with the eventual exception of the
last zone of the device which may be smaller.
What: /sys/block/<disk>/queue/dax
Date: June 2016
Contact: linux-block@vger.kernel.org
Description:
[RO] This file indicates whether the device supports Direct
Access (DAX), used by CPU-addressable storage to bypass the
pagecache. It shows '1' if true, '0' if not.
What: /sys/block/<disk>/queue/discard_granularity
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Devices that support discard functionality may internally
allocate space using units that are bigger than the logical
block size. The discard_granularity parameter indicates the size
of the internal allocation unit in bytes if reported by the
device. Otherwise the discard_granularity will be set to match
the device's physical block size. A discard_granularity of 0
means that the device does not support discard functionality.
What: /sys/block/<disk>/queue/discard_max_bytes
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RW] While discard_max_hw_bytes is the hardware limit for the
device, this setting is the software limit. Some devices exhibit
large latencies when large discards are issued, setting this
value lower will make Linux issue smaller discards and
potentially help reduce latencies induced by large discard
operations.
What: /sys/block/<disk>/queue/discard_max_hw_bytes
Date: July 2015
Contact: linux-block@vger.kernel.org
Description:
[RO] Devices that support discard functionality may have
internal limits on the number of bytes that can be trimmed or
unmapped in a single operation. The `discard_max_hw_bytes`
parameter is set by the device driver to the maximum number of
bytes that can be discarded in a single operation. Discard
requests issued to the device must not exceed this limit. A
`discard_max_hw_bytes` value of 0 means that the device does not
support discard functionality.
What: /sys/block/<disk>/queue/discard_zeroes_data
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Will always return 0. Don't rely on any specific behavior
for discards, and don't read this file.
What: /sys/block/<disk>/queue/fua
Date: May 2018
Contact: linux-block@vger.kernel.org
Description:
[RO] Whether or not the block driver supports the FUA flag for
write requests. FUA stands for Force Unit Access. If the FUA
flag is set that means that write requests must bypass the
volatile cache of the storage device.
What: /sys/block/<disk>/queue/hw_sector_size
Date: January 2008
Contact: linux-block@vger.kernel.org
Description:
[RO] This is the hardware sector size of the device, in bytes.
What: /sys/block/<disk>/queue/independent_access_ranges/
Date: October 2021
Contact: linux-block@vger.kernel.org
Description:
[RO] The presence of this sub-directory of the
/sys/block/xxx/queue/ directory indicates that the device is
capable of executing requests targeting different sector ranges
in parallel. For instance, single LUN multi-actuator hard-disks
will have an independent_access_ranges directory if the device
correctly advertizes the sector ranges of its actuators.
The independent_access_ranges directory contains one directory
per access range, with each range described using the sector
(RO) attribute file to indicate the first sector of the range
and the nr_sectors (RO) attribute file to indicate the total
number of sectors in the range starting from the first sector of
the range. For example, a dual-actuator hard-disk will have the
following independent_access_ranges entries.::
$ tree /sys/block/<disk>/queue/independent_access_ranges/
/sys/block/<disk>/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
The sector and nr_sectors attributes use 512B sector unit,
regardless of the actual block size of the device. Independent
access ranges do not overlap and include all sectors within the
device capacity. The access ranges are numbered in increasing
order of the range start sector, that is, the sector attribute
of range 0 always has the value 0.
What: /sys/block/<disk>/queue/io_poll
Date: November 2015
Contact: linux-block@vger.kernel.org
Description:
[RW] When read, this file shows whether polling is enabled (1)
or disabled (0). Writing '0' to this file will disable polling
for this device. Writing any non-zero value will enable this
feature.
What: /sys/block/<disk>/queue/io_poll_delay
Date: November 2016
Contact: linux-block@vger.kernel.org
Description:
[RW] If polling is enabled, this controls what kind of polling
will be performed. It defaults to -1, which is classic polling.
In this mode, the CPU will repeatedly ask for completions
without giving up any time. If set to 0, a hybrid polling mode
is used, where the kernel will attempt to make an educated guess
at when the IO will complete. Based on this guess, the kernel
will put the process issuing IO to sleep for an amount of time,
before entering a classic poll loop. This mode might be a little
slower than pure classic polling, but it will be more efficient.
If set to a value larger than 0, the kernel will put the process
issuing IO to sleep for this amount of microseconds before
entering classic polling.
What: /sys/block/<disk>/queue/io_timeout
Date: November 2018
Contact: Weiping Zhang <zhangweiping@didiglobal.com>
Description:
[RW] io_timeout is the request timeout in milliseconds. If a
request does not complete in this time then the block driver
timeout handler is invoked. That timeout handler can decide to
retry the request, to fail it or to start a device recovery
strategy.
What: /sys/block/<disk>/queue/iostats
Date: January 2009
Contact: linux-block@vger.kernel.org
Description:
[RW] This file is used to control (on/off) the iostats
accounting of the disk.
What: /sys/block/<disk>/queue/logical_block_size
Date: May 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] This is the smallest unit the storage device can address.
It is typically 512 bytes.
What: /sys/block/<disk>/queue/max_active_zones
Date: July 2020
Contact: Niklas Cassel <niklas.cassel@wdc.com>
Description:
[RO] For zoned block devices (zoned attribute indicating
"host-managed" or "host-aware"), the sum of zones belonging to
any of the zone states: EXPLICIT OPEN, IMPLICIT OPEN or CLOSED,
is limited by this value. If this value is 0, there is no limit.
If the host attempts to exceed this limit, the driver should
report this error with BLK_STS_ZONE_ACTIVE_RESOURCE, which user
space may see as the EOVERFLOW errno.
What: /sys/block/<disk>/queue/max_discard_segments
Date: February 2017
Contact: linux-block@vger.kernel.org
Description:
[RO] The maximum number of DMA scatter/gather entries in a
discard request.
What: /sys/block/<disk>/queue/max_hw_sectors_kb
Date: September 2004
Contact: linux-block@vger.kernel.org
Description:
[RO] This is the maximum number of kilobytes supported in a
single data transfer.
What: /sys/block/<disk>/queue/max_integrity_segments
Date: September 2010
Contact: linux-block@vger.kernel.org
Description:
[RO] Maximum number of elements in a DMA scatter/gather list
with integrity data that will be submitted by the block layer
core to the associated block driver.
What: /sys/block/<disk>/queue/max_open_zones
Date: July 2020
Contact: Niklas Cassel <niklas.cassel@wdc.com>
Description:
[RO] For zoned block devices (zoned attribute indicating
"host-managed" or "host-aware"), the sum of zones belonging to
any of the zone states: EXPLICIT OPEN or IMPLICIT OPEN, is
limited by this value. If this value is 0, there is no limit.
What: /sys/block/<disk>/queue/max_sectors_kb
Date: September 2004
Contact: linux-block@vger.kernel.org
Description:
[RW] This is the maximum number of kilobytes that the block
layer will allow for a filesystem request. Must be smaller than
or equal to the maximum size allowed by the hardware.
What: /sys/block/<disk>/queue/max_segment_size
Date: March 2010
Contact: linux-block@vger.kernel.org
Description:
[RO] Maximum size in bytes of a single element in a DMA
scatter/gather list.
What: /sys/block/<disk>/queue/max_segments
Date: March 2010
Contact: linux-block@vger.kernel.org
Description:
[RO] Maximum number of elements in a DMA scatter/gather list
that is submitted to the associated block driver.
What: /sys/block/<disk>/queue/minimum_io_size
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Storage devices may report a granularity or preferred
minimum I/O size which is the smallest request the device can
perform without incurring a performance penalty. For disk
drives this is often the physical block size. For RAID arrays
it is often the stripe chunk size. A properly aligned multiple
of minimum_io_size is the preferred request size for workloads
where a high number of I/O operations is desired.
What: /sys/block/<disk>/queue/nomerges
Date: January 2010
Contact: linux-block@vger.kernel.org
Description:
[RW] Standard I/O elevator operations include attempts to merge
contiguous I/Os. For known random I/O loads these attempts will
always fail and result in extra cycles being spent in the
kernel. This allows one to turn off this behavior on one of two
ways: When set to 1, complex merge checks are disabled, but the
simple one-shot merges with the previous I/O request are
enabled. When set to 2, all merge tries are disabled. The
default value is 0 - which enables all types of merge tries.
What: /sys/block/<disk>/queue/nr_requests
Date: July 2003
Contact: linux-block@vger.kernel.org
Description:
[RW] This controls how many requests may be allocated in the
block layer for read or write requests. Note that the total
allocated number may be twice this amount, since it applies only
to reads or writes (not the accumulated sum).
To avoid priority inversion through request starvation, a
request queue maintains a separate request pool per each cgroup
when CONFIG_BLK_CGROUP is enabled, and this parameter applies to
each such per-block-cgroup request pool. IOW, if there are N
block cgroups, each request queue may have up to N request
pools, each independently regulated by nr_requests.
What: /sys/block/<disk>/queue/nr_zones
Date: November 2018
Contact: Damien Le Moal <damien.lemoal@wdc.com>
Description:
[RO] nr_zones indicates the total number of zones of a zoned
block device ("host-aware" or "host-managed" zone model). For
regular block devices, the value is always 0.
What: /sys/block/<disk>/queue/optimal_io_size
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Storage devices may report an optimal I/O size, which is
the device's preferred unit for sustained I/O. This is rarely
reported for disk drives. For RAID arrays it is usually the
stripe width or the internal track size. A properly aligned
multiple of optimal_io_size is the preferred request size for
workloads where sustained throughput is desired. If no optimal
I/O size is reported this file contains 0.
What: /sys/block/<disk>/queue/physical_block_size
Date: May 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] This is the smallest unit a physical storage device can
write atomically. It is usually the same as the logical block
size but may be bigger. One example is SATA drives with 4KB
sectors that expose a 512-byte logical block size to the
operating system. For stacked block devices the
physical_block_size variable contains the maximum
physical_block_size of the component devices.
What: /sys/block/<disk>/queue/read_ahead_kb
Date: May 2004
Contact: linux-block@vger.kernel.org
Description:
[RW] Maximum number of kilobytes to read-ahead for filesystems
on this block device.
What: /sys/block/<disk>/queue/rotational
Date: January 2009
Contact: linux-block@vger.kernel.org
Description:
[RW] This file is used to stat if the device is of rotational
type or non-rotational type.
What: /sys/block/<disk>/queue/rq_affinity
Date: September 2008
Contact: linux-block@vger.kernel.org
Description:
[RW] If this option is '1', the block layer will migrate request
completions to the cpu "group" that originally submitted the
request. For some workloads this provides a significant
reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of
completion processing setting this option to '2' forces the
completion to run on the requesting cpu (bypassing the "group"
aggregation logic).
What: /sys/block/<disk>/queue/scheduler
Date: October 2004
Contact: linux-block@vger.kernel.org
Description:
[RW] When read, this file will display the current and available
IO schedulers for this block device. The currently active IO
scheduler will be enclosed in [] brackets. Writing an IO
scheduler name to this file will switch control of this block
device to that new IO scheduler. Note that writing an IO
scheduler name to this file will attempt to load that IO
scheduler module, if it isn't already present in the system.
What: /sys/block/<disk>/queue/stable_writes
Date: September 2020
Contact: linux-block@vger.kernel.org
Description:
[RW] This file will contain '1' if memory must not be modified
while it is being used in a write request to this device. When
this is the case and the kernel is performing writeback of a
page, the kernel will wait for writeback to complete before
allowing the page to be modified again, rather than allowing
immediate modification as is normally the case. This
restriction arises when the device accesses the memory multiple
times where the same data must be seen every time -- for
example, once to calculate a checksum and once to actually write
the data. If no such restriction exists, this file will contain
'0'. This file is writable for testing purposes.
What: /sys/block/<disk>/queue/throttle_sample_time
Date: March 2017
Contact: linux-block@vger.kernel.org
Description:
[RW] This is the time window that blk-throttle samples data, in
millisecond. blk-throttle makes decision based on the
samplings. Lower time means cgroups have more smooth throughput,
but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.
What: /sys/block/<disk>/queue/virt_boundary_mask
Date: April 2021
Contact: linux-block@vger.kernel.org
Description:
[RO] This file shows the I/O segment memory alignment mask for
the block device. I/O requests to this device will be split
between segments wherever either the memory address of the end
of the previous segment or the memory address of the beginning
of the current segment is not aligned to virt_boundary_mask + 1
bytes.
What: /sys/block/<disk>/queue/wbt_lat_usec
Date: November 2016
Contact: linux-block@vger.kernel.org
Description:
[RW] If the device is registered for writeback throttling, then
this file shows the target minimum read latency. If this latency
is exceeded in a given window of time (see wb_window_usec), then
the writeback throttling will start scaling back writes. Writing
a value of '0' to this file disables the feature. Writing a
value of '-1' to this file resets the value to the default
setting.
What: /sys/block/<disk>/queue/write_cache
Date: April 2016
Contact: linux-block@vger.kernel.org
Description:
[RW] When read, this file will display whether the device has
write back caching enabled or not. It will return "write back"
for the former case, and "write through" for the latter. Writing
to this file can change the kernels view of the device, but it
doesn't alter the device state. This means that it might not be
safe to toggle the setting from "write back" to "write through",
since that will also eliminate cache flushes issued by the
kernel.
What: /sys/block/<disk>/queue/write_same_max_bytes
Date: January 2012
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Some devices support a write same operation in which a
single data block can be written to a range of several
contiguous blocks on storage. This can be used to wipe areas on
disk or to initialize drives in a RAID configuration.
write_same_max_bytes indicates how many bytes can be written in
a single write same command. If write_same_max_bytes is 0, write
same is not supported by the device.
What: /sys/block/<disk>/queue/write_zeroes_max_bytes
Date: November 2016
Contact: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Description:
[RO] Devices that support write zeroes operation in which a
single request can be issued to zero out the range of contiguous
blocks on storage without having any payload in the request.
This can be used to optimize writing zeroes to the devices.
write_zeroes_max_bytes indicates how many bytes can be written
in a single write zeroes command. If write_zeroes_max_bytes is
0, write zeroes is not supported by the device.
What: /sys/block/<disk>/queue/zone_append_max_bytes
Date: May 2020
Contact: linux-block@vger.kernel.org
Description:
[RO] This is the maximum number of bytes that can be written to
a sequential zone of a zoned block device using a zone append
write operation (REQ_OP_ZONE_APPEND). This value is always 0 for
regular block devices.
What: /sys/block/<disk>/queue/zone_write_granularity
Date: January 2021
Contact: linux-block@vger.kernel.org
Description:
[RO] This indicates the alignment constraint, in bytes, for
write operations in sequential zones of zoned block devices
(devices with a zoned attributed that reports "host-managed" or
"host-aware"). This value is always 0 for regular block devices.
What: /sys/block/<disk>/queue/zoned
Date: September 2016
Contact: Damien Le Moal <damien.lemoal@wdc.com>
Description:
[RO] zoned indicates if the device is a zoned block device and
the zone model of the device if it is indeed zoned. The
possible values indicated by zoned are "none" for regular block
devices and "host-aware" or "host-managed" for zoned block
devices. The characteristics of host-aware and host-managed
zoned block devices are described in the ZBC (Zoned Block
Commands) and ZAC (Zoned Device ATA Command Set) standards.
These standards also define the "drive-managed" zone model.
However, since drive-managed zoned block devices do not support
zone commands, they will be treated as regular block devices and
zoned will report "none".
What: /sys/block/<disk>/stat
Date: February 2008
Contact: Jerome Marchand <jmarchan@redhat.com>
Description:
The /sys/block/<disk>/stat files displays the I/O
statistics of disk <disk>. They contain 11 fields:
== ==============================================
1 reads completed successfully
2 reads merged
3 sectors read
4 time spent reading (ms)
5 writes completed
6 writes merged
7 sectors written
8 time spent writing (ms)
9 I/Os currently in progress
10 time spent doing I/Os (ms)
11 weighted time spent doing I/Os (ms)
12 discards completed
13 discards merged
14 sectors discarded
15 time spent discarding (ms)
16 flush requests completed
17 time spent flushing (ms)
== ==============================================
For more details refer Documentation/admin-guide/iostats.rst

View File

@ -176,3 +176,9 @@ Contact: Keith Busch <keith.busch@intel.com>
Description:
The cache write policy: 0 for write-back, 1 for write-through,
other or unknown.
What: /sys/devices/system/node/nodeX/x86/sgx_total_bytes
Date: November 2021
Contact: Jarkko Sakkinen <jarkko@kernel.org>
Description:
The total amount of SGX physical memory in bytes.

View File

@ -41,14 +41,14 @@ KernelVersion: 5.6.0
Contact: dmaengine@vger.kernel.org
Description: The maximum number of groups can be created under this device.
What: /sys/bus/dsa/devices/dsa<m>/max_tokens
Date: Oct 25, 2019
KernelVersion: 5.6.0
What: /sys/bus/dsa/devices/dsa<m>/max_read_buffers
Date: Dec 10, 2021
KernelVersion: 5.17.0
Contact: dmaengine@vger.kernel.org
Description: The total number of bandwidth tokens supported by this device.
The bandwidth tokens represent resources within the DSA
Description: The total number of read buffers supported by this device.
The read buffers represent resources within the DSA
implementation, and these resources are allocated by engines to
support operations.
support operations. See DSA spec v1.2 9.2.4 Total Read Buffers.
What: /sys/bus/dsa/devices/dsa<m>/max_transfer_size
Date: Oct 25, 2019
@ -115,13 +115,13 @@ KernelVersion: 5.6.0
Contact: dmaengine@vger.kernel.org
Description: To indicate if this device is configurable or not.
What: /sys/bus/dsa/devices/dsa<m>/token_limit
Date: Oct 25, 2019
KernelVersion: 5.6.0
What: /sys/bus/dsa/devices/dsa<m>/read_buffer_limit
Date: Dec 10, 2021
KernelVersion: 5.17.0
Contact: dmaengine@vger.kernel.org
Description: The maximum number of bandwidth tokens that may be in use at
Description: The maximum number of read buffers that may be in use at
one time by operations that access low bandwidth memory in the
device.
device. See DSA spec v1.2 9.2.8 GENCFG on Global Read Buffer Limit.
What: /sys/bus/dsa/devices/dsa<m>/cmd_status
Date: Aug 28, 2020
@ -220,8 +220,38 @@ Contact: dmaengine@vger.kernel.org
Description: Show the current number of entries in this WQ if WQ Occupancy
Support bit WQ capabilities is 1.
What: /sys/bus/dsa/devices/wq<m>.<n>/enqcmds_retries
Date Oct 29, 2021
KernelVersion: 5.17.0
Contact: dmaengine@vger.kernel.org
Description: Indicate the number of retires for an enqcmds submission on a sharedwq.
A max value to set attribute is capped at 64.
What: /sys/bus/dsa/devices/engine<m>.<n>/group_id
Date: Oct 25, 2019
KernelVersion: 5.6.0
Contact: dmaengine@vger.kernel.org
Description: The group that this engine belongs to.
What: /sys/bus/dsa/devices/group<m>.<n>/use_read_buffer_limit
Date: Dec 10, 2021
KernelVersion: 5.17.0
Contact: dmaengine@vger.kernel.org
Description: Enable the use of global read buffer limit for the group. See DSA
spec v1.2 9.2.18 GRPCFG Use Global Read Buffer Limit.
What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_allowed
Date: Dec 10, 2021
KernelVersion: 5.17.0
Contact: dmaengine@vger.kernel.org
Description: Indicates max number of read buffers that may be in use at one time
by all engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read
Buffers Allowed.
What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_reserved
Date: Dec 10, 2021
KernelVersion: 5.17.0
Contact: dmaengine@vger.kernel.org
Description: Indicates the number of Read Buffers reserved for the use of
engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read Buffers
Reserved.

View File

@ -27,6 +27,6 @@ Description:
(in 1/256 dB)
p_volume_res playback volume control resolution
(in 1/256 dB)
req_number the number of pre-allocated request
req_number the number of pre-allocated requests
for both capture and playback
===================== =======================================

View File

@ -30,4 +30,6 @@ Description:
(in 1/256 dB)
p_volume_res playback volume control resolution
(in 1/256 dB)
req_number the number of pre-allocated requests
for both capture and playback
===================== =======================================

View File

@ -21,11 +21,11 @@ Description: Allow the root user to disable/enable in runtime the clock
a different engine to disable/enable its clock gating feature.
The bitmask is composed of 20 bits:
======= ============
======= ============
0 - 7 DMA channels
8 - 11 MME engines
12 - 19 TPC engines
======= ============
======= ============
The bit's location of a specific engine can be determined
using (1 << GAUDI_ENGINE_ID_*). GAUDI_ENGINE_ID_* values
@ -155,6 +155,13 @@ Description: Triggers an I2C transaction that is generated by the device's
CPU. Writing to this file generates a write transaction while
reading from the file generates a read transaction
What: /sys/kernel/debug/habanalabs/hl<n>/i2c_len
Date: Dec 2021
KernelVersion: 5.17
Contact: obitton@habana.ai
Description: Sets I2C length in bytes for I2C transaction that is generated by
the device's CPU
What: /sys/kernel/debug/habanalabs/hl<n>/i2c_reg
Date: Jan 2019
KernelVersion: 5.1
@ -226,12 +233,6 @@ Description: Gets the state dump occurring on a CS timeout or failure.
Writing an integer X discards X state dumps, so that the
next read would return X+1-st newest state dump.
What: /sys/kernel/debug/habanalabs/hl<n>/timeout_locked
Date: Sep 2021
KernelVersion: 5.16
Contact: obitton@habana.ai
Description: Sets the command submission timeout value in seconds.
What: /sys/kernel/debug/habanalabs/hl<n>/stop_on_err
Date: Mar 2020
KernelVersion: 5.6
@ -239,6 +240,12 @@ Contact: ogabbay@kernel.org
Description: Sets the stop-on_error option for the device engines. Value of
"0" is for disable, otherwise enable.
What: /sys/kernel/debug/habanalabs/hl<n>/timeout_locked
Date: Sep 2021
KernelVersion: 5.16
Contact: obitton@habana.ai
Description: Sets the command submission timeout value in seconds.
What: /sys/kernel/debug/habanalabs/hl<n>/userptr
Date: Jan 2019
KernelVersion: 5.1

View File

@ -1,346 +0,0 @@
What: /sys/block/<disk>/stat
Date: February 2008
Contact: Jerome Marchand <jmarchan@redhat.com>
Description:
The /sys/block/<disk>/stat files displays the I/O
statistics of disk <disk>. They contain 11 fields:
== ==============================================
1 reads completed successfully
2 reads merged
3 sectors read
4 time spent reading (ms)
5 writes completed
6 writes merged
7 sectors written
8 time spent writing (ms)
9 I/Os currently in progress
10 time spent doing I/Os (ms)
11 weighted time spent doing I/Os (ms)
12 discards completed
13 discards merged
14 sectors discarded
15 time spent discarding (ms)
16 flush requests completed
17 time spent flushing (ms)
== ==============================================
For more details refer Documentation/admin-guide/iostats.rst
What: /sys/block/<disk>/inflight
Date: October 2009
Contact: Jens Axboe <axboe@kernel.dk>, Nikanth Karthikesan <knikanth@suse.de>
Description:
Reports the number of I/O requests currently in progress
(pending / in flight) in a device driver. This can be less
than the number of requests queued in the block device queue.
The report contains 2 fields: one for read requests
and one for write requests.
The value type is unsigned int.
Cf. Documentation/block/stat.rst which contains a single value for
requests in flight.
This is related to nr_requests in Documentation/block/queue-sysfs.rst
and for SCSI device also its queue_depth.
What: /sys/block/<disk>/diskseq
Date: February 2021
Contact: Matteo Croce <mcroce@microsoft.com>
Description:
The /sys/block/<disk>/diskseq files reports the disk
sequence number, which is a monotonically increasing
number assigned to every drive.
Some devices, like the loop device, refresh such number
every time the backing file is changed.
The value type is 64 bit unsigned.
What: /sys/block/<disk>/<part>/stat
Date: February 2008
Contact: Jerome Marchand <jmarchan@redhat.com>
Description:
The /sys/block/<disk>/<part>/stat files display the
I/O statistics of partition <part>. The format is the
same as the above-written /sys/block/<disk>/stat
format.
What: /sys/block/<disk>/integrity/format
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Metadata format for integrity capable block device.
E.g. T10-DIF-TYPE1-CRC.
What: /sys/block/<disk>/integrity/read_verify
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Indicates whether the block layer should verify the
integrity of read requests serviced by devices that
support sending integrity metadata.
What: /sys/block/<disk>/integrity/tag_size
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Number of bytes of integrity tag space available per
512 bytes of data.
What: /sys/block/<disk>/integrity/device_is_integrity_capable
Date: July 2014
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Indicates whether a storage device is capable of storing
integrity metadata. Set if the device is T10 PI-capable.
What: /sys/block/<disk>/integrity/protection_interval_bytes
Date: July 2015
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Describes the number of data bytes which are protected
by one integrity tuple. Typically the device's logical
block size.
What: /sys/block/<disk>/integrity/write_generate
Date: June 2008
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Indicates whether the block layer should automatically
generate checksums for write requests bound for
devices that support receiving integrity metadata.
What: /sys/block/<disk>/alignment_offset
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Storage devices may report a physical block size that is
bigger than the logical block size (for instance a drive
with 4KB physical sectors exposing 512-byte logical
blocks to the operating system). This parameter
indicates how many bytes the beginning of the device is
offset from the disk's natural alignment.
What: /sys/block/<disk>/<partition>/alignment_offset
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Storage devices may report a physical block size that is
bigger than the logical block size (for instance a drive
with 4KB physical sectors exposing 512-byte logical
blocks to the operating system). This parameter
indicates how many bytes the beginning of the partition
is offset from the disk's natural alignment.
What: /sys/block/<disk>/queue/logical_block_size
Date: May 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
This is the smallest unit the storage device can
address. It is typically 512 bytes.
What: /sys/block/<disk>/queue/physical_block_size
Date: May 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
This is the smallest unit a physical storage device can
write atomically. It is usually the same as the logical
block size but may be bigger. One example is SATA
drives with 4KB sectors that expose a 512-byte logical
block size to the operating system. For stacked block
devices the physical_block_size variable contains the
maximum physical_block_size of the component devices.
What: /sys/block/<disk>/queue/minimum_io_size
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Storage devices may report a granularity or preferred
minimum I/O size which is the smallest request the
device can perform without incurring a performance
penalty. For disk drives this is often the physical
block size. For RAID arrays it is often the stripe
chunk size. A properly aligned multiple of
minimum_io_size is the preferred request size for
workloads where a high number of I/O operations is
desired.
What: /sys/block/<disk>/queue/optimal_io_size
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Storage devices may report an optimal I/O size, which is
the device's preferred unit for sustained I/O. This is
rarely reported for disk drives. For RAID arrays it is
usually the stripe width or the internal track size. A
properly aligned multiple of optimal_io_size is the
preferred request size for workloads where sustained
throughput is desired. If no optimal I/O size is
reported this file contains 0.
What: /sys/block/<disk>/queue/nomerges
Date: January 2010
Contact:
Description:
Standard I/O elevator operations include attempts to
merge contiguous I/Os. For known random I/O loads these
attempts will always fail and result in extra cycles
being spent in the kernel. This allows one to turn off
this behavior on one of two ways: When set to 1, complex
merge checks are disabled, but the simple one-shot merges
with the previous I/O request are enabled. When set to 2,
all merge tries are disabled. The default value is 0 -
which enables all types of merge tries.
What: /sys/block/<disk>/discard_alignment
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Devices that support discard functionality may
internally allocate space in units that are bigger than
the exported logical block size. The discard_alignment
parameter indicates how many bytes the beginning of the
device is offset from the internal allocation unit's
natural alignment.
What: /sys/block/<disk>/<partition>/discard_alignment
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Devices that support discard functionality may
internally allocate space in units that are bigger than
the exported logical block size. The discard_alignment
parameter indicates how many bytes the beginning of the
partition is offset from the internal allocation unit's
natural alignment.
What: /sys/block/<disk>/queue/discard_granularity
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Devices that support discard functionality may
internally allocate space using units that are bigger
than the logical block size. The discard_granularity
parameter indicates the size of the internal allocation
unit in bytes if reported by the device. Otherwise the
discard_granularity will be set to match the device's
physical block size. A discard_granularity of 0 means
that the device does not support discard functionality.
What: /sys/block/<disk>/queue/discard_max_bytes
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Devices that support discard functionality may have
internal limits on the number of bytes that can be
trimmed or unmapped in a single operation. Some storage
protocols also have inherent limits on the number of
blocks that can be described in a single command. The
discard_max_bytes parameter is set by the device driver
to the maximum number of bytes that can be discarded in
a single operation. Discard requests issued to the
device must not exceed this limit. A discard_max_bytes
value of 0 means that the device does not support
discard functionality.
What: /sys/block/<disk>/queue/discard_zeroes_data
Date: May 2011
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Will always return 0. Don't rely on any specific behavior
for discards, and don't read this file.
What: /sys/block/<disk>/queue/write_same_max_bytes
Date: January 2012
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
Some devices support a write same operation in which a
single data block can be written to a range of several
contiguous blocks on storage. This can be used to wipe
areas on disk or to initialize drives in a RAID
configuration. write_same_max_bytes indicates how many
bytes can be written in a single write same command. If
write_same_max_bytes is 0, write same is not supported
by the device.
What: /sys/block/<disk>/queue/write_zeroes_max_bytes
Date: November 2016
Contact: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Description:
Devices that support write zeroes operation in which a
single request can be issued to zero out the range of
contiguous blocks on storage without having any payload
in the request. This can be used to optimize writing zeroes
to the devices. write_zeroes_max_bytes indicates how many
bytes can be written in a single write zeroes command. If
write_zeroes_max_bytes is 0, write zeroes is not supported
by the device.
What: /sys/block/<disk>/queue/zoned
Date: September 2016
Contact: Damien Le Moal <damien.lemoal@wdc.com>
Description:
zoned indicates if the device is a zoned block device
and the zone model of the device if it is indeed zoned.
The possible values indicated by zoned are "none" for
regular block devices and "host-aware" or "host-managed"
for zoned block devices. The characteristics of
host-aware and host-managed zoned block devices are
described in the ZBC (Zoned Block Commands) and ZAC
(Zoned Device ATA Command Set) standards. These standards
also define the "drive-managed" zone model. However,
since drive-managed zoned block devices do not support
zone commands, they will be treated as regular block
devices and zoned will report "none".
What: /sys/block/<disk>/queue/nr_zones
Date: November 2018
Contact: Damien Le Moal <damien.lemoal@wdc.com>
Description:
nr_zones indicates the total number of zones of a zoned block
device ("host-aware" or "host-managed" zone model). For regular
block devices, the value is always 0.
What: /sys/block/<disk>/queue/max_active_zones
Date: July 2020
Contact: Niklas Cassel <niklas.cassel@wdc.com>
Description:
For zoned block devices (zoned attribute indicating
"host-managed" or "host-aware"), the sum of zones belonging to
any of the zone states: EXPLICIT OPEN, IMPLICIT OPEN or CLOSED,
is limited by this value. If this value is 0, there is no limit.
What: /sys/block/<disk>/queue/max_open_zones
Date: July 2020
Contact: Niklas Cassel <niklas.cassel@wdc.com>
Description:
For zoned block devices (zoned attribute indicating
"host-managed" or "host-aware"), the sum of zones belonging to
any of the zone states: EXPLICIT OPEN or IMPLICIT OPEN,
is limited by this value. If this value is 0, there is no limit.
What: /sys/block/<disk>/queue/chunk_sectors
Date: September 2016
Contact: Hannes Reinecke <hare@suse.com>
Description:
chunk_sectors has different meaning depending on the type
of the disk. For a RAID device (dm-raid), chunk_sectors
indicates the size in 512B sectors of the RAID volume
stripe segment. For a zoned block device, either
host-aware or host-managed, chunk_sectors indicates the
size in 512B sectors of the zones of the device, with
the eventual exception of the last zone of the device
which may be smaller.
What: /sys/block/<disk>/queue/io_timeout
Date: November 2018
Contact: Weiping Zhang <zhangweiping@didiglobal.com>
Description:
io_timeout is the request timeout in milliseconds. If a request
does not complete in this time then the block driver timeout
handler is invoked. That timeout handler can decide to retry
the request, to fail it or to start a device recovery strategy.

View File

@ -0,0 +1,16 @@
What: /sys/bus/iio/devices/iio:deviceX/filter_mode_available
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
Reading this returns the valid values that can be written to the
on_altvoltage0_mode attribute:
- auto -> Adjust bandpass filter to track changes in input clock rate.
- manual -> disable/unregister the clock rate notifier / input clock tracking.
What: /sys/bus/iio/devices/iio:deviceX/filter_mode
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
This attribute configures the filter mode.
Reading returns the actual mode.

View File

@ -0,0 +1,38 @@
What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0-1_i_calibphase
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
Read/write unscaled value for the Local Oscillatior path quadrature I phase shift.
What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0-1_q_calibphase
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
Read/write unscaled value for the Local Oscillatior path quadrature Q phase shift.
What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_i_calibbias
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
Read/write value for the Local Oscillatior Feedthrough Offset Calibration I Positive
side.
What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_q_calibbias
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
Read/write value for the Local Oscillatior Feedthrough Offset Calibration Q Positive side.
What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage1_i_calibbias
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
Read/write raw value for the Local Oscillatior Feedthrough Offset Calibration I Negative
side.
What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage1_q_calibbias
KernelVersion:
Contact: linux-iio@vger.kernel.org
Description:
Read/write raw value for the Local Oscillatior Feedthrough Offset Calibration Q Negative
side.

View File

@ -244,6 +244,15 @@ Description:
is permitted, "u2" if only u2 is permitted, "u1_u2" if both u1 and
u2 are permitted.
What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/connector
Date: December 2021
Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Description:
Link to the USB Type-C connector when available. This link is
only created when USB Type-C Connector Class is enabled, and
only if the system firmware is capable of describing the
connection between a port and its connector.
What: /sys/bus/usb/devices/.../power/usb2_lpm_l1_timeout
Date: May 2013
Contact: Mathias Nyman <mathias.nyman@linux.intel.com>

View File

@ -0,0 +1,57 @@
What: /sys/bus/vdpa/driver_autoprobe
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Description:
This file determines whether new devices are immediately bound
to a driver after the creation. It initially contains 1, which
means the kernel automatically binds devices to a compatible
driver immediately after they are created.
Writing "0" to this file disable this feature, any other string
enable it.
What: /sys/bus/vdpa/driver_probe
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Description:
Writing a device name to this file will cause the kernel binds
devices to a compatible driver.
This can be useful when /sys/bus/vdpa/driver_autoprobe is
disabled.
What: /sys/bus/vdpa/drivers/.../bind
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Description:
Writing a device name to this file will cause the driver to
attempt to bind to the device. This is useful for overriding
default bindings.
What: /sys/bus/vdpa/drivers/.../unbind
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Description:
Writing a device name to this file will cause the driver to
attempt to unbind from the device. This may be useful when
overriding default bindings.
What: /sys/bus/vdpa/devices/.../driver_override
Date: November 2021
Contact: virtualization@lists.linux-foundation.org
Description:
This file allows the driver for a device to be specified.
When specified, only a driver with a name matching the value
written to driver_override will have an opportunity to bind to
the device. The override is specified by writing a string to the
driver_override file (echo vhost-vdpa > driver_override) and may
be cleared with an empty string (echo > driver_override).
This returns the device to standard matching rules binding.
Writing to driver_override does not automatically unbind the
device from its current driver or make any attempt to
automatically load the specified driver. If no driver with a
matching name is currently loaded in the kernel, the device will
not bind to any driver. This also allows devices to opt-out of
driver binding using a driver_override name such as "none".
Only a single driver may be specified in the override, there is
no support for parsing delimiters.

View File

@ -161,6 +161,15 @@ Description:
power-on:
Representing a password required to use
the system
system-mgmt:
Representing System Management password.
See Lenovo extensions section for details
HDD:
Representing HDD password
See Lenovo extensions section for details
NVMe:
Representing NVMe password
See Lenovo extensions section for details
mechanism:
The means of authentication. This attribute is mandatory.
@ -207,6 +216,13 @@ Description:
On Lenovo systems the following additional settings are available:
role: system-mgmt This gives the same authority as the bios-admin password to control
security related features. The authorities allocated can be set via
the BIOS menu SMP Access Control Policy
role: HDD & NVMe This password is used to unlock access to the drive at boot. Note see
'level' and 'index' extensions below.
lenovo_encoding:
The encoding method that is used. This can be either "ascii"
or "scancode". Default is set to "ascii"
@ -216,6 +232,22 @@ Description:
two char code (e.g. "us", "fr", "gr") and may vary per platform.
Default is set to "us"
level:
Available for HDD and NVMe authentication to set 'user' or 'master'
privilege level.
If only the user password is configured then this should be used to
unlock the drive at boot. If both master and user passwords are set
then either can be used. If a master password is set a user password
is required.
This attribute defaults to 'user' level
index:
Used with HDD and NVME authentication to set the drive index
that is being referenced (e.g hdd0, hdd1 etc)
This attribute defaults to device 0.
What: /sys/class/firmware-attributes/*/attributes/pending_reboot
Date: February 2021
KernelVersion: 5.11

View File

@ -413,7 +413,7 @@ Description:
"Over voltage", "Unspecified failure", "Cold",
"Watchdog timer expire", "Safety timer expire",
"Over current", "Calibration required", "Warm",
"Cool", "Hot"
"Cool", "Hot", "No battery"
What: /sys/class/power_supply/<supply_name>/precharge_current
Date: June 2017
@ -455,6 +455,20 @@ Description:
"Unknown", "Charging", "Discharging",
"Not charging", "Full"
What: /sys/class/power_supply/<supply_name>/charge_behaviour
Date: November 2021
Contact: linux-pm@vger.kernel.org
Description:
Represents the charging behaviour.
Access: Read, Write
Valid values:
================ ====================================
auto: Charge normally, respect thresholds
inhibit-charge: Do not charge while AC is attached
force-discharge: Force discharge while AC is attached
What: /sys/class/power_supply/<supply_name>/technology
Date: May 2007
Contact: linux-pm@vger.kernel.org

View File

@ -666,3 +666,18 @@ Description: Preferred MTE tag checking mode
================ ==============================================
See also: Documentation/arm64/memory-tagging-extension.rst
What: /sys/devices/system/cpu/nohz_full
Date: Apr 2015
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description:
(RO) the list of CPUs that are in nohz_full mode.
These CPUs are set by boot parameter "nohz_full=".
What: /sys/devices/system/cpu/isolated
Date: Apr 2015
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description:
(RO) the list of CPUs that are isolated and don't
participate in load balancing. These CPUs are set by
boot parameter "isolcpus=".

View File

@ -0,0 +1,16 @@
What: /sys/fs/erofs/features/
Date: November 2021
Contact: "Huang Jianan" <huangjianan@oppo.com>
Description: Shows all enabled kernel features.
Supported features:
zero_padding, compr_cfgs, big_pcluster, chunked_file,
device_table, compr_head2, sb_chksum.
What: /sys/fs/erofs/<disk>/sync_decompress
Date: November 2021
Contact: "Huang Jianan" <huangjianan@oppo.com>
Description: Control strategy of sync decompression
- 0 (default, auto): enable for readpage, and enable for
readahead on atomic contexts only,
- 1 (force on): enable for readpage and readahead.
- 2 (force off): disable for all situations.

View File

@ -112,6 +112,11 @@ Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
Description: Set timeout to issue discard commands during umount.
Default: 5 secs
What: /sys/fs/f2fs/<disk>/pending_discard
Date: November 2021
Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
Description: Shows the number of pending discard commands in the queue.
What: /sys/fs/f2fs/<disk>/max_victim_search
Date: January 2014
Contact: "Jaegeuk Kim" <jaegeuk.kim@samsung.com>
@ -528,3 +533,10 @@ Description: With "mode=fragment:block" mount options, we can scatter block allo
f2fs will allocate 1..<max_fragment_chunk> blocks in a chunk and make a hole
in the length of 1..<max_fragment_hole> by turns. This value can be set
between 1..512 and the default value is 4.
What: /sys/fs/f2fs/<disk>/gc_urgent_high_remaining
Date: December 2021
Contact: "Daeho Jeong" <daehojeong@google.com>
Description: You can set the trial count limit for GC urgent high mode with this value.
If GC thread gets to the limit, the mode will turn back to GC normal mode.
By default, the value is zero, which means there is no limit like before.

View File

@ -0,0 +1,35 @@
What: /sys/fs/ubifsX_Y/error_magic
Date: October 2021
KernelVersion: 5.16
Contact: linux-mtd@lists.infradead.org
Description:
Exposes magic errors: every node starts with a magic number.
This counter keeps track of the number of accesses of nodes
with a corrupted magic number.
The counter is reset to 0 with a remount.
What: /sys/fs/ubifsX_Y/error_node
Date: October 2021
KernelVersion: 5.16
Contact: linux-mtd@lists.infradead.org
Description:
Exposes node errors. Every node embeds its type.
This counter keeps track of the number of accesses of nodes
with a corrupted node type.
The counter is reset to 0 with a remount.
What: /sys/fs/ubifsX_Y/error_crc
Date: October 2021
KernelVersion: 5.16
Contact: linux-mtd@lists.infradead.org
Description:
Exposes crc errors: every node embeds a crc checksum.
This counter keeps track of the number of accesses of nodes
with a bad crc checksum.
The counter is reset to 0 with a remount.

View File

@ -19,6 +19,8 @@ endif
SPHINXBUILD = sphinx-build
SPHINXOPTS =
SPHINXDIRS = .
DOCS_THEME =
DOCS_CSS =
_SPHINXDIRS = $(sort $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst)))
SPHINX_CONF = conf.py
PAPER =
@ -84,7 +86,10 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
-D version=$(KERNELVERSION) -D release=$(KERNELRELEASE) \
$(ALLSPHINXOPTS) \
$(abspath $(srctree)/$(src)/$5) \
$(abspath $(BUILDDIR)/$3/$4)
$(abspath $(BUILDDIR)/$3/$4) && \
if [ "x$(DOCS_CSS)" != "x" ]; then \
cp $(if $(patsubst /%,,$(DOCS_CSS)),$(abspath $(srctree)/$(DOCS_CSS)),$(DOCS_CSS)) $(BUILDDIR)/$3/_static/; \
fi
htmldocs:
@$(srctree)/scripts/sphinx-pre-install --version-check
@ -154,4 +159,8 @@ dochelp:
@echo ' make SPHINX_CONF={conf-file} [target] use *additional* sphinx-build'
@echo ' configuration. This is e.g. useful to build with nit-picking config.'
@echo
@echo ' make DOCS_THEME={sphinx-theme} selects a different Sphinx theme.'
@echo
@echo ' make DOCS_CSS={a .css file} adds a DOCS_CSS override file for html/epub output.'
@echo
@echo ' Default location for the generated documents is Documentation/output'

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 10 KiB

After

Width:  |  Height:  |  Size: 10 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 10 KiB

After

Width:  |  Height:  |  Size: 10 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 11 KiB

After

Width:  |  Height:  |  Size: 11 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 12 KiB

After

Width:  |  Height:  |  Size: 12 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 12 KiB

After

Width:  |  Height:  |  Size: 12 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 13 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 13 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 13 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -125,7 +125,7 @@
y="492.36218" /></flowRegion><flowPara
id="flowPara2991" /></flowRoot> <text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="362.371"
y="262.51819"
id="text4441"

Before

Width:  |  Height:  |  Size: 12 KiB

After

Width:  |  Height:  |  Size: 12 KiB

View File

@ -88,7 +88,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -103,7 +103,7 @@
id="text2993"
y="-261.66608"
x="412.12299"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
xml:space="preserve"
transform="matrix(0,1,-1,0,0,0)"><tspan
y="-261.66608"
@ -135,7 +135,7 @@
</g>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.04738"
y="268.18076"
id="text4429"
@ -146,7 +146,7 @@
y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.04738"
y="439.13766"
id="text4441"
@ -157,7 +157,7 @@
y="439.13766">WRITE_ONCE(b, 1);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="255.60869"
y="309.29346"
id="text4445"
@ -168,7 +168,7 @@
y="309.29346">r1 = READ_ONCE(a);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="255.14423"
y="520.61786"
id="text4449"
@ -179,7 +179,7 @@
y="520.61786">WRITE_ONCE(c, 1);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="384.71124"
id="text4453"
@ -190,7 +190,7 @@
y="384.71124">r2 = READ_ONCE(b);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="582.13617"
id="text4457"
@ -201,7 +201,7 @@
y="582.13617">r3 = READ_ONCE(c);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.08231"
y="213.91006"
id="text4461"
@ -212,7 +212,7 @@
y="213.91006">thread0()</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="252.34512"
y="213.91006"
id="text4461-6"
@ -223,7 +223,7 @@
y="213.91006">thread1()</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.42557"
y="213.91006"
id="text4461-2"
@ -251,7 +251,7 @@
inkscape:connector-curvature="0" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="111.75929"
y="251.53981"
id="text4429-8"
@ -262,7 +262,7 @@
y="251.53981">rcu_read_lock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="367.91556"
id="text4429-8-9"
@ -273,7 +273,7 @@
y="367.91556">rcu_read_lock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="597.40289"
id="text4429-8-9-3"
@ -284,7 +284,7 @@
y="597.40289">rcu_read_unlock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="111.75929"
y="453.15311"
id="text4429-8-9-3-1"
@ -300,7 +300,7 @@
inkscape:connector-curvature="0" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="394.94427"
y="345.66351"
id="text4648"
@ -324,7 +324,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.11968"
y="475.77856"
id="text4648-4"
@ -361,7 +361,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="254.85066"
y="348.96619"
id="text4648-4-3"

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB

View File

@ -116,7 +116,7 @@
<flowRoot
xml:space="preserve"
id="flowRoot2985"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion
id="flowRegion2987"><rect
id="rect2989"
width="82.85714"
@ -131,7 +131,7 @@
id="text2993"
y="-261.66608"
x="436.12299"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
xml:space="preserve"
transform="matrix(0,1,-1,0,0,0)"><tspan
y="-261.66608"
@ -163,7 +163,7 @@
</g>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.04738"
y="268.18076"
id="text4429"
@ -174,7 +174,7 @@
y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.04738"
y="487.13766"
id="text4441"
@ -185,7 +185,7 @@
y="487.13766">WRITE_ONCE(b, 1);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="255.60869"
y="297.29346"
id="text4445"
@ -196,7 +196,7 @@
y="297.29346">r1 = READ_ONCE(a);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="255.14423"
y="554.61786"
id="text4449"
@ -207,7 +207,7 @@
y="554.61786">WRITE_ONCE(c, 1);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="370.71124"
id="text4453"
@ -218,7 +218,7 @@
y="370.71124">WRITE_ONCE(d, 1);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="572.13617"
id="text4457"
@ -229,7 +229,7 @@
y="572.13617">r2 = READ_ONCE(c);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.08231"
y="213.91006"
id="text4461"
@ -240,7 +240,7 @@
y="213.91006">thread0()</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="252.34512"
y="213.91006"
id="text4461-6"
@ -251,7 +251,7 @@
y="213.91006">thread1()</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.42557"
y="213.91006"
id="text4461-2"
@ -281,7 +281,7 @@
sodipodi:nodetypes="cc" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="111.75929"
y="251.53981"
id="text4429-8"
@ -292,7 +292,7 @@
y="251.53981">rcu_read_lock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="353.91556"
id="text4429-8-9"
@ -303,7 +303,7 @@
y="353.91556">rcu_read_lock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="396.10254"
y="587.40289"
id="text4429-8-9-3"
@ -314,7 +314,7 @@
y="587.40289">rcu_read_unlock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="111.75929"
y="501.15311"
id="text4429-8-9-3-1"
@ -331,7 +331,7 @@
sodipodi:nodetypes="cc" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="394.94427"
y="331.66351"
id="text4648"
@ -355,7 +355,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="112.11968"
y="523.77856"
id="text4648-4"
@ -392,7 +392,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="254.85066"
y="336.96619"
id="text4648-4-3"
@ -421,7 +421,7 @@
id="text2993-7"
y="-261.66608"
x="440.12299"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
xml:space="preserve"
transform="matrix(0,1,-1,0,0,0)"><tspan
y="-261.66608"
@ -453,7 +453,7 @@
</g>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="541.70508"
y="387.6217"
id="text4445-0"
@ -464,7 +464,7 @@
y="387.6217">r3 = READ_ONCE(d);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="541.2406"
y="646.94611"
id="text4449-6"
@ -488,7 +488,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="540.94702"
y="427.29443"
id="text4648-4-3-1"
@ -499,7 +499,7 @@
y="427.29443">QS</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="686.27747"
y="461.83929"
id="text4453-7"
@ -510,7 +510,7 @@
y="461.83929">r4 = READ_ONCE(b);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="686.27747"
y="669.26422"
id="text4457-9"
@ -521,7 +521,7 @@
y="669.26422">r5 = READ_ONCE(e);</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="686.27747"
y="445.04358"
id="text4429-8-9-33"
@ -532,7 +532,7 @@
y="445.04358">rcu_read_lock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="686.27747"
y="684.53094"
id="text4429-8-9-3-8"
@ -543,7 +543,7 @@
y="684.53094">rcu_read_unlock();</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="685.11914"
y="422.79153"
id="text4648-9"
@ -567,7 +567,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="397.85934"
y="609.59003"
id="text4648-5"
@ -591,7 +591,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="256.75986"
y="586.99133"
id="text4648-5-2"
@ -615,7 +615,7 @@
sodipodi:open="true" />
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="546.22791"
y="213.91006"
id="text4461-2-5"
@ -626,7 +626,7 @@
y="213.91006">thread3()</tspan></text>
<text
xml:space="preserve"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"
style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"
x="684.00067"
y="213.91006"
id="text4461-2-1"

Before

Width:  |  Height:  |  Size: 29 KiB

After

Width:  |  Height:  |  Size: 30 KiB

View File

@ -254,17 +254,6 @@ period (in this case 2603), the grace-period sequence number (7075), and
an estimate of the total number of RCU callbacks queued across all CPUs
(625 in this case).
In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed
for each CPU::
0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 dyntick_enabled: 1
The "last_accelerate:" prints the low-order 16 bits (in hex) of the
jiffies counter when this CPU last invoked rcu_try_advance_all_cbs()
from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from
rcu_prepare_for_idle(). "dyntick_enabled: 1" indicates that dyntick-idle
processing is enabled.
If the grace period ends just as the stall warning starts printing,
there will be a spurious stall-warning message, which will include
the following::

View File

@ -39,9 +39,11 @@ different paths, as follows:
:ref:`6. ANALOGY WITH READER-WRITER LOCKING <6_whatisRCU>`
:ref:`7. FULL LIST OF RCU APIs <7_whatisRCU>`
:ref:`7. ANALOGY WITH REFERENCE COUNTING <7_whatisRCU>`
:ref:`8. ANSWERS TO QUICK QUIZZES <8_whatisRCU>`
:ref:`8. FULL LIST OF RCU APIs <8_whatisRCU>`
:ref:`9. ANSWERS TO QUICK QUIZZES <9_whatisRCU>`
People who prefer starting with a conceptual overview should focus on
Section 1, though most readers will profit by reading this section at
@ -677,7 +679,7 @@ Quick Quiz #1:
occur when using this algorithm in a real-world Linux
kernel? How could this deadlock be avoided?
:ref:`Answers to Quick Quiz <8_whatisRCU>`
:ref:`Answers to Quick Quiz <9_whatisRCU>`
5B. "TOY" EXAMPLE #2: CLASSIC RCU
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -732,7 +734,7 @@ Quick Quiz #2:
Give an example where Classic RCU's read-side
overhead is **negative**.
:ref:`Answers to Quick Quiz <8_whatisRCU>`
:ref:`Answers to Quick Quiz <9_whatisRCU>`
.. _quiz_3:
@ -741,7 +743,7 @@ Quick Quiz #3:
critical section, what the heck do you do in
CONFIG_PREEMPT_RT, where normal spinlocks can block???
:ref:`Answers to Quick Quiz <8_whatisRCU>`
:ref:`Answers to Quick Quiz <9_whatisRCU>`
.. _6_whatisRCU:
@ -872,7 +874,79 @@ be used in place of synchronize_rcu().
.. _7_whatisRCU:
7. FULL LIST OF RCU APIs
7. ANALOGY WITH REFERENCE COUNTING
-----------------------------------
The reader-writer analogy (illustrated by the previous section) is not
always the best way to think about using RCU. Another helpful analogy
considers RCU an effective reference count on everything which is
protected by RCU.
A reference count typically does not prevent the referenced object's
values from changing, but does prevent changes to type -- particularly the
gross change of type that happens when that object's memory is freed and
re-allocated for some other purpose. Once a type-safe reference to the
object is obtained, some other mechanism is needed to ensure consistent
access to the data in the object. This could involve taking a spinlock,
but with RCU the typical approach is to perform reads with SMP-aware
operations such as smp_load_acquire(), to perform updates with atomic
read-modify-write operations, and to provide the necessary ordering.
RCU provides a number of support functions that embed the required
operations and ordering, such as the list_for_each_entry_rcu() macro
used in the previous section.
A more focused view of the reference counting behavior is that,
between rcu_read_lock() and rcu_read_unlock(), any reference taken with
rcu_dereference() on a pointer marked as ``__rcu`` can be treated as
though a reference-count on that object has been temporarily increased.
This prevents the object from changing type. Exactly what this means
will depend on normal expectations of objects of that type, but it
typically includes that spinlocks can still be safely locked, normal
reference counters can be safely manipulated, and ``__rcu`` pointers
can be safely dereferenced.
Some operations that one might expect to see on an object for
which an RCU reference is held include:
- Copying out data that is guaranteed to be stable by the object's type.
- Using kref_get_unless_zero() or similar to get a longer-term
reference. This may fail of course.
- Acquiring a spinlock in the object, and checking if the object still
is the expected object and if so, manipulating it freely.
The understanding that RCU provides a reference that only prevents a
change of type is particularly visible with objects allocated from a
slab cache marked ``SLAB_TYPESAFE_BY_RCU``. RCU operations may yield a
reference to an object from such a cache that has been concurrently
freed and the memory reallocated to a completely different object,
though of the same type. In this case RCU doesn't even protect the
identity of the object from changing, only its type. So the object
found may not be the one expected, but it will be one where it is safe
to take a reference or spinlock and then confirm that the identity
matches the expectations.
With traditional reference counting -- such as that implemented by the
kref library in Linux -- there is typically code that runs when the last
reference to an object is dropped. With kref, this is the function
passed to kref_put(). When RCU is being used, such finalization code
must not be run until all ``__rcu`` pointers referencing the object have
been updated, and then a grace period has passed. Every remaining
globally visible pointer to the object must be considered to be a
potential counted reference, and the finalization code is typically run
using call_rcu() only after all those pointers have been changed.
To see how to choose between these two analogies -- of RCU as a
reader-writer lock and RCU as a reference counting system -- it is useful
to reflect on the scale of the thing being protected. The reader-writer
lock analogy looks at larger multi-part objects such as a linked list
and shows how RCU can facilitate concurrency while elements are added
to, and removed from, the list. The reference-count analogy looks at
the individual objects and looks at how they can be accessed safely
within whatever whole they are a part of.
.. _8_whatisRCU:
8. FULL LIST OF RCU APIs
-------------------------
The RCU APIs are documented in docbook-format header comments in the
@ -1035,9 +1109,9 @@ g. Otherwise, use RCU.
Of course, this all assumes that you have determined that RCU is in fact
the right tool for your job.
.. _8_whatisRCU:
.. _9_whatisRCU:
8. ANSWERS TO QUICK QUIZZES
9. ANSWERS TO QUICK QUIZZES
----------------------------
Quick Quiz #1:

View File

@ -13,6 +13,8 @@ a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages
d) memory reclaim
e) thrashing page cache
f) direct compact
and makes these statistics available to userspace through
the taskstats interface.
@ -41,11 +43,12 @@ generic data structure to userspace corresponding to per-pid and per-tgid
statistics. The delay accounting functionality populates specific fields of
this structure. See
include/linux/taskstats.h
include/uapi/linux/taskstats.h
for a description of the fields pertaining to delay accounting.
It will generally be in the form of counters returning the cumulative
delay seen for cpu, sync block I/O, swapin, memory reclaim etc.
delay seen for cpu, sync block I/O, swapin, memory reclaim, thrash page
cache, direct compact etc.
Taking the difference of two successive readings of a given
counter (say cpu_delay_total) for a task will give the delay
@ -88,41 +91,37 @@ seen.
General format of the getdelays command::
getdelays [-t tgid] [-p pid] [-c cmd...]
getdelays [-dilv] [-t tgid] [-p pid]
Get delays, since system boot, for pid 10::
# ./getdelays -p 10
# ./getdelays -d -p 10
(output similar to next case)
Get sum of delays, since system boot, for all pids with tgid 5::
# ./getdelays -t 5
# ./getdelays -d -t 5
print delayacct stats ON
TGID 5
CPU count real total virtual total delay total
7876 92005750 100000000 24001500
IO count delay total
0 0
SWAP count delay total
0 0
RECLAIM count delay total
0 0
CPU count real total virtual total delay total delay average
8 7000000 6872122 3382277 0.423ms
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
THRASHING count delay total delay average
0 0 0ms
COMPACT count delay total delay average
0 0 0ms
Get delays seen in executing a given simple command::
Get IO accounting for pid 1, it works only with -p::
# ./getdelays -c ls /
# ./getdelays -i -p 1
printing IO accounting
linuxrc: read=65536, write=0, cancelled_write=0
bin data1 data3 data5 dev home media opt root srv sys usr
boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
CPU count real total virtual total delay total
6 4000250 4000000 0
IO count delay total
0 0
SWAP count delay total
0 0
RECLAIM count delay total
0 0
The above command can be used with -v to get more debug information.

View File

@ -92,7 +92,8 @@ Triggers can be set on more than one psi metric and more than one trigger
for the same psi metric can be specified. However for each trigger a separate
file descriptor is required to be able to poll it separately from others,
therefore for each trigger a separate open() syscall should be made even
when opening the same psi interface file.
when opening the same psi interface file. Write operations to a file descriptor
with an already existing psi trigger will fail with EBUSY.
Monitors activate only when system enters stall state for the monitored
psi metric and deactivates upon exit from the stall state. While system is

View File

@ -4,6 +4,8 @@
Collaborative Processor Performance Control (CPPC)
==================================================
.. _cppc_sysfs:
CPPC
====

View File

@ -29,12 +29,14 @@ Brief summary of control files::
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include::
hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes
hugetlb.1GB.numa_stat
hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt
hugetlb.1GB.rsvd.limit_in_bytes
@ -43,6 +45,7 @@ files include::
hugetlb.1GB.rsvd.failcnt
hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes
hugetlb.64KB.numa_stat
hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt
hugetlb.64KB.rsvd.limit_in_bytes
@ -51,6 +54,7 @@ files include::
hugetlb.64KB.rsvd.failcnt
hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes
hugetlb.32MB.numa_stat
hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt
hugetlb.32MB.rsvd.limit_in_bytes

View File

@ -1268,6 +1268,9 @@ PAGE_SIZE multiple when read back.
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
oom_group_kill
The number of times a group OOM has occurred.
memory.events.local
Similar to memory.events but the fields in the file are local
to the cgroup i.e. not hierarchical. The file modified event
@ -1311,6 +1314,9 @@ PAGE_SIZE multiple when read back.
sock (npn)
Amount of memory used in network transmission buffers
vmalloc (npn)
Amount of memory used for vmap backed memory.
shmem
Amount of cached filesystem data that is swap-backed,
such as tmpfs, shm segments, shared anonymous mmap()s
@ -2260,6 +2266,11 @@ HugeTLB Interface Files
are local to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.
hugetlb.<hugepagesize>.numa_stat
Similar to memory.numa_stat, it shows the numa information of the
hugetlb pages of <hugepagesize> in this cgroup. Only active in
use hugetlb pages are included. The per-node values are in bytes.
Misc
----

View File

@ -734,10 +734,9 @@ SecurityFlags Flags which control security negotiation and
using weaker password hashes is 0x37037 (lanman,
plaintext, ntlm, ntlmv2, signing allowed). Some
SecurityFlags require the corresponding menuconfig
options to be enabled (lanman and plaintext require
CONFIG_CIFS_WEAK_PW_HASH for example). Enabling
plaintext authentication currently requires also
enabling lanman authentication in the security flags
options to be enabled. Enabling plaintext
authentication currently requires also enabling
lanman authentication in the security flags
because the cifs module only supports sending
laintext passwords using the older lanman dialect
form of the session setup SMB. (e.g. for authentication

View File

@ -8,11 +8,9 @@ to /proc/cpuinfo output of some architectures. They reside in
Documentation/ABI/stable/sysfs-devices-system-cpu.
Architecture-neutral, drivers/base/topology.c, exports these attributes.
However, the book and drawer related sysfs files will only be created if
CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390,
where they reflect the cpu and cache hierarchy.
However the die, cluster, book, and drawer hierarchy related sysfs files will
only be created if an architecture provides the related macros as described
below.
For an architecture to support this feature, it must define some of
these macros in include/asm-XXX/topology.h::
@ -43,15 +41,14 @@ not defined by include/asm-XXX/topology.h:
2) topology_die_id: -1
3) topology_cluster_id: -1
4) topology_core_id: 0
5) topology_sibling_cpumask: just the given CPU
6) topology_core_cpumask: just the given CPU
7) topology_cluster_cpumask: just the given CPU
8) topology_die_cpumask: just the given CPU
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
default definitions for topology_book_id() and topology_book_cpumask().
For architectures that don't support drawers (CONFIG_SCHED_DRAWER) there are
no default definitions for topology_drawer_id() and topology_drawer_cpumask().
5) topology_book_id: -1
6) topology_drawer_id: -1
7) topology_sibling_cpumask: just the given CPU
8) topology_core_cpumask: just the given CPU
9) topology_cluster_cpumask: just the given CPU
10) topology_die_cpumask: just the given CPU
11) topology_book_cpumask: just the given CPU
12) topology_drawer_cpumask: just the given CPU
Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal

View File

@ -2339,13 +2339,7 @@
disks (see major number 3) except that the limit on
partitions is 31.
162 char Raw block device interface
0 = /dev/rawctl Raw I/O control device
1 = /dev/raw/raw1 First raw I/O device
2 = /dev/raw/raw2 Second raw I/O device
...
max minor number of raw device is set by kernel config
MAX_RAW_DEVS or raw module parameter 'max_raw_devs'
162 char Used for (now removed) raw block device interface
163 char

View File

@ -0,0 +1,134 @@
.. SPDX-License-Identifier: GPL-2.0-or-later
Configfs GPIO Simulator
=======================
The configfs GPIO Simulator (gpio-sim) provides a way to create simulated GPIO
chips for testing purposes. The lines exposed by these chips can be accessed
using the standard GPIO character device interface as well as manipulated
using sysfs attributes.
Creating simulated chips
------------------------
The gpio-sim module registers a configfs subsystem called ``'gpio-sim'``. For
details of the configfs filesystem, please refer to the configfs documentation.
The user can create a hierarchy of configfs groups and items as well as modify
values of exposed attributes. Once the chip is instantiated, this hierarchy
will be translated to appropriate device properties. The general structure is:
**Group:** ``/config/gpio-sim``
This is the top directory of the gpio-sim configfs tree.
**Group:** ``/config/gpio-sim/gpio-device``
**Attribute:** ``/config/gpio-sim/gpio-device/dev_name``
**Attribute:** ``/config/gpio-sim/gpio-device/live``
This is a directory representing a GPIO platform device. The ``'dev_name'``
attribute is read-only and allows the user-space to read the platform device
name (e.g. ``'gpio-sim.0'``). The ``'live'`` attribute allows to trigger the
actual creation of the device once it's fully configured. The accepted values
are: ``'1'`` to enable the simulated device and ``'0'`` to disable and tear
it down.
**Group:** ``/config/gpio-sim/gpio-device/gpio-bankX``
**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/chip_name``
**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/num_lines``
This group represents a bank of GPIOs under the top platform device. The
``'chip_name'`` attribute is read-only and allows the user-space to read the
device name of the bank device. The ``'num_lines'`` attribute allows to specify
the number of lines exposed by this bank.
**Group:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY``
**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/name``
This group represents a single line at the offset Y. The 'name' attribute
allows to set the line name as represented by the 'gpio-line-names' property.
**Item:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog``
**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog/name``
**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog/direction``
This item makes the gpio-sim module hog the associated line. The ``'name'``
attribute specifies the in-kernel consumer name to use. The ``'direction'``
attribute specifies the hog direction and must be one of: ``'input'``,
``'output-high'`` and ``'output-low'``.
Inside each bank directory, there's a set of attributes that can be used to
configure the new chip. Additionally the user can ``mkdir()`` subdirectories
inside the chip's directory that allow to pass additional configuration for
specific lines. The name of those subdirectories must take the form of:
``'line<offset>'`` (e.g. ``'line0'``, ``'line20'``, etc.) as the name will be
used by the module to assign the config to the specific line at given offset.
Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in
order to instantiate the chip. It can be set back to 0 to destroy the simulated
chip. The module will synchronously wait for the new simulated device to be
successfully probed and if this doesn't happen, writing to ``'live'`` will
result in an error.
Simulated GPIO chips can also be defined in device-tree. The compatible string
must be: ``"gpio-simulator"``. Supported properties are:
``"gpio-sim,label"`` - chip label
Other standard GPIO properties (like ``"gpio-line-names"``, ``"ngpios"`` or
``"gpio-hog"``) are also supported. Please refer to the GPIO documentation for
details.
An example device-tree code defining a GPIO simulator:
.. code-block :: none
gpio-sim {
compatible = "gpio-simulator";
bank0 {
gpio-controller;
#gpio-cells = <2>;
ngpios = <16>;
gpio-sim,label = "dt-bank0";
gpio-line-names = "", "sim-foo", "", "sim-bar";
};
bank1 {
gpio-controller;
#gpio-cells = <2>;
ngpios = <8>;
gpio-sim,label = "dt-bank1";
line3 {
gpio-hog;
gpios = <3 0>;
output-high;
line-name = "sim-hog-from-dt";
};
};
};
Manipulating simulated lines
----------------------------
Each simulated GPIO chip creates a separate sysfs group under its device
directory for each exposed line
(e.g. ``/sys/devices/platform/gpio-sim.X/gpiochipY/``). The name of each group
is of the form: ``'sim_gpioX'`` where X is the offset of the line. Inside each
group there are two attibutes:
``pull`` - allows to read and set the current simulated pull setting for
every line, when writing the value must be one of: ``'pull-up'``,
``'pull-down'``
``value`` - allows to read the current value of the line which may be
different from the pull if the line is being driven from
user-space

View File

@ -10,6 +10,7 @@ gpio
gpio-aggregator
sysfs
gpio-mockup
gpio-sim
.. only:: subproject and html

View File

@ -468,7 +468,7 @@ Spectre variant 2
before invoking any firmware code to prevent Spectre variant 2 exploits
using the firmware.
Using kernel address space randomization (CONFIG_RANDOMIZE_SLAB=y
Using kernel address space randomization (CONFIG_RANDOMIZE_BASE=y
and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) makes
attacks on the kernel generally more difficult.

View File

@ -225,14 +225,23 @@
For broken nForce2 BIOS resulting in XT-PIC timer.
acpi_sleep= [HW,ACPI] Sleep options
Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
old_ordering, nonvs, sci_force_enable, nobl }
Format: { s3_bios, s3_mode, s3_beep, s4_hwsig,
s4_nohwsig, old_ordering, nonvs,
sci_force_enable, nobl }
See Documentation/power/video.rst for information on
s3_bios and s3_mode.
s3_beep is for debugging; it makes the PC's speaker beep
as soon as the kernel's real-mode entry point is called.
s4_hwsig causes the kernel to check the ACPI hardware
signature during resume from hibernation, and gracefully
refuse to resume if it has changed. This complies with
the ACPI specification but not with reality, since
Windows does not do this and many laptops do change it
on docking. So the default behaviour is to allow resume
and simply warn when the signature changes, unless the
s4_hwsig option is enabled.
s4_nohwsig prevents ACPI hardware signature from being
used during resume from hibernation.
used (or even warned about) during resume.
old_ordering causes the ACPI 1.0 ordering of the _PTS
control method, with respect to putting devices into
low power states, to be enforced (the ACPI 2.0 ordering
@ -603,8 +612,8 @@
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
unstable. Defaults to three retries, that is,
four attempts to read the clock under test.
unstable. Defaults to two retries, that is,
three attempts to read the clock under test.
clocksource.verify_n_cpus= [KNL]
Limit the number of CPUs checked for clocksources
@ -2940,7 +2949,7 @@
both parameters are enabled, hugetlb_free_vmemmap takes
precedence over memory_hotplug.memmap_on_memory.
memtest= [KNL,X86,ARM,PPC,RISCV] Enable memtest
memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest
Format: <integer>
default : 0 <disable>
Specifies the number of memtest passes to be
@ -3384,7 +3393,7 @@
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
nosmep [X86,PPC]
nosmep [X86,PPC64s]
Disable SMEP (Supervisor Mode Execution Prevention)
even if it is supported by processor.
@ -3551,6 +3560,13 @@
shutdown the other cpus. Instead use the REBOOT_VECTOR
irq.
nomodeset Disable kernel modesetting. DRM drivers will not perform
display-mode changes or accelerated rendering. Only the
system framebuffer will be available for use if this was
set-up by the firmware or boot loader.
Useful as fallback, or for testing and debugging.
nomodule Disable module load
nopat [X86] Disable PAT (page attribute table extension of
@ -4357,19 +4373,30 @@
Disable the Correctable Errors Collector,
see CONFIG_RAS_CEC help text.
rcu_nocbs= [KNL]
The argument is a cpu list, as described above.
rcu_nocbs[=cpu-list]
[KNL] The optional argument is a cpu list,
as described above.
In kernels built with CONFIG_RCU_NOCB_CPU=y, set
the specified list of CPUs to be no-callback CPUs.
Invocation of these CPUs' RCU callbacks will be
offloaded to "rcuox/N" kthreads created for that
purpose, where "x" is "p" for RCU-preempt, and
"s" for RCU-sched, and "N" is the CPU number.
This reduces OS jitter on the offloaded CPUs,
which can be useful for HPC and real-time
workloads. It can also improve energy efficiency
for asymmetric multiprocessors.
In kernels built with CONFIG_RCU_NOCB_CPU=y,
enable the no-callback CPU mode, which prevents
such CPUs' callbacks from being invoked in
softirq context. Invocation of such CPUs' RCU
callbacks will instead be offloaded to "rcuox/N"
kthreads created for that purpose, where "x" is
"p" for RCU-preempt, "s" for RCU-sched, and "g"
for the kthreads that mediate grace periods; and
"N" is the CPU number. This reduces OS jitter on
the offloaded CPUs, which can be useful for HPC
and real-time workloads. It can also improve
energy efficiency for asymmetric multiprocessors.
If a cpulist is passed as an argument, the specified
list of CPUs is set to no-callback mode from boot.
Otherwise, if the '=' sign and the cpulist
arguments are omitted, no CPU will be set to
no-callback mode from boot but the mode may be
toggled at runtime via cpusets.
rcu_nocb_poll [KNL]
Rather than requiring that offloaded CPUs
@ -4503,10 +4530,6 @@
on rcutree.qhimark at boot time and to zero to
disable more aggressive help enlistment.
rcutree.rcu_idle_gp_delay= [KNL]
Set wakeup interval for idle CPUs that have
RCU callbacks (RCU_FAST_NO_HZ=y).
rcutree.rcu_kick_kthreads= [KNL]
Cause the grace-period kthread to get an extra
wake_up() if it sleeps three times longer than
@ -4617,8 +4640,12 @@
in seconds.
rcutorture.fwd_progress= [KNL]
Enable RCU grace-period forward-progress testing
Specifies the number of kthreads to be used
for RCU grace-period forward-progress testing
for the types of RCU supporting this notion.
Defaults to 1 kthread, values less than zero or
greater than the number of CPUs cause the number
of CPUs to be used.
rcutorture.fwd_progress_div= [KNL]
Specify the fraction of a CPU-stall-warning
@ -4819,6 +4846,29 @@
period to instead use normal non-expedited
grace-period processing.
rcupdate.rcu_task_collapse_lim= [KNL]
Set the maximum number of callbacks present
at the beginning of a grace period that allows
the RCU Tasks flavors to collapse back to using
a single callback queue. This switching only
occurs when rcupdate.rcu_task_enqueue_lim is
set to the default value of -1.
rcupdate.rcu_task_contend_lim= [KNL]
Set the minimum number of callback-queuing-time
lock-contention events per jiffy required to
cause the RCU Tasks flavors to switch to per-CPU
callback queuing. This switching only occurs
when rcupdate.rcu_task_enqueue_lim is set to
the default value of -1.
rcupdate.rcu_task_enqueue_lim= [KNL]
Set the number of callback queues to use for the
RCU Tasks family of RCU flavors. The default
of -1 allows this to be automatically (and
dynamically) adjusted. This parameter is intended
for use in testing.
rcupdate.rcu_task_ipi_delay= [KNL]
Set time in jiffies during which RCU tasks will
avoid sending IPIs, starting with the beginning
@ -6452,6 +6502,12 @@
controller on both pseries and powernv
platforms. Only useful on POWER9 and above.
xive.store-eoi=off [PPC]
By default on POWER10 and above, the kernel will use
stores for EOI handling when the XIVE interrupt mode
is active. This option allows the XIVE driver to use
loads instead, as on POWER9.
xhci-hcd.quirks [USB,KNL]
A hex value specifying bitmask with supplemental xhci
host controller quirks. Meaning of each bit can be

View File

@ -208,7 +208,7 @@ Do at least one of the following:
2. Enable RCU to do its processing remotely via dyntick-idle by
doing all of the following:
a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
a. Build with CONFIG_NO_HZ=y.
b. Ensure that the CPU goes idle frequently, allowing other
CPUs to detect that it has passed through an RCU quiescent
state. If the kernel is built with CONFIG_NO_HZ_FULL=y,

View File

@ -60,6 +60,7 @@ s5p-mfc Samsung S5P MFC Video Codec
sh_veu SuperH VEU mem2mem video processing
sh_vou SuperH VOU video output
stm32-dcmi STM32 Digital Camera Memory Interface (DCMI)
stm32-dma2d STM32 Chrom-Art Accelerator Unit
sun4i-csi Allwinner A10 CMOS Sensor Interface Support
sun6i-csi Allwinner V3s Camera Sensor Interface
sun8i-di Allwinner Deinterlace

View File

@ -208,6 +208,31 @@ PID of the DAMON thread.
If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. Else,
-1.
nr_reclaim_tried_regions
------------------------
Number of memory regions that tried to be reclaimed by DAMON_RECLAIM.
bytes_reclaim_tried_regions
---------------------------
Total bytes of memory regions that tried to be reclaimed by DAMON_RECLAIM.
nr_reclaimed_regions
--------------------
Number of memory regions that successfully be reclaimed by DAMON_RECLAIM.
bytes_reclaimed_regions
-----------------------
Total bytes of memory regions that successfully be reclaimed by DAMON_RECLAIM.
nr_quota_exceeds
----------------
Number of times that the time/space quota limits have exceeded.
Example
=======

View File

@ -7,37 +7,40 @@ Detailed Usages
DAMON provides below three interfaces for different users.
- *DAMON user space tool.*
This is for privileged people such as system administrators who want a
just-working human-friendly interface. Using this, users can use the DAMONs
major features in a human-friendly way. It may not be highly tuned for
special cases, though. It supports both virtual and physical address spaces
monitoring.
`This <https://github.com/awslabs/damo>`_ is for privileged people such as
system administrators who want a just-working human-friendly interface.
Using this, users can use the DAMONs major features in a human-friendly way.
It may not be highly tuned for special cases, though. It supports both
virtual and physical address spaces monitoring. For more detail, please
refer to its `usage document
<https://github.com/awslabs/damo/blob/next/USAGE.md>`_.
- *debugfs interface.*
This is for privileged user space programmers who want more optimized use of
DAMON. Using this, users can use DAMONs major features by reading
from and writing to special debugfs files. Therefore, you can write and use
your personalized DAMON debugfs wrapper programs that reads/writes the
debugfs files instead of you. The DAMON user space tool is also a reference
implementation of such programs. It supports both virtual and physical
address spaces monitoring.
:ref:`This <debugfs_interface>` is for privileged user space programmers who
want more optimized use of DAMON. Using this, users can use DAMONs major
features by reading from and writing to special debugfs files. Therefore,
you can write and use your personalized DAMON debugfs wrapper programs that
reads/writes the debugfs files instead of you. The `DAMON user space tool
<https://github.com/awslabs/damo>`_ is one example of such programs. It
supports both virtual and physical address spaces monitoring. Note that this
interface provides only simple :ref:`statistics <damos_stats>` for the
monitoring results. For detailed monitoring results, DAMON provides a
:ref:`tracepoint <tracepoint>`.
- *Kernel Space Programming Interface.*
This is for kernel space programmers. Using this, users can utilize every
feature of DAMON most flexibly and efficiently by writing kernel space
DAMON application programs for you. You can even extend DAMON for various
address spaces.
:doc:`This </vm/damon/api>` is for kernel space programmers. Using this,
users can utilize every feature of DAMON most flexibly and efficiently by
writing kernel space DAMON application programs for you. You can even extend
DAMON for various address spaces. For detail, please refer to the interface
:doc:`document </vm/damon/api>`.
Nevertheless, you could write your own user space tool using the debugfs
interface. A reference implementation is available at
https://github.com/awslabs/damo. If you are a kernel programmer, you could
refer to :doc:`/vm/damon/api` for the kernel space programming interface. For
the reason, this document describes only the debugfs interface
.. _debugfs_interface:
debugfs Interface
=================
DAMON exports five files, ``attrs``, ``target_ids``, ``init_regions``,
``schemes`` and ``monitor_on`` under its debugfs directory,
``<debugfs>/damon/``.
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
Attributes
@ -131,24 +134,38 @@ Schemes
For usual DAMON-based data access aware memory management optimizations, users
would simply want the system to apply a memory management action to a memory
region of a specific size having a specific access frequency for a specific
time. DAMON receives such formalized operation schemes from the user and
applies those to the target processes. It also counts the total number and
size of regions that each scheme is applied. This statistics can be used for
online analysis or tuning of the schemes.
region of a specific access pattern. DAMON receives such formalized operation
schemes from the user and applies those to the target processes.
Users can get and set the schemes by reading from and writing to ``schemes``
debugfs file. Reading the file also shows the statistics of each scheme. To
the file, each of the schemes should be represented in each line in below form:
the file, each of the schemes should be represented in each line in below
form::
min-size max-size min-acc max-acc min-age max-age action
<target access pattern> <action> <quota> <watermarks>
Note that the ranges are closed interval. Bytes for the size of regions
(``min-size`` and ``max-size``), number of monitored accesses per aggregate
interval for access frequency (``min-acc`` and ``max-acc``), number of
aggregate intervals for the age of regions (``min-age`` and ``max-age``), and a
predefined integer for memory management actions should be used. The supported
numbers and their meanings are as below.
You can disable schemes by simply writing an empty string to the file.
Target Access Pattern
~~~~~~~~~~~~~~~~~~~~~
The ``<target access pattern>`` is constructed with three ranges in below
form::
min-size max-size min-acc max-acc min-age max-age
Specifically, bytes for the size of regions (``min-size`` and ``max-size``),
number of monitored accesses per aggregate interval for access frequency
(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of
regions (``min-age`` and ``max-age``) are specified. Note that the ranges are
closed interval.
Action
~~~~~~
The ``<action>`` is a predefined integer for memory management actions, which
DAMON will apply to the regions having the target access pattern. The
supported numbers and their meanings are as below.
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``
- 1: Call ``madvise()`` for the region with ``MADV_COLD``
@ -157,20 +174,82 @@ numbers and their meanings are as below.
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
- 5: Do nothing but count the statistics
You can disable schemes by simply writing an empty string to the file. For
example, below commands applies a scheme saying "If a memory region of size in
[4KiB, 8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
interval in [10, 20], page out the region", check the entered scheme again, and
finally remove the scheme. ::
Quota
~~~~~
Optimal ``target access pattern`` for each ``action`` is workload dependent, so
not easy to find. Worse yet, setting a scheme of some action too aggressive
can cause severe overhead. To avoid such overhead, users can limit time and
size quota for the scheme via the ``<quota>`` in below form::
<ms> <sz> <reset interval> <priority weights>
This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying
the action to memory regions of the ``target access pattern`` within the
``<reset interval>`` milliseconds, and to apply the action to only up to
``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both
``<ms>`` and ``<sz>`` zero disables the quota limits.
When the quota limit is expected to be exceeded, DAMON prioritizes found memory
regions of the ``target access pattern`` based on their size, access frequency,
and age. For personalized prioritization, users can set the weights for the
three properties in ``<priority weights>`` in below form::
<size weight> <access frequency weight> <age weight>
Watermarks
~~~~~~~~~~
Some schemes would need to run based on current value of the system's specific
metrics like free memory ratio. For such cases, users can specify watermarks
for the condition.::
<metric> <check interval> <high mark> <middle mark> <low mark>
``<metric>`` is a predefined integer for the metric to be checked. The
supported numbers and their meanings are as below.
- 0: Ignore the watermarks
- 1: System's free memory rate (per thousand)
The value of the metric is checked every ``<check interval>`` microseconds.
If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the
scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme
is activated.
.. _damos_stats:
Statistics
~~~~~~~~~~
It also counts the total number and bytes of regions that each scheme is tried
to be applied, the two numbers for the regions that each scheme is successfully
applied, and the total number of the quota limit exceeds. This statistics can
be used for online analysis or tuning of the schemes.
The statistics can be shown by reading the ``schemes`` file. Reading the file
will show each scheme you entered in each line, and the five numbers for the
statistics will be added at the end of each line.
Example
~~~~~~~
Below commands applies a scheme saying "If a memory region of size in [4KiB,
8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
interval in [10, 20], page out the region. For the paging out, use only up to
10ms per second, and also don't page out more than 1GiB per second. Under the
limitation, page out memory regions having longer age first. Also, check the
free memory rate of the system every 5 seconds, start the monitoring and paging
out when the free memory rate becomes lower than 50%, but stop it if the free
memory rate becomes larger than 60%, or lower than 30%".::
# cd <debugfs>/damon
# echo "4096 8192 0 5 10 20 2" > schemes
# cat schemes
4096 8192 0 5 10 20 2 0 0
# echo > schemes
The last two integers in the 4th line of above example is the total number and
the total size of the regions that the scheme is applied.
# scheme="4096 8192 0 5 10 20 2" # target access pattern and action
# scheme+=" 10 $((1024*1024*1024)) 1000" # quotas
# scheme+=" 0 0 100" # prioritization weights
# scheme+=" 1 5000000 600 500 300" # watermarks
# echo "$scheme" > schemes
Turning On/Off
@ -195,6 +274,54 @@ the monitoring is turned on. If you write to the files while DAMON is running,
an error code such as ``-EBUSY`` will be returned.
Monitoring Thread PID
---------------------
DAMON does requested monitoring with a kernel thread called ``kdamond``. You
can get the pid of the thread by reading the ``kdamond_pid`` file. When the
monitoring is turned off, reading the file returns ``none``. ::
# cd <debugfs>/damon
# cat monitor_on
off
# cat kdamond_pid
none
# echo on > monitor_on
# cat kdamond_pid
18594
Using Multiple Monitoring Threads
---------------------------------
One ``kdamond`` thread is created for each monitoring context. You can create
and remove monitoring contexts for multiple ``kdamond`` required use case using
the ``mk_contexts`` and ``rm_contexts`` files.
Writing the name of the new context to the ``mk_contexts`` file creates a
directory of the name on the DAMON debugfs directory. The directory will have
DAMON debugfs files for the context. ::
# cd <debugfs>/damon
# ls foo
# ls: cannot access 'foo': No such file or directory
# echo foo > mk_contexts
# ls foo
# attrs init_regions kdamond_pid schemes target_ids
If the context is not needed anymore, you can remove it and the corresponding
directory by putting the name of the context to the ``rm_contexts`` file. ::
# echo foo > rm_contexts
# ls foo
# ls: cannot access 'foo': No such file or directory
Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
root directory only.
.. _tracepoint:
Tracepoint for Monitoring Results
=================================

View File

@ -408,7 +408,7 @@ follows:
Memory Policy APIs
==================
Linux supports 3 system calls for controlling memory policy. These APIS
Linux supports 4 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space.
@ -460,6 +460,20 @@ requested via the 'flags' argument.
See the mbind(2) man page for more details.
Set home node for a Range of Task's Address Spacec::
long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);
sys_set_mempolicy_home_node set the home node for a VMA policy present in the
task's address range. The system call updates the home node only for the existing
mempolicy range. Other address ranges are ignored. A home node is the NUMA node
closest to which page allocation will come from. Specifying the home node override
the default allocation policy to allocate memory close to the local node for an
executing CPU.
Memory Policy Command Line Interface
====================================

View File

@ -0,0 +1,106 @@
================================================
HiSilicon PCIe Performance Monitoring Unit (PMU)
================================================
On Hip09, HiSilicon PCIe Performance Monitoring Unit (PMU) could monitor
bandwidth, latency, bus utilization and buffer occupancy data of PCIe.
Each PCIe Core has a PMU to monitor multi Root Ports of this PCIe Core and
all Endpoints downstream these Root Ports.
HiSilicon PCIe PMU driver
=========================
The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe
Core id.::
/sys/bus/event_source/hisi_pcie<sicl>_<core>
PMU driver provides description of available events and filter options in sysfs,
see /sys/bus/event_source/devices/hisi_pcie<sicl>_<core>.
The "format" directory describes all formats of the config (events) and config1
(filter options) fields of the perf_event_attr structure. The "events" directory
describes all documented events shown in perf list.
The "identifier" sysfs file allows users to identify the version of the
PMU hardware device.
The "bus" sysfs file allows users to get the bus number of Root Ports
monitored by PMU.
Example usage of perf::
$# perf list
hisi_pcie0_0/rx_mwr_latency/ [kernel PMU event]
hisi_pcie0_0/rx_mwr_cnt/ [kernel PMU event]
------------------------------------------
$# perf stat -e hisi_pcie0_0/rx_mwr_latency/
$# perf stat -e hisi_pcie0_0/rx_mwr_cnt/
$# perf stat -g -e hisi_pcie0_0/rx_mwr_latency/ -e hisi_pcie0_0/rx_mwr_cnt/
The current driver does not support sampling. So "perf record" is unsupported.
Also attach to a task is unsupported for PCIe PMU.
Filter options
--------------
1. Target filter
PMU could only monitor the performance of traffic downstream target Root Ports
or downstream target Endpoint. PCIe PMU driver support "port" and "bdf"
interfaces for users, and these two interfaces aren't supported at the same
time.
-port
"port" filter can be used in all PCIe PMU events, target Root Port can be
selected by configuring the 16-bits-bitmap "port". Multi ports can be selected
for AP-layer-events, and only one port can be selected for TL/DL-layer-events.
For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of bitmap
should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4 lanes),
bit8 is set, port=0x100; if these two Root Ports are both monitored, port=0x101.
Example usage of perf::
$# perf stat -e hisi_pcie0_0/rx_mwr_latency,port=0x1/ sleep 5
-bdf
"bdf" filter can only be used in bandwidth events, target Endpoint is selected
by configuring BDF to "bdf". Counter only counts the bandwidth of message
requested by target Endpoint.
For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0.
Example usage of perf::
$# perf stat -e hisi_pcie0_0/rx_mrd_flux,bdf=0x3900/ sleep 5
2. Trigger filter
Event statistics start when the first time TLP length is greater/smaller
than trigger condition. You can set the trigger condition by writing "trig_len",
and set the trigger mode by writing "trig_mode". This filter can only be used
in bandwidth events.
For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0"
means statistics start when TLP length > trigger condition, "trig_mode=1"
means start when TLP length < condition.
Example usage of perf::
$# perf stat -e hisi_pcie0_0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5
3. Threshold filter
Counter counts when TLP length within the specified range. You can set the
threshold by writing "thr_len", and set the threshold mode by writing
"thr_mode". This filter can only be used in bandwidth events.
For example, "thr_len=4" means threshold is 2^4 DW, "thr_mode=0" means
counter counts when TLP length >= threshold, and "thr_mode=1" means counts
when TLP length < threshold.
Example usage of perf::
$# perf stat -e hisi_pcie0_0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5

View File

@ -0,0 +1,382 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
===============================================
``amd-pstate`` CPU Performance Scaling Driver
===============================================
:Copyright: |copy| 2021 Advanced Micro Devices, Inc.
:Author: Huang Rui <ray.huang@amd.com>
Introduction
===================
``amd-pstate`` is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism on modern AMD APU and CPU series in
Linux kernel. The new mechanism is based on Collaborative Processor
Performance Control (CPPC) which provides finer grain frequency management
than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
the ACPI P-states driver to manage CPU frequency and clocks with switching
only in 3 P-states. CPPC replaces the ACPI P-states controls, allows a
flexible, low-latency interface for the Linux kernel to directly
communicate the performance hints to hardware.
``amd-pstate`` leverages the Linux kernel governors such as ``schedutil``,
``ondemand``, etc. to manage the performance hints which are provided by
CPPC hardware functionality that internally follows the hardware
specification (for details refer to AMD64 Architecture Programmer's Manual
Volume 2: System Programming [1]_). Currently ``amd-pstate`` supports basic
frequency control function according to kernel governors on some of the
Zen2 and Zen3 processors, and we will implement more AMD specific functions
in future after we verify them on the hardware and SBIOS.
AMD CPPC Overview
=======================
Collaborative Processor Performance Control (CPPC) interface enumerates a
continuous, abstract, and unit-less performance value in a scale that is
not tied to a specific performance state / frequency. This is an ACPI
standard [2]_ which software can specify application performance goals and
hints as a relative target to the infrastructure limits. AMD processors
provides the low latency register model (MSR) instead of AML code
interpreter for performance adjustments. ``amd-pstate`` will initialize a
``struct cpufreq_driver`` instance ``amd_pstate_driver`` with the callbacks
to manage each performance update behavior. ::
Highest Perf ------>+-----------------------+ +-----------------------+
| | | |
| | | |
| | Max Perf ---->| |
| | | |
| | | |
Nominal Perf ------>+-----------------------+ +-----------------------+
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | Desired Perf ---->| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
Lowest non- | | | |
linear perf ------>+-----------------------+ +-----------------------+
| | | |
| | Lowest perf ---->| |
| | | |
Lowest perf ------>+-----------------------+ +-----------------------+
| | | |
| | | |
| | | |
0 ------>+-----------------------+ +-----------------------+
AMD P-States Performance Scale
.. _perf_cap:
AMD CPPC Performance Capability
--------------------------------
Highest Performance (RO)
.........................
It is the absolute maximum performance an individual processor may reach,
assuming ideal conditions. This performance level may not be sustainable
for long durations and may only be achievable if other platform components
are in a specific state; for example, it may require other processors be in
an idle state. This would be equivalent to the highest frequencies
supported by the processor.
Nominal (Guaranteed) Performance (RO)
......................................
It is the maximum sustained performance level of the processor, assuming
ideal operating conditions. In absence of an external constraint (power,
thermal, etc.) this is the performance level the processor is expected to
be able to maintain continuously. All cores/processors are expected to be
able to sustain their nominal performance state simultaneously.
Lowest non-linear Performance (RO)
...................................
It is the lowest performance level at which nonlinear power savings are
achieved, for example, due to the combined effects of voltage and frequency
scaling. Above this threshold, lower performance levels should be generally
more energy efficient than higher performance levels. This register
effectively conveys the most efficient performance level to ``amd-pstate``.
Lowest Performance (RO)
........................
It is the absolute lowest performance level of the processor. Selecting a
performance level lower than the lowest nonlinear performance level may
cause an efficiency penalty but should reduce the instantaneous power
consumption of the processor.
AMD CPPC Performance Control
------------------------------
``amd-pstate`` passes performance goals through these registers. The
register drives the behavior of the desired performance target.
Minimum requested performance (RW)
...................................
``amd-pstate`` specifies the minimum allowed performance level.
Maximum requested performance (RW)
...................................
``amd-pstate`` specifies a limit the maximum performance that is expected
to be supplied by the hardware.
Desired performance target (RW)
...................................
``amd-pstate`` specifies a desired target in the CPPC performance scale as
a relative number. This can be expressed as percentage of nominal
performance (infrastructure max). Below the nominal sustained performance
level, desired performance expresses the average performance level of the
processor subject to hardware. Above the nominal performance level,
processor must provide at least nominal performance requested and go higher
if current operating conditions allow.
Energy Performance Preference (EPP) (RW)
.........................................
Provides a hint to the hardware if software wants to bias toward performance
(0x0) or energy efficiency (0xff).
Key Governors Support
=======================
``amd-pstate`` can be used with all the (generic) scaling governors listed
by the ``scaling_available_governors`` policy attribute in ``sysfs``. Then,
it is responsible for the configuration of policy objects corresponding to
CPUs and provides the ``CPUFreq`` core (and the scaling governors attached
to the policy objects) with accurate information on the maximum and minimum
operating frequencies supported by the hardware. Users can check the
``scaling_cur_freq`` information comes from the ``CPUFreq`` core.
``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic
frequency control. It is to fine tune the processor configuration on
``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate``
registers adjust_perf callback to implement the CPPC similar performance
update behavior. It is initialized by ``sugov_start`` and then populate the
CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as
the utilization update callback function in CPU scheduler. CPU scheduler
will call ``cpufreq_update_util`` and assign the target performance
according to the ``struct sugov_cpu`` that utilization update belongs to.
Then ``amd-pstate`` updates the desired performance according to the CPU
scheduler assigned.
Processor Support
=======================
The ``amd-pstate`` initialization will fail if the _CPC in ACPI SBIOS is
not existed at the detected processor, and it uses ``acpi_cpc_valid`` to
check the _CPC existence. All Zen based processors support legacy ACPI
hardware P-States function, so while the ``amd-pstate`` fails to be
initialized, the kernel will fall back to initialize ``acpi-cpufreq``
driver.
There are two types of hardware implementations for ``amd-pstate``: one is
`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support
<perf_cap_>`_. It can use :c:macro:`X86_FEATURE_CPPC` feature flag (for
details refer to Processor Programming Reference (PPR) for AMD Family
19h Model 51h, Revision A1 Processors [3]_) to indicate the different
types. ``amd-pstate`` is to register different ``static_call`` instances
for different hardware implementations.
Currently, some of Zen2 and Zen3 processors support ``amd-pstate``. In the
future, it will be supported on more and more AMD processors.
Full MSR Support
-----------------
Some new Zen3 processors such as Cezanne provide the MSR registers directly
while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set.
``amd-pstate`` can handle the MSR register to implement the fast switch
function in ``CPUFreq`` that can shrink latency of frequency control on the
interrupt context. The functions with ``pstate_xxx`` prefix represent the
operations of MSR registers.
Shared Memory Support
----------------------
If :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, that means the
processor supports shared memory solution. In this case, ``amd-pstate``
uses the ``cppc_acpi`` helper methods to implement the callback functions
that defined on ``static_call``. The functions with ``cppc_xxx`` prefix
represent the operations of acpi cppc helpers for shared memory solution.
AMD P-States and ACPI hardware P-States always can be supported in one
processor. But AMD P-States has the higher priority and if it is enabled
with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond
to the request from AMD P-States.
User Space Interface in ``sysfs``
==================================
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
control its functionality at the system level. They located in the
``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. ::
root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd*
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq
/sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq
``amd_pstate_highest_perf / amd_pstate_max_freq``
Maximum CPPC performance and CPU frequency that the driver is allowed to
set in percent of the maximum supported CPPC performance level (the highest
performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).
In some of ASICs, the highest CPPC performance is not the one in the _CPC
table, so we need to expose it to sysfs. If boost is not active but
supported, this maximum frequency will be larger than the one in
``cpuinfo``.
This attribute is read-only.
``amd_pstate_lowest_nonlinear_freq``
The lowest non-linear CPPC CPU frequency that the driver is allowed to set
in percent of the maximum supported CPPC performance level (Please see the
lowest non-linear performance in `AMD CPPC Performance Capability
<perf_cap_>`_).
This attribute is read-only.
For other performance and frequency values, we can read them back from
``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
``amd-pstate`` vs ``acpi-cpufreq``
======================================
On majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables
provided by the platform firmware used for CPU performance scaling, but
only provides 3 P-states on AMD processors.
However, on modern AMD APU and CPU series, it provides the collaborative
processor performance control according to ACPI protocol and customize this
for AMD platforms. That is fine-grain and continuous frequency range
instead of the legacy hardware P-states. ``amd-pstate`` is the kernel
module which supports the new AMD P-States mechanism on most of future AMD
platforms. The AMD P-States mechanism will be the more performance and energy
efficiency frequency management method on AMD processors.
Kernel Module Options for ``amd-pstate``
=========================================
``shared_mem``
Use a module param (shared_mem) to enable related processors manually with
**amd_pstate.shared_mem=1**.
Due to the performance issue on the processors with `Shared Memory Support
<perf_cap_>`_, so we disable it for the moment and will enable this by default
once we address performance issue on this solution.
The way to check whether current processor is `Full MSR Support <perf_cap_>`_
or `Shared Memory Support <perf_cap_>`_ : ::
ray@hr-test1:~$ lscpu | grep cppc
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
If CPU Flags have cppc, then this processor supports `Full MSR Support
<perf_cap_>`_. Otherwise it supports `Shared Memory Support <perf_cap_>`_.
``cpupower`` tool support for ``amd-pstate``
===============================================
``amd-pstate`` is supported on ``cpupower`` tool that can be used to dump the frequency
information. And it is in progress to support more and more operations for new
``amd-pstate`` module with this tool. ::
root@hr-test1:/home/ray# cpupower frequency-info
analyzing CPU 0:
driver: amd-pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 131 us
hardware limits: 400 MHz - 4.68 GHz
available cpufreq governors: ondemand conservative powersave userspace performance schedutil
current policy: frequency should be within 400 MHz and 4.68 GHz.
The governor "schedutil" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 4.02 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes
AMD PSTATE Highest Performance: 166. Maximum Frequency: 4.68 GHz.
AMD PSTATE Nominal Performance: 117. Nominal Frequency: 3.30 GHz.
AMD PSTATE Lowest Non-linear Performance: 39. Lowest Non-linear Frequency: 1.10 GHz.
AMD PSTATE Lowest Performance: 15. Lowest Frequency: 400 MHz.
Diagnostics and Tuning
=======================
Trace Events
--------------
There are two static trace events that can be used for ``amd-pstate``
diagnostics. One of them is the cpu_frequency trace event generally used
by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
specific to ``amd-pstate``. The following sequence of shell commands can
be used to enable them and see their output (if the kernel is generally
configured to support event tracing). ::
root@hr-test1:/home/ray# cd /sys/kernel/tracing/
root@hr-test1:/sys/kernel/tracing# echo 1 > events/amd_cpu/enable
root@hr-test1:/sys/kernel/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 47827/42233061 #P:2
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [015] dN... 4995.979886: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=15 changed=false fast_switch=true
<idle>-0 [007] d.h.. 4995.979893: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
cat-2161 [000] d.... 4995.980841: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=0 changed=false fast_switch=true
sshd-2125 [004] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=4 changed=false fast_switch=true
<idle>-0 [007] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
<idle>-0 [003] d.s.. 4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true
<idle>-0 [011] d.s.. 4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true
The cpu_frequency trace event will be triggered either by the ``schedutil`` scaling
governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the
policies with other scaling governors).
Reference
===========
.. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming,
https://www.amd.com/system/files/TechDocs/24593.pdf
.. [2] Advanced Configuration and Power Interface Specification,
https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf
.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors
https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip

View File

@ -11,6 +11,7 @@ Working-State Power Management
intel_idle
cpufreq
intel_pstate
amd-pstate
cpufreq_drivers
intel_epb
intel-speed-select

View File

@ -905,6 +905,17 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
The default value is 8.
perf_user_access (arm64 only)
=================================
Controls user space access for reading perf event counters. When set to 1,
user space can read performance monitor counter registers directly.
The default value is 0 (access disabled).
See Documentation/arm64/perf.rst for more information.
pid_max
=======

View File

@ -948,7 +948,7 @@ how much memory needs to be free before kswapd goes back to sleep.
The unit is in fractions of 10,000. The default value of 10 means the
distances between watermarks are 0.1% of the available memory in the
node/system. The maximum value is 1000, or 10% of memory.
node/system. The maximum value is 3000, or 30% of memory.
A high rate of threads entering direct reclaim (allocstall) or kswapd
going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate

85
Documentation/arc/arc.rst Normal file
View File

@ -0,0 +1,85 @@
.. SPDX-License-Identifier: GPL-2.0
Linux kernel for ARC processors
*******************************
Other sources of information
############################
Below are some resources where more information can be found on
ARC processors and relevant open source projects.
- `<https://embarc.org>`_ - Community portal for open source on ARC.
Good place to start to find relevant FOSS projects, toolchain releases,
news items and more.
- `<https://github.com/foss-for-synopsys-dwc-arc-processors>`_ -
Home for all development activities regarding open source projects for
ARC processors. Some of the projects are forks of various upstream projects,
where "work in progress" is hosted prior to submission to upstream projects.
Other projects are developed by Synopsys and made available to community
as open source for use on ARC Processors.
- `Official Synopsys ARC Processors website
<https://www.synopsys.com/designware-ip/processor-solutions.html>`_ -
location, with access to some IP documentation (`Programmer's Reference
Manual, AKA PRM for ARC HS processors
<https://www.synopsys.com/dw/doc.php/ds/cc/programmers-reference-manual-ARC-HS.pdf>`_)
and free versions of some commercial tools (`Free nSIM
<https://www.synopsys.com/cgi-bin/dwarcnsim/req1.cgi>`_ and
`MetaWare Light Edition <https://www.synopsys.com/cgi-bin/arcmwtk_lite/reg1.cgi>`_).
Please note though, registration is required to access both the documentation and
the tools.
Important note on ARC processors configurability
################################################
ARC processors are highly configurable and several configurable options
are supported in Linux. Some options are transparent to software
(i.e cache geometries, some can be detected at runtime and configured
and used accordingly, while some need to be explicitly selected or configured
in the kernel's configuration utility (AKA "make menuconfig").
However not all configurable options are supported when an ARC processor
is to run Linux. SoC design teams should refer to "Appendix E:
Configuration for ARC Linux" in the ARC HS Databook for configurability
guidelines.
Following these guidelines and selecting valid configuration options
up front is critical to help prevent any unwanted issues during
SoC bringup and software development in general.
Building the Linux kernel for ARC processors
############################################
The process of kernel building for ARC processors is the same as for any other
architecture and could be done in 2 ways:
- Cross-compilation: process of compiling for ARC targets on a development
host with a different processor architecture (generally x86_64/amd64).
- Native compilation: process of compiling for ARC on a ARC platform
(hardware board or a simulator like QEMU) with complete development environment
(GNU toolchain, dtc, make etc) installed on the platform.
In both cases, up-to-date GNU toolchain for ARC for the host is needed.
Synopsys offers prebuilt toolchain releases which can be used for this purpose,
available from:
- Synopsys GNU toolchain releases:
`<https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/releases>`_
- Linux kernel compilers collection:
`<https://mirrors.edge.kernel.org/pub/tools/crosstool>`_
- Bootlin's toolchain collection: `<https://toolchains.bootlin.com>`_
Once the toolchain is installed in the system, make sure its "bin" folder
is added in your ``PATH`` environment variable. Then set ``ARCH=arc`` &
``CROSS_COMPILE=arc-linux`` (or whatever matches installed ARC toolchain prefix)
and then as usual ``make defconfig && make``.
This will produce "vmlinux" file in the root of the kernel source tree
usable for loading on the target system via JTAG.
If you need to get an image usable with U-Boot bootloader,
type ``make uImage`` and ``uImage`` will be produced in ``arch/arc/boot``
folder.

View File

@ -0,0 +1,3 @@
.. SPDX-License-Identifier: GPL-2.0
.. kernel-feat:: $srctree/Documentation/features arc

View File

@ -0,0 +1,17 @@
===================
ARC architecture
===================
.. toctree::
:maxdepth: 1
arc
features
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@ -9,6 +9,7 @@ implementation.
.. toctree::
:maxdepth: 2
arc/index
arm/index
arm64/index
ia64/index

View File

@ -266,10 +266,12 @@ Avanta family
-------------
Flavors:
- 88F6500
- 88F6510
- 88F6530P
- 88F6550
- 88F6560
- 88F6601
Homepage:
https://web.archive.org/web/20181005145041/http://www.marvell.com/broadband/

View File

@ -275,6 +275,23 @@ infrastructure:
| SVEVer | [3-0] | y |
+------------------------------+---------+---------+
8) ID_AA64MMFR1_EL1 - Memory model feature register 1
+------------------------------+---------+---------+
| Name | bits | visible |
+------------------------------+---------+---------+
| AFP | [47-44] | y |
+------------------------------+---------+---------+
9) ID_AA64ISAR2_EL1 - Instruction set attribute register 2
+------------------------------+---------+---------+
| Name | bits | visible |
+------------------------------+---------+---------+
| RPRES | [7-4] | y |
+------------------------------+---------+---------+
Appendix I: Example
-------------------

View File

@ -251,6 +251,14 @@ HWCAP2_ECV
Functionality implied by ID_AA64MMFR0_EL1.ECV == 0b0001.
HWCAP2_AFP
Functionality implied by ID_AA64MFR1_EL1.AFP == 0b0001.
HWCAP2_RPRES
Functionality implied by ID_AA64ISAR2_EL1.RPRES == 0b0001.
4. Unused AT_HWCAP bits
-----------------------

View File

@ -2,7 +2,10 @@
.. _perf_index:
=====================
====
Perf
====
Perf Event Attributes
=====================
@ -88,3 +91,76 @@ exclude_host. However when using !exclude_hv there is a small blackout
window at the guest entry/exit where host events are not captured.
On VHE systems there are no blackout windows.
Perf Userspace PMU Hardware Counter Access
==========================================
Overview
--------
The perf userspace tool relies on the PMU to monitor events. It offers an
abstraction layer over the hardware counters since the underlying
implementation is cpu-dependent.
Arm64 allows userspace tools to have access to the registers storing the
hardware counters' values directly.
This targets specifically self-monitoring tasks in order to reduce the overhead
by directly accessing the registers without having to go through the kernel.
How-to
------
The focus is set on the armv8 PMUv3 which makes sure that the access to the pmu
registers is enabled and that the userspace has access to the relevant
information in order to use them.
In order to have access to the hardware counters, the global sysctl
kernel/perf_user_access must first be enabled:
.. code-block:: sh
echo 1 > /proc/sys/kernel/perf_user_access
It is necessary to open the event using the perf tool interface with config1:1
attr bit set: the sys_perf_event_open syscall returns a fd which can
subsequently be used with the mmap syscall in order to retrieve a page of memory
containing information about the event. The PMU driver uses this page to expose
to the user the hardware counter's index and other necessary data. Using this
index enables the user to access the PMU registers using the `mrs` instruction.
Access to the PMU registers is only valid while the sequence lock is unchanged.
In particular, the PMSELR_EL0 register is zeroed each time the sequence lock is
changed.
The userspace access is supported in libperf using the perf_evsel__mmap()
and perf_evsel__read() functions. See `tools/lib/perf/tests/test-evsel.c`_ for
an example.
About heterogeneous systems
---------------------------
On heterogeneous systems such as big.LITTLE, userspace PMU counter access can
only be enabled when the tasks are pinned to a homogeneous subset of cores and
the corresponding PMU instance is opened by specifying the 'type' attribute.
The use of generic event types is not supported in this case.
Have a look at `tools/perf/arch/arm64/tests/user-events.c`_ for an example. It
can be run using the perf tool to check that the access to the registers works
correctly from userspace:
.. code-block:: sh
perf test -v user
About chained events and counter sizes
--------------------------------------
The user can request either a 32-bit (config1:0 == 0) or 64-bit (config1:0 == 1)
counter along with userspace access. The sys_perf_event_open syscall will fail
if a 64-bit counter is requested and the hardware doesn't support 64-bit
counters. Chained events are not supported in conjunction with userspace counter
access. If a 32-bit counter is requested on hardware with 64-bit counters, then
userspace must treat the upper 32-bits read from the counter as UNKNOWN. The
'pmc_width' field in the user page will indicate the valid width of the counter
and should be used to mask the upper bits as needed.
.. Links
.. _tools/perf/arch/arm64/tests/user-events.c:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/arch/arm64/tests/user-events.c
.. _tools/lib/perf/tests/test-evsel.c:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/tests/test-evsel.c

View File

@ -52,6 +52,12 @@ stable kernels.
| Allwinner | A64/R18 | UNKNOWN1 | SUN50I_ERRATUM_UNKNOWN1 |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A510 | #2064142 | ARM64_ERRATUM_2064142 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A510 | #2038923 | ARM64_ERRATUM_2038923 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A510 | #1902691 | ARM64_ERRATUM_1902691 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A53 | #826319 | ARM64_ERRATUM_826319 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A53 | #827319 | ARM64_ERRATUM_827319 |
@ -92,12 +98,20 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A77 | #1508412 | ARM64_ERRATUM_1508412 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A510 | #2051678 | ARM64_ERRATUM_2051678 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A510 | #2077057 | ARM64_ERRATUM_2077057 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A710 | #2119858 | ARM64_ERRATUM_2119858 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A710 | #2054223 | ARM64_ERRATUM_2054223 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A710 | #2224489 | ARM64_ERRATUM_2224489 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-X2 | #2119858 | ARM64_ERRATUM_2119858 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-X2 | #2224489 | ARM64_ERRATUM_2224489 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N1 | #1188873,1418040| ARM64_ERRATUM_1418040 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N1 | #1349291 | N/A |

View File

@ -255,7 +255,7 @@ prctl(PR_SVE_GET_VL)
vector length change (which would only normally be the case between a
fork() or vfork() and the corresponding execve() in typical use).
To extract the vector length from the result, and it with
To extract the vector length from the result, bitwise and it with
PR_SVE_VL_LEN_MASK.
Return value: a nonnegative value on success, or a negative value on error:

View File

@ -49,7 +49,7 @@ how the user addresses are used by the kernel:
- ``brk()``, ``mmap()`` and the ``new_address`` argument to
``mremap()`` as these have the potential to alias with existing
user addresses.
user addresses.
NOTE: This behaviour changed in v5.6 and so some earlier kernels may
incorrectly accept valid tagged pointers for the ``brk()``,

View File

@ -20,7 +20,6 @@ Block
kyber-iosched
null_blk
pr
queue-sysfs
request
stat
switching-sched

View File

@ -1,321 +0,0 @@
=================
Queue sysfs files
=================
This text file will detail the queue files that are located in the sysfs tree
for each block device. Note that stacked devices typically do not export
any settings, since their queue merely functions as a remapping target.
These files are the ones found in the /sys/block/xxx/queue/ directory.
Files denoted with a RO postfix are readonly and the RW postfix means
read-write.
add_random (RW)
---------------
This file allows to turn off the disk entropy contribution. Default
value of this file is '1'(on).
chunk_sectors (RO)
------------------
This has different meaning depending on the type of the block device.
For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
of the RAID volume stripe segment. For a zoned block device, either host-aware
or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
of the device, with the eventual exception of the last zone of the device which
may be smaller.
dax (RO)
--------
This file indicates whether the device supports Direct Access (DAX),
used by CPU-addressable storage to bypass the pagecache. It shows '1'
if true, '0' if not.
discard_granularity (RO)
------------------------
This shows the size of internal allocation of the device in bytes, if
reported by the device. A value of '0' means device does not support
the discard functionality.
discard_max_hw_bytes (RO)
-------------------------
Devices that support discard functionality may have internal limits on
the number of bytes that can be trimmed or unmapped in a single operation.
The `discard_max_hw_bytes` parameter is set by the device driver to the
maximum number of bytes that can be discarded in a single operation.
Discard requests issued to the device must not exceed this limit.
A `discard_max_hw_bytes` value of 0 means that the device does not support
discard functionality.
discard_max_bytes (RW)
----------------------
While discard_max_hw_bytes is the hardware limit for the device, this
setting is the software limit. Some devices exhibit large latencies when
large discards are issued, setting this value lower will make Linux issue
smaller discards and potentially help reduce latencies induced by large
discard operations.
discard_zeroes_data (RO)
------------------------
Obsolete. Always zero.
fua (RO)
--------
Whether or not the block driver supports the FUA flag for write requests.
FUA stands for Force Unit Access. If the FUA flag is set that means that
write requests must bypass the volatile cache of the storage device.
hw_sector_size (RO)
-------------------
This is the hardware sector size of the device, in bytes.
io_poll (RW)
------------
When read, this file shows whether polling is enabled (1) or disabled
(0). Writing '0' to this file will disable polling for this device.
Writing any non-zero value will enable this feature.
io_poll_delay (RW)
------------------
If polling is enabled, this controls what kind of polling will be
performed. It defaults to -1, which is classic polling. In this mode,
the CPU will repeatedly ask for completions without giving up any time.
If set to 0, a hybrid polling mode is used, where the kernel will attempt
to make an educated guess at when the IO will complete. Based on this
guess, the kernel will put the process issuing IO to sleep for an amount
of time, before entering a classic poll loop. This mode might be a
little slower than pure classic polling, but it will be more efficient.
If set to a value larger than 0, the kernel will put the process issuing
IO to sleep for this amount of microseconds before entering classic
polling.
io_timeout (RW)
---------------
io_timeout is the request timeout in milliseconds. If a request does not
complete in this time then the block driver timeout handler is invoked.
That timeout handler can decide to retry the request, to fail it or to start
a device recovery strategy.
iostats (RW)
-------------
This file is used to control (on/off) the iostats accounting of the
disk.
logical_block_size (RO)
-----------------------
This is the logical block size of the device, in bytes.
max_discard_segments (RO)
-------------------------
The maximum number of DMA scatter/gather entries in a discard request.
max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.
max_integrity_segments (RO)
---------------------------
Maximum number of elements in a DMA scatter/gather list with integrity
data that will be submitted by the block layer core to the associated
block driver.
max_active_zones (RO)
---------------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), the sum of zones belonging to any of the zone states:
EXPLICIT OPEN, IMPLICIT OPEN or CLOSED, is limited by this value.
If this value is 0, there is no limit.
If the host attempts to exceed this limit, the driver should report this error
with BLK_STS_ZONE_ACTIVE_RESOURCE, which user space may see as the EOVERFLOW
errno.
max_open_zones (RO)
-------------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), the sum of zones belonging to any of the zone states:
EXPLICIT OPEN or IMPLICIT OPEN, is limited by this value.
If this value is 0, there is no limit.
If the host attempts to exceed this limit, the driver should report this error
with BLK_STS_ZONE_OPEN_RESOURCE, which user space may see as the ETOOMANYREFS
errno.
max_sectors_kb (RW)
-------------------
This is the maximum number of kilobytes that the block layer will allow
for a filesystem request. Must be smaller than or equal to the maximum
size allowed by the hardware.
max_segments (RO)
-----------------
Maximum number of elements in a DMA scatter/gather list that is submitted
to the associated block driver.
max_segment_size (RO)
---------------------
Maximum size in bytes of a single element in a DMA scatter/gather list.
minimum_io_size (RO)
--------------------
This is the smallest preferred IO size reported by the device.
nomerges (RW)
-------------
This enables the user to disable the lookup logic involved with IO
merging requests in the block layer. By default (0) all merges are
enabled. When set to 1 only simple one-hit merges will be tried. When
set to 2 no merge algorithms will be tried (including one-hit or more
complex tree/hash lookups).
nr_requests (RW)
----------------
This controls how many requests may be allocated in the block layer for
read or write requests. Note that the total allocated number may be twice
this amount, since it applies only to reads or writes (not the accumulated
sum).
To avoid priority inversion through request starvation, a request
queue maintains a separate request pool per each cgroup when
CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such
per-block-cgroup request pool. IOW, if there are N block cgroups,
each request queue may have up to N request pools, each independently
regulated by nr_requests.
nr_zones (RO)
-------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), this indicates the total number of zones of the device.
This is always 0 for regular block devices.
optimal_io_size (RO)
--------------------
This is the optimal IO size reported by the device.
physical_block_size (RO)
------------------------
This is the physical block size of device, in bytes.
read_ahead_kb (RW)
------------------
Maximum number of kilobytes to read-ahead for filesystems on this block
device.
rotational (RW)
---------------
This file is used to stat if the device is of rotational type or
non-rotational type.
rq_affinity (RW)
----------------
If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic).
scheduler (RW)
--------------
When read, this file will display the current and available IO schedulers
for this block device. The currently active IO scheduler will be enclosed
in [] brackets. Writing an IO scheduler name to this file will switch
control of this block device to that new IO scheduler. Note that writing
an IO scheduler name to this file will attempt to load that IO scheduler
module, if it isn't already present in the system.
write_cache (RW)
----------------
When read, this file will display whether the device has write back
caching enabled or not. It will return "write back" for the former
case, and "write through" for the latter. Writing to this file can
change the kernels view of the device, but it doesn't alter the
device state. This means that it might not be safe to toggle the
setting from "write back" to "write through", since that will also
eliminate cache flushes issued by the kernel.
write_same_max_bytes (RO)
-------------------------
This is the number of bytes the device can write in a single write-same
command. A value of '0' means write-same is not supported by this
device.
wbt_lat_usec (RW)
-----------------
If the device is registered for writeback throttling, then this file shows
the target minimum read latency. If this latency is exceeded in a given
window of time (see wb_window_usec), then the writeback throttling will start
scaling back writes. Writing a value of '0' to this file disables the
feature. Writing a value of '-1' to this file resets the value to the
default setting.
throttle_sample_time (RW)
-------------------------
This is the time window that blk-throttle samples data, in millisecond.
blk-throttle makes decision based on the samplings. Lower time means cgroups
have more smooth throughput, but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.
write_zeroes_max_bytes (RO)
---------------------------
For block drivers that support REQ_OP_WRITE_ZEROES, the maximum number of
bytes that can be zeroed at once. The value 0 means that REQ_OP_WRITE_ZEROES
is not supported.
zone_append_max_bytes (RO)
--------------------------
This is the maximum number of bytes that can be written to a sequential
zone of a zoned block device using a zone append write operation
(REQ_OP_ZONE_APPEND). This value is always 0 for regular block devices.
zoned (RO)
----------
This indicates if the device is a zoned block device and the zone model of the
device if it is indeed zoned. The possible values indicated by zoned are
"none" for regular block devices and "host-aware" or "host-managed" for zoned
block devices. The characteristics of host-aware and host-managed zoned block
devices are described in the ZBC (Zoned Block Commands) and ZAC
(Zoned Device ATA Command Set) standards. These standards also define the
"drive-managed" zone model. However, since drive-managed zoned block devices
do not support zone commands, they will be treated as regular block devices
and zoned will report "none".
zone_write_granularity (RO)
---------------------------
This indicates the alignment constraint, in bytes, for write operations in
sequential zones of zoned block devices (devices with a zoned attributed
that reports "host-managed" or "host-aware"). This value is always 0 for
regular block devices.
independent_access_ranges (RO)
------------------------------
The presence of this sub-directory of the /sys/block/xxx/queue/ directory
indicates that the device is capable of executing requests targeting
different sector ranges in parallel. For instance, single LUN multi-actuator
hard-disks will have an independent_access_ranges directory if the device
correctly advertizes the sector ranges of its actuators.
The independent_access_ranges directory contains one directory per access
range, with each range described using the sector (RO) attribute file to
indicate the first sector of the range and the nr_sectors (RO) attribute file
to indicate the total number of sectors in the range starting from the first
sector of the range. For example, a dual-actuator hard-disk will have the
following independent_access_ranges entries.::
$ tree /sys/block/<device>/queue/independent_access_ranges/
/sys/block/<device>/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
The sector and nr_sectors attributes use 512B sector unit, regardless of
the actual block size of the device. Independent access ranges do not
overlap and include all sectors within the device capacity. The access
ranges are numbered in increasing order of the range start sector,
that is, the sector attribute of range 0 always has the value 0.
Jens Axboe <jens.axboe@oracle.com>, February 2009

View File

@ -3,7 +3,7 @@ BPF Type Format (BTF)
=====================
1. Introduction
***************
===============
BTF (BPF Type Format) is the metadata format which encodes the debug info
related to BPF program/map. The name BTF was used initially to describe data
@ -30,7 +30,7 @@ sections are discussed in details in :ref:`BTF_Type_String`.
.. _BTF_Type_String:
2. BTF Type and String Encoding
*******************************
===============================
The file ``include/uapi/linux/btf.h`` provides high-level definition of how
types/strings are encoded.
@ -57,13 +57,13 @@ little-endian target. The ``btf_header`` is designed to be extensible with
generated.
2.1 String Encoding
===================
-------------------
The first string in the string section must be a null string. The rest of
string table is a concatenation of other null-terminated strings.
2.2 Type Encoding
=================
-----------------
The type id ``0`` is reserved for ``void`` type. The type section is parsed
sequentially and type id is assigned to each recognized type starting from id
@ -86,6 +86,7 @@ sequentially and type id is assigned to each recognized type starting from id
#define BTF_KIND_DATASEC 15 /* Section */
#define BTF_KIND_FLOAT 16 /* Floating point */
#define BTF_KIND_DECL_TAG 17 /* Decl Tag */
#define BTF_KIND_TYPE_TAG 18 /* Type Tag */
Note that the type section encodes debug info, not just pure types.
``BTF_KIND_FUNC`` is not a type, and it represents a defined subprogram.
@ -107,7 +108,7 @@ Each type contains the following common data::
* "size" tells the size of the type it is describing.
*
* "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT,
* FUNC, FUNC_PROTO and DECL_TAG.
* FUNC, FUNC_PROTO, DECL_TAG and TYPE_TAG.
* "type" is a type_id referring to another type.
*/
union {
@ -492,8 +493,18 @@ the attribute is applied to a ``struct``/``union`` member or
a ``func`` argument, and ``btf_decl_tag.component_idx`` should be a
valid index (starting from 0) pointing to a member or an argument.
2.2.17 BTF_KIND_TYPE_TAG
~~~~~~~~~~~~~~~~~~~~~~~~
``struct btf_type`` encoding requirement:
* ``name_off``: offset to a non-empty string
* ``info.kind_flag``: 0
* ``info.kind``: BTF_KIND_TYPE_TAG
* ``info.vlen``: 0
* ``type``: the type with ``btf_type_tag`` attribute
3. BTF Kernel API
*****************
=================
The following bpf syscall command involves BTF:
* BPF_BTF_LOAD: load a blob of BTF data into kernel
@ -536,14 +547,14 @@ The workflow typically looks like:
3.1 BPF_BTF_LOAD
================
----------------
Load a blob of BTF data into kernel. A blob of data, described in
:ref:`BTF_Type_String`, can be directly loaded into the kernel. A ``btf_fd``
is returned to a userspace.
3.2 BPF_MAP_CREATE
==================
------------------
A map can be created with ``btf_fd`` and specified key/value type id.::
@ -570,7 +581,7 @@ automatically.
.. _BPF_Prog_Load:
3.3 BPF_PROG_LOAD
=================
-----------------
During prog_load, func_info and line_info can be passed to kernel with proper
values for the following attributes:
@ -620,7 +631,7 @@ For line_info, the line number and column number are defined as below:
#define BPF_LINE_INFO_LINE_COL(line_col) ((line_col) & 0x3ff)
3.4 BPF_{PROG,MAP}_GET_NEXT_ID
==============================
------------------------------
In kernel, every loaded program, map or btf has a unique id. The id won't
change during the lifetime of a program, map, or btf.
@ -630,13 +641,13 @@ each command, to user space, for bpf program or maps, respectively, so an
inspection tool can inspect all programs and maps.
3.5 BPF_{PROG,MAP}_GET_FD_BY_ID
===============================
-------------------------------
An introspection tool cannot use id to get details about program or maps.
A file descriptor needs to be obtained first for reference-counting purpose.
3.6 BPF_OBJ_GET_INFO_BY_FD
==========================
--------------------------
Once a program/map fd is acquired, an introspection tool can get the detailed
information from kernel about this fd, some of which are BTF-related. For
@ -645,7 +656,7 @@ example, ``bpf_map_info`` returns ``btf_id`` and key/value type ids.
bpf byte codes, and jited_line_info.
3.7 BPF_BTF_GET_FD_BY_ID
========================
------------------------
With ``btf_id`` obtained in ``bpf_map_info`` and ``bpf_prog_info``, bpf
syscall command BPF_BTF_GET_FD_BY_ID can retrieve a btf fd. Then, with
@ -657,10 +668,10 @@ tool has full btf knowledge and is able to pretty print map key/values, dump
func signatures and line info, along with byte/jit codes.
4. ELF File Format Interface
****************************
============================
4.1 .BTF section
================
----------------
The .BTF section contains type and string data. The format of this section is
same as the one describe in :ref:`BTF_Type_String`.
@ -668,7 +679,7 @@ same as the one describe in :ref:`BTF_Type_String`.
.. _BTF_Ext_Section:
4.2 .BTF.ext section
====================
--------------------
The .BTF.ext section encodes func_info and line_info which needs loader
manipulation before loading into the kernel.
@ -732,7 +743,7 @@ bpf_insn``. For ELF API, the ``insn_off`` is the byte offset from the
beginning of section (``btf_ext_info_sec->sec_name_off``).
4.2 .BTF_ids section
====================
--------------------
The .BTF_ids section encodes BTF ID values that are used within the kernel.
@ -793,10 +804,10 @@ All the BTF ID lists and sets are compiled in the .BTF_ids section and
resolved during the linking phase of kernel build by ``resolve_btfids`` tool.
5. Using BTF
************
============
5.1 bpftool map pretty print
============================
----------------------------
With BTF, the map key/value can be printed based on fields rather than simply
raw bytes. This is especially valuable for large structure or if your data
@ -838,7 +849,7 @@ bpftool is able to pretty print like below:
]
5.2 bpftool prog dump
=====================
---------------------
The following is an example showing how func_info and line_info can help prog
dump with better kernel symbol names, function prototypes and line
@ -872,7 +883,7 @@ information.::
[...]
5.3 Verifier Log
================
----------------
The following is an example of how line_info can help debugging verification
failure.::
@ -898,7 +909,7 @@ failure.::
R2 offset is outside of the packet
6. BTF Generation
*****************
=================
You need latest pahole
@ -1005,6 +1016,6 @@ format.::
.long 8206 # Line 8 Col 14
7. Testing
**********
==========
Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests.

View File

@ -0,0 +1,376 @@
===================
Classic BPF vs eBPF
===================
eBPF is designed to be JITed with one to one mapping, which can also open up
the possibility for GCC/LLVM compilers to generate optimized eBPF code through
an eBPF backend that performs almost as fast as natively compiled code.
Some core changes of the eBPF format from classic BPF:
- Number of registers increase from 2 to 10:
The old format had two registers A and X, and a hidden frame pointer. The
new layout extends this to be 10 internal registers and a read-only frame
pointer. Since 64-bit CPUs are passing arguments to functions via registers
the number of args from eBPF program to in-kernel function is restricted
to 5 and one register is used to accept return value from an in-kernel
function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
etc, and eBPF calling convention maps directly to ABIs used by the kernel on
64-bit architectures.
On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
and may let more complex programs to be interpreted.
R0 - R5 are scratch registers and eBPF program needs spill/fill them if
necessary across calls. Note that there is only one eBPF program (== one
eBPF main routine) and it cannot call other eBPF functions, it can only
call predefined in-kernel functions, though.
- Register width increases from 32-bit to 64-bit:
Still, the semantics of the original 32-bit ALU operations are preserved
via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
subregisters that zero-extend into 64-bit if they are being written to.
That behavior maps directly to x86_64 and arm64 subregister definition, but
makes other JITs more difficult.
32-bit architectures run 64-bit eBPF programs via interpreter.
Their JITs may convert BPF programs that only use 32-bit subregisters into
native instruction set and let the rest being interpreted.
Operation is 64-bit, because on 64-bit architectures, pointers are also
64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
so 32-bit eBPF registers would otherwise require to define register-pair
ABI, thus, there won't be able to use a direct eBPF register to HW register
mapping and JIT would need to do combine/split/move operations for every
register in and out of the function, which is complex, bug prone and slow.
Another reason is the use of atomic 64-bit counters.
- Conditional jt/jf targets replaced with jt/fall-through:
While the original design has constructs such as ``if (cond) jump_true;
else jump_false;``, they are being replaced into alternative constructs like
``if (cond) jump_true; /* else fall-through */``.
- Introduces bpf_call insn and register passing convention for zero overhead
calls from/to other kernel functions:
Before an in-kernel function call, the eBPF program needs to
place function arguments into R1 to R5 registers to satisfy calling
convention, then the interpreter will take them from registers and pass
to in-kernel function. If R1 - R5 registers are mapped to CPU registers
that are used for argument passing on given architecture, the JIT compiler
doesn't need to emit extra moves. Function arguments will be in the correct
registers and BPF_CALL instruction will be JITed as single 'call' HW
instruction. This calling convention was picked to cover common call
situations without performance penalty.
After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
a return value of the function. Since R6 - R9 are callee saved, their state
is preserved across the call.
For example, consider three C functions::
u64 f1() { return (*_f2)(1); }
u64 f2(u64 a) { return f3(a + 1, a); }
u64 f3(u64 a, u64 b) { return a - b; }
GCC can compile f1, f3 into x86_64::
f1:
movl $1, %edi
movq _f2(%rip), %rax
jmp *%rax
f3:
movq %rdi, %rax
subq %rsi, %rax
ret
Function f2 in eBPF may look like::
f2:
bpf_mov R2, R1
bpf_add R1, 1
bpf_call f3
bpf_exit
If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
be used to call into f2.
For practical reasons all eBPF programs have only one argument 'ctx' which is
already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
are currently not supported, but these restrictions can be lifted if necessary
in the future.
On 64-bit architectures all register map to HW registers one to one. For
example, x86_64 JIT compiler can map them as ...
::
R0 - rax
R1 - rdi
R2 - rsi
R3 - rdx
R4 - rcx
R5 - r8
R6 - rbx
R7 - r13
R8 - r14
R9 - r15
R10 - rbp
... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
and rbx, r12 - r15 are callee saved.
Then the following eBPF pseudo-program::
bpf_mov R6, R1 /* save ctx */
bpf_mov R2, 2
bpf_mov R3, 3
bpf_mov R4, 4
bpf_mov R5, 5
bpf_call foo
bpf_mov R7, R0 /* save foo() return value */
bpf_mov R1, R6 /* restore ctx for next call */
bpf_mov R2, 6
bpf_mov R3, 7
bpf_mov R4, 8
bpf_mov R5, 9
bpf_call bar
bpf_add R0, R7
bpf_exit
After JIT to x86_64 may look like::
push %rbp
mov %rsp,%rbp
sub $0x228,%rsp
mov %rbx,-0x228(%rbp)
mov %r13,-0x220(%rbp)
mov %rdi,%rbx
mov $0x2,%esi
mov $0x3,%edx
mov $0x4,%ecx
mov $0x5,%r8d
callq foo
mov %rax,%r13
mov %rbx,%rdi
mov $0x6,%esi
mov $0x7,%edx
mov $0x8,%ecx
mov $0x9,%r8d
callq bar
add %r13,%rax
mov -0x228(%rbp),%rbx
mov -0x220(%rbp),%r13
leaveq
retq
Which is in this example equivalent in C to::
u64 bpf_filter(u64 ctx)
{
return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
}
In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
registers and place their return value into ``%rax`` which is R0 in eBPF.
Prologue and epilogue are emitted by JIT and are implicit in the
interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
them across the calls as defined by calling convention.
For example the following program is invalid::
bpf_mov R1, 1
bpf_call foo
bpf_mov R0, R1
bpf_exit
After the call the registers R1-R5 contain junk values and cannot be read.
An in-kernel verifier.rst is used to validate eBPF programs.
Also in the new design, eBPF is limited to 4096 insns, which means that any
program will terminate quickly and will only call a fixed number of kernel
functions. Original BPF and eBPF are two operand instructions,
which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
The input context pointer for invoking the interpreter function is generic,
its content is defined by a specific use case. For seccomp register R1 points
to seccomp_data, for converted BPF filters R1 points to a skb.
A program, that is translated internally consists of the following elements::
op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
has room for new instructions. Some of them may use 16/24/32 byte encoding. New
instructions must be multiple of 8 bytes to preserve backward compatibility.
eBPF is a general purpose RISC instruction set. Not every register and
every instruction are used during translation from original BPF to eBPF.
For example, socket filters are not using ``exclusive add`` instruction, but
tracing filters may do to maintain counters of events, for example. Register R9
is not used by socket filters either, but more complex filters may be running
out of registers and would have to resort to spill/fill to stack.
eBPF can be used as a generic assembler for last step performance
optimizations, socket filters and seccomp are using it as assembler. Tracing
filters may use it as assembler to generate code from kernel. In kernel usage
may not be bounded by security considerations, since generated eBPF code
may be optimizing internal code path and not being exposed to the user space.
Safety of eBPF can come from the verifier.rst. In such use cases as
described, it may be used as safe instruction set.
Just like the original BPF, eBPF runs within a controlled environment,
is deterministic and the kernel can easily prove that. The safety of the program
can be determined in two steps: first step does depth-first-search to disallow
loops and other CFG validation; second step starts from the first insn and
descends all possible paths. It simulates execution of every insn and observes
the state change of registers and stack.
opcode encoding
===============
eBPF is reusing most of the opcode encoding from classic to simplify conversion
of classic BPF to eBPF.
For arithmetic and jump instructions the 8-bit 'code' field is divided into three
parts::
+----------------+--------+--------------------+
| 4 bits | 1 bit | 3 bits |
| operation code | source | instruction class |
+----------------+--------+--------------------+
(MSB) (LSB)
Three LSB bits store instruction class which is one of:
=================== ===============
Classic BPF classes eBPF classes
=================== ===============
BPF_LD 0x00 BPF_LD 0x00
BPF_LDX 0x01 BPF_LDX 0x01
BPF_ST 0x02 BPF_ST 0x02
BPF_STX 0x03 BPF_STX 0x03
BPF_ALU 0x04 BPF_ALU 0x04
BPF_JMP 0x05 BPF_JMP 0x05
BPF_RET 0x06 BPF_JMP32 0x06
BPF_MISC 0x07 BPF_ALU64 0x07
=================== ===============
The 4th bit encodes the source operand ...
::
BPF_K 0x00
BPF_X 0x08
* in classic BPF, this means::
BPF_SRC(code) == BPF_X - use register X as source operand
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
* in eBPF, this means::
BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
... and four MSB bits store operation code.
If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
BPF_ADD 0x00
BPF_SUB 0x10
BPF_MUL 0x20
BPF_DIV 0x30
BPF_OR 0x40
BPF_AND 0x50
BPF_LSH 0x60
BPF_RSH 0x70
BPF_NEG 0x80
BPF_MOD 0x90
BPF_XOR 0xa0
BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
BPF_END 0xd0 /* eBPF only: endianness conversion */
If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
BPF_JA 0x00 /* BPF_JMP only */
BPF_JEQ 0x10
BPF_JGT 0x20
BPF_JGE 0x30
BPF_JSET 0x40
BPF_JNE 0x50 /* eBPF only: jump != */
BPF_JSGT 0x60 /* eBPF only: signed '>' */
BPF_JSGE 0x70 /* eBPF only: signed '>=' */
BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */
BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */
BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
BPF_JSLT 0xc0 /* eBPF only: signed '<' */
BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
and eBPF. There are only two registers in classic BPF, so it means A += X.
In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
exactly the same operations as BPF_ALU, but with 64-bit wide operands
instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
dst_reg = dst_reg + src_reg
Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
operation. Classic BPF_RET | BPF_K means copy imm32 into return register
and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
in eBPF means function exit only. The eBPF program needs to store return
value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
operands for the comparisons instead.
For load and store instructions the 8-bit 'code' field is divided as::
+--------+--------+-------------------+
| 3 bits | 2 bits | 3 bits |
| mode | size | instruction class |
+--------+--------+-------------------+
(MSB) (LSB)
Size modifier is one of ...
::
BPF_W 0x00 /* word */
BPF_H 0x08 /* half word */
BPF_B 0x10 /* byte */
BPF_DW 0x18 /* eBPF only, double word */
... which encodes size of load/store operation::
B - 1 byte
H - 2 byte
W - 4 byte
DW - 8 byte (eBPF only)
Mode modifier is one of::
BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
BPF_ABS 0x20
BPF_IND 0x40
BPF_MEM 0x60
BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */

11
Documentation/bpf/faq.rst Normal file
View File

@ -0,0 +1,11 @@
================================
Frequently asked questions (FAQ)
================================
Two sets of Questions and Answers (Q&A) are maintained.
.. toctree::
:maxdepth: 1
bpf_design_QA
bpf_devel_QA

View File

@ -0,0 +1,7 @@
Helper functions
================
* `bpf-helpers(7)`_ maintains a list of helpers available to eBPF programs.
.. Links
.. _bpf-helpers(7): https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

View File

@ -5,104 +5,33 @@ BPF Documentation
This directory contains documentation for the BPF (Berkeley Packet
Filter) facility, with a focus on the extended BPF version (eBPF).
This kernel side documentation is still work in progress. The main
textual documentation is (for historical reasons) described in
:ref:`networking-filter`, which describe both classical and extended
BPF instruction-set.
This kernel side documentation is still work in progress.
The Cilium project also maintains a `BPF and XDP Reference Guide`_
that goes into great technical depth about the BPF Architecture.
libbpf
======
Documentation/bpf/libbpf/index.rst is a userspace library for loading and interacting with bpf programs.
BPF Type Format (BTF)
=====================
.. toctree::
:maxdepth: 1
instruction-set
verifier
libbpf/index
btf
Frequently asked questions (FAQ)
================================
Two sets of Questions and Answers (Q&A) are maintained.
.. toctree::
:maxdepth: 1
bpf_design_QA
bpf_devel_QA
Syscall API
===========
The primary info for the bpf syscall is available in the `man-pages`_
for `bpf(2)`_. For more information about the userspace API, see
Documentation/userspace-api/ebpf/index.rst.
Helper functions
================
* `bpf-helpers(7)`_ maintains a list of helpers available to eBPF programs.
Program types
=============
.. toctree::
:maxdepth: 1
prog_cgroup_sockopt
prog_cgroup_sysctl
prog_flow_dissector
bpf_lsm
prog_sk_lookup
Map types
=========
.. toctree::
:maxdepth: 1
map_cgroup_storage
Testing and debugging BPF
=========================
.. toctree::
:maxdepth: 1
drgn
s390
Licensing
=========
.. toctree::
:maxdepth: 1
faq
syscall_api
helpers
programs
maps
classic_vs_extended.rst
bpf_licensing
test_debug
other
.. only:: subproject and html
Other
=====
Indices
=======
.. toctree::
:maxdepth: 1
ringbuf
llvm_reloc
* :ref:`genindex`
.. Links:
.. _networking-filter: ../networking/filter.rst
.. _man-pages: https://www.kernel.org/doc/man-pages/
.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
.. _bpf-helpers(7): https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
.. _BPF and XDP Reference Guide: https://docs.cilium.io/en/latest/bpf/

View File

@ -0,0 +1,279 @@
====================
eBPF Instruction Set
====================
Registers and calling convention
================================
eBPF has 10 general purpose registers and a read-only frame pointer register,
all of which are 64-bits wide.
The eBPF calling convention is defined as:
* R0: return value from function calls, and exit value for eBPF programs
* R1 - R5: arguments for function calls
* R6 - R9: callee saved registers that function calls will preserve
* R10: read-only frame pointer to access stack
R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if
necessary across calls.
Instruction encoding
====================
eBPF uses 64-bit instructions with the following encoding:
============= ======= =============== ==================== ============
32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
============= ======= =============== ==================== ============
immediate offset source register destination register opcode
============= ======= =============== ==================== ============
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
Instruction classes
-------------------
The three LSB bits of the 'opcode' field store the instruction class:
========= ===== ===============================
class value description
========= ===== ===============================
BPF_LD 0x00 non-standard load operations
BPF_LDX 0x01 load into register operations
BPF_ST 0x02 store from immediate operations
BPF_STX 0x03 store from register operations
BPF_ALU 0x04 32-bit arithmetic operations
BPF_JMP 0x05 64-bit jump operations
BPF_JMP32 0x06 32-bit jump operations
BPF_ALU64 0x07 64-bit arithmetic operations
========= ===== ===============================
Arithmetic and jump instructions
================================
For arithmetic and jump instructions (BPF_ALU, BPF_ALU64, BPF_JMP and
BPF_JMP32), the 8-bit 'opcode' field is divided into three parts:
============== ====== =================
4 bits (MSB) 1 bit 3 bits (LSB)
============== ====== =================
operation code source instruction class
============== ====== =================
The 4th bit encodes the source operand:
====== ===== ========================================
source value description
====== ===== ========================================
BPF_K 0x00 use 32-bit immediate as source operand
BPF_X 0x08 use 'src_reg' register as source operand
====== ===== ========================================
The four MSB bits store the operation code.
Arithmetic instructions
-----------------------
BPF_ALU uses 32-bit wide operands while BPF_ALU64 uses 64-bit wide operands for
otherwise identical operations.
The code field encodes the operation as below:
======== ===== ==========================
code value description
======== ===== ==========================
BPF_ADD 0x00 dst += src
BPF_SUB 0x10 dst -= src
BPF_MUL 0x20 dst \*= src
BPF_DIV 0x30 dst /= src
BPF_OR 0x40 dst \|= src
BPF_AND 0x50 dst &= src
BPF_LSH 0x60 dst <<= src
BPF_RSH 0x70 dst >>= src
BPF_NEG 0x80 dst = ~src
BPF_MOD 0x90 dst %= src
BPF_XOR 0xa0 dst ^= src
BPF_MOV 0xb0 dst = src
BPF_ARSH 0xc0 sign extending shift right
BPF_END 0xd0 endianness conversion
======== ===== ==========================
BPF_ADD | BPF_X | BPF_ALU means::
dst_reg = (u32) dst_reg + (u32) src_reg;
BPF_ADD | BPF_X | BPF_ALU64 means::
dst_reg = dst_reg + src_reg
BPF_XOR | BPF_K | BPF_ALU means::
src_reg = (u32) src_reg ^ (u32) imm32
BPF_XOR | BPF_K | BPF_ALU64 means::
src_reg = src_reg ^ imm32
Jump instructions
-----------------
BPF_JMP32 uses 32-bit wide operands while BPF_JMP uses 64-bit wide operands for
otherwise identical operations.
The code field encodes the operation as below:
======== ===== ========================= ============
code value description notes
======== ===== ========================= ============
BPF_JA 0x00 PC += off BPF_JMP only
BPF_JEQ 0x10 PC += off if dst == src
BPF_JGT 0x20 PC += off if dst > src unsigned
BPF_JGE 0x30 PC += off if dst >= src unsigned
BPF_JSET 0x40 PC += off if dst & src
BPF_JNE 0x50 PC += off if dst != src
BPF_JSGT 0x60 PC += off if dst > src signed
BPF_JSGE 0x70 PC += off if dst >= src signed
BPF_CALL 0x80 function call
BPF_EXIT 0x90 function / program return BPF_JMP only
BPF_JLT 0xa0 PC += off if dst < src unsigned
BPF_JLE 0xb0 PC += off if dst <= src unsigned
BPF_JSLT 0xc0 PC += off if dst < src signed
BPF_JSLE 0xd0 PC += off if dst <= src signed
======== ===== ========================= ============
The eBPF program needs to store the return value into register R0 before doing a
BPF_EXIT.
Load and store instructions
===========================
For load and store instructions (BPF_LD, BPF_LDX, BPF_ST and BPF_STX), the
8-bit 'opcode' field is divided as:
============ ====== =================
3 bits (MSB) 2 bits 3 bits (LSB)
============ ====== =================
mode size instruction class
============ ====== =================
The size modifier is one of:
============= ===== =====================
size modifier value description
============= ===== =====================
BPF_W 0x00 word (4 bytes)
BPF_H 0x08 half word (2 bytes)
BPF_B 0x10 byte
BPF_DW 0x18 double word (8 bytes)
============= ===== =====================
The mode modifier is one of:
============= ===== ====================================
mode modifier value description
============= ===== ====================================
BPF_IMM 0x00 used for 64-bit mov
BPF_ABS 0x20 legacy BPF packet access
BPF_IND 0x40 legacy BPF packet access
BPF_MEM 0x60 all normal load and store operations
BPF_ATOMIC 0xc0 atomic operations
============= ===== ====================================
BPF_MEM | <size> | BPF_STX means::
*(size *) (dst_reg + off) = src_reg
BPF_MEM | <size> | BPF_ST means::
*(size *) (dst_reg + off) = imm32
BPF_MEM | <size> | BPF_LDX means::
dst_reg = *(size *) (src_reg + off)
Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW.
Atomic operations
-----------------
eBPF includes atomic operations, which use the immediate field for extra
encoding::
.imm = BPF_ADD, .code = BPF_ATOMIC | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
.imm = BPF_ADD, .code = BPF_ATOMIC | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
The basic atomic operations supported are::
BPF_ADD
BPF_AND
BPF_OR
BPF_XOR
Each having equivalent semantics with the ``BPF_ADD`` example, that is: the
memory location addresed by ``dst_reg + off`` is atomically modified, with
``src_reg`` as the other operand. If the ``BPF_FETCH`` flag is set in the
immediate, then these operations also overwrite ``src_reg`` with the
value that was in memory before it was modified.
The more special operations are::
BPF_XCHG
This atomically exchanges ``src_reg`` with the value addressed by ``dst_reg +
off``. ::
BPF_CMPXCHG
This atomically compares the value addressed by ``dst_reg + off`` with
``R0``. If they match it is replaced with ``src_reg``. In either case, the
value that was there before is zero-extended and loaded back to ``R0``.
Note that 1 and 2 byte atomic operations are not supported.
Clang can generate atomic instructions by default when ``-mcpu=v3`` is
enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
the atomics features, while keeping a lower ``-mcpu`` version, you can use
``-Xclang -target-feature -Xclang +alu32``.
You may encounter ``BPF_XADD`` - this is a legacy name for ``BPF_ATOMIC``,
referring to the exclusive-add operation encoded when the immediate field is
zero.
16-byte instructions
--------------------
eBPF has one 16-byte instruction: ``BPF_LD | BPF_DW | BPF_IMM`` which consists
of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single
instruction that loads 64-bit immediate value into a dst_reg.
Packet access instructions
--------------------------
eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
(BPF_IND | <size> | BPF_LD) which are used to access packet data.
They had to be carried over from classic BPF to have strong performance of
socket filters running in eBPF interpreter. These instructions can only
be used when interpreter context is a pointer to ``struct sk_buff`` and
have seven implicit operands. Register R6 is an implicit input that must
contain pointer to sk_buff. Register R0 is an implicit output which contains
the data fetched from the packet. Registers R1-R5 are scratch registers
and must not be used to store the data across BPF_ABS | BPF_LD or
BPF_IND | BPF_LD instructions.
These instructions have implicit program exit condition as well. When
eBPF program is trying to access the data beyond the packet boundary,
the interpreter will abort the execution of the program. JIT compilers
therefore must preserve this property. src_reg and imm32 fields are
explicit inputs to these instructions.
For example, BPF_IND | BPF_W | BPF_LD means::
R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
and R1 - R5 are clobbered.

View File

@ -3,8 +3,6 @@
libbpf
======
For API documentation see the `versioned API documentation site <https://libbpf.readthedocs.io/en/latest/api.html>`_.
.. toctree::
:maxdepth: 1
@ -14,6 +12,8 @@ For API documentation see the `versioned API documentation site <https://libbpf.
This is documentation for libbpf, a userspace library for loading and
interacting with bpf programs.
For API documentation see the `versioned API documentation site <https://libbpf.readthedocs.io/en/latest/api.html>`_.
All general BPF questions, including kernel functionality, libbpf APIs and
their application, should be sent to bpf@vger.kernel.org mailing list.
You can `subscribe <http://vger.kernel.org/vger-lists.html#bpf>`_ to the

View File

@ -0,0 +1,52 @@
=========
eBPF maps
=========
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
The maps are accessed from user space via BPF syscall, which has commands:
- create a map with given type and attributes
``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)``
using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
returns process-local file descriptor or negative error
- lookup key in a given map
``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)``
using attr->map_fd, attr->key, attr->value
returns zero and stores found elem into value or negative error
- create or update key/value pair in a given map
``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)``
using attr->map_fd, attr->key, attr->value
returns zero or negative error
- find and delete element by key in a given map
``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)``
using attr->map_fd, attr->key
- to delete map: close(fd)
Exiting process will delete maps automatically
userspace programs use this syscall to create/access maps that eBPF programs
are concurrently updating.
maps can have different types: hash, array, bloom filter, radix-tree, etc.
The map is defined by:
- type
- max number of elements
- key size in bytes
- value size in bytes
Map Types
=========
.. toctree::
:maxdepth: 1
:glob:
map_*

View File

@ -0,0 +1,9 @@
=====
Other
=====
.. toctree::
:maxdepth: 1
ringbuf
llvm_reloc

View File

@ -0,0 +1,9 @@
=============
Program Types
=============
.. toctree::
:maxdepth: 1
:glob:
prog_*

View File

@ -0,0 +1,11 @@
===========
Syscall API
===========
The primary info for the bpf syscall is available in the `man-pages`_
for `bpf(2)`_. For more information about the userspace API, see
Documentation/userspace-api/ebpf/index.rst.
.. Links:
.. _man-pages: https://www.kernel.org/doc/man-pages/
.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html

View File

@ -0,0 +1,9 @@
=========================
Testing and debugging BPF
=========================
.. toctree::
:maxdepth: 1
drgn
s390

View File

@ -0,0 +1,529 @@
=============
eBPF verifier
=============
The safety of the eBPF program is determined in two steps.
First step does DAG check to disallow loops and other CFG validation.
In particular it will detect programs that have unreachable instructions.
(though classic BPF checker allows them)
Second step starts from the first insn and descends all possible paths.
It simulates execution of every insn and observes the state change of
registers and stack.
At the start of the program the register R1 contains a pointer to context
and has type PTR_TO_CTX.
If verifier sees an insn that does R2=R1, then R2 has now type
PTR_TO_CTX as well and can be used on the right hand side of expression.
If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
since addition of two valid pointers makes invalid pointer.
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
sure that kernel addresses don't leak to unprivileged users)
If register was never written to, it's not readable::
bpf_mov R0 = R2
bpf_exit
will be rejected, since R2 is unreadable at the start of the program.
After kernel function call, R1-R5 are reset to unreadable and
R0 has a return type of the function.
Since R6-R9 are callee saved, their state is preserved across the call.
::
bpf_mov R6 = 1
bpf_call foo
bpf_mov R0 = R6
bpf_exit
is a correct program. If there was R1 instead of R6, it would have
been rejected.
load/store instructions are allowed only with registers of valid types, which
are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
For example::
bpf_mov R1 = 1
bpf_mov R2 = 2
bpf_xadd *(u32 *)(R1 + 3) += R2
bpf_exit
will be rejected, since R1 doesn't have a valid pointer type at the time of
execution of instruction bpf_xadd.
At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``)
A callback is used to customize verifier to restrict eBPF program access to only
certain fields within ctx structure with specified size and alignment.
For example, the following insn::
bpf_ld R0 = *(u32 *)(R6 + 8)
intends to load a word from address R6 + 8 and store it into R0
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
that offset 8 of size 4 bytes can be accessed for reading, otherwise
the verifier will reject the program.
If R6=PTR_TO_STACK, then access should be aligned and be within
stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
so it will fail verification, since it's out of bounds.
The verifier will allow eBPF program to read data from stack only after
it wrote into it.
Classic BPF verifier does similar check with M[0-15] memory slots.
For example::
bpf_ld R0 = *(u32 *)(R10 - 4)
bpf_exit
is invalid program.
Though R10 is correct read-only register and has type PTR_TO_STACK
and R10 - 4 is within stack bounds, there were no stores into that location.
Pointer register spill/fill is tracked as well, since four (R6-R9)
callee saved registers may not be enough for some programs.
Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
The eBPF verifier will check that registers match argument constraints.
After the call register R0 will be set to return type of the function.
Function calls is a main mechanism to extend functionality of eBPF programs.
Socket filters may let programs to call one set of functions, whereas tracing
filters may allow completely different set.
If a function made accessible to eBPF program, it needs to be thought through
from safety point of view. The verifier will guarantee that the function is
called with valid arguments.
seccomp vs socket filters have different security restrictions for classic BPF.
Seccomp solves this by two stage verifier: classic BPF verifier is followed
by seccomp verifier. In case of eBPF one configurable verifier is shared for
all use cases.
See details of eBPF verifier in kernel/bpf/verifier.c
Register value tracking
=======================
In order to determine the safety of an eBPF program, the verifier must track
the range of possible values in each register and also in each stack slot.
This is done with ``struct bpf_reg_state``, defined in include/linux/
bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
register state has a type, which is either NOT_INIT (the register has not been
written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
pointer type. The types of pointers describe their base, as follows:
PTR_TO_CTX
Pointer to bpf_context.
CONST_PTR_TO_MAP
Pointer to struct bpf_map. "Const" because arithmetic
on these pointers is forbidden.
PTR_TO_MAP_VALUE
Pointer to the value stored in a map element.
PTR_TO_MAP_VALUE_OR_NULL
Either a pointer to a map value, or NULL; map accesses
(see maps.rst) return this type, which becomes a
PTR_TO_MAP_VALUE when checked != NULL. Arithmetic on
these pointers is forbidden.
PTR_TO_STACK
Frame pointer.
PTR_TO_PACKET
skb->data.
PTR_TO_PACKET_END
skb->data + headlen; arithmetic forbidden.
PTR_TO_SOCKET
Pointer to struct bpf_sock_ops, implicitly refcounted.
PTR_TO_SOCKET_OR_NULL
Either a pointer to a socket, or NULL; socket lookup
returns this type, which becomes a PTR_TO_SOCKET when
checked != NULL. PTR_TO_SOCKET is reference-counted,
so programs must release the reference through the
socket release function before the end of the program.
Arithmetic on these pointers is forbidden.
However, a pointer may be offset from this base (as a result of pointer
arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
offset'. The former is used when an exactly-known value (e.g. an immediate
operand) is added to a pointer, while the latter is used for values which are
not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
the range of possible values in the register.
The verifier's knowledge about the variable offset consists of:
* minimum and maximum values as unsigned
* minimum and maximum values as signed
* knowledge of the values of individual bits, in the form of a 'tnum': a u64
'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
mask and value; no bit should ever be 1 in both. For example, if a byte is read
into a register from memory, the register's top 56 bits are known zero, while
the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
0x1ff), because of potential carries.
Besides arithmetic, the register state can also be updated by conditional
branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
BPF_JSGE) would instead update the signed minimum/maximum values. Information
from the signed and unsigned bounds can be combined; for instance if a value is
first tested < 8 and then tested s> 4, the verifier will conclude that the value
is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
pointers sharing that same variable offset. This is important for packet range
checks: after adding a variable to a packet pointer register A, if you then copy
it to another register B and then add a constant 4 to A, both registers will
share the same 'id' but the A will have a fixed offset of +4. Then if A is
bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is
now known to have a safe range of at least 4 bytes. See 'Direct packet access',
below, for more on PTR_TO_PACKET ranges.
The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
the pointer returned from a map lookup. This means that when one copy is
checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
As well as range-checking, the tracked information is also used for enforcing
alignment of pointer accesses. For instance, on most systems the packet pointer
is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
that pointer are safe.
The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
to all copies of the pointer returned from a socket lookup. This has similar
behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
represents a reference to the corresponding ``struct sock``. To ensure that the
reference is not leaked, it is imperative to NULL-check the reference and in
the non-NULL case, and pass the valid reference to the socket release function.
Direct packet access
====================
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
data via skb->data and skb->data_end pointers.
Ex::
1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */
2: r3 = *(u32 *)(r1 +76) /* load skb->data */
3: r5 = r3
4: r5 += 14
5: if r5 > r4 goto pc+16
R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */
this 2byte load from the packet is safe to do, since the program author
did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which
means that in the fall-through case the register R3 (which points to skb->data)
has at least 14 directly accessible bytes. The verifier marks it
as R3=pkt(id=0,off=0,r=14).
id=0 means that no additional variables were added to the register.
off=0 means that no additional constants were added.
r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
to the packet data, but constant 14 was added to the register, so
it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14)
which is zero bytes.
More complex packet access may look like::
R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
7: r4 = *(u8 *)(r3 +12)
8: r4 *= 14
9: r3 = *(u32 *)(r1 +76) /* load skb->data */
10: r3 += r4
11: r2 = r1
12: r2 <<= 48
13: r2 >>= 48
14: r3 += r2
15: r2 = r3
16: r2 += 8
17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
18: if r2 > r1 goto pc+2
R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
19: r1 = *(u8 *)(r3 +4)
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some
offset within a packet and since the program author did
``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8).
The verifier only allows 'add'/'sub' operations on packet registers. Any other
operation will set the register state to 'SCALAR_VALUE' and it won't be
available for direct packet access.
Operation ``r3 += rX`` may overflow and become less than original skb->data,
therefore the verifier has to prevent that. So when it sees ``r3 += rX``
instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
against skb->data_end will not give us 'range' information, so attempts to read
through the pointer will give "invalid access to packet" error.
Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is
R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
of the register are guaranteed to be zero, and nothing is known about the lower
8 bits. After insn ``r4 *= 14`` the state becomes
R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
value by constant 14 will keep upper 52 bits as zero, also the least significant
bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make
R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
extending. This logic is implemented in adjust_reg_min_max_vals() function,
which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
versa) and adjust_scalar_min_max_vals() for operations on two scalars.
The end result is that bpf program author can access packet directly
using normal C code as::
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct eth_hdr *eth = data;
struct iphdr *iph = data + sizeof(*eth);
struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph);
if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
return 0;
if (eth->h_proto != htons(ETH_P_IP))
return 0;
if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
return 0;
if (udp->dest == 53 || udp->source == 9)
...;
which makes such programs easier to write comparing to LD_ABS insn
and significantly faster.
Pruning
=======
The verifier does not actually walk all possible paths through the program. For
each new branch to analyse, the verifier looks at all the states it's previously
been in when at this instruction. If any of them contain the current state as a
subset, the branch is 'pruned' - that is, the fact that the previous state was
accepted implies the current state would be as well. For instance, if in the
previous state, r1 held a packet-pointer, and in the current state, r1 holds a
packet-pointer with a range as long or longer and at least as strict an
alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
have been used by any path from that point, so any value in r2 (including
another NOT_INIT) is safe. The implementation is in the function regsafe().
Pruning considers not only the registers but also the stack (and any spilled
registers it may hold). They must all be safe for the branch to be pruned.
This is implemented in states_equal().
Understanding eBPF verifier messages
====================================
The following are few examples of invalid eBPF programs and verifier error
messages as seen in the log:
Program with unreachable instructions::
static struct bpf_insn prog[] = {
BPF_EXIT_INSN(),
BPF_EXIT_INSN(),
};
Error:
unreachable insn 1
Program that reads uninitialized register::
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
BPF_EXIT_INSN(),
Error::
0: (bf) r0 = r2
R2 !read_ok
Program that doesn't initialize R0 before exiting::
BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
BPF_EXIT_INSN(),
Error::
0: (bf) r2 = r1
1: (95) exit
R0 !read_ok
Program that accesses stack out of bounds::
BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
BPF_EXIT_INSN(),
Error::
0: (7a) *(u64 *)(r10 +8) = 0
invalid stack off=8 size=8
Program that doesn't initialize stack before passing its address into function::
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_EXIT_INSN(),
Error::
0: (bf) r2 = r10
1: (07) r2 += -8
2: (b7) r1 = 0x0
3: (85) call 1
invalid indirect read from stack off -8+0 size 8
Program that uses invalid map_fd=0 while calling to map_lookup_elem() function::
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_EXIT_INSN(),
Error::
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 0x0
4: (85) call 1
fd 0 is not pointing to valid bpf_map
Program that doesn't check return value of map_lookup_elem() before accessing
map element::
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
BPF_EXIT_INSN(),
Error::
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 0x0
4: (85) call 1
5: (7a) *(u64 *)(r0 +0) = 0
R0 invalid mem access 'map_value_or_null'
Program that correctly checks map_lookup_elem() returned value for NULL, but
accesses the memory with incorrect alignment::
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
BPF_EXIT_INSN(),
Error::
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 1
4: (85) call 1
5: (15) if r0 == 0x0 goto pc+1
R0=map_ptr R10=fp
6: (7a) *(u64 *)(r0 +4) = 0
misaligned access off 4 size 8
Program that correctly checks map_lookup_elem() returned value for NULL and
accesses memory with correct alignment in one side of 'if' branch, but fails
to do so in the other side of 'if' branch::
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
BPF_EXIT_INSN(),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),
Error::
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 1
4: (85) call 1
5: (15) if r0 == 0x0 goto pc+2
R0=map_ptr R10=fp
6: (7a) *(u64 *)(r0 +0) = 0
7: (95) exit
from 5 to 8: R0=imm0 R10=fp
8: (7a) *(u64 *)(r0 +0) = 1
R0 invalid mem access 'imm'
Program that performs a socket lookup then sets the pointer to NULL without
checking it::
BPF_MOV64_IMM(BPF_REG_2, 0),
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_MOV64_IMM(BPF_REG_3, 4),
BPF_MOV64_IMM(BPF_REG_4, 0),
BPF_MOV64_IMM(BPF_REG_5, 0),
BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
Error::
0: (b7) r2 = 0
1: (63) *(u32 *)(r10 -8) = r2
2: (bf) r2 = r10
3: (07) r2 += -8
4: (b7) r3 = 4
5: (b7) r4 = 0
6: (b7) r5 = 0
7: (85) call bpf_sk_lookup_tcp#65
8: (b7) r0 = 0
9: (95) exit
Unreleased reference id=1, alloc_insn=7
Program that performs a socket lookup but does not NULL-check the returned
value::
BPF_MOV64_IMM(BPF_REG_2, 0),
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_MOV64_IMM(BPF_REG_3, 4),
BPF_MOV64_IMM(BPF_REG_4, 0),
BPF_MOV64_IMM(BPF_REG_5, 0),
BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
BPF_EXIT_INSN(),
Error::
0: (b7) r2 = 0
1: (63) *(u32 *)(r10 -8) = r2
2: (bf) r2 = r10
3: (07) r2 += -8
4: (b7) r3 = 4
5: (b7) r4 = 0
6: (b7) r5 = 0
7: (85) call bpf_sk_lookup_tcp#65
8: (95) exit
Unreleased reference id=1, alloc_insn=7

View File

@ -208,16 +208,86 @@ highlight_language = 'none'
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
# The Read the Docs theme is available from
# - https://github.com/snide/sphinx_rtd_theme
# - https://pypi.python.org/pypi/sphinx_rtd_theme
# - python-sphinx-rtd-theme package (on Debian)
try:
import sphinx_rtd_theme
html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
except ImportError:
sys.stderr.write('Warning: The Sphinx \'sphinx_rtd_theme\' HTML theme was not found. Make sure you have the theme installed to produce pretty HTML output. Falling back to the default theme.\n')
# Default theme
html_theme = 'sphinx_rtd_theme'
html_css_files = []
if "DOCS_THEME" in os.environ:
html_theme = os.environ["DOCS_THEME"]
if html_theme == 'sphinx_rtd_theme' or html_theme == 'sphinx_rtd_dark_mode':
# Read the Docs theme
try:
import sphinx_rtd_theme
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_css_files = [
'theme_overrides.css',
]
# Read the Docs dark mode override theme
if html_theme == 'sphinx_rtd_dark_mode':
try:
import sphinx_rtd_dark_mode
extensions.append('sphinx_rtd_dark_mode')
except ImportError:
html_theme == 'sphinx_rtd_theme'
if html_theme == 'sphinx_rtd_theme':
# Add color-specific RTD normal mode
html_css_files.append('theme_rtd_colors.css')
except ImportError:
html_theme = 'classic'
if "DOCS_CSS" in os.environ:
css = os.environ["DOCS_CSS"].split(" ")
for l in css:
html_css_files.append(l)
if major <= 1 and minor < 8:
html_context = {
'css_files': [],
}
for l in html_css_files:
html_context['css_files'].append('_static/' + l)
if html_theme == 'classic':
html_theme_options = {
'rightsidebar': False,
'stickysidebar': True,
'collapsiblesidebar': True,
'externalrefs': False,
'footerbgcolor': "white",
'footertextcolor': "white",
'sidebarbgcolor': "white",
'sidebarbtncolor': "black",
'sidebartextcolor': "black",
'sidebarlinkcolor': "#686bff",
'relbarbgcolor': "#133f52",
'relbartextcolor': "white",
'relbarlinkcolor': "white",
'bgcolor': "white",
'textcolor': "black",
'headbgcolor': "#f2f2f2",
'headtextcolor': "#20435c",
'headlinkcolor': "#c60f0f",
'linkcolor': "#355f7c",
'visitedlinkcolor': "#355f7c",
'codebgcolor': "#3f3f3f",
'codetextcolor': "white",
'bodyfont': "serif",
'headfont': "sans-serif",
}
sys.stderr.write("Using %s theme\n" % html_theme)
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
@ -246,20 +316,8 @@ except ImportError:
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['sphinx-static']
html_css_files = [
'theme_overrides.css',
]
if major <= 1 and minor < 8:
html_context = {
'css_files': [
'_static/theme_overrides.css',
],
}
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.

View File

@ -279,6 +279,7 @@ Accounting Framework
Block Devices
=============
.. kernel-doc:: include/linux/bio.h
.. kernel-doc:: block/blk-core.c
:export:
@ -294,9 +295,6 @@ Block Devices
.. kernel-doc:: block/blk-settings.c
:export:
.. kernel-doc:: block/blk-exec.c
:export:
.. kernel-doc:: block/blk-flush.c
:export:

View File

@ -118,7 +118,7 @@ Initialization of kobjects
Code which creates a kobject must, of course, initialize that object. Some
of the internal fields are setup with a (mandatory) call to kobject_init()::
void kobject_init(struct kobject *kobj, struct kobj_type *ktype);
void kobject_init(struct kobject *kobj, const struct kobj_type *ktype);
The ktype is required for a kobject to be created properly, as every kobject
must have an associated kobj_type. After calling kobject_init(), to
@ -156,7 +156,7 @@ kobject_name()::
There is a helper function to both initialize and add the kobject to the
kernel at the same time, called surprisingly enough kobject_init_and_add()::
int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype,
int kobject_init_and_add(struct kobject *kobj, const struct kobj_type *ktype,
struct kobject *parent, const char *fmt, ...);
The arguments are the same as the individual kobject_init() and
@ -299,7 +299,6 @@ kobj_type::
struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;
struct attribute **default_attrs;
const struct attribute_group **default_groups;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
@ -313,10 +312,10 @@ call kobject_init() or kobject_init_and_add().
The release field in struct kobj_type is, of course, a pointer to the
release() method for this type of kobject. The other two fields (sysfs_ops
and default_attrs) control how objects of this type are represented in
and default_groups) control how objects of this type are represented in
sysfs; they are beyond the scope of this document.
The default_attrs pointer is a list of default attributes that will be
The default_groups pointer is a list of default attributes that will be
automatically created for any kobject that is registered with this ktype.
@ -373,10 +372,9 @@ If a kset wishes to control the uevent operations of the kobjects
associated with it, it can use the struct kset_uevent_ops to handle it::
struct kset_uevent_ops {
int (* const filter)(struct kset *kset, struct kobject *kobj);
const char *(* const name)(struct kset *kset, struct kobject *kobj);
int (* const uevent)(struct kset *kset, struct kobject *kobj,
struct kobj_uevent_env *env);
int (* const filter)(struct kobject *kobj);
const char *(* const name)(struct kobject *kobj);
int (* const uevent)(struct kobject *kobj, struct kobj_uevent_env *env);
};

View File

@ -32,6 +32,7 @@ Documentation/dev-tools/testing-overview.rst
kgdb
kselftest
kunit/index
ktap
.. only:: subproject and html

View File

@ -204,17 +204,17 @@ Ultimately this allows to determine the possible executions of concurrent code,
and if that code is free from data races.
KCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``,
``atomic_*``, etc.), but is oblivious of any ordering guarantees and simply
assumes that memory barriers are placed correctly. In other words, KCSAN
assumes that as long as a plain access is not observed to race with another
conflicting access, memory operations are correctly ordered.
``atomic_*``, etc.), and a subset of ordering guarantees implied by memory
barriers. With ``CONFIG_KCSAN_WEAK_MEMORY=y``, KCSAN models load or store
buffering, and can detect missing ``smp_mb()``, ``smp_wmb()``, ``smp_rmb()``,
``smp_store_release()``, and all ``atomic_*`` operations with equivalent
implied barriers.
This means that KCSAN will not report *potential* data races due to missing
memory ordering. Developers should therefore carefully consider the required
memory ordering requirements that remain unchecked. If, however, missing
memory ordering (that is observable with a particular compiler and
architecture) leads to an observable data race (e.g. entering a critical
section erroneously), KCSAN would report the resulting data race.
Note, KCSAN will not report all data races due to missing memory ordering,
specifically where a memory barrier would be required to prohibit subsequent
memory operation from reordering before the barrier. Developers should
therefore carefully consider the required memory ordering requirements that
remain unchecked.
Race Detection Beyond Data Races
--------------------------------
@ -268,6 +268,56 @@ marked operations, if all accesses to a variable that is accessed concurrently
are properly marked, KCSAN will never trigger a watchpoint and therefore never
report the accesses.
Modeling Weak Memory
~~~~~~~~~~~~~~~~~~~~
KCSAN's approach to detecting data races due to missing memory barriers is
based on modeling access reordering (with ``CONFIG_KCSAN_WEAK_MEMORY=y``).
Each plain memory access for which a watchpoint is set up, is also selected for
simulated reordering within the scope of its function (at most 1 in-flight
access).
Once an access has been selected for reordering, it is checked along every
other access until the end of the function scope. If an appropriate memory
barrier is encountered, the access will no longer be considered for simulated
reordering.
When the result of a memory operation should be ordered by a barrier, KCSAN can
then detect data races where the conflict only occurs as a result of a missing
barrier. Consider the example::
int x, flag;
void T1(void)
{
x = 1; // data race!
WRITE_ONCE(flag, 1); // correct: smp_store_release(&flag, 1)
}
void T2(void)
{
while (!READ_ONCE(flag)); // correct: smp_load_acquire(&flag)
... = x; // data race!
}
When weak memory modeling is enabled, KCSAN can consider ``x`` in ``T1`` for
simulated reordering. After the write of ``flag``, ``x`` is again checked for
concurrent accesses: because ``T2`` is able to proceed after the write of
``flag``, a data race is detected. With the correct barriers in place, ``x``
would not be considered for reordering after the proper release of ``flag``,
and no data race would be detected.
Deliberate trade-offs in complexity but also practical limitations mean only a
subset of data races due to missing memory barriers can be detected. With
currently available compiler support, the implementation is limited to modeling
the effects of "buffering" (delaying accesses), since the runtime cannot
"prefetch" accesses. Also recall that watchpoints are only set up for plain
accesses, and the only access type for which KCSAN simulates reordering. This
means reordering of marked accesses is not modeled.
A consequence of the above is that acquire operations do not require barrier
instrumentation (no prefetching). Furthermore, marked accesses introducing
address or control dependencies do not require special handling (the marked
access cannot be reordered, later dependent accesses cannot be prefetched).
Key Properties
~~~~~~~~~~~~~~
@ -290,8 +340,8 @@ Key Properties
4. **Detects Racy Writes from Devices:** Due to checking data values upon
setting up watchpoints, racy writes from devices can also be detected.
5. **Memory Ordering:** KCSAN is *not* explicitly aware of the LKMM's ordering
rules; this may result in missed data races (false negatives).
5. **Memory Ordering:** KCSAN is aware of only a subset of LKMM ordering rules;
this may result in missed data races (false negatives).
6. **Analysis Accuracy:** For observed executions, due to using a sampling
strategy, the analysis is *unsound* (false negatives possible), but aims to

View File

@ -402,7 +402,7 @@ This is a quick example of how to use kdb.
2. Enter the kernel debugger manually or by waiting for an oops or
fault. There are several ways you can enter the kernel debugger
manually; all involve using the :kbd:`SysRq-G`, which means you must have
enabled ``CONFIG_MAGIC_SysRq=y`` in your kernel config.
enabled ``CONFIG_MAGIC_SYSRQ=y`` in your kernel config.
- When logged in as root or with a super user session you can run::
@ -461,7 +461,7 @@ This is a quick example of how to use kdb with a keyboard.
2. Enter the kernel debugger manually or by waiting for an oops or
fault. There are several ways you can enter the kernel debugger
manually; all involve using the :kbd:`SysRq-G`, which means you must have
enabled ``CONFIG_MAGIC_SysRq=y`` in your kernel config.
enabled ``CONFIG_MAGIC_SYSRQ=y`` in your kernel config.
- When logged in as root or with a super user session you can run::
@ -557,7 +557,7 @@ Connecting with gdb to a serial port
Example (using a directly connected port)::
% gdb ./vmlinux
(gdb) set remotebaud 115200
(gdb) set serial baud 115200
(gdb) target remote /dev/ttyS0

View File

@ -0,0 +1,298 @@
.. SPDX-License-Identifier: GPL-2.0
========================================
The Kernel Test Anything Protocol (KTAP)
========================================
TAP, or the Test Anything Protocol is a format for specifying test results used
by a number of projects. It's website and specification are found at this `link
<https://testanything.org/>`_. The Linux Kernel largely uses TAP output for test
results. However, Kernel testing frameworks have special needs for test results
which don't align with the original TAP specification. Thus, a "Kernel TAP"
(KTAP) format is specified to extend and alter TAP to support these use-cases.
This specification describes the generally accepted format of KTAP as it is
currently used in the kernel.
KTAP test results describe a series of tests (which may be nested: i.e., test
can have subtests), each of which can contain both diagnostic data -- e.g., log
lines -- and a final result. The test structure and results are
machine-readable, whereas the diagnostic data is unstructured and is there to
aid human debugging.
KTAP output is built from four different types of lines:
- Version lines
- Plan lines
- Test case result lines
- Diagnostic lines
In general, valid KTAP output should also form valid TAP output, but some
information, in particular nested test results, may be lost. Also note that
there is a stagnant draft specification for TAP14, KTAP diverges from this in
a couple of places (notably the "Subtest" header), which are described where
relevant later in this document.
Version lines
-------------
All KTAP-formatted results begin with a "version line" which specifies which
version of the (K)TAP standard the result is compliant with.
For example:
- "KTAP version 1"
- "TAP version 13"
- "TAP version 14"
Note that, in KTAP, subtests also begin with a version line, which denotes the
start of the nested test results. This differs from TAP14, which uses a
separate "Subtest" line.
While, going forward, "KTAP version 1" should be used by compliant tests, it
is expected that most parsers and other tooling will accept the other versions
listed here for compatibility with existing tests and frameworks.
Plan lines
----------
A test plan provides the number of tests (or subtests) in the KTAP output.
Plan lines must follow the format of "1..N" where N is the number of tests or subtests.
Plan lines follow version lines to indicate the number of nested tests.
While there are cases where the number of tests is not known in advance -- in
which case the test plan may be omitted -- it is strongly recommended one is
present where possible.
Test case result lines
----------------------
Test case result lines indicate the final status of a test.
They are required and must have the format:
.. code-block::
<result> <number> [<description>][ # [<directive>] [<diagnostic data>]]
The result can be either "ok", which indicates the test case passed,
or "not ok", which indicates that the test case failed.
<number> represents the number of the test being performed. The first test must
have the number 1 and the number then must increase by 1 for each additional
subtest within the same test at the same nesting level.
The description is a description of the test, generally the name of
the test, and can be any string of words (can't include #). The
description is optional, but recommended.
The directive and any diagnostic data is optional. If either are present, they
must follow a hash sign, "#".
A directive is a keyword that indicates a different outcome for a test other
than passed and failed. The directive is optional, and consists of a single
keyword preceding the diagnostic data. In the event that a parser encounters
a directive it doesn't support, it should fall back to the "ok" / "not ok"
result.
Currently accepted directives are:
- "SKIP", which indicates a test was skipped (note the result of the test case
result line can be either "ok" or "not ok" if the SKIP directive is used)
- "TODO", which indicates that a test is not expected to pass at the moment,
e.g. because the feature it is testing is known to be broken. While this
directive is inherited from TAP, its use in the kernel is discouraged.
- "XFAIL", which indicates that a test is expected to fail. This is similar
to "TODO", above, and is used by some kselftest tests.
- “TIMEOUT”, which indicates a test has timed out (note the result of the test
case result line should be “not ok” if the TIMEOUT directive is used)
- “ERROR”, which indicates that the execution of a test has failed due to a
specific error that is included in the diagnostic data. (note the result of
the test case result line should be “not ok” if the ERROR directive is used)
The diagnostic data is a plain-text field which contains any additional details
about why this result was produced. This is typically an error message for ERROR
or failed tests, or a description of missing dependencies for a SKIP result.
The diagnostic data field is optional, and results which have neither a
directive nor any diagnostic data do not need to include the "#" field
separator.
Example result lines include:
.. code-block::
ok 1 test_case_name
The test "test_case_name" passed.
.. code-block::
not ok 1 test_case_name
The test "test_case_name" failed.
.. code-block::
ok 1 test # SKIP necessary dependency unavailable
The test "test" was SKIPPED with the diagnostic message "necessary dependency
unavailable".
.. code-block::
not ok 1 test # TIMEOUT 30 seconds
The test "test" timed out, with diagnostic data "30 seconds".
.. code-block::
ok 5 check return code # rcode=0
The test "check return code" passed, with additional diagnostic data “rcode=0”
Diagnostic lines
----------------
If tests wish to output any further information, they should do so using
"diagnostic lines". Diagnostic lines are optional, freeform text, and are
often used to describe what is being tested and any intermediate results in
more detail than the final result and diagnostic data line provides.
Diagnostic lines are formatted as "# <diagnostic_description>", where the
description can be any string. Diagnostic lines can be anywhere in the test
output. As a rule, diagnostic lines regarding a test are directly before the
test result line for that test.
Note that most tools will treat unknown lines (see below) as diagnostic lines,
even if they do not start with a "#": this is to capture any other useful
kernel output which may help debug the test. It is nevertheless recommended
that tests always prefix any diagnostic output they have with a "#" character.
Unknown lines
-------------
There may be lines within KTAP output that do not follow the format of one of
the four formats for lines described above. This is allowed, however, they will
not influence the status of the tests.
Nested tests
------------
In KTAP, tests can be nested. This is done by having a test include within its
output an entire set of KTAP-formatted results. This can be used to categorize
and group related tests, or to split out different results from the same test.
The "parent" test's result should consist of all of its subtests' results,
starting with another KTAP version line and test plan, and end with the overall
result. If one of the subtests fail, for example, the parent test should also
fail.
Additionally, all result lines in a subtest should be indented. One level of
indentation is two spaces: " ". The indentation should begin at the version
line and should end before the parent test's result line.
An example of a test with two nested subtests:
.. code-block::
KTAP version 1
1..1
KTAP version 1
1..2
ok 1 test_1
not ok 2 test_2
# example failed
not ok 1 example
An example format with multiple levels of nested testing:
.. code-block::
KTAP version 1
1..2
KTAP version 1
1..2
KTAP version 1
1..2
not ok 1 test_1
ok 2 test_2
not ok 1 test_3
ok 2 test_4 # SKIP
not ok 1 example_test_1
ok 2 example_test_2
Major differences between TAP and KTAP
--------------------------------------
Note the major differences between the TAP and KTAP specification:
- yaml and json are not recommended in diagnostic messages
- TODO directive not recognized
- KTAP allows for an arbitrary number of tests to be nested
The TAP14 specification does permit nested tests, but instead of using another
nested version line, uses a line of the form
"Subtest: <name>" where <name> is the name of the parent test.
Example KTAP output
--------------------
.. code-block::
KTAP version 1
1..1
KTAP version 1
1..3
KTAP version 1
1..1
# test_1: initializing test_1
ok 1 test_1
ok 1 example_test_1
KTAP version 1
1..2
ok 1 test_1 # SKIP test_1 skipped
ok 2 test_2
ok 2 example_test_2
KTAP version 1
1..3
ok 1 test_1
# test_2: FAIL
not ok 2 test_2
ok 3 test_3 # SKIP test_3 skipped
not ok 3 example_test_3
not ok 1 main_test
This output defines the following hierarchy:
A single test called "main_test", which fails, and has three subtests:
- "example_test_1", which passes, and has one subtest:
- "test_1", which passes, and outputs the diagnostic message "test_1: initializing test_1"
- "example_test_2", which passes, and has two subtests:
- "test_1", which is skipped, with the explanation "test_1 skipped"
- "test_2", which passes
- "example_test_3", which fails, and has three subtests
- "test_1", which passes
- "test_2", which outputs the diagnostic line "test_2: FAIL", and fails.
- "test_3", which is skipped with the explanation "test_3 skipped"
Note that the individual subtests with the same names do not conflict, as they
are found in different parent tests. This output also exhibits some sensible
rules for "bubbling up" test results: a test fails if any of its subtests fail.
Skipped tests do not affect the result of the parent test (though it often
makes sense for a test to be marked skipped if _all_ of its subtests have been
skipped).
See also:
---------
- The TAP specification:
https://testanything.org/tap-version-13-specification.html
- The (stagnant) TAP version 14 specification:
https://github.com/TestAnything/Specification/blob/tap-14-specification/specification.md
- The kselftest documentation:
Documentation/dev-tools/kselftest.rst
- The KUnit documentation:
Documentation/dev-tools/kunit/index.rst

View File

@ -12,5 +12,4 @@ following sections:
Documentation/dev-tools/kunit/api/test.rst
- documents all of the standard testing API excluding mocking
or mocking related features.
- documents all of the standard testing API

View File

@ -4,8 +4,7 @@
Test API
========
This file documents all of the standard testing API excluding mocking or mocking
related features.
This file documents all of the standard testing API.
.. kernel-doc:: include/kunit/test.h
:internal:

View File

@ -0,0 +1,204 @@
.. SPDX-License-Identifier: GPL-2.0
==================
KUnit Architecture
==================
The KUnit architecture can be divided into two parts:
- Kernel testing library
- kunit_tool (Command line test harness)
In-Kernel Testing Framework
===========================
The kernel testing library supports KUnit tests written in C using
KUnit. KUnit tests are kernel code. KUnit does several things:
- Organizes tests
- Reports test results
- Provides test utilities
Test Cases
----------
The fundamental unit in KUnit is the test case. The KUnit test cases are
grouped into KUnit suites. A KUnit test case is a function with type
signature ``void (*)(struct kunit *test)``.
These test case functions are wrapped in a struct called
``struct kunit_case``. For code, see:
.. kernel-doc:: include/kunit/test.h
:identifiers: kunit_case
.. note:
``generate_params`` is optional for non-parameterized tests.
Each KUnit test case gets a ``struct kunit`` context
object passed to it that tracks a running test. The KUnit assertion
macros and other KUnit utilities use the ``struct kunit`` context
object. As an exception, there are two fields:
- ``->priv``: The setup functions can use it to store arbitrary test
user data.
- ``->param_value``: It contains the parameter value which can be
retrieved in the parameterized tests.
Test Suites
-----------
A KUnit suite includes a collection of test cases. The KUnit suites
are represented by the ``struct kunit_suite``. For example:
.. code-block:: c
static struct kunit_case example_test_cases[] = {
KUNIT_CASE(example_test_foo),
KUNIT_CASE(example_test_bar),
KUNIT_CASE(example_test_baz),
{}
};
static struct kunit_suite example_test_suite = {
.name = "example",
.init = example_test_init,
.exit = example_test_exit,
.test_cases = example_test_cases,
};
kunit_test_suite(example_test_suite);
In the above example, the test suite ``example_test_suite``, runs the
test cases ``example_test_foo``, ``example_test_bar``, and
``example_test_baz``. Before running the test, the ``example_test_init``
is called and after running the test, ``example_test_exit`` is called.
The ``kunit_test_suite(example_test_suite)`` registers the test suite
with the KUnit test framework.
Executor
--------
The KUnit executor can list and run built-in KUnit tests on boot.
The Test suites are stored in a linker section
called ``.kunit_test_suites``. For code, see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/vmlinux.lds.h?h=v5.15#n945.
The linker section consists of an array of pointers to
``struct kunit_suite``, and is populated by the ``kunit_test_suites()``
macro. To run all tests compiled into the kernel, the KUnit executor
iterates over the linker section array.
.. kernel-figure:: kunit_suitememorydiagram.svg
:alt: KUnit Suite Memory
KUnit Suite Memory Diagram
On the kernel boot, the KUnit executor uses the start and end addresses
of this section to iterate over and run all tests. For code, see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/kunit/executor.c
When built as a module, the ``kunit_test_suites()`` macro defines a
``module_init()`` function, which runs all the tests in the compilation
unit instead of utilizing the executor.
In KUnit tests, some error classes do not affect other tests
or parts of the kernel, each KUnit case executes in a separate thread
context. For code, see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/kunit/try-catch.c?h=v5.15#n58
Assertion Macros
----------------
KUnit tests verify state using expectations/assertions.
All expectations/assertions are formatted as:
``KUNIT_{EXPECT|ASSERT}_<op>[_MSG](kunit, property[, message])``
- ``{EXPECT|ASSERT}`` determines whether the check is an assertion or an
expectation.
- For an expectation, if the check fails, marks the test as failed
and logs the failure.
- An assertion, on failure, causes the test case to terminate
immediately.
- Assertions call function:
``void __noreturn kunit_abort(struct kunit *)``.
- ``kunit_abort`` calls function:
``void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)``.
- ``kunit_try_catch_throw`` calls function:
``void complete_and_exit(struct completion *, long) __noreturn;``
and terminates the special thread context.
- ``<op>`` denotes a check with options: ``TRUE`` (supplied property
has the boolean value “true”), ``EQ`` (two supplied properties are
equal), ``NOT_ERR_OR_NULL`` (supplied pointer is not null and does not
contain an “err” value).
- ``[_MSG]`` prints a custom message on failure.
Test Result Reporting
---------------------
KUnit prints test results in KTAP format. KTAP is based on TAP14, see:
https://github.com/isaacs/testanything.github.io/blob/tap14/tap-version-14-specification.md.
KTAP (yet to be standardized format) works with KUnit and Kselftest.
The KUnit executor prints KTAP results to dmesg, and debugfs
(if configured).
Parameterized Tests
-------------------
Each KUnit parameterized test is associated with a collection of
parameters. The test is invoked multiple times, once for each parameter
value and the parameter is stored in the ``param_value`` field.
The test case includes a ``KUNIT_CASE_PARAM()`` macro that accepts a
generator function.
The generator function is passed the previous parameter and returns the next
parameter. It also provides a macro to generate common-case generators based on
arrays.
For code, see:
.. kernel-doc:: include/kunit/test.h
:identifiers: KUNIT_ARRAY_PARAM
kunit_tool (Command Line Test Harness)
======================================
kunit_tool is a Python script ``(tools/testing/kunit/kunit.py)``
that can be used to configure, build, exec, parse and run (runs other
commands in order) test results. You can either run KUnit tests using
kunit_tool or can include KUnit in kernel and parse manually.
- ``configure`` command generates the kernel ``.config`` from a
``.kunitconfig`` file (and any architecture-specific options).
For some architectures, additional config options are specified in the
``qemu_config`` Python script
(For example: ``tools/testing/kunit/qemu_configs/powerpc.py``).
It parses both the existing ``.config`` and the ``.kunitconfig`` files
and ensures that ``.config`` is a superset of ``.kunitconfig``.
If this is not the case, it will combine the two and run
``make olddefconfig`` to regenerate the ``.config`` file. It then
verifies that ``.config`` is now a superset. This checks if all
Kconfig dependencies are correctly specified in ``.kunitconfig``.
``kunit_config.py`` includes the parsing Kconfigs code. The code which
runs ``make olddefconfig`` is a part of ``kunit_kernel.py``. You can
invoke this command via: ``./tools/testing/kunit/kunit.py config`` and
generate a ``.config`` file.
- ``build`` runs ``make`` on the kernel tree with required options
(depends on the architecture and some options, for example: build_dir)
and reports any errors.
To build a KUnit kernel from the current ``.config``, you can use the
``build`` argument: ``./tools/testing/kunit/kunit.py build``.
- ``exec`` command executes kernel results either directly (using
User-mode Linux configuration), or via an emulator such
as QEMU. It reads results from the log via standard
output (stdout), and passes them to ``parse`` to be parsed.
If you already have built a kernel with built-in KUnit tests,
you can run the kernel and display the test results with the ``exec``
argument: ``./tools/testing/kunit/kunit.py exec``.
- ``parse`` extracts the KTAP output from a kernel log, parses
the test results, and prints a summary. For failed tests, any
diagnostic output will be included.

View File

@ -4,56 +4,55 @@
Frequently Asked Questions
==========================
How is this different from Autotest, kselftest, etc?
====================================================
How is this different from Autotest, kselftest, and so on?
==========================================================
KUnit is a unit testing framework. Autotest, kselftest (and some others) are
not.
A `unit test <https://martinfowler.com/bliki/UnitTest.html>`_ is supposed to
test a single unit of code in isolation, hence the name. A unit test should be
the finest granularity of testing and as such should allow all possible code
paths to be tested in the code under test; this is only possible if the code
under test is very small and does not have any external dependencies outside of
test a single unit of code in isolation and hence the name *unit test*. A unit
test should be the finest granularity of testing and should allow all possible
code paths to be tested in the code under test. This is only possible if the
code under test is small and does not have any external dependencies outside of
the test's control like hardware.
There are no testing frameworks currently available for the kernel that do not
require installing the kernel on a test machine or in a VM and all require
tests to be written in userspace and run on the kernel under test; this is true
for Autotest, kselftest, and some others, disqualifying any of them from being
considered unit testing frameworks.
require installing the kernel on a test machine or in a virtual machine. All
testing frameworks require tests to be written in userspace and run on the
kernel under test. This is true for Autotest, kselftest, and some others,
disqualifying any of them from being considered unit testing frameworks.
Does KUnit support running on architectures other than UML?
===========================================================
Yes, well, mostly.
Yes, mostly.
For the most part, the KUnit core framework (what you use to write the tests)
can compile to any architecture; it compiles like just another part of the
For the most part, the KUnit core framework (what we use to write the tests)
can compile to any architecture. It compiles like just another part of the
kernel and runs when the kernel boots, or when built as a module, when the
module is loaded. However, there is some infrastructure,
like the KUnit Wrapper (``tools/testing/kunit/kunit.py``) that does not support
other architectures.
module is loaded. However, there is infrastructure, like the KUnit Wrapper
(``tools/testing/kunit/kunit.py``) that does not support other architectures.
In short, this means that, yes, you can run KUnit on other architectures, but
it might require more work than using KUnit on UML.
In short, yes, you can run KUnit on other architectures, but it might require
more work than using KUnit on UML.
For more information, see :ref:`kunit-on-non-uml`.
What is the difference between a unit test and these other kinds of tests?
==========================================================================
What is the difference between a unit test and other kinds of tests?
====================================================================
Most existing tests for the Linux kernel would be categorized as an integration
test, or an end-to-end test.
- A unit test is supposed to test a single unit of code in isolation, hence the
name. A unit test should be the finest granularity of testing and as such
should allow all possible code paths to be tested in the code under test; this
is only possible if the code under test is very small and does not have any
external dependencies outside of the test's control like hardware.
- A unit test is supposed to test a single unit of code in isolation. A unit
test should be the finest granularity of testing and, as such, allows all
possible code paths to be tested in the code under test. This is only possible
if the code under test is small and does not have any external dependencies
outside of the test's control like hardware.
- An integration test tests the interaction between a minimal set of components,
usually just two or three. For example, someone might write an integration
test to test the interaction between a driver and a piece of hardware, or to
test the interaction between the userspace libraries the kernel provides and
the kernel itself; however, one of these tests would probably not test the
the kernel itself. However, one of these tests would probably not test the
entire kernel along with hardware interactions and interactions with the
userspace.
- An end-to-end test usually tests the entire system from the perspective of the
@ -62,26 +61,26 @@ test, or an end-to-end test.
hardware with a production userspace and then trying to exercise some behavior
that depends on interactions between the hardware, the kernel, and userspace.
KUnit isn't working, what should I do?
======================================
KUnit is not working, what should I do?
=======================================
Unfortunately, there are a number of things which can break, but here are some
things to try.
1. Try running ``./tools/testing/kunit/kunit.py run`` with the ``--raw_output``
1. Run ``./tools/testing/kunit/kunit.py run`` with the ``--raw_output``
parameter. This might show details or error messages hidden by the kunit_tool
parser.
2. Instead of running ``kunit.py run``, try running ``kunit.py config``,
``kunit.py build``, and ``kunit.py exec`` independently. This can help track
down where an issue is occurring. (If you think the parser is at fault, you
can run it manually against stdin or a file with ``kunit.py parse``.)
3. Running the UML kernel directly can often reveal issues or error messages
kunit_tool ignores. This should be as simple as running ``./vmlinux`` after
building the UML kernel (e.g., by using ``kunit.py build``). Note that UML
has some unusual requirements (such as the host having a tmpfs filesystem
mounted), and has had issues in the past when built statically and the host
has KASLR enabled. (On older host kernels, you may need to run ``setarch
`uname -m` -R ./vmlinux`` to disable KASLR.)
can run it manually against ``stdin`` or a file with ``kunit.py parse``.)
3. Running the UML kernel directly can often reveal issues or error messages,
``kunit_tool`` ignores. This should be as simple as running ``./vmlinux``
after building the UML kernel (for example, by using ``kunit.py build``).
Note that UML has some unusual requirements (such as the host having a tmpfs
filesystem mounted), and has had issues in the past when built statically and
the host has KASLR enabled. (On older host kernels, you may need to run
``setarch `uname -m` -R ./vmlinux`` to disable KASLR.)
4. Make sure the kernel .config has ``CONFIG_KUNIT=y`` and at least one test
(e.g. ``CONFIG_KUNIT_EXAMPLE_TEST=y``). kunit_tool will keep its .config
around, so you can see what config was used after running ``kunit.py run``.

View File

@ -1,13 +1,17 @@
.. SPDX-License-Identifier: GPL-2.0
=========================================
KUnit - Unit Testing for the Linux Kernel
=========================================
=================================
KUnit - Linux Kernel Unit Testing
=================================
.. toctree::
:maxdepth: 2
:caption: Contents:
start
architecture
run_wrapper
run_manual
usage
kunit-tool
api/index
@ -16,82 +20,94 @@ KUnit - Unit Testing for the Linux Kernel
tips
running_tips
What is KUnit?
==============
This section details the kernel unit testing framework.
KUnit is a lightweight unit testing and mocking framework for the Linux kernel.
Introduction
============
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining unit test
cases, grouping related test cases into test suites, providing common
infrastructure for running tests, and much more.
KUnit (Kernel unit testing framework) provides a common framework for
unit tests within the Linux kernel. Using KUnit, you can define groups
of test cases called test suites. The tests either run on kernel boot
if built-in, or load as a module. KUnit automatically flags and reports
failed test cases in the kernel log. The test results appear in `TAP
(Test Anything Protocol) format <https://testanything.org/>`_. It is inspired by
JUnit, Pythons unittest.mock, and GoogleTest/GoogleMock (C++ unit testing
framework).
KUnit consists of a kernel component, which provides a set of macros for easily
writing unit tests. Tests written against KUnit will run on kernel boot if
built-in, or when loaded if built as a module. These tests write out results to
the kernel log in `TAP <https://testanything.org/>`_ format.
KUnit tests are part of the kernel, written in the C (programming)
language, and test parts of the Kernel implementation (example: a C
language function). Excluding build time, from invocation to
completion, KUnit can run around 100 tests in less than 10 seconds.
KUnit can test any kernel component, for example: file system, system
calls, memory management, device drivers and so on.
To make running these tests (and reading the results) easier, KUnit offers
:doc:`kunit_tool <kunit-tool>`, which builds a `User Mode Linux
<http://user-mode-linux.sourceforge.net>`_ kernel, runs it, and parses the test
results. This provides a quick way of running KUnit tests during development,
without requiring a virtual machine or separate hardware.
KUnit follows the white-box testing approach. The test has access to
internal system functionality. KUnit runs in kernel space and is not
restricted to things exposed to user-space.
Get started now: Documentation/dev-tools/kunit/start.rst
In addition, KUnit has kunit_tool, a script (``tools/testing/kunit/kunit.py``)
that configures the Linux kernel, runs KUnit tests under QEMU or UML (`User Mode
Linux <http://user-mode-linux.sourceforge.net/>`_), parses the test results and
displays them in a user friendly manner.
Why KUnit?
==========
Features
--------
A unit test is supposed to test a single unit of code in isolation, hence the
name. A unit test should be the finest granularity of testing and as such should
allow all possible code paths to be tested in the code under test; this is only
possible if the code under test is very small and does not have any external
dependencies outside of the test's control like hardware.
- Provides a framework for writing unit tests.
- Runs tests on any kernel architecture.
- Runs a test in milliseconds.
KUnit provides a common framework for unit tests within the kernel.
Prerequisites
-------------
KUnit tests can be run on most architectures, and most tests are architecture
independent. All built-in KUnit tests run on kernel startup. Alternatively,
KUnit and KUnit tests can be built as modules and tests will run when the test
module is loaded.
- Any Linux kernel compatible hardware.
- For Kernel under test, Linux kernel version 5.5 or greater.
.. note::
Unit Testing
============
KUnit can also run tests without needing a virtual machine or actual
hardware under User Mode Linux. User Mode Linux is a Linux architecture,
like ARM or x86, which compiles the kernel as a Linux executable. KUnit
can be used with UML either by building with ``ARCH=um`` (like any other
architecture), or by using :doc:`kunit_tool <kunit-tool>`.
A unit test tests a single unit of code in isolation. A unit test is the finest
granularity of testing and allows all possible code paths to be tested in the
code under test. This is possible if the code under test is small and does not
have any external dependencies outside of the test's control like hardware.
KUnit is fast. Excluding build time, from invocation to completion KUnit can run
several dozen tests in only 10 to 20 seconds; this might not sound like a big
deal to some people, but having such fast and easy to run tests fundamentally
changes the way you go about testing and even writing code in the first place.
Linus himself said in his `git talk at Google
<https://gist.github.com/lorn/1272686/revisions#diff-53c65572127855f1b003db4064a94573R874>`_:
"... a lot of people seem to think that performance is about doing the
same thing, just doing it faster, and that is not true. That is not what
performance is all about. If you can do something really fast, really
well, people will start using it differently."
Write Unit Tests
----------------
In this context Linus was talking about branching and merging,
but this point also applies to testing. If your tests are slow, unreliable, are
difficult to write, and require a special setup or special hardware to run,
then you wait a lot longer to write tests, and you wait a lot longer to run
tests; this means that tests are likely to break, unlikely to test a lot of
things, and are unlikely to be rerun once they pass. If your tests are really
fast, you run them all the time, every time you make a change, and every time
someone sends you some code. Why trust that someone ran all their tests
correctly on every change when you can just run them yourself in less time than
it takes to read their test log?
To write good unit tests, there is a simple but powerful pattern:
Arrange-Act-Assert. This is a great way to structure test cases and
defines an order of operations.
- Arrange inputs and targets: At the start of the test, arrange the data
that allows a function to work. Example: initialize a statement or
object.
- Act on the target behavior: Call your function/code under test.
- Assert expected outcome: Verify that the result (or resulting state) is as
expected.
Unit Testing Advantages
-----------------------
- Increases testing speed and development in the long run.
- Detects bugs at initial stage and therefore decreases bug fix cost
compared to acceptance testing.
- Improves code quality.
- Encourages writing testable code.
How do I use it?
================
* Documentation/dev-tools/kunit/start.rst - for new users of KUnit
* Documentation/dev-tools/kunit/tips.rst - for short examples of best practices
* Documentation/dev-tools/kunit/usage.rst - for a more detailed explanation of KUnit features
* Documentation/dev-tools/kunit/api/index.rst - for the list of KUnit APIs used for testing
* Documentation/dev-tools/kunit/kunit-tool.rst - for more information on the kunit_tool helper script
* Documentation/dev-tools/kunit/faq.rst - for answers to some common questions about KUnit
* Documentation/dev-tools/kunit/start.rst - for KUnit new users.
* Documentation/dev-tools/kunit/architecture.rst - KUnit architecture.
* Documentation/dev-tools/kunit/run_wrapper.rst - run kunit_tool.
* Documentation/dev-tools/kunit/run_manual.rst - run tests without kunit_tool.
* Documentation/dev-tools/kunit/usage.rst - write tests.
* Documentation/dev-tools/kunit/tips.rst - best practices with
examples.
* Documentation/dev-tools/kunit/api/index.rst - KUnit APIs
used for testing.
* Documentation/dev-tools/kunit/kunit-tool.rst - kunit_tool helper
script.
* Documentation/dev-tools/kunit/faq.rst - KUnit common questions and
answers.

View File

@ -0,0 +1,81 @@
<?xml version="1.0" encoding="UTF-8"?>
<svg width="796.93" height="555.73" version="1.1" viewBox="0 0 796.93 555.73" xmlns="http://www.w3.org/2000/svg">
<g transform="translate(-13.724 -17.943)">
<g fill="#dad4d4" fill-opacity=".91765" stroke="#1a1a1a">
<rect x="323.56" y="18.443" width="115.75" height="41.331"/>
<rect x="323.56" y="463.09" width="115.75" height="41.331"/>
<rect x="323.56" y="531.84" width="115.75" height="41.331"/>
<rect x="323.56" y="88.931" width="115.75" height="74.231"/>
</g>
<g>
<rect x="323.56" y="421.76" width="115.75" height="41.331" fill="#b9dbc6" stroke="#1a1a1a"/>
<text x="328.00888" y="446.61826" fill="#000000" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="328.00888" y="446.61826" font-family="monospace" font-size="16px">kunit_suite</tspan></text>
</g>
<g transform="translate(0 -258.6)">
<rect x="323.56" y="421.76" width="115.75" height="41.331" fill="#b9dbc6" stroke="#1a1a1a"/>
<text x="328.00888" y="446.61826" fill="#000000" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="328.00888" y="446.61826" font-family="monospace" font-size="16px">kunit_suite</tspan></text>
</g>
<g transform="translate(0 -217.27)">
<rect x="323.56" y="421.76" width="115.75" height="41.331" fill="#b9dbc6" stroke="#1a1a1a"/>
<text x="328.00888" y="446.61826" fill="#000000" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="328.00888" y="446.61826" font-family="monospace" font-size="16px">kunit_suite</tspan></text>
</g>
<g transform="translate(0 -175.94)">
<rect x="323.56" y="421.76" width="115.75" height="41.331" fill="#b9dbc6" stroke="#1a1a1a"/>
<text x="328.00888" y="446.61826" fill="#000000" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="328.00888" y="446.61826" font-family="monospace" font-size="16px">kunit_suite</tspan></text>
</g>
<g transform="translate(0 -134.61)">
<rect x="323.56" y="421.76" width="115.75" height="41.331" fill="#b9dbc6" stroke="#1a1a1a"/>
<text x="328.00888" y="446.61826" fill="#000000" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="328.00888" y="446.61826" font-family="monospace" font-size="16px">kunit_suite</tspan></text>
</g>
<g transform="translate(0 -41.331)">
<rect x="323.56" y="421.76" width="115.75" height="41.331" fill="#b9dbc6" stroke="#1a1a1a"/>
<text x="328.00888" y="446.61826" fill="#000000" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="328.00888" y="446.61826" font-family="monospace" font-size="16px">kunit_suite</tspan></text>
</g>
<g transform="translate(3.4459e-5 -.71088)">
<rect x="502.19" y="143.16" width="201.13" height="41.331" fill="#dad4d4" fill-opacity=".91765" stroke="#1a1a1a"/>
<text x="512.02319" y="168.02026" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="512.02319" y="168.02026" font-family="monospace">_kunit_suites_start</tspan></text>
</g>
<g transform="translate(3.0518e-5 -3.1753)">
<rect x="502.19" y="445.69" width="201.13" height="41.331" fill="#dad4d4" fill-opacity=".91765" stroke="#1a1a1a"/>
<text x="521.61694" y="470.54846" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="521.61694" y="470.54846" font-family="monospace">_kunit_suites_end</tspan></text>
</g>
<rect x="14.224" y="277.78" width="134.47" height="41.331" fill="#dad4d4" fill-opacity=".91765" stroke="#1a1a1a"/>
<text x="32.062176" y="304.41287" font-family="sans-serif" font-size="16px" style="line-height:1.25" xml:space="preserve"><tspan x="32.062176" y="304.41287" font-family="monospace">.init.data</tspan></text>
<g transform="translate(217.98 145.12)" stroke="#1a1a1a">
<circle cx="149.97" cy="373.01" r="3.4012"/>
<circle cx="163.46" cy="373.01" r="3.4012"/>
<circle cx="176.95" cy="373.01" r="3.4012"/>
</g>
<g transform="translate(217.98 -298.66)" stroke="#1a1a1a">
<circle cx="149.97" cy="373.01" r="3.4012"/>
<circle cx="163.46" cy="373.01" r="3.4012"/>
<circle cx="176.95" cy="373.01" r="3.4012"/>
</g>
<g stroke="#1a1a1a">
<rect x="323.56" y="328.49" width="115.75" height="51.549" fill="#b9dbc6"/>
<g transform="translate(217.98 -18.75)">
<circle cx="149.97" cy="373.01" r="3.4012"/>
<circle cx="163.46" cy="373.01" r="3.4012"/>
<circle cx="176.95" cy="373.01" r="3.4012"/>
</g>
</g>
<g transform="scale(1.0933 .9147)" stroke-width="32.937" aria-label="{">
<path d="m275.49 545.57c-35.836-8.432-47.43-24.769-47.957-64.821v-88.536c-0.527-44.795-10.54-57.97-49.538-67.456 38.998-10.013 49.011-23.715 49.538-67.983v-88.536c0.527-40.052 12.121-56.389 47.957-64.821v-5.797c-65.348 0-85.901 17.391-86.955 73.253v93.806c-0.527 36.89-10.013 50.065-44.795 59.551 34.782 10.013 44.268 23.188 44.795 60.078v93.279c1.581 56.389 21.607 73.78 86.955 73.78z"/>
</g>
<g transform="scale(1.1071 .90325)" stroke-width="14.44" aria-label="{">
<path d="m461.46 443.55c-15.711-3.6967-20.794-10.859-21.025-28.418v-38.815c-0.23104-19.639-4.6209-25.415-21.718-29.574 17.097-4.3898 21.487-10.397 21.718-29.805v-38.815c0.23105-17.559 5.314-24.722 21.025-28.418v-2.5415c-28.649 0-37.66 7.6244-38.122 32.115v41.126c-0.23105 16.173-4.3898 21.949-19.639 26.108 15.249 4.3898 19.408 10.166 19.639 26.339v40.895c0.69313 24.722 9.4728 32.346 38.122 32.346z"/>
</g>
<path d="m449.55 161.84v2.5h49.504v-2.5z" color="#000000" style="-inkscape-stroke:none"/>
<g fill-rule="evenodd">
<path d="m443.78 163.09 8.65-5v10z" color="#000000" stroke-width="1pt" style="-inkscape-stroke:none"/>
<path d="m453.1 156.94-10.648 6.1543 0.99804 0.57812 9.6504 5.5781zm-1.334 2.3125v7.6856l-6.6504-3.8438z" color="#000000" style="-inkscape-stroke:none"/>
</g>
<path d="m449.55 461.91v2.5h49.504v-2.5z" color="#000000" style="-inkscape-stroke:none"/>
<g fill-rule="evenodd">
<path d="m443.78 463.16 8.65-5v10z" color="#000000" stroke-width="1pt" style="-inkscape-stroke:none"/>
<path d="m453.1 457-10.648 6.1562 0.99804 0.57617 9.6504 5.5781zm-1.334 2.3125v7.6856l-6.6504-3.8438z" color="#000000" style="-inkscape-stroke:none"/>
</g>
<rect x="515.64" y="223.9" width="294.52" height="178.49" fill="#dad4d4" fill-opacity=".91765" stroke="#1a1a1a"/>
<text x="523.33319" y="262.52542" font-family="monospace" font-size="14.667px" style="line-height:1.25" xml:space="preserve"><tspan x="523.33319" y="262.52542"><tspan fill="#008000" font-family="monospace" font-size="14.667px" font-weight="bold">struct</tspan> kunit_suite {</tspan><tspan x="523.33319" y="280.8588"><tspan fill="#008000" font-family="monospace" font-size="14.667px" font-weight="bold"> const char</tspan> name[<tspan fill="#ff00ff" font-size="14.667px">256</tspan>];</tspan><tspan x="523.33319" y="299.19217"> <tspan fill="#008000" font-family="monospace" font-size="14.667px" font-weight="bold">int</tspan> (*init)(<tspan fill="#008000" font-family="monospace" font-size="14.667px" font-weight="bold">struct</tspan> kunit *);</tspan><tspan x="523.33319" y="317.52554"> <tspan fill="#008000" font-family="monospace" font-size="14.667px" font-weight="bold">void</tspan> (*exit)(<tspan fill="#008000" font-family="monospace" font-size="14.667px" font-weight="bold">struct</tspan> kunit *);</tspan><tspan x="523.33319" y="335.85892"> <tspan fill="#008000" font-family="monospace" font-size="14.667px" font-weight="bold">struct</tspan> kunit_case *test_cases;</tspan><tspan x="523.33319" y="354.19229"> ...</tspan><tspan x="523.33319" y="372.52567">};</tspan></text>
</g>
</svg>

After

Width:  |  Height:  |  Size: 7.6 KiB

View File

@ -0,0 +1,57 @@
.. SPDX-License-Identifier: GPL-2.0
============================
Run Tests without kunit_tool
============================
If we do not want to use kunit_tool (For example: we want to integrate
with other systems, or run tests on real hardware), we can
include KUnit in any kernel, read out results, and parse manually.
.. note:: KUnit is not designed for use in a production system. It is
possible that tests may reduce the stability or security of
the system.
Configure the Kernel
====================
KUnit tests can run without kunit_tool. This can be useful, if:
- We have an existing kernel configuration to test.
- Need to run on real hardware (or using an emulator/VM kunit_tool
does not support).
- Wish to integrate with some existing testing systems.
KUnit is configured with the ``CONFIG_KUNIT`` option, and individual
tests can also be built by enabling their config options in our
``.config``. KUnit tests usually (but don't always) have config options
ending in ``_KUNIT_TEST``. Most tests can either be built as a module,
or be built into the kernel.
.. note ::
We can enable the ``KUNIT_ALL_TESTS`` config option to
automatically enable all tests with satisfied dependencies. This is
a good way of quickly testing everything applicable to the current
config.
Once we have built our kernel (and/or modules), it is simple to run
the tests. If the tests are built-in, they will run automatically on the
kernel boot. The results will be written to the kernel log (``dmesg``)
in TAP format.
If the tests are built as modules, they will run when the module is
loaded.
.. code-block :: bash
# modprobe example-test
The results will appear in TAP format in ``dmesg``.
.. note ::
If ``CONFIG_KUNIT_DEBUGFS`` is enabled, KUnit test results will
be accessible from the ``debugfs`` filesystem (if mounted).
They will be in ``/sys/kernel/debug/kunit/<test_suite>/results``, in
TAP format.

View File

@ -0,0 +1,247 @@
.. SPDX-License-Identifier: GPL-2.0
=========================
Run Tests with kunit_tool
=========================
We can either run KUnit tests using kunit_tool or can run tests
manually, and then use kunit_tool to parse the results. To run tests
manually, see: Documentation/dev-tools/kunit/run_manual.rst.
As long as we can build the kernel, we can run KUnit.
kunit_tool is a Python script which configures and builds a kernel, runs
tests, and formats the test results.
Run command:
.. code-block::
./tools/testing/kunit/kunit.py run
We should see the following:
.. code-block::
Generating .config...
Building KUnit kernel...
Starting KUnit kernel...
We may want to use the following options:
.. code-block::
./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all
- ``--timeout`` sets a maximum amount of time for tests to run.
- ``--jobs`` sets the number of threads to build the kernel.
kunit_tool will generate a ``.kunitconfig`` with a default
configuration, if no other ``.kunitconfig`` file exists
(in the build directory). In addition, it verifies that the
generated ``.config`` file contains the ``CONFIG`` options in the
``.kunitconfig``.
It is also possible to pass a separate ``.kunitconfig`` fragment to
kunit_tool. This is useful if we have several different groups of
tests we want to run independently, or if we want to use pre-defined
test configs for certain subsystems.
To use a different ``.kunitconfig`` file (such as one
provided to test a particular subsystem), pass it as an option:
.. code-block::
./tools/testing/kunit/kunit.py run --kunitconfig=fs/ext4/.kunitconfig
To view kunit_tool flags (optional command-line arguments), run:
.. code-block::
./tools/testing/kunit/kunit.py run --help
Create a ``.kunitconfig`` File
===============================
If we want to run a specific set of tests (rather than those listed
in the KUnit ``defconfig``), we can provide Kconfig options in the
``.kunitconfig`` file. For default .kunitconfig, see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/kunit/configs/default.config.
A ``.kunitconfig`` is a ``minconfig`` (a .config
generated by running ``make savedefconfig``), used for running a
specific set of tests. This file contains the regular Kernel configs
with specific test targets. The ``.kunitconfig`` also
contains any other config options required by the tests (For example:
dependencies for features under tests, configs that enable/disable
certain code blocks, arch configs and so on).
To create a ``.kunitconfig``, using the KUnit ``defconfig``:
.. code-block::
cd $PATH_TO_LINUX_REPO
cp tools/testing/kunit/configs/default.config .kunit/.kunitconfig
We can then add any other Kconfig options. For example:
.. code-block::
CONFIG_LIST_KUNIT_TEST=y
kunit_tool ensures that all config options in ``.kunitconfig`` are
set in the kernel ``.config`` before running the tests. It warns if we
have not included the options dependencies.
.. note:: Removing something from the ``.kunitconfig`` will
not rebuild the ``.config file``. The configuration is only
updated if the ``.kunitconfig`` is not a subset of ``.config``.
This means that we can use other tools
(For example: ``make menuconfig``) to adjust other config options.
The build dir needs to be set for ``make menuconfig`` to
work, therefore by default use ``make O=.kunit menuconfig``.
Configure, Build, and Run Tests
===============================
If we want to make manual changes to the KUnit build process, we
can run part of the KUnit build process independently.
When running kunit_tool, from a ``.kunitconfig``, we can generate a
``.config`` by using the ``config`` argument:
.. code-block::
./tools/testing/kunit/kunit.py config
To build a KUnit kernel from the current ``.config``, we can use the
``build`` argument:
.. code-block::
./tools/testing/kunit/kunit.py build
If we already have built UML kernel with built-in KUnit tests, we
can run the kernel, and display the test results with the ``exec``
argument:
.. code-block::
./tools/testing/kunit/kunit.py exec
The ``run`` command discussed in section: **Run Tests with kunit_tool**,
is equivalent to running the above three commands in sequence.
Parse Test Results
==================
KUnit tests output displays results in TAP (Test Anything Protocol)
format. When running tests, kunit_tool parses this output and prints
a summary. To see the raw test results in TAP format, we can pass the
``--raw_output`` argument:
.. code-block::
./tools/testing/kunit/kunit.py run --raw_output
If we have KUnit results in the raw TAP format, we can parse them and
print the human-readable summary with the ``parse`` command for
kunit_tool. This accepts a filename for an argument, or will read from
standard input.
.. code-block:: bash
# Reading from a file
./tools/testing/kunit/kunit.py parse /var/log/dmesg
# Reading from stdin
dmesg | ./tools/testing/kunit/kunit.py parse
Run Selected Test Suites
========================
By passing a bash style glob filter to the ``exec`` or ``run``
commands, we can run a subset of the tests built into a kernel . For
example: if we only want to run KUnit resource tests, use:
.. code-block::
./tools/testing/kunit/kunit.py run 'kunit-resource*'
This uses the standard glob format with wildcard characters.
Run Tests on qemu
=================
kunit_tool supports running tests on qemu as well as
via UML. To run tests on qemu, by default it requires two flags:
- ``--arch``: Selects a configs collection (Kconfig, qemu config options
and so on), that allow KUnit tests to be run on the specified
architecture in a minimal way. The architecture argument is same as
the option name passed to the ``ARCH`` variable used by Kbuild.
Not all architectures currently support this flag, but we can use
``--qemu_config`` to handle it. If ``um`` is passed (or this flag
is ignored), the tests will run via UML. Non-UML architectures,
for example: i386, x86_64, arm and so on; run on qemu.
- ``--cross_compile``: Specifies the Kbuild toolchain. It passes the
same argument as passed to the ``CROSS_COMPILE`` variable used by
Kbuild. As a reminder, this will be the prefix for the toolchain
binaries such as GCC. For example:
- ``sparc64-linux-gnu`` if we have the sparc toolchain installed on
our system.
- ``$HOME/toolchains/microblaze/gcc-9.2.0-nolibc/microblaze-linux/bin/microblaze-linux``
if we have downloaded the microblaze toolchain from the 0-day
website to a directory in our home directory called toolchains.
If we want to run KUnit tests on an architecture not supported by
the ``--arch`` flag, or want to run KUnit tests on qemu using a
non-default configuration; then we can write our own``QemuConfig``.
These ``QemuConfigs`` are written in Python. They have an import line
``from..qemu_config import QemuArchParams`` at the top of the file.
The file must contain a variable called ``QEMU_ARCH`` that has an
instance of ``QemuArchParams`` assigned to it. See example in:
``tools/testing/kunit/qemu_configs/x86_64.py``.
Once we have a ``QemuConfig``, we can pass it into kunit_tool,
using the ``--qemu_config`` flag. When used, this flag replaces the
``--arch`` flag. For example: using
``tools/testing/kunit/qemu_configs/x86_64.py``, the invocation appear
as
.. code-block:: bash
./tools/testing/kunit/kunit.py run \
--timeout=60 \
--jobs=12 \
--qemu_config=./tools/testing/kunit/qemu_configs/x86_64.py
To run existing KUnit tests on non-UML architectures, see:
Documentation/dev-tools/kunit/non_uml.rst.
Command-Line Arguments
======================
kunit_tool has a number of other command-line arguments which can
be useful for our test environment. Below the most commonly used
command line arguments:
- ``--help``: Lists all available options. To list common options,
place ``--help`` before the command. To list options specific to that
command, place ``--help`` after the command.
.. note:: Different commands (``config``, ``build``, ``run``, etc)
have different supported options.
- ``--build_dir``: Specifies kunit_tool build directory. It includes
the ``.kunitconfig``, ``.config`` files and compiled kernel.
- ``--make_options``: Specifies additional options to pass to make, when
compiling a kernel (using ``build`` or ``run`` commands). For example:
to enable compiler warnings, we can pass ``--make_options W=1``.
- ``--alltests``: Builds a UML kernel with all config options enabled
using ``make allyesconfig``. This allows us to run as many tests as
possible.
.. note:: It is slow and prone to breakage as new options are
added or modified. Instead, enable all tests
which have satisfied dependencies by adding
``CONFIG_KUNIT_ALL_TESTS=y`` to your ``.kunitconfig``.

View File

@ -4,132 +4,137 @@
Getting Started
===============
Installing dependencies
Installing Dependencies
=======================
KUnit has the same dependencies as the Linux kernel. As long as you can build
the kernel, you can run KUnit.
KUnit has the same dependencies as the Linux kernel. As long as you can
build the kernel, you can run KUnit.
Running tests with the KUnit Wrapper
====================================
Included with KUnit is a simple Python wrapper which runs tests under User Mode
Linux, and formats the test results.
The wrapper can be run with:
Running tests with kunit_tool
=============================
kunit_tool is a Python script, which configures and builds a kernel, runs
tests, and formats the test results. From the kernel repository, you
can run kunit_tool:
.. code-block:: bash
./tools/testing/kunit/kunit.py run
For more information on this wrapper (also called kunit_tool) check out the
Documentation/dev-tools/kunit/kunit-tool.rst page.
For more information on this wrapper, see:
Documentation/dev-tools/kunit/run_wrapper.rst.
Creating a .kunitconfig
-----------------------
If you want to run a specific set of tests (rather than those listed in the
KUnit defconfig), you can provide Kconfig options in the ``.kunitconfig`` file.
This file essentially contains the regular Kernel config, with the specific
test targets as well. The ``.kunitconfig`` should also contain any other config
options required by the tests.
Creating a ``.kunitconfig``
---------------------------
A good starting point for a ``.kunitconfig`` is the KUnit defconfig:
By default, kunit_tool runs a selection of tests. However, you can specify which
unit tests to run by creating a ``.kunitconfig`` file with kernel config options
that enable only a specific set of tests and their dependencies.
The ``.kunitconfig`` file contains a list of kconfig options which are required
to run the desired targets. The ``.kunitconfig`` also contains any other test
specific config options, such as test dependencies. For example: the
``FAT_FS`` tests - ``FAT_KUNIT_TEST``, depends on
``FAT_FS``. ``FAT_FS`` can be enabled by selecting either ``MSDOS_FS``
or ``VFAT_FS``. To run ``FAT_KUNIT_TEST``, the ``.kunitconfig`` has:
.. code-block:: none
CONFIG_KUNIT=y
CONFIG_MSDOS_FS=y
CONFIG_FAT_KUNIT_TEST=y
1. A good starting point for the ``.kunitconfig``, is the KUnit default
config. Run the command:
.. code-block:: bash
cd $PATH_TO_LINUX_REPO
cp tools/testing/kunit/configs/default.config .kunitconfig
You can then add any other Kconfig options you wish, e.g.:
.. note ::
You may want to remove CONFIG_KUNIT_ALL_TESTS from the ``.kunitconfig`` as
it will enable a number of additional tests that you may not want.
2. You can then add any other Kconfig options, for example:
.. code-block:: none
CONFIG_LIST_KUNIT_TEST=y
:doc:`kunit_tool <kunit-tool>` will ensure that all config options set in
``.kunitconfig`` are set in the kernel ``.config`` before running the tests.
It'll warn you if you haven't included the dependencies of the options you're
using.
Before running the tests, kunit_tool ensures that all config options
set in ``.kunitconfig`` are set in the kernel ``.config``. It will warn
you if you have not included dependencies for the options used.
.. note::
Note that removing something from the ``.kunitconfig`` will not trigger a
rebuild of the ``.config`` file: the configuration is only updated if the
``.kunitconfig`` is not a subset of ``.config``. This means that you can use
other tools (such as make menuconfig) to adjust other config options.
.. note ::
If you change the ``.kunitconfig``, kunit.py will trigger a rebuild of the
``.config`` file. But you can edit the ``.config`` file directly or with
tools like ``make menuconfig O=.kunit``. As long as its a superset of
``.kunitconfig``, kunit.py won't overwrite your changes.
Running the tests (KUnit Wrapper)
---------------------------------
To make sure that everything is set up correctly, simply invoke the Python
wrapper from your kernel repo:
Running Tests (KUnit Wrapper)
-----------------------------
1. To make sure that everything is set up correctly, invoke the Python
wrapper from your kernel repository:
.. code-block:: bash
./tools/testing/kunit/kunit.py run
.. note::
You may want to run ``make mrproper`` first.
If everything worked correctly, you should see the following:
.. code-block:: bash
.. code-block::
Generating .config ...
Building KUnit Kernel ...
Starting KUnit Kernel ...
followed by a list of tests that are run. All of them should be passing.
The tests will pass or fail.
.. note::
Because it is building a lot of sources for the first time, the
``Building KUnit kernel`` step may take a while.
.. note ::
Because it is building a lot of sources for the first time, the
``Building KUnit kernel`` may take a while.
Running tests without the KUnit Wrapper
Running Tests without the KUnit Wrapper
=======================================
If you do not want to use the KUnit Wrapper (for example: you want code
under test to integrate with other systems, or use a different/
unsupported architecture or configuration), KUnit can be included in
any kernel, and the results are read out and parsed manually.
If you'd rather not use the KUnit Wrapper (if, for example, you need to
integrate with other systems, or use an architecture other than UML), KUnit can
be included in any kernel, and the results read out and parsed manually.
.. note ::
``CONFIG_KUNIT`` should not be enabled in a production environment.
Enabling KUnit disables Kernel Address-Space Layout Randomization
(KASLR), and tests may affect the state of the kernel in ways not
suitable for production.
.. note::
KUnit is not designed for use in a production system, and it's possible that
tests may reduce the stability or security of the system.
Configuring the kernel
Configuring the Kernel
----------------------
To enable KUnit itself, you need to enable the ``CONFIG_KUNIT`` Kconfig
option (under Kernel Hacking/Kernel Testing and Coverage in
``menuconfig``). From there, you can enable any KUnit tests. They
usually have config options ending in ``_KUNIT_TEST``.
In order to enable KUnit itself, you simply need to enable the ``CONFIG_KUNIT``
Kconfig option (it's under Kernel Hacking/Kernel Testing and Coverage in
menuconfig). From there, you can enable any KUnit tests you want: they usually
have config options ending in ``_KUNIT_TEST``.
KUnit and KUnit tests can be compiled as modules. The tests in a module
will run when the module is loaded.
KUnit and KUnit tests can be compiled as modules: in this case the tests in a
module will be run when the module is loaded.
Running the tests (w/o KUnit Wrapper)
Running Tests (without KUnit Wrapper)
-------------------------------------
Build and run your kernel. In the kernel log, the test output is printed
out in the TAP format. This will only happen by default if KUnit/tests
are built-in. Otherwise the module will need to be loaded.
Build and run your kernel as usual. Test output will be written to the kernel
log in `TAP <https://testanything.org/>`_ format.
.. note ::
Some lines and/or data may get interspersed in the TAP output.
.. note::
It's possible that there will be other lines and/or data interspersed in the
TAP output.
Writing your first test
Writing Your First Test
=======================
In your kernel repository, let's add some code that we can test.
In your kernel repo let's add some code that we can test. Create a file
``drivers/misc/example.h`` with the contents:
1. Create a file ``drivers/misc/example.h``, which includes:
.. code-block:: c
int misc_example_add(int left, int right);
create a file ``drivers/misc/example.c``:
2. Create a file ``drivers/misc/example.c``, which includes:
.. code-block:: c
@ -142,21 +147,22 @@ create a file ``drivers/misc/example.c``:
return left + right;
}
Now add the following lines to ``drivers/misc/Kconfig``:
3. Add the following lines to ``drivers/misc/Kconfig``:
.. code-block:: kconfig
config MISC_EXAMPLE
bool "My example"
and the following lines to ``drivers/misc/Makefile``:
4. Add the following lines to ``drivers/misc/Makefile``:
.. code-block:: make
obj-$(CONFIG_MISC_EXAMPLE) += example.o
Now we are ready to write the test. The test will be in
``drivers/misc/example-test.c``:
Now we are ready to write the test cases.
1. Add the below test case in ``drivers/misc/example_test.c``:
.. code-block:: c
@ -191,7 +197,7 @@ Now we are ready to write the test. The test will be in
};
kunit_test_suite(misc_example_test_suite);
Now add the following to ``drivers/misc/Kconfig``:
2. Add the following lines to ``drivers/misc/Kconfig``:
.. code-block:: kconfig
@ -200,20 +206,20 @@ Now add the following to ``drivers/misc/Kconfig``:
depends on MISC_EXAMPLE && KUNIT=y
default KUNIT_ALL_TESTS
and the following to ``drivers/misc/Makefile``:
3. Add the following lines to ``drivers/misc/Makefile``:
.. code-block:: make
obj-$(CONFIG_MISC_EXAMPLE_TEST) += example-test.o
obj-$(CONFIG_MISC_EXAMPLE_TEST) += example_test.o
Now add it to your ``.kunitconfig``:
4. Add the following lines to ``.kunitconfig``:
.. code-block:: none
CONFIG_MISC_EXAMPLE=y
CONFIG_MISC_EXAMPLE_TEST=y
Now you can run the test:
5. Run the test:
.. code-block:: bash
@ -227,16 +233,23 @@ You should see the following failure:
[16:08:57] [PASSED] misc-example:misc_example_add_test_basic
[16:08:57] [FAILED] misc-example:misc_example_test_failure
[16:08:57] EXPECTATION FAILED at drivers/misc/example-test.c:17
[16:08:57] This test never passes.
[16:08:57] This test never passes.
...
Congrats! You just wrote your first KUnit test!
Congrats! You just wrote your first KUnit test.
Next Steps
==========
* Check out the Documentation/dev-tools/kunit/tips.rst page for tips on
writing idiomatic KUnit tests.
* Check out the :doc:`running_tips` page for tips on
how to make running KUnit tests easier.
* Optional: see the :doc:`usage` page for a more
in-depth explanation of KUnit.
* Documentation/dev-tools/kunit/architecture.rst - KUnit architecture.
* Documentation/dev-tools/kunit/run_wrapper.rst - run kunit_tool.
* Documentation/dev-tools/kunit/run_manual.rst - run tests without kunit_tool.
* Documentation/dev-tools/kunit/usage.rst - write tests.
* Documentation/dev-tools/kunit/tips.rst - best practices with
examples.
* Documentation/dev-tools/kunit/api/index.rst - KUnit APIs
used for testing.
* Documentation/dev-tools/kunit/kunit-tool.rst - kunit_tool helper
script.
* Documentation/dev-tools/kunit/faq.rst - KUnit common questions and
answers.

Some files were not shown because too many files have changed in this diff Show More