mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2024-12-29 01:05:29 +00:00
It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro. These create more than the usual number of simple but annoying merge conflicts with other trees, unfortunately. He has a lot more of these waiting on the wings that, I think, will go to you directly later on. - A new document on how to use merges and rebases in kernel repos, and one on Spectre vulnerabilities. - Various improvements to the build system, including automatic markup of function() references because some people, for reasons I will never understand, were of the opinion that :c:func:``function()`` is unattractive and not fun to type. - We now recommend using sphinx 1.7, but still support back to 1.4. - Lots of smaller improvements, warning fixes, typo fixes, etc. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8 raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF 3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY= =D0eO -----END PGP SIGNATURE----- Merge tag 'docs-5.3' of git://git.lwn.net/linux Pull Documentation updates from Jonathan Corbet: "It's been a relatively busy cycle for docs: - A fair pile of RST conversions, many from Mauro. These create more than the usual number of simple but annoying merge conflicts with other trees, unfortunately. He has a lot more of these waiting on the wings that, I think, will go to you directly later on. - A new document on how to use merges and rebases in kernel repos, and one on Spectre vulnerabilities. - Various improvements to the build system, including automatic markup of function() references because some people, for reasons I will never understand, were of the opinion that :c:func:``function()`` is unattractive and not fun to type. - We now recommend using sphinx 1.7, but still support back to 1.4. - Lots of smaller improvements, warning fixes, typo fixes, etc" * tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits) docs: automarkup.py: ignore exceptions when seeking for xrefs docs: Move binderfs to admin-guide Disable Sphinx SmartyPants in HTML output doc: RCU callback locks need only _bh, not necessarily _irq docs: format kernel-parameters -- as code Doc : doc-guide : Fix a typo platform: x86: get rid of a non-existent document Add the RCU docs to the core-api manual Documentation: RCU: Add TOC tree hooks Documentation: RCU: Rename txt files to rst Documentation: RCU: Convert RCU UP systems to reST Documentation: RCU: Convert RCU linked list to reST Documentation: RCU: Convert RCU basic concepts to reST docs: filesystems: Remove uneeded .rst extension on toctables scripts/sphinx-pre-install: fix out-of-tree build docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/ Documentation: PGP: update for newer HW devices Documentation: Add section about CPU vulnerabilities for Spectre Documentation: platform: Delete x86-laptop-drivers.txt docs: Note that :c:func: should no longer be used ...
This commit is contained in:
commit
e9a83bd232
@ -137,7 +137,8 @@ Description: Discover cpuidle policy and mechanism
|
||||
current_governor: (RW) displays current idle policy. Users can
|
||||
switch the governor at runtime by writing to this file.
|
||||
|
||||
See files in Documentation/cpuidle/ for more information.
|
||||
See Documentation/admin-guide/pm/cpuidle.rst and
|
||||
Documentation/driver-api/pm/cpuidle.rst for more information.
|
||||
|
||||
|
||||
What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/name
|
||||
|
@ -11,4 +11,4 @@ Description:
|
||||
example would be, if User A has shares = 1024 and user
|
||||
B has shares = 2048, User B will get twice the CPU
|
||||
bandwidth user A will. For more details refer
|
||||
Documentation/scheduler/sched-design-CFS.txt
|
||||
Documentation/scheduler/sched-design-CFS.rst
|
||||
|
@ -198,7 +198,7 @@ call to set the mask to the value returned.
|
||||
::
|
||||
|
||||
size_t
|
||||
dma_direct_max_mapping_size(struct device *dev);
|
||||
dma_max_mapping_size(struct device *dev);
|
||||
|
||||
Returns the maximum size of a mapping for the device. The size parameter
|
||||
of the mapping functions like dma_map_single(), dma_map_page() and
|
||||
|
@ -1,3 +1,9 @@
|
||||
:orphan:
|
||||
|
||||
====
|
||||
EDID
|
||||
====
|
||||
|
||||
In the good old days when graphics parameters were configured explicitly
|
||||
in a file called xorg.conf, even broken hardware could be managed.
|
||||
|
||||
@ -34,16 +40,19 @@ Makefile. Please note that the EDID data structure expects the timing
|
||||
values in a different way as compared to the standard X11 format.
|
||||
|
||||
X11:
|
||||
HTimings: hdisp hsyncstart hsyncend htotal
|
||||
VTimings: vdisp vsyncstart vsyncend vtotal
|
||||
HTimings:
|
||||
hdisp hsyncstart hsyncend htotal
|
||||
VTimings:
|
||||
vdisp vsyncstart vsyncend vtotal
|
||||
|
||||
EDID:
|
||||
#define XPIX hdisp
|
||||
#define XBLANK htotal-hdisp
|
||||
#define XOFFSET hsyncstart-hdisp
|
||||
#define XPULSE hsyncend-hsyncstart
|
||||
EDID::
|
||||
|
||||
#define YPIX vdisp
|
||||
#define YBLANK vtotal-vdisp
|
||||
#define YOFFSET vsyncstart-vdisp
|
||||
#define YPULSE vsyncend-vsyncstart
|
||||
#define XPIX hdisp
|
||||
#define XBLANK htotal-hdisp
|
||||
#define XOFFSET hsyncstart-hdisp
|
||||
#define XPULSE hsyncend-hsyncstart
|
||||
|
||||
#define YPIX vdisp
|
||||
#define YBLANK vtotal-vdisp
|
||||
#define YOFFSET vsyncstart-vdisp
|
||||
#define YPULSE vsyncend-vsyncstart
|
13
Documentation/Kconfig
Normal file
13
Documentation/Kconfig
Normal file
@ -0,0 +1,13 @@
|
||||
config WARN_MISSING_DOCUMENTS
|
||||
|
||||
bool "Warn if there's a missing documentation file"
|
||||
depends on COMPILE_TEST
|
||||
help
|
||||
It is not uncommon that a document gets renamed.
|
||||
This option makes the Kernel to check for missing dependencies,
|
||||
warning when something is missing. Works only if the Kernel
|
||||
is built from a git tree.
|
||||
|
||||
If unsure, select 'N'.
|
||||
|
||||
|
@ -4,6 +4,11 @@
|
||||
|
||||
subdir-y := devicetree/bindings/
|
||||
|
||||
# Check for broken documentation file references
|
||||
ifeq ($(CONFIG_WARN_MISSING_DOCUMENTS),y)
|
||||
$(shell $(srctree)/scripts/documentation-file-ref-check --warn)
|
||||
endif
|
||||
|
||||
# You can set these variables from the command line.
|
||||
SPHINXBUILD = sphinx-build
|
||||
SPHINXOPTS =
|
||||
@ -23,11 +28,13 @@ ifeq ($(HAVE_SPHINX),0)
|
||||
.DEFAULT:
|
||||
$(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.)
|
||||
@echo
|
||||
@./scripts/sphinx-pre-install
|
||||
@$(srctree)/scripts/sphinx-pre-install
|
||||
@echo " SKIP Sphinx $@ target."
|
||||
|
||||
else # HAVE_SPHINX
|
||||
|
||||
export SPHINXOPTS = $(shell perl -e 'open IN,"sphinx-build --version 2>&1 |"; while (<IN>) { if (m/([\d\.]+)/) { print "-jauto" if ($$1 >= "1.7") } ;} close IN')
|
||||
|
||||
# User-friendly check for pdflatex and latexmk
|
||||
HAVE_PDFLATEX := $(shell if which $(PDFLATEX) >/dev/null 2>&1; then echo 1; else echo 0; fi)
|
||||
HAVE_LATEXMK := $(shell if which latexmk >/dev/null 2>&1; then echo 1; else echo 0; fi)
|
||||
@ -70,12 +77,14 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
|
||||
$(abspath $(BUILDDIR)/$3/$4)
|
||||
|
||||
htmldocs:
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
|
||||
|
||||
linkcheckdocs:
|
||||
@$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var)))
|
||||
|
||||
latexdocs:
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,latex,$(var),latex,$(var)))
|
||||
|
||||
ifeq ($(HAVE_PDFLATEX),0)
|
||||
@ -87,14 +96,17 @@ pdfdocs:
|
||||
else # HAVE_PDFLATEX
|
||||
|
||||
pdfdocs: latexdocs
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
$(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;)
|
||||
|
||||
endif # HAVE_PDFLATEX
|
||||
|
||||
epubdocs:
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,epub,$(var),epub,$(var)))
|
||||
|
||||
xmldocs:
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,xml,$(var),xml,$(var)))
|
||||
|
||||
endif # HAVE_SPHINX
|
||||
|
@ -1,17 +1,19 @@
|
||||
RCU on Uniprocessor Systems
|
||||
.. _up_doc:
|
||||
|
||||
RCU on Uniprocessor Systems
|
||||
===========================
|
||||
|
||||
A common misconception is that, on UP systems, the call_rcu() primitive
|
||||
may immediately invoke its function. The basis of this misconception
|
||||
is that since there is only one CPU, it should not be necessary to
|
||||
wait for anything else to get done, since there are no other CPUs for
|
||||
anything else to be happening on. Although this approach will -sort- -of-
|
||||
anything else to be happening on. Although this approach will *sort of*
|
||||
work a surprising amount of the time, it is a very bad idea in general.
|
||||
This document presents three examples that demonstrate exactly how bad
|
||||
an idea this is.
|
||||
|
||||
|
||||
Example 1: softirq Suicide
|
||||
--------------------------
|
||||
|
||||
Suppose that an RCU-based algorithm scans a linked list containing
|
||||
elements A, B, and C in process context, and can delete elements from
|
||||
@ -28,8 +30,8 @@ your kernel.
|
||||
This same problem can occur if call_rcu() is invoked from a hardware
|
||||
interrupt handler.
|
||||
|
||||
|
||||
Example 2: Function-Call Fatality
|
||||
---------------------------------
|
||||
|
||||
Of course, one could avert the suicide described in the preceding example
|
||||
by having call_rcu() directly invoke its arguments only if it was called
|
||||
@ -46,11 +48,13 @@ its arguments would cause it to fail to make the fundamental guarantee
|
||||
underlying RCU, namely that call_rcu() defers invoking its arguments until
|
||||
all RCU read-side critical sections currently executing have completed.
|
||||
|
||||
Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in
|
||||
this case?
|
||||
Quick Quiz #1:
|
||||
Why is it *not* legal to invoke synchronize_rcu() in this case?
|
||||
|
||||
:ref:`Answers to Quick Quiz <answer_quick_quiz_up>`
|
||||
|
||||
Example 3: Death by Deadlock
|
||||
----------------------------
|
||||
|
||||
Suppose that call_rcu() is invoked while holding a lock, and that the
|
||||
callback function must acquire this same lock. In this case, if
|
||||
@ -76,25 +80,30 @@ there are cases where this can be quite ugly:
|
||||
If call_rcu() directly invokes the callback, painful locking restrictions
|
||||
or API changes would be required.
|
||||
|
||||
Quick Quiz #2: What locking restriction must RCU callbacks respect?
|
||||
Quick Quiz #2:
|
||||
What locking restriction must RCU callbacks respect?
|
||||
|
||||
:ref:`Answers to Quick Quiz <answer_quick_quiz_up>`
|
||||
|
||||
Summary
|
||||
-------
|
||||
|
||||
Permitting call_rcu() to immediately invoke its arguments breaks RCU,
|
||||
even on a UP system. So do not do it! Even on a UP system, the RCU
|
||||
infrastructure -must- respect grace periods, and -must- invoke callbacks
|
||||
infrastructure *must* respect grace periods, and *must* invoke callbacks
|
||||
from a known environment in which no locks are held.
|
||||
|
||||
Note that it -is- safe for synchronize_rcu() to return immediately on
|
||||
UP systems, including !PREEMPT SMP builds running on UP systems.
|
||||
Note that it *is* safe for synchronize_rcu() to return immediately on
|
||||
UP systems, including PREEMPT SMP builds running on UP systems.
|
||||
|
||||
Quick Quiz #3: Why can't synchronize_rcu() return immediately on
|
||||
UP systems running preemptable RCU?
|
||||
Quick Quiz #3:
|
||||
Why can't synchronize_rcu() return immediately on UP systems running
|
||||
preemptable RCU?
|
||||
|
||||
.. _answer_quick_quiz_up:
|
||||
|
||||
Answer to Quick Quiz #1:
|
||||
Why is it -not- legal to invoke synchronize_rcu() in this case?
|
||||
Why is it *not* legal to invoke synchronize_rcu() in this case?
|
||||
|
||||
Because the calling function is scanning an RCU-protected linked
|
||||
list, and is therefore within an RCU read-side critical section.
|
||||
@ -104,12 +113,13 @@ Answer to Quick Quiz #1:
|
||||
Answer to Quick Quiz #2:
|
||||
What locking restriction must RCU callbacks respect?
|
||||
|
||||
Any lock that is acquired within an RCU callback must be
|
||||
acquired elsewhere using an _irq variant of the spinlock
|
||||
primitive. For example, if "mylock" is acquired by an
|
||||
RCU callback, then a process-context acquisition of this
|
||||
lock must use something like spin_lock_irqsave() to
|
||||
acquire the lock.
|
||||
Any lock that is acquired within an RCU callback must be acquired
|
||||
elsewhere using an _bh variant of the spinlock primitive.
|
||||
For example, if "mylock" is acquired by an RCU callback, then
|
||||
a process-context acquisition of this lock must use something
|
||||
like spin_lock_bh() to acquire the lock. Please note that
|
||||
it is also OK to use _irq variants of spinlocks, for example,
|
||||
spin_lock_irqsave().
|
||||
|
||||
If the process-context code were to simply use spin_lock(),
|
||||
then, since RCU callbacks can be invoked from softirq context,
|
||||
@ -119,7 +129,7 @@ Answer to Quick Quiz #2:
|
||||
|
||||
This restriction might seem gratuitous, since very few RCU
|
||||
callbacks acquire locks directly. However, a great many RCU
|
||||
callbacks do acquire locks -indirectly-, for example, via
|
||||
callbacks do acquire locks *indirectly*, for example, via
|
||||
the kfree() primitive.
|
||||
|
||||
Answer to Quick Quiz #3:
|
19
Documentation/RCU/index.rst
Normal file
19
Documentation/RCU/index.rst
Normal file
@ -0,0 +1,19 @@
|
||||
.. _rcu_concepts:
|
||||
|
||||
============
|
||||
RCU concepts
|
||||
============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
rcu
|
||||
listRCU
|
||||
UP
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
@ -1,5 +1,7 @@
|
||||
Using RCU to Protect Read-Mostly Linked Lists
|
||||
.. _list_rcu_doc:
|
||||
|
||||
Using RCU to Protect Read-Mostly Linked Lists
|
||||
=============================================
|
||||
|
||||
One of the best applications of RCU is to protect read-mostly linked lists
|
||||
("struct list_head" in list.h). One big advantage of this approach
|
||||
@ -7,8 +9,8 @@ is that all of the required memory barriers are included for you in
|
||||
the list macros. This document describes several applications of RCU,
|
||||
with the best fits first.
|
||||
|
||||
|
||||
Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
|
||||
----------------------------------------------------------------------
|
||||
|
||||
The best applications are cases where, if reader-writer locking were
|
||||
used, the read-side lock would be dropped before taking any action
|
||||
@ -24,7 +26,7 @@ added or deleted, rather than being modified in place.
|
||||
|
||||
A straightforward example of this use of RCU may be found in the
|
||||
system-call auditing support. For example, a reader-writer locked
|
||||
implementation of audit_filter_task() might be as follows:
|
||||
implementation of audit_filter_task() might be as follows::
|
||||
|
||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||
{
|
||||
@ -48,7 +50,7 @@ the corresponding value is returned. By the time that this value is acted
|
||||
on, the list may well have been modified. This makes sense, since if
|
||||
you are turning auditing off, it is OK to audit a few extra system calls.
|
||||
|
||||
This means that RCU can be easily applied to the read side, as follows:
|
||||
This means that RCU can be easily applied to the read side, as follows::
|
||||
|
||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||
{
|
||||
@ -73,7 +75,7 @@ become list_for_each_entry_rcu(). The _rcu() list-traversal primitives
|
||||
insert the read-side memory barriers that are required on DEC Alpha CPUs.
|
||||
|
||||
The changes to the update side are also straightforward. A reader-writer
|
||||
lock might be used as follows for deletion and insertion:
|
||||
lock might be used as follows for deletion and insertion::
|
||||
|
||||
static inline int audit_del_rule(struct audit_rule *rule,
|
||||
struct list_head *list)
|
||||
@ -106,7 +108,7 @@ lock might be used as follows for deletion and insertion:
|
||||
return 0;
|
||||
}
|
||||
|
||||
Following are the RCU equivalents for these two functions:
|
||||
Following are the RCU equivalents for these two functions::
|
||||
|
||||
static inline int audit_del_rule(struct audit_rule *rule,
|
||||
struct list_head *list)
|
||||
@ -154,13 +156,13 @@ otherwise cause concurrent readers to fail spectacularly.
|
||||
So, when readers can tolerate stale data and when entries are either added
|
||||
or deleted, without in-place modification, it is very easy to use RCU!
|
||||
|
||||
|
||||
Example 2: Handling In-Place Updates
|
||||
------------------------------------
|
||||
|
||||
The system-call auditing code does not update auditing rules in place.
|
||||
However, if it did, reader-writer-locked code to do so might look as
|
||||
follows (presumably, the field_count is only permitted to decrease,
|
||||
otherwise, the added fields would need to be filled in):
|
||||
otherwise, the added fields would need to be filled in)::
|
||||
|
||||
static inline int audit_upd_rule(struct audit_rule *rule,
|
||||
struct list_head *list,
|
||||
@ -187,7 +189,7 @@ otherwise, the added fields would need to be filled in):
|
||||
The RCU version creates a copy, updates the copy, then replaces the old
|
||||
entry with the newly updated entry. This sequence of actions, allowing
|
||||
concurrent reads while doing a copy to perform an update, is what gives
|
||||
RCU ("read-copy update") its name. The RCU code is as follows:
|
||||
RCU ("read-copy update") its name. The RCU code is as follows::
|
||||
|
||||
static inline int audit_upd_rule(struct audit_rule *rule,
|
||||
struct list_head *list,
|
||||
@ -216,8 +218,8 @@ RCU ("read-copy update") its name. The RCU code is as follows:
|
||||
Again, this assumes that the caller holds audit_netlink_sem. Normally,
|
||||
the reader-writer lock would become a spinlock in this sort of code.
|
||||
|
||||
|
||||
Example 3: Eliminating Stale Data
|
||||
---------------------------------
|
||||
|
||||
The auditing examples above tolerate stale data, as do most algorithms
|
||||
that are tracking external state. Because there is a delay from the
|
||||
@ -231,13 +233,16 @@ per-entry spinlock, and, if the "deleted" flag is set, pretends that the
|
||||
entry does not exist. For this to be helpful, the search function must
|
||||
return holding the per-entry spinlock, as ipc_lock() does in fact do.
|
||||
|
||||
Quick Quiz: Why does the search function need to return holding the
|
||||
per-entry lock for this deleted-flag technique to be helpful?
|
||||
Quick Quiz:
|
||||
Why does the search function need to return holding the per-entry lock for
|
||||
this deleted-flag technique to be helpful?
|
||||
|
||||
:ref:`Answer to Quick Quiz <answer_quick_quiz_list>`
|
||||
|
||||
If the system-call audit module were to ever need to reject stale data,
|
||||
one way to accomplish this would be to add a "deleted" flag and a "lock"
|
||||
spinlock to the audit_entry structure, and modify audit_filter_task()
|
||||
as follows:
|
||||
as follows::
|
||||
|
||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||
{
|
||||
@ -268,7 +273,7 @@ audit_upd_rule() would need additional memory barriers to ensure
|
||||
that the list_add_rcu() was really executed before the list_del_rcu().
|
||||
|
||||
The audit_del_rule() function would need to set the "deleted"
|
||||
flag under the spinlock as follows:
|
||||
flag under the spinlock as follows::
|
||||
|
||||
static inline int audit_del_rule(struct audit_rule *rule,
|
||||
struct list_head *list)
|
||||
@ -290,8 +295,8 @@ flag under the spinlock as follows:
|
||||
return -EFAULT; /* No matching rule */
|
||||
}
|
||||
|
||||
|
||||
Summary
|
||||
-------
|
||||
|
||||
Read-mostly list-based data structures that can tolerate stale data are
|
||||
the most amenable to use of RCU. The simplest case is where entries are
|
||||
@ -302,8 +307,9 @@ If stale data cannot be tolerated, then a "deleted" flag may be used
|
||||
in conjunction with a per-entry spinlock in order to allow the search
|
||||
function to reject newly deleted data.
|
||||
|
||||
.. _answer_quick_quiz_list:
|
||||
|
||||
Answer to Quick Quiz
|
||||
Answer to Quick Quiz:
|
||||
Why does the search function need to return holding the per-entry
|
||||
lock for this deleted-flag technique to be helpful?
|
||||
|
92
Documentation/RCU/rcu.rst
Normal file
92
Documentation/RCU/rcu.rst
Normal file
@ -0,0 +1,92 @@
|
||||
.. _rcu_doc:
|
||||
|
||||
RCU Concepts
|
||||
============
|
||||
|
||||
The basic idea behind RCU (read-copy update) is to split destructive
|
||||
operations into two parts, one that prevents anyone from seeing the data
|
||||
item being destroyed, and one that actually carries out the destruction.
|
||||
A "grace period" must elapse between the two parts, and this grace period
|
||||
must be long enough that any readers accessing the item being deleted have
|
||||
since dropped their references. For example, an RCU-protected deletion
|
||||
from a linked list would first remove the item from the list, wait for
|
||||
a grace period to elapse, then free the element. See the
|
||||
Documentation/RCU/listRCU.rst file for more information on using RCU with
|
||||
linked lists.
|
||||
|
||||
Frequently Asked Questions
|
||||
--------------------------
|
||||
|
||||
- Why would anyone want to use RCU?
|
||||
|
||||
The advantage of RCU's two-part approach is that RCU readers need
|
||||
not acquire any locks, perform any atomic instructions, write to
|
||||
shared memory, or (on CPUs other than Alpha) execute any memory
|
||||
barriers. The fact that these operations are quite expensive
|
||||
on modern CPUs is what gives RCU its performance advantages
|
||||
in read-mostly situations. The fact that RCU readers need not
|
||||
acquire locks can also greatly simplify deadlock-avoidance code.
|
||||
|
||||
- How can the updater tell when a grace period has completed
|
||||
if the RCU readers give no indication when they are done?
|
||||
|
||||
Just as with spinlocks, RCU readers are not permitted to
|
||||
block, switch to user-mode execution, or enter the idle loop.
|
||||
Therefore, as soon as a CPU is seen passing through any of these
|
||||
three states, we know that that CPU has exited any previous RCU
|
||||
read-side critical sections. So, if we remove an item from a
|
||||
linked list, and then wait until all CPUs have switched context,
|
||||
executed in user mode, or executed in the idle loop, we can
|
||||
safely free up that item.
|
||||
|
||||
Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
|
||||
same effect, but require that the readers manipulate CPU-local
|
||||
counters. These counters allow limited types of blocking within
|
||||
RCU read-side critical sections. SRCU also uses CPU-local
|
||||
counters, and permits general blocking within RCU read-side
|
||||
critical sections. These variants of RCU detect grace periods
|
||||
by sampling these counters.
|
||||
|
||||
- If I am running on a uniprocessor kernel, which can only do one
|
||||
thing at a time, why should I wait for a grace period?
|
||||
|
||||
See the Documentation/RCU/UP.rst file for more information.
|
||||
|
||||
- How can I see where RCU is currently used in the Linux kernel?
|
||||
|
||||
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
|
||||
"rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock",
|
||||
"srcu_read_unlock", "synchronize_rcu", "synchronize_net",
|
||||
"synchronize_srcu", and the other RCU primitives. Or grab one
|
||||
of the cscope databases from:
|
||||
|
||||
(http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html).
|
||||
|
||||
- What guidelines should I follow when writing code that uses RCU?
|
||||
|
||||
See the checklist.txt file in this directory.
|
||||
|
||||
- Why the name "RCU"?
|
||||
|
||||
"RCU" stands for "read-copy update". The file Documentation/RCU/listRCU.rst
|
||||
has more information on where this name came from, search for
|
||||
"read-copy update" to find it.
|
||||
|
||||
- I hear that RCU is patented? What is with that?
|
||||
|
||||
Yes, it is. There are several known patents related to RCU,
|
||||
search for the string "Patent" in RTFP.txt to find them.
|
||||
Of these, one was allowed to lapse by the assignee, and the
|
||||
others have been contributed to the Linux kernel under GPL.
|
||||
There are now also LGPL implementations of user-level RCU
|
||||
available (http://liburcu.org/).
|
||||
|
||||
- I hear that RCU needs work in order to support realtime kernels?
|
||||
|
||||
Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
|
||||
kernel configuration parameter.
|
||||
|
||||
- Where can I find more information on RCU?
|
||||
|
||||
See the RTFP.txt file in this directory.
|
||||
Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/).
|
@ -1,89 +0,0 @@
|
||||
RCU Concepts
|
||||
|
||||
|
||||
The basic idea behind RCU (read-copy update) is to split destructive
|
||||
operations into two parts, one that prevents anyone from seeing the data
|
||||
item being destroyed, and one that actually carries out the destruction.
|
||||
A "grace period" must elapse between the two parts, and this grace period
|
||||
must be long enough that any readers accessing the item being deleted have
|
||||
since dropped their references. For example, an RCU-protected deletion
|
||||
from a linked list would first remove the item from the list, wait for
|
||||
a grace period to elapse, then free the element. See the listRCU.txt
|
||||
file for more information on using RCU with linked lists.
|
||||
|
||||
|
||||
Frequently Asked Questions
|
||||
|
||||
o Why would anyone want to use RCU?
|
||||
|
||||
The advantage of RCU's two-part approach is that RCU readers need
|
||||
not acquire any locks, perform any atomic instructions, write to
|
||||
shared memory, or (on CPUs other than Alpha) execute any memory
|
||||
barriers. The fact that these operations are quite expensive
|
||||
on modern CPUs is what gives RCU its performance advantages
|
||||
in read-mostly situations. The fact that RCU readers need not
|
||||
acquire locks can also greatly simplify deadlock-avoidance code.
|
||||
|
||||
o How can the updater tell when a grace period has completed
|
||||
if the RCU readers give no indication when they are done?
|
||||
|
||||
Just as with spinlocks, RCU readers are not permitted to
|
||||
block, switch to user-mode execution, or enter the idle loop.
|
||||
Therefore, as soon as a CPU is seen passing through any of these
|
||||
three states, we know that that CPU has exited any previous RCU
|
||||
read-side critical sections. So, if we remove an item from a
|
||||
linked list, and then wait until all CPUs have switched context,
|
||||
executed in user mode, or executed in the idle loop, we can
|
||||
safely free up that item.
|
||||
|
||||
Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
|
||||
same effect, but require that the readers manipulate CPU-local
|
||||
counters. These counters allow limited types of blocking within
|
||||
RCU read-side critical sections. SRCU also uses CPU-local
|
||||
counters, and permits general blocking within RCU read-side
|
||||
critical sections. These variants of RCU detect grace periods
|
||||
by sampling these counters.
|
||||
|
||||
o If I am running on a uniprocessor kernel, which can only do one
|
||||
thing at a time, why should I wait for a grace period?
|
||||
|
||||
See the UP.txt file in this directory.
|
||||
|
||||
o How can I see where RCU is currently used in the Linux kernel?
|
||||
|
||||
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
|
||||
"rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock",
|
||||
"srcu_read_unlock", "synchronize_rcu", "synchronize_net",
|
||||
"synchronize_srcu", and the other RCU primitives. Or grab one
|
||||
of the cscope databases from:
|
||||
|
||||
http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html
|
||||
|
||||
o What guidelines should I follow when writing code that uses RCU?
|
||||
|
||||
See the checklist.txt file in this directory.
|
||||
|
||||
o Why the name "RCU"?
|
||||
|
||||
"RCU" stands for "read-copy update". The file listRCU.txt has
|
||||
more information on where this name came from, search for
|
||||
"read-copy update" to find it.
|
||||
|
||||
o I hear that RCU is patented? What is with that?
|
||||
|
||||
Yes, it is. There are several known patents related to RCU,
|
||||
search for the string "Patent" in RTFP.txt to find them.
|
||||
Of these, one was allowed to lapse by the assignee, and the
|
||||
others have been contributed to the Linux kernel under GPL.
|
||||
There are now also LGPL implementations of user-level RCU
|
||||
available (http://liburcu.org/).
|
||||
|
||||
o I hear that RCU needs work in order to support realtime kernels?
|
||||
|
||||
Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
|
||||
kernel configuration parameter.
|
||||
|
||||
o Where can I find more information on RCU?
|
||||
|
||||
See the RTFP.txt file in this directory.
|
||||
Or point your browser at http://www.rdrop.com/users/paulmck/RCU/.
|
@ -1,3 +1,5 @@
|
||||
:orphan:
|
||||
|
||||
========================================================
|
||||
OpenCAPI (Open Coherent Accelerator Processor Interface)
|
||||
========================================================
|
||||
|
@ -96,4 +96,4 @@ where
|
||||
<URL:http://www.uefi.org/sites/default/files/resources/_DSD-hierarchical-data-extension-UUID-v1.1.pdf>,
|
||||
referenced 2019-02-21.
|
||||
|
||||
[7] Documentation/acpi/dsd/data-node-reference.txt
|
||||
[7] Documentation/firmware-guide/acpi/dsd/data-node-references.rst
|
||||
|
@ -227,7 +227,7 @@ Configuring the kernel
|
||||
"make tinyconfig" Configure the tiniest possible kernel.
|
||||
|
||||
You can find more information on using the Linux kernel config tools
|
||||
in Documentation/kbuild/kconfig.txt.
|
||||
in Documentation/kbuild/kconfig.rst.
|
||||
|
||||
- NOTES on ``make config``:
|
||||
|
||||
|
@ -90,7 +90,7 @@ the disk is not available then you have three options:
|
||||
run a null modem to a second machine and capture the output there
|
||||
using your favourite communication program. Minicom works well.
|
||||
|
||||
(3) Use Kdump (see Documentation/kdump/kdump.txt),
|
||||
(3) Use Kdump (see Documentation/kdump/kdump.rst),
|
||||
extract the kernel ring buffer from old memory with using dmesg
|
||||
gdbmacro in Documentation/kdump/gdbmacros.txt.
|
||||
|
||||
|
@ -9,5 +9,6 @@ are configurable at compile, boot or run time.
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
spectre
|
||||
l1tf
|
||||
mds
|
||||
|
697
Documentation/admin-guide/hw-vuln/spectre.rst
Normal file
697
Documentation/admin-guide/hw-vuln/spectre.rst
Normal file
@ -0,0 +1,697 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Spectre Side Channels
|
||||
=====================
|
||||
|
||||
Spectre is a class of side channel attacks that exploit branch prediction
|
||||
and speculative execution on modern CPUs to read memory, possibly
|
||||
bypassing access controls. Speculative execution side channel exploits
|
||||
do not modify memory but attempt to infer privileged data in the memory.
|
||||
|
||||
This document covers Spectre variant 1 and Spectre variant 2.
|
||||
|
||||
Affected processors
|
||||
-------------------
|
||||
|
||||
Speculative execution side channel methods affect a wide range of modern
|
||||
high performance processors, since most modern high speed processors
|
||||
use branch prediction and speculative execution.
|
||||
|
||||
The following CPUs are vulnerable:
|
||||
|
||||
- Intel Core, Atom, Pentium, and Xeon processors
|
||||
|
||||
- AMD Phenom, EPYC, and Zen processors
|
||||
|
||||
- IBM POWER and zSeries processors
|
||||
|
||||
- Higher end ARM processors
|
||||
|
||||
- Apple CPUs
|
||||
|
||||
- Higher end MIPS CPUs
|
||||
|
||||
- Likely most other high performance CPUs. Contact your CPU vendor for details.
|
||||
|
||||
Whether a processor is affected or not can be read out from the Spectre
|
||||
vulnerability files in sysfs. See :ref:`spectre_sys_info`.
|
||||
|
||||
Related CVEs
|
||||
------------
|
||||
|
||||
The following CVE entries describe Spectre variants:
|
||||
|
||||
============= ======================= =================
|
||||
CVE-2017-5753 Bounds check bypass Spectre variant 1
|
||||
CVE-2017-5715 Branch target injection Spectre variant 2
|
||||
============= ======================= =================
|
||||
|
||||
Problem
|
||||
-------
|
||||
|
||||
CPUs use speculative operations to improve performance. That may leave
|
||||
traces of memory accesses or computations in the processor's caches,
|
||||
buffers, and branch predictors. Malicious software may be able to
|
||||
influence the speculative execution paths, and then use the side effects
|
||||
of the speculative execution in the CPUs' caches and buffers to infer
|
||||
privileged data touched during the speculative execution.
|
||||
|
||||
Spectre variant 1 attacks take advantage of speculative execution of
|
||||
conditional branches, while Spectre variant 2 attacks use speculative
|
||||
execution of indirect branches to leak privileged memory.
|
||||
See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[7] <spec_ref7>`
|
||||
:ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
|
||||
|
||||
Spectre variant 1 (Bounds Check Bypass)
|
||||
---------------------------------------
|
||||
|
||||
The bounds check bypass attack :ref:`[2] <spec_ref2>` takes advantage
|
||||
of speculative execution that bypasses conditional branch instructions
|
||||
used for memory access bounds check (e.g. checking if the index of an
|
||||
array results in memory access within a valid range). This results in
|
||||
memory accesses to invalid memory (with out-of-bound index) that are
|
||||
done speculatively before validation checks resolve. Such speculative
|
||||
memory accesses can leave side effects, creating side channels which
|
||||
leak information to the attacker.
|
||||
|
||||
There are some extensions of Spectre variant 1 attacks for reading data
|
||||
over the network, see :ref:`[12] <spec_ref12>`. However such attacks
|
||||
are difficult, low bandwidth, fragile, and are considered low risk.
|
||||
|
||||
Spectre variant 2 (Branch Target Injection)
|
||||
-------------------------------------------
|
||||
|
||||
The branch target injection attack takes advantage of speculative
|
||||
execution of indirect branches :ref:`[3] <spec_ref3>`. The indirect
|
||||
branch predictors inside the processor used to guess the target of
|
||||
indirect branches can be influenced by an attacker, causing gadget code
|
||||
to be speculatively executed, thus exposing sensitive data touched by
|
||||
the victim. The side effects left in the CPU's caches during speculative
|
||||
execution can be measured to infer data values.
|
||||
|
||||
.. _poison_btb:
|
||||
|
||||
In Spectre variant 2 attacks, the attacker can steer speculative indirect
|
||||
branches in the victim to gadget code by poisoning the branch target
|
||||
buffer of a CPU used for predicting indirect branch addresses. Such
|
||||
poisoning could be done by indirect branching into existing code,
|
||||
with the address offset of the indirect branch under the attacker's
|
||||
control. Since the branch prediction on impacted hardware does not
|
||||
fully disambiguate branch address and uses the offset for prediction,
|
||||
this could cause privileged code's indirect branch to jump to a gadget
|
||||
code with the same offset.
|
||||
|
||||
The most useful gadgets take an attacker-controlled input parameter (such
|
||||
as a register value) so that the memory read can be controlled. Gadgets
|
||||
without input parameters might be possible, but the attacker would have
|
||||
very little control over what memory can be read, reducing the risk of
|
||||
the attack revealing useful data.
|
||||
|
||||
One other variant 2 attack vector is for the attacker to poison the
|
||||
return stack buffer (RSB) :ref:`[13] <spec_ref13>` to cause speculative
|
||||
subroutine return instruction execution to go to a gadget. An attacker's
|
||||
imbalanced subroutine call instructions might "poison" entries in the
|
||||
return stack buffer which are later consumed by a victim's subroutine
|
||||
return instructions. This attack can be mitigated by flushing the return
|
||||
stack buffer on context switch, or virtual machine (VM) exit.
|
||||
|
||||
On systems with simultaneous multi-threading (SMT), attacks are possible
|
||||
from the sibling thread, as level 1 cache and branch target buffer
|
||||
(BTB) may be shared between hardware threads in a CPU core. A malicious
|
||||
program running on the sibling thread may influence its peer's BTB to
|
||||
steer its indirect branch speculations to gadget code, and measure the
|
||||
speculative execution's side effects left in level 1 cache to infer the
|
||||
victim's data.
|
||||
|
||||
Attack scenarios
|
||||
----------------
|
||||
|
||||
The following list of attack scenarios have been anticipated, but may
|
||||
not cover all possible attack vectors.
|
||||
|
||||
1. A user process attacking the kernel
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The attacker passes a parameter to the kernel via a register or
|
||||
via a known address in memory during a syscall. Such parameter may
|
||||
be used later by the kernel as an index to an array or to derive
|
||||
a pointer for a Spectre variant 1 attack. The index or pointer
|
||||
is invalid, but bound checks are bypassed in the code branch taken
|
||||
for speculative execution. This could cause privileged memory to be
|
||||
accessed and leaked.
|
||||
|
||||
For kernel code that has been identified where data pointers could
|
||||
potentially be influenced for Spectre attacks, new "nospec" accessor
|
||||
macros are used to prevent speculative loading of data.
|
||||
|
||||
Spectre variant 2 attacker can :ref:`poison <poison_btb>` the branch
|
||||
target buffer (BTB) before issuing syscall to launch an attack.
|
||||
After entering the kernel, the kernel could use the poisoned branch
|
||||
target buffer on indirect jump and jump to gadget code in speculative
|
||||
execution.
|
||||
|
||||
If an attacker tries to control the memory addresses leaked during
|
||||
speculative execution, he would also need to pass a parameter to the
|
||||
gadget, either through a register or a known address in memory. After
|
||||
the gadget has executed, he can measure the side effect.
|
||||
|
||||
The kernel can protect itself against consuming poisoned branch
|
||||
target buffer entries by using return trampolines (also known as
|
||||
"retpoline") :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` for all
|
||||
indirect branches. Return trampolines trap speculative execution paths
|
||||
to prevent jumping to gadget code during speculative execution.
|
||||
x86 CPUs with Enhanced Indirect Branch Restricted Speculation
|
||||
(Enhanced IBRS) available in hardware should use the feature to
|
||||
mitigate Spectre variant 2 instead of retpoline. Enhanced IBRS is
|
||||
more efficient than retpoline.
|
||||
|
||||
There may be gadget code in firmware which could be exploited with
|
||||
Spectre variant 2 attack by a rogue user process. To mitigate such
|
||||
attacks on x86, Indirect Branch Restricted Speculation (IBRS) feature
|
||||
is turned on before the kernel invokes any firmware code.
|
||||
|
||||
2. A user process attacking another user process
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
A malicious user process can try to attack another user process,
|
||||
either via a context switch on the same hardware thread, or from the
|
||||
sibling hyperthread sharing a physical processor core on simultaneous
|
||||
multi-threading (SMT) system.
|
||||
|
||||
Spectre variant 1 attacks generally require passing parameters
|
||||
between the processes, which needs a data passing relationship, such
|
||||
as remote procedure calls (RPC). Those parameters are used in gadget
|
||||
code to derive invalid data pointers accessing privileged memory in
|
||||
the attacked process.
|
||||
|
||||
Spectre variant 2 attacks can be launched from a rogue process by
|
||||
:ref:`poisoning <poison_btb>` the branch target buffer. This can
|
||||
influence the indirect branch targets for a victim process that either
|
||||
runs later on the same hardware thread, or running concurrently on
|
||||
a sibling hardware thread sharing the same physical core.
|
||||
|
||||
A user process can protect itself against Spectre variant 2 attacks
|
||||
by using the prctl() syscall to disable indirect branch speculation
|
||||
for itself. An administrator can also cordon off an unsafe process
|
||||
from polluting the branch target buffer by disabling the process's
|
||||
indirect branch speculation. This comes with a performance cost
|
||||
from not using indirect branch speculation and clearing the branch
|
||||
target buffer. When SMT is enabled on x86, for a process that has
|
||||
indirect branch speculation disabled, Single Threaded Indirect Branch
|
||||
Predictors (STIBP) :ref:`[4] <spec_ref4>` are turned on to prevent the
|
||||
sibling thread from controlling branch target buffer. In addition,
|
||||
the Indirect Branch Prediction Barrier (IBPB) is issued to clear the
|
||||
branch target buffer when context switching to and from such process.
|
||||
|
||||
On x86, the return stack buffer is stuffed on context switch.
|
||||
This prevents the branch target buffer from being used for branch
|
||||
prediction when the return stack buffer underflows while switching to
|
||||
a deeper call stack. Any poisoned entries in the return stack buffer
|
||||
left by the previous process will also be cleared.
|
||||
|
||||
User programs should use address space randomization to make attacks
|
||||
more difficult (Set /proc/sys/kernel/randomize_va_space = 1 or 2).
|
||||
|
||||
3. A virtualized guest attacking the host
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The attack mechanism is similar to how user processes attack the
|
||||
kernel. The kernel is entered via hyper-calls or other virtualization
|
||||
exit paths.
|
||||
|
||||
For Spectre variant 1 attacks, rogue guests can pass parameters
|
||||
(e.g. in registers) via hyper-calls to derive invalid pointers to
|
||||
speculate into privileged memory after entering the kernel. For places
|
||||
where such kernel code has been identified, nospec accessor macros
|
||||
are used to stop speculative memory access.
|
||||
|
||||
For Spectre variant 2 attacks, rogue guests can :ref:`poison
|
||||
<poison_btb>` the branch target buffer or return stack buffer, causing
|
||||
the kernel to jump to gadget code in the speculative execution paths.
|
||||
|
||||
To mitigate variant 2, the host kernel can use return trampolines
|
||||
for indirect branches to bypass the poisoned branch target buffer,
|
||||
and flushing the return stack buffer on VM exit. This prevents rogue
|
||||
guests from affecting indirect branching in the host kernel.
|
||||
|
||||
To protect host processes from rogue guests, host processes can have
|
||||
indirect branch speculation disabled via prctl(). The branch target
|
||||
buffer is cleared before context switching to such processes.
|
||||
|
||||
4. A virtualized guest attacking other guest
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
A rogue guest may attack another guest to get data accessible by the
|
||||
other guest.
|
||||
|
||||
Spectre variant 1 attacks are possible if parameters can be passed
|
||||
between guests. This may be done via mechanisms such as shared memory
|
||||
or message passing. Such parameters could be used to derive data
|
||||
pointers to privileged data in guest. The privileged data could be
|
||||
accessed by gadget code in the victim's speculation paths.
|
||||
|
||||
Spectre variant 2 attacks can be launched from a rogue guest by
|
||||
:ref:`poisoning <poison_btb>` the branch target buffer or the return
|
||||
stack buffer. Such poisoned entries could be used to influence
|
||||
speculation execution paths in the victim guest.
|
||||
|
||||
Linux kernel mitigates attacks to other guests running in the same
|
||||
CPU hardware thread by flushing the return stack buffer on VM exit,
|
||||
and clearing the branch target buffer before switching to a new guest.
|
||||
|
||||
If SMT is used, Spectre variant 2 attacks from an untrusted guest
|
||||
in the sibling hyperthread can be mitigated by the administrator,
|
||||
by turning off the unsafe guest's indirect branch speculation via
|
||||
prctl(). A guest can also protect itself by turning on microcode
|
||||
based mitigations (such as IBPB or STIBP on x86) within the guest.
|
||||
|
||||
.. _spectre_sys_info:
|
||||
|
||||
Spectre system information
|
||||
--------------------------
|
||||
|
||||
The Linux kernel provides a sysfs interface to enumerate the current
|
||||
mitigation status of the system for Spectre: whether the system is
|
||||
vulnerable, and which mitigations are active.
|
||||
|
||||
The sysfs file showing Spectre variant 1 mitigation status is:
|
||||
|
||||
/sys/devices/system/cpu/vulnerabilities/spectre_v1
|
||||
|
||||
The possible values in this file are:
|
||||
|
||||
======================================= =================================
|
||||
'Mitigation: __user pointer sanitation' Protection in kernel on a case by
|
||||
case base with explicit pointer
|
||||
sanitation.
|
||||
======================================= =================================
|
||||
|
||||
However, the protections are put in place on a case by case basis,
|
||||
and there is no guarantee that all possible attack vectors for Spectre
|
||||
variant 1 are covered.
|
||||
|
||||
The spectre_v2 kernel file reports if the kernel has been compiled with
|
||||
retpoline mitigation or if the CPU has hardware mitigation, and if the
|
||||
CPU has support for additional process-specific mitigation.
|
||||
|
||||
This file also reports CPU features enabled by microcode to mitigate
|
||||
attack between user processes:
|
||||
|
||||
1. Indirect Branch Prediction Barrier (IBPB) to add additional
|
||||
isolation between processes of different users.
|
||||
2. Single Thread Indirect Branch Predictors (STIBP) to add additional
|
||||
isolation between CPU threads running on the same core.
|
||||
|
||||
These CPU features may impact performance when used and can be enabled
|
||||
per process on a case-by-case base.
|
||||
|
||||
The sysfs file showing Spectre variant 2 mitigation status is:
|
||||
|
||||
/sys/devices/system/cpu/vulnerabilities/spectre_v2
|
||||
|
||||
The possible values in this file are:
|
||||
|
||||
- Kernel status:
|
||||
|
||||
==================================== =================================
|
||||
'Not affected' The processor is not vulnerable
|
||||
'Vulnerable' Vulnerable, no mitigation
|
||||
'Mitigation: Full generic retpoline' Software-focused mitigation
|
||||
'Mitigation: Full AMD retpoline' AMD-specific software mitigation
|
||||
'Mitigation: Enhanced IBRS' Hardware-focused mitigation
|
||||
==================================== =================================
|
||||
|
||||
- Firmware status: Show if Indirect Branch Restricted Speculation (IBRS) is
|
||||
used to protect against Spectre variant 2 attacks when calling firmware (x86 only).
|
||||
|
||||
========== =============================================================
|
||||
'IBRS_FW' Protection against user program attacks when calling firmware
|
||||
========== =============================================================
|
||||
|
||||
- Indirect branch prediction barrier (IBPB) status for protection between
|
||||
processes of different users. This feature can be controlled through
|
||||
prctl() per process, or through kernel command line options. This is
|
||||
an x86 only feature. For more details see below.
|
||||
|
||||
=================== ========================================================
|
||||
'IBPB: disabled' IBPB unused
|
||||
'IBPB: always-on' Use IBPB on all tasks
|
||||
'IBPB: conditional' Use IBPB on SECCOMP or indirect branch restricted tasks
|
||||
=================== ========================================================
|
||||
|
||||
- Single threaded indirect branch prediction (STIBP) status for protection
|
||||
between different hyper threads. This feature can be controlled through
|
||||
prctl per process, or through kernel command line options. This is x86
|
||||
only feature. For more details see below.
|
||||
|
||||
==================== ========================================================
|
||||
'STIBP: disabled' STIBP unused
|
||||
'STIBP: forced' Use STIBP on all tasks
|
||||
'STIBP: conditional' Use STIBP on SECCOMP or indirect branch restricted tasks
|
||||
==================== ========================================================
|
||||
|
||||
- Return stack buffer (RSB) protection status:
|
||||
|
||||
============= ===========================================
|
||||
'RSB filling' Protection of RSB on context switch enabled
|
||||
============= ===========================================
|
||||
|
||||
Full mitigation might require a microcode update from the CPU
|
||||
vendor. When the necessary microcode is not available, the kernel will
|
||||
report vulnerability.
|
||||
|
||||
Turning on mitigation for Spectre variant 1 and Spectre variant 2
|
||||
-----------------------------------------------------------------
|
||||
|
||||
1. Kernel mitigation
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For the Spectre variant 1, vulnerable kernel code (as determined
|
||||
by code audit or scanning tools) is annotated on a case by case
|
||||
basis to use nospec accessor macros for bounds clipping :ref:`[2]
|
||||
<spec_ref2>` to avoid any usable disclosure gadgets. However, it may
|
||||
not cover all attack vectors for Spectre variant 1.
|
||||
|
||||
For Spectre variant 2 mitigation, the compiler turns indirect calls or
|
||||
jumps in the kernel into equivalent return trampolines (retpolines)
|
||||
:ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` to go to the target
|
||||
addresses. Speculative execution paths under retpolines are trapped
|
||||
in an infinite loop to prevent any speculative execution jumping to
|
||||
a gadget.
|
||||
|
||||
To turn on retpoline mitigation on a vulnerable CPU, the kernel
|
||||
needs to be compiled with a gcc compiler that supports the
|
||||
-mindirect-branch=thunk-extern -mindirect-branch-register options.
|
||||
If the kernel is compiled with a Clang compiler, the compiler needs
|
||||
to support -mretpoline-external-thunk option. The kernel config
|
||||
CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
|
||||
the latest updated microcode.
|
||||
|
||||
On Intel Skylake-era systems the mitigation covers most, but not all,
|
||||
cases. See :ref:`[3] <spec_ref3>` for more details.
|
||||
|
||||
On CPUs with hardware mitigation for Spectre variant 2 (e.g. Enhanced
|
||||
IBRS on x86), retpoline is automatically disabled at run time.
|
||||
|
||||
The retpoline mitigation is turned on by default on vulnerable
|
||||
CPUs. It can be forced on or off by the administrator
|
||||
via the kernel command line and sysfs control files. See
|
||||
:ref:`spectre_mitigation_control_command_line`.
|
||||
|
||||
On x86, indirect branch restricted speculation is turned on by default
|
||||
before invoking any firmware code to prevent Spectre variant 2 exploits
|
||||
using the firmware.
|
||||
|
||||
Using kernel address space randomization (CONFIG_RANDOMIZE_SLAB=y
|
||||
and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) makes
|
||||
attacks on the kernel generally more difficult.
|
||||
|
||||
2. User program mitigation
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
User programs can mitigate Spectre variant 1 using LFENCE or "bounds
|
||||
clipping". For more details see :ref:`[2] <spec_ref2>`.
|
||||
|
||||
For Spectre variant 2 mitigation, individual user programs
|
||||
can be compiled with return trampolines for indirect branches.
|
||||
This protects them from consuming poisoned entries in the branch
|
||||
target buffer left by malicious software. Alternatively, the
|
||||
programs can disable their indirect branch speculation via prctl()
|
||||
(See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
|
||||
On x86, this will turn on STIBP to guard against attacks from the
|
||||
sibling thread when the user program is running, and use IBPB to
|
||||
flush the branch target buffer when switching to/from the program.
|
||||
|
||||
Restricting indirect branch speculation on a user program will
|
||||
also prevent the program from launching a variant 2 attack
|
||||
on x86. All sand-boxed SECCOMP programs have indirect branch
|
||||
speculation restricted by default. Administrators can change
|
||||
that behavior via the kernel command line and sysfs control files.
|
||||
See :ref:`spectre_mitigation_control_command_line`.
|
||||
|
||||
Programs that disable their indirect branch speculation will have
|
||||
more overhead and run slower.
|
||||
|
||||
User programs should use address space randomization
|
||||
(/proc/sys/kernel/randomize_va_space = 1 or 2) to make attacks more
|
||||
difficult.
|
||||
|
||||
3. VM mitigation
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
Within the kernel, Spectre variant 1 attacks from rogue guests are
|
||||
mitigated on a case by case basis in VM exit paths. Vulnerable code
|
||||
uses nospec accessor macros for "bounds clipping", to avoid any
|
||||
usable disclosure gadgets. However, this may not cover all variant
|
||||
1 attack vectors.
|
||||
|
||||
For Spectre variant 2 attacks from rogue guests to the kernel, the
|
||||
Linux kernel uses retpoline or Enhanced IBRS to prevent consumption of
|
||||
poisoned entries in branch target buffer left by rogue guests. It also
|
||||
flushes the return stack buffer on every VM exit to prevent a return
|
||||
stack buffer underflow so poisoned branch target buffer could be used,
|
||||
or attacker guests leaving poisoned entries in the return stack buffer.
|
||||
|
||||
To mitigate guest-to-guest attacks in the same CPU hardware thread,
|
||||
the branch target buffer is sanitized by flushing before switching
|
||||
to a new guest on a CPU.
|
||||
|
||||
The above mitigations are turned on by default on vulnerable CPUs.
|
||||
|
||||
To mitigate guest-to-guest attacks from sibling thread when SMT is
|
||||
in use, an untrusted guest running in the sibling thread can have
|
||||
its indirect branch speculation disabled by administrator via prctl().
|
||||
|
||||
The kernel also allows guests to use any microcode based mitigation
|
||||
they choose to use (such as IBPB or STIBP on x86) to protect themselves.
|
||||
|
||||
.. _spectre_mitigation_control_command_line:
|
||||
|
||||
Mitigation control on the kernel command line
|
||||
---------------------------------------------
|
||||
|
||||
Spectre variant 2 mitigation can be disabled or force enabled at the
|
||||
kernel command line.
|
||||
|
||||
nospectre_v2
|
||||
|
||||
[X86] Disable all mitigations for the Spectre variant 2
|
||||
(indirect branch prediction) vulnerability. System may
|
||||
allow data leaks with this option, which is equivalent
|
||||
to spectre_v2=off.
|
||||
|
||||
|
||||
spectre_v2=
|
||||
|
||||
[X86] Control mitigation of Spectre variant 2
|
||||
(indirect branch speculation) vulnerability.
|
||||
The default operation protects the kernel from
|
||||
user space attacks.
|
||||
|
||||
on
|
||||
unconditionally enable, implies
|
||||
spectre_v2_user=on
|
||||
off
|
||||
unconditionally disable, implies
|
||||
spectre_v2_user=off
|
||||
auto
|
||||
kernel detects whether your CPU model is
|
||||
vulnerable
|
||||
|
||||
Selecting 'on' will, and 'auto' may, choose a
|
||||
mitigation method at run time according to the
|
||||
CPU, the available microcode, the setting of the
|
||||
CONFIG_RETPOLINE configuration option, and the
|
||||
compiler with which the kernel was built.
|
||||
|
||||
Selecting 'on' will also enable the mitigation
|
||||
against user space to user space task attacks.
|
||||
|
||||
Selecting 'off' will disable both the kernel and
|
||||
the user space protections.
|
||||
|
||||
Specific mitigations can also be selected manually:
|
||||
|
||||
retpoline
|
||||
replace indirect branches
|
||||
retpoline,generic
|
||||
google's original retpoline
|
||||
retpoline,amd
|
||||
AMD-specific minimal thunk
|
||||
|
||||
Not specifying this option is equivalent to
|
||||
spectre_v2=auto.
|
||||
|
||||
For user space mitigation:
|
||||
|
||||
spectre_v2_user=
|
||||
|
||||
[X86] Control mitigation of Spectre variant 2
|
||||
(indirect branch speculation) vulnerability between
|
||||
user space tasks
|
||||
|
||||
on
|
||||
Unconditionally enable mitigations. Is
|
||||
enforced by spectre_v2=on
|
||||
|
||||
off
|
||||
Unconditionally disable mitigations. Is
|
||||
enforced by spectre_v2=off
|
||||
|
||||
prctl
|
||||
Indirect branch speculation is enabled,
|
||||
but mitigation can be enabled via prctl
|
||||
per thread. The mitigation control state
|
||||
is inherited on fork.
|
||||
|
||||
prctl,ibpb
|
||||
Like "prctl" above, but only STIBP is
|
||||
controlled per thread. IBPB is issued
|
||||
always when switching between different user
|
||||
space processes.
|
||||
|
||||
seccomp
|
||||
Same as "prctl" above, but all seccomp
|
||||
threads will enable the mitigation unless
|
||||
they explicitly opt out.
|
||||
|
||||
seccomp,ibpb
|
||||
Like "seccomp" above, but only STIBP is
|
||||
controlled per thread. IBPB is issued
|
||||
always when switching between different
|
||||
user space processes.
|
||||
|
||||
auto
|
||||
Kernel selects the mitigation depending on
|
||||
the available CPU features and vulnerability.
|
||||
|
||||
Default mitigation:
|
||||
If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl"
|
||||
|
||||
Not specifying this option is equivalent to
|
||||
spectre_v2_user=auto.
|
||||
|
||||
In general the kernel by default selects
|
||||
reasonable mitigations for the current CPU. To
|
||||
disable Spectre variant 2 mitigations, boot with
|
||||
spectre_v2=off. Spectre variant 1 mitigations
|
||||
cannot be disabled.
|
||||
|
||||
Mitigation selection guide
|
||||
--------------------------
|
||||
|
||||
1. Trusted userspace
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If all userspace applications are from trusted sources and do not
|
||||
execute externally supplied untrusted code, then the mitigations can
|
||||
be disabled.
|
||||
|
||||
2. Protect sensitive programs
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For security-sensitive programs that have secrets (e.g. crypto
|
||||
keys), protection against Spectre variant 2 can be put in place by
|
||||
disabling indirect branch speculation when the program is running
|
||||
(See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
|
||||
|
||||
3. Sandbox untrusted programs
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Untrusted programs that could be a source of attacks can be cordoned
|
||||
off by disabling their indirect branch speculation when they are run
|
||||
(See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
|
||||
This prevents untrusted programs from polluting the branch target
|
||||
buffer. All programs running in SECCOMP sandboxes have indirect
|
||||
branch speculation restricted by default. This behavior can be
|
||||
changed via the kernel command line and sysfs control files. See
|
||||
:ref:`spectre_mitigation_control_command_line`.
|
||||
|
||||
3. High security mode
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
All Spectre variant 2 mitigations can be forced on
|
||||
at boot time for all programs (See the "on" option in
|
||||
:ref:`spectre_mitigation_control_command_line`). This will add
|
||||
overhead as indirect branch speculations for all programs will be
|
||||
restricted.
|
||||
|
||||
On x86, branch target buffer will be flushed with IBPB when switching
|
||||
to a new program. STIBP is left on all the time to protect programs
|
||||
against variant 2 attacks originating from programs running on
|
||||
sibling threads.
|
||||
|
||||
Alternatively, STIBP can be used only when running programs
|
||||
whose indirect branch speculation is explicitly disabled,
|
||||
while IBPB is still used all the time when switching to a new
|
||||
program to clear the branch target buffer (See "ibpb" option in
|
||||
:ref:`spectre_mitigation_control_command_line`). This "ibpb" option
|
||||
has less performance cost than the "on" option, which leaves STIBP
|
||||
on all the time.
|
||||
|
||||
References on Spectre
|
||||
---------------------
|
||||
|
||||
Intel white papers:
|
||||
|
||||
.. _spec_ref1:
|
||||
|
||||
[1] `Intel analysis of speculative execution side channels <https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf>`_.
|
||||
|
||||
.. _spec_ref2:
|
||||
|
||||
[2] `Bounds check bypass <https://software.intel.com/security-software-guidance/software-guidance/bounds-check-bypass>`_.
|
||||
|
||||
.. _spec_ref3:
|
||||
|
||||
[3] `Deep dive: Retpoline: A branch target injection mitigation <https://software.intel.com/security-software-guidance/insights/deep-dive-retpoline-branch-target-injection-mitigation>`_.
|
||||
|
||||
.. _spec_ref4:
|
||||
|
||||
[4] `Deep Dive: Single Thread Indirect Branch Predictors <https://software.intel.com/security-software-guidance/insights/deep-dive-single-thread-indirect-branch-predictors>`_.
|
||||
|
||||
AMD white papers:
|
||||
|
||||
.. _spec_ref5:
|
||||
|
||||
[5] `AMD64 technology indirect branch control extension <https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf>`_.
|
||||
|
||||
.. _spec_ref6:
|
||||
|
||||
[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf>`_.
|
||||
|
||||
ARM white papers:
|
||||
|
||||
.. _spec_ref7:
|
||||
|
||||
[7] `Cache speculation side-channels <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/download-the-whitepaper>`_.
|
||||
|
||||
.. _spec_ref8:
|
||||
|
||||
[8] `Cache speculation issues update <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/latest-updates/cache-speculation-issues-update>`_.
|
||||
|
||||
Google white paper:
|
||||
|
||||
.. _spec_ref9:
|
||||
|
||||
[9] `Retpoline: a software construct for preventing branch-target-injection <https://support.google.com/faqs/answer/7625886>`_.
|
||||
|
||||
MIPS white paper:
|
||||
|
||||
.. _spec_ref10:
|
||||
|
||||
[10] `MIPS: response on speculative execution and side channel vulnerabilities <https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/>`_.
|
||||
|
||||
Academic papers:
|
||||
|
||||
.. _spec_ref11:
|
||||
|
||||
[11] `Spectre Attacks: Exploiting Speculative Execution <https://spectreattack.com/spectre.pdf>`_.
|
||||
|
||||
.. _spec_ref12:
|
||||
|
||||
[12] `NetSpectre: Read Arbitrary Memory over Network <https://arxiv.org/abs/1807.10535>`_.
|
||||
|
||||
.. _spec_ref13:
|
||||
|
||||
[13] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://www.usenix.org/system/files/conference/woot18/woot18-paper-koruyeh.pdf>`_.
|
@ -70,6 +70,7 @@ configure specific aspects of kernel behavior to your liking.
|
||||
ras
|
||||
bcache
|
||||
ext4
|
||||
binderfs
|
||||
pm/index
|
||||
thunderbolt
|
||||
LSM/index
|
||||
|
@ -9,11 +9,11 @@ and sorted into English Dictionary order (defined as ignoring all
|
||||
punctuation and sorting digits before letters in a case insensitive
|
||||
manner), and with descriptions where known.
|
||||
|
||||
The kernel parses parameters from the kernel command line up to "--";
|
||||
The kernel parses parameters from the kernel command line up to "``--``";
|
||||
if it doesn't recognize a parameter and it doesn't contain a '.', the
|
||||
parameter gets passed to init: parameters with '=' go into init's
|
||||
environment, others are passed as command line arguments to init.
|
||||
Everything after "--" is passed as an argument to init.
|
||||
Everything after "``--``" is passed as an argument to init.
|
||||
|
||||
Module parameters can be specified in two ways: via the kernel command
|
||||
line with a module name prefix, or via modprobe, e.g.::
|
||||
@ -167,7 +167,7 @@ parameter is applicable::
|
||||
X86-32 X86-32, aka i386 architecture is enabled.
|
||||
X86-64 X86-64 architecture is enabled.
|
||||
More X86-64 boot options can be found in
|
||||
Documentation/x86/x86_64/boot-options.txt .
|
||||
Documentation/x86/x86_64/boot-options.rst.
|
||||
X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64)
|
||||
X86_UV SGI UV support is enabled.
|
||||
XEN Xen support is enabled
|
||||
@ -181,10 +181,10 @@ In addition, the following text indicates that the option::
|
||||
Parameters denoted with BOOT are actually interpreted by the boot
|
||||
loader, and have no meaning to the kernel directly.
|
||||
Do not modify the syntax of boot loader parameters without extreme
|
||||
need or coordination with <Documentation/x86/boot.txt>.
|
||||
need or coordination with <Documentation/x86/boot.rst>.
|
||||
|
||||
There are also arch-specific kernel-parameters not documented here.
|
||||
See for example <Documentation/x86/x86_64/boot-options.txt>.
|
||||
See for example <Documentation/x86/x86_64/boot-options.rst>.
|
||||
|
||||
Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
|
||||
a trailing = on the name of any parameter states that that parameter will
|
||||
|
@ -53,7 +53,7 @@
|
||||
ACPI_DEBUG_PRINT statements, e.g.,
|
||||
ACPI_DEBUG_PRINT((ACPI_DB_INFO, ...
|
||||
The debug_level mask defaults to "info". See
|
||||
Documentation/acpi/debug.txt for more information about
|
||||
Documentation/firmware-guide/acpi/debug.rst for more information about
|
||||
debug layers and levels.
|
||||
|
||||
Enable processor driver info messages:
|
||||
@ -708,14 +708,14 @@
|
||||
[KNL, x86_64] select a region under 4G first, and
|
||||
fall back to reserve region above 4G when '@offset'
|
||||
hasn't been specified.
|
||||
See Documentation/kdump/kdump.txt for further details.
|
||||
See Documentation/kdump/kdump.rst for further details.
|
||||
|
||||
crashkernel=range1:size1[,range2:size2,...][@offset]
|
||||
[KNL] Same as above, but depends on the memory
|
||||
in the running system. The syntax of range is
|
||||
start-[end] where start and end are both
|
||||
a memory unit (amount[KMG]). See also
|
||||
Documentation/kdump/kdump.txt for an example.
|
||||
Documentation/kdump/kdump.rst for an example.
|
||||
|
||||
crashkernel=size[KMG],high
|
||||
[KNL, x86_64] range could be above 4G. Allow kernel
|
||||
@ -932,7 +932,7 @@
|
||||
edid/1680x1050.bin, or edid/1920x1080.bin is given
|
||||
and no file with the same name exists. Details and
|
||||
instructions how to build your own EDID data are
|
||||
available in Documentation/EDID/HOWTO.txt. An EDID
|
||||
available in Documentation/EDID/howto.rst. An EDID
|
||||
data set will only be used for a particular connector,
|
||||
if its name and a colon are prepended to the EDID
|
||||
name. Each connector may use a unique EDID data
|
||||
@ -963,7 +963,7 @@
|
||||
for details.
|
||||
|
||||
nompx [X86] Disables Intel Memory Protection Extensions.
|
||||
See Documentation/x86/intel_mpx.txt for more
|
||||
See Documentation/x86/intel_mpx.rst for more
|
||||
information about the feature.
|
||||
|
||||
nopku [X86] Disable Memory Protection Keys CPU feature found
|
||||
@ -1189,7 +1189,7 @@
|
||||
that is to be dynamically loaded by Linux. If there are
|
||||
multiple variables with the same name but with different
|
||||
vendor GUIDs, all of them will be loaded. See
|
||||
Documentation/acpi/ssdt-overlays.txt for details.
|
||||
Documentation/admin-guide/acpi/ssdt-overlays.rst for details.
|
||||
|
||||
|
||||
eisa_irq_edge= [PARISC,HW]
|
||||
@ -1209,7 +1209,7 @@
|
||||
Specifies physical address of start of kernel core
|
||||
image elf header and optionally the size. Generally
|
||||
kexec loader will pass this option to capture kernel.
|
||||
See Documentation/kdump/kdump.txt for details.
|
||||
See Documentation/kdump/kdump.rst for details.
|
||||
|
||||
enable_mtrr_cleanup [X86]
|
||||
The kernel tries to adjust MTRR layout from continuous
|
||||
@ -1388,9 +1388,6 @@
|
||||
Valid parameters: "on", "off"
|
||||
Default: "on"
|
||||
|
||||
hisax= [HW,ISDN]
|
||||
See Documentation/isdn/README.HiSax.
|
||||
|
||||
hlt [BUGS=ARM,SH]
|
||||
|
||||
hpet= [X86-32,HPET] option to control HPET usage
|
||||
@ -1507,7 +1504,7 @@
|
||||
Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
|
||||
.vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr
|
||||
.cdrom .chs .ignore_cable are additional options
|
||||
See Documentation/ide/ide.txt.
|
||||
See Documentation/ide/ide.rst.
|
||||
|
||||
ide-generic.probe-mask= [HW] (E)IDE subsystem
|
||||
Format: <int>
|
||||
@ -2383,7 +2380,7 @@
|
||||
|
||||
mce [X86-32] Machine Check Exception
|
||||
|
||||
mce=option [X86-64] See Documentation/x86/x86_64/boot-options.txt
|
||||
mce=option [X86-64] See Documentation/x86/x86_64/boot-options.rst
|
||||
|
||||
md= [HW] RAID subsystems devices and level
|
||||
See Documentation/admin-guide/md.rst.
|
||||
@ -2439,7 +2436,7 @@
|
||||
set according to the
|
||||
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
|
||||
option.
|
||||
See Documentation/memory-hotplug.txt.
|
||||
See Documentation/admin-guide/mm/memory-hotplug.rst.
|
||||
|
||||
memmap=exactmap [KNL,X86] Enable setting of an exact
|
||||
E820 memory map, as specified by the user.
|
||||
@ -2528,7 +2525,7 @@
|
||||
mem_encrypt=on: Activate SME
|
||||
mem_encrypt=off: Do not activate SME
|
||||
|
||||
Refer to Documentation/x86/amd-memory-encryption.txt
|
||||
Refer to Documentation/virtual/kvm/amd-memory-encryption.rst
|
||||
for details on when memory encryption can be activated.
|
||||
|
||||
mem_sleep_default= [SUSPEND] Default system suspend mode:
|
||||
@ -2836,8 +2833,9 @@
|
||||
0 - turn hardlockup detector in nmi_watchdog off
|
||||
1 - turn hardlockup detector in nmi_watchdog on
|
||||
When panic is specified, panic when an NMI watchdog
|
||||
timeout occurs (or 'nopanic' to override the opposite
|
||||
default). To disable both hard and soft lockup detectors,
|
||||
timeout occurs (or 'nopanic' to not panic on an NMI
|
||||
watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set)
|
||||
To disable both hard and soft lockup detectors,
|
||||
please see 'nowatchdog'.
|
||||
This is useful when you use a panic=... timeout and
|
||||
need the box quickly up again.
|
||||
@ -3528,7 +3526,7 @@
|
||||
See Documentation/blockdev/paride.txt.
|
||||
|
||||
pirq= [SMP,APIC] Manual mp-table setup
|
||||
See Documentation/x86/i386/IO-APIC.txt.
|
||||
See Documentation/x86/i386/IO-APIC.rst.
|
||||
|
||||
plip= [PPT,NET] Parallel port network link
|
||||
Format: { parport<nr> | timid | 0 }
|
||||
@ -5032,7 +5030,7 @@
|
||||
vector=percpu: enable percpu vector domain
|
||||
|
||||
video= [FB] Frame buffer configuration
|
||||
See Documentation/fb/modedb.txt.
|
||||
See Documentation/fb/modedb.rst.
|
||||
|
||||
video.brightness_switch_enabled= [0,1]
|
||||
If set to 1, on receiving an ACPI notify event
|
||||
@ -5060,7 +5058,7 @@
|
||||
Can be used multiple times for multiple devices.
|
||||
|
||||
vga= [BOOT,X86-32] Select a particular video mode
|
||||
See Documentation/x86/boot.txt and
|
||||
See Documentation/x86/boot.rst and
|
||||
Documentation/svga.txt.
|
||||
Use vga=ask for menu.
|
||||
This is actually a boot loader parameter; the value is
|
||||
@ -5167,7 +5165,7 @@
|
||||
Default: 3 = cyan.
|
||||
|
||||
watchdog timers [HW,WDT] For information on watchdog timers,
|
||||
see Documentation/watchdog/watchdog-parameters.txt
|
||||
see Documentation/watchdog/watchdog-parameters.rst
|
||||
or other driver-specific files in the
|
||||
Documentation/watchdog/ directory.
|
||||
|
||||
|
@ -165,5 +165,6 @@ write-through caching.
|
||||
========
|
||||
See Also
|
||||
========
|
||||
.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
|
||||
Section 5.2.27
|
||||
|
||||
[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
|
||||
- Section 5.2.27
|
||||
|
@ -199,7 +199,7 @@ Architecture (MCA)\ [#f3]_.
|
||||
mode).
|
||||
|
||||
.. [#f3] For more details about the Machine Check Architecture (MCA),
|
||||
please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
|
||||
please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree.
|
||||
|
||||
EDAC - Error Detection And Correction
|
||||
*************************************
|
||||
|
@ -1,3 +1,6 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
ATA over Ethernet is a network protocol that provides simple access to
|
||||
block storage on the LAN.
|
||||
|
||||
@ -22,7 +25,8 @@ document the use of the driver and are not necessary if you install
|
||||
the aoetools.
|
||||
|
||||
|
||||
CREATING DEVICE NODES
|
||||
Creating Device Nodes
|
||||
=====================
|
||||
|
||||
Users of udev should find the block device nodes created
|
||||
automatically, but to create all the necessary device nodes, use the
|
||||
@ -38,7 +42,8 @@ CREATING DEVICE NODES
|
||||
confusing when an AoE device is not present the first time the a
|
||||
command is run but appears a second later.
|
||||
|
||||
USING DEVICE NODES
|
||||
Using Device Nodes
|
||||
==================
|
||||
|
||||
"cat /dev/etherd/err" blocks, waiting for error diagnostic output,
|
||||
like any retransmitted packets.
|
||||
@ -55,7 +60,7 @@ USING DEVICE NODES
|
||||
by sysfs counterparts. Using the commands in aoetools insulates
|
||||
users from these implementation details.
|
||||
|
||||
The block devices are named like this:
|
||||
The block devices are named like this::
|
||||
|
||||
e{shelf}.{slot}
|
||||
e{shelf}.{slot}p{part}
|
||||
@ -64,7 +69,8 @@ USING DEVICE NODES
|
||||
first shelf (shelf address zero). That's the whole disk. The first
|
||||
partition on that disk would be "e0.2p1".
|
||||
|
||||
USING SYSFS
|
||||
Using sysfs
|
||||
===========
|
||||
|
||||
Each aoe block device in /sys/block has the extra attributes of
|
||||
state, mac, and netif. The state attribute is "up" when the device
|
||||
@ -78,29 +84,29 @@ USING SYSFS
|
||||
|
||||
There is a script in this directory that formats this information in
|
||||
a convenient way. Users with aoetools should use the aoe-stat
|
||||
command.
|
||||
command::
|
||||
|
||||
root@makki root# sh Documentation/aoe/status.sh
|
||||
e10.0 eth3 up
|
||||
e10.1 eth3 up
|
||||
e10.2 eth3 up
|
||||
e10.3 eth3 up
|
||||
e10.4 eth3 up
|
||||
e10.5 eth3 up
|
||||
e10.6 eth3 up
|
||||
e10.7 eth3 up
|
||||
e10.8 eth3 up
|
||||
e10.9 eth3 up
|
||||
e4.0 eth1 up
|
||||
e4.1 eth1 up
|
||||
e4.2 eth1 up
|
||||
e4.3 eth1 up
|
||||
e4.4 eth1 up
|
||||
e4.5 eth1 up
|
||||
e4.6 eth1 up
|
||||
e4.7 eth1 up
|
||||
e4.8 eth1 up
|
||||
e4.9 eth1 up
|
||||
root@makki root# sh Documentation/aoe/status.sh
|
||||
e10.0 eth3 up
|
||||
e10.1 eth3 up
|
||||
e10.2 eth3 up
|
||||
e10.3 eth3 up
|
||||
e10.4 eth3 up
|
||||
e10.5 eth3 up
|
||||
e10.6 eth3 up
|
||||
e10.7 eth3 up
|
||||
e10.8 eth3 up
|
||||
e10.9 eth3 up
|
||||
e4.0 eth1 up
|
||||
e4.1 eth1 up
|
||||
e4.2 eth1 up
|
||||
e4.3 eth1 up
|
||||
e4.4 eth1 up
|
||||
e4.5 eth1 up
|
||||
e4.6 eth1 up
|
||||
e4.7 eth1 up
|
||||
e4.8 eth1 up
|
||||
e4.9 eth1 up
|
||||
|
||||
Use /sys/module/aoe/parameters/aoe_iflist (or better, the driver
|
||||
option discussed below) instead of /dev/etherd/interfaces to limit
|
||||
@ -113,12 +119,13 @@ USING SYSFS
|
||||
for this purpose. You can also directly use the
|
||||
/dev/etherd/discover special file described above.
|
||||
|
||||
DRIVER OPTIONS
|
||||
Driver Options
|
||||
==============
|
||||
|
||||
There is a boot option for the built-in aoe driver and a
|
||||
corresponding module parameter, aoe_iflist. Without this option,
|
||||
all network interfaces may be used for ATA over Ethernet. Here is a
|
||||
usage example for the module parameter.
|
||||
usage example for the module parameter::
|
||||
|
||||
modprobe aoe_iflist="eth1 eth3"
|
||||
|
23
Documentation/aoe/examples.rst
Normal file
23
Documentation/aoe/examples.rst
Normal file
@ -0,0 +1,23 @@
|
||||
Example of udev rules
|
||||
---------------------
|
||||
|
||||
.. include:: udev.txt
|
||||
:literal:
|
||||
|
||||
Example of udev install rules script
|
||||
------------------------------------
|
||||
|
||||
.. literalinclude:: udev-install.sh
|
||||
:language: shell
|
||||
|
||||
Example script to get status
|
||||
----------------------------
|
||||
|
||||
.. literalinclude:: status.sh
|
||||
:language: shell
|
||||
|
||||
Example of AoE autoload script
|
||||
------------------------------
|
||||
|
||||
.. literalinclude:: autoload.sh
|
||||
:language: shell
|
19
Documentation/aoe/index.rst
Normal file
19
Documentation/aoe/index.rst
Normal file
@ -0,0 +1,19 @@
|
||||
:orphan:
|
||||
|
||||
=======================
|
||||
ATA over Ethernet (AoE)
|
||||
=======================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
aoe
|
||||
todo
|
||||
examples
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
@ -1,3 +1,6 @@
|
||||
TODO
|
||||
====
|
||||
|
||||
There is a potential for deadlock when allocating a struct sk_buff for
|
||||
data that needs to be written out to aoe storage. If the data is
|
||||
being written from a dirty page in order to free that page, and if
|
@ -11,7 +11,7 @@
|
||||
# udev_rules="/etc/udev/rules.d/"
|
||||
# bash# ls /etc/udev/rules.d/
|
||||
# 10-wacom.rules 50-udev.rules
|
||||
# bash# cp /path/to/linux-2.6.xx/Documentation/aoe/udev.txt \
|
||||
# bash# cp /path/to/linux/Documentation/aoe/udev.txt \
|
||||
# /etc/udev/rules.d/60-aoe.rules
|
||||
#
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
Too many problems poped up because of unnoticed misaligned memory access in
|
||||
Too many problems popped up because of unnoticed misaligned memory access in
|
||||
kernel code lately. Therefore the alignment fixup is now unconditionally
|
||||
configured in for SA11x0 based targets. According to Alan Cox, this is a
|
||||
bad idea to configure it out, but Russell King has some good reasons for
|
||||
|
@ -1,3 +1,5 @@
|
||||
:orphan:
|
||||
|
||||
========================
|
||||
STM32 ARM Linux Overview
|
||||
========================
|
||||
|
@ -1,3 +1,5 @@
|
||||
:orphan:
|
||||
|
||||
STM32F429 Overview
|
||||
==================
|
||||
|
||||
|
@ -1,3 +1,5 @@
|
||||
:orphan:
|
||||
|
||||
STM32F746 Overview
|
||||
==================
|
||||
|
||||
|
@ -1,3 +1,5 @@
|
||||
:orphan:
|
||||
|
||||
STM32F769 Overview
|
||||
==================
|
||||
|
||||
|
@ -1,3 +1,5 @@
|
||||
:orphan:
|
||||
|
||||
STM32H743 Overview
|
||||
==================
|
||||
|
||||
|
@ -1,3 +1,5 @@
|
||||
:orphan:
|
||||
|
||||
STM32MP157 Overview
|
||||
===================
|
||||
|
||||
|
@ -1,5 +1,7 @@
|
||||
===========
|
||||
ACPI Tables
|
||||
-----------
|
||||
===========
|
||||
|
||||
The expectations of individual ACPI tables are discussed in the list that
|
||||
follows.
|
||||
|
||||
@ -11,54 +13,71 @@ outside of the UEFI Forum (see Section 5.2.6 of the specification).
|
||||
|
||||
For ACPI on arm64, tables also fall into the following categories:
|
||||
|
||||
-- Required: DSDT, FADT, GTDT, MADT, MCFG, RSDP, SPCR, XSDT
|
||||
- Required: DSDT, FADT, GTDT, MADT, MCFG, RSDP, SPCR, XSDT
|
||||
|
||||
-- Recommended: BERT, EINJ, ERST, HEST, PCCT, SSDT
|
||||
- Recommended: BERT, EINJ, ERST, HEST, PCCT, SSDT
|
||||
|
||||
-- Optional: BGRT, CPEP, CSRT, DBG2, DRTM, ECDT, FACS, FPDT, IORT,
|
||||
- Optional: BGRT, CPEP, CSRT, DBG2, DRTM, ECDT, FACS, FPDT, IORT,
|
||||
MCHI, MPST, MSCT, NFIT, PMTT, RASF, SBST, SLIT, SPMI, SRAT, STAO,
|
||||
TCPA, TPM2, UEFI, XENV
|
||||
|
||||
-- Not supported: BOOT, DBGP, DMAR, ETDT, HPET, IBFT, IVRS, LPIT,
|
||||
- Not supported: BOOT, DBGP, DMAR, ETDT, HPET, IBFT, IVRS, LPIT,
|
||||
MSDM, OEMx, PSDT, RSDT, SLIC, WAET, WDAT, WDRT, WPBT
|
||||
|
||||
====== ========================================================================
|
||||
Table Usage for ARMv8 Linux
|
||||
----- ----------------------------------------------------------------
|
||||
====== ========================================================================
|
||||
BERT Section 18.3 (signature == "BERT")
|
||||
== Boot Error Record Table ==
|
||||
|
||||
**Boot Error Record Table**
|
||||
|
||||
Must be supplied if RAS support is provided by the platform. It
|
||||
is recommended this table be supplied.
|
||||
|
||||
BOOT Signature Reserved (signature == "BOOT")
|
||||
== simple BOOT flag table ==
|
||||
|
||||
**simple BOOT flag table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
BGRT Section 5.2.22 (signature == "BGRT")
|
||||
== Boot Graphics Resource Table ==
|
||||
|
||||
**Boot Graphics Resource Table**
|
||||
|
||||
Optional, not currently supported, with no real use-case for an
|
||||
ARM server.
|
||||
|
||||
CPEP Section 5.2.18 (signature == "CPEP")
|
||||
== Corrected Platform Error Polling table ==
|
||||
|
||||
**Corrected Platform Error Polling table**
|
||||
|
||||
Optional, not currently supported, and not recommended until such
|
||||
time as ARM-compatible hardware is available, and the specification
|
||||
suitably modified.
|
||||
|
||||
CSRT Signature Reserved (signature == "CSRT")
|
||||
== Core System Resources Table ==
|
||||
|
||||
**Core System Resources Table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
DBG2 Signature Reserved (signature == "DBG2")
|
||||
== DeBuG port table 2 ==
|
||||
|
||||
**DeBuG port table 2**
|
||||
|
||||
License has changed and should be usable. Optional if used instead
|
||||
of earlycon=<device> on the command line.
|
||||
|
||||
DBGP Signature Reserved (signature == "DBGP")
|
||||
== DeBuG Port table ==
|
||||
|
||||
**DeBuG Port table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
DSDT Section 5.2.11.1 (signature == "DSDT")
|
||||
== Differentiated System Description Table ==
|
||||
|
||||
**Differentiated System Description Table**
|
||||
|
||||
A DSDT is required; see also SSDT.
|
||||
|
||||
ACPI tables contain only one DSDT but can contain one or more SSDTs,
|
||||
@ -66,22 +85,30 @@ DSDT Section 5.2.11.1 (signature == "DSDT")
|
||||
but cannot modify or replace anything in the DSDT.
|
||||
|
||||
DMAR Signature Reserved (signature == "DMAR")
|
||||
== DMA Remapping table ==
|
||||
|
||||
**DMA Remapping table**
|
||||
|
||||
x86 only table, will not be supported.
|
||||
|
||||
DRTM Signature Reserved (signature == "DRTM")
|
||||
== Dynamic Root of Trust for Measurement table ==
|
||||
|
||||
**Dynamic Root of Trust for Measurement table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
ECDT Section 5.2.16 (signature == "ECDT")
|
||||
== Embedded Controller Description Table ==
|
||||
|
||||
**Embedded Controller Description Table**
|
||||
|
||||
Optional, not currently supported, but could be used on ARM if and
|
||||
only if one uses the GPE_BIT field to represent an IRQ number, since
|
||||
there are no GPE blocks defined in hardware reduced mode. This would
|
||||
need to be modified in the ACPI specification.
|
||||
|
||||
EINJ Section 18.6 (signature == "EINJ")
|
||||
== Error Injection table ==
|
||||
|
||||
**Error Injection table**
|
||||
|
||||
This table is very useful for testing platform response to error
|
||||
conditions; it allows one to inject an error into the system as
|
||||
if it had actually occurred. However, this table should not be
|
||||
@ -89,27 +116,35 @@ EINJ Section 18.6 (signature == "EINJ")
|
||||
and executed with the ACPICA tools only during testing.
|
||||
|
||||
ERST Section 18.5 (signature == "ERST")
|
||||
== Error Record Serialization Table ==
|
||||
|
||||
**Error Record Serialization Table**
|
||||
|
||||
On a platform supports RAS, this table must be supplied if it is not
|
||||
UEFI-based; if it is UEFI-based, this table may be supplied. When this
|
||||
table is not present, UEFI run time service will be utilized to save
|
||||
and retrieve hardware error information to and from a persistent store.
|
||||
|
||||
ETDT Signature Reserved (signature == "ETDT")
|
||||
== Event Timer Description Table ==
|
||||
|
||||
**Event Timer Description Table**
|
||||
|
||||
Obsolete table, will not be supported.
|
||||
|
||||
FACS Section 5.2.10 (signature == "FACS")
|
||||
== Firmware ACPI Control Structure ==
|
||||
|
||||
**Firmware ACPI Control Structure**
|
||||
|
||||
It is unlikely that this table will be terribly useful. If it is
|
||||
provided, the Global Lock will NOT be used since it is not part of
|
||||
the hardware reduced profile, and only 64-bit address fields will
|
||||
be considered valid.
|
||||
|
||||
FADT Section 5.2.9 (signature == "FACP")
|
||||
== Fixed ACPI Description Table ==
|
||||
|
||||
**Fixed ACPI Description Table**
|
||||
Required for arm64.
|
||||
|
||||
|
||||
The HW_REDUCED_ACPI flag must be set. All of the fields that are
|
||||
to be ignored when HW_REDUCED_ACPI is set are expected to be set to
|
||||
zero.
|
||||
@ -118,22 +153,28 @@ FADT Section 5.2.9 (signature == "FACP")
|
||||
used, not FIRMWARE_CTRL.
|
||||
|
||||
If PSCI is used (as is recommended), make sure that ARM_BOOT_ARCH is
|
||||
filled in properly -- that the PSCI_COMPLIANT flag is set and that
|
||||
filled in properly - that the PSCI_COMPLIANT flag is set and that
|
||||
PSCI_USE_HVC is set or unset as needed (see table 5-37).
|
||||
|
||||
For the DSDT that is also required, the X_DSDT field is to be used,
|
||||
not the DSDT field.
|
||||
|
||||
FPDT Section 5.2.23 (signature == "FPDT")
|
||||
== Firmware Performance Data Table ==
|
||||
|
||||
**Firmware Performance Data Table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
GTDT Section 5.2.24 (signature == "GTDT")
|
||||
== Generic Timer Description Table ==
|
||||
|
||||
**Generic Timer Description Table**
|
||||
|
||||
Required for arm64.
|
||||
|
||||
HEST Section 18.3.2 (signature == "HEST")
|
||||
== Hardware Error Source Table ==
|
||||
|
||||
**Hardware Error Source Table**
|
||||
|
||||
ARM-specific error sources have been defined; please use those or the
|
||||
PCI types such as type 6 (AER Root Port), 7 (AER Endpoint), or 8 (AER
|
||||
Bridge), or use type 9 (Generic Hardware Error Source). Firmware first
|
||||
@ -144,122 +185,174 @@ HEST Section 18.3.2 (signature == "HEST")
|
||||
is recommended this table be supplied.
|
||||
|
||||
HPET Signature Reserved (signature == "HPET")
|
||||
== High Precision Event timer Table ==
|
||||
|
||||
**High Precision Event timer Table**
|
||||
|
||||
x86 only table, will not be supported.
|
||||
|
||||
IBFT Signature Reserved (signature == "IBFT")
|
||||
== iSCSI Boot Firmware Table ==
|
||||
|
||||
**iSCSI Boot Firmware Table**
|
||||
|
||||
Microsoft defined table, support TBD.
|
||||
|
||||
IORT Signature Reserved (signature == "IORT")
|
||||
== Input Output Remapping Table ==
|
||||
|
||||
**Input Output Remapping Table**
|
||||
|
||||
arm64 only table, required in order to describe IO topology, SMMUs,
|
||||
and GIC ITSs, and how those various components are connected together,
|
||||
such as identifying which components are behind which SMMUs/ITSs.
|
||||
This table will only be required on certain SBSA platforms (e.g.,
|
||||
when using GICv3-ITS and an SMMU); on SBSA Level 0 platforms, it
|
||||
when using GICv3-ITS and an SMMU); on SBSA Level 0 platforms, it
|
||||
remains optional.
|
||||
|
||||
IVRS Signature Reserved (signature == "IVRS")
|
||||
== I/O Virtualization Reporting Structure ==
|
||||
|
||||
**I/O Virtualization Reporting Structure**
|
||||
|
||||
x86_64 (AMD) only table, will not be supported.
|
||||
|
||||
LPIT Signature Reserved (signature == "LPIT")
|
||||
== Low Power Idle Table ==
|
||||
|
||||
**Low Power Idle Table**
|
||||
|
||||
x86 only table as of ACPI 5.1; starting with ACPI 6.0, processor
|
||||
descriptions and power states on ARM platforms should use the DSDT
|
||||
and define processor container devices (_HID ACPI0010, Section 8.4,
|
||||
and more specifically 8.4.3 and and 8.4.4).
|
||||
|
||||
MADT Section 5.2.12 (signature == "APIC")
|
||||
== Multiple APIC Description Table ==
|
||||
|
||||
**Multiple APIC Description Table**
|
||||
|
||||
Required for arm64. Only the GIC interrupt controller structures
|
||||
should be used (types 0xA - 0xF).
|
||||
|
||||
MCFG Signature Reserved (signature == "MCFG")
|
||||
== Memory-mapped ConFiGuration space ==
|
||||
|
||||
**Memory-mapped ConFiGuration space**
|
||||
|
||||
If the platform supports PCI/PCIe, an MCFG table is required.
|
||||
|
||||
MCHI Signature Reserved (signature == "MCHI")
|
||||
== Management Controller Host Interface table ==
|
||||
|
||||
**Management Controller Host Interface table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
MPST Section 5.2.21 (signature == "MPST")
|
||||
== Memory Power State Table ==
|
||||
|
||||
**Memory Power State Table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
MSCT Section 5.2.19 (signature == "MSCT")
|
||||
== Maximum System Characteristic Table ==
|
||||
|
||||
**Maximum System Characteristic Table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
MSDM Signature Reserved (signature == "MSDM")
|
||||
== Microsoft Data Management table ==
|
||||
|
||||
**Microsoft Data Management table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
NFIT Section 5.2.25 (signature == "NFIT")
|
||||
== NVDIMM Firmware Interface Table ==
|
||||
|
||||
**NVDIMM Firmware Interface Table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
OEMx Signature of "OEMx" only
|
||||
== OEM Specific Tables ==
|
||||
|
||||
**OEM Specific Tables**
|
||||
|
||||
All tables starting with a signature of "OEM" are reserved for OEM
|
||||
use. Since these are not meant to be of general use but are limited
|
||||
to very specific end users, they are not recommended for use and are
|
||||
not supported by the kernel for arm64.
|
||||
|
||||
PCCT Section 14.1 (signature == "PCCT)
|
||||
== Platform Communications Channel Table ==
|
||||
|
||||
**Platform Communications Channel Table**
|
||||
|
||||
Recommend for use on arm64; use of PCC is recommended when using CPPC
|
||||
to control performance and power for platform processors.
|
||||
|
||||
PMTT Section 5.2.21.12 (signature == "PMTT")
|
||||
== Platform Memory Topology Table ==
|
||||
|
||||
**Platform Memory Topology Table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
PSDT Section 5.2.11.3 (signature == "PSDT")
|
||||
== Persistent System Description Table ==
|
||||
|
||||
**Persistent System Description Table**
|
||||
|
||||
Obsolete table, will not be supported.
|
||||
|
||||
RASF Section 5.2.20 (signature == "RASF")
|
||||
== RAS Feature table ==
|
||||
|
||||
**RAS Feature table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
RSDP Section 5.2.5 (signature == "RSD PTR")
|
||||
== Root System Description PoinTeR ==
|
||||
|
||||
**Root System Description PoinTeR**
|
||||
|
||||
Required for arm64.
|
||||
|
||||
RSDT Section 5.2.7 (signature == "RSDT")
|
||||
== Root System Description Table ==
|
||||
|
||||
**Root System Description Table**
|
||||
|
||||
Since this table can only provide 32-bit addresses, it is deprecated
|
||||
on arm64, and will not be used. If provided, it will be ignored.
|
||||
|
||||
SBST Section 5.2.14 (signature == "SBST")
|
||||
== Smart Battery Subsystem Table ==
|
||||
|
||||
**Smart Battery Subsystem Table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
SLIC Signature Reserved (signature == "SLIC")
|
||||
== Software LIcensing table ==
|
||||
|
||||
**Software LIcensing table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
SLIT Section 5.2.17 (signature == "SLIT")
|
||||
== System Locality distance Information Table ==
|
||||
|
||||
**System Locality distance Information Table**
|
||||
|
||||
Optional in general, but required for NUMA systems.
|
||||
|
||||
SPCR Signature Reserved (signature == "SPCR")
|
||||
== Serial Port Console Redirection table ==
|
||||
|
||||
**Serial Port Console Redirection table**
|
||||
|
||||
Required for arm64.
|
||||
|
||||
SPMI Signature Reserved (signature == "SPMI")
|
||||
== Server Platform Management Interface table ==
|
||||
|
||||
**Server Platform Management Interface table**
|
||||
|
||||
Optional, not currently supported.
|
||||
|
||||
SRAT Section 5.2.16 (signature == "SRAT")
|
||||
== System Resource Affinity Table ==
|
||||
|
||||
**System Resource Affinity Table**
|
||||
|
||||
Optional, but if used, only the GICC Affinity structures are read.
|
||||
To support arm64 NUMA, this table is required.
|
||||
|
||||
SSDT Section 5.2.11.2 (signature == "SSDT")
|
||||
== Secondary System Description Table ==
|
||||
|
||||
**Secondary System Description Table**
|
||||
|
||||
These tables are a continuation of the DSDT; these are recommended
|
||||
for use with devices that can be added to a running system, but can
|
||||
also serve the purpose of dividing up device descriptions into more
|
||||
@ -272,49 +365,69 @@ SSDT Section 5.2.11.2 (signature == "SSDT")
|
||||
one DSDT but can contain many SSDTs.
|
||||
|
||||
STAO Signature Reserved (signature == "STAO")
|
||||
== _STA Override table ==
|
||||
|
||||
**_STA Override table**
|
||||
|
||||
Optional, but only necessary in virtualized environments in order to
|
||||
hide devices from guest OSs.
|
||||
|
||||
TCPA Signature Reserved (signature == "TCPA")
|
||||
== Trusted Computing Platform Alliance table ==
|
||||
|
||||
**Trusted Computing Platform Alliance table**
|
||||
|
||||
Optional, not currently supported, and may need changes to fully
|
||||
interoperate with arm64.
|
||||
|
||||
TPM2 Signature Reserved (signature == "TPM2")
|
||||
== Trusted Platform Module 2 table ==
|
||||
|
||||
**Trusted Platform Module 2 table**
|
||||
|
||||
Optional, not currently supported, and may need changes to fully
|
||||
interoperate with arm64.
|
||||
|
||||
UEFI Signature Reserved (signature == "UEFI")
|
||||
== UEFI ACPI data table ==
|
||||
|
||||
**UEFI ACPI data table**
|
||||
|
||||
Optional, not currently supported. No known use case for arm64,
|
||||
at present.
|
||||
|
||||
WAET Signature Reserved (signature == "WAET")
|
||||
== Windows ACPI Emulated devices Table ==
|
||||
|
||||
**Windows ACPI Emulated devices Table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
WDAT Signature Reserved (signature == "WDAT")
|
||||
== Watch Dog Action Table ==
|
||||
|
||||
**Watch Dog Action Table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
WDRT Signature Reserved (signature == "WDRT")
|
||||
== Watch Dog Resource Table ==
|
||||
|
||||
**Watch Dog Resource Table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
WPBT Signature Reserved (signature == "WPBT")
|
||||
== Windows Platform Binary Table ==
|
||||
|
||||
**Windows Platform Binary Table**
|
||||
|
||||
Microsoft only table, will not be supported.
|
||||
|
||||
XENV Signature Reserved (signature == "XENV")
|
||||
== Xen project table ==
|
||||
|
||||
**Xen project table**
|
||||
|
||||
Optional, used only by Xen at present.
|
||||
|
||||
XSDT Section 5.2.8 (signature == "XSDT")
|
||||
== eXtended System Description Table ==
|
||||
Required for arm64.
|
||||
|
||||
**eXtended System Description Table**
|
||||
|
||||
Required for arm64.
|
||||
====== ========================================================================
|
||||
|
||||
ACPI Objects
|
||||
------------
|
||||
@ -323,10 +436,11 @@ shown in the list that follows; any object not explicitly mentioned below
|
||||
should be used as needed for a particular platform or particular subsystem,
|
||||
such as power management or PCI.
|
||||
|
||||
===== ================ ========================================================
|
||||
Name Section Usage for ARMv8 Linux
|
||||
---- ------------ -------------------------------------------------
|
||||
===== ================ ========================================================
|
||||
_CCA 6.2.17 This method must be defined for all bus masters
|
||||
on arm64 -- there are no assumptions made about
|
||||
on arm64 - there are no assumptions made about
|
||||
whether such devices are cache coherent or not.
|
||||
The _CCA value is inherited by all descendants of
|
||||
these devices so it does not need to be repeated.
|
||||
@ -422,8 +536,8 @@ _OSC 6.2.11 This method can be a global method in ACPI (i.e.,
|
||||
by the kernel community, then register it with the
|
||||
UEFI Forum.
|
||||
|
||||
\_OSI 5.7.2 Deprecated on ARM64. As far as ACPI firmware is
|
||||
concerned, _OSI is not to be used to determine what
|
||||
\_OSI 5.7.2 Deprecated on ARM64. As far as ACPI firmware is
|
||||
concerned, _OSI is not to be used to determine what
|
||||
sort of system is being used or what functionality
|
||||
is provided. The _OSC method is to be used instead.
|
||||
|
||||
@ -447,7 +561,7 @@ _PSx 7.3.2-5 Use as needed; power management specific. If _PS0 is
|
||||
usage, change them in these methods.
|
||||
|
||||
_RDI 8.4.4.4 Recommended for use with processor definitions (_HID
|
||||
ACPI0010) on arm64. This should only be used in
|
||||
ACPI0010) on arm64. This should only be used in
|
||||
conjunction with _LPI.
|
||||
|
||||
\_REV 5.7.4 Always returns the latest version of ACPI supported.
|
||||
@ -476,6 +590,7 @@ _SWS 7.4.3 Use as needed; power management specific; this may
|
||||
|
||||
_UID 6.1.12 Recommended for distinguishing devices of the same
|
||||
class; define it if at all possible.
|
||||
===== ================ ========================================================
|
||||
|
||||
|
||||
|
||||
@ -488,7 +603,7 @@ platforms, ACPI events must be signaled differently.
|
||||
|
||||
There are two options: GPIO-signaled interrupts (Section 5.6.5), and
|
||||
interrupt-signaled events (Section 5.6.9). Interrupt-signaled events are a
|
||||
new feature in the ACPI 6.1 specification. Either -- or both -- can be used
|
||||
new feature in the ACPI 6.1 specification. Either - or both - can be used
|
||||
on a given platform, and which to use may be dependent of limitations in any
|
||||
given SoC. If possible, interrupt-signaled events are recommended.
|
||||
|
||||
@ -564,39 +679,40 @@ supported.
|
||||
|
||||
The following classes of objects are not supported:
|
||||
|
||||
-- Section 9.2: ambient light sensor devices
|
||||
- Section 9.2: ambient light sensor devices
|
||||
|
||||
-- Section 9.3: battery devices
|
||||
- Section 9.3: battery devices
|
||||
|
||||
-- Section 9.4: lids (e.g., laptop lids)
|
||||
- Section 9.4: lids (e.g., laptop lids)
|
||||
|
||||
-- Section 9.8.2: IDE controllers
|
||||
- Section 9.8.2: IDE controllers
|
||||
|
||||
-- Section 9.9: floppy controllers
|
||||
- Section 9.9: floppy controllers
|
||||
|
||||
-- Section 9.10: GPE block devices
|
||||
- Section 9.10: GPE block devices
|
||||
|
||||
-- Section 9.15: PC/AT RTC/CMOS devices
|
||||
- Section 9.15: PC/AT RTC/CMOS devices
|
||||
|
||||
-- Section 9.16: user presence detection devices
|
||||
- Section 9.16: user presence detection devices
|
||||
|
||||
-- Section 9.17: I/O APIC devices; all GICs must be enumerable via MADT
|
||||
- Section 9.17: I/O APIC devices; all GICs must be enumerable via MADT
|
||||
|
||||
-- Section 9.18: time and alarm devices (see 9.15)
|
||||
- Section 9.18: time and alarm devices (see 9.15)
|
||||
|
||||
-- Section 10: power source and power meter devices
|
||||
- Section 10: power source and power meter devices
|
||||
|
||||
-- Section 11: thermal management
|
||||
- Section 11: thermal management
|
||||
|
||||
-- Section 12: embedded controllers interface
|
||||
- Section 12: embedded controllers interface
|
||||
|
||||
-- Section 13: SMBus interfaces
|
||||
- Section 13: SMBus interfaces
|
||||
|
||||
|
||||
This also means that there is no support for the following objects:
|
||||
|
||||
==== =========================== ==== ==========
|
||||
Name Section Name Section
|
||||
---- ------------ ---- ------------
|
||||
==== =========================== ==== ==========
|
||||
_ALC 9.3.4 _FDM 9.10.3
|
||||
_ALI 9.3.2 _FIX 6.2.7
|
||||
_ALP 9.3.6 _GAI 10.4.5
|
||||
@ -619,4 +735,4 @@ _DCK 6.5.2 _UPD 9.16.1
|
||||
_EC 12.12 _UPP 9.16.2
|
||||
_FDE 9.10.1 _WPC 10.5.2
|
||||
_FDI 9.10.2 _WPP 10.5.3
|
||||
|
||||
==== =========================== ==== ==========
|
@ -1,5 +1,7 @@
|
||||
=====================
|
||||
ACPI on ARMv8 Servers
|
||||
---------------------
|
||||
=====================
|
||||
|
||||
ACPI can be used for ARMv8 general purpose servers designed to follow
|
||||
the ARM SBSA (Server Base System Architecture) [0] and SBBR (Server
|
||||
Base Boot Requirements) [1] specifications. Please note that the SBBR
|
||||
@ -34,28 +36,28 @@ of the summary text almost directly, to be honest.
|
||||
|
||||
The short form of the rationale for ACPI on ARM is:
|
||||
|
||||
-- ACPI’s byte code (AML) allows the platform to encode hardware behavior,
|
||||
- ACPI’s byte code (AML) allows the platform to encode hardware behavior,
|
||||
while DT explicitly does not support this. For hardware vendors, being
|
||||
able to encode behavior is a key tool used in supporting operating
|
||||
system releases on new hardware.
|
||||
|
||||
-- ACPI’s OSPM defines a power management model that constrains what the
|
||||
- ACPI’s OSPM defines a power management model that constrains what the
|
||||
platform is allowed to do into a specific model, while still providing
|
||||
flexibility in hardware design.
|
||||
|
||||
-- In the enterprise server environment, ACPI has established bindings (such
|
||||
- In the enterprise server environment, ACPI has established bindings (such
|
||||
as for RAS) which are currently used in production systems. DT does not.
|
||||
Such bindings could be defined in DT at some point, but doing so means ARM
|
||||
and x86 would end up using completely different code paths in both firmware
|
||||
and the kernel.
|
||||
|
||||
-- Choosing a single interface to describe the abstraction between a platform
|
||||
- Choosing a single interface to describe the abstraction between a platform
|
||||
and an OS is important. Hardware vendors would not be required to implement
|
||||
both DT and ACPI if they want to support multiple operating systems. And,
|
||||
agreeing on a single interface instead of being fragmented into per OS
|
||||
interfaces makes for better interoperability overall.
|
||||
|
||||
-- The new ACPI governance process works well and Linux is now at the same
|
||||
- The new ACPI governance process works well and Linux is now at the same
|
||||
table as hardware vendors and other OS vendors. In fact, there is no
|
||||
longer any reason to feel that ACPI only belongs to Windows or that
|
||||
Linux is in any way secondary to Microsoft in this arena. The move of
|
||||
@ -169,31 +171,31 @@ For the ACPI core to operate properly, and in turn provide the information
|
||||
the kernel needs to configure devices, it expects to find the following
|
||||
tables (all section numbers refer to the ACPI 6.1 specification):
|
||||
|
||||
-- RSDP (Root System Description Pointer), section 5.2.5
|
||||
- RSDP (Root System Description Pointer), section 5.2.5
|
||||
|
||||
-- XSDT (eXtended System Description Table), section 5.2.8
|
||||
- XSDT (eXtended System Description Table), section 5.2.8
|
||||
|
||||
-- FADT (Fixed ACPI Description Table), section 5.2.9
|
||||
- FADT (Fixed ACPI Description Table), section 5.2.9
|
||||
|
||||
-- DSDT (Differentiated System Description Table), section
|
||||
- DSDT (Differentiated System Description Table), section
|
||||
5.2.11.1
|
||||
|
||||
-- MADT (Multiple APIC Description Table), section 5.2.12
|
||||
- MADT (Multiple APIC Description Table), section 5.2.12
|
||||
|
||||
-- GTDT (Generic Timer Description Table), section 5.2.24
|
||||
- GTDT (Generic Timer Description Table), section 5.2.24
|
||||
|
||||
-- If PCI is supported, the MCFG (Memory mapped ConFiGuration
|
||||
- If PCI is supported, the MCFG (Memory mapped ConFiGuration
|
||||
Table), section 5.2.6, specifically Table 5-31.
|
||||
|
||||
-- If booting without a console=<device> kernel parameter is
|
||||
- If booting without a console=<device> kernel parameter is
|
||||
supported, the SPCR (Serial Port Console Redirection table),
|
||||
section 5.2.6, specifically Table 5-31.
|
||||
|
||||
-- If necessary to describe the I/O topology, SMMUs and GIC ITSs,
|
||||
- If necessary to describe the I/O topology, SMMUs and GIC ITSs,
|
||||
the IORT (Input Output Remapping Table, section 5.2.6, specifically
|
||||
Table 5-31).
|
||||
|
||||
-- If NUMA is supported, the SRAT (System Resource Affinity Table)
|
||||
- If NUMA is supported, the SRAT (System Resource Affinity Table)
|
||||
and SLIT (System Locality distance Information Table), sections
|
||||
5.2.16 and 5.2.17, respectively.
|
||||
|
||||
@ -269,9 +271,9 @@ describes how to define the structure of an object returned via _DSD, and
|
||||
how specific data structures are defined by specific UUIDs. Linux should
|
||||
only use the _DSD Device Properties UUID [5]:
|
||||
|
||||
-- UUID: daffd814-6eba-4d8c-8a91-bc9bbf4aa301
|
||||
- UUID: daffd814-6eba-4d8c-8a91-bc9bbf4aa301
|
||||
|
||||
-- http://www.uefi.org/sites/default/files/resources/_DSD-device-properties-UUID.pdf
|
||||
- http://www.uefi.org/sites/default/files/resources/_DSD-device-properties-UUID.pdf
|
||||
|
||||
The UEFI Forum provides a mechanism for registering device properties [4]
|
||||
so that they may be used across all operating systems supporting ACPI.
|
||||
@ -327,10 +329,10 @@ turning a device full off.
|
||||
|
||||
There are two options for using those Power Resources. They can:
|
||||
|
||||
-- be managed in a _PSx method which gets called on entry to power
|
||||
- be managed in a _PSx method which gets called on entry to power
|
||||
state Dx.
|
||||
|
||||
-- be declared separately as power resources with their own _ON and _OFF
|
||||
- be declared separately as power resources with their own _ON and _OFF
|
||||
methods. They are then tied back to D-states for a particular device
|
||||
via _PRx which specifies which power resources a device needs to be on
|
||||
while in Dx. Kernel then tracks number of devices using a power resource
|
||||
@ -339,16 +341,16 @@ There are two options for using those Power Resources. They can:
|
||||
The kernel ACPI code will also assume that the _PSx methods follow the normal
|
||||
ACPI rules for such methods:
|
||||
|
||||
-- If either _PS0 or _PS3 is implemented, then the other method must also
|
||||
- If either _PS0 or _PS3 is implemented, then the other method must also
|
||||
be implemented.
|
||||
|
||||
-- If a device requires usage or setup of a power resource when on, the ASL
|
||||
- If a device requires usage or setup of a power resource when on, the ASL
|
||||
should organize that it is allocated/enabled using the _PS0 method.
|
||||
|
||||
-- Resources allocated or enabled in the _PS0 method should be disabled
|
||||
- Resources allocated or enabled in the _PS0 method should be disabled
|
||||
or de-allocated in the _PS3 method.
|
||||
|
||||
-- Firmware will leave the resources in a reasonable state before handing
|
||||
- Firmware will leave the resources in a reasonable state before handing
|
||||
over control to the kernel.
|
||||
|
||||
Such code in _PSx methods will of course be very platform specific. But,
|
||||
@ -394,52 +396,52 @@ else must be discovered by the driver probe function. Then, have the rest
|
||||
of the driver operate off of the contents of that struct. Doing so should
|
||||
allow most divergence between ACPI and DT functionality to be kept local to
|
||||
the probe function instead of being scattered throughout the driver. For
|
||||
example:
|
||||
example::
|
||||
|
||||
static int device_probe_dt(struct platform_device *pdev)
|
||||
{
|
||||
/* DT specific functionality */
|
||||
...
|
||||
}
|
||||
static int device_probe_dt(struct platform_device *pdev)
|
||||
{
|
||||
/* DT specific functionality */
|
||||
...
|
||||
}
|
||||
|
||||
static int device_probe_acpi(struct platform_device *pdev)
|
||||
{
|
||||
/* ACPI specific functionality */
|
||||
...
|
||||
}
|
||||
static int device_probe_acpi(struct platform_device *pdev)
|
||||
{
|
||||
/* ACPI specific functionality */
|
||||
...
|
||||
}
|
||||
|
||||
static int device_probe(struct platform_device *pdev)
|
||||
{
|
||||
...
|
||||
struct device_node node = pdev->dev.of_node;
|
||||
...
|
||||
static int device_probe(struct platform_device *pdev)
|
||||
{
|
||||
...
|
||||
struct device_node node = pdev->dev.of_node;
|
||||
...
|
||||
|
||||
if (node)
|
||||
ret = device_probe_dt(pdev);
|
||||
else if (ACPI_HANDLE(&pdev->dev))
|
||||
ret = device_probe_acpi(pdev);
|
||||
else
|
||||
/* other initialization */
|
||||
...
|
||||
/* Continue with any generic probe operations */
|
||||
...
|
||||
}
|
||||
if (node)
|
||||
ret = device_probe_dt(pdev);
|
||||
else if (ACPI_HANDLE(&pdev->dev))
|
||||
ret = device_probe_acpi(pdev);
|
||||
else
|
||||
/* other initialization */
|
||||
...
|
||||
/* Continue with any generic probe operations */
|
||||
...
|
||||
}
|
||||
|
||||
DO keep the MODULE_DEVICE_TABLE entries together in the driver to make it
|
||||
clear the different names the driver is probed for, both from DT and from
|
||||
ACPI:
|
||||
ACPI::
|
||||
|
||||
static struct of_device_id virtio_mmio_match[] = {
|
||||
{ .compatible = "virtio,mmio", },
|
||||
{ }
|
||||
};
|
||||
MODULE_DEVICE_TABLE(of, virtio_mmio_match);
|
||||
static struct of_device_id virtio_mmio_match[] = {
|
||||
{ .compatible = "virtio,mmio", },
|
||||
{ }
|
||||
};
|
||||
MODULE_DEVICE_TABLE(of, virtio_mmio_match);
|
||||
|
||||
static const struct acpi_device_id virtio_mmio_acpi_match[] = {
|
||||
{ "LNRO0005", },
|
||||
{ }
|
||||
};
|
||||
MODULE_DEVICE_TABLE(acpi, virtio_mmio_acpi_match);
|
||||
static const struct acpi_device_id virtio_mmio_acpi_match[] = {
|
||||
{ "LNRO0005", },
|
||||
{ }
|
||||
};
|
||||
MODULE_DEVICE_TABLE(acpi, virtio_mmio_acpi_match);
|
||||
|
||||
|
||||
ASWG
|
||||
@ -471,7 +473,8 @@ Linux Code
|
||||
Individual items specific to Linux on ARM, contained in the the Linux
|
||||
source code, are in the list that follows:
|
||||
|
||||
ACPI_OS_NAME This macro defines the string to be returned when
|
||||
ACPI_OS_NAME
|
||||
This macro defines the string to be returned when
|
||||
an ACPI method invokes the _OS method. On ARM64
|
||||
systems, this macro will be "Linux" by default.
|
||||
The command line parameter acpi_os=<string>
|
||||
@ -482,38 +485,44 @@ ACPI_OS_NAME This macro defines the string to be returned when
|
||||
ACPI Objects
|
||||
------------
|
||||
Detailed expectations for ACPI tables and object are listed in the file
|
||||
Documentation/arm64/acpi_object_usage.txt.
|
||||
Documentation/arm64/acpi_object_usage.rst.
|
||||
|
||||
|
||||
References
|
||||
----------
|
||||
[0] http://silver.arm.com -- document ARM-DEN-0029, or newer
|
||||
[0] http://silver.arm.com
|
||||
document ARM-DEN-0029, or newer:
|
||||
"Server Base System Architecture", version 2.3, dated 27 Mar 2014
|
||||
|
||||
[1] http://infocenter.arm.com/help/topic/com.arm.doc.den0044a/Server_Base_Boot_Requirements.pdf
|
||||
Document ARM-DEN-0044A, or newer: "Server Base Boot Requirements, System
|
||||
Software on ARM Platforms", dated 16 Aug 2014
|
||||
|
||||
[2] http://www.secretlab.ca/archives/151, 10 Jan 2015, Copyright (c) 2015,
|
||||
[2] http://www.secretlab.ca/archives/151,
|
||||
10 Jan 2015, Copyright (c) 2015,
|
||||
Linaro Ltd., written by Grant Likely.
|
||||
|
||||
[3] AMD ACPI for Seattle platform documentation:
|
||||
[3] AMD ACPI for Seattle platform documentation
|
||||
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Seattle_ACPI_Guide.pdf
|
||||
|
||||
[4] http://www.uefi.org/acpi -- please see the link for the "ACPI _DSD Device
|
||||
|
||||
[4] http://www.uefi.org/acpi
|
||||
please see the link for the "ACPI _DSD Device
|
||||
Property Registry Instructions"
|
||||
|
||||
[5] http://www.uefi.org/acpi -- please see the link for the "_DSD (Device
|
||||
[5] http://www.uefi.org/acpi
|
||||
please see the link for the "_DSD (Device
|
||||
Specific Data) Implementation Guide"
|
||||
|
||||
[6] Kernel code for the unified device property interface can be found in
|
||||
[6] Kernel code for the unified device
|
||||
property interface can be found in
|
||||
include/linux/property.h and drivers/base/property.c.
|
||||
|
||||
|
||||
Authors
|
||||
-------
|
||||
Al Stone <al.stone@linaro.org>
|
||||
Graeme Gregory <graeme.gregory@linaro.org>
|
||||
Hanjun Guo <hanjun.guo@linaro.org>
|
||||
- Al Stone <al.stone@linaro.org>
|
||||
- Graeme Gregory <graeme.gregory@linaro.org>
|
||||
- Hanjun Guo <hanjun.guo@linaro.org>
|
||||
|
||||
Grant Likely <grant.likely@linaro.org>, for the "Why ACPI on ARM?" section
|
||||
- Grant Likely <grant.likely@linaro.org>, for the "Why ACPI on ARM?" section
|
@ -1,7 +1,9 @@
|
||||
Booting AArch64 Linux
|
||||
=====================
|
||||
=====================
|
||||
Booting AArch64 Linux
|
||||
=====================
|
||||
|
||||
Author: Will Deacon <will.deacon@arm.com>
|
||||
|
||||
Date : 07 September 2012
|
||||
|
||||
This document is based on the ARM booting document by Russell King and
|
||||
@ -12,7 +14,7 @@ The AArch64 exception model is made up of a number of exception levels
|
||||
counterpart. EL2 is the hypervisor level and exists only in non-secure
|
||||
mode. EL3 is the highest priority level and exists only in secure mode.
|
||||
|
||||
For the purposes of this document, we will use the term `boot loader'
|
||||
For the purposes of this document, we will use the term `boot loader`
|
||||
simply to define all software that executes on the CPU(s) before control
|
||||
is passed to the Linux kernel. This may include secure monitor and
|
||||
hypervisor code, or it may just be a handful of instructions for
|
||||
@ -70,7 +72,7 @@ Image target is available instead.
|
||||
|
||||
Requirement: MANDATORY
|
||||
|
||||
The decompressed kernel image contains a 64-byte header as follows:
|
||||
The decompressed kernel image contains a 64-byte header as follows::
|
||||
|
||||
u32 code0; /* Executable code */
|
||||
u32 code1; /* Executable code */
|
||||
@ -103,19 +105,26 @@ Header notes:
|
||||
|
||||
- The flags field (introduced in v3.17) is a little-endian 64-bit field
|
||||
composed as follows:
|
||||
Bit 0: Kernel endianness. 1 if BE, 0 if LE.
|
||||
Bit 1-2: Kernel Page size.
|
||||
0 - Unspecified.
|
||||
1 - 4K
|
||||
2 - 16K
|
||||
3 - 64K
|
||||
Bit 3: Kernel physical placement
|
||||
0 - 2MB aligned base should be as close as possible
|
||||
to the base of DRAM, since memory below it is not
|
||||
accessible via the linear mapping
|
||||
1 - 2MB aligned base may be anywhere in physical
|
||||
memory
|
||||
Bits 4-63: Reserved.
|
||||
|
||||
============= ===============================================================
|
||||
Bit 0 Kernel endianness. 1 if BE, 0 if LE.
|
||||
Bit 1-2 Kernel Page size.
|
||||
|
||||
* 0 - Unspecified.
|
||||
* 1 - 4K
|
||||
* 2 - 16K
|
||||
* 3 - 64K
|
||||
Bit 3 Kernel physical placement
|
||||
|
||||
0
|
||||
2MB aligned base should be as close as possible
|
||||
to the base of DRAM, since memory below it is not
|
||||
accessible via the linear mapping
|
||||
1
|
||||
2MB aligned base may be anywhere in physical
|
||||
memory
|
||||
Bits 4-63 Reserved.
|
||||
============= ===============================================================
|
||||
|
||||
- When image_size is zero, a bootloader should attempt to keep as much
|
||||
memory as possible free for use by the kernel immediately after the
|
||||
@ -147,19 +156,22 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
corrupted by bogus network packets or disk data. This will save
|
||||
you many hours of debug.
|
||||
|
||||
- Primary CPU general-purpose register settings
|
||||
x0 = physical address of device tree blob (dtb) in system RAM.
|
||||
x1 = 0 (reserved for future use)
|
||||
x2 = 0 (reserved for future use)
|
||||
x3 = 0 (reserved for future use)
|
||||
- Primary CPU general-purpose register settings:
|
||||
|
||||
- x0 = physical address of device tree blob (dtb) in system RAM.
|
||||
- x1 = 0 (reserved for future use)
|
||||
- x2 = 0 (reserved for future use)
|
||||
- x3 = 0 (reserved for future use)
|
||||
|
||||
- CPU mode
|
||||
|
||||
All forms of interrupts must be masked in PSTATE.DAIF (Debug, SError,
|
||||
IRQ and FIQ).
|
||||
The CPU must be in either EL2 (RECOMMENDED in order to have access to
|
||||
the virtualisation extensions) or non-secure EL1.
|
||||
|
||||
- Caches, MMUs
|
||||
|
||||
The MMU must be off.
|
||||
Instruction cache may be on or off.
|
||||
The address range corresponding to the loaded kernel image must be
|
||||
@ -172,18 +184,21 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
operations (not recommended) must be configured and disabled.
|
||||
|
||||
- Architected timers
|
||||
|
||||
CNTFRQ must be programmed with the timer frequency and CNTVOFF must
|
||||
be programmed with a consistent value on all CPUs. If entering the
|
||||
kernel at EL1, CNTHCTL_EL2 must have EL1PCTEN (bit 0) set where
|
||||
available.
|
||||
|
||||
- Coherency
|
||||
|
||||
All CPUs to be booted by the kernel must be part of the same coherency
|
||||
domain on entry to the kernel. This may require IMPLEMENTATION DEFINED
|
||||
initialisation to enable the receiving of maintenance operations on
|
||||
each CPU.
|
||||
|
||||
- System registers
|
||||
|
||||
All writable architected system registers at the exception level where
|
||||
the kernel image will be entered must be initialised by software at a
|
||||
higher exception level to prevent execution in an UNKNOWN state.
|
||||
@ -195,28 +210,40 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
|
||||
For systems with a GICv3 interrupt controller to be used in v3 mode:
|
||||
- If EL3 is present:
|
||||
ICC_SRE_EL3.Enable (bit 3) must be initialiased to 0b1.
|
||||
ICC_SRE_EL3.SRE (bit 0) must be initialised to 0b1.
|
||||
|
||||
- ICC_SRE_EL3.Enable (bit 3) must be initialiased to 0b1.
|
||||
- ICC_SRE_EL3.SRE (bit 0) must be initialised to 0b1.
|
||||
|
||||
- If the kernel is entered at EL1:
|
||||
ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
|
||||
ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
|
||||
|
||||
- ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
|
||||
- ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
|
||||
|
||||
- The DT or ACPI tables must describe a GICv3 interrupt controller.
|
||||
|
||||
For systems with a GICv3 interrupt controller to be used in
|
||||
compatibility (v2) mode:
|
||||
|
||||
- If EL3 is present:
|
||||
ICC_SRE_EL3.SRE (bit 0) must be initialised to 0b0.
|
||||
|
||||
ICC_SRE_EL3.SRE (bit 0) must be initialised to 0b0.
|
||||
|
||||
- If the kernel is entered at EL1:
|
||||
ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b0.
|
||||
|
||||
ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b0.
|
||||
|
||||
- The DT or ACPI tables must describe a GICv2 interrupt controller.
|
||||
|
||||
For CPUs with pointer authentication functionality:
|
||||
- If EL3 is present:
|
||||
SCR_EL3.APK (bit 16) must be initialised to 0b1
|
||||
SCR_EL3.API (bit 17) must be initialised to 0b1
|
||||
|
||||
- SCR_EL3.APK (bit 16) must be initialised to 0b1
|
||||
- SCR_EL3.API (bit 17) must be initialised to 0b1
|
||||
|
||||
- If the kernel is entered at EL1:
|
||||
HCR_EL2.APK (bit 40) must be initialised to 0b1
|
||||
HCR_EL2.API (bit 41) must be initialised to 0b1
|
||||
|
||||
- HCR_EL2.APK (bit 40) must be initialised to 0b1
|
||||
- HCR_EL2.API (bit 41) must be initialised to 0b1
|
||||
|
||||
The requirements described above for CPU mode, caches, MMUs, architected
|
||||
timers, coherency and system registers apply to all CPUs. All CPUs must
|
@ -1,5 +1,6 @@
|
||||
ARM64 CPU Feature Registers
|
||||
===========================
|
||||
===========================
|
||||
ARM64 CPU Feature Registers
|
||||
===========================
|
||||
|
||||
Author: Suzuki K Poulose <suzuki.poulose@arm.com>
|
||||
|
||||
@ -9,7 +10,7 @@ registers to userspace. The availability of this ABI is advertised
|
||||
via the HWCAP_CPUID in HWCAPs.
|
||||
|
||||
1. Motivation
|
||||
---------------
|
||||
-------------
|
||||
|
||||
The ARM architecture defines a set of feature registers, which describe
|
||||
the capabilities of the CPU/system. Access to these system registers is
|
||||
@ -33,9 +34,10 @@ there are some issues with their usage.
|
||||
|
||||
|
||||
2. Requirements
|
||||
-----------------
|
||||
---------------
|
||||
|
||||
a) Safety:
|
||||
|
||||
a) Safety :
|
||||
Applications should be able to use the information provided by the
|
||||
infrastructure to run safely across the system. This has greater
|
||||
implications on a system with heterogeneous CPUs.
|
||||
@ -47,7 +49,8 @@ there are some issues with their usage.
|
||||
Otherwise an application could crash when scheduled on the CPU
|
||||
which doesn't support CRC32.
|
||||
|
||||
b) Security :
|
||||
b) Security:
|
||||
|
||||
Applications should only be able to receive information that is
|
||||
relevant to the normal operation in userspace. Hence, some of the
|
||||
fields are masked out(i.e, made invisible) and their values are set to
|
||||
@ -58,10 +61,12 @@ there are some issues with their usage.
|
||||
(even when the CPU provides it).
|
||||
|
||||
c) Implementation Defined Features
|
||||
|
||||
The infrastructure doesn't expose any register which is
|
||||
IMPLEMENTATION DEFINED as per ARMv8-A Architecture.
|
||||
|
||||
d) CPU Identification :
|
||||
d) CPU Identification:
|
||||
|
||||
MIDR_EL1 is exposed to help identify the processor. On a
|
||||
heterogeneous system, this could be racy (just like getcpu()). The
|
||||
process could be migrated to another CPU by the time it uses the
|
||||
@ -70,7 +75,7 @@ there are some issues with their usage.
|
||||
currently executing on. The REVIDR is not exposed due to this
|
||||
constraint, as REVIDR makes sense only in conjunction with the
|
||||
MIDR. Alternately, MIDR_EL1 and REVIDR_EL1 are exposed via sysfs
|
||||
at:
|
||||
at::
|
||||
|
||||
/sys/devices/system/cpu/cpu$ID/regs/identification/
|
||||
\- midr
|
||||
@ -85,7 +90,8 @@ exception and ends up in SIGILL being delivered to the process.
|
||||
The infrastructure hooks into the exception handler and emulates the
|
||||
operation if the source belongs to the supported system register space.
|
||||
|
||||
The infrastructure emulates only the following system register space:
|
||||
The infrastructure emulates only the following system register space::
|
||||
|
||||
Op0=3, Op1=0, CRn=0, CRm=0,4,5,6,7
|
||||
|
||||
(See Table C5-6 'System instruction encodings for non-Debug System
|
||||
@ -107,73 +113,76 @@ infrastructure:
|
||||
-------------------------------------------
|
||||
|
||||
1) ID_AA64ISAR0_EL1 - Instruction Set Attribute Register 0
|
||||
x--------------------------------------------------x
|
||||
|
||||
+------------------------------+---------+---------+
|
||||
| Name | bits | visible |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| TS | [55-52] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| FHM | [51-48] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| DP | [47-44] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SM4 | [43-40] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SM3 | [39-36] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SHA3 | [35-32] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| RDM | [31-28] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| ATOMICS | [23-20] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| CRC32 | [19-16] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SHA2 | [15-12] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SHA1 | [11-8] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| AES | [7-4] | y |
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
|
||||
|
||||
2) ID_AA64PFR0_EL1 - Processor Feature Register 0
|
||||
x--------------------------------------------------x
|
||||
|
||||
+------------------------------+---------+---------+
|
||||
| Name | bits | visible |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| DIT | [51-48] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SVE | [35-32] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| GIC | [27-24] | n |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| AdvSIMD | [23-20] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| FP | [19-16] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| EL3 | [15-12] | n |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| EL2 | [11-8] | n |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| EL1 | [7-4] | n |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| EL0 | [3-0] | n |
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
|
||||
|
||||
3) MIDR_EL1 - Main ID Register
|
||||
x--------------------------------------------------x
|
||||
|
||||
+------------------------------+---------+---------+
|
||||
| Name | bits | visible |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| Implementer | [31-24] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| Variant | [23-20] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| Architecture | [19-16] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| PartNum | [15-4] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| Revision | [3-0] | y |
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
|
||||
NOTE: The 'visible' fields of MIDR_EL1 will contain the value
|
||||
as available on the CPU where it is fetched and is not a system
|
||||
@ -181,90 +190,92 @@ infrastructure:
|
||||
|
||||
4) ID_AA64ISAR1_EL1 - Instruction set attribute register 1
|
||||
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
| Name | bits | visible |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| GPI | [31-28] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| GPA | [27-24] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| LRCPC | [23-20] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| FCMA | [19-16] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| JSCVT | [15-12] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| API | [11-8] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| APA | [7-4] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| DPB | [3-0] | y |
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
|
||||
5) ID_AA64MMFR2_EL1 - Memory model feature register 2
|
||||
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
| Name | bits | visible |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| AT | [35-32] | y |
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
|
||||
6) ID_AA64ZFR0_EL1 - SVE feature ID register 0
|
||||
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
| Name | bits | visible |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SM4 | [43-40] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SHA3 | [35-32] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| BitPerm | [19-16] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| AES | [7-4] | y |
|
||||
|--------------------------------------------------|
|
||||
+------------------------------+---------+---------+
|
||||
| SVEVer | [3-0] | y |
|
||||
x--------------------------------------------------x
|
||||
+------------------------------+---------+---------+
|
||||
|
||||
Appendix I: Example
|
||||
---------------------------
|
||||
-------------------
|
||||
|
||||
/*
|
||||
* Sample program to demonstrate the MRS emulation ABI.
|
||||
*
|
||||
* Copyright (C) 2015-2016, ARM Ltd
|
||||
*
|
||||
* Author: Suzuki K Poulose <suzuki.poulose@arm.com>
|
||||
*
|
||||
* This program is free software; you can redistribute it and/or modify
|
||||
* it under the terms of the GNU General Public License version 2 as
|
||||
* published by the Free Software Foundation.
|
||||
*
|
||||
* This program is distributed in the hope that it will be useful,
|
||||
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
* GNU General Public License for more details.
|
||||
* This program is free software; you can redistribute it and/or modify
|
||||
* it under the terms of the GNU General Public License version 2 as
|
||||
* published by the Free Software Foundation.
|
||||
*
|
||||
* This program is distributed in the hope that it will be useful,
|
||||
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
* GNU General Public License for more details.
|
||||
*/
|
||||
::
|
||||
|
||||
#include <asm/hwcap.h>
|
||||
#include <stdio.h>
|
||||
#include <sys/auxv.h>
|
||||
/*
|
||||
* Sample program to demonstrate the MRS emulation ABI.
|
||||
*
|
||||
* Copyright (C) 2015-2016, ARM Ltd
|
||||
*
|
||||
* Author: Suzuki K Poulose <suzuki.poulose@arm.com>
|
||||
*
|
||||
* This program is free software; you can redistribute it and/or modify
|
||||
* it under the terms of the GNU General Public License version 2 as
|
||||
* published by the Free Software Foundation.
|
||||
*
|
||||
* This program is distributed in the hope that it will be useful,
|
||||
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
* GNU General Public License for more details.
|
||||
* This program is free software; you can redistribute it and/or modify
|
||||
* it under the terms of the GNU General Public License version 2 as
|
||||
* published by the Free Software Foundation.
|
||||
*
|
||||
* This program is distributed in the hope that it will be useful,
|
||||
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
* GNU General Public License for more details.
|
||||
*/
|
||||
|
||||
#define get_cpu_ftr(id) ({ \
|
||||
#include <asm/hwcap.h>
|
||||
#include <stdio.h>
|
||||
#include <sys/auxv.h>
|
||||
|
||||
#define get_cpu_ftr(id) ({ \
|
||||
unsigned long __val; \
|
||||
asm("mrs %0, "#id : "=r" (__val)); \
|
||||
printf("%-20s: 0x%016lx\n", #id, __val); \
|
||||
})
|
||||
|
||||
int main(void)
|
||||
{
|
||||
int main(void)
|
||||
{
|
||||
|
||||
if (!(getauxval(AT_HWCAP) & HWCAP_CPUID)) {
|
||||
fputs("CPUID registers unavailable\n", stderr);
|
||||
@ -284,13 +295,10 @@ int main(void)
|
||||
get_cpu_ftr(MPIDR_EL1);
|
||||
get_cpu_ftr(REVIDR_EL1);
|
||||
|
||||
#if 0
|
||||
#if 0
|
||||
/* Unexposed register access causes SIGILL */
|
||||
get_cpu_ftr(ID_MMFR0_EL1);
|
||||
#endif
|
||||
#endif
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
|
||||
}
|
@ -1,3 +1,4 @@
|
||||
================
|
||||
ARM64 ELF hwcaps
|
||||
================
|
||||
|
||||
@ -15,16 +16,16 @@ of flags called hwcaps, exposed in the auxilliary vector.
|
||||
|
||||
Userspace software can test for features by acquiring the AT_HWCAP or
|
||||
AT_HWCAP2 entry of the auxiliary vector, and testing whether the relevant
|
||||
flags are set, e.g.
|
||||
flags are set, e.g.::
|
||||
|
||||
bool floating_point_is_present(void)
|
||||
{
|
||||
unsigned long hwcaps = getauxval(AT_HWCAP);
|
||||
if (hwcaps & HWCAP_FP)
|
||||
return true;
|
||||
bool floating_point_is_present(void)
|
||||
{
|
||||
unsigned long hwcaps = getauxval(AT_HWCAP);
|
||||
if (hwcaps & HWCAP_FP)
|
||||
return true;
|
||||
|
||||
return false;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
Where software relies on a feature described by a hwcap, it should check
|
||||
the relevant hwcap flag to verify that the feature is present before
|
||||
@ -45,7 +46,7 @@ userspace code at EL0. These hwcaps are defined in terms of ID register
|
||||
fields, and should be interpreted with reference to the definition of
|
||||
these fields in the ARM Architecture Reference Manual (ARM ARM).
|
||||
|
||||
Such hwcaps are described below in the form:
|
||||
Such hwcaps are described below in the form::
|
||||
|
||||
Functionality implied by idreg.field == val.
|
||||
|
||||
@ -64,75 +65,58 @@ reference to ID registers, and may refer to other documentation.
|
||||
---------------------------------
|
||||
|
||||
HWCAP_FP
|
||||
|
||||
Functionality implied by ID_AA64PFR0_EL1.FP == 0b0000.
|
||||
|
||||
HWCAP_ASIMD
|
||||
|
||||
Functionality implied by ID_AA64PFR0_EL1.AdvSIMD == 0b0000.
|
||||
|
||||
HWCAP_EVTSTRM
|
||||
|
||||
The generic timer is configured to generate events at a frequency of
|
||||
approximately 100KHz.
|
||||
|
||||
HWCAP_AES
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.AES == 0b0001.
|
||||
|
||||
HWCAP_PMULL
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.AES == 0b0010.
|
||||
|
||||
HWCAP_SHA1
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.SHA1 == 0b0001.
|
||||
|
||||
HWCAP_SHA2
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.SHA2 == 0b0001.
|
||||
|
||||
HWCAP_CRC32
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.CRC32 == 0b0001.
|
||||
|
||||
HWCAP_ATOMICS
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0010.
|
||||
|
||||
HWCAP_FPHP
|
||||
|
||||
Functionality implied by ID_AA64PFR0_EL1.FP == 0b0001.
|
||||
|
||||
HWCAP_ASIMDHP
|
||||
|
||||
Functionality implied by ID_AA64PFR0_EL1.AdvSIMD == 0b0001.
|
||||
|
||||
HWCAP_CPUID
|
||||
|
||||
EL0 access to certain ID registers is available, to the extent
|
||||
described by Documentation/arm64/cpu-feature-registers.txt.
|
||||
described by Documentation/arm64/cpu-feature-registers.rst.
|
||||
|
||||
These ID registers may imply the availability of features.
|
||||
|
||||
HWCAP_ASIMDRDM
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.RDM == 0b0001.
|
||||
|
||||
HWCAP_JSCVT
|
||||
|
||||
Functionality implied by ID_AA64ISAR1_EL1.JSCVT == 0b0001.
|
||||
|
||||
HWCAP_FCMA
|
||||
|
||||
Functionality implied by ID_AA64ISAR1_EL1.FCMA == 0b0001.
|
||||
|
||||
HWCAP_LRCPC
|
||||
|
||||
Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0001.
|
||||
|
||||
HWCAP_DCPOP
|
||||
|
||||
Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0001.
|
||||
|
||||
HWCAP2_DCPODP
|
||||
@ -140,27 +124,21 @@ HWCAP2_DCPODP
|
||||
Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010.
|
||||
|
||||
HWCAP_SHA3
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.SHA3 == 0b0001.
|
||||
|
||||
HWCAP_SM3
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.SM3 == 0b0001.
|
||||
|
||||
HWCAP_SM4
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.SM4 == 0b0001.
|
||||
|
||||
HWCAP_ASIMDDP
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.DP == 0b0001.
|
||||
|
||||
HWCAP_SHA512
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.SHA2 == 0b0010.
|
||||
|
||||
HWCAP_SVE
|
||||
|
||||
Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001.
|
||||
|
||||
HWCAP2_SVE2
|
||||
@ -188,23 +166,18 @@ HWCAP2_SVESM4
|
||||
Functionality implied by ID_AA64ZFR0_EL1.SM4 == 0b0001.
|
||||
|
||||
HWCAP_ASIMDFHM
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.FHM == 0b0001.
|
||||
|
||||
HWCAP_DIT
|
||||
|
||||
Functionality implied by ID_AA64PFR0_EL1.DIT == 0b0001.
|
||||
|
||||
HWCAP_USCAT
|
||||
|
||||
Functionality implied by ID_AA64MMFR2_EL1.AT == 0b0001.
|
||||
|
||||
HWCAP_ILRCPC
|
||||
|
||||
Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0010.
|
||||
|
||||
HWCAP_FLAGM
|
||||
|
||||
Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0001.
|
||||
|
||||
HWCAP2_FLAGM2
|
||||
@ -212,20 +185,17 @@ HWCAP2_FLAGM2
|
||||
Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0010.
|
||||
|
||||
HWCAP_SSBS
|
||||
|
||||
Functionality implied by ID_AA64PFR1_EL1.SSBS == 0b0010.
|
||||
|
||||
HWCAP_PACA
|
||||
|
||||
Functionality implied by ID_AA64ISAR1_EL1.APA == 0b0001 or
|
||||
ID_AA64ISAR1_EL1.API == 0b0001, as described by
|
||||
Documentation/arm64/pointer-authentication.txt.
|
||||
Documentation/arm64/pointer-authentication.rst.
|
||||
|
||||
HWCAP_PACG
|
||||
|
||||
Functionality implied by ID_AA64ISAR1_EL1.GPA == 0b0001 or
|
||||
ID_AA64ISAR1_EL1.GPI == 0b0001, as described by
|
||||
Documentation/arm64/pointer-authentication.txt.
|
||||
Documentation/arm64/pointer-authentication.rst.
|
||||
|
||||
HWCAP2_FRINT
|
||||
|
@ -1,3 +1,4 @@
|
||||
====================
|
||||
HugeTLBpage on ARM64
|
||||
====================
|
||||
|
||||
@ -31,8 +32,10 @@ and level of the page table.
|
||||
|
||||
The following hugepage sizes are supported -
|
||||
|
||||
CONT PTE PMD CONT PMD PUD
|
||||
-------- --- -------- ---
|
||||
====== ======== ==== ======== ===
|
||||
- CONT PTE PMD CONT PMD PUD
|
||||
====== ======== ==== ======== ===
|
||||
4K: 64K 2M 32M 1G
|
||||
16K: 2M 32M 1G
|
||||
64K: 2M 512M 16G
|
||||
====== ======== ==== ======== ===
|
28
Documentation/arm64/index.rst
Normal file
28
Documentation/arm64/index.rst
Normal file
@ -0,0 +1,28 @@
|
||||
:orphan:
|
||||
|
||||
==================
|
||||
ARM64 Architecture
|
||||
==================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
acpi_object_usage
|
||||
arm-acpi
|
||||
booting
|
||||
cpu-feature-registers
|
||||
elf_hwcaps
|
||||
hugetlbpage
|
||||
legacy_instructions
|
||||
memory
|
||||
pointer-authentication
|
||||
silicon-errata
|
||||
sve
|
||||
tagged-pointers
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
@ -1,3 +1,7 @@
|
||||
===================
|
||||
Legacy instructions
|
||||
===================
|
||||
|
||||
The arm64 port of the Linux kernel provides infrastructure to support
|
||||
emulation of instructions which have been deprecated, or obsoleted in
|
||||
the architecture. The infrastructure code uses undefined instruction
|
||||
@ -9,19 +13,22 @@ The emulation mode can be controlled by writing to sysctl nodes
|
||||
behaviours and the corresponding values of the sysctl nodes -
|
||||
|
||||
* Undef
|
||||
Value: 0
|
||||
Value: 0
|
||||
|
||||
Generates undefined instruction abort. Default for instructions that
|
||||
have been obsoleted in the architecture, e.g., SWP
|
||||
|
||||
* Emulate
|
||||
Value: 1
|
||||
Value: 1
|
||||
|
||||
Uses software emulation. To aid migration of software, in this mode
|
||||
usage of emulated instruction is traced as well as rate limited
|
||||
warnings are issued. This is the default for deprecated
|
||||
instructions, .e.g., CP15 barriers
|
||||
|
||||
* Hardware Execution
|
||||
Value: 2
|
||||
Value: 2
|
||||
|
||||
Although marked as deprecated, some implementations may support the
|
||||
enabling/disabling of hardware support for the execution of these
|
||||
instructions. Using hardware execution generally provides better
|
||||
@ -38,20 +45,24 @@ individual instruction notes for further information.
|
||||
Supported legacy instructions
|
||||
-----------------------------
|
||||
* SWP{B}
|
||||
Node: /proc/sys/abi/swp
|
||||
Status: Obsolete
|
||||
Default: Undef (0)
|
||||
|
||||
:Node: /proc/sys/abi/swp
|
||||
:Status: Obsolete
|
||||
:Default: Undef (0)
|
||||
|
||||
* CP15 Barriers
|
||||
Node: /proc/sys/abi/cp15_barrier
|
||||
Status: Deprecated
|
||||
Default: Emulate (1)
|
||||
|
||||
:Node: /proc/sys/abi/cp15_barrier
|
||||
:Status: Deprecated
|
||||
:Default: Emulate (1)
|
||||
|
||||
* SETEND
|
||||
Node: /proc/sys/abi/setend
|
||||
Status: Deprecated
|
||||
Default: Emulate (1)*
|
||||
Note: All the cpus on the system must have mixed endian support at EL0
|
||||
for this feature to be enabled. If a new CPU - which doesn't support mixed
|
||||
endian - is hotplugged in after this feature has been enabled, there could
|
||||
be unexpected results in the application.
|
||||
|
||||
:Node: /proc/sys/abi/setend
|
||||
:Status: Deprecated
|
||||
:Default: Emulate (1)*
|
||||
|
||||
Note: All the cpus on the system must have mixed endian support at EL0
|
||||
for this feature to be enabled. If a new CPU - which doesn't support mixed
|
||||
endian - is hotplugged in after this feature has been enabled, there could
|
||||
be unexpected results in the application.
|
98
Documentation/arm64/memory.rst
Normal file
98
Documentation/arm64/memory.rst
Normal file
@ -0,0 +1,98 @@
|
||||
==============================
|
||||
Memory Layout on AArch64 Linux
|
||||
==============================
|
||||
|
||||
Author: Catalin Marinas <catalin.marinas@arm.com>
|
||||
|
||||
This document describes the virtual memory layout used by the AArch64
|
||||
Linux kernel. The architecture allows up to 4 levels of translation
|
||||
tables with a 4KB page size and up to 3 levels with a 64KB page size.
|
||||
|
||||
AArch64 Linux uses either 3 levels or 4 levels of translation tables
|
||||
with the 4KB page configuration, allowing 39-bit (512GB) or 48-bit
|
||||
(256TB) virtual addresses, respectively, for both user and kernel. With
|
||||
64KB pages, only 2 levels of translation tables, allowing 42-bit (4TB)
|
||||
virtual address, are used but the memory layout is the same.
|
||||
|
||||
User addresses have bits 63:48 set to 0 while the kernel addresses have
|
||||
the same bits set to 1. TTBRx selection is given by bit 63 of the
|
||||
virtual address. The swapper_pg_dir contains only kernel (global)
|
||||
mappings while the user pgd contains only user (non-global) mappings.
|
||||
The swapper_pg_dir address is written to TTBR1 and never written to
|
||||
TTBR0.
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 4KB pages + 3 levels::
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 0000007fffffffff 512GB user
|
||||
ffffff8000000000 ffffffffffffffff 512GB kernel
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 4KB pages + 4 levels::
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 0000ffffffffffff 256TB user
|
||||
ffff000000000000 ffffffffffffffff 256TB kernel
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 64KB pages + 2 levels::
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 000003ffffffffff 4TB user
|
||||
fffffc0000000000 ffffffffffffffff 4TB kernel
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 64KB pages + 3 levels::
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 0000ffffffffffff 256TB user
|
||||
ffff000000000000 ffffffffffffffff 256TB kernel
|
||||
|
||||
|
||||
For details of the virtual kernel memory layout please see the kernel
|
||||
booting log.
|
||||
|
||||
|
||||
Translation table lookup with 4KB pages::
|
||||
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
| | | | | |
|
||||
| | | | | v
|
||||
| | | | | [11:0] in-page offset
|
||||
| | | | +-> [20:12] L3 index
|
||||
| | | +-----------> [29:21] L2 index
|
||||
| | +---------------------> [38:30] L1 index
|
||||
| +-------------------------------> [47:39] L0 index
|
||||
+-------------------------------------------------> [63] TTBR0/1
|
||||
|
||||
|
||||
Translation table lookup with 64KB pages::
|
||||
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
| | | | |
|
||||
| | | | v
|
||||
| | | | [15:0] in-page offset
|
||||
| | | +----------> [28:16] L3 index
|
||||
| | +--------------------------> [41:29] L2 index
|
||||
| +-------------------------------> [47:42] L1 index
|
||||
+-------------------------------------------------> [63] TTBR0/1
|
||||
|
||||
|
||||
When using KVM without the Virtualization Host Extensions, the
|
||||
hypervisor maps kernel pages in EL2 at a fixed (and potentially
|
||||
random) offset from the linear mapping. See the kern_hyp_va macro and
|
||||
kvm_update_va_mask function for more details. MMIO devices such as
|
||||
GICv2 gets mapped next to the HYP idmap page, as do vectors when
|
||||
ARM64_HARDEN_EL2_VECTORS is selected for particular CPUs.
|
||||
|
||||
When using KVM with the Virtualization Host Extensions, no additional
|
||||
mappings are created, since the host kernel runs directly in EL2.
|
@ -1,97 +0,0 @@
|
||||
Memory Layout on AArch64 Linux
|
||||
==============================
|
||||
|
||||
Author: Catalin Marinas <catalin.marinas@arm.com>
|
||||
|
||||
This document describes the virtual memory layout used by the AArch64
|
||||
Linux kernel. The architecture allows up to 4 levels of translation
|
||||
tables with a 4KB page size and up to 3 levels with a 64KB page size.
|
||||
|
||||
AArch64 Linux uses either 3 levels or 4 levels of translation tables
|
||||
with the 4KB page configuration, allowing 39-bit (512GB) or 48-bit
|
||||
(256TB) virtual addresses, respectively, for both user and kernel. With
|
||||
64KB pages, only 2 levels of translation tables, allowing 42-bit (4TB)
|
||||
virtual address, are used but the memory layout is the same.
|
||||
|
||||
User addresses have bits 63:48 set to 0 while the kernel addresses have
|
||||
the same bits set to 1. TTBRx selection is given by bit 63 of the
|
||||
virtual address. The swapper_pg_dir contains only kernel (global)
|
||||
mappings while the user pgd contains only user (non-global) mappings.
|
||||
The swapper_pg_dir address is written to TTBR1 and never written to
|
||||
TTBR0.
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 4KB pages + 3 levels:
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 0000007fffffffff 512GB user
|
||||
ffffff8000000000 ffffffffffffffff 512GB kernel
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 4KB pages + 4 levels:
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 0000ffffffffffff 256TB user
|
||||
ffff000000000000 ffffffffffffffff 256TB kernel
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 64KB pages + 2 levels:
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 000003ffffffffff 4TB user
|
||||
fffffc0000000000 ffffffffffffffff 4TB kernel
|
||||
|
||||
|
||||
AArch64 Linux memory layout with 64KB pages + 3 levels:
|
||||
|
||||
Start End Size Use
|
||||
-----------------------------------------------------------------------
|
||||
0000000000000000 0000ffffffffffff 256TB user
|
||||
ffff000000000000 ffffffffffffffff 256TB kernel
|
||||
|
||||
|
||||
For details of the virtual kernel memory layout please see the kernel
|
||||
booting log.
|
||||
|
||||
|
||||
Translation table lookup with 4KB pages:
|
||||
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
| | | | | |
|
||||
| | | | | v
|
||||
| | | | | [11:0] in-page offset
|
||||
| | | | +-> [20:12] L3 index
|
||||
| | | +-----------> [29:21] L2 index
|
||||
| | +---------------------> [38:30] L1 index
|
||||
| +-------------------------------> [47:39] L0 index
|
||||
+-------------------------------------------------> [63] TTBR0/1
|
||||
|
||||
|
||||
Translation table lookup with 64KB pages:
|
||||
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
|
||||
+--------+--------+--------+--------+--------+--------+--------+--------+
|
||||
| | | | |
|
||||
| | | | v
|
||||
| | | | [15:0] in-page offset
|
||||
| | | +----------> [28:16] L3 index
|
||||
| | +--------------------------> [41:29] L2 index
|
||||
| +-------------------------------> [47:42] L1 index
|
||||
+-------------------------------------------------> [63] TTBR0/1
|
||||
|
||||
|
||||
When using KVM without the Virtualization Host Extensions, the
|
||||
hypervisor maps kernel pages in EL2 at a fixed (and potentially
|
||||
random) offset from the linear mapping. See the kern_hyp_va macro and
|
||||
kvm_update_va_mask function for more details. MMIO devices such as
|
||||
GICv2 gets mapped next to the HYP idmap page, as do vectors when
|
||||
ARM64_HARDEN_EL2_VECTORS is selected for particular CPUs.
|
||||
|
||||
When using KVM with the Virtualization Host Extensions, no additional
|
||||
mappings are created, since the host kernel runs directly in EL2.
|
@ -1,7 +1,9 @@
|
||||
=======================================
|
||||
Pointer authentication in AArch64 Linux
|
||||
=======================================
|
||||
|
||||
Author: Mark Rutland <mark.rutland@arm.com>
|
||||
|
||||
Date: 2017-07-19
|
||||
|
||||
This document briefly describes the provision of pointer authentication
|
@ -1,7 +1,9 @@
|
||||
Silicon Errata and Software Workarounds
|
||||
=======================================
|
||||
=======================================
|
||||
Silicon Errata and Software Workarounds
|
||||
=======================================
|
||||
|
||||
Author: Will Deacon <will.deacon@arm.com>
|
||||
|
||||
Date : 27 November 2015
|
||||
|
||||
It is an unfortunate fact of life that hardware is often produced with
|
||||
@ -9,11 +11,13 @@ so-called "errata", which can cause it to deviate from the architecture
|
||||
under specific circumstances. For hardware produced by ARM, these
|
||||
errata are broadly classified into the following categories:
|
||||
|
||||
Category A: A critical error without a viable workaround.
|
||||
Category B: A significant or critical error with an acceptable
|
||||
========== ========================================================
|
||||
Category A A critical error without a viable workaround.
|
||||
Category B A significant or critical error with an acceptable
|
||||
workaround.
|
||||
Category C: A minor error that is not expected to occur under normal
|
||||
Category C A minor error that is not expected to occur under normal
|
||||
operation.
|
||||
========== ========================================================
|
||||
|
||||
For more information, consult one of the "Software Developers Errata
|
||||
Notice" documents available on infocenter.arm.com (registration
|
||||
@ -42,47 +46,86 @@ file acts as a registry of software workarounds in the Linux Kernel and
|
||||
will be updated when new workarounds are committed and backported to
|
||||
stable kernels.
|
||||
|
||||
| Implementor | Component | Erratum ID | Kconfig |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Implementor | Component | Erratum ID | Kconfig |
|
||||
+================+=================+=================+=============================+
|
||||
| Allwinner | A64/R18 | UNKNOWN1 | SUN50I_ERRATUM_UNKNOWN1 |
|
||||
| | | | |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A53 | #826319 | ARM64_ERRATUM_826319 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A53 | #827319 | ARM64_ERRATUM_827319 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A53 | #824069 | ARM64_ERRATUM_824069 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A53 | #819472 | ARM64_ERRATUM_819472 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A53 | #845719 | ARM64_ERRATUM_845719 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A53 | #843419 | ARM64_ERRATUM_843419 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A57 | #832075 | ARM64_ERRATUM_832075 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A57 | #852523 | N/A |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A57 | #834220 | ARM64_ERRATUM_834220 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A72 | #853709 | N/A |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A73 | #858921 | ARM64_ERRATUM_858921 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A55 | #1024718 | ARM64_ERRATUM_1024718 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A76 | #1188873,1418040| ARM64_ERRATUM_1418040 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A76 | #1165522 | ARM64_ERRATUM_1165522 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A76 | #1286807 | ARM64_ERRATUM_1286807 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Cortex-A76 | #1463225 | ARM64_ERRATUM_1463225 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Neoverse-N1 | #1188873,1418040| ARM64_ERRATUM_1418040 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | MMU-500 | #841119,826419 | N/A |
|
||||
| | | | |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX ITS | #22375,24313 | CAVIUM_ERRATUM_22375 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX ITS | #23144 | CAVIUM_ERRATUM_23144 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX GICv3 | #23154 | CAVIUM_ERRATUM_23154 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX Core | #27456 | CAVIUM_ERRATUM_27456 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX Core | #30115 | CAVIUM_ERRATUM_30115 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX SMMUv2 | #27704 | N/A |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX2 SMMUv3| #74 | N/A |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Cavium | ThunderX2 SMMUv3| #126 | N/A |
|
||||
| | | | |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Freescale/NXP | LS2080A/LS1043A | A-008585 | FSL_ERRATUM_A008585 |
|
||||
| | | | |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Hisilicon | Hip0{5,6,7} | #161010101 | HISILICON_ERRATUM_161010101 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Hisilicon | Hip0{6,7} | #161010701 | N/A |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Hisilicon | Hip07 | #161600802 | HISILICON_ERRATUM_161600802 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Hisilicon | Hip08 SMMU PMCG | #162001800 | N/A |
|
||||
| | | | |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Qualcomm Tech. | Kryo/Falkor v1 | E1003 | QCOM_FALKOR_ERRATUM_1003 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Qualcomm Tech. | Falkor v1 | E1009 | QCOM_FALKOR_ERRATUM_1009 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Qualcomm Tech. | QDF2400 ITS | E0065 | QCOM_QDF2400_ERRATUM_0065 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Qualcomm Tech. | Falkor v{1,2} | E1041 | QCOM_FALKOR_ERRATUM_1041 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Fujitsu | A64FX | E#010001 | FUJITSU_ERRATUM_010001 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
@ -1,7 +1,9 @@
|
||||
Scalable Vector Extension support for AArch64 Linux
|
||||
===================================================
|
||||
===================================================
|
||||
Scalable Vector Extension support for AArch64 Linux
|
||||
===================================================
|
||||
|
||||
Author: Dave Martin <Dave.Martin@arm.com>
|
||||
|
||||
Date: 4 August 2017
|
||||
|
||||
This document outlines briefly the interface provided to userspace by Linux in
|
||||
@ -442,7 +444,7 @@ In A64 state, SVE adds the following:
|
||||
|
||||
* FPSR and FPCR are retained from ARMv8-A, and interact with SVE floating-point
|
||||
operations in a similar way to the way in which they interact with ARMv8
|
||||
floating-point operations.
|
||||
floating-point operations::
|
||||
|
||||
8VL-1 128 0 bit index
|
||||
+---- //// -----------------+
|
||||
@ -499,6 +501,8 @@ ARMv8-A defines the following floating-point / SIMD register state:
|
||||
* 32 128-bit vector registers V0..V31
|
||||
* 2 32-bit status/control registers FPSR, FPCR
|
||||
|
||||
::
|
||||
|
||||
127 0 bit index
|
||||
+---------------+
|
||||
V0 | |
|
||||
@ -533,7 +537,7 @@ References
|
||||
[2] arch/arm64/include/uapi/asm/ptrace.h
|
||||
AArch64 Linux ptrace ABI definitions
|
||||
|
||||
[3] Documentation/arm64/cpu-feature-registers.txt
|
||||
[3] Documentation/arm64/cpu-feature-registers.rst
|
||||
|
||||
[4] ARM IHI0055C
|
||||
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055c/IHI0055C_beta_aapcs64.pdf
|
@ -1,7 +1,9 @@
|
||||
Tagged virtual addresses in AArch64 Linux
|
||||
=========================================
|
||||
=========================================
|
||||
Tagged virtual addresses in AArch64 Linux
|
||||
=========================================
|
||||
|
||||
Author: Will Deacon <will.deacon@arm.com>
|
||||
|
||||
Date : 12 June 2013
|
||||
|
||||
This document briefly describes the provision of tagged virtual
|
@ -151,6 +151,7 @@ for the type. The maximum value of ``BTF_INT_BITS()`` is 128.
|
||||
|
||||
The ``BTF_INT_OFFSET()`` specifies the starting bit offset to calculate values
|
||||
for this int. For example, a bitfield struct member has:
|
||||
|
||||
* btf member bit offset 100 from the start of the structure,
|
||||
* btf member pointing to an int type,
|
||||
* the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4``
|
||||
@ -160,6 +161,7 @@ from bits ``100 + 2 = 102``.
|
||||
|
||||
Alternatively, the bitfield struct member can be the following to access the
|
||||
same bits as the above:
|
||||
|
||||
* btf member bit offset 102,
|
||||
* btf member pointing to an int type,
|
||||
* the int type has ``BTF_INT_OFFSET() = 0`` and ``BTF_INT_BITS() = 4``
|
||||
|
@ -1,21 +0,0 @@
|
||||
LATEXFILE = cdrom-standard
|
||||
|
||||
all:
|
||||
make clean
|
||||
latex $(LATEXFILE)
|
||||
latex $(LATEXFILE)
|
||||
@if [ -x `which gv` ]; then \
|
||||
`dvips -q -t letter -o $(LATEXFILE).ps $(LATEXFILE).dvi` ;\
|
||||
`gv -antialias -media letter -nocenter $(LATEXFILE).ps` ;\
|
||||
else \
|
||||
`xdvi $(LATEXFILE).dvi &` ;\
|
||||
fi
|
||||
make sortofclean
|
||||
|
||||
clean:
|
||||
rm -f $(LATEXFILE).ps $(LATEXFILE).dvi $(LATEXFILE).aux $(LATEXFILE).log
|
||||
|
||||
sortofclean:
|
||||
rm -f $(LATEXFILE).aux $(LATEXFILE).log
|
||||
|
||||
|
1063
Documentation/cdrom/cdrom-standard.rst
Normal file
1063
Documentation/cdrom/cdrom-standard.rst
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@ -1,18 +1,20 @@
|
||||
IDE-CD driver documentation
|
||||
Originally by scott snyder <snyder@fnald0.fnal.gov> (19 May 1996)
|
||||
Carrying on the torch is: Erik Andersen <andersee@debian.org>
|
||||
New maintainers (19 Oct 1998): Jens Axboe <axboe@image.dk>
|
||||
===========================
|
||||
|
||||
:Originally by: scott snyder <snyder@fnald0.fnal.gov> (19 May 1996)
|
||||
:Carrying on the torch is: Erik Andersen <andersee@debian.org>
|
||||
:New maintainers (19 Oct 1998): Jens Axboe <axboe@image.dk>
|
||||
|
||||
1. Introduction
|
||||
---------------
|
||||
|
||||
The ide-cd driver should work with all ATAPI ver 1.2 to ATAPI 2.6 compliant
|
||||
The ide-cd driver should work with all ATAPI ver 1.2 to ATAPI 2.6 compliant
|
||||
CDROM drives which attach to an IDE interface. Note that some CDROM vendors
|
||||
(including Mitsumi, Sony, Creative, Aztech, and Goldstar) have made
|
||||
both ATAPI-compliant drives and drives which use a proprietary
|
||||
interface. If your drive uses one of those proprietary interfaces,
|
||||
this driver will not work with it (but one of the other CDROM drivers
|
||||
probably will). This driver will not work with `ATAPI' drives which
|
||||
probably will). This driver will not work with `ATAPI` drives which
|
||||
attach to the parallel port. In addition, there is at least one drive
|
||||
(CyCDROM CR520ie) which attaches to the IDE port but is not ATAPI;
|
||||
this driver will not work with drives like that either (but see the
|
||||
@ -31,7 +33,7 @@ This driver provides the following features:
|
||||
from audio tracks. The program cdda2wav can be used for this.
|
||||
Note, however, that only some drives actually support this.
|
||||
|
||||
- There is now support for CDROM changers which comply with the
|
||||
- There is now support for CDROM changers which comply with the
|
||||
ATAPI 2.6 draft standard (such as the NEC CDR-251). This additional
|
||||
functionality includes a function call to query which slot is the
|
||||
currently selected slot, a function call to query which slots contain
|
||||
@ -45,22 +47,22 @@ This driver provides the following features:
|
||||
---------------
|
||||
|
||||
0. The ide-cd relies on the ide disk driver. See
|
||||
Documentation/ide/ide.txt for up-to-date information on the ide
|
||||
Documentation/ide/ide.rst for up-to-date information on the ide
|
||||
driver.
|
||||
|
||||
1. Make sure that the ide and ide-cd drivers are compiled into the
|
||||
kernel you're using. When configuring the kernel, in the section
|
||||
entitled "Floppy, IDE, and other block devices", say either `Y'
|
||||
(which will compile the support directly into the kernel) or `M'
|
||||
kernel you're using. When configuring the kernel, in the section
|
||||
entitled "Floppy, IDE, and other block devices", say either `Y`
|
||||
(which will compile the support directly into the kernel) or `M`
|
||||
(to compile support as a module which can be loaded and unloaded)
|
||||
to the options:
|
||||
to the options::
|
||||
|
||||
ATA/ATAPI/MFM/RLL support
|
||||
Include IDE/ATAPI CDROM support
|
||||
|
||||
Depending on what type of IDE interface you have, you may need to
|
||||
specify additional configuration options. See
|
||||
Documentation/ide/ide.txt.
|
||||
Documentation/ide/ide.rst.
|
||||
|
||||
2. You should also ensure that the iso9660 filesystem is either
|
||||
compiled into the kernel or available as a loadable module. You
|
||||
@ -72,35 +74,35 @@ This driver provides the following features:
|
||||
address and an IRQ number, the standard assignments being
|
||||
0x1f0 and 14 for the primary interface and 0x170 and 15 for the
|
||||
secondary interface. Each interface can control up to two devices,
|
||||
where each device can be a hard drive, a CDROM drive, a floppy drive,
|
||||
or a tape drive. The two devices on an interface are called `master'
|
||||
and `slave'; this is usually selectable via a jumper on the drive.
|
||||
where each device can be a hard drive, a CDROM drive, a floppy drive,
|
||||
or a tape drive. The two devices on an interface are called `master`
|
||||
and `slave`; this is usually selectable via a jumper on the drive.
|
||||
|
||||
Linux names these devices as follows. The master and slave devices
|
||||
on the primary IDE interface are called `hda' and `hdb',
|
||||
on the primary IDE interface are called `hda` and `hdb`,
|
||||
respectively. The drives on the secondary interface are called
|
||||
`hdc' and `hdd'. (Interfaces at other locations get other letters
|
||||
in the third position; see Documentation/ide/ide.txt.)
|
||||
`hdc` and `hdd`. (Interfaces at other locations get other letters
|
||||
in the third position; see Documentation/ide/ide.rst.)
|
||||
|
||||
If you want your CDROM drive to be found automatically by the
|
||||
driver, you should make sure your IDE interface uses either the
|
||||
primary or secondary addresses mentioned above. In addition, if
|
||||
the CDROM drive is the only device on the IDE interface, it should
|
||||
be jumpered as `master'. (If for some reason you cannot configure
|
||||
be jumpered as `master`. (If for some reason you cannot configure
|
||||
your system in this manner, you can probably still use the driver.
|
||||
You may have to pass extra configuration information to the kernel
|
||||
when you boot, however. See Documentation/ide/ide.txt for more
|
||||
when you boot, however. See Documentation/ide/ide.rst for more
|
||||
information.)
|
||||
|
||||
4. Boot the system. If the drive is recognized, you should see a
|
||||
message which looks like
|
||||
message which looks like::
|
||||
|
||||
hdb: NEC CD-ROM DRIVE:260, ATAPI CDROM drive
|
||||
|
||||
If you do not see this, see section 5 below.
|
||||
|
||||
5. You may want to create a symbolic link /dev/cdrom pointing to the
|
||||
actual device. You can do this with the command
|
||||
actual device. You can do this with the command::
|
||||
|
||||
ln -s /dev/hdX /dev/cdrom
|
||||
|
||||
@ -108,14 +110,14 @@ This driver provides the following features:
|
||||
drive is installed.
|
||||
|
||||
6. You should be able to see any error messages from the driver with
|
||||
the `dmesg' command.
|
||||
the `dmesg` command.
|
||||
|
||||
|
||||
3. Basic usage
|
||||
--------------
|
||||
|
||||
An ISO 9660 CDROM can be mounted by putting the disc in the drive and
|
||||
typing (as root)
|
||||
An ISO 9660 CDROM can be mounted by putting the disc in the drive and
|
||||
typing (as root)::
|
||||
|
||||
mount -t iso9660 /dev/cdrom /mnt/cdrom
|
||||
|
||||
@ -123,7 +125,7 @@ where it is assumed that /dev/cdrom is a link pointing to the actual
|
||||
device (as described in step 5 of the last section) and /mnt/cdrom is
|
||||
an empty directory. You should now be able to see the contents of the
|
||||
CDROM under the /mnt/cdrom directory. If you want to eject the CDROM,
|
||||
you must first dismount it with a command like
|
||||
you must first dismount it with a command like::
|
||||
|
||||
umount /mnt/cdrom
|
||||
|
||||
@ -148,7 +150,7 @@ such as cdda2wav. The only types of drive which I've heard support
|
||||
this are Sony and Toshiba drives. You will get errors if you try to
|
||||
use this function on a drive which does not support it.
|
||||
|
||||
For supported changers, you can use the `cdchange' program (appended to
|
||||
For supported changers, you can use the `cdchange` program (appended to
|
||||
the end of this file) to switch between changer slots. Note that the
|
||||
drive should be unmounted before attempting this. The program takes
|
||||
two arguments: the CDROM device, and the slot number to which you wish
|
||||
@ -161,17 +163,17 @@ to change. If the slot number is -1, the drive is unloaded.
|
||||
This section discusses some common problems encountered when trying to
|
||||
use the driver, and some possible solutions. Note that if you are
|
||||
experiencing problems, you should probably also review
|
||||
Documentation/ide/ide.txt for current information about the underlying
|
||||
Documentation/ide/ide.rst for current information about the underlying
|
||||
IDE support code. Some of these items apply only to earlier versions
|
||||
of the driver, but are mentioned here for completeness.
|
||||
|
||||
In most cases, you should probably check with `dmesg' for any errors
|
||||
In most cases, you should probably check with `dmesg` for any errors
|
||||
from the driver.
|
||||
|
||||
a. Drive is not detected during booting.
|
||||
|
||||
- Review the configuration instructions above and in
|
||||
Documentation/ide/ide.txt, and check how your hardware is
|
||||
Documentation/ide/ide.rst, and check how your hardware is
|
||||
configured.
|
||||
|
||||
- If your drive is the only device on an IDE interface, it should
|
||||
@ -179,14 +181,14 @@ a. Drive is not detected during booting.
|
||||
|
||||
- If your IDE interface is not at the standard addresses of 0x170
|
||||
or 0x1f0, you'll need to explicitly inform the driver using a
|
||||
lilo option. See Documentation/ide/ide.txt. (This feature was
|
||||
lilo option. See Documentation/ide/ide.rst. (This feature was
|
||||
added around kernel version 1.3.30.)
|
||||
|
||||
- If the autoprobing is not finding your drive, you can tell the
|
||||
driver to assume that one exists by using a lilo option of the
|
||||
form `hdX=cdrom', where X is the drive letter corresponding to
|
||||
where your drive is installed. Note that if you do this and you
|
||||
see a boot message like
|
||||
form `hdX=cdrom`, where X is the drive letter corresponding to
|
||||
where your drive is installed. Note that if you do this and you
|
||||
see a boot message like::
|
||||
|
||||
hdX: ATAPI cdrom (?)
|
||||
|
||||
@ -205,7 +207,7 @@ a. Drive is not detected during booting.
|
||||
Support for some interfaces needing extra initialization is
|
||||
provided in later 1.3.x kernels. You may need to turn on
|
||||
additional kernel configuration options to get them to work;
|
||||
see Documentation/ide/ide.txt.
|
||||
see Documentation/ide/ide.rst.
|
||||
|
||||
Even if support is not available for your interface, you may be
|
||||
able to get it to work with the following procedure. First boot
|
||||
@ -220,7 +222,7 @@ b. Timeout/IRQ errors.
|
||||
probably not making it to the host.
|
||||
|
||||
- IRQ problems may also be indicated by the message
|
||||
`IRQ probe failed (<n>)' while booting. If <n> is zero, that
|
||||
`IRQ probe failed (<n>)` while booting. If <n> is zero, that
|
||||
means that the system did not see an interrupt from the drive when
|
||||
it was expecting one (on any feasible IRQ). If <n> is negative,
|
||||
that means the system saw interrupts on multiple IRQ lines, when
|
||||
@ -240,27 +242,27 @@ b. Timeout/IRQ errors.
|
||||
there are hardware problems with the interrupt setup; they
|
||||
apparently don't use interrupts.
|
||||
|
||||
- If you own a Pioneer DR-A24X, you _will_ get nasty error messages
|
||||
- If you own a Pioneer DR-A24X, you _will_ get nasty error messages
|
||||
on boot such as "irq timeout: status=0x50 { DriveReady SeekComplete }"
|
||||
The Pioneer DR-A24X CDROM drives are fairly popular these days.
|
||||
Unfortunately, these drives seem to become very confused when we perform
|
||||
the standard Linux ATA disk drive probe. If you own one of these drives,
|
||||
you can bypass the ATA probing which confuses these CDROM drives, by
|
||||
adding `append="hdX=noprobe hdX=cdrom"' to your lilo.conf file and running
|
||||
lilo (again where X is the drive letter corresponding to where your drive
|
||||
you can bypass the ATA probing which confuses these CDROM drives, by
|
||||
adding `append="hdX=noprobe hdX=cdrom"` to your lilo.conf file and running
|
||||
lilo (again where X is the drive letter corresponding to where your drive
|
||||
is installed.)
|
||||
|
||||
|
||||
c. System hangups.
|
||||
|
||||
- If the system locks up when you try to access the CDROM, the most
|
||||
likely cause is that you have a buggy IDE adapter which doesn't
|
||||
properly handle simultaneous transactions on multiple interfaces.
|
||||
The most notorious of these is the CMD640B chip. This problem can
|
||||
be worked around by specifying the `serialize' option when
|
||||
be worked around by specifying the `serialize` option when
|
||||
booting. Recent kernels should be able to detect the need for
|
||||
this automatically in most cases, but the detection is not
|
||||
foolproof. See Documentation/ide/ide.txt for more information
|
||||
about the `serialize' option and the CMD640B.
|
||||
foolproof. See Documentation/ide/ide.rst for more information
|
||||
about the `serialize` option and the CMD640B.
|
||||
|
||||
- Note that many MS-DOS CDROM drivers will work with such buggy
|
||||
hardware, apparently because they never attempt to overlap CDROM
|
||||
@ -269,14 +271,14 @@ c. System hangups.
|
||||
|
||||
d. Can't mount a CDROM.
|
||||
|
||||
- If you get errors from mount, it may help to check `dmesg' to see
|
||||
- If you get errors from mount, it may help to check `dmesg` to see
|
||||
if there are any more specific errors from the driver or from the
|
||||
filesystem.
|
||||
|
||||
- Make sure there's a CDROM loaded in the drive, and that's it's an
|
||||
ISO 9660 disc. You can't mount an audio CD.
|
||||
|
||||
- With the CDROM in the drive and unmounted, try something like
|
||||
- With the CDROM in the drive and unmounted, try something like::
|
||||
|
||||
cat /dev/cdrom | od | more
|
||||
|
||||
@ -284,9 +286,9 @@ d. Can't mount a CDROM.
|
||||
OK, and the problem is at the filesystem level (i.e., the CDROM is
|
||||
not ISO 9660 or has errors in the filesystem structure).
|
||||
|
||||
- If you see `not a block device' errors, check that the definitions
|
||||
- If you see `not a block device` errors, check that the definitions
|
||||
of the device special files are correct. They should be as
|
||||
follows:
|
||||
follows::
|
||||
|
||||
brw-rw---- 1 root disk 3, 0 Nov 11 18:48 /dev/hda
|
||||
brw-rw---- 1 root disk 3, 64 Nov 11 18:48 /dev/hdb
|
||||
@ -301,7 +303,7 @@ d. Can't mount a CDROM.
|
||||
If you have a /dev/cdrom symbolic link, check that it is pointing
|
||||
to the correct device file.
|
||||
|
||||
If you hear people talking of the devices `hd1a' and `hd1b', these
|
||||
If you hear people talking of the devices `hd1a` and `hd1b`, these
|
||||
were old names for what are now called hdc and hdd. Those names
|
||||
should be considered obsolete.
|
||||
|
||||
@ -311,8 +313,8 @@ d. Can't mount a CDROM.
|
||||
always give meaningful error messages.
|
||||
|
||||
|
||||
e. Directory listings are unpredictably truncated, and `dmesg' shows
|
||||
`buffer botch' error messages from the driver.
|
||||
e. Directory listings are unpredictably truncated, and `dmesg` shows
|
||||
`buffer botch` error messages from the driver.
|
||||
|
||||
- There was a bug in the version of the driver in 1.2.x kernels
|
||||
which could cause this. It was fixed in 1.3.0. If you can't
|
||||
@ -335,34 +337,36 @@ f. Data corruption.
|
||||
5. cdchange.c
|
||||
-------------
|
||||
|
||||
/*
|
||||
* cdchange.c [-v] <device> [<slot>]
|
||||
*
|
||||
* This loads a CDROM from a specified slot in a changer, and displays
|
||||
* information about the changer status. The drive should be unmounted before
|
||||
* using this program.
|
||||
*
|
||||
* Changer information is displayed if either the -v flag is specified
|
||||
* or no slot was specified.
|
||||
*
|
||||
* Based on code originally from Gerhard Zuber <zuber@berlin.snafu.de>.
|
||||
* Changer status information, and rewrite for the new Uniform CDROM driver
|
||||
* interface by Erik Andersen <andersee@debian.org>.
|
||||
*/
|
||||
::
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <errno.h>
|
||||
#include <string.h>
|
||||
#include <unistd.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <linux/cdrom.h>
|
||||
/*
|
||||
* cdchange.c [-v] <device> [<slot>]
|
||||
*
|
||||
* This loads a CDROM from a specified slot in a changer, and displays
|
||||
* information about the changer status. The drive should be unmounted before
|
||||
* using this program.
|
||||
*
|
||||
* Changer information is displayed if either the -v flag is specified
|
||||
* or no slot was specified.
|
||||
*
|
||||
* Based on code originally from Gerhard Zuber <zuber@berlin.snafu.de>.
|
||||
* Changer status information, and rewrite for the new Uniform CDROM driver
|
||||
* interface by Erik Andersen <andersee@debian.org>.
|
||||
*/
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <errno.h>
|
||||
#include <string.h>
|
||||
#include <unistd.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <linux/cdrom.h>
|
||||
|
||||
|
||||
int
|
||||
main (int argc, char **argv)
|
||||
{
|
||||
int
|
||||
main (int argc, char **argv)
|
||||
{
|
||||
char *program;
|
||||
char *device;
|
||||
int fd; /* file descriptor for CD-ROM device */
|
||||
@ -382,30 +386,30 @@ main (int argc, char **argv)
|
||||
fprintf (stderr, " Slots are numbered 1 -- n.\n");
|
||||
exit (1);
|
||||
}
|
||||
|
||||
|
||||
if (strcmp (argv[0], "-v") == 0) {
|
||||
verbose = 1;
|
||||
++argv;
|
||||
--argc;
|
||||
}
|
||||
|
||||
|
||||
device = argv[0];
|
||||
|
||||
|
||||
if (argc == 2)
|
||||
slot = atoi (argv[1]) - 1;
|
||||
|
||||
/* open device */
|
||||
/* open device */
|
||||
fd = open(device, O_RDONLY | O_NONBLOCK);
|
||||
if (fd < 0) {
|
||||
fprintf (stderr, "%s: open failed for `%s': %s\n",
|
||||
fprintf (stderr, "%s: open failed for `%s`: %s\n",
|
||||
program, device, strerror (errno));
|
||||
exit (1);
|
||||
}
|
||||
|
||||
/* Check CD player status */
|
||||
/* Check CD player status */
|
||||
total_slots_available = ioctl (fd, CDROM_CHANGER_NSLOTS);
|
||||
if (total_slots_available <= 1 ) {
|
||||
fprintf (stderr, "%s: Device `%s' is not an ATAPI "
|
||||
fprintf (stderr, "%s: Device `%s` is not an ATAPI "
|
||||
"compliant CD changer.\n", program, device);
|
||||
exit (1);
|
||||
}
|
||||
@ -418,7 +422,7 @@ main (int argc, char **argv)
|
||||
exit (1);
|
||||
}
|
||||
|
||||
/* load */
|
||||
/* load */
|
||||
slot=ioctl (fd, CDROM_SELECT_DISC, slot);
|
||||
if (slot<0) {
|
||||
fflush(stdout);
|
||||
@ -462,14 +466,14 @@ main (int argc, char **argv)
|
||||
|
||||
for (x_slot=0; x_slot<total_slots_available; x_slot++) {
|
||||
printf ("Slot %2d: ", x_slot+1);
|
||||
status = ioctl (fd, CDROM_DRIVE_STATUS, x_slot);
|
||||
if (status<0) {
|
||||
perror(" CDROM_DRIVE_STATUS");
|
||||
} else switch(status) {
|
||||
status = ioctl (fd, CDROM_DRIVE_STATUS, x_slot);
|
||||
if (status<0) {
|
||||
perror(" CDROM_DRIVE_STATUS");
|
||||
} else switch(status) {
|
||||
case CDS_DISC_OK:
|
||||
printf ("Disc present.");
|
||||
break;
|
||||
case CDS_NO_DISC:
|
||||
case CDS_NO_DISC:
|
||||
printf ("Empty slot.");
|
||||
break;
|
||||
case CDS_TRAY_OPEN:
|
||||
@ -507,11 +511,11 @@ main (int argc, char **argv)
|
||||
break;
|
||||
}
|
||||
}
|
||||
status = ioctl (fd, CDROM_MEDIA_CHANGED, x_slot);
|
||||
if (status<0) {
|
||||
status = ioctl (fd, CDROM_MEDIA_CHANGED, x_slot);
|
||||
if (status<0) {
|
||||
perror(" CDROM_MEDIA_CHANGED");
|
||||
}
|
||||
switch (status) {
|
||||
}
|
||||
switch (status) {
|
||||
case 1:
|
||||
printf ("Changed.\n");
|
||||
break;
|
||||
@ -525,10 +529,10 @@ main (int argc, char **argv)
|
||||
/* close device */
|
||||
status = close (fd);
|
||||
if (status != 0) {
|
||||
fprintf (stderr, "%s: close failed for `%s': %s\n",
|
||||
fprintf (stderr, "%s: close failed for `%s`: %s\n",
|
||||
program, device, strerror (errno));
|
||||
exit (1);
|
||||
}
|
||||
|
||||
|
||||
exit (0);
|
||||
}
|
||||
}
|
19
Documentation/cdrom/index.rst
Normal file
19
Documentation/cdrom/index.rst
Normal file
@ -0,0 +1,19 @@
|
||||
:orphan:
|
||||
|
||||
=====
|
||||
cdrom
|
||||
=====
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
cdrom-standard
|
||||
ide-cd
|
||||
packet-writing
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
@ -1,3 +1,7 @@
|
||||
==============
|
||||
Packet writing
|
||||
==============
|
||||
|
||||
Getting started quick
|
||||
---------------------
|
||||
|
||||
@ -10,13 +14,16 @@ Getting started quick
|
||||
Download from http://sourceforge.net/projects/linux-udf/
|
||||
|
||||
- Grab a new CD-RW disc and format it (assuming CD-RW is hdc, substitute
|
||||
as appropriate):
|
||||
as appropriate)::
|
||||
|
||||
# cdrwtool -d /dev/hdc -q
|
||||
|
||||
- Setup your writer
|
||||
- Setup your writer::
|
||||
|
||||
# pktsetup dev_name /dev/hdc
|
||||
|
||||
- Now you can mount /dev/pktcdvd/dev_name and copy files to it. Enjoy!
|
||||
- Now you can mount /dev/pktcdvd/dev_name and copy files to it. Enjoy::
|
||||
|
||||
# mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime
|
||||
|
||||
|
||||
@ -25,11 +32,11 @@ Packet writing for DVD-RW media
|
||||
|
||||
DVD-RW discs can be written to much like CD-RW discs if they are in
|
||||
the so called "restricted overwrite" mode. To put a disc in restricted
|
||||
overwrite mode, run:
|
||||
overwrite mode, run::
|
||||
|
||||
# dvd+rw-format /dev/hdc
|
||||
|
||||
You can then use the disc the same way you would use a CD-RW disc:
|
||||
You can then use the disc the same way you would use a CD-RW disc::
|
||||
|
||||
# pktsetup dev_name /dev/hdc
|
||||
# mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime
|
||||
@ -41,7 +48,7 @@ Packet writing for DVD+RW media
|
||||
According to the DVD+RW specification, a drive supporting DVD+RW discs
|
||||
shall implement "true random writes with 2KB granularity", which means
|
||||
that it should be possible to put any filesystem with a block size >=
|
||||
2KB on such a disc. For example, it should be possible to do:
|
||||
2KB on such a disc. For example, it should be possible to do::
|
||||
|
||||
# dvd+rw-format /dev/hdc (only needed if the disc has never
|
||||
been formatted)
|
||||
@ -54,7 +61,7 @@ follow the specification, but suffer bad performance problems if the
|
||||
writes are not 32KB aligned.
|
||||
|
||||
Both problems can be solved by using the pktcdvd driver, which always
|
||||
generates aligned writes.
|
||||
generates aligned writes::
|
||||
|
||||
# dvd+rw-format /dev/hdc
|
||||
# pktsetup dev_name /dev/hdc
|
||||
@ -83,7 +90,7 @@ Notes
|
||||
|
||||
- Since the pktcdvd driver makes the disc appear as a regular block
|
||||
device with a 2KB block size, you can put any filesystem you like on
|
||||
the disc. For example, run:
|
||||
the disc. For example, run::
|
||||
|
||||
# /sbin/mke2fs /dev/pktcdvd/dev_name
|
||||
|
||||
@ -97,7 +104,7 @@ Since Linux 2.6.20, the pktcdvd module has a sysfs interface
|
||||
and can be controlled by it. For example the "pktcdvd" tool uses
|
||||
this interface. (see http://tom.ist-im-web.de/download/pktcdvd )
|
||||
|
||||
"pktcdvd" works similar to "pktsetup", e.g.:
|
||||
"pktcdvd" works similar to "pktsetup", e.g.::
|
||||
|
||||
# pktcdvd -a dev_name /dev/hdc
|
||||
# mkudffs /dev/pktcdvd/dev_name
|
||||
@ -115,7 +122,7 @@ For a description of the sysfs interface look into the file:
|
||||
Using the pktcdvd debugfs interface
|
||||
-----------------------------------
|
||||
|
||||
To read pktcdvd device infos in human readable form, do:
|
||||
To read pktcdvd device infos in human readable form, do::
|
||||
|
||||
# cat /sys/kernel/debug/pktcdvd/pktcdvd[0-7]/info
|
||||
|
@ -34,7 +34,8 @@ needs_sphinx = '1.3'
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig']
|
||||
extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain',
|
||||
'kfigure', 'sphinx.ext.ifconfig', 'automarkup']
|
||||
|
||||
# The name of the math extension changed on Sphinx 1.4
|
||||
if (major == 1 and minor > 3) or (major > 1):
|
||||
@ -200,7 +201,7 @@ html_context = {
|
||||
|
||||
# If true, SmartyPants will be used to convert quotes and dashes to
|
||||
# typographically correct entities.
|
||||
#html_use_smartypants = True
|
||||
html_use_smartypants = False
|
||||
|
||||
# Custom sidebar templates, maps document names to template names.
|
||||
#html_sidebars = {}
|
||||
|
@ -34,6 +34,8 @@ Core utilities
|
||||
timekeeping
|
||||
boot-time-mm
|
||||
memory-hotplug
|
||||
protection-keys
|
||||
../RCU/index
|
||||
|
||||
|
||||
Interfaces for kernel debugging
|
||||
|
@ -33,6 +33,9 @@ String Conversions
|
||||
.. kernel-doc:: lib/kstrtox.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: lib/string_helpers.c
|
||||
:export:
|
||||
|
||||
String Manipulation
|
||||
-------------------
|
||||
|
||||
@ -138,6 +141,15 @@ Base 2 log and power Functions
|
||||
.. kernel-doc:: include/linux/log2.h
|
||||
:internal:
|
||||
|
||||
Integer power Functions
|
||||
-----------------------
|
||||
|
||||
.. kernel-doc:: lib/math/int_pow.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: lib/math/int_sqrt.c
|
||||
:export:
|
||||
|
||||
Division Functions
|
||||
------------------
|
||||
|
||||
@ -358,8 +370,6 @@ Read-Copy Update (RCU)
|
||||
|
||||
.. kernel-doc:: kernel/rcu/tree.c
|
||||
|
||||
.. kernel-doc:: kernel/rcu/tree_plugin.h
|
||||
|
||||
.. kernel-doc:: kernel/rcu/tree_exp.h
|
||||
|
||||
.. kernel-doc:: kernel/rcu/update.c
|
||||
|
@ -115,7 +115,7 @@ Some additional variants exist for more specialized cases:
|
||||
void ktime_get_coarse_clocktai_ts64( struct timespec64 * )
|
||||
|
||||
These are quicker than the non-coarse versions, but less accurate,
|
||||
corresponding to CLOCK_MONONOTNIC_COARSE and CLOCK_REALTIME_COARSE
|
||||
corresponding to CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE
|
||||
in user space, along with the equivalent boottime/tai/raw
|
||||
timebase not available in user space.
|
||||
|
||||
|
@ -30,27 +30,27 @@ it called marks. Each mark may be set or cleared independently of
|
||||
the others. You can iterate over entries which are marked.
|
||||
|
||||
Normal pointers may be stored in the XArray directly. They must be 4-byte
|
||||
aligned, which is true for any pointer returned from :c:func:`kmalloc` and
|
||||
:c:func:`alloc_page`. It isn't true for arbitrary user-space pointers,
|
||||
aligned, which is true for any pointer returned from kmalloc() and
|
||||
alloc_page(). It isn't true for arbitrary user-space pointers,
|
||||
nor for function pointers. You can store pointers to statically allocated
|
||||
objects, as long as those objects have an alignment of at least 4.
|
||||
|
||||
You can also store integers between 0 and ``LONG_MAX`` in the XArray.
|
||||
You must first convert it into an entry using :c:func:`xa_mk_value`.
|
||||
You must first convert it into an entry using xa_mk_value().
|
||||
When you retrieve an entry from the XArray, you can check whether it is
|
||||
a value entry by calling :c:func:`xa_is_value`, and convert it back to
|
||||
an integer by calling :c:func:`xa_to_value`.
|
||||
a value entry by calling xa_is_value(), and convert it back to
|
||||
an integer by calling xa_to_value().
|
||||
|
||||
Some users want to store tagged pointers instead of using the marks
|
||||
described above. They can call :c:func:`xa_tag_pointer` to create an
|
||||
entry with a tag, :c:func:`xa_untag_pointer` to turn a tagged entry
|
||||
back into an untagged pointer and :c:func:`xa_pointer_tag` to retrieve
|
||||
described above. They can call xa_tag_pointer() to create an
|
||||
entry with a tag, xa_untag_pointer() to turn a tagged entry
|
||||
back into an untagged pointer and xa_pointer_tag() to retrieve
|
||||
the tag of an entry. Tagged pointers use the same bits that are used
|
||||
to distinguish value entries from normal pointers, so each user must
|
||||
decide whether they want to store value entries or tagged pointers in
|
||||
any particular XArray.
|
||||
|
||||
The XArray does not support storing :c:func:`IS_ERR` pointers as some
|
||||
The XArray does not support storing IS_ERR() pointers as some
|
||||
conflict with value entries or internal entries.
|
||||
|
||||
An unusual feature of the XArray is the ability to create entries which
|
||||
@ -64,89 +64,89 @@ entry will cause the XArray to forget about the range.
|
||||
Normal API
|
||||
==========
|
||||
|
||||
Start by initialising an XArray, either with :c:func:`DEFINE_XARRAY`
|
||||
for statically allocated XArrays or :c:func:`xa_init` for dynamically
|
||||
Start by initialising an XArray, either with DEFINE_XARRAY()
|
||||
for statically allocated XArrays or xa_init() for dynamically
|
||||
allocated ones. A freshly-initialised XArray contains a ``NULL``
|
||||
pointer at every index.
|
||||
|
||||
You can then set entries using :c:func:`xa_store` and get entries
|
||||
using :c:func:`xa_load`. xa_store will overwrite any entry with the
|
||||
You can then set entries using xa_store() and get entries
|
||||
using xa_load(). xa_store will overwrite any entry with the
|
||||
new entry and return the previous entry stored at that index. You can
|
||||
use :c:func:`xa_erase` instead of calling :c:func:`xa_store` with a
|
||||
use xa_erase() instead of calling xa_store() with a
|
||||
``NULL`` entry. There is no difference between an entry that has never
|
||||
been stored to, one that has been erased and one that has most recently
|
||||
had ``NULL`` stored to it.
|
||||
|
||||
You can conditionally replace an entry at an index by using
|
||||
:c:func:`xa_cmpxchg`. Like :c:func:`cmpxchg`, it will only succeed if
|
||||
xa_cmpxchg(). Like cmpxchg(), it will only succeed if
|
||||
the entry at that index has the 'old' value. It also returns the entry
|
||||
which was at that index; if it returns the same entry which was passed as
|
||||
'old', then :c:func:`xa_cmpxchg` succeeded.
|
||||
'old', then xa_cmpxchg() succeeded.
|
||||
|
||||
If you want to only store a new entry to an index if the current entry
|
||||
at that index is ``NULL``, you can use :c:func:`xa_insert` which
|
||||
at that index is ``NULL``, you can use xa_insert() which
|
||||
returns ``-EBUSY`` if the entry is not empty.
|
||||
|
||||
You can enquire whether a mark is set on an entry by using
|
||||
:c:func:`xa_get_mark`. If the entry is not ``NULL``, you can set a mark
|
||||
on it by using :c:func:`xa_set_mark` and remove the mark from an entry by
|
||||
calling :c:func:`xa_clear_mark`. You can ask whether any entry in the
|
||||
XArray has a particular mark set by calling :c:func:`xa_marked`.
|
||||
xa_get_mark(). If the entry is not ``NULL``, you can set a mark
|
||||
on it by using xa_set_mark() and remove the mark from an entry by
|
||||
calling xa_clear_mark(). You can ask whether any entry in the
|
||||
XArray has a particular mark set by calling xa_marked().
|
||||
|
||||
You can copy entries out of the XArray into a plain array by calling
|
||||
:c:func:`xa_extract`. Or you can iterate over the present entries in
|
||||
the XArray by calling :c:func:`xa_for_each`. You may prefer to use
|
||||
:c:func:`xa_find` or :c:func:`xa_find_after` to move to the next present
|
||||
xa_extract(). Or you can iterate over the present entries in
|
||||
the XArray by calling xa_for_each(). You may prefer to use
|
||||
xa_find() or xa_find_after() to move to the next present
|
||||
entry in the XArray.
|
||||
|
||||
Calling :c:func:`xa_store_range` stores the same entry in a range
|
||||
Calling xa_store_range() stores the same entry in a range
|
||||
of indices. If you do this, some of the other operations will behave
|
||||
in a slightly odd way. For example, marking the entry at one index
|
||||
may result in the entry being marked at some, but not all of the other
|
||||
indices. Storing into one index may result in the entry retrieved by
|
||||
some, but not all of the other indices changing.
|
||||
|
||||
Sometimes you need to ensure that a subsequent call to :c:func:`xa_store`
|
||||
will not need to allocate memory. The :c:func:`xa_reserve` function
|
||||
Sometimes you need to ensure that a subsequent call to xa_store()
|
||||
will not need to allocate memory. The xa_reserve() function
|
||||
will store a reserved entry at the indicated index. Users of the
|
||||
normal API will see this entry as containing ``NULL``. If you do
|
||||
not need to use the reserved entry, you can call :c:func:`xa_release`
|
||||
not need to use the reserved entry, you can call xa_release()
|
||||
to remove the unused entry. If another user has stored to the entry
|
||||
in the meantime, :c:func:`xa_release` will do nothing; if instead you
|
||||
want the entry to become ``NULL``, you should use :c:func:`xa_erase`.
|
||||
Using :c:func:`xa_insert` on a reserved entry will fail.
|
||||
in the meantime, xa_release() will do nothing; if instead you
|
||||
want the entry to become ``NULL``, you should use xa_erase().
|
||||
Using xa_insert() on a reserved entry will fail.
|
||||
|
||||
If all entries in the array are ``NULL``, the :c:func:`xa_empty` function
|
||||
If all entries in the array are ``NULL``, the xa_empty() function
|
||||
will return ``true``.
|
||||
|
||||
Finally, you can remove all entries from an XArray by calling
|
||||
:c:func:`xa_destroy`. If the XArray entries are pointers, you may wish
|
||||
xa_destroy(). If the XArray entries are pointers, you may wish
|
||||
to free the entries first. You can do this by iterating over all present
|
||||
entries in the XArray using the :c:func:`xa_for_each` iterator.
|
||||
entries in the XArray using the xa_for_each() iterator.
|
||||
|
||||
Allocating XArrays
|
||||
------------------
|
||||
|
||||
If you use :c:func:`DEFINE_XARRAY_ALLOC` to define the XArray, or
|
||||
initialise it by passing ``XA_FLAGS_ALLOC`` to :c:func:`xa_init_flags`,
|
||||
If you use DEFINE_XARRAY_ALLOC() to define the XArray, or
|
||||
initialise it by passing ``XA_FLAGS_ALLOC`` to xa_init_flags(),
|
||||
the XArray changes to track whether entries are in use or not.
|
||||
|
||||
You can call :c:func:`xa_alloc` to store the entry at an unused index
|
||||
You can call xa_alloc() to store the entry at an unused index
|
||||
in the XArray. If you need to modify the array from interrupt context,
|
||||
you can use :c:func:`xa_alloc_bh` or :c:func:`xa_alloc_irq` to disable
|
||||
you can use xa_alloc_bh() or xa_alloc_irq() to disable
|
||||
interrupts while allocating the ID.
|
||||
|
||||
Using :c:func:`xa_store`, :c:func:`xa_cmpxchg` or :c:func:`xa_insert` will
|
||||
Using xa_store(), xa_cmpxchg() or xa_insert() will
|
||||
also mark the entry as being allocated. Unlike a normal XArray, storing
|
||||
``NULL`` will mark the entry as being in use, like :c:func:`xa_reserve`.
|
||||
To free an entry, use :c:func:`xa_erase` (or :c:func:`xa_release` if
|
||||
``NULL`` will mark the entry as being in use, like xa_reserve().
|
||||
To free an entry, use xa_erase() (or xa_release() if
|
||||
you only want to free the entry if it's ``NULL``).
|
||||
|
||||
By default, the lowest free entry is allocated starting from 0. If you
|
||||
want to allocate entries starting at 1, it is more efficient to use
|
||||
:c:func:`DEFINE_XARRAY_ALLOC1` or ``XA_FLAGS_ALLOC1``. If you want to
|
||||
DEFINE_XARRAY_ALLOC1() or ``XA_FLAGS_ALLOC1``. If you want to
|
||||
allocate IDs up to a maximum, then wrap back around to the lowest free
|
||||
ID, you can use :c:func:`xa_alloc_cyclic`.
|
||||
ID, you can use xa_alloc_cyclic().
|
||||
|
||||
You cannot use ``XA_MARK_0`` with an allocating XArray as this mark
|
||||
is used to track whether an entry is free or not. The other marks are
|
||||
@ -155,17 +155,17 @@ available for your use.
|
||||
Memory allocation
|
||||
-----------------
|
||||
|
||||
The :c:func:`xa_store`, :c:func:`xa_cmpxchg`, :c:func:`xa_alloc`,
|
||||
:c:func:`xa_reserve` and :c:func:`xa_insert` functions take a gfp_t
|
||||
The xa_store(), xa_cmpxchg(), xa_alloc(),
|
||||
xa_reserve() and xa_insert() functions take a gfp_t
|
||||
parameter in case the XArray needs to allocate memory to store this entry.
|
||||
If the entry is being deleted, no memory allocation needs to be performed,
|
||||
and the GFP flags specified will be ignored.
|
||||
|
||||
It is possible for no memory to be allocatable, particularly if you pass
|
||||
a restrictive set of GFP flags. In that case, the functions return a
|
||||
special value which can be turned into an errno using :c:func:`xa_err`.
|
||||
special value which can be turned into an errno using xa_err().
|
||||
If you don't need to know exactly which error occurred, using
|
||||
:c:func:`xa_is_err` is slightly more efficient.
|
||||
xa_is_err() is slightly more efficient.
|
||||
|
||||
Locking
|
||||
-------
|
||||
@ -174,54 +174,54 @@ When using the Normal API, you do not have to worry about locking.
|
||||
The XArray uses RCU and an internal spinlock to synchronise access:
|
||||
|
||||
No lock needed:
|
||||
* :c:func:`xa_empty`
|
||||
* :c:func:`xa_marked`
|
||||
* xa_empty()
|
||||
* xa_marked()
|
||||
|
||||
Takes RCU read lock:
|
||||
* :c:func:`xa_load`
|
||||
* :c:func:`xa_for_each`
|
||||
* :c:func:`xa_find`
|
||||
* :c:func:`xa_find_after`
|
||||
* :c:func:`xa_extract`
|
||||
* :c:func:`xa_get_mark`
|
||||
* xa_load()
|
||||
* xa_for_each()
|
||||
* xa_find()
|
||||
* xa_find_after()
|
||||
* xa_extract()
|
||||
* xa_get_mark()
|
||||
|
||||
Takes xa_lock internally:
|
||||
* :c:func:`xa_store`
|
||||
* :c:func:`xa_store_bh`
|
||||
* :c:func:`xa_store_irq`
|
||||
* :c:func:`xa_insert`
|
||||
* :c:func:`xa_insert_bh`
|
||||
* :c:func:`xa_insert_irq`
|
||||
* :c:func:`xa_erase`
|
||||
* :c:func:`xa_erase_bh`
|
||||
* :c:func:`xa_erase_irq`
|
||||
* :c:func:`xa_cmpxchg`
|
||||
* :c:func:`xa_cmpxchg_bh`
|
||||
* :c:func:`xa_cmpxchg_irq`
|
||||
* :c:func:`xa_store_range`
|
||||
* :c:func:`xa_alloc`
|
||||
* :c:func:`xa_alloc_bh`
|
||||
* :c:func:`xa_alloc_irq`
|
||||
* :c:func:`xa_reserve`
|
||||
* :c:func:`xa_reserve_bh`
|
||||
* :c:func:`xa_reserve_irq`
|
||||
* :c:func:`xa_destroy`
|
||||
* :c:func:`xa_set_mark`
|
||||
* :c:func:`xa_clear_mark`
|
||||
* xa_store()
|
||||
* xa_store_bh()
|
||||
* xa_store_irq()
|
||||
* xa_insert()
|
||||
* xa_insert_bh()
|
||||
* xa_insert_irq()
|
||||
* xa_erase()
|
||||
* xa_erase_bh()
|
||||
* xa_erase_irq()
|
||||
* xa_cmpxchg()
|
||||
* xa_cmpxchg_bh()
|
||||
* xa_cmpxchg_irq()
|
||||
* xa_store_range()
|
||||
* xa_alloc()
|
||||
* xa_alloc_bh()
|
||||
* xa_alloc_irq()
|
||||
* xa_reserve()
|
||||
* xa_reserve_bh()
|
||||
* xa_reserve_irq()
|
||||
* xa_destroy()
|
||||
* xa_set_mark()
|
||||
* xa_clear_mark()
|
||||
|
||||
Assumes xa_lock held on entry:
|
||||
* :c:func:`__xa_store`
|
||||
* :c:func:`__xa_insert`
|
||||
* :c:func:`__xa_erase`
|
||||
* :c:func:`__xa_cmpxchg`
|
||||
* :c:func:`__xa_alloc`
|
||||
* :c:func:`__xa_set_mark`
|
||||
* :c:func:`__xa_clear_mark`
|
||||
* __xa_store()
|
||||
* __xa_insert()
|
||||
* __xa_erase()
|
||||
* __xa_cmpxchg()
|
||||
* __xa_alloc()
|
||||
* __xa_set_mark()
|
||||
* __xa_clear_mark()
|
||||
|
||||
If you want to take advantage of the lock to protect the data structures
|
||||
that you are storing in the XArray, you can call :c:func:`xa_lock`
|
||||
before calling :c:func:`xa_load`, then take a reference count on the
|
||||
object you have found before calling :c:func:`xa_unlock`. This will
|
||||
that you are storing in the XArray, you can call xa_lock()
|
||||
before calling xa_load(), then take a reference count on the
|
||||
object you have found before calling xa_unlock(). This will
|
||||
prevent stores from removing the object from the array between looking
|
||||
up the object and incrementing the refcount. You can also use RCU to
|
||||
avoid dereferencing freed memory, but an explanation of that is beyond
|
||||
@ -261,7 +261,7 @@ context and then erase them in softirq context, you can do that this way::
|
||||
}
|
||||
|
||||
If you are going to modify the XArray from interrupt or softirq context,
|
||||
you need to initialise the array using :c:func:`xa_init_flags`, passing
|
||||
you need to initialise the array using xa_init_flags(), passing
|
||||
``XA_FLAGS_LOCK_IRQ`` or ``XA_FLAGS_LOCK_BH``.
|
||||
|
||||
The above example also shows a common pattern of wanting to extend the
|
||||
@ -269,20 +269,20 @@ coverage of the xa_lock on the store side to protect some statistics
|
||||
associated with the array.
|
||||
|
||||
Sharing the XArray with interrupt context is also possible, either
|
||||
using :c:func:`xa_lock_irqsave` in both the interrupt handler and process
|
||||
context, or :c:func:`xa_lock_irq` in process context and :c:func:`xa_lock`
|
||||
using xa_lock_irqsave() in both the interrupt handler and process
|
||||
context, or xa_lock_irq() in process context and xa_lock()
|
||||
in the interrupt handler. Some of the more common patterns have helper
|
||||
functions such as :c:func:`xa_store_bh`, :c:func:`xa_store_irq`,
|
||||
:c:func:`xa_erase_bh`, :c:func:`xa_erase_irq`, :c:func:`xa_cmpxchg_bh`
|
||||
and :c:func:`xa_cmpxchg_irq`.
|
||||
functions such as xa_store_bh(), xa_store_irq(),
|
||||
xa_erase_bh(), xa_erase_irq(), xa_cmpxchg_bh()
|
||||
and xa_cmpxchg_irq().
|
||||
|
||||
Sometimes you need to protect access to the XArray with a mutex because
|
||||
that lock sits above another mutex in the locking hierarchy. That does
|
||||
not entitle you to use functions like :c:func:`__xa_erase` without taking
|
||||
not entitle you to use functions like __xa_erase() without taking
|
||||
the xa_lock; the xa_lock is used for lockdep validation and will be used
|
||||
for other purposes in the future.
|
||||
|
||||
The :c:func:`__xa_set_mark` and :c:func:`__xa_clear_mark` functions are also
|
||||
The __xa_set_mark() and __xa_clear_mark() functions are also
|
||||
available for situations where you look up an entry and want to atomically
|
||||
set or clear a mark. It may be more efficient to use the advanced API
|
||||
in this case, as it will save you from walking the tree twice.
|
||||
@ -300,27 +300,27 @@ indeed the normal API is implemented in terms of the advanced API. The
|
||||
advanced API is only available to modules with a GPL-compatible license.
|
||||
|
||||
The advanced API is based around the xa_state. This is an opaque data
|
||||
structure which you declare on the stack using the :c:func:`XA_STATE`
|
||||
structure which you declare on the stack using the XA_STATE()
|
||||
macro. This macro initialises the xa_state ready to start walking
|
||||
around the XArray. It is used as a cursor to maintain the position
|
||||
in the XArray and let you compose various operations together without
|
||||
having to restart from the top every time.
|
||||
|
||||
The xa_state is also used to store errors. You can call
|
||||
:c:func:`xas_error` to retrieve the error. All operations check whether
|
||||
xas_error() to retrieve the error. All operations check whether
|
||||
the xa_state is in an error state before proceeding, so there's no need
|
||||
for you to check for an error after each call; you can make multiple
|
||||
calls in succession and only check at a convenient point. The only
|
||||
errors currently generated by the XArray code itself are ``ENOMEM`` and
|
||||
``EINVAL``, but it supports arbitrary errors in case you want to call
|
||||
:c:func:`xas_set_err` yourself.
|
||||
xas_set_err() yourself.
|
||||
|
||||
If the xa_state is holding an ``ENOMEM`` error, calling :c:func:`xas_nomem`
|
||||
If the xa_state is holding an ``ENOMEM`` error, calling xas_nomem()
|
||||
will attempt to allocate more memory using the specified gfp flags and
|
||||
cache it in the xa_state for the next attempt. The idea is that you take
|
||||
the xa_lock, attempt the operation and drop the lock. The operation
|
||||
attempts to allocate memory while holding the lock, but it is more
|
||||
likely to fail. Once you have dropped the lock, :c:func:`xas_nomem`
|
||||
likely to fail. Once you have dropped the lock, xas_nomem()
|
||||
can try harder to allocate more memory. It will return ``true`` if it
|
||||
is worth retrying the operation (i.e. that there was a memory error *and*
|
||||
more memory was allocated). If it has previously allocated memory, and
|
||||
@ -333,7 +333,7 @@ Internal Entries
|
||||
The XArray reserves some entries for its own purposes. These are never
|
||||
exposed through the normal API, but when using the advanced API, it's
|
||||
possible to see them. Usually the best way to handle them is to pass them
|
||||
to :c:func:`xas_retry`, and retry the operation if it returns ``true``.
|
||||
to xas_retry(), and retry the operation if it returns ``true``.
|
||||
|
||||
.. flat-table::
|
||||
:widths: 1 1 6
|
||||
@ -343,89 +343,89 @@ to :c:func:`xas_retry`, and retry the operation if it returns ``true``.
|
||||
- Usage
|
||||
|
||||
* - Node
|
||||
- :c:func:`xa_is_node`
|
||||
- xa_is_node()
|
||||
- An XArray node. May be visible when using a multi-index xa_state.
|
||||
|
||||
* - Sibling
|
||||
- :c:func:`xa_is_sibling`
|
||||
- xa_is_sibling()
|
||||
- A non-canonical entry for a multi-index entry. The value indicates
|
||||
which slot in this node has the canonical entry.
|
||||
|
||||
* - Retry
|
||||
- :c:func:`xa_is_retry`
|
||||
- xa_is_retry()
|
||||
- This entry is currently being modified by a thread which has the
|
||||
xa_lock. The node containing this entry may be freed at the end
|
||||
of this RCU period. You should restart the lookup from the head
|
||||
of the array.
|
||||
|
||||
* - Zero
|
||||
- :c:func:`xa_is_zero`
|
||||
- xa_is_zero()
|
||||
- Zero entries appear as ``NULL`` through the Normal API, but occupy
|
||||
an entry in the XArray which can be used to reserve the index for
|
||||
future use. This is used by allocating XArrays for allocated entries
|
||||
which are ``NULL``.
|
||||
|
||||
Other internal entries may be added in the future. As far as possible, they
|
||||
will be handled by :c:func:`xas_retry`.
|
||||
will be handled by xas_retry().
|
||||
|
||||
Additional functionality
|
||||
------------------------
|
||||
|
||||
The :c:func:`xas_create_range` function allocates all the necessary memory
|
||||
The xas_create_range() function allocates all the necessary memory
|
||||
to store every entry in a range. It will set ENOMEM in the xa_state if
|
||||
it cannot allocate memory.
|
||||
|
||||
You can use :c:func:`xas_init_marks` to reset the marks on an entry
|
||||
You can use xas_init_marks() to reset the marks on an entry
|
||||
to their default state. This is usually all marks clear, unless the
|
||||
XArray is marked with ``XA_FLAGS_TRACK_FREE``, in which case mark 0 is set
|
||||
and all other marks are clear. Replacing one entry with another using
|
||||
:c:func:`xas_store` will not reset the marks on that entry; if you want
|
||||
xas_store() will not reset the marks on that entry; if you want
|
||||
the marks reset, you should do that explicitly.
|
||||
|
||||
The :c:func:`xas_load` will walk the xa_state as close to the entry
|
||||
The xas_load() will walk the xa_state as close to the entry
|
||||
as it can. If you know the xa_state has already been walked to the
|
||||
entry and need to check that the entry hasn't changed, you can use
|
||||
:c:func:`xas_reload` to save a function call.
|
||||
xas_reload() to save a function call.
|
||||
|
||||
If you need to move to a different index in the XArray, call
|
||||
:c:func:`xas_set`. This resets the cursor to the top of the tree, which
|
||||
xas_set(). This resets the cursor to the top of the tree, which
|
||||
will generally make the next operation walk the cursor to the desired
|
||||
spot in the tree. If you want to move to the next or previous index,
|
||||
call :c:func:`xas_next` or :c:func:`xas_prev`. Setting the index does
|
||||
call xas_next() or xas_prev(). Setting the index does
|
||||
not walk the cursor around the array so does not require a lock to be
|
||||
held, while moving to the next or previous index does.
|
||||
|
||||
You can search for the next present entry using :c:func:`xas_find`. This
|
||||
is the equivalent of both :c:func:`xa_find` and :c:func:`xa_find_after`;
|
||||
You can search for the next present entry using xas_find(). This
|
||||
is the equivalent of both xa_find() and xa_find_after();
|
||||
if the cursor has been walked to an entry, then it will find the next
|
||||
entry after the one currently referenced. If not, it will return the
|
||||
entry at the index of the xa_state. Using :c:func:`xas_next_entry` to
|
||||
move to the next present entry instead of :c:func:`xas_find` will save
|
||||
entry at the index of the xa_state. Using xas_next_entry() to
|
||||
move to the next present entry instead of xas_find() will save
|
||||
a function call in the majority of cases at the expense of emitting more
|
||||
inline code.
|
||||
|
||||
The :c:func:`xas_find_marked` function is similar. If the xa_state has
|
||||
The xas_find_marked() function is similar. If the xa_state has
|
||||
not been walked, it will return the entry at the index of the xa_state,
|
||||
if it is marked. Otherwise, it will return the first marked entry after
|
||||
the entry referenced by the xa_state. The :c:func:`xas_next_marked`
|
||||
function is the equivalent of :c:func:`xas_next_entry`.
|
||||
the entry referenced by the xa_state. The xas_next_marked()
|
||||
function is the equivalent of xas_next_entry().
|
||||
|
||||
When iterating over a range of the XArray using :c:func:`xas_for_each`
|
||||
or :c:func:`xas_for_each_marked`, it may be necessary to temporarily stop
|
||||
the iteration. The :c:func:`xas_pause` function exists for this purpose.
|
||||
When iterating over a range of the XArray using xas_for_each()
|
||||
or xas_for_each_marked(), it may be necessary to temporarily stop
|
||||
the iteration. The xas_pause() function exists for this purpose.
|
||||
After you have done the necessary work and wish to resume, the xa_state
|
||||
is in an appropriate state to continue the iteration after the entry
|
||||
you last processed. If you have interrupts disabled while iterating,
|
||||
then it is good manners to pause the iteration and reenable interrupts
|
||||
every ``XA_CHECK_SCHED`` entries.
|
||||
|
||||
The :c:func:`xas_get_mark`, :c:func:`xas_set_mark` and
|
||||
:c:func:`xas_clear_mark` functions require the xa_state cursor to have
|
||||
The xas_get_mark(), xas_set_mark() and
|
||||
xas_clear_mark() functions require the xa_state cursor to have
|
||||
been moved to the appropriate location in the xarray; they will do
|
||||
nothing if you have called :c:func:`xas_pause` or :c:func:`xas_set`
|
||||
nothing if you have called xas_pause() or xas_set()
|
||||
immediately before.
|
||||
|
||||
You can call :c:func:`xas_set_update` to have a callback function
|
||||
You can call xas_set_update() to have a callback function
|
||||
called each time the XArray updates a node. This is used by the page
|
||||
cache workingset code to maintain its list of nodes which contain only
|
||||
shadow entries.
|
||||
@ -443,25 +443,25 @@ eg indices 64-127 may be tied together, but 2-6 may not be. This may
|
||||
save substantial quantities of memory; for example tying 512 entries
|
||||
together will save over 4kB.
|
||||
|
||||
You can create a multi-index entry by using :c:func:`XA_STATE_ORDER`
|
||||
or :c:func:`xas_set_order` followed by a call to :c:func:`xas_store`.
|
||||
Calling :c:func:`xas_load` with a multi-index xa_state will walk the
|
||||
You can create a multi-index entry by using XA_STATE_ORDER()
|
||||
or xas_set_order() followed by a call to xas_store().
|
||||
Calling xas_load() with a multi-index xa_state will walk the
|
||||
xa_state to the right location in the tree, but the return value is not
|
||||
meaningful, potentially being an internal entry or ``NULL`` even when there
|
||||
is an entry stored within the range. Calling :c:func:`xas_find_conflict`
|
||||
is an entry stored within the range. Calling xas_find_conflict()
|
||||
will return the first entry within the range or ``NULL`` if there are no
|
||||
entries in the range. The :c:func:`xas_for_each_conflict` iterator will
|
||||
entries in the range. The xas_for_each_conflict() iterator will
|
||||
iterate over every entry which overlaps the specified range.
|
||||
|
||||
If :c:func:`xas_load` encounters a multi-index entry, the xa_index
|
||||
If xas_load() encounters a multi-index entry, the xa_index
|
||||
in the xa_state will not be changed. When iterating over an XArray
|
||||
or calling :c:func:`xas_find`, if the initial index is in the middle
|
||||
or calling xas_find(), if the initial index is in the middle
|
||||
of a multi-index entry, it will not be altered. Subsequent calls
|
||||
or iterations will move the index to the first index in the range.
|
||||
Each entry will only be returned once, no matter how many indices it
|
||||
occupies.
|
||||
|
||||
Using :c:func:`xas_next` or :c:func:`xas_prev` with a multi-index xa_state
|
||||
Using xas_next() or xas_prev() with a multi-index xa_state
|
||||
is not supported. Using either of these functions on a multi-index entry
|
||||
will reveal sibling entries; these should be skipped over by the caller.
|
||||
|
||||
|
@ -1,3 +1,4 @@
|
||||
=============================
|
||||
Guidance for writing policies
|
||||
=============================
|
||||
|
||||
@ -30,7 +31,7 @@ multiqueue (mq)
|
||||
|
||||
This policy is now an alias for smq (see below).
|
||||
|
||||
The following tunables are accepted, but have no effect:
|
||||
The following tunables are accepted, but have no effect::
|
||||
|
||||
'sequential_threshold <#nr_sequential_ios>'
|
||||
'random_threshold <#nr_random_ios>'
|
||||
@ -56,7 +57,9 @@ mq policy's hints to be dropped. Also, performance of the cache may
|
||||
degrade slightly until smq recalculates the origin device's hotspots
|
||||
that should be cached.
|
||||
|
||||
Memory usage:
|
||||
Memory usage
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The mq policy used a lot of memory; 88 bytes per cache block on a 64
|
||||
bit machine.
|
||||
|
||||
@ -69,7 +72,9 @@ cache block).
|
||||
All this means smq uses ~25bytes per cache block. Still a lot of
|
||||
memory, but a substantial improvement nontheless.
|
||||
|
||||
Level balancing:
|
||||
Level balancing
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
mq placed entries in different levels of the multiqueue structures
|
||||
based on their hit count (~ln(hit count)). This meant the bottom
|
||||
levels generally had the most entries, and the top ones had very
|
||||
@ -94,7 +99,9 @@ is used to decide which blocks to promote. If the hotspot queue is
|
||||
performing badly then it starts moving entries more quickly between
|
||||
levels. This lets it adapt to new IO patterns very quickly.
|
||||
|
||||
Performance:
|
||||
Performance
|
||||
^^^^^^^^^^^
|
||||
|
||||
Testing smq shows substantially better performance than mq.
|
||||
|
||||
cleaner
|
||||
@ -105,16 +112,19 @@ The cleaner writes back all dirty blocks in a cache to decommission it.
|
||||
Examples
|
||||
========
|
||||
|
||||
The syntax for a table is:
|
||||
The syntax for a table is::
|
||||
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature_args> [<feature arg>]*
|
||||
<policy> <#policy_args> [<policy arg>]*
|
||||
|
||||
The syntax to send a message using the dmsetup command is:
|
||||
The syntax to send a message using the dmsetup command is::
|
||||
|
||||
dmsetup message <mapped device> 0 sequential_threshold 1024
|
||||
dmsetup message <mapped device> 0 random_threshold 8
|
||||
|
||||
Using dmsetup:
|
||||
Using dmsetup::
|
||||
|
||||
dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
|
||||
/dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
|
||||
creates a 128GB large mapped device named 'blah' with the
|
@ -1,3 +1,7 @@
|
||||
=====
|
||||
Cache
|
||||
=====
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
@ -24,10 +28,13 @@ scenarios (eg. a vm image server).
|
||||
Glossary
|
||||
========
|
||||
|
||||
Migration - Movement of the primary copy of a logical block from one
|
||||
Migration
|
||||
Movement of the primary copy of a logical block from one
|
||||
device to the other.
|
||||
Promotion - Migration from slow device to fast device.
|
||||
Demotion - Migration from fast device to slow device.
|
||||
Promotion
|
||||
Migration from slow device to fast device.
|
||||
Demotion
|
||||
Migration from fast device to slow device.
|
||||
|
||||
The origin device always contains a copy of the logical block, which
|
||||
may be out of date or kept in sync with the copy on the cache device
|
||||
@ -169,45 +176,53 @@ Target interface
|
||||
Constructor
|
||||
-----------
|
||||
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature args> [<feature arg>]*
|
||||
<policy> <#policy args> [policy args]*
|
||||
::
|
||||
|
||||
metadata dev : fast device holding the persistent metadata
|
||||
cache dev : fast device holding cached data blocks
|
||||
origin dev : slow device holding original data blocks
|
||||
block size : cache unit size in sectors
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature args> [<feature arg>]*
|
||||
<policy> <#policy args> [policy args]*
|
||||
|
||||
#feature args : number of feature arguments passed
|
||||
feature args : writethrough or passthrough (The default is writeback.)
|
||||
================ =======================================================
|
||||
metadata dev fast device holding the persistent metadata
|
||||
cache dev fast device holding cached data blocks
|
||||
origin dev slow device holding original data blocks
|
||||
block size cache unit size in sectors
|
||||
|
||||
policy : the replacement policy to use
|
||||
#policy args : an even number of arguments corresponding to
|
||||
key/value pairs passed to the policy
|
||||
policy args : key/value pairs passed to the policy
|
||||
E.g. 'sequential_threshold 1024'
|
||||
See cache-policies.txt for details.
|
||||
#feature args number of feature arguments passed
|
||||
feature args writethrough or passthrough (The default is writeback.)
|
||||
|
||||
policy the replacement policy to use
|
||||
#policy args an even number of arguments corresponding to
|
||||
key/value pairs passed to the policy
|
||||
policy args key/value pairs passed to the policy
|
||||
E.g. 'sequential_threshold 1024'
|
||||
See cache-policies.txt for details.
|
||||
================ =======================================================
|
||||
|
||||
Optional feature arguments are:
|
||||
writethrough : write through caching that prohibits cache block
|
||||
content from being different from origin block content.
|
||||
Without this argument, the default behaviour is to write
|
||||
back cache block contents later for performance reasons,
|
||||
so they may differ from the corresponding origin blocks.
|
||||
|
||||
passthrough : a degraded mode useful for various cache coherency
|
||||
situations (e.g., rolling back snapshots of
|
||||
underlying storage). Reads and writes always go to
|
||||
the origin. If a write goes to a cached origin
|
||||
block, then the cache block is invalidated.
|
||||
To enable passthrough mode the cache must be clean.
|
||||
|
||||
metadata2 : use version 2 of the metadata. This stores the dirty bits
|
||||
in a separate btree, which improves speed of shutting
|
||||
down the cache.
|
||||
==================== ========================================================
|
||||
writethrough write through caching that prohibits cache block
|
||||
content from being different from origin block content.
|
||||
Without this argument, the default behaviour is to write
|
||||
back cache block contents later for performance reasons,
|
||||
so they may differ from the corresponding origin blocks.
|
||||
|
||||
no_discard_passdown : disable passing down discards from the cache
|
||||
to the origin's data device.
|
||||
passthrough a degraded mode useful for various cache coherency
|
||||
situations (e.g., rolling back snapshots of
|
||||
underlying storage). Reads and writes always go to
|
||||
the origin. If a write goes to a cached origin
|
||||
block, then the cache block is invalidated.
|
||||
To enable passthrough mode the cache must be clean.
|
||||
|
||||
metadata2 use version 2 of the metadata. This stores the dirty
|
||||
bits in a separate btree, which improves speed of
|
||||
shutting down the cache.
|
||||
|
||||
no_discard_passdown disable passing down discards from the cache
|
||||
to the origin's data device.
|
||||
==================== ========================================================
|
||||
|
||||
A policy called 'default' is always registered. This is an alias for
|
||||
the policy we currently think is giving best all round performance.
|
||||
@ -218,54 +233,61 @@ the characteristics of a specific policy, always request it by name.
|
||||
Status
|
||||
------
|
||||
|
||||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<cache block size> <#used cache blocks>/<#total cache blocks>
|
||||
<#read hits> <#read misses> <#write hits> <#write misses>
|
||||
<#demotions> <#promotions> <#dirty> <#features> <features>*
|
||||
<#core args> <core args>* <policy name> <#policy args> <policy args>*
|
||||
<cache metadata mode>
|
||||
::
|
||||
|
||||
metadata block size : Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks : Number of metadata blocks used
|
||||
#total metadata blocks : Total number of metadata blocks
|
||||
cache block size : Configurable block size for the cache device
|
||||
in sectors
|
||||
#used cache blocks : Number of blocks resident in the cache
|
||||
#total cache blocks : Total number of cache blocks
|
||||
#read hits : Number of times a READ bio has been mapped
|
||||
to the cache
|
||||
#read misses : Number of times a READ bio has been mapped
|
||||
to the origin
|
||||
#write hits : Number of times a WRITE bio has been mapped
|
||||
to the cache
|
||||
#write misses : Number of times a WRITE bio has been
|
||||
mapped to the origin
|
||||
#demotions : Number of times a block has been removed
|
||||
from the cache
|
||||
#promotions : Number of times a block has been moved to
|
||||
the cache
|
||||
#dirty : Number of blocks in the cache that differ
|
||||
from the origin
|
||||
#feature args : Number of feature args to follow
|
||||
feature args : 'writethrough' (optional)
|
||||
#core args : Number of core arguments (must be even)
|
||||
core args : Key/value pairs for tuning the core
|
||||
e.g. migration_threshold
|
||||
policy name : Name of the policy
|
||||
#policy args : Number of policy arguments to follow (must be even)
|
||||
policy args : Key/value pairs e.g. sequential_threshold
|
||||
cache metadata mode : ro if read-only, rw if read-write
|
||||
In serious cases where even a read-only mode is deemed unsafe
|
||||
no further I/O will be permitted and the status will just
|
||||
contain the string 'Fail'. The userspace recovery tools
|
||||
should then be used.
|
||||
needs_check : 'needs_check' if set, '-' if not set
|
||||
A metadata operation has failed, resulting in the needs_check
|
||||
flag being set in the metadata's superblock. The metadata
|
||||
device must be deactivated and checked/repaired before the
|
||||
cache can be made fully operational again. '-' indicates
|
||||
needs_check is not set.
|
||||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<cache block size> <#used cache blocks>/<#total cache blocks>
|
||||
<#read hits> <#read misses> <#write hits> <#write misses>
|
||||
<#demotions> <#promotions> <#dirty> <#features> <features>*
|
||||
<#core args> <core args>* <policy name> <#policy args> <policy args>*
|
||||
<cache metadata mode>
|
||||
|
||||
|
||||
========================= =====================================================
|
||||
metadata block size Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks Number of metadata blocks used
|
||||
#total metadata blocks Total number of metadata blocks
|
||||
cache block size Configurable block size for the cache device
|
||||
in sectors
|
||||
#used cache blocks Number of blocks resident in the cache
|
||||
#total cache blocks Total number of cache blocks
|
||||
#read hits Number of times a READ bio has been mapped
|
||||
to the cache
|
||||
#read misses Number of times a READ bio has been mapped
|
||||
to the origin
|
||||
#write hits Number of times a WRITE bio has been mapped
|
||||
to the cache
|
||||
#write misses Number of times a WRITE bio has been
|
||||
mapped to the origin
|
||||
#demotions Number of times a block has been removed
|
||||
from the cache
|
||||
#promotions Number of times a block has been moved to
|
||||
the cache
|
||||
#dirty Number of blocks in the cache that differ
|
||||
from the origin
|
||||
#feature args Number of feature args to follow
|
||||
feature args 'writethrough' (optional)
|
||||
#core args Number of core arguments (must be even)
|
||||
core args Key/value pairs for tuning the core
|
||||
e.g. migration_threshold
|
||||
policy name Name of the policy
|
||||
#policy args Number of policy arguments to follow (must be even)
|
||||
policy args Key/value pairs e.g. sequential_threshold
|
||||
cache metadata mode ro if read-only, rw if read-write
|
||||
|
||||
In serious cases where even a read-only mode is
|
||||
deemed unsafe no further I/O will be permitted and
|
||||
the status will just contain the string 'Fail'.
|
||||
The userspace recovery tools should then be used.
|
||||
needs_check 'needs_check' if set, '-' if not set
|
||||
A metadata operation has failed, resulting in the
|
||||
needs_check flag being set in the metadata's
|
||||
superblock. The metadata device must be
|
||||
deactivated and checked/repaired before the
|
||||
cache can be made fully operational again.
|
||||
'-' indicates needs_check is not set.
|
||||
========================= =====================================================
|
||||
|
||||
Messages
|
||||
--------
|
||||
@ -274,11 +296,12 @@ Policies will have different tunables, specific to each one, so we
|
||||
need a generic way of getting and setting these. Device-mapper
|
||||
messages are used. (A sysfs interface would also be possible.)
|
||||
|
||||
The message format is:
|
||||
The message format is::
|
||||
|
||||
<key> <value>
|
||||
|
||||
E.g.
|
||||
E.g.::
|
||||
|
||||
dmsetup message my_cache 0 sequential_threshold 1024
|
||||
|
||||
|
||||
@ -290,11 +313,12 @@ of values from 5 to 9. Each cblock must be expressed as a decimal
|
||||
value, in the future a variant message that takes cblock ranges
|
||||
expressed in hexadecimal may be needed to better support efficient
|
||||
invalidation of larger caches. The cache must be in passthrough mode
|
||||
when invalidate_cblocks is used.
|
||||
when invalidate_cblocks is used::
|
||||
|
||||
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
|
||||
|
||||
E.g.
|
||||
E.g.::
|
||||
|
||||
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
|
||||
|
||||
Examples
|
||||
@ -304,8 +328,10 @@ The test suite can be found here:
|
||||
|
||||
https://github.com/jthornber/device-mapper-test-suite
|
||||
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
|
||||
mq 4 sequential_threshold 1024 random_threshold 8'
|
||||
::
|
||||
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
|
||||
mq 4 sequential_threshold 1024 random_threshold 8'
|
@ -1,10 +1,12 @@
|
||||
========
|
||||
dm-delay
|
||||
========
|
||||
|
||||
Device-Mapper's "delay" target delays reads and/or writes
|
||||
and maps them to different devices.
|
||||
|
||||
Parameters:
|
||||
Parameters::
|
||||
|
||||
<device> <offset> <delay> [<write_device> <write_offset> <write_delay>
|
||||
[<flush_device> <flush_offset> <flush_delay>]]
|
||||
|
||||
@ -14,15 +16,16 @@ Delays are specified in milliseconds.
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create device delaying rw operation for 500ms
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
|
||||
]]
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create device delaying only write operation for 500ms and
|
||||
# splitting reads and writes to different devices $1 $2
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
|
||||
]]
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create device delaying rw operation for 500ms
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create device delaying only write operation for 500ms and
|
||||
# splitting reads and writes to different devices $1 $2
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
|
@ -1,5 +1,6 @@
|
||||
========
|
||||
dm-crypt
|
||||
=========
|
||||
========
|
||||
|
||||
Device-Mapper's "crypt" target provides transparent encryption of block devices
|
||||
using the kernel crypto API.
|
||||
@ -7,15 +8,20 @@ using the kernel crypto API.
|
||||
For a more detailed description of supported parameters see:
|
||||
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
|
||||
|
||||
Parameters: <cipher> <key> <iv_offset> <device path> \
|
||||
Parameters::
|
||||
|
||||
<cipher> <key> <iv_offset> <device path> \
|
||||
<offset> [<#opt_params> <opt_params>]
|
||||
|
||||
<cipher>
|
||||
Encryption cipher, encryption mode and Initial Vector (IV) generator.
|
||||
|
||||
The cipher specifications format is:
|
||||
The cipher specifications format is::
|
||||
|
||||
cipher[:keycount]-chainmode-ivmode[:ivopts]
|
||||
Examples:
|
||||
|
||||
Examples::
|
||||
|
||||
aes-cbc-essiv:sha256
|
||||
aes-xts-plain64
|
||||
serpent-xts-plain64
|
||||
@ -25,12 +31,17 @@ Parameters: <cipher> <key> <iv_offset> <device path> \
|
||||
as for the first format type.
|
||||
This format is mainly used for specification of authenticated modes.
|
||||
|
||||
The crypto API cipher specifications format is:
|
||||
The crypto API cipher specifications format is::
|
||||
|
||||
capi:cipher_api_spec-ivmode[:ivopts]
|
||||
Examples:
|
||||
|
||||
Examples::
|
||||
|
||||
capi:cbc(aes)-essiv:sha256
|
||||
capi:xts(aes)-plain64
|
||||
Examples of authenticated modes:
|
||||
|
||||
Examples of authenticated modes::
|
||||
|
||||
capi:gcm(aes)-random
|
||||
capi:authenc(hmac(sha256),xts(aes))-random
|
||||
capi:rfc7539(chacha20,poly1305)-random
|
||||
@ -142,21 +153,21 @@ LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
|
||||
encryption with dm-crypt using the 'cryptsetup' utility, see
|
||||
https://gitlab.com/cryptsetup/cryptsetup
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup
|
||||
dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
|
||||
]]
|
||||
::
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup when encryption key is stored in keyring service
|
||||
dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
|
||||
]]
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup
|
||||
dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using cryptsetup and LUKS header with default cipher
|
||||
cryptsetup luksFormat $1
|
||||
cryptsetup luksOpen $1 crypt1
|
||||
]]
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup when encryption key is stored in keyring service
|
||||
dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create a crypt device using cryptsetup and LUKS header with default cipher
|
||||
cryptsetup luksFormat $1
|
||||
cryptsetup luksOpen $1 crypt1
|
@ -1,3 +1,4 @@
|
||||
=========
|
||||
dm-flakey
|
||||
=========
|
||||
|
||||
@ -15,17 +16,26 @@ underlying devices.
|
||||
|
||||
Table parameters
|
||||
----------------
|
||||
|
||||
::
|
||||
|
||||
<dev path> <offset> <up interval> <down interval> \
|
||||
[<num_features> [<feature arguments>]]
|
||||
|
||||
Mandatory parameters:
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
<up interval>: Number of seconds device is available.
|
||||
<down interval>: Number of seconds device returns errors.
|
||||
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
<up interval>:
|
||||
Number of seconds device is available.
|
||||
<down interval>:
|
||||
Number of seconds device returns errors.
|
||||
|
||||
Optional feature parameters:
|
||||
|
||||
If no feature parameters are present, during the periods of
|
||||
unreliability, all I/O returns errors.
|
||||
|
||||
@ -41,17 +51,24 @@ Optional feature parameters:
|
||||
During <down interval>, replace <Nth_byte> of the data of
|
||||
each matching bio with <value>.
|
||||
|
||||
<Nth_byte>: The offset of the byte to replace.
|
||||
Counting starts at 1, to replace the first byte.
|
||||
<direction>: Either 'r' to corrupt reads or 'w' to corrupt writes.
|
||||
'w' is incompatible with drop_writes.
|
||||
<value>: The value (from 0-255) to write.
|
||||
<flags>: Perform the replacement only if bio->bi_opf has all the
|
||||
selected flags set.
|
||||
<Nth_byte>:
|
||||
The offset of the byte to replace.
|
||||
Counting starts at 1, to replace the first byte.
|
||||
<direction>:
|
||||
Either 'r' to corrupt reads or 'w' to corrupt writes.
|
||||
'w' is incompatible with drop_writes.
|
||||
<value>:
|
||||
The value (from 0-255) to write.
|
||||
<flags>:
|
||||
Perform the replacement only if bio->bi_opf has all the
|
||||
selected flags set.
|
||||
|
||||
Examples:
|
||||
|
||||
Replaces the 32nd byte of READ bios with the value 1::
|
||||
|
||||
corrupt_bio_byte 32 r 1 0
|
||||
- replaces the 32nd byte of READ bios with the value 1
|
||||
|
||||
Replaces the 224th byte of REQ_META (=32) bios with the value 0::
|
||||
|
||||
corrupt_bio_byte 224 w 0 32
|
||||
- replaces the 224th byte of REQ_META (=32) bios with the value 0
|
@ -1,5 +1,6 @@
|
||||
================================
|
||||
Early creation of mapped devices
|
||||
====================================
|
||||
================================
|
||||
|
||||
It is possible to configure a device-mapper device to act as the root device for
|
||||
your system in two ways.
|
||||
@ -12,15 +13,17 @@ The second is to create one or more device-mappers using the module parameter
|
||||
|
||||
The format is specified as a string of data separated by commas and optionally
|
||||
semi-colons, where:
|
||||
|
||||
- a comma is used to separate fields like name, uuid, flags and table
|
||||
(specifies one device)
|
||||
- a semi-colon is used to separate devices.
|
||||
|
||||
So the format will look like this:
|
||||
So the format will look like this::
|
||||
|
||||
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
|
||||
|
||||
Where,
|
||||
Where::
|
||||
|
||||
<name> ::= The device name.
|
||||
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
|
||||
<minor> ::= The device minor number | ""
|
||||
@ -29,7 +32,7 @@ Where,
|
||||
<target_type> ::= "verity" | "linear" | ... (see list below)
|
||||
|
||||
The dm line should be equivalent to the one used by the dmsetup tool with the
|
||||
--concise argument.
|
||||
`--concise` argument.
|
||||
|
||||
Target types
|
||||
============
|
||||
@ -38,32 +41,34 @@ Not all target types are available as there are serious risks in allowing
|
||||
activation of certain DM targets without first using userspace tools to check
|
||||
the validity of associated metadata.
|
||||
|
||||
"cache": constrained, userspace should verify cache device
|
||||
"crypt": allowed
|
||||
"delay": allowed
|
||||
"era": constrained, userspace should verify metadata device
|
||||
"flakey": constrained, meant for test
|
||||
"linear": allowed
|
||||
"log-writes": constrained, userspace should verify metadata device
|
||||
"mirror": constrained, userspace should verify main/mirror device
|
||||
"raid": constrained, userspace should verify metadata device
|
||||
"snapshot": constrained, userspace should verify src/dst device
|
||||
"snapshot-origin": allowed
|
||||
"snapshot-merge": constrained, userspace should verify src/dst device
|
||||
"striped": allowed
|
||||
"switch": constrained, userspace should verify dev path
|
||||
"thin": constrained, requires dm target message from userspace
|
||||
"thin-pool": constrained, requires dm target message from userspace
|
||||
"verity": allowed
|
||||
"writecache": constrained, userspace should verify cache device
|
||||
"zero": constrained, not meant for rootfs
|
||||
======================= =======================================================
|
||||
`cache` constrained, userspace should verify cache device
|
||||
`crypt` allowed
|
||||
`delay` allowed
|
||||
`era` constrained, userspace should verify metadata device
|
||||
`flakey` constrained, meant for test
|
||||
`linear` allowed
|
||||
`log-writes` constrained, userspace should verify metadata device
|
||||
`mirror` constrained, userspace should verify main/mirror device
|
||||
`raid` constrained, userspace should verify metadata device
|
||||
`snapshot` constrained, userspace should verify src/dst device
|
||||
`snapshot-origin` allowed
|
||||
`snapshot-merge` constrained, userspace should verify src/dst device
|
||||
`striped` allowed
|
||||
`switch` constrained, userspace should verify dev path
|
||||
`thin` constrained, requires dm target message from userspace
|
||||
`thin-pool` constrained, requires dm target message from userspace
|
||||
`verity` allowed
|
||||
`writecache` constrained, userspace should verify cache device
|
||||
`zero` constrained, not meant for rootfs
|
||||
======================= =======================================================
|
||||
|
||||
If the target is not listed above, it is constrained by default (not tested).
|
||||
|
||||
Examples
|
||||
========
|
||||
An example of booting to a linear array made up of user-mode linux block
|
||||
devices:
|
||||
devices::
|
||||
|
||||
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
|
||||
|
||||
@ -71,43 +76,49 @@ This will boot to a rw dm-linear target of 8192 sectors split across two block
|
||||
devices identified by their major:minor numbers. After boot, udev will rename
|
||||
this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
|
||||
|
||||
An example of multiple device-mappers, with the dm-mod.create="..." contents is shown here
|
||||
split on multiple lines for readability:
|
||||
An example of multiple device-mappers, with the dm-mod.create="..." contents
|
||||
is shown here split on multiple lines for readability::
|
||||
|
||||
vroot,,,ro,
|
||||
0 1740800 verity 254:0 254:0 1740800 sha1
|
||||
76e9be054b15884a9fa85973e9cb274c93afadb6
|
||||
5b3549d54d6c7a3837b9b81ed72e49463a64c03680c47835bef94d768e5646fe;
|
||||
vram,,,rw,
|
||||
0 32768 linear 1:0 0,
|
||||
32768 32768 linear 1:1 0
|
||||
dm-linear,,1,rw,
|
||||
0 32768 linear 8:1 0,
|
||||
32768 1024000 linear 8:2 0;
|
||||
dm-verity,,3,ro,
|
||||
0 1638400 verity 1 /dev/sdc1 /dev/sdc2 4096 4096 204800 1 sha256
|
||||
ac87db56303c9c1da433d7209b5a6ef3e4779df141200cbd7c157dcb8dd89c42
|
||||
5ebfe87f7df3235b80a117ebc4078e44f55045487ad4a96581d1adb564615b51
|
||||
|
||||
Other examples (per target):
|
||||
|
||||
"crypt":
|
||||
"crypt"::
|
||||
|
||||
dm-crypt,,8,ro,
|
||||
0 1048576 crypt aes-xts-plain64
|
||||
babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
|
||||
/dev/sda 0 1 allow_discards
|
||||
|
||||
"delay":
|
||||
"delay"::
|
||||
|
||||
dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
|
||||
|
||||
"linear":
|
||||
"linear"::
|
||||
|
||||
dm-linear,,,rw,
|
||||
0 32768 linear /dev/sda1 0,
|
||||
32768 1024000 linear /dev/sda2 0,
|
||||
1056768 204800 linear /dev/sda3 0,
|
||||
1261568 512000 linear /dev/sda4 0
|
||||
|
||||
"snapshot-origin":
|
||||
"snapshot-origin"::
|
||||
|
||||
dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
|
||||
|
||||
"striped":
|
||||
"striped"::
|
||||
|
||||
dm-striped,,4,ro,0 1638400 striped 4 4096
|
||||
/dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
|
||||
|
||||
"verity":
|
||||
"verity"::
|
||||
|
||||
dm-verity,,4,ro,
|
||||
0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
|
||||
fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
|
@ -1,3 +1,7 @@
|
||||
============
|
||||
dm-integrity
|
||||
============
|
||||
|
||||
The dm-integrity target emulates a block device that has additional
|
||||
per-sector tags that can be used for storing integrity information.
|
||||
|
||||
@ -35,15 +39,16 @@ zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
|
||||
target can't be loaded.
|
||||
|
||||
To use the target for the first time:
|
||||
|
||||
1. overwrite the superblock with zeroes
|
||||
2. load the dm-integrity target with one-sector size, the kernel driver
|
||||
will format the device
|
||||
will format the device
|
||||
3. unload the dm-integrity target
|
||||
4. read the "provided_data_sectors" value from the superblock
|
||||
5. load the dm-integrity target with the the target size
|
||||
"provided_data_sectors"
|
||||
"provided_data_sectors"
|
||||
6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target
|
||||
with the size "provided_data_sectors"
|
||||
with the size "provided_data_sectors"
|
||||
|
||||
|
||||
Target arguments:
|
||||
@ -51,17 +56,20 @@ Target arguments:
|
||||
1. the underlying block device
|
||||
|
||||
2. the number of reserved sector at the beginning of the device - the
|
||||
dm-integrity won't read of write these sectors
|
||||
dm-integrity won't read of write these sectors
|
||||
|
||||
3. the size of the integrity tag (if "-" is used, the size is taken from
|
||||
the internal-hash algorithm)
|
||||
the internal-hash algorithm)
|
||||
|
||||
4. mode:
|
||||
D - direct writes (without journal) - in this mode, journaling is
|
||||
|
||||
D - direct writes (without journal)
|
||||
in this mode, journaling is
|
||||
not used and data sectors and integrity tags are written
|
||||
separately. In case of crash, it is possible that the data
|
||||
and integrity tag doesn't match.
|
||||
J - journaled writes - data and integrity tags are written to the
|
||||
J - journaled writes
|
||||
data and integrity tags are written to the
|
||||
journal and atomicity is guaranteed. In case of crash,
|
||||
either both data and tag or none of them are written. The
|
||||
journaled mode degrades write throughput twice because the
|
||||
@ -178,9 +186,12 @@ and the reloaded target would be non-functional.
|
||||
|
||||
|
||||
The layout of the formatted block device:
|
||||
* reserved sectors (they are not used by this target, they can be used for
|
||||
storing LUKS metadata or for other purpose), the size of the reserved
|
||||
area is specified in the target arguments
|
||||
|
||||
* reserved sectors
|
||||
(they are not used by this target, they can be used for
|
||||
storing LUKS metadata or for other purpose), the size of the reserved
|
||||
area is specified in the target arguments
|
||||
|
||||
* superblock (4kiB)
|
||||
* magic string - identifies that the device was formatted
|
||||
* version
|
||||
@ -192,40 +203,55 @@ The layout of the formatted block device:
|
||||
metadata and padding). The user of this target should not send
|
||||
bios that access data beyond the "provided data sectors" limit.
|
||||
* flags
|
||||
SB_FLAG_HAVE_JOURNAL_MAC - a flag is set if journal_mac is used
|
||||
SB_FLAG_RECALCULATING - recalculating is in progress
|
||||
SB_FLAG_DIRTY_BITMAP - journal area contains the bitmap of dirty
|
||||
blocks
|
||||
SB_FLAG_HAVE_JOURNAL_MAC
|
||||
- a flag is set if journal_mac is used
|
||||
SB_FLAG_RECALCULATING
|
||||
- recalculating is in progress
|
||||
SB_FLAG_DIRTY_BITMAP
|
||||
- journal area contains the bitmap of dirty
|
||||
blocks
|
||||
* log2(sectors per block)
|
||||
* a position where recalculating finished
|
||||
* journal
|
||||
The journal is divided into sections, each section contains:
|
||||
|
||||
* metadata area (4kiB), it contains journal entries
|
||||
every journal entry contains:
|
||||
|
||||
- every journal entry contains:
|
||||
|
||||
* logical sector (specifies where the data and tag should
|
||||
be written)
|
||||
* last 8 bytes of data
|
||||
* integrity tag (the size is specified in the superblock)
|
||||
every metadata sector ends with
|
||||
|
||||
- every metadata sector ends with
|
||||
|
||||
* mac (8-bytes), all the macs in 8 metadata sectors form a
|
||||
64-byte value. It is used to store hmac of sector
|
||||
numbers in the journal section, to protect against a
|
||||
possibility that the attacker tampers with sector
|
||||
numbers in the journal.
|
||||
* commit id
|
||||
|
||||
* data area (the size is variable; it depends on how many journal
|
||||
entries fit into the metadata area)
|
||||
every sector in the data area contains:
|
||||
|
||||
- every sector in the data area contains:
|
||||
|
||||
* data (504 bytes of data, the last 8 bytes are stored in
|
||||
the journal entry)
|
||||
* commit id
|
||||
|
||||
To test if the whole journal section was written correctly, every
|
||||
512-byte sector of the journal ends with 8-byte commit id. If the
|
||||
commit id matches on all sectors in a journal section, then it is
|
||||
assumed that the section was written correctly. If the commit id
|
||||
doesn't match, the section was written partially and it should not
|
||||
be replayed.
|
||||
* one or more runs of interleaved tags and data. Each run contains:
|
||||
|
||||
* one or more runs of interleaved tags and data.
|
||||
Each run contains:
|
||||
|
||||
* tag area - it contains integrity tags. There is one tag for each
|
||||
sector in the data area
|
||||
* data area - it contains data sectors. The number of data sectors
|
@ -1,3 +1,4 @@
|
||||
=====
|
||||
dm-io
|
||||
=====
|
||||
|
||||
@ -7,7 +8,7 @@ version.
|
||||
|
||||
The user must set up an io_region structure to describe the desired location
|
||||
of the I/O. Each io_region indicates a block-device along with the starting
|
||||
sector and size of the region.
|
||||
sector and size of the region::
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
@ -19,7 +20,7 @@ Dm-io can read from one io_region or write to one or more io_regions. Writes
|
||||
to multiple regions are specified by an array of io_region structures.
|
||||
|
||||
The first I/O service type takes a list of memory pages as the data buffer for
|
||||
the I/O, along with an offset into the first page.
|
||||
the I/O, along with an offset into the first page::
|
||||
|
||||
struct page_list {
|
||||
struct page_list *next;
|
||||
@ -35,7 +36,7 @@ the I/O, along with an offset into the first page.
|
||||
|
||||
The second I/O service type takes an array of bio vectors as the data buffer
|
||||
for the I/O. This service can be handy if the caller has a pre-assembled bio,
|
||||
but wants to direct different portions of the bio to different devices.
|
||||
but wants to direct different portions of the bio to different devices::
|
||||
|
||||
int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
|
||||
int rw, struct bio_vec *bvec,
|
||||
@ -47,7 +48,7 @@ but wants to direct different portions of the bio to different devices.
|
||||
The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
|
||||
data buffer for the I/O. This service can be handy if the caller needs to do
|
||||
I/O to a large region but doesn't want to allocate a large number of individual
|
||||
memory pages.
|
||||
memory pages::
|
||||
|
||||
int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
|
||||
void *data, unsigned long *error_bits);
|
||||
@ -55,11 +56,11 @@ memory pages.
|
||||
void *data, io_notify_fn fn, void *context);
|
||||
|
||||
Callers of the asynchronous I/O services must include the name of a completion
|
||||
callback routine and a pointer to some context data for the I/O.
|
||||
callback routine and a pointer to some context data for the I/O::
|
||||
|
||||
typedef void (*io_notify_fn)(unsigned long error, void *context);
|
||||
|
||||
The "error" parameter in this callback, as well as the "*error" parameter in
|
||||
The "error" parameter in this callback, as well as the `*error` parameter in
|
||||
all of the synchronous versions, is a bitset (instead of a simple error value).
|
||||
In the case of an write-I/O to multiple regions, this bitset allows dm-io to
|
||||
indicate success or failure on each individual region.
|
||||
@ -72,4 +73,3 @@ always available in order to avoid unnecessary waiting while performing I/O.
|
||||
When the user is finished using the dm-io services, they should call
|
||||
dm_io_put() and specify the same number of pages that were given on the
|
||||
dm_io_get() call.
|
||||
|
@ -1,3 +1,4 @@
|
||||
=====================
|
||||
Device-Mapper Logging
|
||||
=====================
|
||||
The device-mapper logging code is used by some of the device-mapper
|
||||
@ -16,11 +17,13 @@ dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
|
||||
logging implementations are available and provide different
|
||||
capabilities. The list includes:
|
||||
|
||||
============== ==============================================================
|
||||
Type Files
|
||||
==== =====
|
||||
============== ==============================================================
|
||||
disk drivers/md/dm-log.c
|
||||
core drivers/md/dm-log.c
|
||||
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
|
||||
============== ==============================================================
|
||||
|
||||
The "disk" log type
|
||||
-------------------
|
@ -1,3 +1,4 @@
|
||||
===============
|
||||
dm-queue-length
|
||||
===============
|
||||
|
||||
@ -6,12 +7,18 @@ which selects a path with the least number of in-flight I/Os.
|
||||
The path selector name is 'queue-length'.
|
||||
|
||||
Table parameters for each path: [<repeat_count>]
|
||||
|
||||
::
|
||||
|
||||
<repeat_count>: The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
|
||||
Status for each path: <status> <fail-count> <in-flight>
|
||||
|
||||
::
|
||||
|
||||
<status>: 'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>: The number of path failures.
|
||||
<in-flight>: The number of in-flight I/Os on the path.
|
||||
@ -29,11 +36,13 @@ Examples
|
||||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128.
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
|
||||
::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
|
@ -1,3 +1,4 @@
|
||||
=======
|
||||
dm-raid
|
||||
=======
|
||||
|
||||
@ -8,49 +9,66 @@ interface.
|
||||
|
||||
Mapping Table Interface
|
||||
-----------------------
|
||||
The target is named "raid" and it accepts the following parameters:
|
||||
The target is named "raid" and it accepts the following parameters::
|
||||
|
||||
<raid_type> <#raid_params> <raid_params> \
|
||||
<#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
|
||||
|
||||
<raid_type>:
|
||||
|
||||
============= ===============================================================
|
||||
raid0 RAID0 striping (no resilience)
|
||||
raid1 RAID1 mirroring
|
||||
raid4 RAID4 with dedicated last parity disk
|
||||
raid5_n RAID5 with dedicated last parity disk supporting takeover
|
||||
Same as raid4
|
||||
-Transitory layout
|
||||
|
||||
- Transitory layout
|
||||
raid5_la RAID5 left asymmetric
|
||||
|
||||
- rotating parity 0 with data continuation
|
||||
raid5_ra RAID5 right asymmetric
|
||||
|
||||
- rotating parity N with data continuation
|
||||
raid5_ls RAID5 left symmetric
|
||||
|
||||
- rotating parity 0 with data restart
|
||||
raid5_rs RAID5 right symmetric
|
||||
|
||||
- rotating parity N with data restart
|
||||
raid6_zr RAID6 zero restart
|
||||
|
||||
- rotating parity zero (left-to-right) with data restart
|
||||
raid6_nr RAID6 N restart
|
||||
|
||||
- rotating parity N (right-to-left) with data restart
|
||||
raid6_nc RAID6 N continue
|
||||
|
||||
- rotating parity N (right-to-left) with data continuation
|
||||
raid6_n_6 RAID6 with dedicate parity disks
|
||||
|
||||
- parity and Q-syndrome on the last 2 disks;
|
||||
layout for takeover from/to raid4/raid5_n
|
||||
raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_la from/to raid6
|
||||
raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_ra from/to raid6
|
||||
raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_ls from/to raid6
|
||||
raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_rs from/to raid6
|
||||
raid10 Various RAID10 inspired algorithms chosen by additional params
|
||||
(see raid10_format and raid10_copies below)
|
||||
|
||||
- RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
|
||||
- RAID1E: Integrated Adjacent Stripe Mirroring
|
||||
- RAID1E: Integrated Offset Stripe Mirroring
|
||||
- and other similar RAID10 variants
|
||||
- and other similar RAID10 variants
|
||||
============= ===============================================================
|
||||
|
||||
Reference: Chapter 4 of
|
||||
http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
|
||||
@ -58,33 +76,41 @@ The target is named "raid" and it accepts the following parameters:
|
||||
<#raid_params>: The number of parameters that follow.
|
||||
|
||||
<raid_params> consists of
|
||||
|
||||
Mandatory parameters:
|
||||
<chunk_size>: Chunk size in sectors. This parameter is often known as
|
||||
<chunk_size>:
|
||||
Chunk size in sectors. This parameter is often known as
|
||||
"stripe size". It is the only mandatory parameter and
|
||||
is placed first.
|
||||
|
||||
followed by optional parameters (in any order):
|
||||
[sync|nosync] Force or prevent RAID initialization.
|
||||
[sync|nosync]
|
||||
Force or prevent RAID initialization.
|
||||
|
||||
[rebuild <idx>] Rebuild drive number 'idx' (first drive is 0).
|
||||
[rebuild <idx>]
|
||||
Rebuild drive number 'idx' (first drive is 0).
|
||||
|
||||
[daemon_sleep <ms>]
|
||||
Interval between runs of the bitmap daemon that
|
||||
clear bits. A longer interval means less bitmap I/O but
|
||||
resyncing after a failure is likely to take longer.
|
||||
|
||||
[min_recovery_rate <kB/sec/disk>] Throttle RAID initialization
|
||||
[max_recovery_rate <kB/sec/disk>] Throttle RAID initialization
|
||||
[write_mostly <idx>] Mark drive index 'idx' write-mostly.
|
||||
[max_write_behind <sectors>] See '--write-behind=' (man mdadm)
|
||||
[stripe_cache <sectors>] Stripe cache size (RAID 4/5/6 only)
|
||||
[min_recovery_rate <kB/sec/disk>]
|
||||
Throttle RAID initialization
|
||||
[max_recovery_rate <kB/sec/disk>]
|
||||
Throttle RAID initialization
|
||||
[write_mostly <idx>]
|
||||
Mark drive index 'idx' write-mostly.
|
||||
[max_write_behind <sectors>]
|
||||
See '--write-behind=' (man mdadm)
|
||||
[stripe_cache <sectors>]
|
||||
Stripe cache size (RAID 4/5/6 only)
|
||||
[region_size <sectors>]
|
||||
The region_size multiplied by the number of regions is the
|
||||
logical size of the array. The bitmap records the device
|
||||
synchronisation state for each region.
|
||||
|
||||
[raid10_copies <# copies>]
|
||||
[raid10_format <near|far|offset>]
|
||||
[raid10_copies <# copies>], [raid10_format <near|far|offset>]
|
||||
These two options are used to alter the default layout of
|
||||
a RAID10 configuration. The number of copies is can be
|
||||
specified, but the default is 2. There are also three
|
||||
@ -93,13 +119,17 @@ The target is named "raid" and it accepts the following parameters:
|
||||
respect to mirroring. If these options are left unspecified,
|
||||
or 'raid10_copies 2' and/or 'raid10_format near' are given,
|
||||
then the layouts for 2, 3 and 4 devices are:
|
||||
|
||||
======== ========== ==============
|
||||
2 drives 3 drives 4 drives
|
||||
-------- ---------- --------------
|
||||
======== ========== ==============
|
||||
A1 A1 A1 A1 A2 A1 A1 A2 A2
|
||||
A2 A2 A2 A3 A3 A3 A3 A4 A4
|
||||
A3 A3 A4 A4 A5 A5 A5 A6 A6
|
||||
A4 A4 A5 A6 A6 A7 A7 A8 A8
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ========== ==============
|
||||
|
||||
The 2-device layout is equivalent 2-way RAID1. The 4-device
|
||||
layout is what a traditional RAID10 would look like. The
|
||||
3-device layout is what might be called a 'RAID1E - Integrated
|
||||
@ -107,8 +137,10 @@ The target is named "raid" and it accepts the following parameters:
|
||||
|
||||
If 'raid10_copies 2' and 'raid10_format far', then the layouts
|
||||
for 2, 3 and 4 devices are:
|
||||
|
||||
======== ============ ===================
|
||||
2 drives 3 drives 4 drives
|
||||
-------- -------------- --------------------
|
||||
======== ============ ===================
|
||||
A1 A2 A1 A2 A3 A1 A2 A3 A4
|
||||
A3 A4 A4 A5 A6 A5 A6 A7 A8
|
||||
A5 A6 A7 A8 A9 A9 A10 A11 A12
|
||||
@ -117,11 +149,14 @@ The target is named "raid" and it accepts the following parameters:
|
||||
A4 A3 A6 A4 A5 A6 A5 A8 A7
|
||||
A6 A5 A9 A7 A8 A10 A9 A12 A11
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ============ ===================
|
||||
|
||||
If 'raid10_copies 2' and 'raid10_format offset', then the
|
||||
layouts for 2, 3 and 4 devices are:
|
||||
|
||||
======== ========== ================
|
||||
2 drives 3 drives 4 drives
|
||||
-------- ------------ -----------------
|
||||
======== ========== ================
|
||||
A1 A2 A1 A2 A3 A1 A2 A3 A4
|
||||
A2 A1 A3 A1 A2 A2 A1 A4 A3
|
||||
A3 A4 A4 A5 A6 A5 A6 A7 A8
|
||||
@ -129,6 +164,8 @@ The target is named "raid" and it accepts the following parameters:
|
||||
A5 A6 A7 A8 A9 A9 A10 A11 A12
|
||||
A6 A5 A9 A7 A8 A10 A9 A12 A11
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ========== ================
|
||||
|
||||
Here we see layouts closely akin to 'RAID1E - Integrated
|
||||
Offset Stripe Mirroring'.
|
||||
|
||||
@ -190,22 +227,25 @@ The target is named "raid" and it accepts the following parameters:
|
||||
|
||||
Example Tables
|
||||
--------------
|
||||
# RAID4 - 4 data drives, 1 parity (no metadata devices)
|
||||
# No metadata devices specified to hold superblock/bitmap info
|
||||
# Chunk size of 1MiB
|
||||
# (Lines separated for easy reading)
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 1 2048 \
|
||||
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
|
||||
::
|
||||
|
||||
# RAID4 - 4 data drives, 1 parity (with metadata devices)
|
||||
# Chunk size of 1MiB, force RAID initialization,
|
||||
# min recovery rate at 20 kiB/sec/disk
|
||||
# RAID4 - 4 data drives, 1 parity (no metadata devices)
|
||||
# No metadata devices specified to hold superblock/bitmap info
|
||||
# Chunk size of 1MiB
|
||||
# (Lines separated for easy reading)
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 4 2048 sync min_recovery_rate 20 \
|
||||
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
|
||||
0 1960893648 raid \
|
||||
raid4 1 2048 \
|
||||
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
|
||||
|
||||
# RAID4 - 4 data drives, 1 parity (with metadata devices)
|
||||
# Chunk size of 1MiB, force RAID initialization,
|
||||
# min recovery rate at 20 kiB/sec/disk
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 4 2048 sync min_recovery_rate 20 \
|
||||
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
|
||||
|
||||
|
||||
Status Output
|
||||
@ -219,41 +259,58 @@ Arguments that can be repeated are ordered by value.
|
||||
|
||||
'dmsetup status' yields information on the state and health of the array.
|
||||
The output is as follows (normally a single line, but expanded here for
|
||||
clarity):
|
||||
1: <s> <l> raid \
|
||||
2: <raid_type> <#devices> <health_chars> \
|
||||
3: <sync_ratio> <sync_action> <mismatch_cnt>
|
||||
clarity)::
|
||||
|
||||
1: <s> <l> raid \
|
||||
2: <raid_type> <#devices> <health_chars> \
|
||||
3: <sync_ratio> <sync_action> <mismatch_cnt>
|
||||
|
||||
Line 1 is the standard output produced by device-mapper.
|
||||
Line 2 & 3 are produced by the raid target and are best explained by example:
|
||||
|
||||
Line 2 & 3 are produced by the raid target and are best explained by example::
|
||||
|
||||
0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
|
||||
|
||||
Here we can see the RAID type is raid4, there are 5 devices - all of
|
||||
which are 'A'live, and the array is 2/490221568 complete with its initial
|
||||
recovery. Here is a fuller description of the individual fields:
|
||||
|
||||
=============== =========================================================
|
||||
<raid_type> Same as the <raid_type> used to create the array.
|
||||
<health_chars> One char for each device, indicating: 'A' = alive and
|
||||
in-sync, 'a' = alive but not in-sync, 'D' = dead/failed.
|
||||
<health_chars> One char for each device, indicating:
|
||||
|
||||
- 'A' = alive and in-sync
|
||||
- 'a' = alive but not in-sync
|
||||
- 'D' = dead/failed.
|
||||
<sync_ratio> The ratio indicating how much of the array has undergone
|
||||
the process described by 'sync_action'. If the
|
||||
'sync_action' is "check" or "repair", then the process
|
||||
of "resync" or "recover" can be considered complete.
|
||||
<sync_action> One of the following possible states:
|
||||
idle - No synchronization action is being performed.
|
||||
frozen - The current action has been halted.
|
||||
resync - Array is undergoing its initial synchronization
|
||||
|
||||
idle
|
||||
- No synchronization action is being performed.
|
||||
frozen
|
||||
- The current action has been halted.
|
||||
resync
|
||||
- Array is undergoing its initial synchronization
|
||||
or is resynchronizing after an unclean shutdown
|
||||
(possibly aided by a bitmap).
|
||||
recover - A device in the array is being rebuilt or
|
||||
recover
|
||||
- A device in the array is being rebuilt or
|
||||
replaced.
|
||||
check - A user-initiated full check of the array is
|
||||
check
|
||||
- A user-initiated full check of the array is
|
||||
being performed. All blocks are read and
|
||||
checked for consistency. The number of
|
||||
discrepancies found are recorded in
|
||||
<mismatch_cnt>. No changes are made to the
|
||||
array by this action.
|
||||
repair - The same as "check", but discrepancies are
|
||||
repair
|
||||
- The same as "check", but discrepancies are
|
||||
corrected.
|
||||
reshape - The array is undergoing a reshape.
|
||||
reshape
|
||||
- The array is undergoing a reshape.
|
||||
<mismatch_cnt> The number of discrepancies found between mirror copies
|
||||
in RAID1/10 or wrong parity values found in RAID4/5/6.
|
||||
This value is valid only after a "check" of the array
|
||||
@ -261,10 +318,11 @@ recovery. Here is a fuller description of the individual fields:
|
||||
<data_offset> The current data offset to the start of the user data on
|
||||
each component device of a raid set (see the respective
|
||||
raid parameter to support out-of-place reshaping).
|
||||
<journal_char> 'A' - active write-through journal device.
|
||||
'a' - active write-back journal device.
|
||||
'D' - dead journal device.
|
||||
'-' - no journal device.
|
||||
<journal_char> - 'A' - active write-through journal device.
|
||||
- 'a' - active write-back journal device.
|
||||
- 'D' - dead journal device.
|
||||
- '-' - no journal device.
|
||||
=============== =========================================================
|
||||
|
||||
|
||||
Message Interface
|
||||
@ -272,12 +330,15 @@ Message Interface
|
||||
The dm-raid target will accept certain actions through the 'message' interface.
|
||||
('man dmsetup' for more information on the message interface.) These actions
|
||||
include:
|
||||
"idle" - Halt the current sync action.
|
||||
"frozen" - Freeze the current sync action.
|
||||
"resync" - Initiate/continue a resync.
|
||||
"recover"- Initiate/continue a recover process.
|
||||
"check" - Initiate a check (i.e. a "scrub") of the array.
|
||||
"repair" - Initiate a repair of the array.
|
||||
|
||||
========= ================================================
|
||||
"idle" Halt the current sync action.
|
||||
"frozen" Freeze the current sync action.
|
||||
"resync" Initiate/continue a resync.
|
||||
"recover" Initiate/continue a recover process.
|
||||
"check" Initiate a check (i.e. a "scrub") of the array.
|
||||
"repair" Initiate a repair of the array.
|
||||
========= ================================================
|
||||
|
||||
|
||||
Discard Support
|
||||
@ -307,48 +368,52 @@ increasingly whitelisted in the kernel and can thus be trusted.
|
||||
|
||||
For trusted devices, the following dm-raid module parameter can be set
|
||||
to safely enable discard support for RAID 4/5/6:
|
||||
|
||||
'devices_handle_discards_safely'
|
||||
|
||||
|
||||
Version History
|
||||
---------------
|
||||
1.0.0 Initial version. Support for RAID 4/5/6
|
||||
1.1.0 Added support for RAID 1
|
||||
1.2.0 Handle creation of arrays that contain failed devices.
|
||||
1.3.0 Added support for RAID 10
|
||||
1.3.1 Allow device replacement/rebuild for RAID 10
|
||||
1.3.2 Fix/improve redundancy checking for RAID10
|
||||
1.4.0 Non-functional change. Removes arg from mapping function.
|
||||
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
|
||||
1.4.2 Add RAID10 "far" and "offset" algorithm support.
|
||||
1.5.0 Add message interface to allow manipulation of the sync_action.
|
||||
|
||||
::
|
||||
|
||||
1.0.0 Initial version. Support for RAID 4/5/6
|
||||
1.1.0 Added support for RAID 1
|
||||
1.2.0 Handle creation of arrays that contain failed devices.
|
||||
1.3.0 Added support for RAID 10
|
||||
1.3.1 Allow device replacement/rebuild for RAID 10
|
||||
1.3.2 Fix/improve redundancy checking for RAID10
|
||||
1.4.0 Non-functional change. Removes arg from mapping function.
|
||||
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
|
||||
1.4.2 Add RAID10 "far" and "offset" algorithm support.
|
||||
1.5.0 Add message interface to allow manipulation of the sync_action.
|
||||
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
|
||||
1.5.1 Add ability to restore transiently failed devices on resume.
|
||||
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
|
||||
1.6.0 Add discard support (and devices_handle_discard_safely module param).
|
||||
1.7.0 Add support for MD RAID0 mappings.
|
||||
1.8.0 Explicitly check for compatible flags in the superblock metadata
|
||||
1.5.1 Add ability to restore transiently failed devices on resume.
|
||||
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
|
||||
1.6.0 Add discard support (and devices_handle_discard_safely module param).
|
||||
1.7.0 Add support for MD RAID0 mappings.
|
||||
1.8.0 Explicitly check for compatible flags in the superblock metadata
|
||||
and reject to start the raid set if any are set by a newer
|
||||
target version, thus avoiding data corruption on a raid set
|
||||
with a reshape in progress.
|
||||
1.9.0 Add support for RAID level takeover/reshape/region size
|
||||
1.9.0 Add support for RAID level takeover/reshape/region size
|
||||
and set size reduction.
|
||||
1.9.1 Fix activation of existing RAID 4/10 mapped devices
|
||||
1.9.2 Don't emit '- -' on the status table line in case the constructor
|
||||
1.9.1 Fix activation of existing RAID 4/10 mapped devices
|
||||
1.9.2 Don't emit '- -' on the status table line in case the constructor
|
||||
fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
|
||||
'D' on the status line. If '- -' is passed into the constructor, emit
|
||||
'- -' on the table line and '-' as the status line health character.
|
||||
1.10.0 Add support for raid4/5/6 journal device
|
||||
1.10.1 Fix data corruption on reshape request
|
||||
1.11.0 Fix table line argument order
|
||||
1.10.0 Add support for raid4/5/6 journal device
|
||||
1.10.1 Fix data corruption on reshape request
|
||||
1.11.0 Fix table line argument order
|
||||
(wrong raid10_copies/raid10_format sequence)
|
||||
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
|
||||
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
|
||||
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
|
||||
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
|
||||
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
|
||||
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
|
||||
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
|
||||
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
|
||||
state races.
|
||||
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
|
||||
1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
|
||||
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
|
||||
1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
|
||||
deadlock/potential data corruption. Update superblock when
|
||||
specific devices are requested via rebuild. Fix RAID leg
|
||||
rebuild errors.
|
@ -1,3 +1,4 @@
|
||||
===============
|
||||
dm-service-time
|
||||
===============
|
||||
|
||||
@ -12,25 +13,34 @@ in a path-group, and it can be specified as a table argument.
|
||||
|
||||
The path selector name is 'service-time'.
|
||||
|
||||
Table parameters for each path: [<repeat_count> [<relative_throughput>]]
|
||||
<repeat_count>: The number of I/Os to dispatch using the selected
|
||||
Table parameters for each path:
|
||||
|
||||
[<repeat_count> [<relative_throughput>]]
|
||||
<repeat_count>:
|
||||
The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
<relative_throughput>: The relative throughput value of the path
|
||||
<relative_throughput>:
|
||||
The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
The valid range is 0-100.
|
||||
If not given, minimum value '1' is used.
|
||||
If '0' is given, the path isn't selected while
|
||||
other paths having a positive value are available.
|
||||
|
||||
Status for each path: <status> <fail-count> <in-flight-size> \
|
||||
<relative_throughput>
|
||||
<status>: 'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>: The number of path failures.
|
||||
<in-flight-size>: The size of in-flight I/Os on the path.
|
||||
<relative_throughput>: The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
Status for each path:
|
||||
|
||||
<status> <fail-count> <in-flight-size> <relative_throughput>
|
||||
<status>:
|
||||
'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>:
|
||||
The number of path failures.
|
||||
<in-flight-size>:
|
||||
The size of in-flight I/Os on the path.
|
||||
<relative_throughput>:
|
||||
The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
|
||||
|
||||
Algorithm
|
||||
@ -39,7 +49,7 @@ Algorithm
|
||||
dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
|
||||
dispatched and subtracts when completed.
|
||||
Basically, dm-service-time selects a path having minimum service time
|
||||
which is calculated by:
|
||||
which is calculated by::
|
||||
|
||||
('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
|
||||
|
||||
@ -67,25 +77,25 @@ Examples
|
||||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128
|
||||
and sda has an average throughput 1GB/s and sdb has 4GB/s,
|
||||
'relative_throughput' value may be '1' for sda and '4' for sdb.
|
||||
'relative_throughput' value may be '1' for sda and '4' for sdb::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
|
||||
|
||||
|
||||
Or '2' for sda and '8' for sdb would be also true.
|
||||
Or '2' for sda and '8' for sdb would be also true::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
|
110
Documentation/device-mapper/dm-uevent.rst
Normal file
110
Documentation/device-mapper/dm-uevent.rst
Normal file
@ -0,0 +1,110 @@
|
||||
====================
|
||||
device-mapper uevent
|
||||
====================
|
||||
|
||||
The device-mapper uevent code adds the capability to device-mapper to create
|
||||
and send kobject uevents (uevents). Previously device-mapper events were only
|
||||
available through the ioctl interface. The advantage of the uevents interface
|
||||
is the event contains environment attributes providing increased context for
|
||||
the event avoiding the need to query the state of the device-mapper device after
|
||||
the event is received.
|
||||
|
||||
There are two functions currently for device-mapper events. The first function
|
||||
listed creates the event and the second function sends the event(s)::
|
||||
|
||||
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
|
||||
const char *path, unsigned nr_valid_paths)
|
||||
|
||||
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
|
||||
|
||||
|
||||
The variables added to the uevent environment are:
|
||||
|
||||
Variable Name: DM_TARGET
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description:
|
||||
:Value: Name of device-mapper target that generated the event.
|
||||
|
||||
Variable Name: DM_ACTION
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description:
|
||||
:Value: Device-mapper specific action that caused the uevent action.
|
||||
PATH_FAILED - A path has failed;
|
||||
PATH_REINSTATED - A path has been reinstated.
|
||||
|
||||
Variable Name: DM_SEQNUM
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: unsigned integer
|
||||
:Description: A sequence number for this specific device-mapper device.
|
||||
:Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_PATH
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: Major and minor number of the path device pertaining to this
|
||||
event.
|
||||
:Value: Path name in the form of "Major:Minor"
|
||||
|
||||
Variable Name: DM_NR_VALID_PATHS
|
||||
--------------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: unsigned integer
|
||||
:Description:
|
||||
:Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_NAME
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: Name of the device-mapper device.
|
||||
:Value: Name
|
||||
|
||||
Variable Name: DM_UUID
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: UUID of the device-mapper device.
|
||||
:Value: UUID. (Empty string if there isn't one.)
|
||||
|
||||
An example of the uevents generated as captured by udevmonitor is shown
|
||||
below
|
||||
|
||||
1.) Path failure::
|
||||
|
||||
UEVENT[1192521009.711215] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_FAILED
|
||||
DM_SEQNUM=1
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=0
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1130
|
||||
|
||||
2.) Path reinstate::
|
||||
|
||||
UEVENT[1192521132.989927] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_REINSTATED
|
||||
DM_SEQNUM=2
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=1
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1131
|
@ -1,97 +0,0 @@
|
||||
The device-mapper uevent code adds the capability to device-mapper to create
|
||||
and send kobject uevents (uevents). Previously device-mapper events were only
|
||||
available through the ioctl interface. The advantage of the uevents interface
|
||||
is the event contains environment attributes providing increased context for
|
||||
the event avoiding the need to query the state of the device-mapper device after
|
||||
the event is received.
|
||||
|
||||
There are two functions currently for device-mapper events. The first function
|
||||
listed creates the event and the second function sends the event(s).
|
||||
|
||||
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
|
||||
const char *path, unsigned nr_valid_paths)
|
||||
|
||||
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
|
||||
|
||||
|
||||
The variables added to the uevent environment are:
|
||||
|
||||
Variable Name: DM_TARGET
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description:
|
||||
Value: Name of device-mapper target that generated the event.
|
||||
|
||||
Variable Name: DM_ACTION
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description:
|
||||
Value: Device-mapper specific action that caused the uevent action.
|
||||
PATH_FAILED - A path has failed.
|
||||
PATH_REINSTATED - A path has been reinstated.
|
||||
|
||||
Variable Name: DM_SEQNUM
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: unsigned integer
|
||||
Description: A sequence number for this specific device-mapper device.
|
||||
Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_PATH
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: Major and minor number of the path device pertaining to this
|
||||
event.
|
||||
Value: Path name in the form of "Major:Minor"
|
||||
|
||||
Variable Name: DM_NR_VALID_PATHS
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: unsigned integer
|
||||
Description:
|
||||
Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_NAME
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: Name of the device-mapper device.
|
||||
Value: Name
|
||||
|
||||
Variable Name: DM_UUID
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: UUID of the device-mapper device.
|
||||
Value: UUID. (Empty string if there isn't one.)
|
||||
|
||||
An example of the uevents generated as captured by udevmonitor is shown
|
||||
below.
|
||||
|
||||
1.) Path failure.
|
||||
UEVENT[1192521009.711215] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_FAILED
|
||||
DM_SEQNUM=1
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=0
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1130
|
||||
|
||||
2.) Path reinstate.
|
||||
UEVENT[1192521132.989927] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_REINSTATED
|
||||
DM_SEQNUM=2
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=1
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1131
|
@ -1,3 +1,4 @@
|
||||
========
|
||||
dm-zoned
|
||||
========
|
||||
|
||||
@ -133,12 +134,13 @@ A zoned block device must first be formatted using the dmzadm tool. This
|
||||
will analyze the device zone configuration, determine where to place the
|
||||
metadata sets on the device and initialize the metadata sets.
|
||||
|
||||
Ex:
|
||||
Ex::
|
||||
|
||||
dmzadm --format /dev/sdxx
|
||||
dmzadm --format /dev/sdxx
|
||||
|
||||
For a formatted device, the target can be created normally with the
|
||||
dmsetup utility. The only parameter that dm-zoned requires is the
|
||||
underlying zoned block device name. Ex:
|
||||
underlying zoned block device name. Ex::
|
||||
|
||||
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`
|
||||
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
|
||||
dmsetup create dmz-`basename ${dev}`
|
@ -1,3 +1,7 @@
|
||||
======
|
||||
dm-era
|
||||
======
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
@ -14,12 +18,14 @@ coherency after rolling back a vendor snapshot.
|
||||
Constructor
|
||||
===========
|
||||
|
||||
era <metadata dev> <origin dev> <block size>
|
||||
era <metadata dev> <origin dev> <block size>
|
||||
|
||||
metadata dev : fast device holding the persistent metadata
|
||||
origin dev : device holding data blocks that may change
|
||||
block size : block size of origin data device, granularity that is
|
||||
tracked by the target
|
||||
================ ======================================================
|
||||
metadata dev fast device holding the persistent metadata
|
||||
origin dev device holding data blocks that may change
|
||||
block size block size of origin data device, granularity that is
|
||||
tracked by the target
|
||||
================ ======================================================
|
||||
|
||||
Messages
|
||||
========
|
||||
@ -49,14 +55,16 @@ Status
|
||||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<current era> <held metadata root | '-'>
|
||||
|
||||
metadata block size : Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks : Number of metadata blocks used
|
||||
#total metadata blocks : Total number of metadata blocks
|
||||
current era : The current era
|
||||
held metadata root : The location, in blocks, of the metadata root
|
||||
that has been 'held' for userspace read
|
||||
access. '-' indicates there is no held root
|
||||
========================= ==============================================
|
||||
metadata block size Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks Number of metadata blocks used
|
||||
#total metadata blocks Total number of metadata blocks
|
||||
current era The current era
|
||||
held metadata root The location, in blocks, of the metadata root
|
||||
that has been 'held' for userspace read
|
||||
access. '-' indicates there is no held root
|
||||
========================= ==============================================
|
||||
|
||||
Detailed use case
|
||||
=================
|
||||
@ -88,7 +96,7 @@ Memory usage
|
||||
|
||||
The target uses a bitset to record writes in the current era. It also
|
||||
has a spare bitset ready for switching over to a new era. Other than
|
||||
that it uses a few 4k blocks for updating metadata.
|
||||
that it uses a few 4k blocks for updating metadata::
|
||||
|
||||
(4 * nr_blocks) bytes + buffers
|
||||
|
44
Documentation/device-mapper/index.rst
Normal file
44
Documentation/device-mapper/index.rst
Normal file
@ -0,0 +1,44 @@
|
||||
:orphan:
|
||||
|
||||
=============
|
||||
Device Mapper
|
||||
=============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
cache-policies
|
||||
cache
|
||||
delay
|
||||
dm-crypt
|
||||
dm-flakey
|
||||
dm-init
|
||||
dm-integrity
|
||||
dm-io
|
||||
dm-log
|
||||
dm-queue-length
|
||||
dm-raid
|
||||
dm-service-time
|
||||
dm-uevent
|
||||
dm-zoned
|
||||
era
|
||||
kcopyd
|
||||
linear
|
||||
log-writes
|
||||
persistent-data
|
||||
snapshot
|
||||
statistics
|
||||
striped
|
||||
switch
|
||||
thin-provisioning
|
||||
unstriped
|
||||
verity
|
||||
writecache
|
||||
zero
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
@ -1,3 +1,4 @@
|
||||
======
|
||||
kcopyd
|
||||
======
|
||||
|
||||
@ -7,7 +8,7 @@ notification. It is used by dm-snapshot and dm-mirror.
|
||||
|
||||
Users of kcopyd must first create a client and indicate how many memory pages
|
||||
to set aside for their copy jobs. This is done with a call to
|
||||
kcopyd_client_create().
|
||||
kcopyd_client_create()::
|
||||
|
||||
int kcopyd_client_create(unsigned int num_pages,
|
||||
struct kcopyd_client **result);
|
||||
@ -16,7 +17,7 @@ To start a copy job, the user must set up io_region structures to describe
|
||||
the source and destinations of the copy. Each io_region indicates a
|
||||
block-device along with the starting sector and size of the region. The source
|
||||
of the copy is given as one io_region structure, and the destinations of the
|
||||
copy are given as an array of io_region structures.
|
||||
copy are given as an array of io_region structures::
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
@ -26,7 +27,7 @@ copy are given as an array of io_region structures.
|
||||
|
||||
To start the copy, the user calls kcopyd_copy(), passing in the client
|
||||
pointer, pointers to the source and destination io_regions, the name of a
|
||||
completion callback routine, and a pointer to some context data for the copy.
|
||||
completion callback routine, and a pointer to some context data for the copy::
|
||||
|
||||
int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
|
||||
unsigned int num_dests, struct io_region *dests,
|
||||
@ -41,7 +42,6 @@ write error occurred during the copy.
|
||||
|
||||
When a user is done with all their copy jobs, they should call
|
||||
kcopyd_client_destroy() to delete the kcopyd client, which will release the
|
||||
associated memory pages.
|
||||
associated memory pages::
|
||||
|
||||
void kcopyd_client_destroy(struct kcopyd_client *kc);
|
||||
|
63
Documentation/device-mapper/linear.rst
Normal file
63
Documentation/device-mapper/linear.rst
Normal file
@ -0,0 +1,63 @@
|
||||
=========
|
||||
dm-linear
|
||||
=========
|
||||
|
||||
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
|
||||
device onto a linear range of another device. This is the basic building
|
||||
block of logical volume managers.
|
||||
|
||||
Parameters: <dev path> <offset>
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create an identity mapping for a device
|
||||
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Join 2 devices together
|
||||
size1=`blockdev --getsz $1`
|
||||
size2=`blockdev --getsz $2`
|
||||
echo "0 $size1 linear $1 0
|
||||
$size1 $size2 linear $2 0" | dmsetup create joined
|
||||
|
||||
::
|
||||
|
||||
#!/usr/bin/perl -w
|
||||
# Split a device into 4M chunks and then join them together in reverse order.
|
||||
|
||||
my $name = "reverse";
|
||||
my $extent_size = 4 * 1024 * 2;
|
||||
my $dev = $ARGV[0];
|
||||
my $table = "";
|
||||
my $count = 0;
|
||||
|
||||
if (!defined($dev)) {
|
||||
die("Please specify a device.\n");
|
||||
}
|
||||
|
||||
my $dev_size = `blockdev --getsz $dev`;
|
||||
my $extents = int($dev_size / $extent_size) -
|
||||
(($dev_size % $extent_size) ? 1 : 0);
|
||||
|
||||
while ($extents > 0) {
|
||||
my $this_start = $count * $extent_size;
|
||||
$extents--;
|
||||
$count++;
|
||||
my $this_offset = $extents * $extent_size;
|
||||
|
||||
$table .= "$this_start $extent_size linear $dev $this_offset\n";
|
||||
}
|
||||
|
||||
`echo \"$table\" | dmsetup create $name`;
|
@ -1,61 +0,0 @@
|
||||
dm-linear
|
||||
=========
|
||||
|
||||
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
|
||||
device onto a linear range of another device. This is the basic building
|
||||
block of logical volume managers.
|
||||
|
||||
Parameters: <dev path> <offset>
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create an identity mapping for a device
|
||||
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
|
||||
]]
|
||||
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Join 2 devices together
|
||||
size1=`blockdev --getsz $1`
|
||||
size2=`blockdev --getsz $2`
|
||||
echo "0 $size1 linear $1 0
|
||||
$size1 $size2 linear $2 0" | dmsetup create joined
|
||||
]]
|
||||
|
||||
|
||||
[[
|
||||
#!/usr/bin/perl -w
|
||||
# Split a device into 4M chunks and then join them together in reverse order.
|
||||
|
||||
my $name = "reverse";
|
||||
my $extent_size = 4 * 1024 * 2;
|
||||
my $dev = $ARGV[0];
|
||||
my $table = "";
|
||||
my $count = 0;
|
||||
|
||||
if (!defined($dev)) {
|
||||
die("Please specify a device.\n");
|
||||
}
|
||||
|
||||
my $dev_size = `blockdev --getsz $dev`;
|
||||
my $extents = int($dev_size / $extent_size) -
|
||||
(($dev_size % $extent_size) ? 1 : 0);
|
||||
|
||||
while ($extents > 0) {
|
||||
my $this_start = $count * $extent_size;
|
||||
$extents--;
|
||||
$count++;
|
||||
my $this_offset = $extents * $extent_size;
|
||||
|
||||
$table .= "$this_start $extent_size linear $dev $this_offset\n";
|
||||
}
|
||||
|
||||
`echo \"$table\" | dmsetup create $name`;
|
||||
]]
|
@ -1,3 +1,4 @@
|
||||
=============
|
||||
dm-log-writes
|
||||
=============
|
||||
|
||||
@ -25,11 +26,11 @@ completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
|
||||
simulate the worst case scenario with regard to power failures. Consider the
|
||||
following example (W means write, C means complete):
|
||||
|
||||
W1,W2,W3,C3,C2,Wflush,C1,Cflush
|
||||
W1,W2,W3,C3,C2,Wflush,C1,Cflush
|
||||
|
||||
The log would show the following
|
||||
The log would show the following:
|
||||
|
||||
W3,W2,flush,W1....
|
||||
W3,W2,flush,W1....
|
||||
|
||||
Again this is to simulate what is actually on disk, this allows us to detect
|
||||
cases where a power failure at a particular point in time would create an
|
||||
@ -42,11 +43,11 @@ Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
|
||||
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
|
||||
request. Consider the following example:
|
||||
|
||||
WRITE block 1, DISCARD block 1, FLUSH
|
||||
WRITE block 1, DISCARD block 1, FLUSH
|
||||
|
||||
If we logged DISCARD when it completed, the replay would look like this
|
||||
If we logged DISCARD when it completed, the replay would look like this:
|
||||
|
||||
DISCARD 1, WRITE 1, FLUSH
|
||||
DISCARD 1, WRITE 1, FLUSH
|
||||
|
||||
which isn't quite what happened and wouldn't be caught during the log replay.
|
||||
|
||||
@ -57,15 +58,19 @@ i) Constructor
|
||||
|
||||
log-writes <dev_path> <log_dev_path>
|
||||
|
||||
dev_path : Device that all of the IO will go to normally.
|
||||
log_dev_path : Device where the log entries are written to.
|
||||
============= ==============================================
|
||||
dev_path Device that all of the IO will go to normally.
|
||||
log_dev_path Device where the log entries are written to.
|
||||
============= ==============================================
|
||||
|
||||
ii) Status
|
||||
|
||||
<#logged entries> <highest allocated sector>
|
||||
|
||||
#logged entries : Number of logged entries
|
||||
highest allocated sector : Highest allocated sector
|
||||
=========================== ========================
|
||||
#logged entries Number of logged entries
|
||||
highest allocated sector Highest allocated sector
|
||||
=========================== ========================
|
||||
|
||||
iii) Messages
|
||||
|
||||
@ -75,15 +80,15 @@ iii) Messages
|
||||
For example say you want to fsck a file system after every
|
||||
write, but first you need to replay up to the mkfs to make sure
|
||||
we're fsck'ing something reasonable, you would do something like
|
||||
this:
|
||||
this::
|
||||
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
<run test>
|
||||
|
||||
This would allow you to replay the log up to the mkfs mark and
|
||||
then replay from that point on doing the fsck check in the
|
||||
interval that you want.
|
||||
This would allow you to replay the log up to the mkfs mark and
|
||||
then replay from that point on doing the fsck check in the
|
||||
interval that you want.
|
||||
|
||||
Every log has a mark at the end labeled "dm-log-writes-end".
|
||||
|
||||
@ -97,42 +102,42 @@ Example usage
|
||||
=============
|
||||
|
||||
Say you want to test fsync on your file system. You would do something like
|
||||
this:
|
||||
this::
|
||||
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<some test that does fsync at the end>
|
||||
dmsetup message log 0 mark fsync
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
umount /mnt/btrfs-test
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<some test that does fsync at the end>
|
||||
dmsetup message log 0 mark fsync
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
umount /mnt/btrfs-test
|
||||
|
||||
dmsetup remove log
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
|
||||
mount /dev/sdb /mnt/btrfs-test
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
<verify md5sum's are correct>
|
||||
dmsetup remove log
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
|
||||
mount /dev/sdb /mnt/btrfs-test
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
<verify md5sum's are correct>
|
||||
|
||||
Another option is to do a complicated file system operation and verify the file
|
||||
system is consistent during the entire operation. You could do this with:
|
||||
Another option is to do a complicated file system operation and verify the file
|
||||
system is consistent during the entire operation. You could do this with:
|
||||
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<fsstress to dirty the fs>
|
||||
btrfs filesystem balance /mnt/btrfs-test
|
||||
umount /mnt/btrfs-test
|
||||
dmsetup remove log
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<fsstress to dirty the fs>
|
||||
btrfs filesystem balance /mnt/btrfs-test
|
||||
umount /mnt/btrfs-test
|
||||
dmsetup remove log
|
||||
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
|
||||
btrfsck /dev/sdb
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
|
||||
btrfsck /dev/sdb
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
|
||||
--fsck "btrfsck /dev/sdb" --check fua
|
||||
|
||||
And that will replay the log until it sees a FUA request, run the fsck command
|
@ -1,3 +1,7 @@
|
||||
===============
|
||||
Persistent data
|
||||
===============
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
@ -1,15 +1,16 @@
|
||||
==============================
|
||||
Device-mapper snapshot support
|
||||
==============================
|
||||
|
||||
Device-mapper allows you, without massive data copying:
|
||||
|
||||
*) To create snapshots of any block device i.e. mountable, saved states of
|
||||
the block device which are also writable without interfering with the
|
||||
original content;
|
||||
*) To create device "forks", i.e. multiple different versions of the
|
||||
same data stream.
|
||||
*) To merge a snapshot of a block device back into the snapshot's origin
|
||||
device.
|
||||
- To create snapshots of any block device i.e. mountable, saved states of
|
||||
the block device which are also writable without interfering with the
|
||||
original content;
|
||||
- To create device "forks", i.e. multiple different versions of the
|
||||
same data stream.
|
||||
- To merge a snapshot of a block device back into the snapshot's origin
|
||||
device.
|
||||
|
||||
In the first two cases, dm copies only the chunks of data that get
|
||||
changed and uses a separate copy-on-write (COW) block device for
|
||||
@ -22,7 +23,7 @@ the origin device.
|
||||
There are three dm targets available:
|
||||
snapshot, snapshot-origin, and snapshot-merge.
|
||||
|
||||
*) snapshot-origin <origin>
|
||||
- snapshot-origin <origin>
|
||||
|
||||
which will normally have one or more snapshots based on it.
|
||||
Reads will be mapped directly to the backing device. For each write, the
|
||||
@ -30,7 +31,7 @@ original data will be saved in the <COW device> of each snapshot to keep
|
||||
its visible content unchanged, at least until the <COW device> fills up.
|
||||
|
||||
|
||||
*) snapshot <origin> <COW device> <persistent?> <chunksize>
|
||||
- snapshot <origin> <COW device> <persistent?> <chunksize>
|
||||
|
||||
A snapshot of the <origin> block device is created. Changed chunks of
|
||||
<chunksize> sectors will be stored on the <COW device>. Writes will
|
||||
@ -83,25 +84,25 @@ When you create the first LVM2 snapshot of a volume, four dm devices are used:
|
||||
source volume), whose table is replaced by a "snapshot-origin" mapping
|
||||
from device #1.
|
||||
|
||||
A fixed naming scheme is used, so with the following commands:
|
||||
A fixed naming scheme is used, so with the following commands::
|
||||
|
||||
lvcreate -L 1G -n base volumeGroup
|
||||
lvcreate -L 100M --snapshot -n snap volumeGroup/base
|
||||
lvcreate -L 1G -n base volumeGroup
|
||||
lvcreate -L 100M --snapshot -n snap volumeGroup/base
|
||||
|
||||
we'll have this situation (with volumes in above order):
|
||||
we'll have this situation (with volumes in above order)::
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
|
||||
volumeGroup-base: 0 2097152 snapshot-origin 254:11
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
|
||||
volumeGroup-base: 0 2097152 snapshot-origin 254:11
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
|
||||
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
|
||||
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
|
||||
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
|
||||
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How snapshot-merge is used by LVM2
|
||||
@ -114,27 +115,28 @@ merging snapshot after it completes. The "snapshot" that hands over its
|
||||
COW device to the "snapshot-merge" is deactivated (unless using lvchange
|
||||
--refresh); but if it is left active it will simply return I/O errors.
|
||||
|
||||
A snapshot will merge into its origin with the following command:
|
||||
A snapshot will merge into its origin with the following command::
|
||||
|
||||
lvconvert --merge volumeGroup/snap
|
||||
lvconvert --merge volumeGroup/snap
|
||||
|
||||
we'll now have this situation:
|
||||
we'll now have this situation::
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
|
||||
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
|
||||
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How to determine when a merging is complete
|
||||
===========================================
|
||||
The snapshot-merge and snapshot status lines end with:
|
||||
|
||||
<sectors_allocated>/<total_sectors> <metadata_sectors>
|
||||
|
||||
Both <sectors_allocated> and <total_sectors> include both data and metadata.
|
||||
@ -142,35 +144,37 @@ During merging, the number of sectors allocated gets smaller and
|
||||
smaller. Merging has finished when the number of sectors holding data
|
||||
is zero, in other words <sectors_allocated> == <metadata_sectors>.
|
||||
|
||||
Here is a practical example (using a hybrid of lvm and dmsetup commands):
|
||||
Here is a practical example (using a hybrid of lvm and dmsetup commands)::
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
snap volumeGroup swi-a- 1.00g base 18.97
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
snap volumeGroup swi-a- 1.00g base 18.97
|
||||
|
||||
# dmsetup status volumeGroup-snap
|
||||
0 8388608 snapshot 397896/2097152 1560
|
||||
^^^^ metadata sectors
|
||||
# dmsetup status volumeGroup-snap
|
||||
0 8388608 snapshot 397896/2097152 1560
|
||||
^^^^ metadata sectors
|
||||
|
||||
# lvconvert --merge -b volumeGroup/snap
|
||||
Merging of volume snap started.
|
||||
# lvconvert --merge -b volumeGroup/snap
|
||||
Merging of volume snap started.
|
||||
|
||||
# lvs volumeGroup/snap
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup Owi-a- 4.00g 17.23
|
||||
# lvs volumeGroup/snap
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup Owi-a- 4.00g 17.23
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 281688/2097152 1104
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 281688/2097152 1104
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 180480/2097152 712
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 180480/2097152 712
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 16/2097152 16
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 16/2097152 16
|
||||
|
||||
Merging has finished.
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
::
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
@ -1,3 +1,4 @@
|
||||
=============
|
||||
DM statistics
|
||||
=============
|
||||
|
||||
@ -11,7 +12,7 @@ Individual statistics will be collected for each step-sized area within
|
||||
the range specified.
|
||||
|
||||
The I/O statistics counters for each step-sized area of a region are
|
||||
in the same format as /sys/block/*/stat or /proc/diskstats (see:
|
||||
in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see:
|
||||
Documentation/iostats.txt). But two extra counters (12 and 13) are
|
||||
provided: total time spent reading and writing. When the histogram
|
||||
argument is used, the 14th parameter is reported that represents the
|
||||
@ -32,40 +33,45 @@ on each other's data.
|
||||
The creation of DM statistics will allocate memory via kmalloc or
|
||||
fallback to using vmalloc space. At most, 1/4 of the overall system
|
||||
memory may be allocated by DM statistics. The admin can see how much
|
||||
memory is used by reading
|
||||
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
|
||||
memory is used by reading:
|
||||
|
||||
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
|
||||
|
||||
Messages
|
||||
========
|
||||
|
||||
@stats_create <range> <step>
|
||||
[<number_of_optional_arguments> <optional_arguments>...]
|
||||
[<program_id> [<aux_data>]]
|
||||
|
||||
@stats_create <range> <step> [<number_of_optional_arguments> <optional_arguments>...] [<program_id> [<aux_data>]]
|
||||
Create a new region and return the region_id.
|
||||
|
||||
<range>
|
||||
"-" - whole device
|
||||
"<start_sector>+<length>" - a range of <length> 512-byte sectors
|
||||
starting with <start_sector>.
|
||||
"-"
|
||||
whole device
|
||||
"<start_sector>+<length>"
|
||||
a range of <length> 512-byte sectors
|
||||
starting with <start_sector>.
|
||||
|
||||
<step>
|
||||
"<area_size>" - the range is subdivided into areas each containing
|
||||
<area_size> sectors.
|
||||
"/<number_of_areas>" - the range is subdivided into the specified
|
||||
number of areas.
|
||||
"<area_size>"
|
||||
the range is subdivided into areas each containing
|
||||
<area_size> sectors.
|
||||
"/<number_of_areas>"
|
||||
the range is subdivided into the specified
|
||||
number of areas.
|
||||
|
||||
<number_of_optional_arguments>
|
||||
The number of optional arguments
|
||||
|
||||
<optional_arguments>
|
||||
The following optional arguments are supported
|
||||
precise_timestamps - use precise timer with nanosecond resolution
|
||||
The following optional arguments are supported:
|
||||
|
||||
precise_timestamps
|
||||
use precise timer with nanosecond resolution
|
||||
instead of the "jiffies" variable. When this argument is
|
||||
used, the resulting times are in nanoseconds instead of
|
||||
milliseconds. Precise timestamps are a little bit slower
|
||||
to obtain than jiffies-based timestamps.
|
||||
histogram:n1,n2,n3,n4,... - collect histogram of latencies. The
|
||||
histogram:n1,n2,n3,n4,...
|
||||
collect histogram of latencies. The
|
||||
numbers n1, n2, etc are times that represent the boundaries
|
||||
of the histogram. If precise_timestamps is not used, the
|
||||
times are in milliseconds, otherwise they are in
|
||||
@ -96,21 +102,18 @@ Messages
|
||||
@stats_list message, but it doesn't use this value for anything.
|
||||
|
||||
@stats_delete <region_id>
|
||||
|
||||
Delete the region with the specified id.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
@stats_clear <region_id>
|
||||
|
||||
Clear all the counters except the in-flight i/o counters.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
@stats_list [<program_id>]
|
||||
|
||||
List all regions registered with @stats_create.
|
||||
|
||||
<program_id>
|
||||
@ -127,7 +130,6 @@ Messages
|
||||
if they were specified when creating the region.
|
||||
|
||||
@stats_print <region_id> [<starting_line> <number_of_lines>]
|
||||
|
||||
Print counters for each step-sized area of a region.
|
||||
|
||||
<region_id>
|
||||
@ -143,10 +145,11 @@ Messages
|
||||
|
||||
Output format for each step-sized area of a region:
|
||||
|
||||
<start_sector>+<length> counters
|
||||
<start_sector>+<length>
|
||||
counters
|
||||
|
||||
The first 11 counters have the same meaning as
|
||||
/sys/block/*/stat or /proc/diskstats.
|
||||
`/sys/block/*/stat or /proc/diskstats`.
|
||||
|
||||
Please refer to Documentation/iostats.txt for details.
|
||||
|
||||
@ -163,11 +166,11 @@ Messages
|
||||
11. the weighted number of milliseconds spent doing I/Os
|
||||
|
||||
Additional counters:
|
||||
|
||||
12. the total time spent reading in milliseconds
|
||||
13. the total time spent writing in milliseconds
|
||||
|
||||
@stats_print_clear <region_id> [<starting_line> <number_of_lines>]
|
||||
|
||||
Atomically print and then clear all the counters except the
|
||||
in-flight i/o counters. Useful when the client consuming the
|
||||
statistics does not want to lose any statistics (those updated
|
||||
@ -185,7 +188,6 @@ Messages
|
||||
If omitted, all lines are printed and then cleared.
|
||||
|
||||
@stats_set_aux <region_id> <aux_data>
|
||||
|
||||
Store auxiliary data aux_data for the specified region.
|
||||
|
||||
<region_id>
|
||||
@ -201,23 +203,23 @@ Examples
|
||||
========
|
||||
|
||||
Subdivide the DM device 'vol' into 100 pieces and start collecting
|
||||
statistics on them:
|
||||
statistics on them::
|
||||
|
||||
dmsetup message vol 0 @stats_create - /100
|
||||
|
||||
Set the auxiliary data string to "foo bar baz" (the escape for each
|
||||
space must also be escaped, otherwise the shell will consume them):
|
||||
space must also be escaped, otherwise the shell will consume them)::
|
||||
|
||||
dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
|
||||
|
||||
List the statistics:
|
||||
List the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_list
|
||||
|
||||
Print the statistics:
|
||||
Print the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_print 0
|
||||
|
||||
Delete the statistics:
|
||||
Delete the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_delete 0
|
61
Documentation/device-mapper/striped.rst
Normal file
61
Documentation/device-mapper/striped.rst
Normal file
@ -0,0 +1,61 @@
|
||||
=========
|
||||
dm-stripe
|
||||
=========
|
||||
|
||||
Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
|
||||
device across one or more underlying devices. Data is written in "chunks",
|
||||
with consecutive chunks rotating among the underlying devices. This can
|
||||
potentially provide improved I/O throughput by utilizing several physical
|
||||
devices in parallel.
|
||||
|
||||
Parameters: <num devs> <chunk size> [<dev path> <offset>]+
|
||||
<num devs>:
|
||||
Number of underlying devices.
|
||||
<chunk size>:
|
||||
Size of each chunk of data. Must be at least as
|
||||
large as the system's PAGE_SIZE.
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
|
||||
One or more underlying devices can be specified. The striped device size must
|
||||
be a multiple of the chunk size multiplied by the number of underlying devices.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
#!/usr/bin/perl -w
|
||||
# Create a striped device across any number of underlying devices. The device
|
||||
# will be called "stripe_dev" and have a chunk-size of 128k.
|
||||
|
||||
my $chunk_size = 128 * 2;
|
||||
my $dev_name = "stripe_dev";
|
||||
my $num_devs = @ARGV;
|
||||
my @devs = @ARGV;
|
||||
my ($min_dev_size, $stripe_dev_size, $i);
|
||||
|
||||
if (!$num_devs) {
|
||||
die("Specify at least one device\n");
|
||||
}
|
||||
|
||||
$min_dev_size = `blockdev --getsz $devs[0]`;
|
||||
for ($i = 1; $i < $num_devs; $i++) {
|
||||
my $this_size = `blockdev --getsz $devs[$i]`;
|
||||
$min_dev_size = ($min_dev_size < $this_size) ?
|
||||
$min_dev_size : $this_size;
|
||||
}
|
||||
|
||||
$stripe_dev_size = $min_dev_size * $num_devs;
|
||||
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
|
||||
|
||||
$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
|
||||
for ($i = 0; $i < $num_devs; $i++) {
|
||||
$table .= " $devs[$i] 0";
|
||||
}
|
||||
|
||||
`echo $table | dmsetup create $dev_name`;
|
@ -1,57 +0,0 @@
|
||||
dm-stripe
|
||||
=========
|
||||
|
||||
Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
|
||||
device across one or more underlying devices. Data is written in "chunks",
|
||||
with consecutive chunks rotating among the underlying devices. This can
|
||||
potentially provide improved I/O throughput by utilizing several physical
|
||||
devices in parallel.
|
||||
|
||||
Parameters: <num devs> <chunk size> [<dev path> <offset>]+
|
||||
<num devs>: Number of underlying devices.
|
||||
<chunk size>: Size of each chunk of data. Must be at least as
|
||||
large as the system's PAGE_SIZE.
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
|
||||
One or more underlying devices can be specified. The striped device size must
|
||||
be a multiple of the chunk size multiplied by the number of underlying devices.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
[[
|
||||
#!/usr/bin/perl -w
|
||||
# Create a striped device across any number of underlying devices. The device
|
||||
# will be called "stripe_dev" and have a chunk-size of 128k.
|
||||
|
||||
my $chunk_size = 128 * 2;
|
||||
my $dev_name = "stripe_dev";
|
||||
my $num_devs = @ARGV;
|
||||
my @devs = @ARGV;
|
||||
my ($min_dev_size, $stripe_dev_size, $i);
|
||||
|
||||
if (!$num_devs) {
|
||||
die("Specify at least one device\n");
|
||||
}
|
||||
|
||||
$min_dev_size = `blockdev --getsz $devs[0]`;
|
||||
for ($i = 1; $i < $num_devs; $i++) {
|
||||
my $this_size = `blockdev --getsz $devs[$i]`;
|
||||
$min_dev_size = ($min_dev_size < $this_size) ?
|
||||
$min_dev_size : $this_size;
|
||||
}
|
||||
|
||||
$stripe_dev_size = $min_dev_size * $num_devs;
|
||||
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
|
||||
|
||||
$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
|
||||
for ($i = 0; $i < $num_devs; $i++) {
|
||||
$table .= " $devs[$i] 0";
|
||||
}
|
||||
|
||||
`echo $table | dmsetup create $dev_name`;
|
||||
]]
|
||||
|
@ -1,3 +1,4 @@
|
||||
=========
|
||||
dm-switch
|
||||
=========
|
||||
|
||||
@ -67,27 +68,25 @@ b-tree can achieve.
|
||||
Construction Parameters
|
||||
=======================
|
||||
|
||||
<num_paths> <region_size> <num_optional_args> [<optional_args>...]
|
||||
[<dev_path> <offset>]+
|
||||
<num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
|
||||
<num_paths>
|
||||
The number of paths across which to distribute the I/O.
|
||||
|
||||
<num_paths>
|
||||
The number of paths across which to distribute the I/O.
|
||||
<region_size>
|
||||
The number of 512-byte sectors in a region. Each region can be redirected
|
||||
to any of the available paths.
|
||||
|
||||
<region_size>
|
||||
The number of 512-byte sectors in a region. Each region can be redirected
|
||||
to any of the available paths.
|
||||
<num_optional_args>
|
||||
The number of optional arguments. Currently, no optional arguments
|
||||
are supported and so this must be zero.
|
||||
|
||||
<num_optional_args>
|
||||
The number of optional arguments. Currently, no optional arguments
|
||||
are supported and so this must be zero.
|
||||
<dev_path>
|
||||
The block device that represents a specific path to the device.
|
||||
|
||||
<dev_path>
|
||||
The block device that represents a specific path to the device.
|
||||
|
||||
<offset>
|
||||
The offset of the start of data on the specific <dev_path> (in units
|
||||
of 512-byte sectors). This number is added to the sector number when
|
||||
forwarding the request to the specific path. Typically it is zero.
|
||||
<offset>
|
||||
The offset of the start of data on the specific <dev_path> (in units
|
||||
of 512-byte sectors). This number is added to the sector number when
|
||||
forwarding the request to the specific path. Typically it is zero.
|
||||
|
||||
Messages
|
||||
========
|
||||
@ -122,17 +121,21 @@ Example
|
||||
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
|
||||
the same size.
|
||||
|
||||
Create a switch device with 64kB region size:
|
||||
Create a switch device with 64kB region size::
|
||||
|
||||
dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
|
||||
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
|
||||
|
||||
Set mappings for the first 7 entries to point to devices switch0, switch1,
|
||||
switch2, switch0, switch1, switch2, switch1:
|
||||
switch2, switch0, switch1, switch2, switch1::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
|
||||
|
||||
Set repetitive mapping. This command:
|
||||
Set repetitive mapping. This command::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
|
||||
is equivalent to:
|
||||
|
||||
is equivalent to::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
|
||||
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
|
||||
|
@ -1,3 +1,7 @@
|
||||
=================
|
||||
Thin provisioning
|
||||
=================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
@ -95,6 +99,8 @@ previously.)
|
||||
Using an existing pool device
|
||||
-----------------------------
|
||||
|
||||
::
|
||||
|
||||
dmsetup create pool \
|
||||
--table "0 20971520 thin-pool $metadata_dev $data_dev \
|
||||
$data_block_size $low_water_mark"
|
||||
@ -154,7 +160,7 @@ Thin provisioning
|
||||
i) Creating a new thinly-provisioned volume.
|
||||
|
||||
To create a new thinly- provisioned volume you must send a message to an
|
||||
active pool device, /dev/mapper/pool in this example.
|
||||
active pool device, /dev/mapper/pool in this example::
|
||||
|
||||
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||||
|
||||
@ -164,7 +170,7 @@ i) Creating a new thinly-provisioned volume.
|
||||
|
||||
ii) Using a thinly-provisioned volume.
|
||||
|
||||
Thinly-provisioned volumes are activated using the 'thin' target:
|
||||
Thinly-provisioned volumes are activated using the 'thin' target::
|
||||
|
||||
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
|
||||
|
||||
@ -181,6 +187,8 @@ i) Creating an internal snapshot.
|
||||
must suspend it before creating the snapshot to avoid corruption.
|
||||
This is NOT enforced at the moment, so please be careful!
|
||||
|
||||
::
|
||||
|
||||
dmsetup suspend /dev/mapper/thin
|
||||
dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
|
||||
dmsetup resume /dev/mapper/thin
|
||||
@ -198,14 +206,14 @@ ii) Using an internal snapshot.
|
||||
activating or removing them both. (This differs from conventional
|
||||
device-mapper snapshots.)
|
||||
|
||||
Activate it exactly the same way as any other thinly-provisioned volume:
|
||||
Activate it exactly the same way as any other thinly-provisioned volume::
|
||||
|
||||
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
|
||||
|
||||
External snapshots
|
||||
------------------
|
||||
|
||||
You can use an external _read only_ device as an origin for a
|
||||
You can use an external **read only** device as an origin for a
|
||||
thinly-provisioned volume. Any read to an unprovisioned area of the
|
||||
thin device will be passed through to the origin. Writes trigger
|
||||
the allocation of new blocks as usual.
|
||||
@ -223,11 +231,13 @@ i) Creating a snapshot of an external device
|
||||
This is the same as creating a thin device.
|
||||
You don't mention the origin at this stage.
|
||||
|
||||
::
|
||||
|
||||
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||||
|
||||
ii) Using a snapshot of an external device.
|
||||
|
||||
Append an extra parameter to the thin target specifying the origin:
|
||||
Append an extra parameter to the thin target specifying the origin::
|
||||
|
||||
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
|
||||
|
||||
@ -240,6 +250,8 @@ Deactivation
|
||||
All devices using a pool must be deactivated before the pool itself
|
||||
can be.
|
||||
|
||||
::
|
||||
|
||||
dmsetup remove thin
|
||||
dmsetup remove snap
|
||||
dmsetup remove pool
|
||||
@ -252,25 +264,32 @@ Reference
|
||||
|
||||
i) Constructor
|
||||
|
||||
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
|
||||
<low water mark (blocks)> [<number of feature args> [<arg>]*]
|
||||
::
|
||||
|
||||
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
|
||||
<low water mark (blocks)> [<number of feature args> [<arg>]*]
|
||||
|
||||
Optional feature arguments:
|
||||
|
||||
skip_block_zeroing: Skip the zeroing of newly-provisioned blocks.
|
||||
skip_block_zeroing:
|
||||
Skip the zeroing of newly-provisioned blocks.
|
||||
|
||||
ignore_discard: Disable discard support.
|
||||
ignore_discard:
|
||||
Disable discard support.
|
||||
|
||||
no_discard_passdown: Don't pass discards down to the underlying
|
||||
data device, but just remove the mapping.
|
||||
no_discard_passdown:
|
||||
Don't pass discards down to the underlying
|
||||
data device, but just remove the mapping.
|
||||
|
||||
read_only: Don't allow any changes to be made to the pool
|
||||
read_only:
|
||||
Don't allow any changes to be made to the pool
|
||||
metadata. This mode is only available after the
|
||||
thin-pool has been created and first used in full
|
||||
read/write mode. It cannot be specified on initial
|
||||
thin-pool creation.
|
||||
|
||||
error_if_no_space: Error IOs, instead of queueing, if no space.
|
||||
error_if_no_space:
|
||||
Error IOs, instead of queueing, if no space.
|
||||
|
||||
Data block size must be between 64KB (128 sectors) and 1GB
|
||||
(2097152 sectors) inclusive.
|
||||
@ -278,10 +297,12 @@ i) Constructor
|
||||
|
||||
ii) Status
|
||||
|
||||
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||||
<used data blocks>/<total data blocks> <held metadata root>
|
||||
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
|
||||
needs_check|- metadata_low_watermark
|
||||
::
|
||||
|
||||
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||||
<used data blocks>/<total data blocks> <held metadata root>
|
||||
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
|
||||
needs_check|- metadata_low_watermark
|
||||
|
||||
transaction id:
|
||||
A 64-bit number used by userspace to help synchronise with metadata
|
||||
@ -336,13 +357,11 @@ ii) Status
|
||||
iii) Messages
|
||||
|
||||
create_thin <dev id>
|
||||
|
||||
Create a new thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
|
||||
create_snap <dev id> <origin id>
|
||||
|
||||
Create a new snapshot of another thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
@ -350,11 +369,9 @@ iii) Messages
|
||||
of which the new device will be a snapshot.
|
||||
|
||||
delete <dev id>
|
||||
|
||||
Deletes a thin device. Irreversible.
|
||||
|
||||
set_transaction_id <current id> <new id>
|
||||
|
||||
Userland volume managers, such as LVM, need a way to
|
||||
synchronise their external metadata with the internal metadata of the
|
||||
pool target. The thin-pool target offers to store an
|
||||
@ -364,14 +381,12 @@ iii) Messages
|
||||
compare-and-swap message.
|
||||
|
||||
reserve_metadata_snap
|
||||
|
||||
Reserve a copy of the data mapping btree for use by userland.
|
||||
This allows userland to inspect the mappings as they were when
|
||||
this message was executed. Use the pool's status command to
|
||||
get the root block associated with the metadata snapshot.
|
||||
|
||||
release_metadata_snap
|
||||
|
||||
Release a previously reserved copy of the data mapping btree.
|
||||
|
||||
'thin' target
|
||||
@ -379,7 +394,9 @@ iii) Messages
|
||||
|
||||
i) Constructor
|
||||
|
||||
thin <pool dev> <dev id> [<external origin dev>]
|
||||
::
|
||||
|
||||
thin <pool dev> <dev id> [<external origin dev>]
|
||||
|
||||
pool dev:
|
||||
the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
|
||||
@ -401,8 +418,7 @@ provisioned as and when needed.
|
||||
|
||||
ii) Status
|
||||
|
||||
<nr mapped sectors> <highest mapped sector>
|
||||
|
||||
<nr mapped sectors> <highest mapped sector>
|
||||
If the pool has encountered device errors and failed, the status
|
||||
will just contain the string 'Fail'. The userspace recovery
|
||||
tools should then be used.
|
@ -1,3 +1,7 @@
|
||||
================================
|
||||
Device-mapper "unstriped" target
|
||||
================================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
@ -34,46 +38,46 @@ striped target to combine the 4 devices into one. It then will use
|
||||
the unstriped target ontop of the striped device to access the
|
||||
individual backing loop devices. We write data to the newly exposed
|
||||
unstriped devices and verify the data written matches the correct
|
||||
underlying device on the striped array.
|
||||
underlying device on the striped array::
|
||||
|
||||
#!/bin/bash
|
||||
#!/bin/bash
|
||||
|
||||
MEMBER_SIZE=$((128 * 1024 * 1024))
|
||||
NUM=4
|
||||
SEQ_END=$((${NUM}-1))
|
||||
CHUNK=256
|
||||
BS=4096
|
||||
MEMBER_SIZE=$((128 * 1024 * 1024))
|
||||
NUM=4
|
||||
SEQ_END=$((${NUM}-1))
|
||||
CHUNK=256
|
||||
BS=4096
|
||||
|
||||
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
|
||||
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
|
||||
COUNT=$((${MEMBER_SIZE} / ${BS}))
|
||||
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
|
||||
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
|
||||
COUNT=$((${MEMBER_SIZE} / ${BS}))
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
|
||||
losetup /dev/loop${i} member-${i}
|
||||
DM_PARMS+=" /dev/loop${i} 0"
|
||||
done
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
|
||||
losetup /dev/loop${i} member-${i}
|
||||
DM_PARMS+=" /dev/loop${i} 0"
|
||||
done
|
||||
|
||||
echo $DM_PARMS | dmsetup create raid0
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
|
||||
done;
|
||||
echo $DM_PARMS | dmsetup create raid0
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
|
||||
diff /dev/mapper/set-${i} member-${i}
|
||||
done;
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
|
||||
diff /dev/mapper/set-${i} member-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dmsetup remove set-${i}
|
||||
done
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dmsetup remove set-${i}
|
||||
done
|
||||
|
||||
dmsetup remove raid0
|
||||
dmsetup remove raid0
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
losetup -d /dev/loop${i}
|
||||
rm -f member-${i}
|
||||
done
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
losetup -d /dev/loop${i}
|
||||
rm -f member-${i}
|
||||
done
|
||||
|
||||
Another example
|
||||
---------------
|
||||
@ -81,7 +85,7 @@ Another example
|
||||
Intel NVMe drives contain two cores on the physical device.
|
||||
Each core of the drive has segregated access to its LBA range.
|
||||
The current LBA model has a RAID 0 128k chunk on each core, resulting
|
||||
in a 256k stripe across the two cores:
|
||||
in a 256k stripe across the two cores::
|
||||
|
||||
Core 0: Core 1:
|
||||
__________ __________
|
||||
@ -108,17 +112,24 @@ Example dmsetup usage
|
||||
|
||||
unstriped ontop of Intel NVMe device that has 2 cores
|
||||
-----------------------------------------------------
|
||||
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
|
||||
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
|
||||
|
||||
::
|
||||
|
||||
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
|
||||
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
|
||||
|
||||
There will now be two devices that expose Intel NVMe core 0 and 1
|
||||
respectively:
|
||||
/dev/mapper/nvmset0
|
||||
/dev/mapper/nvmset1
|
||||
respectively::
|
||||
|
||||
/dev/mapper/nvmset0
|
||||
/dev/mapper/nvmset1
|
||||
|
||||
unstriped ontop of striped with 4 drives using 128K chunk size
|
||||
--------------------------------------------------------------
|
||||
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
|
||||
|
||||
::
|
||||
|
||||
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
|
@ -1,5 +1,6 @@
|
||||
=========
|
||||
dm-verity
|
||||
==========
|
||||
=========
|
||||
|
||||
Device-Mapper's "verity" target provides transparent integrity checking of
|
||||
block devices using a cryptographic digest provided by the kernel crypto API.
|
||||
@ -7,6 +8,9 @@ This target is read-only.
|
||||
|
||||
Construction Parameters
|
||||
=======================
|
||||
|
||||
::
|
||||
|
||||
<version> <dev> <hash_dev>
|
||||
<data_block_size> <hash_block_size>
|
||||
<num_data_blocks> <hash_start_block>
|
||||
@ -160,7 +164,9 @@ calculating the parent node.
|
||||
|
||||
The tree looks something like:
|
||||
|
||||
alg = sha256, num_blocks = 32768, block_size = 4096
|
||||
alg = sha256, num_blocks = 32768, block_size = 4096
|
||||
|
||||
::
|
||||
|
||||
[ root ]
|
||||
/ . . . \
|
||||
@ -189,6 +195,7 @@ block boundary) are the hash blocks which are stored a depth at a time
|
||||
|
||||
The full specification of kernel parameters and on-disk metadata format
|
||||
is available at the cryptsetup project's wiki page
|
||||
|
||||
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
|
||||
|
||||
Status
|
||||
@ -198,7 +205,8 @@ If any check failed, C (for Corruption) is returned.
|
||||
|
||||
Example
|
||||
=======
|
||||
Set up a device:
|
||||
Set up a device::
|
||||
|
||||
# dmsetup create vroot --readonly --table \
|
||||
"0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\
|
||||
"4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\
|
||||
@ -209,11 +217,13 @@ the hash tree or activate the kernel device. This is available from
|
||||
the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
|
||||
(as a libcryptsetup extension).
|
||||
|
||||
Create hash on the device:
|
||||
Create hash on the device::
|
||||
|
||||
# veritysetup format /dev/sda1 /dev/sda2
|
||||
...
|
||||
Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
|
||||
|
||||
Activate the device:
|
||||
Activate the device::
|
||||
|
||||
# veritysetup create vroot /dev/sda1 /dev/sda2 \
|
||||
4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
|
@ -1,3 +1,7 @@
|
||||
=================
|
||||
Writecache target
|
||||
=================
|
||||
|
||||
The writecache target caches writes on persistent memory or on SSD. It
|
||||
doesn't cache reads because reads are supposed to be cached in page cache
|
||||
in normal RAM.
|
||||
@ -6,15 +10,18 @@ When the device is constructed, the first sector should be zeroed or the
|
||||
first sector should contain valid superblock from previous invocation.
|
||||
|
||||
Constructor parameters:
|
||||
|
||||
1. type of the cache device - "p" or "s"
|
||||
p - persistent memory
|
||||
s - SSD
|
||||
|
||||
- p - persistent memory
|
||||
- s - SSD
|
||||
2. the underlying device that will be cached
|
||||
3. the cache device
|
||||
4. block size (4096 is recommended; the maximum block size is the page
|
||||
size)
|
||||
5. the number of optional parameters (the parameters with an argument
|
||||
count as two)
|
||||
|
||||
start_sector n (default: 0)
|
||||
offset from the start of cache device in 512-byte sectors
|
||||
high_watermark n (default: 50)
|
||||
@ -43,6 +50,7 @@ Constructor parameters:
|
||||
applicable only to persistent memory - don't use the FUA
|
||||
flag when writing back data and send the FLUSH request
|
||||
afterwards
|
||||
|
||||
- some underlying devices perform better with fua, some
|
||||
with nofua. The user should test it
|
||||
|
||||
@ -60,6 +68,7 @@ Messages:
|
||||
flush the cache device on next suspend. Use this message
|
||||
when you are going to remove the cache device. The proper
|
||||
sequence for removing the cache device is:
|
||||
|
||||
1. send the "flush_on_suspend" message
|
||||
2. load an inactive table with a linear target that maps
|
||||
to the underlying device
|
@ -1,3 +1,4 @@
|
||||
=======
|
||||
dm-zero
|
||||
=======
|
||||
|
||||
@ -18,20 +19,19 @@ filesystem limitations.
|
||||
|
||||
To create a sparse device, start by creating a dm-zero device that's the
|
||||
desired size of the sparse device. For this example, we'll assume a 10TB
|
||||
sparse device.
|
||||
sparse device::
|
||||
|
||||
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
|
||||
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
|
||||
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
|
||||
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
|
||||
|
||||
Then create a snapshot of the zero device, using any available block-device as
|
||||
the COW device. The size of the COW device will determine the amount of real
|
||||
space available to the sparse device. For this example, we'll assume /dev/sdb1
|
||||
is an available 10GB partition.
|
||||
is an available 10GB partition::
|
||||
|
||||
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
|
||||
dmsetup create sparse1
|
||||
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
|
||||
dmsetup create sparse1
|
||||
|
||||
This will create a 10TB sparse device called /dev/mapper/sparse1 that has
|
||||
10GB of actual storage space available. If more than 10GB of data is written
|
||||
to this device, it will start returning I/O errors.
|
||||
|
@ -16,8 +16,8 @@ Required properties:
|
||||
In this case, the ENETC node should include a "mdio" sub-node
|
||||
that in turn should contain the "ethernet-phy" node describing the
|
||||
external phy. Below properties are required, their bindings
|
||||
already defined in ethernet.txt or phy.txt, under
|
||||
Documentation/devicetree/bindings/net/*.
|
||||
already defined in Documentation/devicetree/bindings/net/ethernet.txt or
|
||||
Documentation/devicetree/bindings/net/phy.txt.
|
||||
|
||||
Required:
|
||||
|
||||
@ -51,8 +51,7 @@ Example:
|
||||
connection:
|
||||
|
||||
In this case, the ENETC port node defines a fixed link connection,
|
||||
as specified by "fixed-link.txt", under
|
||||
Documentation/devicetree/bindings/net/*.
|
||||
as specified by Documentation/devicetree/bindings/net/fixed-link.txt.
|
||||
|
||||
Required:
|
||||
|
||||
|
@ -3,7 +3,7 @@ Amlogic Meson AXG DWC PCIE SoC controller
|
||||
Amlogic Meson PCIe host controller is based on the Synopsys DesignWare PCI core.
|
||||
It shares common functions with the PCIe DesignWare core driver and
|
||||
inherits common properties defined in
|
||||
Documentation/devicetree/bindings/pci/designware-pci.txt.
|
||||
Documentation/devicetree/bindings/pci/designware-pcie.txt.
|
||||
|
||||
Additional properties are described here:
|
||||
|
||||
|
@ -97,7 +97,7 @@ Second Level Nodes - Regulators
|
||||
sent for this regulator including those which are for a
|
||||
strictly lower power state.
|
||||
|
||||
Other properties defined in Documentation/devicetree/bindings/regulator.txt
|
||||
Other properties defined in Documentation/devicetree/bindings/regulator/regulator.txt
|
||||
may also be used. regulator-initial-mode and regulator-allowed-modes may be
|
||||
specified for VRM regulators using mode values from
|
||||
include/dt-bindings/regulator/qcom,rpmh-regulator.h. regulator-allow-bypass
|
||||
|
@ -277,7 +277,7 @@ it with special cases.
|
||||
the decompressor (the real mode entry point goes to the same 32bit
|
||||
entry point once it switched into protected mode). That entry point
|
||||
supports one calling convention which is documented in
|
||||
Documentation/x86/boot.txt
|
||||
Documentation/x86/boot.rst
|
||||
The physical pointer to the device-tree block (defined in chapter II)
|
||||
is passed via setup_data which requires at least boot protocol 2.09.
|
||||
The type filed is defined as
|
||||
|
@ -359,7 +359,7 @@ Domain`_ references.
|
||||
``monospaced font``.
|
||||
|
||||
Useful if you need to use special characters that would otherwise have some
|
||||
meaning either by kernel-doc script of by reStructuredText.
|
||||
meaning either by kernel-doc script or by reStructuredText.
|
||||
|
||||
This is particularly useful if you need to use things like ``%ph`` inside
|
||||
a function description.
|
||||
|
@ -27,8 +27,7 @@ Sphinx Install
|
||||
==============
|
||||
|
||||
The ReST markups currently used by the Documentation/ files are meant to be
|
||||
built with ``Sphinx`` version 1.3 or higher. If you desire to build
|
||||
PDF output, it is recommended to use version 1.4.6 or higher.
|
||||
built with ``Sphinx`` version 1.3 or higher.
|
||||
|
||||
There's a script that checks for the Sphinx requirements. Please see
|
||||
:ref:`sphinx-pre-install` for further details.
|
||||
@ -56,13 +55,13 @@ or ``virtualenv``, depending on how your distribution packaged Python 3.
|
||||
those expressions are written using LaTeX notation. It needs texlive
|
||||
installed with amdfonts and amsmath in order to evaluate them.
|
||||
|
||||
In summary, if you want to install Sphinx version 1.4.9, you should do::
|
||||
In summary, if you want to install Sphinx version 1.7.9, you should do::
|
||||
|
||||
$ virtualenv sphinx_1.4
|
||||
$ . sphinx_1.4/bin/activate
|
||||
(sphinx_1.4) $ pip install -r Documentation/sphinx/requirements.txt
|
||||
$ virtualenv sphinx_1.7.9
|
||||
$ . sphinx_1.7.9/bin/activate
|
||||
(sphinx_1.7.9) $ pip install -r Documentation/sphinx/requirements.txt
|
||||
|
||||
After running ``. sphinx_1.4/bin/activate``, the prompt will change,
|
||||
After running ``. sphinx_1.7.9/bin/activate``, the prompt will change,
|
||||
in order to indicate that you're using the new environment. If you
|
||||
open a new shell, you need to rerun this command to enter again at
|
||||
the virtual environment before building the documentation.
|
||||
@ -105,8 +104,8 @@ command line options for your distro::
|
||||
You should run:
|
||||
|
||||
sudo dnf install -y texlive-luatex85
|
||||
/usr/bin/virtualenv sphinx_1.4
|
||||
. sphinx_1.4/bin/activate
|
||||
/usr/bin/virtualenv sphinx_1.7.9
|
||||
. sphinx_1.7.9/bin/activate
|
||||
pip install -r Documentation/sphinx/requirements.txt
|
||||
|
||||
Can't build as 1 mandatory dependency is missing at ./scripts/sphinx-pre-install line 468.
|
||||
@ -218,7 +217,7 @@ Here are some specific guidelines for the kernel documentation:
|
||||
examples, etc.), use ``::`` for anything that doesn't really benefit
|
||||
from syntax highlighting, especially short snippets. Use
|
||||
``.. code-block:: <language>`` for longer code blocks that benefit
|
||||
from highlighting.
|
||||
from highlighting. For a short snippet of code embedded in the text, use \`\`.
|
||||
|
||||
|
||||
the C domain
|
||||
@ -242,11 +241,14 @@ The C domain of the kernel-doc has some additional features. E.g. you can
|
||||
|
||||
The func-name (e.g. ioctl) remains in the output but the ref-name changed from
|
||||
``ioctl`` to ``VIDIOC_LOG_STATUS``. The index entry for this function is also
|
||||
changed to ``VIDIOC_LOG_STATUS`` and the function can now referenced by:
|
||||
changed to ``VIDIOC_LOG_STATUS``.
|
||||
|
||||
.. code-block:: rst
|
||||
|
||||
:c:func:`VIDIOC_LOG_STATUS`
|
||||
Please note that there is no need to use ``c:func:`` to generate cross
|
||||
references to function documentation. Due to some Sphinx extension magic,
|
||||
the documentation build system will automatically turn a reference to
|
||||
``function()`` into a cross reference if an index entry for the given
|
||||
function name exists. If you see ``c:func:`` use in a kernel document,
|
||||
please feel free to remove it.
|
||||
|
||||
|
||||
list tables
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user