mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2025-01-17 13:58:46 +00:00
KVM: Add documentation for VCPU requests
Signed-off-by: Andrew Jones <drjones@redhat.com> Acked-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Christoffer Dall <cdall@linaro.org>
This commit is contained in:
parent
2fa6e1e12a
commit
3bb96149f2
307
Documentation/virtual/kvm/vcpu-requests.rst
Normal file
307
Documentation/virtual/kvm/vcpu-requests.rst
Normal file
@ -0,0 +1,307 @@
|
||||
=================
|
||||
KVM VCPU Requests
|
||||
=================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
KVM supports an internal API enabling threads to request a VCPU thread to
|
||||
perform some activity. For example, a thread may request a VCPU to flush
|
||||
its TLB with a VCPU request. The API consists of the following functions::
|
||||
|
||||
/* Check if any requests are pending for VCPU @vcpu. */
|
||||
bool kvm_request_pending(struct kvm_vcpu *vcpu);
|
||||
|
||||
/* Check if VCPU @vcpu has request @req pending. */
|
||||
bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
|
||||
|
||||
/* Clear request @req for VCPU @vcpu. */
|
||||
void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
|
||||
|
||||
/*
|
||||
* Check if VCPU @vcpu has request @req pending. When the request is
|
||||
* pending it will be cleared and a memory barrier, which pairs with
|
||||
* another in kvm_make_request(), will be issued.
|
||||
*/
|
||||
bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
|
||||
|
||||
/*
|
||||
* Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
|
||||
* with another in kvm_check_request(), prior to setting the request.
|
||||
*/
|
||||
void kvm_make_request(int req, struct kvm_vcpu *vcpu);
|
||||
|
||||
/* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
|
||||
bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
|
||||
|
||||
Typically a requester wants the VCPU to perform the activity as soon
|
||||
as possible after making the request. This means most requests
|
||||
(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
|
||||
and kvm_make_all_cpus_request() has the kicking of all VCPUs built
|
||||
into it.
|
||||
|
||||
VCPU Kicks
|
||||
----------
|
||||
|
||||
The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
|
||||
order to perform some KVM maintenance. To do so, an IPI is sent, forcing
|
||||
a guest mode exit. However, a VCPU thread may not be in guest mode at the
|
||||
time of the kick. Therefore, depending on the mode and state of the VCPU
|
||||
thread, there are two other actions a kick may take. All three actions
|
||||
are listed below:
|
||||
|
||||
1) Send an IPI. This forces a guest mode exit.
|
||||
2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest
|
||||
mode that wait on waitqueues. Waking them removes the threads from
|
||||
the waitqueues, allowing the threads to run again. This behavior
|
||||
may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
|
||||
3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not
|
||||
sleeping, then there is nothing to do.
|
||||
|
||||
VCPU Mode
|
||||
---------
|
||||
|
||||
VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
|
||||
guest is running in guest mode or not, as well as some specific
|
||||
outside guest mode states. The architecture may use ``vcpu->mode`` to
|
||||
ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
|
||||
as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
|
||||
even to ensure IPI acknowledgements are waited upon (see "Waiting for
|
||||
Acknowledgements"). The following modes are defined:
|
||||
|
||||
OUTSIDE_GUEST_MODE
|
||||
|
||||
The VCPU thread is outside guest mode.
|
||||
|
||||
IN_GUEST_MODE
|
||||
|
||||
The VCPU thread is in guest mode.
|
||||
|
||||
EXITING_GUEST_MODE
|
||||
|
||||
The VCPU thread is transitioning from IN_GUEST_MODE to
|
||||
OUTSIDE_GUEST_MODE.
|
||||
|
||||
READING_SHADOW_PAGE_TABLES
|
||||
|
||||
The VCPU thread is outside guest mode, but it wants the sender of
|
||||
certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
|
||||
thread is done reading the page tables.
|
||||
|
||||
VCPU Request Internals
|
||||
======================
|
||||
|
||||
VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
|
||||
This means general bitops, like those documented in [atomic-ops]_ could
|
||||
also be used, e.g. ::
|
||||
|
||||
clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
|
||||
|
||||
However, VCPU request users should refrain from doing so, as it would
|
||||
break the abstraction. The first 8 bits are reserved for architecture
|
||||
independent requests, all additional bits are available for architecture
|
||||
dependent requests.
|
||||
|
||||
Architecture Independent Requests
|
||||
---------------------------------
|
||||
|
||||
KVM_REQ_TLB_FLUSH
|
||||
|
||||
KVM's common MMU notifier may need to flush all of a guest's TLB
|
||||
entries, calling kvm_flush_remote_tlbs() to do so. Architectures that
|
||||
choose to use the common kvm_flush_remote_tlbs() implementation will
|
||||
need to handle this VCPU request.
|
||||
|
||||
KVM_REQ_MMU_RELOAD
|
||||
|
||||
When shadow page tables are used and memory slots are removed it's
|
||||
necessary to inform each VCPU to completely refresh the tables. This
|
||||
request is used for that.
|
||||
|
||||
KVM_REQ_PENDING_TIMER
|
||||
|
||||
This request may be made from a timer handler run on the host on behalf
|
||||
of a VCPU. It informs the VCPU thread to inject a timer interrupt.
|
||||
|
||||
KVM_REQ_UNHALT
|
||||
|
||||
This request may be made from the KVM common function kvm_vcpu_block(),
|
||||
which is used to emulate an instruction that causes a CPU to halt until
|
||||
one of an architectural specific set of events and/or interrupts is
|
||||
received (determined by checking kvm_arch_vcpu_runnable()). When that
|
||||
event or interrupt arrives kvm_vcpu_block() makes the request. This is
|
||||
in contrast to when kvm_vcpu_block() returns due to any other reason,
|
||||
such as a pending signal, which does not indicate the VCPU's halt
|
||||
emulation should stop, and therefore does not make the request.
|
||||
|
||||
KVM_REQUEST_MASK
|
||||
----------------
|
||||
|
||||
VCPU requests should be masked by KVM_REQUEST_MASK before using them with
|
||||
bitops. This is because only the lower 8 bits are used to represent the
|
||||
request's number. The upper bits are used as flags. Currently only two
|
||||
flags are defined.
|
||||
|
||||
VCPU Request Flags
|
||||
------------------
|
||||
|
||||
KVM_REQUEST_NO_WAKEUP
|
||||
|
||||
This flag is applied to requests that only need immediate attention
|
||||
from VCPUs running in guest mode. That is, sleeping VCPUs do not need
|
||||
to be awaken for these requests. Sleeping VCPUs will handle the
|
||||
requests when they are awaken later for some other reason.
|
||||
|
||||
KVM_REQUEST_WAIT
|
||||
|
||||
When requests with this flag are made with kvm_make_all_cpus_request(),
|
||||
then the caller will wait for each VCPU to acknowledge its IPI before
|
||||
proceeding. This flag only applies to VCPUs that would receive IPIs.
|
||||
If, for example, the VCPU is sleeping, so no IPI is necessary, then
|
||||
the requesting thread does not wait. This means that this flag may be
|
||||
safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for
|
||||
Acknowledgements" for more information about requests with
|
||||
KVM_REQUEST_WAIT.
|
||||
|
||||
VCPU Requests with Associated State
|
||||
===================================
|
||||
|
||||
Requesters that want the receiving VCPU to handle new state need to ensure
|
||||
the newly written state is observable to the receiving VCPU thread's CPU
|
||||
by the time it observes the request. This means a write memory barrier
|
||||
must be inserted after writing the new state and before setting the VCPU
|
||||
request bit. Additionally, on the receiving VCPU thread's side, a
|
||||
corresponding read barrier must be inserted after reading the request bit
|
||||
and before proceeding to read the new state associated with it. See
|
||||
scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
|
||||
[memory-barriers]_.
|
||||
|
||||
The pair of functions, kvm_check_request() and kvm_make_request(), provide
|
||||
the memory barriers, allowing this requirement to be handled internally by
|
||||
the API.
|
||||
|
||||
Ensuring Requests Are Seen
|
||||
==========================
|
||||
|
||||
When making requests to VCPUs, we want to avoid the receiving VCPU
|
||||
executing in guest mode for an arbitrary long time without handling the
|
||||
request. We can be sure this won't happen as long as we ensure the VCPU
|
||||
thread checks kvm_request_pending() before entering guest mode and that a
|
||||
kick will send an IPI to force an exit from guest mode when necessary.
|
||||
Extra care must be taken to cover the period after the VCPU thread's last
|
||||
kvm_request_pending() check and before it has entered guest mode, as kick
|
||||
IPIs will only trigger guest mode exits for VCPU threads that are in guest
|
||||
mode or at least have already disabled interrupts in order to prepare to
|
||||
enter guest mode. This means that an optimized implementation (see "IPI
|
||||
Reduction") must be certain when it's safe to not send the IPI. One
|
||||
solution, which all architectures except s390 apply, is to:
|
||||
|
||||
- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
|
||||
the last kvm_request_pending() check;
|
||||
- enable interrupts atomically when entering the guest.
|
||||
|
||||
This solution also requires memory barriers to be placed carefully in both
|
||||
the requesting thread and the receiving VCPU. With the memory barriers we
|
||||
can exclude the possibility of a VCPU thread observing
|
||||
!kvm_request_pending() on its last check and then not receiving an IPI for
|
||||
the next request made of it, even if the request is made immediately after
|
||||
the check. This is done by way of the Dekker memory barrier pattern
|
||||
(scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables,
|
||||
this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting
|
||||
them into the pattern gives::
|
||||
|
||||
CPU1 CPU2
|
||||
================= =================
|
||||
local_irq_disable();
|
||||
WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu);
|
||||
smp_mb(); smp_mb();
|
||||
if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) ==
|
||||
IN_GUEST_MODE) {
|
||||
...abort guest entry... ...send IPI...
|
||||
} }
|
||||
|
||||
As stated above, the IPI is only useful for VCPU threads in guest mode or
|
||||
that have already disabled interrupts. This is why this specific case of
|
||||
the Dekker pattern has been extended to disable interrupts before setting
|
||||
``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to
|
||||
pedantically implement the memory barrier pattern, guaranteeing the
|
||||
compiler doesn't interfere with ``vcpu->mode``'s carefully planned
|
||||
accesses.
|
||||
|
||||
IPI Reduction
|
||||
-------------
|
||||
|
||||
As only one IPI is needed to get a VCPU to check for any/all requests,
|
||||
then they may be coalesced. This is easily done by having the first IPI
|
||||
sending kick also change the VCPU mode to something !IN_GUEST_MODE. The
|
||||
transitional state, EXITING_GUEST_MODE, is used for this purpose.
|
||||
|
||||
Waiting for Acknowledgements
|
||||
----------------------------
|
||||
|
||||
Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
|
||||
be sent, and the acknowledgements to be waited upon, even when the target
|
||||
VCPU threads are in modes other than IN_GUEST_MODE. For example, one case
|
||||
is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
|
||||
is set after disabling interrupts. To support these cases, the
|
||||
KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
|
||||
checking that the VCPU is IN_GUEST_MODE to checking that it is not
|
||||
OUTSIDE_GUEST_MODE.
|
||||
|
||||
Request-less VCPU Kicks
|
||||
-----------------------
|
||||
|
||||
As the determination of whether or not to send an IPI depends on the
|
||||
two-variable Dekker memory barrier pattern, then it's clear that
|
||||
request-less VCPU kicks are almost never correct. Without the assurance
|
||||
that a non-IPI generating kick will still result in an action by the
|
||||
receiving VCPU, as the final kvm_request_pending() check does for
|
||||
request-accompanying kicks, then the kick may not do anything useful at
|
||||
all. If, for instance, a request-less kick was made to a VCPU that was
|
||||
just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
|
||||
the VCPU thread may continue its entry without actually having done
|
||||
whatever it was the kick was meant to initiate.
|
||||
|
||||
One exception is x86's posted interrupt mechanism. In this case, however,
|
||||
even the request-less VCPU kick is coupled with the same
|
||||
local_irq_disable() + smp_mb() pattern described above; the ON bit
|
||||
(Outstanding Notification) in the posted interrupt descriptor takes the
|
||||
role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is
|
||||
set before reading ``vcpu->mode``; dually, in the VCPU thread,
|
||||
vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
|
||||
IN_GUEST_MODE.
|
||||
|
||||
Additional Considerations
|
||||
=========================
|
||||
|
||||
Sleeping VCPUs
|
||||
--------------
|
||||
|
||||
VCPU threads may need to consider requests before and/or after calling
|
||||
functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they
|
||||
do or not, and, if they do, which requests need consideration, is
|
||||
architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
|
||||
to check if it should awaken. One reason to do so is to provide
|
||||
architectures a function where requests may be checked if necessary.
|
||||
|
||||
Clearing Requests
|
||||
-----------------
|
||||
|
||||
Generally it only makes sense for the receiving VCPU thread to clear a
|
||||
request. However, in some circumstances, such as when the requesting
|
||||
thread and the receiving VCPU thread are executed serially, such as when
|
||||
they are the same thread, or when they are using some form of concurrency
|
||||
control to temporarily execute synchronously, then it's possible to know
|
||||
that the request may be cleared immediately, rather than waiting for the
|
||||
receiving VCPU thread to handle the request in VCPU RUN. The only current
|
||||
examples of this are kvm_vcpu_block() calls made by VCPUs to block
|
||||
themselves. A possible side-effect of that call is to make the
|
||||
KVM_REQ_UNHALT request, which may then be cleared immediately when the
|
||||
VCPU returns from the call.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [atomic-ops] Documentation/core-api/atomic_ops.rst
|
||||
.. [memory-barriers] Documentation/memory-barriers.txt
|
||||
.. [lwn-mb] https://lwn.net/Articles/573436/
|
Loading…
x
Reference in New Issue
Block a user