2019-06-04 08:11:32 +00:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-only */
|
2007-12-16 09:02:48 +00:00
|
|
|
#ifndef __KVM_HOST_H
|
|
|
|
#define __KVM_HOST_H
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
#include <linux/types.h>
|
2007-10-18 12:39:10 +00:00
|
|
|
#include <linux/hardirq.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/mutex.h>
|
|
|
|
#include <linux/spinlock.h>
|
2007-05-27 07:46:52 +00:00
|
|
|
#include <linux/signal.h>
|
|
|
|
#include <linux/sched.h>
|
2021-05-18 12:00:31 +00:00
|
|
|
#include <linux/sched/stat.h>
|
2011-11-24 01:12:59 +00:00
|
|
|
#include <linux/bug.h>
|
2021-02-22 02:45:22 +00:00
|
|
|
#include <linux/minmax.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
#include <linux/mm.h>
|
2011-10-10 15:46:15 +00:00
|
|
|
#include <linux/mmu_notifier.h>
|
2007-07-11 15:17:21 +00:00
|
|
|
#include <linux/preempt.h>
|
2008-11-24 06:32:53 +00:00
|
|
|
#include <linux/msi.h>
|
2010-11-09 16:02:49 +00:00
|
|
|
#include <linux/slab.h>
|
2018-05-15 11:37:37 +00:00
|
|
|
#include <linux/vmalloc.h>
|
2010-11-18 17:09:08 +00:00
|
|
|
#include <linux/rcupdate.h>
|
2011-09-12 09:26:22 +00:00
|
|
|
#include <linux/ratelimit.h>
|
2012-08-03 07:39:59 +00:00
|
|
|
#include <linux/err.h>
|
2013-01-20 23:50:22 +00:00
|
|
|
#include <linux/irqflags.h>
|
2013-05-15 23:21:38 +00:00
|
|
|
#include <linux/context_tracking.h>
|
2015-09-18 14:29:43 +00:00
|
|
|
#include <linux/irqbypass.h>
|
2020-04-24 05:48:37 +00:00
|
|
|
#include <linux/rcuwait.h>
|
2017-02-20 11:06:21 +00:00
|
|
|
#include <linux/refcount.h>
|
2019-04-11 09:16:47 +00:00
|
|
|
#include <linux/nospec.h>
|
2021-06-06 02:10:44 +00:00
|
|
|
#include <linux/notifier.h>
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
#include <linux/ftrace.h>
|
2021-12-06 19:54:27 +00:00
|
|
|
#include <linux/hashtable.h>
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
#include <linux/instrumentation.h>
|
2021-12-06 19:54:28 +00:00
|
|
|
#include <linux/interval_tree.h>
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
#include <linux/rbtree.h>
|
2021-11-16 16:04:01 +00:00
|
|
|
#include <linux/xarray.h>
|
Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-20 21:22:52 +00:00
|
|
|
#include <asm/signal.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
|
|
|
#include <linux/kvm.h>
|
2007-02-19 12:37:47 +00:00
|
|
|
#include <linux/kvm_para.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-12-16 09:02:48 +00:00
|
|
|
#include <linux/kvm_types.h>
|
2007-12-03 21:30:23 +00:00
|
|
|
|
2007-12-16 09:02:48 +00:00
|
|
|
#include <asm/kvm_host.h>
|
2020-10-01 01:22:22 +00:00
|
|
|
#include <linux/kvm_dirty_ring.h>
|
2007-12-14 01:41:22 +00:00
|
|
|
|
2021-09-13 13:57:44 +00:00
|
|
|
#ifndef KVM_MAX_VCPU_IDS
|
|
|
|
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
|
2016-05-09 16:13:37 +00:00
|
|
|
#endif
|
|
|
|
|
2012-08-21 02:58:45 +00:00
|
|
|
/*
|
2022-12-02 10:50:10 +00:00
|
|
|
* The bit 16 ~ bit 31 of kvm_userspace_memory_region::flags are internally
|
|
|
|
* used in kvm, other bits are visible for userspace which are defined in
|
2012-08-21 02:58:45 +00:00
|
|
|
* include/linux/kvm_h.
|
|
|
|
*/
|
|
|
|
#define KVM_MEMSLOT_INVALID (1UL << 16)
|
|
|
|
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 21:01:14 +00:00
|
|
|
/*
|
2019-02-05 21:01:18 +00:00
|
|
|
* Bit 63 of the memslot generation number is an "update in-progress flag",
|
2023-02-23 05:28:51 +00:00
|
|
|
* e.g. is temporarily set for the duration of kvm_swap_active_memslots().
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 21:01:14 +00:00
|
|
|
* This flag effectively creates a unique generation number that is used to
|
|
|
|
* mark cached memslot data, e.g. MMIO accesses, as potentially being stale,
|
|
|
|
* i.e. may (or may not) have come from the previous memslots generation.
|
|
|
|
*
|
|
|
|
* This is necessary because the actual memslots update is not atomic with
|
|
|
|
* respect to the generation number update. Updating the generation number
|
|
|
|
* first would allow a vCPU to cache a spte from the old memslots using the
|
|
|
|
* new generation number, and updating the generation number after switching
|
|
|
|
* to the new memslots would allow cache hits using the old generation number
|
|
|
|
* to reference the defunct memslots.
|
|
|
|
*
|
|
|
|
* This mechanism is used to prevent getting hits in KVM's caches while a
|
|
|
|
* memslot update is in-progress, and to prevent cache hits *after* updating
|
|
|
|
* the actual generation number against accesses that were inserted into the
|
|
|
|
* cache *before* the memslots were updated.
|
|
|
|
*/
|
2019-02-05 21:01:18 +00:00
|
|
|
#define KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS BIT_ULL(63)
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 21:01:14 +00:00
|
|
|
|
2012-10-24 06:07:59 +00:00
|
|
|
/* Two fragments for cross MMIO pages. */
|
|
|
|
#define KVM_MAX_MMIO_FRAGMENTS 2
|
2012-04-18 16:22:47 +00:00
|
|
|
|
2023-10-27 18:22:04 +00:00
|
|
|
#ifndef KVM_MAX_NR_ADDRESS_SPACES
|
|
|
|
#define KVM_MAX_NR_ADDRESS_SPACES 1
|
2015-05-17 15:30:37 +00:00
|
|
|
#endif
|
|
|
|
|
2012-08-03 07:43:51 +00:00
|
|
|
/*
|
|
|
|
* For the normal pfn, the highest 12 bits should be zero,
|
2012-10-16 12:10:59 +00:00
|
|
|
* so we can mask bit 62 ~ bit 52 to indicate the error pfn,
|
|
|
|
* mask bit 63 to indicate the noslot pfn.
|
2012-08-03 07:43:51 +00:00
|
|
|
*/
|
2012-10-16 12:10:59 +00:00
|
|
|
#define KVM_PFN_ERR_MASK (0x7ffULL << 52)
|
|
|
|
#define KVM_PFN_ERR_NOSLOT_MASK (0xfffULL << 52)
|
|
|
|
#define KVM_PFN_NOSLOT (0x1ULL << 63)
|
2012-08-03 07:43:51 +00:00
|
|
|
|
|
|
|
#define KVM_PFN_ERR_FAULT (KVM_PFN_ERR_MASK)
|
|
|
|
#define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
|
2012-10-16 12:10:59 +00:00
|
|
|
#define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
|
2022-10-11 19:58:07 +00:00
|
|
|
#define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
|
2024-10-10 18:23:18 +00:00
|
|
|
#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
|
2012-08-03 07:37:54 +00:00
|
|
|
|
2012-10-16 12:10:59 +00:00
|
|
|
/*
|
|
|
|
* error pfns indicate that the gfn is in slot but faild to
|
|
|
|
* translate it to pfn on host.
|
|
|
|
*/
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
static inline bool is_error_pfn(kvm_pfn_t pfn)
|
2012-08-03 07:39:59 +00:00
|
|
|
{
|
2012-08-03 07:43:51 +00:00
|
|
|
return !!(pfn & KVM_PFN_ERR_MASK);
|
2012-08-03 07:39:59 +00:00
|
|
|
}
|
|
|
|
|
2022-10-11 19:58:07 +00:00
|
|
|
/*
|
|
|
|
* KVM_PFN_ERR_SIGPENDING indicates that fetching the PFN was interrupted
|
|
|
|
* by a pending signal. Note, the signal may or may not be fatal.
|
|
|
|
*/
|
|
|
|
static inline bool is_sigpending_pfn(kvm_pfn_t pfn)
|
|
|
|
{
|
|
|
|
return pfn == KVM_PFN_ERR_SIGPENDING;
|
|
|
|
}
|
|
|
|
|
2012-10-16 12:10:59 +00:00
|
|
|
/*
|
|
|
|
* error_noslot pfns indicate that the gfn can not be
|
|
|
|
* translated to pfn - it is not in slot or failed to
|
|
|
|
* translate it to pfn.
|
|
|
|
*/
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
static inline bool is_error_noslot_pfn(kvm_pfn_t pfn)
|
2012-08-03 07:39:59 +00:00
|
|
|
{
|
2012-10-16 12:10:59 +00:00
|
|
|
return !!(pfn & KVM_PFN_ERR_NOSLOT_MASK);
|
2012-08-03 07:39:59 +00:00
|
|
|
}
|
|
|
|
|
2012-10-16 12:10:59 +00:00
|
|
|
/* noslot pfn indicates that the gfn is not in slot. */
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
static inline bool is_noslot_pfn(kvm_pfn_t pfn)
|
2012-08-03 07:39:59 +00:00
|
|
|
{
|
2012-10-16 12:10:59 +00:00
|
|
|
return pfn == KVM_PFN_NOSLOT;
|
2012-08-03 07:39:59 +00:00
|
|
|
}
|
|
|
|
|
2013-07-26 13:04:07 +00:00
|
|
|
/*
|
|
|
|
* architectures with KVM_HVA_ERR_BAD other than PAGE_OFFSET (e.g. s390)
|
|
|
|
* provide own defines and kvm_is_error_hva
|
|
|
|
*/
|
|
|
|
#ifndef KVM_HVA_ERR_BAD
|
|
|
|
|
2012-08-21 03:02:22 +00:00
|
|
|
#define KVM_HVA_ERR_BAD (PAGE_OFFSET)
|
|
|
|
#define KVM_HVA_ERR_RO_BAD (PAGE_OFFSET + PAGE_SIZE)
|
2012-08-21 03:01:50 +00:00
|
|
|
|
|
|
|
static inline bool kvm_is_error_hva(unsigned long addr)
|
|
|
|
{
|
2012-08-21 03:02:22 +00:00
|
|
|
return addr >= PAGE_OFFSET;
|
2012-08-21 03:01:50 +00:00
|
|
|
}
|
|
|
|
|
2013-07-26 13:04:07 +00:00
|
|
|
#endif
|
|
|
|
|
2024-02-15 15:29:04 +00:00
|
|
|
static inline bool kvm_is_error_gpa(gpa_t gpa)
|
|
|
|
{
|
|
|
|
return gpa == INVALID_GPA;
|
|
|
|
}
|
|
|
|
|
2017-04-26 20:32:22 +00:00
|
|
|
#define KVM_REQUEST_MASK GENMASK(7,0)
|
|
|
|
#define KVM_REQUEST_NO_WAKEUP BIT(8)
|
2017-04-27 12:33:43 +00:00
|
|
|
#define KVM_REQUEST_WAIT BIT(9)
|
2022-02-23 16:53:02 +00:00
|
|
|
#define KVM_REQUEST_NO_ACTION BIT(10)
|
2007-06-07 16:18:30 +00:00
|
|
|
/*
|
2016-01-07 14:05:10 +00:00
|
|
|
* Architecture-independent vcpu->requests bit members
|
2022-09-21 00:32:01 +00:00
|
|
|
* Bits 3-7 are reserved for more arch-independent bits.
|
2007-06-07 16:18:30 +00:00
|
|
|
*/
|
2022-11-10 10:49:08 +00:00
|
|
|
#define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
|
|
|
|
#define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
|
|
|
|
#define KVM_REQ_UNBLOCK 2
|
|
|
|
#define KVM_REQ_DIRTY_RING_SOFT_FULL 3
|
|
|
|
#define KVM_REQUEST_ARCH_BASE 8
|
2017-06-04 12:43:51 +00:00
|
|
|
|
2022-02-23 16:53:02 +00:00
|
|
|
/*
|
|
|
|
* KVM_REQ_OUTSIDE_GUEST_MODE exists is purely as way to force the vCPU to
|
|
|
|
* OUTSIDE_GUEST_MODE. KVM_REQ_OUTSIDE_GUEST_MODE differs from a vCPU "kick"
|
|
|
|
* in that it ensures the vCPU has reached OUTSIDE_GUEST_MODE before continuing
|
|
|
|
* on. A kick only guarantees that the vCPU is on its way out, e.g. a previous
|
|
|
|
* kick may have set vcpu->mode to EXITING_GUEST_MODE, and so there's no
|
|
|
|
* guarantee the vCPU received an IPI and has actually exited guest mode.
|
|
|
|
*/
|
|
|
|
#define KVM_REQ_OUTSIDE_GUEST_MODE (KVM_REQUEST_NO_ACTION | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
|
|
|
|
|
2017-06-04 12:43:51 +00:00
|
|
|
#define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
|
2019-12-09 18:31:43 +00:00
|
|
|
BUILD_BUG_ON((unsigned)(nr) >= (sizeof_field(struct kvm_vcpu, requests) * 8) - KVM_REQUEST_ARCH_BASE); \
|
2017-06-04 12:43:51 +00:00
|
|
|
(unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
|
|
|
|
})
|
|
|
|
#define KVM_ARCH_REQ(nr) KVM_ARCH_REQ_FLAGS(nr, 0)
|
2016-01-07 14:00:53 +00:00
|
|
|
|
2021-07-02 22:04:24 +00:00
|
|
|
bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
|
2021-09-03 07:51:41 +00:00
|
|
|
unsigned long *vcpu_bitmap);
|
2021-07-02 22:04:24 +00:00
|
|
|
bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
|
|
|
|
|
2012-09-21 17:58:03 +00:00
|
|
|
#define KVM_USERSPACE_IRQ_SOURCE_ID 0
|
|
|
|
#define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1
|
2008-10-15 12:15:06 +00:00
|
|
|
|
2019-01-04 01:14:28 +00:00
|
|
|
extern struct mutex kvm_lock;
|
2013-04-05 19:20:30 +00:00
|
|
|
extern struct list_head vm_list;
|
|
|
|
|
2011-07-27 13:00:48 +00:00
|
|
|
struct kvm_io_range {
|
|
|
|
gpa_t addr;
|
|
|
|
int len;
|
|
|
|
struct kvm_io_device *dev;
|
|
|
|
};
|
|
|
|
|
2012-03-09 04:17:40 +00:00
|
|
|
#define NR_IOBUS_DEVS 1000
|
2012-03-09 04:17:32 +00:00
|
|
|
|
2007-05-31 18:08:53 +00:00
|
|
|
struct kvm_io_bus {
|
2013-05-24 22:44:15 +00:00
|
|
|
int dev_count;
|
|
|
|
int ioeventfd_count;
|
2012-03-09 04:17:32 +00:00
|
|
|
struct kvm_io_range range[];
|
2007-05-31 18:08:53 +00:00
|
|
|
};
|
|
|
|
|
2009-12-23 16:35:24 +00:00
|
|
|
enum kvm_bus {
|
|
|
|
KVM_MMIO_BUS,
|
|
|
|
KVM_PIO_BUS,
|
2013-02-28 11:33:19 +00:00
|
|
|
KVM_VIRTIO_CCW_NOTIFY_BUS,
|
KVM: VMX: speed up wildcard MMIO EVENTFD
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.
Unfortunately, this only works if userspace does not need to match on
access length and data. The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
- minimize overhead for old userspace that does not use eventfd with lengtth = 0
- minimize disruption in other code (since we don't know the length,
devices on the MMIO bus only get a valid address in write, this
way we don't need to touch all devices to teach them to handle
an invalid length)
At the moment, this optimization only has effect for EPT on x86.
It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.
With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.
Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.
The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
pre-review and suggestions.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-03-31 18:50:44 +00:00
|
|
|
KVM_FAST_MMIO_BUS,
|
2024-11-13 08:18:26 +00:00
|
|
|
KVM_IOCSR_BUS,
|
2009-12-23 16:35:24 +00:00
|
|
|
KVM_NR_BUSES
|
|
|
|
};
|
|
|
|
|
2015-03-26 14:39:28 +00:00
|
|
|
int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
|
2009-12-23 16:35:24 +00:00
|
|
|
int len, const void *val);
|
2015-03-26 14:39:28 +00:00
|
|
|
int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx,
|
|
|
|
gpa_t addr, int len, const void *val, long cookie);
|
|
|
|
int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
|
|
|
|
int len, void *val);
|
2011-07-27 13:00:48 +00:00
|
|
|
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
|
|
|
|
int len, struct kvm_io_device *dev);
|
2021-04-12 22:20:49 +00:00
|
|
|
int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
|
|
|
|
struct kvm_io_device *dev);
|
2016-07-15 11:43:26 +00:00
|
|
|
struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
|
|
|
|
gpa_t addr);
|
2007-05-31 18:08:53 +00:00
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
#ifdef CONFIG_KVM_ASYNC_PF
|
|
|
|
struct kvm_async_pf {
|
|
|
|
struct work_struct work;
|
|
|
|
struct list_head link;
|
|
|
|
struct list_head queue;
|
|
|
|
struct kvm_vcpu *vcpu;
|
2019-12-06 23:57:14 +00:00
|
|
|
gpa_t cr2_or_gpa;
|
2010-10-14 09:22:46 +00:00
|
|
|
unsigned long addr;
|
|
|
|
struct kvm_arch_async_pf arch;
|
2013-10-14 14:22:33 +00:00
|
|
|
bool wakeup_all;
|
2020-06-10 17:55:32 +00:00
|
|
|
bool notpresent_injected;
|
2010-10-14 09:22:46 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
|
2020-06-15 12:13:34 +00:00
|
|
|
bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
|
|
|
|
unsigned long hva, struct kvm_arch_async_pf *arch);
|
2010-10-14 09:22:50 +00:00
|
|
|
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
|
2010-10-14 09:22:46 +00:00
|
|
|
#endif
|
|
|
|
|
2023-10-27 18:21:49 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
|
2023-07-29 00:41:44 +00:00
|
|
|
union kvm_mmu_notifier_arg {
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
unsigned long attributes;
|
2023-07-29 00:41:44 +00:00
|
|
|
};
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
struct kvm_gfn_range {
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
gfn_t start;
|
|
|
|
gfn_t end;
|
2023-07-29 00:41:44 +00:00
|
|
|
union kvm_mmu_notifier_arg arg;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
bool may_block;
|
|
|
|
};
|
|
|
|
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
|
|
|
|
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
|
|
|
|
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
|
2021-03-26 02:19:47 +00:00
|
|
|
#endif
|
|
|
|
|
2011-01-12 07:40:31 +00:00
|
|
|
enum {
|
|
|
|
OUTSIDE_GUEST_MODE,
|
|
|
|
IN_GUEST_MODE,
|
2012-05-14 12:44:06 +00:00
|
|
|
EXITING_GUEST_MODE,
|
|
|
|
READING_SHADOW_PAGE_TABLES,
|
2011-01-12 07:40:31 +00:00
|
|
|
};
|
|
|
|
|
2019-01-31 20:24:34 +00:00
|
|
|
struct kvm_host_map {
|
|
|
|
/*
|
|
|
|
* Only valid if the 'pfn' is managed by the host kernel (i.e. There is
|
|
|
|
* a 'struct page' for it. When using mem= kernel parameter some memory
|
|
|
|
* can be used as guest memory but they are not managed by host
|
|
|
|
* kernel).
|
|
|
|
*/
|
2024-10-10 18:23:33 +00:00
|
|
|
struct page *pinned_page;
|
2019-01-31 20:24:34 +00:00
|
|
|
struct page *page;
|
|
|
|
void *hva;
|
|
|
|
kvm_pfn_t pfn;
|
|
|
|
kvm_pfn_t gfn;
|
2024-10-10 18:23:35 +00:00
|
|
|
bool writable;
|
2019-01-31 20:24:34 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Used to check if the mapping is valid or not. Never use 'kvm_host_map'
|
|
|
|
* directly to check for that.
|
|
|
|
*/
|
|
|
|
static inline bool kvm_vcpu_mapped(struct kvm_host_map *map)
|
|
|
|
{
|
|
|
|
return !!map->hva;
|
|
|
|
}
|
|
|
|
|
2021-05-18 12:00:31 +00:00
|
|
|
static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
|
|
|
|
{
|
|
|
|
return single_task_running() && !need_resched() && ktime_before(cur, stop);
|
|
|
|
}
|
|
|
|
|
2012-04-18 16:22:47 +00:00
|
|
|
/*
|
|
|
|
* Sometimes a large or cross-page mmio needs to be broken up into separate
|
|
|
|
* exits for userspace servicing.
|
|
|
|
*/
|
|
|
|
struct kvm_mmio_fragment {
|
|
|
|
gpa_t gpa;
|
|
|
|
void *data;
|
|
|
|
unsigned len;
|
|
|
|
};
|
|
|
|
|
2007-12-14 01:45:31 +00:00
|
|
|
struct kvm_vcpu {
|
|
|
|
struct kvm *kvm;
|
2008-01-28 23:42:34 +00:00
|
|
|
#ifdef CONFIG_PREEMPT_NOTIFIERS
|
2007-12-14 01:45:31 +00:00
|
|
|
struct preempt_notifier preempt_notifier;
|
2008-01-28 23:42:34 +00:00
|
|
|
#endif
|
2011-01-12 07:40:31 +00:00
|
|
|
int cpu;
|
2019-11-07 12:53:42 +00:00
|
|
|
int vcpu_id; /* id given by userspace at creation */
|
2023-02-02 08:13:42 +00:00
|
|
|
int vcpu_idx; /* index into kvm->vcpu_array */
|
2022-04-15 00:43:43 +00:00
|
|
|
int ____srcu_idx; /* Don't use this directly. You've been warned. */
|
|
|
|
#ifdef CONFIG_PROVE_RCU
|
|
|
|
int srcu_depth;
|
|
|
|
#endif
|
2011-01-12 07:40:31 +00:00
|
|
|
int mode;
|
2018-07-10 09:27:19 +00:00
|
|
|
u64 requests;
|
2008-12-15 12:52:10 +00:00
|
|
|
unsigned long guest_debug;
|
2011-01-12 07:40:31 +00:00
|
|
|
|
|
|
|
struct mutex mutex;
|
|
|
|
struct kvm_run *run;
|
2009-12-23 16:35:25 +00:00
|
|
|
|
2021-10-09 02:11:57 +00:00
|
|
|
#ifndef __KVM_HAVE_ARCH_WQP
|
2020-04-24 05:48:37 +00:00
|
|
|
struct rcuwait wait;
|
2021-10-09 02:11:57 +00:00
|
|
|
#endif
|
2024-08-02 20:01:36 +00:00
|
|
|
struct pid *pid;
|
|
|
|
rwlock_t pid_lock;
|
2007-12-14 01:45:31 +00:00
|
|
|
int sigset_active;
|
|
|
|
sigset_t sigset;
|
2015-09-03 14:07:37 +00:00
|
|
|
unsigned int halt_poll_ns;
|
2016-05-13 10:16:35 +00:00
|
|
|
bool valid_wakeup;
|
2007-12-14 01:45:31 +00:00
|
|
|
|
2007-10-20 07:34:38 +00:00
|
|
|
#ifdef CONFIG_HAS_IOMEM
|
2007-12-14 01:45:31 +00:00
|
|
|
int mmio_needed;
|
|
|
|
int mmio_read_completed;
|
|
|
|
int mmio_is_write;
|
2012-04-18 16:22:47 +00:00
|
|
|
int mmio_cur_fragment;
|
|
|
|
int mmio_nr_fragments;
|
|
|
|
struct kvm_mmio_fragment mmio_fragments[KVM_MAX_MMIO_FRAGMENTS];
|
2007-10-20 07:34:38 +00:00
|
|
|
#endif
|
2007-04-19 14:27:43 +00:00
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
#ifdef CONFIG_KVM_ASYNC_PF
|
|
|
|
struct {
|
|
|
|
u32 queued;
|
|
|
|
struct list_head queue;
|
|
|
|
struct list_head done;
|
|
|
|
spinlock_t lock;
|
|
|
|
} async_pf;
|
|
|
|
#endif
|
|
|
|
|
2012-07-18 13:37:46 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
|
|
|
|
/*
|
|
|
|
* Cpu relax intercept or pause loop exit optimization
|
|
|
|
* in_spin_loop: set when a vcpu does a pause loop exit
|
|
|
|
* or cpu relax intercepted.
|
|
|
|
* dy_eligible: indicates whether vcpu is eligible for directed yield.
|
|
|
|
*/
|
|
|
|
struct {
|
|
|
|
bool in_spin_loop;
|
|
|
|
bool dy_eligible;
|
|
|
|
} spin_loop;
|
|
|
|
#endif
|
2024-05-03 18:17:32 +00:00
|
|
|
bool wants_to_run;
|
2013-03-04 18:02:07 +00:00
|
|
|
bool preempted;
|
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-18 11:39:06 +00:00
|
|
|
bool ready;
|
KVM: Add a flag to track if a loaded vCPU is scheduled out
Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of
being scheduled out (vCPU put path), or if the vCPU is being reloaded
after being scheduled out (vCPU load path). In the short term, this will
allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out
during kvm_arch_vcpu_load().
Longer term, scheduled_out opens up other potential optimizations, without
creating subtle/brittle dependencies. E.g. it allows KVM to keep guest
state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across
kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host
kernel. Forcing arch code to coordinate between kvm_arch_sched_{in,out}()
and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the
exact ordering of calls into arch code.
Adding scheduled_out also obviates the need for a kvm_arch_sched_out()
hook, e.g. if arch code needs to do something novel when putting vCPU
state.
And even if KVM never uses scheduled_out for anything beyond dropping
kvm_arch_sched_in(), just being able to remove all of the arch stubs makes
it worth adding the flag.
Link: https://lore.kernel.org/all/20240430224431.490139-1-seanjc@google.com
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-22 01:40:08 +00:00
|
|
|
bool scheduled_out;
|
2007-12-14 01:41:22 +00:00
|
|
|
struct kvm_vcpu_arch arch;
|
2021-06-18 22:27:06 +00:00
|
|
|
struct kvm_vcpu_stat stat;
|
|
|
|
char stats_id[KVM_STATS_NAME_SIZE];
|
2020-10-01 01:22:22 +00:00
|
|
|
struct kvm_dirty_ring dirty_ring;
|
2021-08-04 22:28:40 +00:00
|
|
|
|
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* The most recently used memslot by this vCPU and the slots generation
|
|
|
|
* for which it is valid.
|
|
|
|
* No wraparound protection is needed since generations won't overflow in
|
|
|
|
* thousands of years, even assuming 1M memslot operations per second.
|
2021-08-04 22:28:40 +00:00
|
|
|
*/
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct kvm_memory_slot *last_used_slot;
|
|
|
|
u64 last_used_slot_gen;
|
2007-12-14 01:41:22 +00:00
|
|
|
};
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
/*
|
|
|
|
* Start accounting time towards a guest.
|
|
|
|
* Must be called before entering guest context.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_timing_enter_irqoff(void)
|
2021-05-05 00:27:34 +00:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This is running in ioctl context so its safe to assume that it's the
|
|
|
|
* stime pending cputime to flush.
|
|
|
|
*/
|
|
|
|
instrumentation_begin();
|
|
|
|
vtime_account_guest_enter();
|
|
|
|
instrumentation_end();
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
}
|
2021-05-05 00:27:34 +00:00
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
/*
|
|
|
|
* Enter guest context and enter an RCU extended quiescent state.
|
|
|
|
*
|
|
|
|
* Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
|
|
|
|
* unsafe to use any code which may directly or indirectly use RCU, tracing
|
|
|
|
* (including IRQ flag tracing), or lockdep. All code in this period must be
|
|
|
|
* non-instrumentable.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_context_enter_irqoff(void)
|
|
|
|
{
|
2021-05-05 00:27:34 +00:00
|
|
|
/*
|
|
|
|
* KVM does not hold any references to rcu protected data when it
|
|
|
|
* switches CPU into a guest mode. In fact switching to a guest mode
|
|
|
|
* is very similar to exiting to userspace from rcu point of view. In
|
|
|
|
* addition CPU may stay in a guest mode for quite a long time (up to
|
|
|
|
* one time slice). Lets treat guest mode as quiescent state, just like
|
|
|
|
* we do with user-mode execution.
|
|
|
|
*/
|
|
|
|
if (!context_tracking_guest_enter()) {
|
|
|
|
instrumentation_begin();
|
2022-09-15 08:38:24 +00:00
|
|
|
rcu_virt_note_context_switch();
|
2021-05-05 00:27:34 +00:00
|
|
|
instrumentation_end();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
/*
|
|
|
|
* Deprecated. Architectures should move to guest_timing_enter_irqoff() and
|
|
|
|
* guest_state_enter_irqoff().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_enter_irqoff(void)
|
|
|
|
{
|
|
|
|
guest_timing_enter_irqoff();
|
|
|
|
guest_context_enter_irqoff();
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* guest_state_enter_irqoff - Fixup state when entering a guest
|
|
|
|
*
|
|
|
|
* Entry to a guest will enable interrupts, but the kernel state is interrupts
|
|
|
|
* disabled when this is invoked. Also tell RCU about it.
|
|
|
|
*
|
|
|
|
* 1) Trace interrupts on state
|
|
|
|
* 2) Invoke context tracking if enabled to adjust RCU state
|
|
|
|
* 3) Tell lockdep that interrupts are enabled
|
|
|
|
*
|
|
|
|
* Invoked from architecture specific code before entering a guest.
|
|
|
|
* Must be called with interrupts disabled and the caller must be
|
|
|
|
* non-instrumentable.
|
|
|
|
* The caller has to invoke guest_timing_enter_irqoff() before this.
|
|
|
|
*
|
|
|
|
* Note: this is analogous to exit_to_user_mode().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_state_enter_irqoff(void)
|
|
|
|
{
|
|
|
|
instrumentation_begin();
|
|
|
|
trace_hardirqs_on_prepare();
|
2022-03-14 22:19:03 +00:00
|
|
|
lockdep_hardirqs_on_prepare();
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
instrumentation_end();
|
|
|
|
|
|
|
|
guest_context_enter_irqoff();
|
|
|
|
lockdep_hardirqs_on(CALLER_ADDR0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Exit guest context and exit an RCU extended quiescent state.
|
|
|
|
*
|
|
|
|
* Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
|
|
|
|
* unsafe to use any code which may directly or indirectly use RCU, tracing
|
|
|
|
* (including IRQ flag tracing), or lockdep. All code in this period must be
|
|
|
|
* non-instrumentable.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_context_exit_irqoff(void)
|
2021-05-05 00:27:34 +00:00
|
|
|
{
|
2024-05-11 02:05:56 +00:00
|
|
|
/*
|
|
|
|
* Guest mode is treated as a quiescent state, see
|
|
|
|
* guest_context_enter_irqoff() for more details.
|
|
|
|
*/
|
|
|
|
if (!context_tracking_guest_exit()) {
|
|
|
|
instrumentation_begin();
|
|
|
|
rcu_virt_note_context_switch();
|
|
|
|
instrumentation_end();
|
|
|
|
}
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
}
|
2021-05-05 00:27:34 +00:00
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
/*
|
|
|
|
* Stop accounting time towards a guest.
|
|
|
|
* Must be called after exiting guest context.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_timing_exit_irqoff(void)
|
|
|
|
{
|
2021-05-05 00:27:34 +00:00
|
|
|
instrumentation_begin();
|
|
|
|
/* Flush the guest cputime we spent on the guest */
|
|
|
|
vtime_account_guest_exit();
|
|
|
|
instrumentation_end();
|
|
|
|
}
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
/*
|
|
|
|
* Deprecated. Architectures should move to guest_state_exit_irqoff() and
|
|
|
|
* guest_timing_exit_irqoff().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_exit_irqoff(void)
|
|
|
|
{
|
|
|
|
guest_context_exit_irqoff();
|
|
|
|
guest_timing_exit_irqoff();
|
|
|
|
}
|
|
|
|
|
2021-05-05 00:27:34 +00:00
|
|
|
static inline void guest_exit(void)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
guest_exit_irqoff();
|
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 13:29:22 +00:00
|
|
|
/**
|
|
|
|
* guest_state_exit_irqoff - Establish state when returning from guest mode
|
|
|
|
*
|
|
|
|
* Entry from a guest disables interrupts, but guest mode is traced as
|
|
|
|
* interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
|
|
|
|
*
|
|
|
|
* 1) Tell lockdep that interrupts are disabled
|
|
|
|
* 2) Invoke context tracking if enabled to reactivate RCU
|
|
|
|
* 3) Trace interrupts off state
|
|
|
|
*
|
|
|
|
* Invoked from architecture specific code after exiting a guest.
|
|
|
|
* Must be invoked with interrupts disabled and the caller must be
|
|
|
|
* non-instrumentable.
|
|
|
|
* The caller has to invoke guest_timing_exit_irqoff() after this.
|
|
|
|
*
|
|
|
|
* Note: this is analogous to enter_from_user_mode().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_state_exit_irqoff(void)
|
|
|
|
{
|
|
|
|
lockdep_hardirqs_off(CALLER_ADDR0);
|
|
|
|
guest_context_exit_irqoff();
|
|
|
|
|
|
|
|
instrumentation_begin();
|
|
|
|
trace_hardirqs_off_finish();
|
|
|
|
instrumentation_end();
|
|
|
|
}
|
|
|
|
|
2011-01-12 07:40:31 +00:00
|
|
|
static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2017-04-26 20:32:24 +00:00
|
|
|
/*
|
|
|
|
* The memory barrier ensures a previous write to vcpu->requests cannot
|
|
|
|
* be reordered with the read of vcpu->mode. It pairs with the general
|
|
|
|
* memory barrier following the write of vcpu->mode in VCPU RUN.
|
|
|
|
*/
|
|
|
|
smp_mb__before_atomic();
|
2011-01-12 07:40:31 +00:00
|
|
|
return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
|
|
|
|
}
|
|
|
|
|
2010-04-13 13:47:24 +00:00
|
|
|
/*
|
|
|
|
* Some of the bitops functions do not support too long bitmaps.
|
|
|
|
* This number must be determined not to exceed such limits.
|
|
|
|
*/
|
|
|
|
#define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/*
|
|
|
|
* Since at idle each memslot belongs to two memslot sets it has to contain
|
|
|
|
* two embedded nodes for each data structure that it forms a part of.
|
|
|
|
*
|
|
|
|
* Two memslot sets (one active and one inactive) are necessary so the VM
|
|
|
|
* continues to run on one memslot set while the other is being modified.
|
|
|
|
*
|
|
|
|
* These two memslot sets normally point to the same set of memslots.
|
|
|
|
* They can, however, be desynchronized when performing a memslot management
|
|
|
|
* operation by replacing the memslot to be modified by its copy.
|
|
|
|
* After the operation is complete, both memslot sets once again point to
|
|
|
|
* the same, common set of memslot data.
|
|
|
|
*
|
|
|
|
* The memslots themselves are independent of each other so they can be
|
|
|
|
* individually added or deleted.
|
|
|
|
*/
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
struct kvm_memory_slot {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct hlist_node id_node[2];
|
|
|
|
struct interval_tree_node hva_node[2];
|
|
|
|
struct rb_node gfn_node[2];
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
gfn_t base_gfn;
|
|
|
|
unsigned long npages;
|
|
|
|
unsigned long *dirty_bitmap;
|
2012-02-08 04:02:18 +00:00
|
|
|
struct kvm_arch_memory_slot arch;
|
2007-10-18 09:09:33 +00:00
|
|
|
unsigned long userspace_addr;
|
2012-12-10 17:33:26 +00:00
|
|
|
u32 flags;
|
2012-12-10 17:33:32 +00:00
|
|
|
short id;
|
2020-10-14 18:26:46 +00:00
|
|
|
u16 as_id;
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_KVM_PRIVATE_MEM
|
|
|
|
struct {
|
|
|
|
struct file __rcu *file;
|
|
|
|
pgoff_t pgoff;
|
|
|
|
} gmem;
|
|
|
|
#endif
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
};
|
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
|
|
|
|
{
|
|
|
|
return slot && (slot->flags & KVM_MEM_GUEST_MEMFD);
|
|
|
|
}
|
|
|
|
|
2021-11-15 23:45:58 +00:00
|
|
|
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
|
2020-10-01 01:22:26 +00:00
|
|
|
{
|
|
|
|
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
|
|
|
|
}
|
|
|
|
|
2010-04-12 10:35:35 +00:00
|
|
|
static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
|
|
|
|
}
|
|
|
|
|
2018-04-30 16:33:24 +00:00
|
|
|
static inline unsigned long *kvm_second_dirty_bitmap(struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
unsigned long len = kvm_dirty_bitmap_bytes(memslot);
|
|
|
|
|
|
|
|
return memslot->dirty_bitmap + len / sizeof(*memslot->dirty_bitmap);
|
|
|
|
}
|
|
|
|
|
2020-02-27 01:32:27 +00:00
|
|
|
#ifndef KVM_DIRTY_LOG_MANUAL_CAPS
|
|
|
|
#define KVM_DIRTY_LOG_MANUAL_CAPS KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE
|
|
|
|
#endif
|
|
|
|
|
2013-07-15 11:36:01 +00:00
|
|
|
struct kvm_s390_adapter_int {
|
|
|
|
u64 ind_addr;
|
|
|
|
u64 summary_addr;
|
|
|
|
u64 ind_offset;
|
|
|
|
u32 summary_offset;
|
|
|
|
u32 adapter_id;
|
|
|
|
};
|
|
|
|
|
2015-11-10 12:36:34 +00:00
|
|
|
struct kvm_hv_sint {
|
|
|
|
u32 vcpu;
|
|
|
|
u32 sint;
|
|
|
|
};
|
|
|
|
|
2021-12-10 16:36:23 +00:00
|
|
|
struct kvm_xen_evtchn {
|
|
|
|
u32 port;
|
2022-03-03 15:41:17 +00:00
|
|
|
u32 vcpu_id;
|
|
|
|
int vcpu_idx;
|
2021-12-10 16:36:23 +00:00
|
|
|
u32 priority;
|
|
|
|
};
|
|
|
|
|
2008-11-19 11:58:46 +00:00
|
|
|
struct kvm_kernel_irq_routing_entry {
|
|
|
|
u32 gsi;
|
2009-07-26 14:10:01 +00:00
|
|
|
u32 type;
|
2009-02-04 15:28:14 +00:00
|
|
|
int (*set)(struct kvm_kernel_irq_routing_entry *e,
|
2013-04-11 11:21:40 +00:00
|
|
|
struct kvm *kvm, int irq_source_id, int level,
|
|
|
|
bool line_status);
|
2008-11-19 11:58:46 +00:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
unsigned irqchip;
|
|
|
|
unsigned pin;
|
|
|
|
} irqchip;
|
2016-07-22 16:20:38 +00:00
|
|
|
struct {
|
|
|
|
u32 address_lo;
|
|
|
|
u32 address_hi;
|
|
|
|
u32 data;
|
|
|
|
u32 flags;
|
|
|
|
u32 devid;
|
|
|
|
} msi;
|
2013-07-15 11:36:01 +00:00
|
|
|
struct kvm_s390_adapter_int adapter;
|
2015-11-10 12:36:34 +00:00
|
|
|
struct kvm_hv_sint hv_sint;
|
2021-12-10 16:36:23 +00:00
|
|
|
struct kvm_xen_evtchn xen_evtchn;
|
2008-11-19 11:58:46 +00:00
|
|
|
};
|
2009-08-24 08:54:20 +00:00
|
|
|
struct hlist_node link;
|
|
|
|
};
|
|
|
|
|
2015-07-30 06:32:35 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
|
|
|
|
struct kvm_irq_routing_table {
|
|
|
|
int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
|
|
|
|
u32 nr_rt_entries;
|
|
|
|
/*
|
|
|
|
* Array indexed by gsi. Each entry contains list of irq chips
|
|
|
|
* the gsi is connected to.
|
|
|
|
*/
|
2023-09-22 17:51:21 +00:00
|
|
|
struct hlist_head map[] __counted_by(nr_rt_entries);
|
2015-07-30 06:32:35 +00:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2022-11-03 14:44:10 +00:00
|
|
|
bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
|
2015-07-30 06:32:35 +00:00
|
|
|
|
2022-08-16 12:53:21 +00:00
|
|
|
#ifndef KVM_INTERNAL_MEM_SLOTS
|
|
|
|
#define KVM_INTERNAL_MEM_SLOTS 0
|
2012-12-10 17:33:15 +00:00
|
|
|
#endif
|
|
|
|
|
2021-01-28 18:01:31 +00:00
|
|
|
#define KVM_MEM_SLOTS_NUM SHRT_MAX
|
2022-08-16 12:53:21 +00:00
|
|
|
#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
|
2011-11-24 09:37:48 +00:00
|
|
|
|
2023-10-27 18:22:04 +00:00
|
|
|
#if KVM_MAX_NR_ADDRESS_SPACES == 1
|
|
|
|
static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return KVM_MAX_NR_ADDRESS_SPACES;
|
|
|
|
}
|
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
/*
|
|
|
|
* Arch code must define kvm_arch_has_private_mem if support for private memory
|
|
|
|
* is enabled.
|
|
|
|
*/
|
|
|
|
#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
|
|
|
|
static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation. Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes. Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC. And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.
Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0. But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.
Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.
Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240809190319.1710470-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-09 19:02:58 +00:00
|
|
|
#ifndef kvm_arch_has_readonly_mem
|
|
|
|
static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_HAVE_KVM_READONLY_MEM);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-12-23 16:35:16 +00:00
|
|
|
struct kvm_memslots {
|
2010-10-18 13:22:23 +00:00
|
|
|
u64 generation;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
atomic_long_t last_used_slot;
|
2021-12-06 19:54:28 +00:00
|
|
|
struct rb_root_cached hva_tree;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct rb_root gfn_tree;
|
2021-12-06 19:54:27 +00:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* The mapping table from slot id to memslot.
|
2021-12-06 19:54:27 +00:00
|
|
|
*
|
|
|
|
* 7-bit bucket count matches the size of the old id to index array for
|
|
|
|
* 512 slots, while giving good performance with this slot count.
|
|
|
|
* Higher bucket counts bring only small performance improvements but
|
|
|
|
* always result in higher memory usage (even for lower memslot counts).
|
|
|
|
*/
|
|
|
|
DECLARE_HASHTABLE(id_hash, 7);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
int node_idx;
|
2009-12-23 16:35:16 +00:00
|
|
|
};
|
|
|
|
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
struct kvm {
|
2021-02-02 18:57:24 +00:00
|
|
|
#ifdef KVM_HAVE_MMU_RWLOCK
|
|
|
|
rwlock_t mmu_lock;
|
|
|
|
#else
|
2007-12-21 00:18:26 +00:00
|
|
|
spinlock_t mmu_lock;
|
2021-02-02 18:57:24 +00:00
|
|
|
#endif /* KVM_HAVE_MMU_RWLOCK */
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
struct mutex slots_lock;
|
2021-05-18 17:34:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Protects the arch-specific fields of struct kvm_memory_slots in
|
|
|
|
* use by the VM. To be used under the slots_lock (above) or in a
|
|
|
|
* kvm->srcu critical section where acquiring the slots_lock would
|
|
|
|
* lead to deadlock with the synchronize_srcu in
|
2023-02-23 05:28:51 +00:00
|
|
|
* kvm_swap_active_memslots().
|
2021-05-18 17:34:11 +00:00
|
|
|
*/
|
|
|
|
struct mutex slots_arch_lock;
|
2007-11-21 14:41:05 +00:00
|
|
|
struct mm_struct *mm; /* userspace tied to this vm */
|
KVM: Require total number of memslot pages to fit in an unsigned long
Explicitly disallow creating more memslot pages than can fit in an
unsigned long, KVM doesn't correctly handle a total number of memslot
pages that doesn't fit in an unsigned long and remedying that would be a
waste of time.
For a 64-bit kernel, this is a nop as memslots are not allowed to overlap
in the gfn address space.
With a 32-bit kernel, userspace can at most address 3gb of virtual memory,
whereas wrapping the total number of pages would require 4tb+ of guest
physical memory. Even with x86's second address space for SMM, userspace
would need to alias all of guest memory more than one _thousand_ times.
And on older x86 hardware with MAXPHYADDR < 43, the guest couldn't
actually access any of those aliases even if userspace lied about
guest.MAXPHYADDR.
On 390 and arm64, this is a nop as they don't support 32-bit hosts.
On x86, practically speaking this is simply acknowledging reality as the
existing kvm_mmu_calculate_default_mmu_pages() assumes the total number
of pages fits in an "unsigned long".
On PPC, this is likely a nop as every flavor of PPC KVM assumes gfns (and
gpas!) fit in unsigned long. arch/powerpc/kvm/book3s_32_mmu_host.c goes
a step further and fails the build if CONFIG_PTE_64BIT=y, which
presumably means that it does't support 64-bit physical addresses.
On MIPS, this is also likely a nop as the core MMU helpers assume gpas
fit in unsigned long, e.g. see kvm_mips_##name##_pte.
And finally, RISC-V is a "don't care" as it doesn't exist in any release,
i.e. there is no established ABI to break.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1c2c91baf8e78acccd4dad38da591002e61c013c.1638817638.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:07 +00:00
|
|
|
unsigned long nr_memslot_pages;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/* The two memslot sets - active and inactive (per address space) */
|
2023-10-27 18:22:04 +00:00
|
|
|
struct kvm_memslots __memslots[KVM_MAX_NR_ADDRESS_SPACES][2];
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/* The current active memslot set for each address space */
|
2023-10-27 18:22:04 +00:00
|
|
|
struct kvm_memslots __rcu *memslots[KVM_MAX_NR_ADDRESS_SPACES];
|
2021-11-16 16:04:01 +00:00
|
|
|
struct xarray vcpu_array;
|
2022-11-17 17:25:02 +00:00
|
|
|
/*
|
|
|
|
* Protected by slots_lock, but can be read outside if an
|
|
|
|
* incorrect answer is acceptable.
|
|
|
|
*/
|
|
|
|
atomic_t nr_memslots_dirty_logging;
|
2016-06-13 12:48:25 +00:00
|
|
|
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
/* Used to wait for completion of MMU notifiers. */
|
|
|
|
spinlock_t mn_invalidate_lock;
|
|
|
|
unsigned long mn_active_invalidate_count;
|
|
|
|
struct rcuwait mn_memslots_update_rcuwait;
|
|
|
|
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
/* For management / invalidation of gfn_to_pfn_caches */
|
|
|
|
spinlock_t gpc_lock;
|
|
|
|
struct list_head gpc_list;
|
|
|
|
|
2016-06-13 12:48:25 +00:00
|
|
|
/*
|
|
|
|
* created_vcpus is protected by kvm->lock, and is incremented
|
|
|
|
* at the beginning of KVM_CREATE_VCPU. online_vcpus is only
|
|
|
|
* incremented after storing the kvm_vcpu pointer in vcpus,
|
|
|
|
* and is accessed atomically.
|
|
|
|
*/
|
2009-06-09 12:56:28 +00:00
|
|
|
atomic_t online_vcpus;
|
2022-03-04 19:48:38 +00:00
|
|
|
int max_vcpus;
|
2016-06-13 12:48:25 +00:00
|
|
|
int created_vcpus;
|
2011-02-01 14:53:28 +00:00
|
|
|
int last_boosted_vcpu;
|
2007-02-12 08:54:44 +00:00
|
|
|
struct list_head vm_list;
|
2009-06-04 18:08:23 +00:00
|
|
|
struct mutex lock;
|
2017-07-07 08:51:38 +00:00
|
|
|
struct kvm_io_bus __rcu *buses[KVM_NR_BUSES];
|
2023-10-18 16:18:00 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2009-05-20 14:30:49 +00:00
|
|
|
struct {
|
|
|
|
spinlock_t lock;
|
|
|
|
struct list_head items;
|
2023-03-22 20:43:43 +00:00
|
|
|
/* resampler_list update side is protected by resampler_lock. */
|
2012-09-21 17:58:03 +00:00
|
|
|
struct list_head resampler_list;
|
|
|
|
struct mutex resampler_lock;
|
2009-05-20 14:30:49 +00:00
|
|
|
} irqfds;
|
2023-10-18 16:18:00 +00:00
|
|
|
#endif
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
struct list_head ioeventfds;
|
2007-11-18 14:24:12 +00:00
|
|
|
struct kvm_vm_stat stat;
|
2007-12-14 01:54:20 +00:00
|
|
|
struct kvm_arch arch;
|
2017-02-20 11:06:21 +00:00
|
|
|
refcount_t users_count;
|
2017-03-31 11:53:23 +00:00
|
|
|
#ifdef CONFIG_KVM_MMIO
|
2008-05-30 14:05:54 +00:00
|
|
|
struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
|
2011-07-20 17:59:00 +00:00
|
|
|
spinlock_t ring_lock;
|
|
|
|
struct list_head coalesced_zones;
|
2008-05-30 14:05:54 +00:00
|
|
|
#endif
|
2008-07-25 14:24:52 +00:00
|
|
|
|
2009-06-04 18:08:23 +00:00
|
|
|
struct mutex irq_lock;
|
2009-01-04 15:10:50 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2010-11-18 17:09:08 +00:00
|
|
|
/*
|
2014-06-30 10:51:11 +00:00
|
|
|
* Update side is protected by irq_lock.
|
2010-11-18 17:09:08 +00:00
|
|
|
*/
|
2010-03-04 14:59:23 +00:00
|
|
|
struct kvm_irq_routing_table __rcu *irq_routing;
|
2023-10-18 16:07:32 +00:00
|
|
|
|
2009-08-24 08:54:23 +00:00
|
|
|
struct hlist_head irq_ack_notifier_list;
|
2009-01-04 15:10:50 +00:00
|
|
|
#endif
|
|
|
|
|
2023-10-27 18:21:49 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
|
2008-07-25 14:24:52 +00:00
|
|
|
struct mmu_notifier mmu_notifier;
|
2022-08-16 12:53:22 +00:00
|
|
|
unsigned long mmu_invalidate_seq;
|
|
|
|
long mmu_invalidate_in_progress;
|
2023-10-27 18:21:45 +00:00
|
|
|
gfn_t mmu_invalidate_range_start;
|
|
|
|
gfn_t mmu_invalidate_range_end;
|
2008-07-25 14:24:52 +00:00
|
|
|
#endif
|
2013-04-25 14:11:23 +00:00
|
|
|
struct list_head devices;
|
2020-02-27 01:32:27 +00:00
|
|
|
u64 manual_dirty_log_protect;
|
2016-05-18 11:26:23 +00:00
|
|
|
struct dentry *debugfs_dentry;
|
|
|
|
struct kvm_stat_data **debugfs_stat_data;
|
2017-04-21 00:30:06 +00:00
|
|
|
struct srcu_struct srcu;
|
|
|
|
struct srcu_struct irq_srcu;
|
2017-07-24 11:40:03 +00:00
|
|
|
pid_t userspace_pid;
|
2022-11-17 00:16:57 +00:00
|
|
|
bool override_halt_poll_ns;
|
2020-04-17 22:14:46 +00:00
|
|
|
unsigned int max_halt_poll_ns;
|
2020-10-01 01:22:22 +00:00
|
|
|
u32 dirty_ring_size;
|
2022-11-10 10:49:10 +00:00
|
|
|
bool dirty_ring_with_bitmap;
|
2021-07-02 22:04:23 +00:00
|
|
|
bool vm_bugged;
|
2021-11-11 15:13:38 +00:00
|
|
|
bool vm_dead;
|
2021-06-06 02:10:44 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
|
|
|
|
struct notifier_block pm_notifier;
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
|
|
/* Protected by slots_locks (for writes) and RCU (for reads) */
|
|
|
|
struct xarray mem_attr_array;
|
2021-06-06 02:10:44 +00:00
|
|
|
#endif
|
2021-06-18 22:27:05 +00:00
|
|
|
char stats_id[KVM_STATS_NAME_SIZE];
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
};
|
|
|
|
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
#define kvm_err(fmt, ...) \
|
|
|
|
pr_err("kvm [%i]: " fmt, task_pid_nr(current), ## __VA_ARGS__)
|
|
|
|
#define kvm_info(fmt, ...) \
|
|
|
|
pr_info("kvm [%i]: " fmt, task_pid_nr(current), ## __VA_ARGS__)
|
|
|
|
#define kvm_debug(fmt, ...) \
|
|
|
|
pr_debug("kvm [%i]: " fmt, task_pid_nr(current), ## __VA_ARGS__)
|
2016-11-15 06:36:18 +00:00
|
|
|
#define kvm_debug_ratelimited(fmt, ...) \
|
|
|
|
pr_debug_ratelimited("kvm [%i]: " fmt, task_pid_nr(current), \
|
|
|
|
## __VA_ARGS__)
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
#define kvm_pr_unimpl(fmt, ...) \
|
|
|
|
pr_err_ratelimited("kvm [%i]: " fmt, \
|
|
|
|
task_tgid_nr(current), ## __VA_ARGS__)
|
2007-08-01 00:48:02 +00:00
|
|
|
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
/* The guest did something we don't support. */
|
|
|
|
#define vcpu_unimpl(vcpu, fmt, ...) \
|
2015-11-20 18:52:12 +00:00
|
|
|
kvm_pr_unimpl("vcpu%i, guest rIP: 0x%lx " fmt, \
|
|
|
|
(vcpu)->vcpu_id, kvm_rip_read(vcpu), ## __VA_ARGS__)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2015-07-03 12:01:35 +00:00
|
|
|
#define vcpu_debug(vcpu, fmt, ...) \
|
|
|
|
kvm_debug("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)
|
2016-11-15 06:36:18 +00:00
|
|
|
#define vcpu_debug_ratelimited(vcpu, fmt, ...) \
|
|
|
|
kvm_debug_ratelimited("vcpu%i " fmt, (vcpu)->vcpu_id, \
|
|
|
|
## __VA_ARGS__)
|
2015-11-30 16:22:20 +00:00
|
|
|
#define vcpu_err(vcpu, fmt, ...) \
|
|
|
|
kvm_err("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)
|
2015-07-03 12:01:35 +00:00
|
|
|
|
2021-11-11 15:13:38 +00:00
|
|
|
static inline void kvm_vm_dead(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm->vm_dead = true;
|
|
|
|
kvm_make_all_cpus_request(kvm, KVM_REQ_VM_DEAD);
|
|
|
|
}
|
|
|
|
|
2021-07-02 22:04:23 +00:00
|
|
|
static inline void kvm_vm_bugged(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm->vm_bugged = true;
|
2021-11-11 15:13:38 +00:00
|
|
|
kvm_vm_dead(kvm);
|
2021-07-02 22:04:23 +00:00
|
|
|
}
|
|
|
|
|
2021-11-11 15:13:38 +00:00
|
|
|
|
2021-07-02 22:04:23 +00:00
|
|
|
#define KVM_BUG(cond, kvm, fmt...) \
|
|
|
|
({ \
|
2023-03-07 13:52:33 +00:00
|
|
|
bool __ret = !!(cond); \
|
2021-07-02 22:04:23 +00:00
|
|
|
\
|
|
|
|
if (WARN_ONCE(__ret && !(kvm)->vm_bugged, fmt)) \
|
|
|
|
kvm_vm_bugged(kvm); \
|
|
|
|
unlikely(__ret); \
|
|
|
|
})
|
|
|
|
|
|
|
|
#define KVM_BUG_ON(cond, kvm) \
|
|
|
|
({ \
|
2023-03-07 13:52:33 +00:00
|
|
|
bool __ret = !!(cond); \
|
2021-07-02 22:04:23 +00:00
|
|
|
\
|
|
|
|
if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
|
|
|
|
kvm_vm_bugged(kvm); \
|
|
|
|
unlikely(__ret); \
|
|
|
|
})
|
|
|
|
|
2023-07-29 00:47:22 +00:00
|
|
|
/*
|
|
|
|
* Note, "data corruption" refers to corruption of host kernel data structures,
|
|
|
|
* not guest data. Guest data corruption, suspected or confirmed, that is tied
|
|
|
|
* and contained to a single VM should *never* BUG() and potentially panic the
|
|
|
|
* host, i.e. use this variant of KVM_BUG() if and only if a KVM data structure
|
|
|
|
* is corrupted and that corruption can have a cascading effect to other parts
|
|
|
|
* of the hosts and/or to other VMs.
|
|
|
|
*/
|
|
|
|
#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \
|
|
|
|
({ \
|
|
|
|
bool __ret = !!(cond); \
|
|
|
|
\
|
|
|
|
if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \
|
|
|
|
BUG_ON(__ret); \
|
|
|
|
else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
|
|
|
|
kvm_vm_bugged(kvm); \
|
|
|
|
unlikely(__ret); \
|
|
|
|
})
|
|
|
|
|
2022-04-15 00:43:43 +00:00
|
|
|
static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_PROVE_RCU
|
|
|
|
WARN_ONCE(vcpu->srcu_depth++,
|
|
|
|
"KVM: Illegal vCPU srcu_idx LOCK, depth=%d", vcpu->srcu_depth - 1);
|
|
|
|
#endif
|
|
|
|
vcpu->____srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_srcu_read_unlock(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->____srcu_idx);
|
|
|
|
|
|
|
|
#ifdef CONFIG_PROVE_RCU
|
|
|
|
WARN_ONCE(--vcpu->srcu_depth,
|
|
|
|
"KVM: Illegal vCPU srcu_idx UNLOCK, depth=%d", vcpu->srcu_depth);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2020-02-27 01:32:27 +00:00
|
|
|
static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return !!(kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET);
|
|
|
|
}
|
|
|
|
|
2017-07-07 08:51:38 +00:00
|
|
|
static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
|
|
|
|
{
|
|
|
|
return srcu_dereference_check(kvm->buses[idx], &kvm->srcu,
|
2017-08-02 15:55:54 +00:00
|
|
|
lockdep_is_held(&kvm->slots_lock) ||
|
|
|
|
!refcount_read(&kvm->users_count));
|
2017-07-07 08:51:38 +00:00
|
|
|
}
|
|
|
|
|
2009-06-09 12:56:29 +00:00
|
|
|
static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
|
|
|
|
{
|
2019-04-11 09:16:47 +00:00
|
|
|
int num_vcpus = atomic_read(&kvm->online_vcpus);
|
|
|
|
i = array_index_nospec(i, num_vcpus);
|
|
|
|
|
|
|
|
/* Pairs with smp_wmb() in kvm_vm_ioctl_create_vcpu. */
|
2009-06-09 12:56:29 +00:00
|
|
|
smp_rmb();
|
2021-11-16 16:04:01 +00:00
|
|
|
return xa_load(&kvm->vcpu_array, i);
|
2009-06-09 12:56:29 +00:00
|
|
|
}
|
|
|
|
|
2021-11-16 16:04:03 +00:00
|
|
|
#define kvm_for_each_vcpu(idx, vcpup, kvm) \
|
|
|
|
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
|
|
|
|
(atomic_read(&kvm->online_vcpus) - 1))
|
2009-06-09 12:56:29 +00:00
|
|
|
|
2015-11-05 08:03:50 +00:00
|
|
|
static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id)
|
|
|
|
{
|
2016-05-09 16:11:54 +00:00
|
|
|
struct kvm_vcpu *vcpu = NULL;
|
2021-11-16 16:04:02 +00:00
|
|
|
unsigned long i;
|
2015-11-05 08:03:50 +00:00
|
|
|
|
2016-05-09 16:11:54 +00:00
|
|
|
if (id < 0)
|
2015-11-05 08:55:08 +00:00
|
|
|
return NULL;
|
2016-05-09 16:11:54 +00:00
|
|
|
if (id < KVM_MAX_VCPUS)
|
|
|
|
vcpu = kvm_get_vcpu(kvm, id);
|
2015-11-05 08:55:08 +00:00
|
|
|
if (vcpu && vcpu->vcpu_id == id)
|
|
|
|
return vcpu;
|
2015-11-05 08:03:50 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
|
|
|
if (vcpu->vcpu_id == id)
|
|
|
|
return vcpu;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2021-11-16 16:03:57 +00:00
|
|
|
void kvm_destroy_vcpus(struct kvm *kvm);
|
2007-07-27 07:16:56 +00:00
|
|
|
|
2017-12-04 20:35:23 +00:00
|
|
|
void vcpu_load(struct kvm_vcpu *vcpu);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
void vcpu_put(struct kvm_vcpu *vcpu);
|
|
|
|
|
2014-11-20 12:45:31 +00:00
|
|
|
#ifdef __KVM_HAVE_IOAPIC
|
2017-04-07 08:50:33 +00:00
|
|
|
void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm);
|
2015-11-10 12:36:31 +00:00
|
|
|
void kvm_arch_post_irq_routing_update(struct kvm *kvm);
|
2014-11-20 12:45:31 +00:00
|
|
|
#else
|
2017-04-07 08:50:33 +00:00
|
|
|
static inline void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm)
|
2014-11-20 12:45:31 +00:00
|
|
|
{
|
|
|
|
}
|
2015-11-10 12:36:31 +00:00
|
|
|
static inline void kvm_arch_post_irq_routing_update(struct kvm *kvm)
|
2015-07-30 06:32:35 +00:00
|
|
|
{
|
|
|
|
}
|
2014-11-20 12:45:31 +00:00
|
|
|
#endif
|
|
|
|
|
2023-10-18 16:07:32 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2013-02-28 11:33:18 +00:00
|
|
|
int kvm_irqfd_init(void);
|
|
|
|
void kvm_irqfd_exit(void);
|
|
|
|
#else
|
|
|
|
static inline int kvm_irqfd_init(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_irqfd_exit(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
2022-11-30 23:09:16 +00:00
|
|
|
int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module);
|
2007-11-14 12:39:31 +00:00
|
|
|
void kvm_exit(void);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2008-03-30 13:01:25 +00:00
|
|
|
void kvm_get_kvm(struct kvm *kvm);
|
2021-06-25 15:32:07 +00:00
|
|
|
bool kvm_get_kvm_safe(struct kvm *kvm);
|
2008-03-30 13:01:25 +00:00
|
|
|
void kvm_put_kvm(struct kvm *kvm);
|
2021-04-08 22:32:14 +00:00
|
|
|
bool file_is_kvm(struct file *file);
|
2019-10-21 22:58:42 +00:00
|
|
|
void kvm_put_kvm_no_destroy(struct kvm *kvm);
|
2008-03-30 13:01:25 +00:00
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
|
2010-04-19 09:41:23 +00:00
|
|
|
{
|
2023-10-27 18:22:04 +00:00
|
|
|
as_id = array_index_nospec(as_id, KVM_MAX_NR_ADDRESS_SPACES);
|
2017-07-07 13:49:00 +00:00
|
|
|
return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
|
2017-08-02 15:55:54 +00:00
|
|
|
lockdep_is_held(&kvm->slots_lock) ||
|
|
|
|
!refcount_read(&kvm->users_count));
|
2010-04-19 09:41:23 +00:00
|
|
|
}
|
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return __kvm_memslots(kvm, 0);
|
|
|
|
}
|
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2015-05-17 15:30:37 +00:00
|
|
|
int as_id = kvm_arch_vcpu_memslots_id(vcpu);
|
|
|
|
|
|
|
|
return __kvm_memslots(vcpu->kvm, as_id);
|
2015-05-17 11:58:53 +00:00
|
|
|
}
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
|
|
|
|
{
|
|
|
|
return RB_EMPTY_ROOT(&slots->gfn_tree);
|
|
|
|
}
|
|
|
|
|
2023-04-26 17:23:22 +00:00
|
|
|
bool kvm_are_all_memslots_empty(struct kvm *kvm);
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
#define kvm_for_each_memslot(memslot, bkt, slots) \
|
|
|
|
hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \
|
|
|
|
if (WARN_ON_ONCE(!memslot->npages)) { \
|
|
|
|
} else
|
|
|
|
|
2020-02-18 21:07:31 +00:00
|
|
|
static inline
|
|
|
|
struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
|
2011-11-24 11:04:35 +00:00
|
|
|
{
|
2011-11-24 09:41:54 +00:00
|
|
|
struct kvm_memory_slot *slot;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
int idx = slots->node_idx;
|
2011-11-24 09:40:57 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
hash_for_each_possible(slots->id_hash, slot, id_node[idx], id) {
|
2021-12-06 19:54:27 +00:00
|
|
|
if (slot->id == id)
|
|
|
|
return slot;
|
|
|
|
}
|
2011-11-24 09:40:57 +00:00
|
|
|
|
2021-12-06 19:54:27 +00:00
|
|
|
return NULL;
|
2011-11-24 11:04:35 +00:00
|
|
|
}
|
|
|
|
|
2021-12-06 19:54:32 +00:00
|
|
|
/* Iterator used for walking memslots that overlap a gfn range. */
|
|
|
|
struct kvm_memslot_iter {
|
|
|
|
struct kvm_memslots *slots;
|
|
|
|
struct rb_node *node;
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void kvm_memslot_iter_next(struct kvm_memslot_iter *iter)
|
|
|
|
{
|
|
|
|
iter->node = rb_next(iter->node);
|
|
|
|
if (!iter->node)
|
|
|
|
return;
|
|
|
|
|
|
|
|
iter->slot = container_of(iter->node, struct kvm_memory_slot, gfn_node[iter->slots->node_idx]);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_memslot_iter_start(struct kvm_memslot_iter *iter,
|
|
|
|
struct kvm_memslots *slots,
|
|
|
|
gfn_t start)
|
|
|
|
{
|
|
|
|
int idx = slots->node_idx;
|
|
|
|
struct rb_node *tmp;
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
|
|
|
|
iter->slots = slots;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the so called "upper bound" of a key - the first node that has
|
|
|
|
* its key strictly greater than the searched one (the start gfn in our case).
|
|
|
|
*/
|
|
|
|
iter->node = NULL;
|
|
|
|
for (tmp = slots->gfn_tree.rb_node; tmp; ) {
|
|
|
|
slot = container_of(tmp, struct kvm_memory_slot, gfn_node[idx]);
|
|
|
|
if (start < slot->base_gfn) {
|
|
|
|
iter->node = tmp;
|
|
|
|
tmp = tmp->rb_left;
|
|
|
|
} else {
|
|
|
|
tmp = tmp->rb_right;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the slot with the lowest gfn that can possibly intersect with
|
|
|
|
* the range, so we'll ideally have slot start <= range start
|
|
|
|
*/
|
|
|
|
if (iter->node) {
|
|
|
|
/*
|
|
|
|
* A NULL previous node means that the very first slot
|
|
|
|
* already has a higher start gfn.
|
|
|
|
* In this case slot start > range start.
|
|
|
|
*/
|
|
|
|
tmp = rb_prev(iter->node);
|
|
|
|
if (tmp)
|
|
|
|
iter->node = tmp;
|
|
|
|
} else {
|
|
|
|
/* a NULL node below means no slots */
|
|
|
|
iter->node = rb_last(&slots->gfn_tree);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (iter->node) {
|
|
|
|
iter->slot = container_of(iter->node, struct kvm_memory_slot, gfn_node[idx]);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It is possible in the slot start < range start case that the
|
|
|
|
* found slot ends before or at range start (slot end <= range start)
|
|
|
|
* and so it does not overlap the requested range.
|
|
|
|
*
|
|
|
|
* In such non-overlapping case the next slot (if it exists) will
|
|
|
|
* already have slot start > range start, otherwise the logic above
|
|
|
|
* would have found it instead of the current slot.
|
|
|
|
*/
|
|
|
|
if (iter->slot->base_gfn + iter->slot->npages <= start)
|
|
|
|
kvm_memslot_iter_next(iter);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter, gfn_t end)
|
|
|
|
{
|
|
|
|
if (!iter->node)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this slot starts beyond or at the end of the range so does
|
|
|
|
* every next one
|
|
|
|
*/
|
|
|
|
return iter->slot->base_gfn < end;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Iterate over each memslot at least partially intersecting [start, end) range */
|
|
|
|
#define kvm_for_each_memslot_in_gfn_range(iter, slots, start, end) \
|
|
|
|
for (kvm_memslot_iter_start(iter, slots, start); \
|
|
|
|
kvm_memslot_iter_is_valid(iter, end); \
|
|
|
|
kvm_memslot_iter_next(iter))
|
|
|
|
|
2024-10-10 18:23:44 +00:00
|
|
|
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
|
|
|
|
struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
|
|
|
|
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
|
|
|
|
|
2013-02-27 10:43:44 +00:00
|
|
|
/*
|
|
|
|
* KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
|
|
|
|
* - create a new memory slot
|
|
|
|
* - delete an existing memory slot
|
|
|
|
* - modify an existing memory slot
|
|
|
|
* -- move it in the guest physical memory space
|
|
|
|
* -- just change its flags
|
|
|
|
*
|
|
|
|
* Since flags can be changed by some of these operations, the following
|
|
|
|
* differentiation is the best we can do for __kvm_set_memory_region():
|
|
|
|
*/
|
|
|
|
enum kvm_mr_change {
|
|
|
|
KVM_MR_CREATE,
|
|
|
|
KVM_MR_DELETE,
|
|
|
|
KVM_MR_MOVE,
|
|
|
|
KVM_MR_FLAGS_ONLY,
|
|
|
|
};
|
|
|
|
|
2007-10-24 21:52:57 +00:00
|
|
|
int kvm_set_memory_region(struct kvm *kvm,
|
2023-10-27 18:21:50 +00:00
|
|
|
const struct kvm_userspace_memory_region2 *mem);
|
2007-10-29 01:40:42 +00:00
|
|
|
int __kvm_set_memory_region(struct kvm *kvm,
|
2023-10-27 18:21:50 +00:00
|
|
|
const struct kvm_userspace_memory_region2 *mem);
|
2020-02-18 21:07:27 +00:00
|
|
|
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
|
2019-02-05 20:54:17 +00:00
|
|
|
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
|
2009-12-23 16:35:18 +00:00
|
|
|
int kvm_arch_prepare_memory_region(struct kvm *kvm,
|
2021-12-06 19:54:11 +00:00
|
|
|
const struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *new,
|
2013-02-27 10:44:34 +00:00
|
|
|
enum kvm_mr_change change);
|
2009-12-23 16:35:18 +00:00
|
|
|
void kvm_arch_commit_memory_region(struct kvm *kvm,
|
2020-02-18 21:07:24 +00:00
|
|
|
struct kvm_memory_slot *old,
|
2015-05-18 11:20:23 +00:00
|
|
|
const struct kvm_memory_slot *new,
|
2013-02-27 10:45:25 +00:00
|
|
|
enum kvm_mr_change change);
|
2012-08-24 18:54:57 +00:00
|
|
|
/* flush all memory translations */
|
|
|
|
void kvm_arch_flush_shadow_all(struct kvm *kvm);
|
|
|
|
/* flush memory translations pointing to 'slot' */
|
|
|
|
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *slot);
|
2009-12-23 16:35:23 +00:00
|
|
|
|
2024-10-10 18:23:13 +00:00
|
|
|
int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
struct page **pages, int nr_pages);
|
2010-08-22 11:11:43 +00:00
|
|
|
|
2024-10-10 18:24:18 +00:00
|
|
|
struct page *__gfn_to_page(struct kvm *kvm, gfn_t gfn, bool write);
|
|
|
|
static inline struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return __gfn_to_page(kvm, gfn, true);
|
|
|
|
}
|
|
|
|
|
2008-02-23 14:44:30 +00:00
|
|
|
unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
|
2013-09-09 11:52:33 +00:00
|
|
|
unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable);
|
2012-08-21 03:02:51 +00:00
|
|
|
unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
|
2014-08-19 10:15:00 +00:00
|
|
|
unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
bool *writable);
|
2024-10-10 18:23:05 +00:00
|
|
|
|
|
|
|
static inline void kvm_release_page_unused(struct page *page)
|
|
|
|
{
|
|
|
|
if (!page)
|
|
|
|
return;
|
|
|
|
|
|
|
|
put_page(page);
|
|
|
|
}
|
|
|
|
|
2007-11-20 09:49:33 +00:00
|
|
|
void kvm_release_page_clean(struct page *page);
|
|
|
|
void kvm_release_page_dirty(struct page *page);
|
2008-04-02 19:46:56 +00:00
|
|
|
|
2024-10-10 18:23:51 +00:00
|
|
|
static inline void kvm_release_faultin_page(struct kvm *kvm, struct page *page,
|
|
|
|
bool unused, bool dirty)
|
|
|
|
{
|
|
|
|
lockdep_assert_once(lockdep_is_held(&kvm->mmu_lock) || unused);
|
|
|
|
|
|
|
|
if (!page)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the page that KVM got from the *primary MMU* is writable, and KVM
|
|
|
|
* installed or reused a SPTE, mark the page/folio dirty. Note, this
|
|
|
|
* may mark a folio dirty even if KVM created a read-only SPTE, e.g. if
|
|
|
|
* the GFN is write-protected. Folios can't be safely marked dirty
|
|
|
|
* outside of mmu_lock as doing so could race with writeback on the
|
|
|
|
* folio. As a result, KVM can't mark folios dirty in the fast page
|
|
|
|
* fault handler, and so KVM must (somewhat) speculatively mark the
|
|
|
|
* folio dirty if KVM could locklessly make the SPTE writable.
|
|
|
|
*/
|
|
|
|
if (unused)
|
|
|
|
kvm_release_page_unused(page);
|
|
|
|
else if (dirty)
|
|
|
|
kvm_release_page_dirty(page);
|
|
|
|
else
|
|
|
|
kvm_release_page_clean(page);
|
|
|
|
}
|
|
|
|
|
2024-10-10 18:23:45 +00:00
|
|
|
kvm_pfn_t __kvm_faultin_pfn(const struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
unsigned int foll, bool *writable,
|
|
|
|
struct page **refcounted_page);
|
|
|
|
|
|
|
|
static inline kvm_pfn_t kvm_faultin_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
|
|
|
|
bool write, bool *writable,
|
|
|
|
struct page **refcounted_page)
|
|
|
|
{
|
|
|
|
return __kvm_faultin_pfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn,
|
|
|
|
write ? FOLL_WRITE : 0, writable, refcounted_page);
|
|
|
|
}
|
|
|
|
|
2007-10-01 20:14:18 +00:00
|
|
|
int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
|
|
|
|
int len);
|
|
|
|
int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
|
2017-05-02 14:20:18 +00:00
|
|
|
int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned long len);
|
2020-05-25 14:41:19 +00:00
|
|
|
int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned int offset,
|
|
|
|
unsigned long len);
|
2007-10-01 20:14:18 +00:00
|
|
|
int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
|
|
|
|
int offset, int len);
|
|
|
|
int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
|
|
|
|
unsigned long len);
|
2017-05-02 14:20:18 +00:00
|
|
|
int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned long len);
|
|
|
|
int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
2018-12-14 22:34:43 +00:00
|
|
|
void *data, unsigned int offset,
|
|
|
|
unsigned long len);
|
2017-05-02 14:20:18 +00:00
|
|
|
int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
gpa_t gpa, unsigned long len);
|
2019-10-21 15:28:17 +00:00
|
|
|
|
2020-08-04 17:06:02 +00:00
|
|
|
#define __kvm_get_guest(kvm, gfn, offset, v) \
|
|
|
|
({ \
|
|
|
|
unsigned long __addr = gfn_to_hva(kvm, gfn); \
|
|
|
|
typeof(v) __user *__uaddr = (typeof(__uaddr))(__addr + offset); \
|
|
|
|
int __ret = -EFAULT; \
|
|
|
|
\
|
|
|
|
if (!kvm_is_error_hva(__addr)) \
|
|
|
|
__ret = get_user(v, __uaddr); \
|
|
|
|
__ret; \
|
|
|
|
})
|
|
|
|
|
|
|
|
#define kvm_get_guest(kvm, gpa, v) \
|
|
|
|
({ \
|
|
|
|
gpa_t __gpa = gpa; \
|
|
|
|
struct kvm *__kvm = kvm; \
|
|
|
|
\
|
|
|
|
__kvm_get_guest(__kvm, __gpa >> PAGE_SHIFT, \
|
|
|
|
offset_in_page(__gpa), v); \
|
|
|
|
})
|
|
|
|
|
2020-08-04 17:06:01 +00:00
|
|
|
#define __kvm_put_guest(kvm, gfn, offset, v) \
|
2019-10-21 15:28:17 +00:00
|
|
|
({ \
|
|
|
|
unsigned long __addr = gfn_to_hva(kvm, gfn); \
|
2020-08-04 17:06:01 +00:00
|
|
|
typeof(v) __user *__uaddr = (typeof(__uaddr))(__addr + offset); \
|
2019-10-21 15:28:17 +00:00
|
|
|
int __ret = -EFAULT; \
|
|
|
|
\
|
|
|
|
if (!kvm_is_error_hva(__addr)) \
|
2020-08-04 17:06:01 +00:00
|
|
|
__ret = put_user(v, __uaddr); \
|
2019-10-21 15:28:17 +00:00
|
|
|
if (!__ret) \
|
|
|
|
mark_page_dirty(kvm, gfn); \
|
|
|
|
__ret; \
|
|
|
|
})
|
|
|
|
|
2020-08-04 17:06:01 +00:00
|
|
|
#define kvm_put_guest(kvm, gpa, v) \
|
2019-10-21 15:28:17 +00:00
|
|
|
({ \
|
|
|
|
gpa_t __gpa = gpa; \
|
|
|
|
struct kvm *__kvm = kvm; \
|
2020-08-04 17:06:01 +00:00
|
|
|
\
|
2019-10-21 15:28:17 +00:00
|
|
|
__kvm_put_guest(__kvm, __gpa >> PAGE_SHIFT, \
|
2020-08-04 17:06:01 +00:00
|
|
|
offset_in_page(__gpa), v); \
|
2019-10-21 15:28:17 +00:00
|
|
|
})
|
|
|
|
|
2007-10-01 20:14:18 +00:00
|
|
|
int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
|
2015-11-14 03:21:06 +00:00
|
|
|
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
|
2020-07-08 14:00:23 +00:00
|
|
|
bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
|
2020-01-08 20:24:37 +00:00
|
|
|
unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
|
2021-11-15 23:45:58 +00:00
|
|
|
void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
|
|
|
|
|
2024-10-10 18:23:35 +00:00
|
|
|
int __kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map,
|
|
|
|
bool writable);
|
|
|
|
void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map);
|
|
|
|
|
|
|
|
static inline int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
struct kvm_host_map *map)
|
|
|
|
{
|
|
|
|
return __kvm_vcpu_map(vcpu, gpa, map, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int kvm_vcpu_map_readonly(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
struct kvm_host_map *map)
|
|
|
|
{
|
|
|
|
return __kvm_vcpu_map(vcpu, gpa, map, false);
|
|
|
|
}
|
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
|
|
|
|
unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
|
|
|
|
int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset,
|
|
|
|
int len);
|
|
|
|
int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa, void *data,
|
|
|
|
unsigned long len);
|
|
|
|
int kvm_vcpu_read_guest(struct kvm_vcpu *vcpu, gpa_t gpa, void *data,
|
|
|
|
unsigned long len);
|
|
|
|
int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, const void *data,
|
|
|
|
int offset, int len);
|
|
|
|
int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data,
|
|
|
|
unsigned long len);
|
|
|
|
void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
|
|
|
|
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
/**
|
2022-10-13 21:12:19 +00:00
|
|
|
* kvm_gpc_init - initialize gfn_to_pfn_cache.
|
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
* @kvm: pointer to kvm instance.
|
2022-10-13 21:12:24 +00:00
|
|
|
*
|
|
|
|
* This sets up a gfn_to_pfn_cache by initializing locks and assigning the
|
|
|
|
* immutable attributes. Note, the cache must be zero-allocated (or zeroed by
|
|
|
|
* the caller before init).
|
|
|
|
*/
|
2024-02-15 15:29:00 +00:00
|
|
|
void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm);
|
2022-10-13 21:12:24 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_gpc_activate - prepare a cached kernel mapping and HPA for a given guest
|
|
|
|
* physical address.
|
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
* @gpa: guest physical address to map.
|
|
|
|
* @len: sanity check; the range being access must fit a single page.
|
|
|
|
*
|
|
|
|
* @return: 0 for success.
|
|
|
|
* -EINVAL for a mapping which would cross a page boundary.
|
2022-10-13 21:12:24 +00:00
|
|
|
* -EFAULT for an untranslatable guest physical address.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
*
|
2022-10-13 21:12:24 +00:00
|
|
|
* This primes a gfn_to_pfn_cache and links it into the @gpc->kvm's list for
|
2022-10-13 21:12:22 +00:00
|
|
|
* invalidations to be processed. Callers are required to use kvm_gpc_check()
|
|
|
|
* to ensure that the cache is valid before accessing the target page.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
*/
|
2022-10-13 21:12:24 +00:00
|
|
|
int kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long len);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
|
2024-02-15 15:29:04 +00:00
|
|
|
/**
|
|
|
|
* kvm_gpc_activate_hva - prepare a cached kernel mapping and HPA for a given HVA.
|
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
|
|
|
* @hva: userspace virtual address to map.
|
|
|
|
* @len: sanity check; the range being access must fit a single page.
|
|
|
|
*
|
|
|
|
* @return: 0 for success.
|
|
|
|
* -EINVAL for a mapping which would cross a page boundary.
|
|
|
|
* -EFAULT for an untranslatable guest physical address.
|
|
|
|
*
|
|
|
|
* The semantics of this function are the same as those of kvm_gpc_activate(). It
|
|
|
|
* merely bypasses a layer of address translation.
|
|
|
|
*/
|
|
|
|
int kvm_gpc_activate_hva(struct gfn_to_pfn_cache *gpc, unsigned long hva, unsigned long len);
|
|
|
|
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
/**
|
2022-10-13 21:12:22 +00:00
|
|
|
* kvm_gpc_check - check validity of a gfn_to_pfn_cache.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
|
|
|
* @len: sanity check; the range being access must fit a single page.
|
|
|
|
*
|
|
|
|
* @return: %true if the cache is still valid and the address matches.
|
|
|
|
* %false if the cache is not valid.
|
|
|
|
*
|
|
|
|
* Callers outside IN_GUEST_MODE context should hold a read lock on @gpc->lock
|
|
|
|
* while calling this function, and then continue to hold the lock until the
|
|
|
|
* access is complete.
|
|
|
|
*
|
|
|
|
* Callers in IN_GUEST_MODE may do so without locking, although they should
|
|
|
|
* still hold a read lock on kvm->scru for the memslot checks.
|
|
|
|
*/
|
2022-10-13 21:12:31 +00:00
|
|
|
bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
|
|
|
|
/**
|
2022-10-13 21:12:22 +00:00
|
|
|
* kvm_gpc_refresh - update a previously initialized cache.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
|
|
|
* @len: sanity check; the range being access must fit a single page.
|
|
|
|
*
|
|
|
|
* @return: 0 for success.
|
|
|
|
* -EINVAL for a mapping which would cross a page boundary.
|
2022-10-13 21:12:28 +00:00
|
|
|
* -EFAULT for an untranslatable guest physical address.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
*
|
|
|
|
* This will attempt to refresh a gfn_to_pfn_cache. Note that a successful
|
2022-10-13 21:12:28 +00:00
|
|
|
* return from this function does not mean the page can be immediately
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
* accessed because it may have raced with an invalidation. Callers must
|
|
|
|
* still lock and check the cache status, as this function does not return
|
|
|
|
* with the lock still held to permit access.
|
|
|
|
*/
|
2022-10-13 21:12:31 +00:00
|
|
|
int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
|
|
|
|
/**
|
2022-10-13 21:12:19 +00:00
|
|
|
* kvm_gpc_deactivate - deactivate and unlink a gfn_to_pfn_cache.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
|
|
|
*
|
2022-10-13 21:12:24 +00:00
|
|
|
* This removes a cache from the VM's list to be processed on MMU notifier
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
* invocation.
|
|
|
|
*/
|
2022-10-13 21:12:24 +00:00
|
|
|
void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
|
2024-02-15 15:29:04 +00:00
|
|
|
static inline bool kvm_gpc_is_gpa_active(struct gfn_to_pfn_cache *gpc)
|
|
|
|
{
|
|
|
|
return gpc->active && !kvm_is_error_gpa(gpc->gpa);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_gpc_is_hva_active(struct gfn_to_pfn_cache *gpc)
|
|
|
|
{
|
|
|
|
return gpc->active && kvm_is_error_gpa(gpc->gpa);
|
|
|
|
}
|
|
|
|
|
2017-11-24 21:39:01 +00:00
|
|
|
void kvm_sigset_activate(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
|
|
|
|
|
2021-10-09 02:12:06 +00:00
|
|
|
void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
|
2021-10-09 02:12:07 +00:00
|
|
|
bool kvm_vcpu_block(struct kvm_vcpu *vcpu);
|
2015-08-27 14:41:15 +00:00
|
|
|
void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
|
2017-04-26 20:32:26 +00:00
|
|
|
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
|
2012-03-08 21:44:24 +00:00
|
|
|
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
|
2014-05-23 10:20:42 +00:00
|
|
|
int kvm_vcpu_yield_to(struct kvm_vcpu *target);
|
2022-09-20 06:02:10 +00:00
|
|
|
void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool yield_to_kernel_mode);
|
2010-11-23 03:13:00 +00:00
|
|
|
|
2007-06-07 16:18:30 +00:00
|
|
|
void kvm_flush_remote_tlbs(struct kvm *kvm);
|
2023-08-11 04:51:18 +00:00
|
|
|
void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
|
2023-08-11 04:51:19 +00:00
|
|
|
void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
|
|
|
|
const struct kvm_memory_slot *memslot);
|
2018-05-16 15:21:28 +00:00
|
|
|
|
2020-07-03 02:35:39 +00:00
|
|
|
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
|
|
|
|
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
|
2022-06-22 19:27:08 +00:00
|
|
|
int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
|
2020-07-03 02:35:39 +00:00
|
|
|
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
|
|
|
|
void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
|
|
|
|
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
|
|
|
|
#endif
|
|
|
|
|
2023-10-27 18:21:45 +00:00
|
|
|
void kvm_mmu_invalidate_begin(struct kvm *kvm);
|
|
|
|
void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
|
|
|
|
void kvm_mmu_invalidate_end(struct kvm *kvm);
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
|
2021-08-10 20:52:39 +00:00
|
|
|
|
2007-10-10 15:16:19 +00:00
|
|
|
long kvm_arch_dev_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
long kvm_arch_vcpu_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg);
|
2018-04-18 19:19:58 +00:00
|
|
|
vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
|
2007-11-15 15:07:47 +00:00
|
|
|
|
2014-07-14 16:27:35 +00:00
|
|
|
int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext);
|
2007-11-15 15:07:47 +00:00
|
|
|
|
2015-01-28 02:54:23 +00:00
|
|
|
void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
|
2015-01-15 23:58:53 +00:00
|
|
|
struct kvm_memory_slot *slot,
|
|
|
|
gfn_t gfn_offset,
|
|
|
|
unsigned long mask);
|
2020-02-18 21:07:29 +00:00
|
|
|
void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
|
|
|
|
|
2023-08-11 04:51:19 +00:00
|
|
|
#ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
2020-02-18 21:07:29 +00:00
|
|
|
int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log);
|
|
|
|
int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-18 21:07:30 +00:00
|
|
|
int *is_dirty, struct kvm_memory_slot **memslot);
|
2020-02-18 21:07:29 +00:00
|
|
|
#endif
|
2007-11-18 12:29:43 +00:00
|
|
|
|
2013-04-11 11:21:40 +00:00
|
|
|
int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
|
|
|
|
bool line_status);
|
2017-02-16 09:40:56 +00:00
|
|
|
int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
|
|
|
|
struct kvm_enable_cap *cap);
|
2023-02-08 14:01:05 +00:00
|
|
|
int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
|
2022-10-17 18:45:39 +00:00
|
|
|
long kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl,
|
|
|
|
unsigned long arg);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
|
2007-10-31 22:24:25 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu);
|
|
|
|
|
2007-11-16 05:05:55 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_translation *tr);
|
|
|
|
|
2007-11-01 19:16:10 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
|
|
|
|
int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_sregs *sregs);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_sregs *sregs);
|
2008-04-11 16:24:45 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mp_state *mp_state);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mp_state *mp_state);
|
2008-12-15 12:52:10 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_guest_debug *dbg);
|
2020-04-16 05:10:57 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2007-11-14 12:38:21 +00:00
|
|
|
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
|
|
|
|
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
|
2019-12-18 21:55:09 +00:00
|
|
|
int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id);
|
2019-12-18 21:55:15 +00:00
|
|
|
int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu);
|
2014-12-04 14:47:07 +00:00
|
|
|
void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
|
2007-11-19 20:04:43 +00:00
|
|
|
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
|
2007-11-14 12:38:21 +00:00
|
|
|
|
2021-06-06 02:10:44 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
|
|
|
|
int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state);
|
|
|
|
#endif
|
|
|
|
|
2019-08-03 06:14:25 +00:00
|
|
|
#ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
|
2020-06-04 13:16:52 +00:00
|
|
|
void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
|
2022-05-23 19:03:27 +00:00
|
|
|
#else
|
|
|
|
static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
|
2019-08-03 06:14:25 +00:00
|
|
|
#endif
|
2016-09-07 18:47:23 +00:00
|
|
|
|
2022-11-30 23:09:33 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
|
2024-08-30 04:35:58 +00:00
|
|
|
/*
|
|
|
|
* kvm_arch_{enable,disable}_virtualization() are called on one CPU, under
|
|
|
|
* kvm_usage_lock, immediately after/before 0=>1 and 1=>0 transitions of
|
|
|
|
* kvm_usage_count, i.e. at the beginning of the generic hardware enabling
|
|
|
|
* sequence, and at the end of the generic hardware disabling sequence.
|
|
|
|
*/
|
|
|
|
void kvm_arch_enable_virtualization(void);
|
|
|
|
void kvm_arch_disable_virtualization(void);
|
|
|
|
/*
|
|
|
|
* kvm_arch_{enable,disable}_virtualization_cpu() are called on "every" CPU to
|
|
|
|
* do the actual twiddling of hardware bits. The hooks are called on all
|
|
|
|
* online CPUs when KVM enables/disabled virtualization, and on a single CPU
|
|
|
|
* when that CPU is onlined/offlined (including for Resume/Suspend).
|
|
|
|
*/
|
2024-08-30 04:35:54 +00:00
|
|
|
int kvm_arch_enable_virtualization_cpu(void);
|
|
|
|
void kvm_arch_disable_virtualization_cpu(void);
|
2022-11-30 23:09:33 +00:00
|
|
|
#endif
|
2007-12-14 01:35:10 +00:00
|
|
|
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
|
2017-08-08 04:05:32 +00:00
|
|
|
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
|
2012-03-08 21:44:24 +00:00
|
|
|
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
|
2019-08-05 02:03:19 +00:00
|
|
|
bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
|
2021-04-16 03:08:10 +00:00
|
|
|
bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
|
2024-01-10 00:39:35 +00:00
|
|
|
bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu);
|
2020-02-13 17:22:55 +00:00
|
|
|
int kvm_arch_post_init_vm(struct kvm *kvm);
|
|
|
|
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
|
2024-02-16 15:59:41 +00:00
|
|
|
void kvm_arch_create_vm_debugfs(struct kvm *kvm);
|
2007-11-14 12:38:21 +00:00
|
|
|
|
2010-11-09 16:02:49 +00:00
|
|
|
#ifndef __KVM_HAVE_ARCH_VM_ALLOC
|
2018-05-15 11:37:37 +00:00
|
|
|
/*
|
|
|
|
* All architectures that want to use vzalloc currently also
|
|
|
|
* need their own kvm_arch_alloc_vm implementation.
|
|
|
|
*/
|
2010-11-09 16:02:49 +00:00
|
|
|
static inline struct kvm *kvm_arch_alloc_vm(void)
|
|
|
|
{
|
2022-11-17 20:34:19 +00:00
|
|
|
return kzalloc(sizeof(struct kvm), GFP_KERNEL_ACCOUNT);
|
2010-11-09 16:02:49 +00:00
|
|
|
}
|
2021-09-03 13:08:05 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline void __kvm_arch_free_vm(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvfree(kvm);
|
|
|
|
}
|
2010-11-09 16:02:49 +00:00
|
|
|
|
2021-09-03 13:08:05 +00:00
|
|
|
#ifndef __KVM_HAVE_ARCH_VM_FREE
|
2010-11-09 16:02:49 +00:00
|
|
|
static inline void kvm_arch_free_vm(struct kvm *kvm)
|
|
|
|
{
|
2021-09-03 13:08:05 +00:00
|
|
|
__kvm_arch_free_vm(kvm);
|
2010-11-09 16:02:49 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2023-08-11 04:51:14 +00:00
|
|
|
#ifndef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS
|
|
|
|
static inline int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
|
2018-07-19 08:40:17 +00:00
|
|
|
{
|
|
|
|
return -ENOTSUPP;
|
|
|
|
}
|
2023-08-11 04:51:15 +00:00
|
|
|
#else
|
|
|
|
int kvm_arch_flush_remote_tlbs(struct kvm *kvm);
|
2018-07-19 08:40:17 +00:00
|
|
|
#endif
|
|
|
|
|
2023-08-11 04:51:18 +00:00
|
|
|
#ifndef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
|
|
|
|
static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm,
|
|
|
|
gfn_t gfn, u64 nr_pages)
|
|
|
|
{
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
|
|
|
|
#endif
|
|
|
|
|
2013-10-30 17:02:30 +00:00
|
|
|
#ifdef __KVM_HAVE_ARCH_NONCOHERENT_DMA
|
|
|
|
void kvm_arch_register_noncoherent_dma(struct kvm *kvm);
|
|
|
|
void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm);
|
|
|
|
bool kvm_arch_has_noncoherent_dma(struct kvm *kvm);
|
|
|
|
#else
|
|
|
|
static inline void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
2015-07-07 13:41:58 +00:00
|
|
|
#ifdef __KVM_HAVE_ARCH_ASSIGNED_DEVICE
|
|
|
|
void kvm_arch_start_assignment(struct kvm *kvm);
|
|
|
|
void kvm_arch_end_assignment(struct kvm *kvm);
|
|
|
|
bool kvm_arch_has_assigned_device(struct kvm *kvm);
|
|
|
|
#else
|
|
|
|
static inline void kvm_arch_start_assignment(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_arch_end_assignment(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2022-06-14 21:15:32 +00:00
|
|
|
static __always_inline bool kvm_arch_has_assigned_device(struct kvm *kvm)
|
2015-07-07 13:41:58 +00:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
2013-10-30 17:02:30 +00:00
|
|
|
|
2020-04-24 05:48:37 +00:00
|
|
|
static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
|
2012-03-08 21:44:24 +00:00
|
|
|
{
|
2012-03-13 21:35:01 +00:00
|
|
|
#ifdef __KVM_HAVE_ARCH_WQP
|
2020-04-24 05:48:37 +00:00
|
|
|
return vcpu->arch.waitp;
|
2012-03-13 21:35:01 +00:00
|
|
|
#else
|
2020-04-24 05:48:37 +00:00
|
|
|
return &vcpu->wait;
|
2012-03-08 21:44:24 +00:00
|
|
|
#endif
|
2012-03-13 21:35:01 +00:00
|
|
|
}
|
2012-03-08 21:44:24 +00:00
|
|
|
|
2021-10-09 02:12:12 +00:00
|
|
|
/*
|
|
|
|
* Wake a vCPU if necessary, but don't do any stats/metadata updates. Returns
|
|
|
|
* true if the vCPU was blocking and was awakened, false otherwise.
|
|
|
|
*/
|
|
|
|
static inline bool __kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return !!rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_vcpu_is_blocking(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return rcuwait_active(kvm_arch_vcpu_get_wait(vcpu));
|
|
|
|
}
|
|
|
|
|
2015-03-04 10:14:33 +00:00
|
|
|
#ifdef __KVM_HAVE_ARCH_INTC_INITIALIZED
|
|
|
|
/*
|
|
|
|
* returns true if the virtual interrupt controller is initialized and
|
|
|
|
* ready to accept virtual IRQ. On some architectures the virtual interrupt
|
|
|
|
* controller is dynamically instantiated and this is not always true.
|
|
|
|
*/
|
|
|
|
bool kvm_arch_intc_initialized(struct kvm *kvm);
|
|
|
|
#else
|
|
|
|
static inline bool kvm_arch_intc_initialized(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-11-11 02:07:33 +00:00
|
|
|
#ifdef CONFIG_GUEST_PERF_EVENTS
|
|
|
|
unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
|
|
|
|
|
|
|
|
void kvm_register_perf_callbacks(unsigned int (*pt_intr_handler)(void));
|
|
|
|
void kvm_unregister_perf_callbacks(void);
|
|
|
|
#else
|
|
|
|
static inline void kvm_register_perf_callbacks(void *ign) {}
|
|
|
|
static inline void kvm_unregister_perf_callbacks(void) {}
|
|
|
|
#endif /* CONFIG_GUEST_PERF_EVENTS */
|
|
|
|
|
2012-01-04 09:25:20 +00:00
|
|
|
int kvm_arch_init_vm(struct kvm *kvm, unsigned long type);
|
2007-11-18 10:43:45 +00:00
|
|
|
void kvm_arch_destroy_vm(struct kvm *kvm);
|
2009-01-06 02:03:02 +00:00
|
|
|
void kvm_arch_sync_events(struct kvm *kvm);
|
2007-11-14 12:38:21 +00:00
|
|
|
|
2008-04-11 17:53:26 +00:00
|
|
|
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
|
2007-12-11 12:36:00 +00:00
|
|
|
|
2008-09-14 00:48:28 +00:00
|
|
|
struct kvm_irq_ack_notifier {
|
|
|
|
struct hlist_node link;
|
|
|
|
unsigned gsi;
|
|
|
|
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
|
|
|
|
};
|
|
|
|
|
2014-06-30 10:51:11 +00:00
|
|
|
int kvm_irq_map_gsi(struct kvm *kvm,
|
|
|
|
struct kvm_kernel_irq_routing_entry *entries, int gsi);
|
|
|
|
int kvm_irq_map_chip_pin(struct kvm *kvm, unsigned irqchip, unsigned pin);
|
2014-06-30 10:51:10 +00:00
|
|
|
|
2013-04-11 11:21:40 +00:00
|
|
|
int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
|
|
|
|
bool line_status);
|
2010-11-18 17:09:08 +00:00
|
|
|
int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
|
2013-04-11 11:21:40 +00:00
|
|
|
int irq_source_id, int level, bool line_status);
|
2015-10-28 18:16:47 +00:00
|
|
|
int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
|
|
|
|
struct kvm *kvm, int irq_source_id,
|
|
|
|
int level, bool line_status);
|
2013-01-25 02:18:51 +00:00
|
|
|
bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
|
2015-10-16 07:07:46 +00:00
|
|
|
void kvm_notify_acked_gsi(struct kvm *kvm, int gsi);
|
2009-01-27 17:12:38 +00:00
|
|
|
void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
|
2008-10-06 05:48:45 +00:00
|
|
|
void kvm_register_irq_ack_notifier(struct kvm *kvm,
|
|
|
|
struct kvm_irq_ack_notifier *kian);
|
2009-06-04 18:08:24 +00:00
|
|
|
void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
|
|
|
|
struct kvm_irq_ack_notifier *kian);
|
2008-10-15 12:15:06 +00:00
|
|
|
int kvm_request_irq_source_id(struct kvm *kvm);
|
|
|
|
void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
|
2019-07-10 00:24:03 +00:00
|
|
|
bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
|
2008-09-14 00:48:28 +00:00
|
|
|
|
2012-01-12 20:09:51 +00:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* Returns a pointer to the memslot if it contains gfn.
|
2021-08-04 22:28:39 +00:00
|
|
|
* Otherwise returns NULL.
|
|
|
|
*/
|
|
|
|
static inline struct kvm_memory_slot *
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
|
2021-08-04 22:28:39 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
if (!slot)
|
2021-08-04 22:28:39 +00:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
|
|
|
|
return slot;
|
|
|
|
else
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
|
2020-02-18 21:07:31 +00:00
|
|
|
*
|
2021-12-06 19:54:25 +00:00
|
|
|
* With "approx" set returns the memslot also when the address falls
|
|
|
|
* in a hole. In that case one of the memslots bordering the hole is
|
|
|
|
* returned.
|
2012-01-12 20:09:51 +00:00
|
|
|
*/
|
|
|
|
static inline struct kvm_memory_slot *
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
|
2012-01-12 20:09:51 +00:00
|
|
|
{
|
2021-08-04 22:28:39 +00:00
|
|
|
struct kvm_memory_slot *slot;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct rb_node *node;
|
|
|
|
int idx = slots->node_idx;
|
|
|
|
|
|
|
|
slot = NULL;
|
|
|
|
for (node = slots->gfn_tree.rb_node; node; ) {
|
|
|
|
slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
|
|
|
|
if (gfn >= slot->base_gfn) {
|
|
|
|
if (gfn < slot->base_gfn + slot->npages)
|
|
|
|
return slot;
|
|
|
|
node = node->rb_right;
|
|
|
|
} else
|
|
|
|
node = node->rb_left;
|
2021-12-06 19:54:25 +00:00
|
|
|
}
|
2012-01-12 20:09:51 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
return approx ? slot : NULL;
|
2012-01-12 20:09:51 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct kvm_memory_slot *
|
2021-12-06 19:54:25 +00:00
|
|
|
____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx)
|
2012-01-12 20:09:51 +00:00
|
|
|
{
|
2021-08-04 22:28:39 +00:00
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
slot = (struct kvm_memory_slot *)atomic_long_read(&slots->last_used_slot);
|
|
|
|
slot = try_get_memslot(slot, gfn);
|
2021-08-04 22:28:39 +00:00
|
|
|
if (slot)
|
|
|
|
return slot;
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
slot = search_memslots(slots, gfn, approx);
|
2021-08-04 22:28:39 +00:00
|
|
|
if (slot) {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
atomic_long_set(&slots->last_used_slot, (unsigned long)slot);
|
2021-08-04 22:28:39 +00:00
|
|
|
return slot;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
2012-01-12 20:09:51 +00:00
|
|
|
}
|
|
|
|
|
2021-12-06 19:54:25 +00:00
|
|
|
/*
|
|
|
|
* __gfn_to_memslot() and its descendants are here to allow arch code to inline
|
|
|
|
* the lookups in hot paths. gfn_to_memslot() itself isn't here as an inline
|
|
|
|
* because that would bloat other code too much.
|
|
|
|
*/
|
|
|
|
static inline struct kvm_memory_slot *
|
|
|
|
__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return ____gfn_to_memslot(slots, gfn, false);
|
|
|
|
}
|
|
|
|
|
2012-08-24 08:50:28 +00:00
|
|
|
static inline unsigned long
|
2021-04-01 23:37:24 +00:00
|
|
|
__gfn_to_hva_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
|
2012-08-24 08:50:28 +00:00
|
|
|
{
|
kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot. The translation is
performed in __gfn_to_hva_memslot using the following formula:
hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
It is expected that gfn falls within the boundaries of the guest's
physical memory. However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.
__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva. If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.
Right now it's not clear if there are any cases in which this is
exploitable. One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86. Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier. However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space. Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets. Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached. This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.
Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08 19:31:42 +00:00
|
|
|
/*
|
|
|
|
* The index was checked originally in search_memslots. To avoid
|
|
|
|
* that a malicious guest builds a Spectre gadget out of e.g. page
|
|
|
|
* table walks, do not let the processor speculate loads outside
|
|
|
|
* the guest's registered memslots.
|
|
|
|
*/
|
2021-06-09 05:49:13 +00:00
|
|
|
unsigned long offset = gfn - slot->base_gfn;
|
|
|
|
offset = array_index_nospec(offset, slot->npages);
|
kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot. The translation is
performed in __gfn_to_hva_memslot using the following formula:
hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
It is expected that gfn falls within the boundaries of the guest's
physical memory. However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.
__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva. If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.
Right now it's not clear if there are any cases in which this is
exploitable. One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86. Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier. However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space. Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets. Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached. This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.
Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08 19:31:42 +00:00
|
|
|
return slot->userspace_addr + offset * PAGE_SIZE;
|
2012-08-24 08:50:28 +00:00
|
|
|
}
|
|
|
|
|
2011-03-09 07:41:59 +00:00
|
|
|
static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return gfn_to_memslot(kvm, gfn)->id;
|
|
|
|
}
|
|
|
|
|
2012-07-02 08:54:30 +00:00
|
|
|
static inline gfn_t
|
|
|
|
hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
|
2010-08-22 11:10:28 +00:00
|
|
|
{
|
2012-07-02 08:54:30 +00:00
|
|
|
gfn_t gfn_offset = (hva - slot->userspace_addr) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
return slot->base_gfn + gfn_offset;
|
2010-08-22 11:10:28 +00:00
|
|
|
}
|
|
|
|
|
2007-11-21 12:44:45 +00:00
|
|
|
static inline gpa_t gfn_to_gpa(gfn_t gfn)
|
|
|
|
{
|
|
|
|
return (gpa_t)gfn << PAGE_SHIFT;
|
|
|
|
}
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2010-09-10 15:30:48 +00:00
|
|
|
static inline gfn_t gpa_to_gfn(gpa_t gpa)
|
|
|
|
{
|
|
|
|
return (gfn_t)(gpa >> PAGE_SHIFT);
|
|
|
|
}
|
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
|
2008-09-14 00:48:28 +00:00
|
|
|
{
|
|
|
|
return (hpa_t)pfn << PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2024-02-15 15:29:03 +00:00
|
|
|
static inline bool kvm_is_gpa_in_memslot(struct kvm *kvm, gpa_t gpa)
|
2014-01-01 15:09:21 +00:00
|
|
|
{
|
|
|
|
unsigned long hva = gfn_to_hva(kvm, gpa_to_gfn(gpa));
|
|
|
|
|
2024-02-15 15:29:03 +00:00
|
|
|
return !kvm_is_error_hva(hva);
|
2014-01-01 15:09:21 +00:00
|
|
|
}
|
|
|
|
|
2024-02-15 15:28:59 +00:00
|
|
|
static inline void kvm_gpc_mark_dirty_in_slot(struct gfn_to_pfn_cache *gpc)
|
|
|
|
{
|
|
|
|
lockdep_assert_held(&gpc->lock);
|
|
|
|
|
|
|
|
if (!gpc->memslot)
|
|
|
|
return;
|
|
|
|
|
|
|
|
mark_page_dirty_in_slot(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa));
|
2014-01-01 15:09:21 +00:00
|
|
|
}
|
|
|
|
|
2007-11-18 14:24:12 +00:00
|
|
|
enum kvm_stat_kind {
|
|
|
|
KVM_STAT_VM,
|
|
|
|
KVM_STAT_VCPU,
|
|
|
|
};
|
|
|
|
|
2016-05-18 11:26:23 +00:00
|
|
|
struct kvm_stat_data {
|
|
|
|
struct kvm *kvm;
|
2021-06-23 21:28:46 +00:00
|
|
|
const struct _kvm_stats_desc *desc;
|
2007-11-18 14:24:12 +00:00
|
|
|
enum kvm_stat_kind kind;
|
2007-10-31 22:24:23 +00:00
|
|
|
};
|
2019-12-13 13:07:21 +00:00
|
|
|
|
2021-06-18 22:27:04 +00:00
|
|
|
struct _kvm_stats_desc {
|
|
|
|
struct kvm_stats_desc desc;
|
|
|
|
char name[KVM_STATS_NAME_SIZE];
|
|
|
|
};
|
|
|
|
|
2021-08-02 16:56:29 +00:00
|
|
|
#define STATS_DESC_COMMON(type, unit, base, exp, sz, bsz) \
|
2021-06-18 22:27:04 +00:00
|
|
|
.flags = type | unit | base | \
|
|
|
|
BUILD_BUG_ON_ZERO(type & ~KVM_STATS_TYPE_MASK) | \
|
|
|
|
BUILD_BUG_ON_ZERO(unit & ~KVM_STATS_UNIT_MASK) | \
|
|
|
|
BUILD_BUG_ON_ZERO(base & ~KVM_STATS_BASE_MASK), \
|
|
|
|
.exponent = exp, \
|
2021-08-02 16:56:29 +00:00
|
|
|
.size = sz, \
|
|
|
|
.bucket_size = bsz
|
2021-06-18 22:27:04 +00:00
|
|
|
|
2021-08-02 16:56:29 +00:00
|
|
|
#define VM_GENERIC_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-18 22:27:04 +00:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-02 16:56:29 +00:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-18 22:27:04 +00:00
|
|
|
.offset = offsetof(struct kvm_vm_stat, generic.stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
2021-08-02 16:56:29 +00:00
|
|
|
#define VCPU_GENERIC_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-18 22:27:04 +00:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-02 16:56:29 +00:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-18 22:27:04 +00:00
|
|
|
.offset = offsetof(struct kvm_vcpu_stat, generic.stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
2021-08-02 16:56:29 +00:00
|
|
|
#define VM_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-18 22:27:04 +00:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-02 16:56:29 +00:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-18 22:27:04 +00:00
|
|
|
.offset = offsetof(struct kvm_vm_stat, stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
2021-08-02 16:56:29 +00:00
|
|
|
#define VCPU_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-18 22:27:04 +00:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-02 16:56:29 +00:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-18 22:27:04 +00:00
|
|
|
.offset = offsetof(struct kvm_vcpu_stat, stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
|
|
|
/* SCOPE: VM, VM_GENERIC, VCPU, VCPU_GENERIC */
|
2021-08-02 16:56:29 +00:00
|
|
|
#define STATS_DESC(SCOPE, stat, type, unit, base, exp, sz, bsz) \
|
|
|
|
SCOPE##_STATS_DESC(stat, type, unit, base, exp, sz, bsz)
|
2021-06-18 22:27:04 +00:00
|
|
|
|
|
|
|
#define STATS_DESC_CUMULATIVE(SCOPE, name, unit, base, exponent) \
|
2021-08-02 16:56:29 +00:00
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_CUMULATIVE, \
|
|
|
|
unit, base, exponent, 1, 0)
|
2021-06-18 22:27:04 +00:00
|
|
|
#define STATS_DESC_INSTANT(SCOPE, name, unit, base, exponent) \
|
2021-08-02 16:56:29 +00:00
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_INSTANT, \
|
|
|
|
unit, base, exponent, 1, 0)
|
2021-06-18 22:27:04 +00:00
|
|
|
#define STATS_DESC_PEAK(SCOPE, name, unit, base, exponent) \
|
2021-08-02 16:56:29 +00:00
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_PEAK, \
|
|
|
|
unit, base, exponent, 1, 0)
|
|
|
|
#define STATS_DESC_LINEAR_HIST(SCOPE, name, unit, base, exponent, sz, bsz) \
|
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_LINEAR_HIST, \
|
|
|
|
unit, base, exponent, sz, bsz)
|
|
|
|
#define STATS_DESC_LOG_HIST(SCOPE, name, unit, base, exponent, sz) \
|
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_LOG_HIST, \
|
|
|
|
unit, base, exponent, sz, 0)
|
2021-06-18 22:27:04 +00:00
|
|
|
|
|
|
|
/* Cumulative counter, read/write */
|
|
|
|
#define STATS_DESC_COUNTER(SCOPE, name) \
|
|
|
|
STATS_DESC_CUMULATIVE(SCOPE, name, KVM_STATS_UNIT_NONE, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
/* Instantaneous counter, read only */
|
|
|
|
#define STATS_DESC_ICOUNTER(SCOPE, name) \
|
|
|
|
STATS_DESC_INSTANT(SCOPE, name, KVM_STATS_UNIT_NONE, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
/* Peak counter, read/write */
|
|
|
|
#define STATS_DESC_PCOUNTER(SCOPE, name) \
|
|
|
|
STATS_DESC_PEAK(SCOPE, name, KVM_STATS_UNIT_NONE, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
|
2022-07-14 11:27:31 +00:00
|
|
|
/* Instantaneous boolean value, read only */
|
|
|
|
#define STATS_DESC_IBOOLEAN(SCOPE, name) \
|
|
|
|
STATS_DESC_INSTANT(SCOPE, name, KVM_STATS_UNIT_BOOLEAN, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
/* Peak (sticky) boolean value, read/write */
|
|
|
|
#define STATS_DESC_PBOOLEAN(SCOPE, name) \
|
|
|
|
STATS_DESC_PEAK(SCOPE, name, KVM_STATS_UNIT_BOOLEAN, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
|
2021-06-18 22:27:04 +00:00
|
|
|
/* Cumulative time in nanosecond */
|
|
|
|
#define STATS_DESC_TIME_NSEC(SCOPE, name) \
|
|
|
|
STATS_DESC_CUMULATIVE(SCOPE, name, KVM_STATS_UNIT_SECONDS, \
|
|
|
|
KVM_STATS_BASE_POW10, -9)
|
2021-08-02 16:56:29 +00:00
|
|
|
/* Linear histogram for time in nanosecond */
|
|
|
|
#define STATS_DESC_LINHIST_TIME_NSEC(SCOPE, name, sz, bsz) \
|
|
|
|
STATS_DESC_LINEAR_HIST(SCOPE, name, KVM_STATS_UNIT_SECONDS, \
|
|
|
|
KVM_STATS_BASE_POW10, -9, sz, bsz)
|
|
|
|
/* Logarithmic histogram for time in nanosecond */
|
|
|
|
#define STATS_DESC_LOGHIST_TIME_NSEC(SCOPE, name, sz) \
|
|
|
|
STATS_DESC_LOG_HIST(SCOPE, name, KVM_STATS_UNIT_SECONDS, \
|
|
|
|
KVM_STATS_BASE_POW10, -9, sz)
|
2021-06-18 22:27:04 +00:00
|
|
|
|
2021-06-18 22:27:05 +00:00
|
|
|
#define KVM_GENERIC_VM_STATS() \
|
2021-08-17 00:26:39 +00:00
|
|
|
STATS_DESC_COUNTER(VM_GENERIC, remote_tlb_flush), \
|
|
|
|
STATS_DESC_COUNTER(VM_GENERIC, remote_tlb_flush_requests)
|
2021-06-18 22:27:05 +00:00
|
|
|
|
2021-06-18 22:27:06 +00:00
|
|
|
#define KVM_GENERIC_VCPU_STATS() \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_successful_poll), \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_attempted_poll), \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_poll_invalid), \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_wakeup), \
|
|
|
|
STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_poll_success_ns), \
|
2021-08-02 16:56:32 +00:00
|
|
|
STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_ns), \
|
2021-08-02 16:56:33 +00:00
|
|
|
STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_wait_ns), \
|
|
|
|
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_poll_success_hist, \
|
|
|
|
HALT_POLL_HIST_COUNT), \
|
|
|
|
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_hist, \
|
|
|
|
HALT_POLL_HIST_COUNT), \
|
|
|
|
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_wait_hist, \
|
2021-10-09 02:12:08 +00:00
|
|
|
HALT_POLL_HIST_COUNT), \
|
2022-07-14 11:27:31 +00:00
|
|
|
STATS_DESC_IBOOLEAN(VCPU_GENERIC, blocking)
|
2021-06-18 22:27:06 +00:00
|
|
|
|
2021-06-18 22:27:04 +00:00
|
|
|
ssize_t kvm_stats_read(char *id, const struct kvm_stats_header *header,
|
|
|
|
const struct _kvm_stats_desc *desc,
|
|
|
|
void *stats, size_t size_stats,
|
|
|
|
char __user *user_buffer, size_t size, loff_t *offset);
|
2021-08-02 16:56:29 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_stats_linear_hist_update() - Update bucket value for linear histogram
|
|
|
|
* statistics data.
|
|
|
|
*
|
|
|
|
* @data: start address of the stats data
|
|
|
|
* @size: the number of bucket of the stats data
|
|
|
|
* @value: the new value used to update the linear histogram's bucket
|
|
|
|
* @bucket_size: the size (width) of a bucket
|
|
|
|
*/
|
|
|
|
static inline void kvm_stats_linear_hist_update(u64 *data, size_t size,
|
|
|
|
u64 value, size_t bucket_size)
|
|
|
|
{
|
|
|
|
size_t index = div64_u64(value, bucket_size);
|
|
|
|
|
|
|
|
index = min(index, size - 1);
|
|
|
|
++data[index];
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_stats_log_hist_update() - Update bucket value for logarithmic histogram
|
|
|
|
* statistics data.
|
|
|
|
*
|
|
|
|
* @data: start address of the stats data
|
|
|
|
* @size: the number of bucket of the stats data
|
|
|
|
* @value: the new value used to update the logarithmic histogram's bucket
|
|
|
|
*/
|
|
|
|
static inline void kvm_stats_log_hist_update(u64 *data, size_t size, u64 value)
|
|
|
|
{
|
|
|
|
size_t index = fls64(value);
|
|
|
|
|
|
|
|
index = min(index, size - 1);
|
|
|
|
++data[index];
|
|
|
|
}
|
|
|
|
|
|
|
|
#define KVM_STATS_LINEAR_HIST_UPDATE(array, value, bsize) \
|
|
|
|
kvm_stats_linear_hist_update(array, ARRAY_SIZE(array), value, bsize)
|
|
|
|
#define KVM_STATS_LOG_HIST_UPDATE(array, value) \
|
|
|
|
kvm_stats_log_hist_update(array, ARRAY_SIZE(array), value)
|
|
|
|
|
|
|
|
|
2021-06-18 22:27:05 +00:00
|
|
|
extern const struct kvm_stats_header kvm_vm_stats_header;
|
|
|
|
extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
|
2021-06-18 22:27:06 +00:00
|
|
|
extern const struct kvm_stats_header kvm_vcpu_stats_header;
|
|
|
|
extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
|
2008-04-10 12:47:53 +00:00
|
|
|
|
2023-10-27 18:21:49 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
|
2022-08-16 12:53:22 +00:00
|
|
|
static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
|
2008-07-25 14:24:52 +00:00
|
|
|
{
|
2022-08-16 12:53:22 +00:00
|
|
|
if (unlikely(kvm->mmu_invalidate_in_progress))
|
2008-07-25 14:24:52 +00:00
|
|
|
return 1;
|
|
|
|
/*
|
2022-08-16 12:53:22 +00:00
|
|
|
* Ensure the read of mmu_invalidate_in_progress happens before
|
|
|
|
* the read of mmu_invalidate_seq. This interacts with the
|
|
|
|
* smp_wmb() in mmu_notifier_invalidate_range_end to make sure
|
|
|
|
* that the caller either sees the old (non-zero) value of
|
|
|
|
* mmu_invalidate_in_progress or the new (incremented) value of
|
|
|
|
* mmu_invalidate_seq.
|
|
|
|
*
|
|
|
|
* PowerPC Book3s HV KVM calls this under a per-page lock rather
|
|
|
|
* than under kvm->mmu_lock, for scalability, so can't rely on
|
|
|
|
* kvm->mmu_lock to keep things ordered.
|
2008-07-25 14:24:52 +00:00
|
|
|
*/
|
2011-12-12 12:37:21 +00:00
|
|
|
smp_rmb();
|
2022-08-16 12:53:22 +00:00
|
|
|
if (kvm->mmu_invalidate_seq != mmu_seq)
|
2008-07-25 14:24:52 +00:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
2021-02-22 02:45:22 +00:00
|
|
|
|
2023-10-27 18:21:45 +00:00
|
|
|
static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
|
2022-08-16 12:53:22 +00:00
|
|
|
unsigned long mmu_seq,
|
2023-10-27 18:21:45 +00:00
|
|
|
gfn_t gfn)
|
2021-02-22 02:45:22 +00:00
|
|
|
{
|
|
|
|
lockdep_assert_held(&kvm->mmu_lock);
|
|
|
|
/*
|
2022-08-16 12:53:22 +00:00
|
|
|
* If mmu_invalidate_in_progress is non-zero, then the range maintained
|
|
|
|
* by kvm_mmu_notifier_invalidate_range_start contains all addresses
|
|
|
|
* that might be being invalidated. Note that it may include some false
|
2021-02-22 02:45:22 +00:00
|
|
|
* positives, due to shortcuts when handing concurrent invalidations.
|
|
|
|
*/
|
2023-10-27 18:21:45 +00:00
|
|
|
if (unlikely(kvm->mmu_invalidate_in_progress)) {
|
|
|
|
/*
|
|
|
|
* Dropping mmu_lock after bumping mmu_invalidate_in_progress
|
|
|
|
* but before updating the range is a KVM bug.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
|
|
|
|
kvm->mmu_invalidate_range_end == INVALID_GPA))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (gfn >= kvm->mmu_invalidate_range_start &&
|
|
|
|
gfn < kvm->mmu_invalidate_range_end)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2022-08-16 12:53:22 +00:00
|
|
|
if (kvm->mmu_invalidate_seq != mmu_seq)
|
2021-02-22 02:45:22 +00:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
Retry page faults without acquiring mmu_lock, and without even faulting
the page into the primary MMU, if the resolved gfn is covered by an active
invalidation. Contending for mmu_lock is especially problematic on
preemptible kernels as the mmu_notifier invalidation task will yield
mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault. And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.
Faulting the page into the primary MMU is similarly problematic, as doing
so may acquire locks that need to be taken for the invalidation to
complete (the primary MMU has finer grained locks than KVM's MMU), and/or
may cause unnecessary churn (getting/putting pages, marking them accessed,
etc).
Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.
Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).
Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.
Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.
Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock. This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.
Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 01:26:40 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This lockless version of the range-based retry check *must* be paired with a
|
|
|
|
* call to the locked version after acquiring mmu_lock, i.e. this is safe to
|
|
|
|
* use only as a pre-check to avoid contending mmu_lock. This version *will*
|
|
|
|
* get false negatives and false positives.
|
|
|
|
*/
|
|
|
|
static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
|
|
|
|
unsigned long mmu_seq,
|
|
|
|
gfn_t gfn)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Use READ_ONCE() to ensure the in-progress flag and sequence counter
|
|
|
|
* are always read from memory, e.g. so that checking for retry in a
|
|
|
|
* loop won't result in an infinite retry loop. Don't force loads for
|
|
|
|
* start+end, as the key to avoiding infinite retry loops is observing
|
|
|
|
* the 1=>0 transition of in-progress, i.e. getting false negatives
|
|
|
|
* due to stale start+end values is acceptable.
|
|
|
|
*/
|
|
|
|
if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
|
|
|
|
gfn >= kvm->mmu_invalidate_range_start &&
|
|
|
|
gfn < kvm->mmu_invalidate_range_end)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
|
|
|
|
}
|
2008-07-25 14:24:52 +00:00
|
|
|
#endif
|
|
|
|
|
2013-04-17 11:29:30 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
|
2008-11-19 11:58:46 +00:00
|
|
|
|
2018-04-27 00:55:03 +00:00
|
|
|
#define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */
|
2008-11-19 11:58:46 +00:00
|
|
|
|
2017-04-28 15:06:20 +00:00
|
|
|
bool kvm_arch_can_set_irq_routing(struct kvm *kvm);
|
2008-11-19 11:58:46 +00:00
|
|
|
int kvm_set_irq_routing(struct kvm *kvm,
|
|
|
|
const struct kvm_irq_routing_entry *entries,
|
|
|
|
unsigned nr,
|
|
|
|
unsigned flags);
|
2024-05-06 10:17:49 +00:00
|
|
|
int kvm_init_irq_routing(struct kvm *kvm);
|
2016-07-12 20:09:26 +00:00
|
|
|
int kvm_set_routing_entry(struct kvm *kvm,
|
|
|
|
struct kvm_kernel_irq_routing_entry *e,
|
2013-04-15 21:23:21 +00:00
|
|
|
const struct kvm_irq_routing_entry *ue);
|
2008-11-19 11:58:46 +00:00
|
|
|
void kvm_free_irq_routing(struct kvm *kvm);
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline void kvm_free_irq_routing(struct kvm *kvm) {}
|
|
|
|
|
2024-05-06 10:17:49 +00:00
|
|
|
static inline int kvm_init_irq_routing(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-11-19 11:58:46 +00:00
|
|
|
#endif
|
|
|
|
|
2014-06-30 10:51:13 +00:00
|
|
|
int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi);
|
|
|
|
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
void kvm_eventfd_init(struct kvm *kvm);
|
2012-10-08 22:22:59 +00:00
|
|
|
int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
|
|
|
|
|
2023-10-18 16:07:32 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2012-06-29 15:56:08 +00:00
|
|
|
int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
|
2009-05-20 14:30:49 +00:00
|
|
|
void kvm_irqfd_release(struct kvm *kvm);
|
KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]
There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.
With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.
In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.
This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).
[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-22 20:43:44 +00:00
|
|
|
bool kvm_notify_irqfd_resampler(struct kvm *kvm,
|
|
|
|
unsigned int irqchip,
|
|
|
|
unsigned int pin);
|
2014-06-30 10:51:11 +00:00
|
|
|
void kvm_irq_routing_update(struct kvm *);
|
2012-10-08 22:22:59 +00:00
|
|
|
#else
|
|
|
|
static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_irqfd_release(struct kvm *kvm) {}
|
KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]
There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.
With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.
In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.
This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).
[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-22 20:43:44 +00:00
|
|
|
|
|
|
|
static inline bool kvm_notify_irqfd_resampler(struct kvm *kvm,
|
|
|
|
unsigned int irqchip,
|
|
|
|
unsigned int pin)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2023-10-18 16:07:32 +00:00
|
|
|
#endif /* CONFIG_HAVE_KVM_IRQCHIP */
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2018-02-22 12:05:41 +00:00
|
|
|
void kvm_arch_irq_routing_update(struct kvm *kvm);
|
|
|
|
|
2022-02-23 16:53:02 +00:00
|
|
|
static inline void __kvm_make_request(int req, struct kvm_vcpu *vcpu)
|
2010-05-10 09:34:53 +00:00
|
|
|
{
|
2016-03-10 15:30:22 +00:00
|
|
|
/*
|
|
|
|
* Ensure the rest of the request is published to kvm_check_request's
|
|
|
|
* caller. Paired with the smp_mb__after_atomic in kvm_check_request.
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
2018-07-10 09:27:19 +00:00
|
|
|
set_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
|
2010-05-10 09:34:53 +00:00
|
|
|
}
|
|
|
|
|
2022-02-23 16:53:02 +00:00
|
|
|
static __always_inline void kvm_make_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Request that don't require vCPU action should never be logged in
|
|
|
|
* vcpu->requests. The vCPU won't clear the request, so it will stay
|
|
|
|
* logged indefinitely and prevent the vCPU from entering the guest.
|
|
|
|
*/
|
|
|
|
BUILD_BUG_ON(!__builtin_constant_p(req) ||
|
|
|
|
(req & KVM_REQUEST_NO_ACTION));
|
|
|
|
|
|
|
|
__kvm_make_request(req, vcpu);
|
|
|
|
}
|
|
|
|
|
2017-06-04 12:43:52 +00:00
|
|
|
static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return READ_ONCE(vcpu->requests);
|
|
|
|
}
|
|
|
|
|
2017-04-26 20:32:19 +00:00
|
|
|
static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2018-07-10 09:27:19 +00:00
|
|
|
return test_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
|
2017-04-26 20:32:19 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2018-07-10 09:27:19 +00:00
|
|
|
clear_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
|
2017-04-26 20:32:19 +00:00
|
|
|
}
|
|
|
|
|
2010-05-10 09:34:53 +00:00
|
|
|
static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2017-04-26 20:32:19 +00:00
|
|
|
if (kvm_test_request(req, vcpu)) {
|
|
|
|
kvm_clear_request(req, vcpu);
|
2016-03-10 15:30:22 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure the rest of the request is visible to kvm_check_request's
|
|
|
|
* caller. Paired with the smp_wmb in kvm_make_request.
|
|
|
|
*/
|
|
|
|
smp_mb__after_atomic();
|
2010-05-10 10:08:26 +00:00
|
|
|
return true;
|
|
|
|
} else {
|
|
|
|
return false;
|
|
|
|
}
|
2010-05-10 09:34:53 +00:00
|
|
|
}
|
|
|
|
|
2022-11-30 23:09:33 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
|
2013-04-05 19:20:30 +00:00
|
|
|
extern bool kvm_rebooting;
|
2022-11-30 23:09:33 +00:00
|
|
|
#endif
|
2013-04-05 19:20:30 +00:00
|
|
|
|
2016-10-14 00:53:19 +00:00
|
|
|
extern unsigned int halt_poll_ns;
|
|
|
|
extern unsigned int halt_poll_ns_grow;
|
2019-01-27 10:17:15 +00:00
|
|
|
extern unsigned int halt_poll_ns_grow_start;
|
2016-10-14 00:53:19 +00:00
|
|
|
extern unsigned int halt_poll_ns_shrink;
|
|
|
|
|
2013-04-12 14:08:42 +00:00
|
|
|
struct kvm_device {
|
2019-10-21 15:28:19 +00:00
|
|
|
const struct kvm_device_ops *ops;
|
2013-04-12 14:08:42 +00:00
|
|
|
struct kvm *kvm;
|
|
|
|
void *private;
|
2013-04-25 14:11:23 +00:00
|
|
|
struct list_head vm_node;
|
2013-04-12 14:08:42 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
/* create, destroy, and name are mandatory */
|
|
|
|
struct kvm_device_ops {
|
|
|
|
const char *name;
|
2016-08-09 17:13:01 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* create is called holding kvm->lock and any operations not suitable
|
|
|
|
* to do while holding the lock should be deferred to init (see
|
|
|
|
* below).
|
|
|
|
*/
|
2013-04-12 14:08:42 +00:00
|
|
|
int (*create)(struct kvm_device *dev, u32 type);
|
|
|
|
|
2016-08-09 17:13:00 +00:00
|
|
|
/*
|
|
|
|
* init is called after create if create is successful and is called
|
|
|
|
* outside of holding kvm->lock.
|
|
|
|
*/
|
|
|
|
void (*init)(struct kvm_device *dev);
|
|
|
|
|
2013-04-12 14:08:42 +00:00
|
|
|
/*
|
|
|
|
* Destroy is responsible for freeing dev.
|
|
|
|
*
|
|
|
|
* Destroy may be called before or after destructors are called
|
|
|
|
* on emulated I/O regions, depending on whether a reference is
|
|
|
|
* held by a vcpu or other kvm component that gets destroyed
|
|
|
|
* after the emulated I/O.
|
|
|
|
*/
|
|
|
|
void (*destroy)(struct kvm_device *dev);
|
|
|
|
|
2019-04-18 10:39:41 +00:00
|
|
|
/*
|
|
|
|
* Release is an alternative method to free the device. It is
|
|
|
|
* called when the device file descriptor is closed. Once
|
|
|
|
* release is called, the destroy method will not be called
|
|
|
|
* anymore as the device is removed from the device list of
|
|
|
|
* the VM. kvm->lock is held.
|
|
|
|
*/
|
|
|
|
void (*release)(struct kvm_device *dev);
|
|
|
|
|
2013-04-12 14:08:42 +00:00
|
|
|
int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
|
|
|
|
int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
|
|
|
|
int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
|
|
|
|
long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
|
|
|
|
unsigned long arg);
|
2019-04-18 10:39:36 +00:00
|
|
|
int (*mmap)(struct kvm_device *dev, struct vm_area_struct *vma);
|
2013-04-12 14:08:42 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
struct kvm_device *kvm_device_from_filp(struct file *filp);
|
2019-10-21 15:28:19 +00:00
|
|
|
int kvm_register_device_ops(const struct kvm_device_ops *ops, u32 type);
|
2014-10-09 10:30:08 +00:00
|
|
|
void kvm_unregister_device_ops(u32 type);
|
2013-04-12 14:08:42 +00:00
|
|
|
|
2013-04-12 14:08:46 +00:00
|
|
|
extern struct kvm_device_ops kvm_mpic_ops;
|
2014-10-26 23:17:00 +00:00
|
|
|
extern struct kvm_device_ops kvm_arm_vgic_v2_ops;
|
2014-06-06 22:54:51 +00:00
|
|
|
extern struct kvm_device_ops kvm_arm_vgic_v3_ops;
|
2013-04-12 14:08:46 +00:00
|
|
|
|
2012-07-18 13:37:46 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
vcpu->spin_loop.in_spin_loop = val;
|
|
|
|
}
|
|
|
|
static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
vcpu->spin_loop.dy_eligible = val;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* !CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
|
2015-09-18 14:29:43 +00:00
|
|
|
|
2020-04-16 13:48:07 +00:00
|
|
|
static inline bool kvm_is_visible_memslot(struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
return (memslot && memslot->id < KVM_USER_MEM_SLOTS &&
|
|
|
|
!(memslot->flags & KVM_MEMSLOT_INVALID));
|
|
|
|
}
|
|
|
|
|
2020-01-09 14:57:19 +00:00
|
|
|
struct kvm_vcpu *kvm_get_running_vcpu(void);
|
2020-02-28 08:49:41 +00:00
|
|
|
struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
|
2020-01-09 14:57:19 +00:00
|
|
|
|
2015-09-18 14:29:43 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
|
2016-05-05 17:58:35 +00:00
|
|
|
bool kvm_arch_has_irq_bypass(void);
|
2015-09-18 14:29:43 +00:00
|
|
|
int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
|
|
|
|
struct irq_bypass_producer *);
|
|
|
|
void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *,
|
|
|
|
struct irq_bypass_producer *);
|
|
|
|
void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
|
|
|
|
void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
|
2015-09-18 14:29:53 +00:00
|
|
|
int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
|
|
|
|
uint32_t guest_irq, bool set);
|
2021-08-27 08:00:03 +00:00
|
|
|
bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
|
|
|
|
struct kvm_kernel_irq_routing_entry *);
|
2015-09-18 14:29:43 +00:00
|
|
|
#endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
|
2015-10-20 07:39:03 +00:00
|
|
|
|
2016-05-13 10:16:35 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_INVALID_WAKEUPS
|
|
|
|
/* If we wakeup during the poll time, was it a sucessful poll? */
|
|
|
|
static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return vcpu->valid_wakeup;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_INVALID_WAKEUPS */
|
|
|
|
|
2019-03-05 10:30:01 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_NO_POLL
|
|
|
|
/* Callback that tells if we must not poll */
|
|
|
|
bool kvm_arch_no_poll(struct kvm_vcpu *vcpu);
|
|
|
|
#else
|
|
|
|
static inline bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_NO_POLL */
|
|
|
|
|
2017-12-12 16:41:34 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL
|
|
|
|
long kvm_arch_vcpu_async_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg);
|
|
|
|
#else
|
|
|
|
static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOIOCTLCMD;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
|
|
|
|
|
2022-04-21 03:14:07 +00:00
|
|
|
void kvm_arch_guest_memory_reclaimed(struct kvm *kvm);
|
|
|
|
|
2018-02-23 16:23:57 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
|
|
|
|
int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
|
|
|
|
#else
|
|
|
|
static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
|
|
|
|
|
2020-07-22 21:59:59 +00:00
|
|
|
#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
|
|
|
|
static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_INTR;
|
|
|
|
vcpu->stat.signal_exits++;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
|
|
|
|
|
2022-08-23 00:46:37 +00:00
|
|
|
/*
|
|
|
|
* If more than one page is being (un)accounted, @virt must be the address of
|
|
|
|
* the first page of a block of pages what were allocated together (i.e
|
|
|
|
* accounted together).
|
|
|
|
*
|
|
|
|
* kvm_account_pgtable_pages() is thread-safe because mod_lruvec_page_state()
|
|
|
|
* is thread-safe.
|
|
|
|
*/
|
|
|
|
static inline void kvm_account_pgtable_pages(void *virt, int nr)
|
|
|
|
{
|
|
|
|
mod_lruvec_page_state(virt_to_page(virt), NR_SECONDARY_PAGETABLE, nr);
|
|
|
|
}
|
|
|
|
|
2020-10-01 01:22:22 +00:00
|
|
|
/*
|
|
|
|
* This defines how many reserved entries we want to keep before we
|
|
|
|
* kick the vcpu to the userspace to avoid dirty ring full. This
|
|
|
|
* value can be tuned to higher if e.g. PML is enabled on the host.
|
|
|
|
*/
|
|
|
|
#define KVM_DIRTY_RING_RSVD_ENTRIES 64
|
|
|
|
|
|
|
|
/* Max number of entries allowed for each kvm dirty ring */
|
|
|
|
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
|
|
|
|
|
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).
KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory. With guest private memory,
there will be two kind of memory conversions:
- explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)
- implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)
On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.
KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.
Note! To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'! Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.
Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages. It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.
Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-10-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:51 +00:00
|
|
|
static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
|
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:22:02 +00:00
|
|
|
gpa_t gpa, gpa_t size,
|
|
|
|
bool is_write, bool is_exec,
|
|
|
|
bool is_private)
|
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).
KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory. With guest private memory,
there will be two kind of memory conversions:
- explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)
- implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)
On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.
KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.
Note! To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'! Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.
Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages. It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.
Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-10-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:51 +00:00
|
|
|
{
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
|
|
|
|
vcpu->run->memory_fault.gpa = gpa;
|
|
|
|
vcpu->run->memory_fault.size = size;
|
|
|
|
|
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:22:02 +00:00
|
|
|
/* RWX flags are not (yet) defined or communicated to userspace. */
|
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).
KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory. With guest private memory,
there will be two kind of memory conversions:
- explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)
- implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)
On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.
KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.
Note! To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'! Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.
Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages. It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.
Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-10-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:51 +00:00
|
|
|
vcpu->run->memory_fault.flags = 0;
|
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:22:02 +00:00
|
|
|
if (is_private)
|
|
|
|
vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
|
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).
KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory. With guest private memory,
there will be two kind of memory conversions:
- explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)
- implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)
On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.
KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.
Note! To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'! Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.
Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages. It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.
Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-10-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:51 +00:00
|
|
|
}
|
|
|
|
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
|
|
static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
|
|
|
|
}
|
|
|
|
|
|
|
|
bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
|
2024-07-11 22:27:54 +00:00
|
|
|
unsigned long mask, unsigned long attrs);
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
|
|
|
|
struct kvm_gfn_range *range);
|
|
|
|
bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
|
|
|
|
struct kvm_gfn_range *range);
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
|
|
|
|
static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
|
|
|
|
kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
|
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
#ifdef CONFIG_KVM_PRIVATE_MEM
|
|
|
|
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
|
2024-10-10 18:23:48 +00:00
|
|
|
gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
|
|
|
|
int *max_order);
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
#else
|
|
|
|
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *slot, gfn_t gfn,
|
2024-10-10 18:23:48 +00:00
|
|
|
kvm_pfn_t *pfn, struct page **page,
|
|
|
|
int *max_order)
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
{
|
|
|
|
KVM_BUG_ON(1, kvm);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_KVM_PRIVATE_MEM */
|
|
|
|
|
2024-07-11 22:27:47 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
|
2024-05-07 16:54:03 +00:00
|
|
|
int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
|
|
|
|
#endif
|
|
|
|
|
2024-07-11 22:27:55 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
|
2024-02-14 17:09:06 +00:00
|
|
|
/**
|
|
|
|
* kvm_gmem_populate() - Populate/prepare a GPA range with guest data
|
|
|
|
*
|
|
|
|
* @kvm: KVM instance
|
|
|
|
* @gfn: starting GFN to be populated
|
|
|
|
* @src: userspace-provided buffer containing data to copy into GFN range
|
|
|
|
* (passed to @post_populate, and incremented on each iteration
|
|
|
|
* if not NULL)
|
|
|
|
* @npages: number of pages to copy from userspace-buffer
|
|
|
|
* @post_populate: callback to issue for each gmem page that backs the GPA
|
|
|
|
* range
|
|
|
|
* @opaque: opaque data to pass to @post_populate callback
|
|
|
|
*
|
|
|
|
* This is primarily intended for cases where a gmem-backed GPA range needs
|
|
|
|
* to be initialized with userspace-provided data prior to being mapped into
|
|
|
|
* the guest as a private page. This should be called with the slots->lock
|
|
|
|
* held so that caller-enforced invariants regarding the expected memory
|
|
|
|
* attributes of the GPA range do not race with KVM_SET_MEMORY_ATTRIBUTES.
|
|
|
|
*
|
|
|
|
* Returns the number of pages that were populated.
|
|
|
|
*/
|
|
|
|
typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
|
|
|
|
void __user *src, int order, void *opaque);
|
|
|
|
|
|
|
|
long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
|
|
|
|
kvm_gmem_populate_cb post_populate, void *opaque);
|
2024-07-11 22:27:55 +00:00
|
|
|
#endif
|
2024-02-14 17:09:06 +00:00
|
|
|
|
2024-07-11 22:27:47 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
|
2023-12-30 17:23:21 +00:00
|
|
|
void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
|
|
|
|
#endif
|
|
|
|
|
2024-04-10 22:07:28 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY
|
|
|
|
long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_pre_fault_memory *range);
|
|
|
|
#endif
|
|
|
|
|
2009-08-26 11:57:50 +00:00
|
|
|
#endif
|