2019-06-04 08:11:32 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
/*
|
|
|
|
* Kernel-based Virtual Machine driver for Linux
|
|
|
|
*
|
|
|
|
* This module enables machines with Intel VT-x extensions to run virtual
|
|
|
|
* machines without emulation or binary translation.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2006 Qumranet, Inc.
|
2010-10-06 12:23:22 +00:00
|
|
|
* Copyright 2010 Red Hat, Inc. and/or its affiliates.
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
* Avi Kivity <avi@qumranet.com>
|
|
|
|
* Yaniv Kamay <yaniv@qumranet.com>
|
|
|
|
*/
|
|
|
|
|
2015-03-26 14:39:29 +00:00
|
|
|
#include <kvm/iodev.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-12-16 09:02:48 +00:00
|
|
|
#include <linux/kvm_host.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
#include <linux/kvm.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/errno.h>
|
|
|
|
#include <linux/percpu.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/miscdevice.h>
|
|
|
|
#include <linux/vmalloc.h>
|
|
|
|
#include <linux/reboot.h>
|
|
|
|
#include <linux/debugfs.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/file.h>
|
2011-03-23 21:16:23 +00:00
|
|
|
#include <linux/syscore_ops.h>
|
2007-02-12 08:54:47 +00:00
|
|
|
#include <linux/cpu.h>
|
2017-02-02 18:15:33 +00:00
|
|
|
#include <linux/sched/signal.h>
|
2017-02-08 17:51:29 +00:00
|
|
|
#include <linux/sched/mm.h>
|
2017-02-08 17:51:35 +00:00
|
|
|
#include <linux/sched/stat.h>
|
2007-06-07 16:18:30 +00:00
|
|
|
#include <linux/cpumask.h>
|
|
|
|
#include <linux/smp.h>
|
2007-06-28 12:38:16 +00:00
|
|
|
#include <linux/anon_inodes.h>
|
2007-09-10 15:10:54 +00:00
|
|
|
#include <linux/profile.h>
|
2007-09-17 19:57:50 +00:00
|
|
|
#include <linux/kvm_para.h>
|
2007-10-09 17:20:39 +00:00
|
|
|
#include <linux/pagemap.h>
|
2007-10-18 14:59:34 +00:00
|
|
|
#include <linux/mman.h>
|
2008-04-02 19:46:56 +00:00
|
|
|
#include <linux/swap.h>
|
2009-03-12 13:45:39 +00:00
|
|
|
#include <linux/bitops.h>
|
2009-05-07 20:55:13 +00:00
|
|
|
#include <linux/spinlock.h>
|
2009-10-22 12:19:27 +00:00
|
|
|
#include <linux/compat.h>
|
2009-12-23 16:35:21 +00:00
|
|
|
#include <linux/srcu.h>
|
2010-01-28 11:37:56 +00:00
|
|
|
#include <linux/hugetlb.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
|
|
|
#include <linux/slab.h>
|
2011-07-27 13:00:48 +00:00
|
|
|
#include <linux/sort.h>
|
|
|
|
#include <linux/bsearch.h>
|
2019-05-17 12:08:53 +00:00
|
|
|
#include <linux/io.h>
|
2019-05-17 08:49:49 +00:00
|
|
|
#include <linux/lockdep.h>
|
2019-11-04 11:22:02 +00:00
|
|
|
#include <linux/kthread.h>
|
2021-06-06 02:10:44 +00:00
|
|
|
#include <linux/suspend.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-06-28 18:15:57 +00:00
|
|
|
#include <asm/processor.h>
|
2014-09-19 23:03:25 +00:00
|
|
|
#include <asm/ioctl.h>
|
2016-12-24 19:46:01 +00:00
|
|
|
#include <linux/uaccess.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2008-05-30 14:05:54 +00:00
|
|
|
#include "coalesced_mmio.h"
|
2010-10-14 09:22:46 +00:00
|
|
|
#include "async_pf.h"
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
#include "kvm_mm.h"
|
2014-09-24 11:02:46 +00:00
|
|
|
#include "vfio.h"
|
2008-05-30 14:05:54 +00:00
|
|
|
|
2023-03-07 14:35:56 +00:00
|
|
|
#include <trace/events/ipi.h>
|
|
|
|
|
2009-06-17 12:22:14 +00:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/kvm.h>
|
|
|
|
|
2020-10-01 01:22:22 +00:00
|
|
|
#include <linux/kvm_dirty_ring.h>
|
|
|
|
|
2023-03-07 14:35:56 +00:00
|
|
|
|
2016-05-18 11:26:23 +00:00
|
|
|
/* Worst case buffer size needed for holding an integer. */
|
|
|
|
#define ITOA_MAX_LEN 12
|
|
|
|
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
MODULE_AUTHOR("Qumranet");
|
|
|
|
MODULE_LICENSE("GPL");
|
|
|
|
|
2015-09-18 10:34:53 +00:00
|
|
|
/* Architectures should define their poll value according to the halt latency */
|
2016-10-14 00:53:19 +00:00
|
|
|
unsigned int halt_poll_ns = KVM_HALT_POLL_NS_DEFAULT;
|
2017-06-27 09:51:18 +00:00
|
|
|
module_param(halt_poll_ns, uint, 0644);
|
2016-10-14 00:53:19 +00:00
|
|
|
EXPORT_SYMBOL_GPL(halt_poll_ns);
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
|
2015-09-03 14:07:38 +00:00
|
|
|
/* Default doubles per-vcpu halt_poll_ns. */
|
2016-10-14 00:53:19 +00:00
|
|
|
unsigned int halt_poll_ns_grow = 2;
|
2017-06-27 09:51:18 +00:00
|
|
|
module_param(halt_poll_ns_grow, uint, 0644);
|
2016-10-14 00:53:19 +00:00
|
|
|
EXPORT_SYMBOL_GPL(halt_poll_ns_grow);
|
2015-09-03 14:07:38 +00:00
|
|
|
|
2019-01-27 10:17:15 +00:00
|
|
|
/* The start value to grow halt_poll_ns from */
|
|
|
|
unsigned int halt_poll_ns_grow_start = 10000; /* 10us */
|
|
|
|
module_param(halt_poll_ns_grow_start, uint, 0644);
|
|
|
|
EXPORT_SYMBOL_GPL(halt_poll_ns_grow_start);
|
|
|
|
|
2023-11-02 15:46:27 +00:00
|
|
|
/* Default halves per-vcpu halt_poll_ns. */
|
|
|
|
unsigned int halt_poll_ns_shrink = 2;
|
2017-06-27 09:51:18 +00:00
|
|
|
module_param(halt_poll_ns_shrink, uint, 0644);
|
2016-10-14 00:53:19 +00:00
|
|
|
EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
|
2015-09-03 14:07:38 +00:00
|
|
|
|
2009-06-04 18:08:24 +00:00
|
|
|
/*
|
|
|
|
* Ordering of locks:
|
|
|
|
*
|
2015-02-26 06:58:24 +00:00
|
|
|
* kvm->lock --> kvm->slots_lock --> kvm->irq_lock
|
2009-06-04 18:08:24 +00:00
|
|
|
*/
|
|
|
|
|
2019-01-04 01:14:28 +00:00
|
|
|
DEFINE_MUTEX(kvm_lock);
|
2007-11-14 12:38:21 +00:00
|
|
|
LIST_HEAD(vm_list);
|
2007-02-12 08:54:44 +00:00
|
|
|
|
2019-12-18 21:55:16 +00:00
|
|
|
static struct kmem_cache *kvm_vcpu_cache;
|
2007-04-19 14:27:43 +00:00
|
|
|
|
2007-07-11 15:17:21 +00:00
|
|
|
static __read_mostly struct preempt_ops kvm_preempt_ops;
|
2020-01-09 14:57:19 +00:00
|
|
|
static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
|
2007-07-11 15:17:21 +00:00
|
|
|
|
2024-05-15 15:08:04 +00:00
|
|
|
static struct dentry *kvm_debugfs_dir;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static const struct file_operations stat_fops_per_vm;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2007-02-21 16:04:26 +00:00
|
|
|
static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl,
|
|
|
|
unsigned long arg);
|
2015-02-03 08:35:15 +00:00
|
|
|
#ifdef CONFIG_KVM_COMPAT
|
2011-06-08 00:45:37 +00:00
|
|
|
static long kvm_vcpu_compat_ioctl(struct file *file, unsigned int ioctl,
|
|
|
|
unsigned long arg);
|
2018-06-17 09:16:21 +00:00
|
|
|
#define KVM_COMPAT(c) .compat_ioctl = (c)
|
|
|
|
#else
|
2019-11-14 13:17:39 +00:00
|
|
|
/*
|
|
|
|
* For architectures that don't implement a compat infrastructure,
|
|
|
|
* adopt a double line of defense:
|
|
|
|
* - Prevent a compat task from opening /dev/kvm
|
|
|
|
* - If the open has been done by a 64bit task, and the KVM fd
|
|
|
|
* passed to a compat task, let the ioctls fail.
|
|
|
|
*/
|
2018-06-17 09:16:21 +00:00
|
|
|
static long kvm_no_compat_ioctl(struct file *file, unsigned int ioctl,
|
|
|
|
unsigned long arg) { return -EINVAL; }
|
2019-11-13 16:05:23 +00:00
|
|
|
|
|
|
|
static int kvm_no_compat_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
return is_compat_task() ? -ENODEV : 0;
|
|
|
|
}
|
|
|
|
#define KVM_COMPAT(c) .compat_ioctl = kvm_no_compat_ioctl, \
|
|
|
|
.open = kvm_no_compat_open
|
2011-06-08 00:45:37 +00:00
|
|
|
#endif
|
2009-09-15 09:37:46 +00:00
|
|
|
static int hardware_enable_all(void);
|
|
|
|
static void hardware_disable_all(void);
|
2007-02-21 16:04:26 +00:00
|
|
|
|
2009-12-23 16:35:24 +00:00
|
|
|
static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
|
2013-12-29 20:12:29 +00:00
|
|
|
|
2017-07-12 15:56:44 +00:00
|
|
|
#define KVM_EVENT_CREATE_VM 0
|
|
|
|
#define KVM_EVENT_DESTROY_VM 1
|
|
|
|
static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
|
|
|
|
static unsigned long long kvm_createvm_count;
|
|
|
|
static unsigned long long kvm_active_vms;
|
|
|
|
|
2021-09-03 07:51:40 +00:00
|
|
|
static DEFINE_PER_CPU(cpumask_var_t, cpu_kick_mask);
|
|
|
|
|
2022-04-21 03:14:07 +00:00
|
|
|
__weak void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2022-04-29 01:04:14 +00:00
|
|
|
bool kvm_is_zone_device_page(struct page *page)
|
2019-11-11 22:12:27 +00:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The metadata used by is_zone_device_page() to determine whether or
|
|
|
|
* not a page is ZONE_DEVICE is guaranteed to be valid if and only if
|
|
|
|
* the device has been pinned, e.g. by get_user_pages(). WARN if the
|
|
|
|
* page_count() is zero to help detect bad usage of this helper.
|
|
|
|
*/
|
2022-04-29 01:04:14 +00:00
|
|
|
if (WARN_ON_ONCE(!page_count(page)))
|
2019-11-11 22:12:27 +00:00
|
|
|
return false;
|
|
|
|
|
2022-04-29 01:04:14 +00:00
|
|
|
return is_zone_device_page(page);
|
2019-11-11 22:12:27 +00:00
|
|
|
}
|
|
|
|
|
2022-04-29 01:04:15 +00:00
|
|
|
/*
|
|
|
|
* Returns a 'struct page' if the pfn is "valid" and backed by a refcounted
|
|
|
|
* page, NULL otherwise. Note, the list of refcounted PG_reserved page types
|
|
|
|
* is likely incomplete, it has been compiled purely through people wanting to
|
|
|
|
* back guest with a certain type of memory and encountering issues.
|
|
|
|
*/
|
|
|
|
struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
|
2008-07-28 16:26:24 +00:00
|
|
|
{
|
2022-04-29 01:04:15 +00:00
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
if (!pfn_valid(pfn))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
page = pfn_to_page(pfn);
|
|
|
|
if (!PageReserved(page))
|
|
|
|
return page;
|
|
|
|
|
|
|
|
/* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */
|
|
|
|
if (is_zero_pfn(pfn))
|
|
|
|
return page;
|
|
|
|
|
2019-11-11 22:12:27 +00:00
|
|
|
/*
|
|
|
|
* ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
|
|
|
|
* perspective they are "normal" pages, albeit with slightly different
|
|
|
|
* usage rules.
|
|
|
|
*/
|
2022-04-29 01:04:15 +00:00
|
|
|
if (kvm_is_zone_device_page(page))
|
|
|
|
return page;
|
2008-07-28 16:26:24 +00:00
|
|
|
|
2022-04-29 01:04:15 +00:00
|
|
|
return NULL;
|
2008-07-28 16:26:24 +00:00
|
|
|
}
|
|
|
|
|
2007-02-21 16:04:26 +00:00
|
|
|
/*
|
|
|
|
* Switches to specified vcpu, until a matching vcpu_put()
|
|
|
|
*/
|
2017-12-04 20:35:23 +00:00
|
|
|
void vcpu_load(struct kvm_vcpu *vcpu)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2017-12-04 20:35:23 +00:00
|
|
|
int cpu = get_cpu();
|
2020-01-09 14:57:19 +00:00
|
|
|
|
|
|
|
__this_cpu_write(kvm_running_vcpu, vcpu);
|
2007-07-11 15:17:21 +00:00
|
|
|
preempt_notifier_register(&vcpu->preempt_notifier);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
kvm_arch_vcpu_load(vcpu, cpu);
|
2007-07-11 15:17:21 +00:00
|
|
|
put_cpu();
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
2016-07-08 22:36:06 +00:00
|
|
|
EXPORT_SYMBOL_GPL(vcpu_load);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
void vcpu_put(struct kvm_vcpu *vcpu)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2007-07-11 15:17:21 +00:00
|
|
|
preempt_disable();
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
kvm_arch_vcpu_put(vcpu);
|
2007-07-11 15:17:21 +00:00
|
|
|
preempt_notifier_unregister(&vcpu->preempt_notifier);
|
2020-01-09 14:57:19 +00:00
|
|
|
__this_cpu_write(kvm_running_vcpu, NULL);
|
2007-07-11 15:17:21 +00:00
|
|
|
preempt_enable();
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
2016-07-08 22:36:06 +00:00
|
|
|
EXPORT_SYMBOL_GPL(vcpu_put);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2017-04-27 12:33:43 +00:00
|
|
|
/* TODO: merge with kvm_arch_vcpu_should_kick */
|
|
|
|
static bool kvm_request_needs_ipi(struct kvm_vcpu *vcpu, unsigned req)
|
|
|
|
{
|
|
|
|
int mode = kvm_vcpu_exiting_guest_mode(vcpu);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to wait for the VCPU to reenable interrupts and get out of
|
|
|
|
* READING_SHADOW_PAGE_TABLES mode.
|
|
|
|
*/
|
|
|
|
if (req & KVM_REQUEST_WAIT)
|
|
|
|
return mode != OUTSIDE_GUEST_MODE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Need to kick a running VCPU, but otherwise there is nothing to do.
|
|
|
|
*/
|
|
|
|
return mode == IN_GUEST_MODE;
|
|
|
|
}
|
|
|
|
|
2022-06-05 06:34:15 +00:00
|
|
|
static void ack_kick(void *_completed)
|
2007-06-07 16:18:30 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2021-09-03 07:51:41 +00:00
|
|
|
static inline bool kvm_kick_many_cpus(struct cpumask *cpus, bool wait)
|
2017-06-30 11:25:45 +00:00
|
|
|
{
|
|
|
|
if (cpumask_empty(cpus))
|
|
|
|
return false;
|
|
|
|
|
2022-06-05 06:34:15 +00:00
|
|
|
smp_call_function_many(cpus, ack_kick, NULL, wait);
|
2017-06-30 11:25:45 +00:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2022-01-25 09:59:08 +00:00
|
|
|
static void kvm_make_vcpu_request(struct kvm_vcpu *vcpu, unsigned int req,
|
|
|
|
struct cpumask *tmp, int current_cpu)
|
2021-09-03 07:51:37 +00:00
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
2022-02-23 16:53:02 +00:00
|
|
|
if (likely(!(req & KVM_REQUEST_NO_ACTION)))
|
|
|
|
__kvm_make_request(req, vcpu);
|
2021-09-03 07:51:37 +00:00
|
|
|
|
|
|
|
if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note, the vCPU could get migrated to a different pCPU at any point
|
|
|
|
* after kvm_request_needs_ipi(), which could result in sending an IPI
|
|
|
|
* to the previous pCPU. But, that's OK because the purpose of the IPI
|
|
|
|
* is to ensure the vCPU returns to OUTSIDE_GUEST_MODE, which is
|
|
|
|
* satisfied if the vCPU migrates. Entering READING_SHADOW_PAGE_TABLES
|
|
|
|
* after this point is also OK, as the requirement is only that KVM wait
|
|
|
|
* for vCPUs that were reading SPTEs _before_ any changes were
|
|
|
|
* finalized. See kvm_vcpu_kick() for more details on handling requests.
|
|
|
|
*/
|
|
|
|
if (kvm_request_needs_ipi(vcpu, req)) {
|
|
|
|
cpu = READ_ONCE(vcpu->cpu);
|
|
|
|
if (cpu != -1 && cpu != current_cpu)
|
|
|
|
__cpumask_set_cpu(cpu, tmp);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-05-16 15:21:28 +00:00
|
|
|
bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
|
2021-09-03 07:51:41 +00:00
|
|
|
unsigned long *vcpu_bitmap)
|
2007-06-07 16:18:30 +00:00
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu;
|
2021-09-03 07:51:41 +00:00
|
|
|
struct cpumask *cpus;
|
2021-09-03 07:51:37 +00:00
|
|
|
int i, me;
|
2018-05-16 15:21:28 +00:00
|
|
|
bool called;
|
2008-12-08 09:58:04 +00:00
|
|
|
|
2011-01-12 07:41:22 +00:00
|
|
|
me = get_cpu();
|
2018-05-16 15:21:28 +00:00
|
|
|
|
2021-09-03 07:51:41 +00:00
|
|
|
cpus = this_cpu_cpumask_var_ptr(cpu_kick_mask);
|
|
|
|
cpumask_clear(cpus);
|
|
|
|
|
2021-09-03 07:51:37 +00:00
|
|
|
for_each_set_bit(i, vcpu_bitmap, KVM_MAX_VCPUS) {
|
|
|
|
vcpu = kvm_get_vcpu(kvm, i);
|
2021-09-03 07:51:38 +00:00
|
|
|
if (!vcpu)
|
2018-05-16 15:21:28 +00:00
|
|
|
continue;
|
2022-01-25 09:59:08 +00:00
|
|
|
kvm_make_vcpu_request(vcpu, req, cpus, me);
|
2008-12-08 09:56:24 +00:00
|
|
|
}
|
2018-05-16 15:21:28 +00:00
|
|
|
|
2021-09-03 07:51:41 +00:00
|
|
|
called = kvm_kick_many_cpus(cpus, !!(req & KVM_REQUEST_WAIT));
|
2011-01-12 07:41:22 +00:00
|
|
|
put_cpu();
|
2018-05-16 15:21:28 +00:00
|
|
|
|
|
|
|
return called;
|
|
|
|
}
|
|
|
|
|
2024-04-04 23:26:51 +00:00
|
|
|
bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
|
2018-05-16 15:21:28 +00:00
|
|
|
{
|
2021-09-03 07:51:37 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
2021-09-03 07:51:40 +00:00
|
|
|
struct cpumask *cpus;
|
2021-11-16 16:04:02 +00:00
|
|
|
unsigned long i;
|
2018-05-16 15:21:28 +00:00
|
|
|
bool called;
|
2021-11-16 16:04:02 +00:00
|
|
|
int me;
|
2018-05-16 15:21:28 +00:00
|
|
|
|
2021-09-03 07:51:37 +00:00
|
|
|
me = get_cpu();
|
|
|
|
|
2021-09-03 07:51:40 +00:00
|
|
|
cpus = this_cpu_cpumask_var_ptr(cpu_kick_mask);
|
|
|
|
cpumask_clear(cpus);
|
|
|
|
|
2024-04-04 23:26:51 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
2022-01-25 09:59:08 +00:00
|
|
|
kvm_make_vcpu_request(vcpu, req, cpus, me);
|
2021-09-03 07:51:37 +00:00
|
|
|
|
|
|
|
called = kvm_kick_many_cpus(cpus, !!(req & KVM_REQUEST_WAIT));
|
|
|
|
put_cpu();
|
2018-05-16 15:21:28 +00:00
|
|
|
|
2008-12-08 09:56:24 +00:00
|
|
|
return called;
|
2007-06-07 16:18:30 +00:00
|
|
|
}
|
KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
For VMX, when a vcpu enters HLT emulation, pi_post_block will:
1) Add vcpu to per-cpu list of blocked vcpus.
2) Program the posted-interrupt descriptor "notification vector"
to POSTED_INTR_WAKEUP_VECTOR
With interrupt remapping, an interrupt will set the PIR bit for the
vector programmed for the device on the CPU, test-and-set the
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.
This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.
Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:
1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit
(notification vector not programmed, so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.
To fix this, use the start_assignment kvm_x86_ops callback to kick
vcpus out of the halt loop, so the notification vector is
properly reprogrammed to the wakeup vector.
Reported-by: Pei Zhang <pezhang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20210526172014.GA29007@fuller.cnet>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-26 17:20:14 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_make_all_cpus_request);
|
2020-05-06 13:17:53 +00:00
|
|
|
|
2008-12-08 09:56:24 +00:00
|
|
|
void kvm_flush_remote_tlbs(struct kvm *kvm)
|
2008-02-20 19:47:24 +00:00
|
|
|
{
|
2021-08-17 00:26:39 +00:00
|
|
|
++kvm->stat.generic.remote_tlb_flush_requests;
|
2021-09-18 00:56:29 +00:00
|
|
|
|
2016-03-13 03:10:28 +00:00
|
|
|
/*
|
|
|
|
* We want to publish modifications to the page tables before reading
|
|
|
|
* mode. Pairs with a memory barrier in arch-specific code.
|
|
|
|
* - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest
|
|
|
|
* and smp_mb in walk_shadow_page_lockless_begin/end.
|
|
|
|
* - powerpc: smp_mb in kvmppc_prepare_to_enter.
|
|
|
|
*
|
|
|
|
* There is already an smp_mb__after_atomic() before
|
|
|
|
* kvm_make_all_cpus_request() reads vcpu->mode. We reuse that
|
|
|
|
* barrier here.
|
|
|
|
*/
|
2023-08-11 04:51:14 +00:00
|
|
|
if (!kvm_arch_flush_remote_tlbs(kvm)
|
2018-07-19 08:40:17 +00:00
|
|
|
|| kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
|
2021-06-18 22:27:03 +00:00
|
|
|
++kvm->stat.generic.remote_tlb_flush;
|
2008-02-20 19:47:24 +00:00
|
|
|
}
|
2013-10-07 16:47:59 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
|
2008-02-20 19:47:24 +00:00
|
|
|
|
2023-08-11 04:51:18 +00:00
|
|
|
void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages)
|
|
|
|
{
|
|
|
|
if (!kvm_arch_flush_remote_tlbs_range(kvm, gfn, nr_pages))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fall back to a flushing entire TLBs if the architecture range-based
|
|
|
|
* TLB invalidation is unsupported or can't be performed for whatever
|
|
|
|
* reason.
|
|
|
|
*/
|
|
|
|
kvm_flush_remote_tlbs(kvm);
|
|
|
|
}
|
|
|
|
|
2023-08-11 04:51:19 +00:00
|
|
|
void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
|
|
|
|
const struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* All current use cases for flushing the TLBs for a specific memslot
|
|
|
|
* are related to dirty logging, and many do the TLB flush out of
|
|
|
|
* mmu_lock. The interaction between the various operations on memslot
|
|
|
|
* must be serialized by slots_locks to ensure the TLB flush from one
|
|
|
|
* operation is observed by any other operation on the same memslot.
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&kvm->slots_lock);
|
|
|
|
kvm_flush_remote_tlbs_range(kvm, memslot->base_gfn, memslot->npages);
|
|
|
|
}
|
2008-02-20 19:47:24 +00:00
|
|
|
|
2022-04-21 03:14:07 +00:00
|
|
|
static void kvm_flush_shadow_all(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm_arch_flush_shadow_all(kvm);
|
|
|
|
kvm_arch_guest_memory_reclaimed(kvm);
|
|
|
|
}
|
|
|
|
|
2020-07-03 02:35:39 +00:00
|
|
|
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
|
|
|
|
static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
|
|
|
|
gfp_t gfp_flags)
|
|
|
|
{
|
2024-01-22 23:53:11 +00:00
|
|
|
void *page;
|
|
|
|
|
2020-07-03 02:35:39 +00:00
|
|
|
gfp_flags |= mc->gfp_zero;
|
|
|
|
|
|
|
|
if (mc->kmem_cache)
|
|
|
|
return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
|
2024-01-22 23:53:11 +00:00
|
|
|
|
|
|
|
page = (void *)__get_free_page(gfp_flags);
|
|
|
|
if (page && mc->init_value)
|
|
|
|
memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
|
|
|
|
return page;
|
2020-07-03 02:35:39 +00:00
|
|
|
}
|
|
|
|
|
2022-06-22 19:27:08 +00:00
|
|
|
int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
|
2020-07-03 02:35:39 +00:00
|
|
|
{
|
2022-07-29 13:46:01 +00:00
|
|
|
gfp_t gfp = mc->gfp_custom ? mc->gfp_custom : GFP_KERNEL_ACCOUNT;
|
2020-07-03 02:35:39 +00:00
|
|
|
void *obj;
|
|
|
|
|
|
|
|
if (mc->nobjs >= min)
|
|
|
|
return 0;
|
2022-06-22 19:27:08 +00:00
|
|
|
|
|
|
|
if (unlikely(!mc->objects)) {
|
|
|
|
if (WARN_ON_ONCE(!capacity))
|
|
|
|
return -EIO;
|
|
|
|
|
2024-01-22 23:53:11 +00:00
|
|
|
/*
|
|
|
|
* Custom init values can be used only for page allocations,
|
|
|
|
* and obviously conflict with __GFP_ZERO.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
|
|
|
|
return -EIO;
|
|
|
|
|
2024-02-12 11:24:10 +00:00
|
|
|
mc->objects = kvmalloc_array(capacity, sizeof(void *), gfp);
|
2022-06-22 19:27:08 +00:00
|
|
|
if (!mc->objects)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
mc->capacity = capacity;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* It is illegal to request a different capacity across topups. */
|
|
|
|
if (WARN_ON_ONCE(mc->capacity != capacity))
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
while (mc->nobjs < mc->capacity) {
|
|
|
|
obj = mmu_memory_cache_alloc_obj(mc, gfp);
|
2020-07-03 02:35:39 +00:00
|
|
|
if (!obj)
|
|
|
|
return mc->nobjs >= min ? 0 : -ENOMEM;
|
|
|
|
mc->objects[mc->nobjs++] = obj;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-06-22 19:27:08 +00:00
|
|
|
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
|
|
|
|
{
|
|
|
|
return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
|
|
|
|
}
|
|
|
|
|
2020-07-03 02:35:39 +00:00
|
|
|
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
|
|
|
|
{
|
|
|
|
return mc->nobjs;
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
|
|
|
|
{
|
|
|
|
while (mc->nobjs) {
|
|
|
|
if (mc->kmem_cache)
|
|
|
|
kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
|
|
|
|
else
|
|
|
|
free_page((unsigned long)mc->objects[--mc->nobjs]);
|
|
|
|
}
|
2022-06-22 19:27:08 +00:00
|
|
|
|
|
|
|
kvfree(mc->objects);
|
|
|
|
|
|
|
|
mc->objects = NULL;
|
|
|
|
mc->capacity = 0;
|
2020-07-03 02:35:39 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
|
|
|
|
{
|
|
|
|
void *p;
|
|
|
|
|
|
|
|
if (WARN_ON(!mc->nobjs))
|
|
|
|
p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
|
|
|
|
else
|
|
|
|
p = mc->objects[--mc->nobjs];
|
|
|
|
BUG_ON(!p);
|
|
|
|
return p;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-12-18 21:55:30 +00:00
|
|
|
static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
|
2007-07-27 07:16:56 +00:00
|
|
|
{
|
|
|
|
mutex_init(&vcpu->mutex);
|
|
|
|
vcpu->cpu = -1;
|
|
|
|
vcpu->kvm = kvm;
|
|
|
|
vcpu->vcpu_id = id;
|
2011-02-01 14:52:41 +00:00
|
|
|
vcpu->pid = NULL;
|
2021-10-09 02:11:57 +00:00
|
|
|
#ifndef __KVM_HAVE_ARCH_WQP
|
2020-04-24 05:48:37 +00:00
|
|
|
rcuwait_init(&vcpu->wait);
|
2021-10-09 02:11:57 +00:00
|
|
|
#endif
|
2010-10-14 09:22:46 +00:00
|
|
|
kvm_async_pf_vcpu_init(vcpu);
|
2007-07-27 07:16:56 +00:00
|
|
|
|
2012-07-18 13:37:46 +00:00
|
|
|
kvm_vcpu_set_in_spin_loop(vcpu, false);
|
|
|
|
kvm_vcpu_set_dy_eligible(vcpu, false);
|
2013-03-04 18:02:07 +00:00
|
|
|
vcpu->preempted = false;
|
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-18 11:39:06 +00:00
|
|
|
vcpu->ready = false;
|
2019-12-18 21:55:17 +00:00
|
|
|
preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
vcpu->last_used_slot = NULL;
|
2022-07-20 09:22:48 +00:00
|
|
|
|
|
|
|
/* Fill the stats id string for the vcpu */
|
|
|
|
snprintf(vcpu->stats_id, sizeof(vcpu->stats_id), "kvm-%d/vcpu-%d",
|
|
|
|
task_pid_nr(current), id);
|
2007-07-27 07:16:56 +00:00
|
|
|
}
|
|
|
|
|
2021-11-16 16:03:57 +00:00
|
|
|
static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
|
2019-12-18 21:55:14 +00:00
|
|
|
{
|
|
|
|
kvm_arch_vcpu_destroy(vcpu);
|
2022-04-06 17:13:42 +00:00
|
|
|
kvm_dirty_ring_free(&vcpu->dirty_ring);
|
2019-12-18 21:55:15 +00:00
|
|
|
|
2019-12-18 21:55:29 +00:00
|
|
|
/*
|
|
|
|
* No need for rcu_read_lock as VCPU_RUN is the only place that changes
|
|
|
|
* the vcpu->pid pointer, and at destruction time all file descriptors
|
|
|
|
* are already gone.
|
|
|
|
*/
|
|
|
|
put_pid(rcu_dereference_protected(vcpu->pid, 1));
|
|
|
|
|
2019-12-18 21:55:30 +00:00
|
|
|
free_page((unsigned long)vcpu->run);
|
2019-12-18 21:55:15 +00:00
|
|
|
kmem_cache_free(kvm_vcpu_cache, vcpu);
|
2019-12-18 21:55:14 +00:00
|
|
|
}
|
2021-11-16 16:03:57 +00:00
|
|
|
|
|
|
|
void kvm_destroy_vcpus(struct kvm *kvm)
|
|
|
|
{
|
2021-11-16 16:04:02 +00:00
|
|
|
unsigned long i;
|
2021-11-16 16:03:57 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
|
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm) {
|
|
|
|
kvm_vcpu_destroy(vcpu);
|
2021-11-16 16:04:01 +00:00
|
|
|
xa_erase(&kvm->vcpu_array, i);
|
2021-11-16 16:03:57 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
atomic_set(&kvm->online_vcpus, 0);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
|
2019-12-18 21:55:14 +00:00
|
|
|
|
2023-10-27 18:21:49 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
|
2008-07-25 14:24:52 +00:00
|
|
|
static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
|
|
|
|
{
|
|
|
|
return container_of(mn, struct kvm, mmu_notifier);
|
|
|
|
}
|
|
|
|
|
2023-10-27 18:21:43 +00:00
|
|
|
typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
|
2023-10-27 18:21:45 +00:00
|
|
|
typedef void (*on_lock_fn_t)(struct kvm *kvm);
|
2022-04-21 03:14:07 +00:00
|
|
|
|
2023-10-27 18:21:43 +00:00
|
|
|
struct kvm_mmu_notifier_range {
|
|
|
|
/*
|
|
|
|
* 64-bit addresses, as KVM notifiers can operate on host virtual
|
|
|
|
* addresses (unsigned long) and guest physical addresses (64-bit).
|
|
|
|
*/
|
|
|
|
u64 start;
|
|
|
|
u64 end;
|
2023-07-29 00:41:44 +00:00
|
|
|
union kvm_mmu_notifier_arg arg;
|
2023-10-27 18:21:43 +00:00
|
|
|
gfn_handler_t handler;
|
2021-04-02 00:56:55 +00:00
|
|
|
on_lock_fn_t on_lock;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
bool flush_on_ret;
|
|
|
|
bool may_block;
|
|
|
|
};
|
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
/*
|
|
|
|
* The inner-most helper returns a tuple containing the return value from the
|
|
|
|
* arch- and action-specific handler, plus a flag indicating whether or not at
|
|
|
|
* least one memslot was found, i.e. if the handler found guest memory.
|
|
|
|
*
|
|
|
|
* Note, most notifiers are averse to booleans, so even though KVM tracks the
|
|
|
|
* return from arch code as a bool, outer helpers will cast it to an int. :-(
|
|
|
|
*/
|
|
|
|
typedef struct kvm_mmu_notifier_return {
|
|
|
|
bool ret;
|
|
|
|
bool found_memslot;
|
|
|
|
} kvm_mn_ret_t;
|
|
|
|
|
2021-04-02 00:56:55 +00:00
|
|
|
/*
|
|
|
|
* Use a dedicated stub instead of NULL to indicate that there is no callback
|
|
|
|
* function/handler. The compiler technically can't guarantee that a real
|
|
|
|
* function will have a non-zero address, and so it will generate code to
|
|
|
|
* check for !NULL, whereas comparing against a stub will be elided at compile
|
|
|
|
* time (unless the compiler is getting long in the tooth, e.g. gcc 4.9).
|
|
|
|
*/
|
|
|
|
static void kvm_null_fn(void)
|
|
|
|
{
|
|
|
|
|
|
|
|
}
|
|
|
|
#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
|
|
|
|
|
2021-12-06 19:54:28 +00:00
|
|
|
/* Iterate over each memslot intersecting [start, last] (inclusive) range */
|
|
|
|
#define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
|
|
|
|
for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
|
|
|
|
node; \
|
|
|
|
node = interval_tree_iter_next(node, start, last)) \
|
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
|
|
|
|
const struct kvm_mmu_notifier_range *range)
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
{
|
2023-10-27 18:21:52 +00:00
|
|
|
struct kvm_mmu_notifier_return r = {
|
|
|
|
.ret = false,
|
|
|
|
.found_memslot = false,
|
|
|
|
};
|
2021-04-02 00:56:55 +00:00
|
|
|
struct kvm_gfn_range gfn_range;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
struct kvm_memslots *slots;
|
|
|
|
int i, idx;
|
|
|
|
|
2021-12-06 19:54:28 +00:00
|
|
|
if (WARN_ON_ONCE(range->end <= range->start))
|
2023-10-27 18:21:52 +00:00
|
|
|
return r;
|
2021-12-06 19:54:28 +00:00
|
|
|
|
2021-04-02 00:56:55 +00:00
|
|
|
/* A null handler is allowed if and only if on_lock() is provided. */
|
|
|
|
if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
|
|
|
|
IS_KVM_NULL_FN(range->handler)))
|
2023-10-27 18:21:52 +00:00
|
|
|
return r;
|
2021-04-02 00:56:55 +00:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
idx = srcu_read_lock(&kvm->srcu);
|
|
|
|
|
2023-10-27 18:22:04 +00:00
|
|
|
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
|
2021-12-06 19:54:28 +00:00
|
|
|
struct interval_tree_node *node;
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
slots = __kvm_memslots(kvm, i);
|
2021-12-06 19:54:28 +00:00
|
|
|
kvm_for_each_memslot_in_hva_range(node, slots,
|
|
|
|
range->start, range->end - 1) {
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
unsigned long hva_start, hva_end;
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
|
2023-10-27 18:21:43 +00:00
|
|
|
hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
|
|
|
|
hva_end = min_t(unsigned long, range->end,
|
|
|
|
slot->userspace_addr + (slot->npages << PAGE_SHIFT));
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* To optimize for the likely case where the address
|
|
|
|
* range is covered by zero or one memslots, don't
|
|
|
|
* bother making these conditional (to avoid writes on
|
|
|
|
* the second or later invocation of the handler).
|
|
|
|
*/
|
2023-07-29 00:41:44 +00:00
|
|
|
gfn_range.arg = range->arg;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
gfn_range.may_block = range->may_block;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* {gfn(page) | page intersects with [hva_start, hva_end)} =
|
|
|
|
* {gfn_start, gfn_start+1, ..., gfn_end-1}.
|
|
|
|
*/
|
|
|
|
gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
|
|
|
|
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
|
|
|
|
gfn_range.slot = slot;
|
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
if (!r.found_memslot) {
|
|
|
|
r.found_memslot = true;
|
KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot
Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
detected in the memslots, i.e. don't take the lock for notifications that
don't affect the guest.
For small VMs, spurious locking is a minor annoyance. And for "volatile"
setups where the majority of notifications _are_ relevant, this barely
qualifies as an optimization.
But, for large VMs (hundreds of threads) with static setups, e.g. no
page migration, no swapping, etc..., the vast majority of MMU notifier
callbacks will be unrelated to the guest, e.g. will often be in response
to the userspace VMM adjusting its own virtual address space. In such
large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
handling page faults. In some scenarios it can even be "fatal" in the
sense that it causes unacceptable brownouts, e.g. when rebuilding huge
pages after live migration, a significant percentage of vCPUs will be
attempting to handle page faults.
x86's TDP MMU implementation is especially susceptible to spurious
locking due it taking mmu_lock for read when handling page faults.
Because rwlock is fair, a single writer will stall future readers, while
the writer is itself stalled waiting for in-progress readers to complete.
This is exacerbated by the MMU notifiers often firing multiple times in
quick succession, e.g. moving a page will (always?) invoke three separate
notifiers: .invalidate_range_start(), invalidate_range_end(), and
.change_pte(). Unnecessarily taking mmu_lock each time means even a
single spurious sequence can be problematic.
Note, this optimizes only the unpaired callbacks. Optimizing the
.invalidate_range_{start,end}() pairs is more complex and will be done in
a future patch.
Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:56 +00:00
|
|
|
KVM_MMU_LOCK(kvm);
|
2021-08-03 07:45:41 +00:00
|
|
|
if (!IS_KVM_NULL_FN(range->on_lock))
|
2023-10-27 18:21:45 +00:00
|
|
|
range->on_lock(kvm);
|
|
|
|
|
2021-08-03 07:45:41 +00:00
|
|
|
if (IS_KVM_NULL_FN(range->handler))
|
|
|
|
break;
|
KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot
Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
detected in the memslots, i.e. don't take the lock for notifications that
don't affect the guest.
For small VMs, spurious locking is a minor annoyance. And for "volatile"
setups where the majority of notifications _are_ relevant, this barely
qualifies as an optimization.
But, for large VMs (hundreds of threads) with static setups, e.g. no
page migration, no swapping, etc..., the vast majority of MMU notifier
callbacks will be unrelated to the guest, e.g. will often be in response
to the userspace VMM adjusting its own virtual address space. In such
large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
handling page faults. In some scenarios it can even be "fatal" in the
sense that it causes unacceptable brownouts, e.g. when rebuilding huge
pages after live migration, a significant percentage of vCPUs will be
attempting to handle page faults.
x86's TDP MMU implementation is especially susceptible to spurious
locking due it taking mmu_lock for read when handling page faults.
Because rwlock is fair, a single writer will stall future readers, while
the writer is itself stalled waiting for in-progress readers to complete.
This is exacerbated by the MMU notifiers often firing multiple times in
quick succession, e.g. moving a page will (always?) invoke three separate
notifiers: .invalidate_range_start(), invalidate_range_end(), and
.change_pte(). Unnecessarily taking mmu_lock each time means even a
single spurious sequence can be problematic.
Note, this optimizes only the unpaired callbacks. Optimizing the
.invalidate_range_{start,end}() pairs is more complex and will be done in
a future patch.
Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:56 +00:00
|
|
|
}
|
2023-10-27 18:21:52 +00:00
|
|
|
r.ret |= range->handler(kvm, &gfn_range);
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
if (range->flush_on_ret && r.ret)
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
kvm_flush_remote_tlbs(kvm);
|
|
|
|
|
2023-10-27 18:21:53 +00:00
|
|
|
if (r.found_memslot)
|
KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot
Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
detected in the memslots, i.e. don't take the lock for notifications that
don't affect the guest.
For small VMs, spurious locking is a minor annoyance. And for "volatile"
setups where the majority of notifications _are_ relevant, this barely
qualifies as an optimization.
But, for large VMs (hundreds of threads) with static setups, e.g. no
page migration, no swapping, etc..., the vast majority of MMU notifier
callbacks will be unrelated to the guest, e.g. will often be in response
to the userspace VMM adjusting its own virtual address space. In such
large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
handling page faults. In some scenarios it can even be "fatal" in the
sense that it causes unacceptable brownouts, e.g. when rebuilding huge
pages after live migration, a significant percentage of vCPUs will be
attempting to handle page faults.
x86's TDP MMU implementation is especially susceptible to spurious
locking due it taking mmu_lock for read when handling page faults.
Because rwlock is fair, a single writer will stall future readers, while
the writer is itself stalled waiting for in-progress readers to complete.
This is exacerbated by the MMU notifiers often firing multiple times in
quick succession, e.g. moving a page will (always?) invoke three separate
notifiers: .invalidate_range_start(), invalidate_range_end(), and
.change_pte(). Unnecessarily taking mmu_lock each time means even a
single spurious sequence can be problematic.
Note, this optimizes only the unpaired callbacks. Optimizing the
.invalidate_range_{start,end}() pairs is more complex and will be done in
a future patch.
Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:56 +00:00
|
|
|
KVM_MMU_UNLOCK(kvm);
|
2021-04-02 00:56:55 +00:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
srcu_read_unlock(&kvm->srcu, idx);
|
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
return r;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
|
|
|
|
unsigned long start,
|
|
|
|
unsigned long end,
|
2023-10-27 18:21:43 +00:00
|
|
|
gfn_handler_t handler)
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
{
|
|
|
|
struct kvm *kvm = mmu_notifier_to_kvm(mn);
|
2023-10-27 18:21:43 +00:00
|
|
|
const struct kvm_mmu_notifier_range range = {
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
.start = start,
|
|
|
|
.end = end,
|
|
|
|
.handler = handler,
|
2021-04-02 00:56:55 +00:00
|
|
|
.on_lock = (void *)kvm_null_fn,
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
.flush_on_ret = true,
|
|
|
|
.may_block = false,
|
|
|
|
};
|
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
return __kvm_handle_hva_range(kvm, &range).ret;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
|
|
|
|
unsigned long start,
|
|
|
|
unsigned long end,
|
2023-10-27 18:21:43 +00:00
|
|
|
gfn_handler_t handler)
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
{
|
|
|
|
struct kvm *kvm = mmu_notifier_to_kvm(mn);
|
2023-10-27 18:21:43 +00:00
|
|
|
const struct kvm_mmu_notifier_range range = {
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
.start = start,
|
|
|
|
.end = end,
|
|
|
|
.handler = handler,
|
2021-04-02 00:56:55 +00:00
|
|
|
.on_lock = (void *)kvm_null_fn,
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
.flush_on_ret = false,
|
|
|
|
.may_block = false,
|
|
|
|
};
|
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
return __kvm_handle_hva_range(kvm, &range).ret;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
}
|
KVM: Avoid illegal stage2 mapping on invalid memory slot
We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.
The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page sharing by KSM
kvm_mmu_notifier_invalidate_range_start
:
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range
kvm_set_spte_gfn (C)
:
kvm_mmu_notifier_invalidate_range_end
Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.
We tried a git-bisect and the first problematic commit is cd4c71835228 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
before the iteration on the memory slots before this commit. This change
literally enlarges the racy window between kvm_mmu_notifier_change_pte()
and memory slot removal so that we're able to reproduce the issue in a
practical test case. However, the issue exists since commit d5d8184d35c9
("KVM: ARM: Memory virtualization setup").
Cc: stable@vger.kernel.org # v3.9+
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
Reported-by: Shuai Hu <hshuai@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Message-Id: <20230615054259.14911-1-gshan@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-06-15 05:42:59 +00:00
|
|
|
|
2023-10-27 18:21:45 +00:00
|
|
|
void kvm_mmu_invalidate_begin(struct kvm *kvm)
|
2008-07-25 14:24:52 +00:00
|
|
|
{
|
2023-10-27 18:21:45 +00:00
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
2008-07-25 14:24:52 +00:00
|
|
|
/*
|
|
|
|
* The count increase must become visible at unlock time as no
|
|
|
|
* spte can be established without taking the mmu_lock and
|
|
|
|
* count is also read inside the mmu_lock critical section.
|
|
|
|
*/
|
2022-08-16 12:53:22 +00:00
|
|
|
kvm->mmu_invalidate_in_progress++;
|
2023-10-27 18:21:45 +00:00
|
|
|
|
2022-08-16 12:53:22 +00:00
|
|
|
if (likely(kvm->mmu_invalidate_in_progress == 1)) {
|
2023-10-27 18:21:45 +00:00
|
|
|
kvm->mmu_invalidate_range_start = INVALID_GPA;
|
|
|
|
kvm->mmu_invalidate_range_end = INVALID_GPA;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
|
|
|
|
{
|
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
|
|
|
|
|
|
|
|
if (likely(kvm->mmu_invalidate_range_start == INVALID_GPA)) {
|
2022-08-16 12:53:22 +00:00
|
|
|
kvm->mmu_invalidate_range_start = start;
|
|
|
|
kvm->mmu_invalidate_range_end = end;
|
2021-02-22 02:45:22 +00:00
|
|
|
} else {
|
|
|
|
/*
|
2022-04-10 15:38:40 +00:00
|
|
|
* Fully tracking multiple concurrent ranges has diminishing
|
2021-02-22 02:45:22 +00:00
|
|
|
* returns. Keep things simple and just find the minimal range
|
|
|
|
* which includes the current and new ranges. As there won't be
|
|
|
|
* enough information to subtract a range after its invalidate
|
|
|
|
* completes, any ranges invalidated concurrently will
|
|
|
|
* accumulate and persist until all outstanding invalidates
|
|
|
|
* complete.
|
|
|
|
*/
|
2022-08-16 12:53:22 +00:00
|
|
|
kvm->mmu_invalidate_range_start =
|
|
|
|
min(kvm->mmu_invalidate_range_start, start);
|
|
|
|
kvm->mmu_invalidate_range_end =
|
|
|
|
max(kvm->mmu_invalidate_range_end, end);
|
2021-02-22 02:45:22 +00:00
|
|
|
}
|
2021-04-02 00:56:55 +00:00
|
|
|
}
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
|
2023-10-27 18:21:45 +00:00
|
|
|
{
|
|
|
|
kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
|
|
|
|
return kvm_unmap_gfn_range(kvm, range);
|
|
|
|
}
|
|
|
|
|
2021-04-02 00:56:55 +00:00
|
|
|
static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
|
|
|
|
const struct mmu_notifier_range *range)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = mmu_notifier_to_kvm(mn);
|
2023-10-27 18:21:43 +00:00
|
|
|
const struct kvm_mmu_notifier_range hva_range = {
|
2021-04-02 00:56:55 +00:00
|
|
|
.start = range->start,
|
|
|
|
.end = range->end,
|
2023-10-27 18:21:45 +00:00
|
|
|
.handler = kvm_mmu_unmap_gfn_range,
|
2022-08-16 12:53:22 +00:00
|
|
|
.on_lock = kvm_mmu_invalidate_begin,
|
2021-04-02 00:56:55 +00:00
|
|
|
.flush_on_ret = true,
|
|
|
|
.may_block = mmu_notifier_range_blockable(range),
|
|
|
|
};
|
2012-02-10 06:28:31 +00:00
|
|
|
|
2021-04-02 00:56:55 +00:00
|
|
|
trace_kvm_unmap_hva_range(range->start, range->end);
|
|
|
|
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
/*
|
|
|
|
* Prevent memslot modification between range_start() and range_end()
|
|
|
|
* so that conditionally locking provides the same result in both
|
2022-08-16 12:53:22 +00:00
|
|
|
* functions. Without that guarantee, the mmu_invalidate_in_progress
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
* adjustments will be imbalanced.
|
|
|
|
*
|
|
|
|
* Pairs with the decrement in range_end().
|
|
|
|
*/
|
|
|
|
spin_lock(&kvm->mn_invalidate_lock);
|
|
|
|
kvm->mn_active_invalidate_count++;
|
|
|
|
spin_unlock(&kvm->mn_invalidate_lock);
|
|
|
|
|
KVM: Fix multiple races in gfn=>pfn cache refresh
Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
between the cache itself, and between the cache and mmu_notifier events.
The existing refresh code attempts to guard against races with the
mmu_notifier by speculatively marking the cache valid, and then marking
it invalid if a mmu_notifier invalidation occurs. That handles the case
where an invalidation occurs between dropping and re-acquiring gpc->lock,
but it doesn't handle the scenario where the cache is refreshed after the
cache was invalidated by the notifier, but before the notifier elevates
mmu_notifier_count. The gpc refresh can't use the "retry" helper as its
invalidation occurs _before_ mmu_notifier_count is elevated and before
mmu_notifier_range_start is set/updated.
CPU0 CPU1
---- ----
gfn_to_pfn_cache_invalidate_start()
|
-> gpc->valid = false;
kvm_gfn_to_pfn_cache_refresh()
|
|-> gpc->valid = true;
hva_to_pfn_retry()
|
-> acquire kvm->mmu_lock
kvm->mmu_notifier_count == 0
mmu_seq == kvm->mmu_notifier_seq
drop kvm->mmu_lock
return pfn 'X'
acquire kvm->mmu_lock
kvm_inc_notifier_count()
drop kvm->mmu_lock()
kernel frees pfn 'X'
kvm_gfn_to_pfn_cache_check()
|
|-> gpc->valid == true
caller accesses freed pfn 'X'
Key off of mn_active_invalidate_count to detect that a pfncache refresh
needs to wait for an in-progress mmu_notifier invalidation. While
mn_active_invalidate_count is not guaranteed to be stable, it is
guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
so either the refresh will see an active invalidation and wait, or the
invalidation will run after the refresh completes.
Speculatively marking the cache valid is itself flawed, as a concurrent
kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
values. The KVM Xen use case explicitly allows/wants multiple users;
even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
can read a different vCPU (or vCPUs). Address this race by invalidating
the cache prior to dropping gpc->lock (this is made possible by fixing
the above mmu_notifier race).
Complicating all of this is the fact that both the hva=>pfn resolution
and mapping of the kernel address can sleep, i.e. must be done outside
of gpc->lock.
Fix the above races in one fell swoop, trying to fix each individual race
is largely pointless and essentially impossible to test, e.g. closing one
hole just shifts the focus to the other hole.
Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
Cc: stable@vger.kernel.org
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429210025.3293691-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 21:00:24 +00:00
|
|
|
/*
|
|
|
|
* Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
|
|
|
|
* before acquiring mmu_lock, to avoid holding mmu_lock while acquiring
|
|
|
|
* each cache's lock. There are relatively few caches in existence at
|
|
|
|
* any given time, and the caches themselves can check for hva overlap,
|
|
|
|
* i.e. don't need to rely on memslot overlap checks for performance.
|
|
|
|
* Because this runs without holding mmu_lock, the pfn caches must use
|
2022-08-16 12:53:22 +00:00
|
|
|
* mn_active_invalidate_count (see above) instead of
|
|
|
|
* mmu_invalidate_in_progress.
|
KVM: Fix multiple races in gfn=>pfn cache refresh
Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
between the cache itself, and between the cache and mmu_notifier events.
The existing refresh code attempts to guard against races with the
mmu_notifier by speculatively marking the cache valid, and then marking
it invalid if a mmu_notifier invalidation occurs. That handles the case
where an invalidation occurs between dropping and re-acquiring gpc->lock,
but it doesn't handle the scenario where the cache is refreshed after the
cache was invalidated by the notifier, but before the notifier elevates
mmu_notifier_count. The gpc refresh can't use the "retry" helper as its
invalidation occurs _before_ mmu_notifier_count is elevated and before
mmu_notifier_range_start is set/updated.
CPU0 CPU1
---- ----
gfn_to_pfn_cache_invalidate_start()
|
-> gpc->valid = false;
kvm_gfn_to_pfn_cache_refresh()
|
|-> gpc->valid = true;
hva_to_pfn_retry()
|
-> acquire kvm->mmu_lock
kvm->mmu_notifier_count == 0
mmu_seq == kvm->mmu_notifier_seq
drop kvm->mmu_lock
return pfn 'X'
acquire kvm->mmu_lock
kvm_inc_notifier_count()
drop kvm->mmu_lock()
kernel frees pfn 'X'
kvm_gfn_to_pfn_cache_check()
|
|-> gpc->valid == true
caller accesses freed pfn 'X'
Key off of mn_active_invalidate_count to detect that a pfncache refresh
needs to wait for an in-progress mmu_notifier invalidation. While
mn_active_invalidate_count is not guaranteed to be stable, it is
guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
so either the refresh will see an active invalidation and wait, or the
invalidation will run after the refresh completes.
Speculatively marking the cache valid is itself flawed, as a concurrent
kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
values. The KVM Xen use case explicitly allows/wants multiple users;
even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
can read a different vCPU (or vCPUs). Address this race by invalidating
the cache prior to dropping gpc->lock (this is made possible by fixing
the above mmu_notifier race).
Complicating all of this is the fact that both the hva=>pfn resolution
and mapping of the kernel address can sleep, i.e. must be done outside
of gpc->lock.
Fix the above races in one fell swoop, trying to fix each individual race
is largely pointless and essentially impossible to test, e.g. closing one
hole just shifts the focus to the other hole.
Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
Cc: stable@vger.kernel.org
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429210025.3293691-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 21:00:24 +00:00
|
|
|
*/
|
2024-03-05 00:37:42 +00:00
|
|
|
gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
|
2023-10-27 18:21:52 +00:00
|
|
|
/*
|
|
|
|
* If one or more memslots were found and thus zapped, notify arch code
|
|
|
|
* that guest memory has been reclaimed. This needs to be done *after*
|
|
|
|
* dropping mmu_lock, as x86's reclaim path is slooooow.
|
|
|
|
*/
|
|
|
|
if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
|
|
|
|
kvm_arch_guest_memory_reclaimed(kvm);
|
2018-08-22 04:52:33 +00:00
|
|
|
|
2020-06-06 04:26:27 +00:00
|
|
|
return 0;
|
2008-07-25 14:24:52 +00:00
|
|
|
}
|
|
|
|
|
2023-10-27 18:21:45 +00:00
|
|
|
void kvm_mmu_invalidate_end(struct kvm *kvm)
|
2008-07-25 14:24:52 +00:00
|
|
|
{
|
2023-10-27 18:21:45 +00:00
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
|
|
|
|
2008-07-25 14:24:52 +00:00
|
|
|
/*
|
|
|
|
* This sequence increase will notify the kvm page fault that
|
|
|
|
* the page that is going to be mapped in the spte could have
|
|
|
|
* been freed.
|
|
|
|
*/
|
2022-08-16 12:53:22 +00:00
|
|
|
kvm->mmu_invalidate_seq++;
|
2011-12-12 12:37:21 +00:00
|
|
|
smp_wmb();
|
2008-07-25 14:24:52 +00:00
|
|
|
/*
|
|
|
|
* The above sequence increase must be visible before the
|
2011-12-12 12:37:21 +00:00
|
|
|
* below count decrease, which is ensured by the smp_wmb above
|
2022-08-16 12:53:22 +00:00
|
|
|
* in conjunction with the smp_rmb in mmu_invalidate_retry().
|
2008-07-25 14:24:52 +00:00
|
|
|
*/
|
2022-08-16 12:53:22 +00:00
|
|
|
kvm->mmu_invalidate_in_progress--;
|
2023-10-27 18:21:44 +00:00
|
|
|
KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
|
2023-10-27 18:21:45 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Assert that at least one range was added between start() and end().
|
|
|
|
* Not adding a range isn't fatal, but it is a KVM bug.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA);
|
2021-04-02 00:56:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
|
|
|
|
const struct mmu_notifier_range *range)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = mmu_notifier_to_kvm(mn);
|
2023-10-27 18:21:43 +00:00
|
|
|
const struct kvm_mmu_notifier_range hva_range = {
|
2021-04-02 00:56:55 +00:00
|
|
|
.start = range->start,
|
|
|
|
.end = range->end,
|
|
|
|
.handler = (void *)kvm_null_fn,
|
2022-08-16 12:53:22 +00:00
|
|
|
.on_lock = kvm_mmu_invalidate_end,
|
2021-04-02 00:56:55 +00:00
|
|
|
.flush_on_ret = false,
|
|
|
|
.may_block = mmu_notifier_range_blockable(range),
|
|
|
|
};
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
bool wake;
|
2021-04-02 00:56:55 +00:00
|
|
|
|
|
|
|
__kvm_handle_hva_range(kvm, &hva_range);
|
2008-07-25 14:24:52 +00:00
|
|
|
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
/* Pairs with the increment in range_start(). */
|
|
|
|
spin_lock(&kvm->mn_invalidate_lock);
|
KVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() calls
When handling the end of an mmu_notifier invalidation, WARN if
mn_active_invalidate_count is already 0 do not decrement it further, i.e.
avoid causing mn_active_invalidate_count to underflow/wrap. In the worst
case scenario, effectively corrupting mn_active_invalidate_count could
cause kvm_swap_active_memslots() to hang indefinitely.
end() calls are *supposed* to be paired with start(), i.e. underflow can
only happen if there is a bug elsewhere in the kernel, but due to lack of
lockdep assertions in the mmu_notifier helpers, it's all too easy for a
bug to go unnoticed for some time, e.g. see the recently introduced
PAGEMAP_SCAN ioctl().
Ideally, mmu_notifiers would incorporate lockdep assertions, but users of
mmu_notifiers aren't required to hold any one specific lock, i.e. adding
the necessary annotations to make lockdep aware of all locks that are
mutally exclusive with mm_take_all_locks() isn't trivial.
Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com
Link: https://lore.kernel.org/r/20240110004239.491290-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-10 00:42:39 +00:00
|
|
|
if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
|
|
|
|
--kvm->mn_active_invalidate_count;
|
|
|
|
wake = !kvm->mn_active_invalidate_count;
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
spin_unlock(&kvm->mn_invalidate_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There can only be one waiter, since the wait happens under
|
|
|
|
* slots_lock.
|
|
|
|
*/
|
|
|
|
if (wake)
|
|
|
|
rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
|
2008-07-25 14:24:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
|
|
|
|
struct mm_struct *mm,
|
2014-09-22 21:54:42 +00:00
|
|
|
unsigned long start,
|
|
|
|
unsigned long end)
|
2008-07-25 14:24:52 +00:00
|
|
|
{
|
2021-03-26 02:19:48 +00:00
|
|
|
trace_kvm_age_hva(start, end);
|
|
|
|
|
2024-04-05 11:58:13 +00:00
|
|
|
return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
|
2008-07-25 14:24:52 +00:00
|
|
|
}
|
|
|
|
|
2015-09-09 22:35:41 +00:00
|
|
|
static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
|
|
|
|
struct mm_struct *mm,
|
|
|
|
unsigned long start,
|
|
|
|
unsigned long end)
|
|
|
|
{
|
2021-03-26 02:19:48 +00:00
|
|
|
trace_kvm_age_hva(start, end);
|
|
|
|
|
2015-09-09 22:35:41 +00:00
|
|
|
/*
|
|
|
|
* Even though we do not flush TLB, this will still adversely
|
|
|
|
* affect performance on pre-Haswell Intel EPT, where there is
|
|
|
|
* no EPT Access Bit to clear so that we have to tear down EPT
|
|
|
|
* tables instead. If we find this unacceptable, we can always
|
|
|
|
* add a parameter to kvm_age_hva so that it effectively doesn't
|
|
|
|
* do anything on clear_young.
|
|
|
|
*
|
|
|
|
* Also note that currently we never issue secondary TLB flushes
|
|
|
|
* from clear_young, leaving this job up to the regular system
|
|
|
|
* cadence. If we find this inaccurate, we might come up with a
|
|
|
|
* more sophisticated heuristic later.
|
|
|
|
*/
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
|
2015-09-09 22:35:41 +00:00
|
|
|
}
|
|
|
|
|
2011-01-13 23:47:10 +00:00
|
|
|
static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
|
|
|
|
struct mm_struct *mm,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
2021-03-26 02:19:48 +00:00
|
|
|
trace_kvm_test_age_hva(address);
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 00:56:50 +00:00
|
|
|
return kvm_handle_hva_range_no_flush(mn, address, address + 1,
|
|
|
|
kvm_test_age_gfn);
|
2011-01-13 23:47:10 +00:00
|
|
|
}
|
|
|
|
|
2008-12-10 20:23:26 +00:00
|
|
|
static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
|
|
|
|
struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = mmu_notifier_to_kvm(mn);
|
2010-04-20 06:29:29 +00:00
|
|
|
int idx;
|
|
|
|
|
|
|
|
idx = srcu_read_lock(&kvm->srcu);
|
2022-04-21 03:14:07 +00:00
|
|
|
kvm_flush_shadow_all(kvm);
|
2010-04-20 06:29:29 +00:00
|
|
|
srcu_read_unlock(&kvm->srcu, idx);
|
2008-12-10 20:23:26 +00:00
|
|
|
}
|
|
|
|
|
2008-07-25 14:24:52 +00:00
|
|
|
static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
|
|
|
|
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
|
|
|
|
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
|
|
|
|
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
|
2015-09-09 22:35:41 +00:00
|
|
|
.clear_young = kvm_mmu_notifier_clear_young,
|
2011-01-13 23:47:10 +00:00
|
|
|
.test_young = kvm_mmu_notifier_test_young,
|
2008-12-10 20:23:26 +00:00
|
|
|
.release = kvm_mmu_notifier_release,
|
2008-07-25 14:24:52 +00:00
|
|
|
};
|
2009-12-20 12:54:04 +00:00
|
|
|
|
|
|
|
static int kvm_init_mmu_notifier(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
|
|
|
|
return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
|
|
|
|
}
|
|
|
|
|
2023-10-27 18:21:49 +00:00
|
|
|
#else /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
|
2009-12-20 12:54:04 +00:00
|
|
|
|
|
|
|
static int kvm_init_mmu_notifier(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-10-27 18:21:49 +00:00
|
|
|
#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
|
2008-07-25 14:24:52 +00:00
|
|
|
|
2021-06-06 02:10:44 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
|
|
|
|
static int kvm_pm_notifier_call(struct notifier_block *bl,
|
|
|
|
unsigned long state,
|
|
|
|
void *unused)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
|
|
|
|
|
|
|
|
return kvm_arch_pm_notifier(kvm, state);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_init_pm_notifier(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm->pm_notifier.notifier_call = kvm_pm_notifier_call;
|
|
|
|
/* Suspend KVM before we suspend ftrace, RCU, etc. */
|
|
|
|
kvm->pm_notifier.priority = INT_MAX;
|
|
|
|
register_pm_notifier(&kvm->pm_notifier);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_destroy_pm_notifier(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
unregister_pm_notifier(&kvm->pm_notifier);
|
|
|
|
}
|
|
|
|
#else /* !CONFIG_HAVE_KVM_PM_NOTIFIER */
|
|
|
|
static void kvm_init_pm_notifier(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_destroy_pm_notifier(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
|
|
|
|
|
2015-05-17 09:41:37 +00:00
|
|
|
static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
if (!memslot->dirty_bitmap)
|
|
|
|
return;
|
|
|
|
|
2024-01-31 01:23:57 +00:00
|
|
|
vfree(memslot->dirty_bitmap);
|
2015-05-17 09:41:37 +00:00
|
|
|
memslot->dirty_bitmap = NULL;
|
|
|
|
}
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/* This does not remove the slot from struct kvm_memslots data structures */
|
2020-02-18 21:07:27 +00:00
|
|
|
static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
|
2015-05-17 09:41:37 +00:00
|
|
|
{
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
if (slot->flags & KVM_MEM_GUEST_MEMFD)
|
|
|
|
kvm_gmem_unbind(slot);
|
|
|
|
|
2020-02-18 21:07:27 +00:00
|
|
|
kvm_destroy_dirty_bitmap(slot);
|
2015-05-17 09:41:37 +00:00
|
|
|
|
2020-02-18 21:07:27 +00:00
|
|
|
kvm_arch_free_memslot(kvm, slot);
|
2015-05-17 09:41:37 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
kfree(slot);
|
2015-05-17 09:41:37 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_free_memslots(struct kvm *kvm, struct kvm_memslots *slots)
|
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct hlist_node *idnode;
|
2015-05-17 09:41:37 +00:00
|
|
|
struct kvm_memory_slot *memslot;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
int bkt;
|
2015-05-17 09:41:37 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/*
|
|
|
|
* The same memslot objects live in both active and inactive sets,
|
|
|
|
* arbitrarily free using index '1' so the second invocation of this
|
|
|
|
* function isn't operating over a structure with dangling pointers
|
|
|
|
* (even though this function isn't actually touching them).
|
|
|
|
*/
|
|
|
|
if (!slots->node_idx)
|
2015-05-17 09:41:37 +00:00
|
|
|
return;
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
hash_for_each_safe(slots->id_hash, bkt, idnode, memslot, id_node[1])
|
2020-02-18 21:07:27 +00:00
|
|
|
kvm_free_memslot(kvm, memslot);
|
2011-11-24 09:40:57 +00:00
|
|
|
}
|
|
|
|
|
2021-06-23 21:28:46 +00:00
|
|
|
static umode_t kvm_stats_debugfs_mode(const struct _kvm_stats_desc *pdesc)
|
|
|
|
{
|
|
|
|
switch (pdesc->desc.flags & KVM_STATS_TYPE_MASK) {
|
|
|
|
case KVM_STATS_TYPE_INSTANT:
|
|
|
|
return 0444;
|
|
|
|
case KVM_STATS_TYPE_CUMULATIVE:
|
|
|
|
case KVM_STATS_TYPE_PEAK:
|
|
|
|
default:
|
|
|
|
return 0644;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2016-05-18 11:26:23 +00:00
|
|
|
static void kvm_destroy_vm_debugfs(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
int i;
|
2021-06-23 21:28:46 +00:00
|
|
|
int kvm_debugfs_num_entries = kvm_vm_stats_header.num_desc +
|
|
|
|
kvm_vcpu_stats_header.num_desc;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2022-04-06 23:56:13 +00:00
|
|
|
if (IS_ERR(kvm->debugfs_dentry))
|
2016-05-18 11:26:23 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
debugfs_remove_recursive(kvm->debugfs_dentry);
|
|
|
|
|
2016-09-07 18:47:21 +00:00
|
|
|
if (kvm->debugfs_stat_data) {
|
|
|
|
for (i = 0; i < kvm_debugfs_num_entries; i++)
|
|
|
|
kfree(kvm->debugfs_stat_data[i]);
|
|
|
|
kfree(kvm->debugfs_stat_data);
|
|
|
|
}
|
2016-05-18 11:26:23 +00:00
|
|
|
}
|
|
|
|
|
2022-07-20 09:22:50 +00:00
|
|
|
static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
|
2016-05-18 11:26:23 +00:00
|
|
|
{
|
2021-08-04 09:28:52 +00:00
|
|
|
static DEFINE_MUTEX(kvm_debugfs_lock);
|
|
|
|
struct dentry *dent;
|
2016-05-18 11:26:23 +00:00
|
|
|
char dir_name[ITOA_MAX_LEN * 2];
|
|
|
|
struct kvm_stat_data *stat_data;
|
2021-06-23 21:28:46 +00:00
|
|
|
const struct _kvm_stats_desc *pdesc;
|
2022-07-20 09:22:51 +00:00
|
|
|
int i, ret = -ENOMEM;
|
2021-06-23 21:28:46 +00:00
|
|
|
int kvm_debugfs_num_entries = kvm_vm_stats_header.num_desc +
|
|
|
|
kvm_vcpu_stats_header.num_desc;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
|
|
|
if (!debugfs_initialized())
|
|
|
|
return 0;
|
|
|
|
|
2022-07-20 09:22:50 +00:00
|
|
|
snprintf(dir_name, sizeof(dir_name), "%d-%s", task_pid_nr(current), fdname);
|
2021-08-04 09:28:52 +00:00
|
|
|
mutex_lock(&kvm_debugfs_lock);
|
|
|
|
dent = debugfs_lookup(dir_name, kvm_debugfs_dir);
|
|
|
|
if (dent) {
|
|
|
|
pr_warn_ratelimited("KVM: debugfs: duplicate directory %s\n", dir_name);
|
|
|
|
dput(dent);
|
|
|
|
mutex_unlock(&kvm_debugfs_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
dent = debugfs_create_dir(dir_name, kvm_debugfs_dir);
|
|
|
|
mutex_unlock(&kvm_debugfs_lock);
|
|
|
|
if (IS_ERR(dent))
|
|
|
|
return 0;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2021-08-04 09:28:52 +00:00
|
|
|
kvm->debugfs_dentry = dent;
|
2016-05-18 11:26:23 +00:00
|
|
|
kvm->debugfs_stat_data = kcalloc(kvm_debugfs_num_entries,
|
|
|
|
sizeof(*kvm->debugfs_stat_data),
|
2019-02-11 19:02:49 +00:00
|
|
|
GFP_KERNEL_ACCOUNT);
|
2016-05-18 11:26:23 +00:00
|
|
|
if (!kvm->debugfs_stat_data)
|
2022-07-20 09:22:51 +00:00
|
|
|
goto out_err;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2021-06-23 21:28:46 +00:00
|
|
|
for (i = 0; i < kvm_vm_stats_header.num_desc; ++i) {
|
|
|
|
pdesc = &kvm_vm_stats_desc[i];
|
2019-02-11 19:02:49 +00:00
|
|
|
stat_data = kzalloc(sizeof(*stat_data), GFP_KERNEL_ACCOUNT);
|
2016-05-18 11:26:23 +00:00
|
|
|
if (!stat_data)
|
2022-07-20 09:22:51 +00:00
|
|
|
goto out_err;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
|
|
|
stat_data->kvm = kvm;
|
2021-06-23 21:28:46 +00:00
|
|
|
stat_data->desc = pdesc;
|
|
|
|
stat_data->kind = KVM_STAT_VM;
|
|
|
|
kvm->debugfs_stat_data[i] = stat_data;
|
|
|
|
debugfs_create_file(pdesc->name, kvm_stats_debugfs_mode(pdesc),
|
|
|
|
kvm->debugfs_dentry, stat_data,
|
|
|
|
&stat_fops_per_vm);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < kvm_vcpu_stats_header.num_desc; ++i) {
|
|
|
|
pdesc = &kvm_vcpu_stats_desc[i];
|
2019-02-11 19:02:49 +00:00
|
|
|
stat_data = kzalloc(sizeof(*stat_data), GFP_KERNEL_ACCOUNT);
|
2016-05-18 11:26:23 +00:00
|
|
|
if (!stat_data)
|
2022-07-20 09:22:51 +00:00
|
|
|
goto out_err;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
|
|
|
stat_data->kvm = kvm;
|
2021-06-23 21:28:46 +00:00
|
|
|
stat_data->desc = pdesc;
|
|
|
|
stat_data->kind = KVM_STAT_VCPU;
|
2021-07-01 19:55:00 +00:00
|
|
|
kvm->debugfs_stat_data[i + kvm_vm_stats_header.num_desc] = stat_data;
|
2021-06-23 21:28:46 +00:00
|
|
|
debugfs_create_file(pdesc->name, kvm_stats_debugfs_mode(pdesc),
|
2019-12-13 13:07:21 +00:00
|
|
|
kvm->debugfs_dentry, stat_data,
|
|
|
|
&stat_fops_per_vm);
|
2016-05-18 11:26:23 +00:00
|
|
|
}
|
2021-07-30 22:04:49 +00:00
|
|
|
|
2024-02-16 15:59:41 +00:00
|
|
|
kvm_arch_create_vm_debugfs(kvm);
|
2016-05-18 11:26:23 +00:00
|
|
|
return 0;
|
2022-07-20 09:22:51 +00:00
|
|
|
out_err:
|
|
|
|
kvm_destroy_vm_debugfs(kvm);
|
|
|
|
return ret;
|
2016-05-18 11:26:23 +00:00
|
|
|
}
|
|
|
|
|
2019-11-04 19:26:00 +00:00
|
|
|
/*
|
|
|
|
* Called after the VM is otherwise initialized, but just before adding it to
|
|
|
|
* the vm_list.
|
|
|
|
*/
|
|
|
|
int __weak kvm_arch_post_init_vm(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called just after removing the VM from the vm_list, but before doing any
|
|
|
|
* other destruction.
|
|
|
|
*/
|
|
|
|
void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2021-07-30 22:04:49 +00:00
|
|
|
/*
|
|
|
|
* Called after per-vm debugfs created. When called kvm->debugfs_dentry should
|
|
|
|
* be setup already, so we can create arch-specific debugfs entries under it.
|
|
|
|
* Cleanup should be automatic done in kvm_destroy_vm_debugfs() recursively, so
|
|
|
|
* a per-arch destroy interface is not needed.
|
|
|
|
*/
|
2024-02-16 15:59:41 +00:00
|
|
|
void __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
|
2021-07-30 22:04:49 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2022-07-20 09:22:51 +00:00
|
|
|
static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2010-11-09 16:02:49 +00:00
|
|
|
struct kvm *kvm = kvm_arch_alloc_vm();
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct kvm_memslots *slots;
|
2024-06-13 14:33:16 +00:00
|
|
|
int r, i, j;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2010-11-09 16:02:49 +00:00
|
|
|
if (!kvm)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
2021-02-02 18:57:24 +00:00
|
|
|
KVM_MMU_LOCK_INIT(kvm);
|
2017-02-27 22:30:07 +00:00
|
|
|
mmgrab(current->mm);
|
2016-03-21 09:15:25 +00:00
|
|
|
kvm->mm = current->mm;
|
|
|
|
kvm_eventfd_init(kvm);
|
|
|
|
mutex_init(&kvm->lock);
|
|
|
|
mutex_init(&kvm->irq_lock);
|
|
|
|
mutex_init(&kvm->slots_lock);
|
2021-05-18 17:34:11 +00:00
|
|
|
mutex_init(&kvm->slots_arch_lock);
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
spin_lock_init(&kvm->mn_invalidate_lock);
|
|
|
|
rcuwait_init(&kvm->mn_memslots_update_rcuwait);
|
2021-11-16 16:04:01 +00:00
|
|
|
xa_init(&kvm->vcpu_array);
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
|
|
xa_init(&kvm->mem_attr_array);
|
|
|
|
#endif
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 16:36:21 +00:00
|
|
|
INIT_LIST_HEAD(&kvm->gpc_list);
|
|
|
|
spin_lock_init(&kvm->gpc_lock);
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
|
2016-03-21 09:15:25 +00:00
|
|
|
INIT_LIST_HEAD(&kvm->devices);
|
2022-03-04 19:48:38 +00:00
|
|
|
kvm->max_vcpus = KVM_MAX_VCPUS;
|
2016-03-21 09:15:25 +00:00
|
|
|
|
2012-12-10 17:33:32 +00:00
|
|
|
BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
|
|
|
|
|
2022-04-15 00:46:22 +00:00
|
|
|
/*
|
|
|
|
* Force subsequent debugfs file creations to fail if the VM directory
|
|
|
|
* is not created (by kvm_create_vm_debugfs()).
|
|
|
|
*/
|
|
|
|
kvm->debugfs_dentry = ERR_PTR(-ENOENT);
|
|
|
|
|
2022-07-20 09:22:47 +00:00
|
|
|
snprintf(kvm->stats_id, sizeof(kvm->stats_id), "kvm-%d",
|
|
|
|
task_pid_nr(current));
|
|
|
|
|
2024-06-13 14:33:16 +00:00
|
|
|
r = -ENOMEM;
|
2019-11-04 11:16:49 +00:00
|
|
|
if (init_srcu_struct(&kvm->srcu))
|
|
|
|
goto out_err_no_srcu;
|
|
|
|
if (init_srcu_struct(&kvm->irq_srcu))
|
|
|
|
goto out_err_no_irq_srcu;
|
|
|
|
|
2024-05-06 10:17:49 +00:00
|
|
|
r = kvm_init_irq_routing(kvm);
|
|
|
|
if (r)
|
|
|
|
goto out_err_no_irq_routing;
|
|
|
|
|
2019-11-04 12:23:53 +00:00
|
|
|
refcount_set(&kvm->users_count, 1);
|
2024-05-06 10:17:49 +00:00
|
|
|
|
2023-10-27 18:22:04 +00:00
|
|
|
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
for (j = 0; j < 2; j++) {
|
|
|
|
slots = &kvm->__memslots[i][j];
|
2019-10-24 23:03:26 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
atomic_long_set(&slots->last_used_slot, (unsigned long)NULL);
|
|
|
|
slots->hva_tree = RB_ROOT_CACHED;
|
|
|
|
slots->gfn_tree = RB_ROOT;
|
|
|
|
hash_init(slots->id_hash);
|
|
|
|
slots->node_idx = j;
|
|
|
|
|
|
|
|
/* Generations must be different for each address space. */
|
|
|
|
slots->generation = i;
|
|
|
|
}
|
|
|
|
|
|
|
|
rcu_assign_pointer(kvm->memslots[i], &kvm->__memslots[i][0]);
|
2015-05-17 15:30:37 +00:00
|
|
|
}
|
2014-08-20 12:29:21 +00:00
|
|
|
|
2024-06-13 14:33:16 +00:00
|
|
|
r = -ENOMEM;
|
2009-12-23 16:35:24 +00:00
|
|
|
for (i = 0; i < KVM_NR_BUSES; i++) {
|
2017-07-07 08:51:38 +00:00
|
|
|
rcu_assign_pointer(kvm->buses[i],
|
2019-02-11 19:02:49 +00:00
|
|
|
kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL_ACCOUNT));
|
2010-11-09 11:42:12 +00:00
|
|
|
if (!kvm->buses[i])
|
2019-10-25 11:34:58 +00:00
|
|
|
goto out_err_no_arch_destroy_vm;
|
2009-12-23 16:35:24 +00:00
|
|
|
}
|
2008-07-25 14:24:52 +00:00
|
|
|
|
2012-01-04 09:25:20 +00:00
|
|
|
r = kvm_arch_init_vm(kvm, type);
|
2010-11-09 16:02:49 +00:00
|
|
|
if (r)
|
2019-10-25 11:34:58 +00:00
|
|
|
goto out_err_no_arch_destroy_vm;
|
2009-09-15 09:37:46 +00:00
|
|
|
|
|
|
|
r = hardware_enable_all();
|
|
|
|
if (r)
|
2014-01-16 12:44:20 +00:00
|
|
|
goto out_err_no_disable;
|
2009-09-15 09:37:46 +00:00
|
|
|
|
2023-10-18 16:07:32 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2009-08-24 08:54:23 +00:00
|
|
|
INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
|
2009-01-04 15:10:50 +00:00
|
|
|
#endif
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2011-06-03 20:04:53 +00:00
|
|
|
r = kvm_init_mmu_notifier(kvm);
|
2019-11-04 19:26:00 +00:00
|
|
|
if (r)
|
|
|
|
goto out_err_no_mmu_notifier;
|
|
|
|
|
2022-08-16 05:39:37 +00:00
|
|
|
r = kvm_coalesced_mmio_init(kvm);
|
|
|
|
if (r < 0)
|
|
|
|
goto out_no_coalesced_mmio;
|
|
|
|
|
2022-08-16 05:39:35 +00:00
|
|
|
r = kvm_create_vm_debugfs(kvm, fdname);
|
|
|
|
if (r)
|
|
|
|
goto out_err_no_debugfs;
|
|
|
|
|
2019-11-04 19:26:00 +00:00
|
|
|
r = kvm_arch_post_init_vm(kvm);
|
2011-06-03 20:04:53 +00:00
|
|
|
if (r)
|
2022-08-16 05:39:35 +00:00
|
|
|
goto out_err;
|
2011-06-03 20:04:53 +00:00
|
|
|
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2007-07-23 07:08:21 +00:00
|
|
|
list_add(&kvm->vm_list, &vm_list);
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2010-11-09 16:02:49 +00:00
|
|
|
|
2015-07-03 16:53:58 +00:00
|
|
|
preempt_notifier_inc();
|
2021-06-06 02:10:44 +00:00
|
|
|
kvm_init_pm_notifier(kvm);
|
2015-07-03 16:53:58 +00:00
|
|
|
|
2007-02-21 17:28:04 +00:00
|
|
|
return kvm;
|
2009-09-15 09:37:46 +00:00
|
|
|
|
|
|
|
out_err:
|
2022-08-16 05:39:35 +00:00
|
|
|
kvm_destroy_vm_debugfs(kvm);
|
|
|
|
out_err_no_debugfs:
|
2022-08-16 05:39:37 +00:00
|
|
|
kvm_coalesced_mmio_free(kvm);
|
|
|
|
out_no_coalesced_mmio:
|
2023-10-27 18:21:49 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
|
2019-11-04 19:26:00 +00:00
|
|
|
if (kvm->mmu_notifier.ops)
|
|
|
|
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
|
|
|
|
#endif
|
|
|
|
out_err_no_mmu_notifier:
|
2009-09-15 09:37:46 +00:00
|
|
|
hardware_disable_all();
|
2014-01-16 12:44:20 +00:00
|
|
|
out_err_no_disable:
|
2019-10-25 11:34:58 +00:00
|
|
|
kvm_arch_destroy_vm(kvm);
|
|
|
|
out_err_no_arch_destroy_vm:
|
2019-11-04 12:23:53 +00:00
|
|
|
WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
|
2009-12-23 16:35:24 +00:00
|
|
|
for (i = 0; i < KVM_NR_BUSES; i++)
|
2017-08-02 15:55:54 +00:00
|
|
|
kfree(kvm_get_bus(kvm, i));
|
2024-05-06 10:17:49 +00:00
|
|
|
kvm_free_irq_routing(kvm);
|
|
|
|
out_err_no_irq_routing:
|
2019-11-04 11:16:49 +00:00
|
|
|
cleanup_srcu_struct(&kvm->irq_srcu);
|
|
|
|
out_err_no_irq_srcu:
|
|
|
|
cleanup_srcu_struct(&kvm->srcu);
|
|
|
|
out_err_no_srcu:
|
2010-11-09 16:02:49 +00:00
|
|
|
kvm_arch_free_vm(kvm);
|
2016-03-21 09:15:25 +00:00
|
|
|
mmdrop(current->mm);
|
2009-09-15 09:37:46 +00:00
|
|
|
return ERR_PTR(r);
|
2007-02-21 17:28:04 +00:00
|
|
|
}
|
|
|
|
|
2013-04-25 14:11:23 +00:00
|
|
|
static void kvm_destroy_devices(struct kvm *kvm)
|
|
|
|
{
|
2016-01-01 11:47:12 +00:00
|
|
|
struct kvm_device *dev, *tmp;
|
2013-04-25 14:11:23 +00:00
|
|
|
|
2016-08-09 17:13:01 +00:00
|
|
|
/*
|
|
|
|
* We do not need to take the kvm->lock here, because nobody else
|
|
|
|
* has a reference to the struct kvm at this point and therefore
|
|
|
|
* cannot access the devices list anyhow.
|
2024-04-22 20:01:40 +00:00
|
|
|
*
|
|
|
|
* The device list is generally managed as an rculist, but list_del()
|
|
|
|
* is used intentionally here. If a bug in KVM introduced a reader that
|
|
|
|
* was not backed by a reference on the kvm struct, the hope is that
|
|
|
|
* it'd consume the poisoned forward pointer instead of suffering a
|
|
|
|
* use-after-free, even though this cannot be guaranteed.
|
2016-08-09 17:13:01 +00:00
|
|
|
*/
|
2016-01-01 11:47:12 +00:00
|
|
|
list_for_each_entry_safe(dev, tmp, &kvm->devices, vm_node) {
|
|
|
|
list_del(&dev->vm_node);
|
2013-04-25 14:11:23 +00:00
|
|
|
dev->ops->destroy(dev);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-02-21 17:28:04 +00:00
|
|
|
static void kvm_destroy_vm(struct kvm *kvm)
|
|
|
|
{
|
2009-12-23 16:35:24 +00:00
|
|
|
int i;
|
2007-11-21 14:41:05 +00:00
|
|
|
struct mm_struct *mm = kvm->mm;
|
|
|
|
|
2021-06-06 02:10:44 +00:00
|
|
|
kvm_destroy_pm_notifier(kvm);
|
2017-07-12 15:56:44 +00:00
|
|
|
kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
|
2016-05-18 11:26:23 +00:00
|
|
|
kvm_destroy_vm_debugfs(kvm);
|
2009-01-06 02:03:02 +00:00
|
|
|
kvm_arch_sync_events(kvm);
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2007-02-12 08:54:44 +00:00
|
|
|
list_del(&kvm->vm_list);
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2019-11-04 19:26:00 +00:00
|
|
|
kvm_arch_pre_destroy_vm(kvm);
|
|
|
|
|
2008-11-19 11:58:46 +00:00
|
|
|
kvm_free_irq_routing(kvm);
|
2017-03-15 08:01:17 +00:00
|
|
|
for (i = 0; i < KVM_NR_BUSES; i++) {
|
2017-08-02 15:55:54 +00:00
|
|
|
struct kvm_io_bus *bus = kvm_get_bus(kvm, i);
|
2017-07-07 08:51:38 +00:00
|
|
|
|
|
|
|
if (bus)
|
|
|
|
kvm_io_bus_destroy(bus);
|
2017-03-15 08:01:17 +00:00
|
|
|
kvm->buses[i] = NULL;
|
|
|
|
}
|
2009-12-20 13:13:43 +00:00
|
|
|
kvm_coalesced_mmio_free(kvm);
|
2023-10-27 18:21:49 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
|
2008-07-25 14:24:52 +00:00
|
|
|
mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
/*
|
|
|
|
* At this point, pending calls to invalidate_range_start()
|
|
|
|
* have completed but no more MMU notifiers will run, so
|
|
|
|
* mn_active_invalidate_count may remain unbalanced.
|
2023-02-23 05:28:51 +00:00
|
|
|
* No threads can be waiting in kvm_swap_active_memslots() as the
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
* last reference on KVM has been dropped, but freeing
|
|
|
|
* memslots would deadlock without this manual intervention.
|
2023-10-27 18:21:46 +00:00
|
|
|
*
|
|
|
|
* If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
|
|
|
|
* notifier between a start() and end(), then there shouldn't be any
|
|
|
|
* in-progress invalidations.
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
*/
|
|
|
|
WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
|
2023-10-27 18:21:46 +00:00
|
|
|
if (kvm->mn_active_invalidate_count)
|
|
|
|
kvm->mn_active_invalidate_count = 0;
|
|
|
|
else
|
|
|
|
WARN_ON(kvm->mmu_invalidate_in_progress);
|
2009-03-19 10:20:36 +00:00
|
|
|
#else
|
2022-04-21 03:14:07 +00:00
|
|
|
kvm_flush_shadow_all(kvm);
|
2008-05-30 14:05:54 +00:00
|
|
|
#endif
|
2007-11-18 10:43:45 +00:00
|
|
|
kvm_arch_destroy_vm(kvm);
|
2013-04-25 14:11:23 +00:00
|
|
|
kvm_destroy_devices(kvm);
|
2023-10-27 18:22:04 +00:00
|
|
|
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
|
|
|
|
kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
|
|
|
|
}
|
2014-06-03 11:44:17 +00:00
|
|
|
cleanup_srcu_struct(&kvm->irq_srcu);
|
2010-11-09 16:02:49 +00:00
|
|
|
cleanup_srcu_struct(&kvm->srcu);
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
|
|
xa_destroy(&kvm->mem_attr_array);
|
|
|
|
#endif
|
2010-11-09 16:02:49 +00:00
|
|
|
kvm_arch_free_vm(kvm);
|
2015-07-03 16:53:58 +00:00
|
|
|
preempt_notifier_dec();
|
2009-09-15 09:37:46 +00:00
|
|
|
hardware_disable_all();
|
2007-11-21 14:41:05 +00:00
|
|
|
mmdrop(mm);
|
2007-02-21 17:28:04 +00:00
|
|
|
}
|
|
|
|
|
2008-03-30 13:01:25 +00:00
|
|
|
void kvm_get_kvm(struct kvm *kvm)
|
|
|
|
{
|
2017-02-20 11:06:21 +00:00
|
|
|
refcount_inc(&kvm->users_count);
|
2008-03-30 13:01:25 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_kvm);
|
|
|
|
|
2021-06-25 15:32:07 +00:00
|
|
|
/*
|
|
|
|
* Make sure the vm is not during destruction, which is a safe version of
|
|
|
|
* kvm_get_kvm(). Return true if kvm referenced successfully, false otherwise.
|
|
|
|
*/
|
|
|
|
bool kvm_get_kvm_safe(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return refcount_inc_not_zero(&kvm->users_count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_kvm_safe);
|
|
|
|
|
2008-03-30 13:01:25 +00:00
|
|
|
void kvm_put_kvm(struct kvm *kvm)
|
|
|
|
{
|
2017-02-20 11:06:21 +00:00
|
|
|
if (refcount_dec_and_test(&kvm->users_count))
|
2008-03-30 13:01:25 +00:00
|
|
|
kvm_destroy_vm(kvm);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_put_kvm);
|
|
|
|
|
2019-10-21 22:58:42 +00:00
|
|
|
/*
|
|
|
|
* Used to put a reference that was taken on behalf of an object associated
|
|
|
|
* with a user-visible file descriptor, e.g. a vcpu or device, if installation
|
|
|
|
* of the new file descriptor fails and the reference cannot be transferred to
|
|
|
|
* its final owner. In such cases, the caller is still actively using @kvm and
|
|
|
|
* will fail miserably if the refcount unexpectedly hits zero.
|
|
|
|
*/
|
|
|
|
void kvm_put_kvm_no_destroy(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
WARN_ON(refcount_dec_and_test(&kvm->users_count));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_put_kvm_no_destroy);
|
2008-03-30 13:01:25 +00:00
|
|
|
|
2007-02-21 17:28:04 +00:00
|
|
|
static int kvm_vm_release(struct inode *inode, struct file *filp)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = filp->private_data;
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
kvm_irqfd_release(kvm);
|
|
|
|
|
2008-03-30 13:01:25 +00:00
|
|
|
kvm_put_kvm(kvm);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-10-27 09:23:54 +00:00
|
|
|
/*
|
|
|
|
* Allocation size is twice as large as the actual dirty bitmap size.
|
2020-02-18 21:07:29 +00:00
|
|
|
* See kvm_vm_ioctl_get_dirty_log() why this is needed.
|
2010-10-27 09:23:54 +00:00
|
|
|
*/
|
2020-02-27 01:32:27 +00:00
|
|
|
static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
|
2010-10-27 09:22:19 +00:00
|
|
|
{
|
2022-03-08 09:49:37 +00:00
|
|
|
unsigned long dirty_bytes = kvm_dirty_bitmap_bytes(memslot);
|
2010-10-27 09:22:19 +00:00
|
|
|
|
2022-03-08 09:49:37 +00:00
|
|
|
memslot->dirty_bitmap = __vcalloc(2, dirty_bytes, GFP_KERNEL_ACCOUNT);
|
2010-10-27 09:22:19 +00:00
|
|
|
if (!memslot->dirty_bitmap)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static struct kvm_memslots *kvm_get_inactive_memslots(struct kvm *kvm, int as_id)
|
2011-11-24 09:40:57 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct kvm_memslots *active = __kvm_memslots(kvm, as_id);
|
|
|
|
int node_idx_inactive = active->node_idx ^ 1;
|
2014-12-01 17:29:26 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
return &kvm->__memslots[as_id][node_idx_inactive];
|
2020-02-18 21:07:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* Helper to get the address space ID when one of memslot pointers may be NULL.
|
|
|
|
* This also serves as a sanity that at least one of the pointers is non-NULL,
|
|
|
|
* and that their address space IDs don't diverge.
|
2020-02-18 21:07:31 +00:00
|
|
|
*/
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static int kvm_memslots_get_as_id(struct kvm_memory_slot *a,
|
|
|
|
struct kvm_memory_slot *b)
|
2020-02-18 21:07:31 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
if (WARN_ON_ONCE(!a && !b))
|
|
|
|
return 0;
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
if (!a)
|
|
|
|
return b->as_id;
|
|
|
|
if (!b)
|
|
|
|
return a->as_id;
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
WARN_ON_ONCE(a->as_id != b->as_id);
|
|
|
|
return a->as_id;
|
2020-02-18 21:07:31 +00:00
|
|
|
}
|
2014-12-27 17:01:00 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static void kvm_insert_gfn_node(struct kvm_memslots *slots,
|
|
|
|
struct kvm_memory_slot *slot)
|
2020-02-18 21:07:31 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct rb_root *gfn_tree = &slots->gfn_tree;
|
|
|
|
struct rb_node **node, *parent;
|
|
|
|
int idx = slots->node_idx;
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
parent = NULL;
|
|
|
|
for (node = &gfn_tree->rb_node; *node; ) {
|
|
|
|
struct kvm_memory_slot *tmp;
|
2011-11-24 09:41:54 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
tmp = container_of(*node, struct kvm_memory_slot, gfn_node[idx]);
|
|
|
|
parent = *node;
|
|
|
|
if (slot->base_gfn < tmp->base_gfn)
|
|
|
|
node = &(*node)->rb_left;
|
|
|
|
else if (slot->base_gfn > tmp->base_gfn)
|
|
|
|
node = &(*node)->rb_right;
|
|
|
|
else
|
|
|
|
BUG();
|
2020-02-18 21:07:31 +00:00
|
|
|
}
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
|
|
|
|
rb_link_node(&slot->gfn_node[idx], parent, node);
|
|
|
|
rb_insert_color(&slot->gfn_node[idx], gfn_tree);
|
2020-02-18 21:07:31 +00:00
|
|
|
}
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static void kvm_erase_gfn_node(struct kvm_memslots *slots,
|
|
|
|
struct kvm_memory_slot *slot)
|
2020-02-18 21:07:31 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
rb_erase(&slot->gfn_node[slots->node_idx], &slots->gfn_tree);
|
|
|
|
}
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static void kvm_replace_gfn_node(struct kvm_memslots *slots,
|
|
|
|
struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *new)
|
|
|
|
{
|
|
|
|
int idx = slots->node_idx;
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
WARN_ON_ONCE(old->base_gfn != new->base_gfn);
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
rb_replace_node(&old->gfn_node[idx], &new->gfn_node[idx],
|
|
|
|
&slots->gfn_tree);
|
2020-02-18 21:07:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* Replace @old with @new in the inactive memslots.
|
2020-02-18 21:07:31 +00:00
|
|
|
*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* With NULL @old this simply adds @new.
|
|
|
|
* With NULL @new this simply removes @old.
|
2020-02-18 21:07:31 +00:00
|
|
|
*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* If @new is non-NULL its hva_node[slots_idx] range has to be set
|
|
|
|
* appropriately.
|
2020-02-18 21:07:31 +00:00
|
|
|
*/
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static void kvm_replace_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *new)
|
2020-02-18 21:07:31 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
int as_id = kvm_memslots_get_as_id(old, new);
|
|
|
|
struct kvm_memslots *slots = kvm_get_inactive_memslots(kvm, as_id);
|
|
|
|
int idx = slots->node_idx;
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
if (old) {
|
|
|
|
hash_del(&old->id_node[idx]);
|
|
|
|
interval_tree_remove(&old->hva_node[idx], &slots->hva_tree);
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
if ((long)old == atomic_long_read(&slots->last_used_slot))
|
|
|
|
atomic_long_set(&slots->last_used_slot, (long)new);
|
2020-02-18 21:07:31 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
if (!new) {
|
|
|
|
kvm_erase_gfn_node(slots, old);
|
2021-12-06 19:54:26 +00:00
|
|
|
return;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
}
|
|
|
|
}
|
2021-12-06 19:54:26 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/*
|
|
|
|
* Initialize @new's hva range. Do this even when replacing an @old
|
|
|
|
* slot, kvm_copy_memslot() deliberately does not touch node data.
|
|
|
|
*/
|
|
|
|
new->hva_node[idx].start = new->userspace_addr;
|
|
|
|
new->hva_node[idx].last = new->userspace_addr +
|
|
|
|
(new->npages << PAGE_SHIFT) - 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* (Re)Add the new memslot. There is no O(1) interval_tree_replace(),
|
|
|
|
* hva_node needs to be swapped with remove+insert even though hva can't
|
|
|
|
* change when replacing an existing slot.
|
|
|
|
*/
|
|
|
|
hash_add(slots->id_hash, &new->id_node[idx], new->id);
|
|
|
|
interval_tree_insert(&new->hva_node[idx], &slots->hva_tree);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the memslot gfn is unchanged, rb_replace_node() can be used to
|
|
|
|
* switch the node in the gfn tree instead of removing the old and
|
|
|
|
* inserting the new as two separate operations. Replacement is a
|
|
|
|
* single O(1) operation versus two O(log(n)) operations for
|
|
|
|
* remove+insert.
|
|
|
|
*/
|
|
|
|
if (old && old->base_gfn == new->base_gfn) {
|
|
|
|
kvm_replace_gfn_node(slots, old, new);
|
|
|
|
} else {
|
|
|
|
if (old)
|
|
|
|
kvm_erase_gfn_node(slots, old);
|
|
|
|
kvm_insert_gfn_node(slots, new);
|
2020-02-18 21:07:31 +00:00
|
|
|
}
|
2011-11-24 09:40:57 +00:00
|
|
|
}
|
|
|
|
|
2023-10-27 18:21:50 +00:00
|
|
|
/*
|
|
|
|
* Flags that do not access any of the extra space of struct
|
|
|
|
* kvm_userspace_memory_region2. KVM_SET_USER_MEMORY_REGION_V1_FLAGS
|
|
|
|
* only allows these.
|
|
|
|
*/
|
|
|
|
#define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
|
|
|
|
(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
|
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
static int check_memory_region_flags(struct kvm *kvm,
|
|
|
|
const struct kvm_userspace_memory_region2 *mem)
|
2012-08-21 02:58:13 +00:00
|
|
|
{
|
2012-08-21 03:02:51 +00:00
|
|
|
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
|
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
if (kvm_arch_has_private_mem(kvm))
|
|
|
|
valid_flags |= KVM_MEM_GUEST_MEMFD;
|
|
|
|
|
|
|
|
/* Dirty logging private memory is not currently supported. */
|
|
|
|
if (mem->flags & KVM_MEM_GUEST_MEMFD)
|
|
|
|
valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
|
|
|
|
|
2024-01-11 08:00:34 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_READONLY_MEM
|
2024-02-22 19:06:08 +00:00
|
|
|
/*
|
|
|
|
* GUEST_MEMFD is incompatible with read-only memslots, as writes to
|
|
|
|
* read-only memslots have emulated MMIO, not page fault, semantics,
|
|
|
|
* and KVM doesn't allow emulated MMIO for private memory.
|
|
|
|
*/
|
|
|
|
if (!(mem->flags & KVM_MEM_GUEST_MEMFD))
|
|
|
|
valid_flags |= KVM_MEM_READONLY;
|
2012-08-21 03:02:51 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
if (mem->flags & ~valid_flags)
|
2012-08-21 02:58:13 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
|
2012-12-24 15:49:30 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct kvm_memslots *slots = kvm_get_inactive_memslots(kvm, as_id);
|
|
|
|
|
|
|
|
/* Grab the generation from the activate memslots. */
|
|
|
|
u64 gen = __kvm_memslots(kvm, as_id)->generation;
|
2012-12-24 15:49:30 +00:00
|
|
|
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 21:01:14 +00:00
|
|
|
WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
|
|
|
|
slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
|
2014-08-18 22:46:06 +00:00
|
|
|
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
/*
|
|
|
|
* Do not store the new memslots while there are invalidations in
|
2021-08-03 07:45:41 +00:00
|
|
|
* progress, otherwise the locking in invalidate_range_start and
|
|
|
|
* invalidate_range_end will be unbalanced.
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
*/
|
|
|
|
spin_lock(&kvm->mn_invalidate_lock);
|
|
|
|
prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
|
|
|
|
while (kvm->mn_active_invalidate_count) {
|
|
|
|
set_current_state(TASK_UNINTERRUPTIBLE);
|
|
|
|
spin_unlock(&kvm->mn_invalidate_lock);
|
|
|
|
schedule();
|
|
|
|
spin_lock(&kvm->mn_invalidate_lock);
|
|
|
|
}
|
|
|
|
finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
|
2015-05-17 15:30:37 +00:00
|
|
|
rcu_assign_pointer(kvm->memslots[as_id], slots);
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
spin_unlock(&kvm->mn_invalidate_lock);
|
2021-05-18 17:34:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Acquired in kvm_set_memslot. Must be released before synchronize
|
|
|
|
* SRCU below in order to avoid deadlock with another thread
|
|
|
|
* acquiring the slots_arch_lock in an srcu critical section.
|
|
|
|
*/
|
|
|
|
mutex_unlock(&kvm->slots_arch_lock);
|
|
|
|
|
2012-12-24 15:49:30 +00:00
|
|
|
synchronize_srcu_expedited(&kvm->srcu);
|
2013-07-04 04:40:29 +00:00
|
|
|
|
2014-08-18 22:46:06 +00:00
|
|
|
/*
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 21:01:14 +00:00
|
|
|
* Increment the new memslot generation a second time, dropping the
|
2019-12-11 06:26:23 +00:00
|
|
|
* update in-progress flag and incrementing the generation based on
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 21:01:14 +00:00
|
|
|
* the number of address spaces. This provides a unique and easily
|
|
|
|
* identifiable generation number while the memslots are in flux.
|
|
|
|
*/
|
|
|
|
gen = slots->generation & ~KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
|
|
|
|
|
|
|
|
/*
|
2017-02-04 04:44:51 +00:00
|
|
|
* Generations must be unique even across address spaces. We do not need
|
|
|
|
* a global counter for that, instead the generation space is evenly split
|
|
|
|
* across address spaces. For example, with two address spaces, address
|
2019-02-05 21:01:18 +00:00
|
|
|
* space 0 will use generations 0, 2, 4, ... while address space 1 will
|
|
|
|
* use generations 1, 3, 5, ...
|
2014-08-18 22:46:06 +00:00
|
|
|
*/
|
2023-10-27 18:22:04 +00:00
|
|
|
gen += kvm_arch_nr_memslot_as_ids(kvm);
|
2014-08-18 22:46:06 +00:00
|
|
|
|
2019-02-05 20:54:17 +00:00
|
|
|
kvm_arch_memslots_updated(kvm, gen);
|
2014-08-18 22:46:06 +00:00
|
|
|
|
2019-02-05 20:54:17 +00:00
|
|
|
slots->generation = gen;
|
2012-12-24 15:49:30 +00:00
|
|
|
}
|
|
|
|
|
2021-12-06 19:54:19 +00:00
|
|
|
static int kvm_prepare_memory_region(struct kvm *kvm,
|
|
|
|
const struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *new,
|
|
|
|
enum kvm_mr_change change)
|
2021-05-18 17:34:10 +00:00
|
|
|
{
|
2021-12-06 19:54:19 +00:00
|
|
|
int r;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If dirty logging is disabled, nullify the bitmap; the old bitmap
|
|
|
|
* will be freed on "commit". If logging is enabled in both old and
|
|
|
|
* new, reuse the existing bitmap. If logging is enabled only in the
|
|
|
|
* new and KVM isn't using a ring buffer, allocate and initialize a
|
|
|
|
* new bitmap.
|
|
|
|
*/
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
if (change != KVM_MR_DELETE) {
|
|
|
|
if (!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
|
|
|
|
new->dirty_bitmap = NULL;
|
|
|
|
else if (old && old->dirty_bitmap)
|
|
|
|
new->dirty_bitmap = old->dirty_bitmap;
|
2022-11-10 10:49:10 +00:00
|
|
|
else if (kvm_use_dirty_bitmap(kvm)) {
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
r = kvm_alloc_dirty_bitmap(new);
|
|
|
|
if (r)
|
|
|
|
return r;
|
|
|
|
|
|
|
|
if (kvm_dirty_log_manual_protect_and_init_set(kvm))
|
|
|
|
bitmap_set(new->dirty_bitmap, 0, new->npages);
|
|
|
|
}
|
2021-12-06 19:54:19 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
r = kvm_arch_prepare_memory_region(kvm, old, new, change);
|
|
|
|
|
|
|
|
/* Free the bitmap on failure if it was allocated above. */
|
KVM: Free new dirty bitmap if creating a new memslot fails
Fix a goof in kvm_prepare_memory_region() where KVM fails to free the
new memslot's dirty bitmap during a CREATE action if
kvm_arch_prepare_memory_region() fails. The logic is supposed to detect
if the bitmap was allocated and thus needs to be freed, versus if the
bitmap was inherited from the old memslot and thus needs to be kept. If
there is no old memslot, then obviously the bitmap can't have been
inherited
The bug was exposed by commit 86931ff7207b ("KVM: x86/mmu: Do not create
SPTEs for GFNs that exceed host.MAXPHYADDR"), which made it trivally easy
for syzkaller to trigger failure during kvm_arch_prepare_memory_region(),
but the bug can be hit other ways too, e.g. due to -ENOMEM when
allocating x86's memslot metadata.
The backtrace from kmemleak:
__vmalloc_node_range+0xb40/0xbd0 mm/vmalloc.c:3195
__vmalloc_node mm/vmalloc.c:3232 [inline]
__vmalloc+0x49/0x50 mm/vmalloc.c:3246
__vmalloc_array mm/util.c:671 [inline]
__vcalloc+0x49/0x70 mm/util.c:694
kvm_alloc_dirty_bitmap virt/kvm/kvm_main.c:1319
kvm_prepare_memory_region virt/kvm/kvm_main.c:1551
kvm_set_memslot+0x1bd/0x690 virt/kvm/kvm_main.c:1782
__kvm_set_memory_region+0x689/0x750 virt/kvm/kvm_main.c:1949
kvm_set_memory_region virt/kvm/kvm_main.c:1962
kvm_vm_ioctl_set_memory_region virt/kvm/kvm_main.c:1974
kvm_vm_ioctl+0x377/0x13a0 virt/kvm/kvm_main.c:4528
vfs_ioctl fs/ioctl.c:51
__do_sys_ioctl fs/ioctl.c:870
__se_sys_ioctl fs/ioctl.c:856
__x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
And the relevant sequence of KVM events:
ioctl(3, KVM_CREATE_VM, 0) = 4
ioctl(4, KVM_SET_USER_MEMORY_REGION, {slot=0,
flags=KVM_MEM_LOG_DIRTY_PAGES,
guest_phys_addr=0x10000000000000,
memory_size=4096,
userspace_addr=0x20fe8000}
) = -1 EINVAL (Invalid argument)
Fixes: 244893fa2859 ("KVM: Dynamically allocate "new" memslots from the get-go")
Cc: stable@vger.kernel.org
Reported-by: syzbot+8606b8a9cc97a63f1c87@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220518003842.1341782-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-18 00:38:42 +00:00
|
|
|
if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
|
2021-12-06 19:54:19 +00:00
|
|
|
kvm_destroy_dirty_bitmap(new);
|
|
|
|
|
|
|
|
return r;
|
2021-05-18 17:34:10 +00:00
|
|
|
}
|
|
|
|
|
2021-12-06 19:54:19 +00:00
|
|
|
static void kvm_commit_memory_region(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *old,
|
|
|
|
const struct kvm_memory_slot *new,
|
|
|
|
enum kvm_mr_change change)
|
2021-05-18 17:34:10 +00:00
|
|
|
{
|
2022-11-17 17:25:02 +00:00
|
|
|
int old_flags = old ? old->flags : 0;
|
|
|
|
int new_flags = new ? new->flags : 0;
|
2021-12-06 19:54:19 +00:00
|
|
|
/*
|
|
|
|
* Update the total number of memslot pages before calling the arch
|
|
|
|
* hook so that architectures can consume the result directly.
|
|
|
|
*/
|
|
|
|
if (change == KVM_MR_DELETE)
|
|
|
|
kvm->nr_memslot_pages -= old->npages;
|
|
|
|
else if (change == KVM_MR_CREATE)
|
|
|
|
kvm->nr_memslot_pages += new->npages;
|
|
|
|
|
2022-11-17 17:25:02 +00:00
|
|
|
if ((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES) {
|
|
|
|
int change = (new_flags & KVM_MEM_LOG_DIRTY_PAGES) ? 1 : -1;
|
|
|
|
atomic_set(&kvm->nr_memslots_dirty_logging,
|
|
|
|
atomic_read(&kvm->nr_memslots_dirty_logging) + change);
|
|
|
|
}
|
|
|
|
|
2021-12-06 19:54:19 +00:00
|
|
|
kvm_arch_commit_memory_region(kvm, old, new, change);
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
switch (change) {
|
|
|
|
case KVM_MR_CREATE:
|
|
|
|
/* Nothing more to do. */
|
|
|
|
break;
|
|
|
|
case KVM_MR_DELETE:
|
|
|
|
/* Free the old memslot and all its metadata. */
|
|
|
|
kvm_free_memslot(kvm, old);
|
|
|
|
break;
|
|
|
|
case KVM_MR_MOVE:
|
|
|
|
case KVM_MR_FLAGS_ONLY:
|
|
|
|
/*
|
|
|
|
* Free the dirty bitmap as needed; the below check encompasses
|
|
|
|
* both the flags and whether a ring buffer is being used)
|
|
|
|
*/
|
|
|
|
if (old->dirty_bitmap && !new->dirty_bitmap)
|
|
|
|
kvm_destroy_dirty_bitmap(old);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The final quirk. Free the detached, old slot, but only its
|
|
|
|
* memory, not any metadata. Metadata, including arch specific
|
|
|
|
* data, may be reused by @new.
|
|
|
|
*/
|
|
|
|
kfree(old);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
2021-05-18 17:34:10 +00:00
|
|
|
}
|
|
|
|
|
2020-02-18 21:07:32 +00:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* Activate @new, which must be installed in the inactive slots by the caller,
|
|
|
|
* by swapping the active slots and then propagating @new to @old once @old is
|
|
|
|
* unreachable and can be safely modified.
|
|
|
|
*
|
|
|
|
* With NULL @old this simply adds @new to @active (while swapping the sets).
|
|
|
|
* With NULL @new this simply removes @old from @active and frees it
|
|
|
|
* (while also swapping the sets).
|
2020-02-18 21:07:32 +00:00
|
|
|
*/
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static void kvm_activate_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *new)
|
2020-02-18 21:07:32 +00:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
int as_id = kvm_memslots_get_as_id(old, new);
|
2020-02-18 21:07:32 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
kvm_swap_active_memslots(kvm, as_id);
|
|
|
|
|
|
|
|
/* Propagate the new memslot to the now inactive memslots. */
|
|
|
|
kvm_replace_memslot(kvm, old, new);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_copy_memslot(struct kvm_memory_slot *dest,
|
|
|
|
const struct kvm_memory_slot *src)
|
|
|
|
{
|
|
|
|
dest->base_gfn = src->base_gfn;
|
|
|
|
dest->npages = src->npages;
|
|
|
|
dest->dirty_bitmap = src->dirty_bitmap;
|
|
|
|
dest->arch = src->arch;
|
|
|
|
dest->userspace_addr = src->userspace_addr;
|
|
|
|
dest->flags = src->flags;
|
|
|
|
dest->id = src->id;
|
|
|
|
dest->as_id = src->as_id;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_invalidate_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *old,
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
struct kvm_memory_slot *invalid_slot)
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
{
|
2021-12-06 19:54:19 +00:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* Mark the current slot INVALID. As with all memslot modifications,
|
|
|
|
* this must be done on an unreachable slot to avoid modifying the
|
|
|
|
* current slot in the active tree.
|
2021-12-06 19:54:19 +00:00
|
|
|
*/
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
kvm_copy_memslot(invalid_slot, old);
|
|
|
|
invalid_slot->flags |= KVM_MEMSLOT_INVALID;
|
|
|
|
kvm_replace_memslot(kvm, old, invalid_slot);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Activate the slot that is now marked INVALID, but don't propagate
|
|
|
|
* the slot to the now inactive slots. The slot is either going to be
|
|
|
|
* deleted or recreated as a new slot.
|
|
|
|
*/
|
|
|
|
kvm_swap_active_memslots(kvm, old->as_id);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* From this point no new shadow pages pointing to a deleted, or moved,
|
|
|
|
* memslot will be created. Validation of sp->gfn happens in:
|
|
|
|
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
|
|
|
|
* - kvm_is_visible_gfn (mmu_check_root)
|
|
|
|
*/
|
2021-12-06 19:54:31 +00:00
|
|
|
kvm_arch_flush_shadow_memslot(kvm, old);
|
2022-04-21 03:14:07 +00:00
|
|
|
kvm_arch_guest_memory_reclaimed(kvm);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
|
2023-02-23 05:28:51 +00:00
|
|
|
/* Was released by kvm_swap_active_memslots(), reacquire. */
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
mutex_lock(&kvm->slots_arch_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy the arch-specific field of the newly-installed slot back to the
|
|
|
|
* old slot as the arch data could have changed between releasing
|
2023-02-23 05:28:51 +00:00
|
|
|
* slots_arch_lock in kvm_swap_active_memslots() and re-acquiring the lock
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* above. Writers are required to retrieve memslots *after* acquiring
|
|
|
|
* slots_arch_lock, thus the active slot's data is guaranteed to be fresh.
|
|
|
|
*/
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
old->arch = invalid_slot->arch;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_create_memslot(struct kvm *kvm,
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
struct kvm_memory_slot *new)
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
{
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
/* Add the new memslot to the inactive set and activate. */
|
|
|
|
kvm_replace_memslot(kvm, NULL, new);
|
|
|
|
kvm_activate_memslot(kvm, NULL, new);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_delete_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *invalid_slot)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Remove the old memslot (in the inactive memslots) by passing NULL as
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
* the "new" slot, and for the invalid version in the active slots.
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
*/
|
|
|
|
kvm_replace_memslot(kvm, old, NULL);
|
|
|
|
kvm_activate_memslot(kvm, invalid_slot, NULL);
|
|
|
|
}
|
2020-02-18 21:07:32 +00:00
|
|
|
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
static void kvm_move_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *new,
|
|
|
|
struct kvm_memory_slot *invalid_slot)
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
{
|
|
|
|
/*
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
* Replace the old memslot in the inactive slots, and then swap slots
|
|
|
|
* and replace the current INVALID with the new as well.
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
*/
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
kvm_replace_memslot(kvm, old, new);
|
|
|
|
kvm_activate_memslot(kvm, invalid_slot, new);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
}
|
2020-02-18 21:07:32 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
static void kvm_update_flags_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *old,
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
struct kvm_memory_slot *new)
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Similar to the MOVE case, but the slot doesn't need to be zapped as
|
|
|
|
* an intermediate step. Instead, the old memslot is simply replaced
|
|
|
|
* with a new, updated copy in both memslot sets.
|
|
|
|
*/
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
kvm_replace_memslot(kvm, old, new);
|
|
|
|
kvm_activate_memslot(kvm, old, new);
|
2020-02-18 21:07:32 +00:00
|
|
|
}
|
|
|
|
|
2020-02-18 21:07:23 +00:00
|
|
|
static int kvm_set_memslot(struct kvm *kvm,
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
struct kvm_memory_slot *old,
|
2021-12-06 19:54:10 +00:00
|
|
|
struct kvm_memory_slot *new,
|
2020-02-18 21:07:23 +00:00
|
|
|
enum kvm_mr_change change)
|
|
|
|
{
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
struct kvm_memory_slot *invalid_slot;
|
2020-02-18 21:07:23 +00:00
|
|
|
int r;
|
|
|
|
|
2021-05-18 17:34:11 +00:00
|
|
|
/*
|
2023-02-23 05:28:51 +00:00
|
|
|
* Released in kvm_swap_active_memslots().
|
2021-05-18 17:34:11 +00:00
|
|
|
*
|
2023-02-23 05:28:51 +00:00
|
|
|
* Must be held from before the current memslots are copied until after
|
|
|
|
* the new memslots are installed with rcu_assign_pointer, then
|
|
|
|
* released before the synchronize srcu in kvm_swap_active_memslots().
|
2021-05-18 17:34:11 +00:00
|
|
|
*
|
|
|
|
* When modifying memslots outside of the slots_lock, must be held
|
|
|
|
* before reading the pointer to the current memslots until after all
|
|
|
|
* changes to those memslots are complete.
|
|
|
|
*
|
|
|
|
* These rules ensure that installing new memslots does not lose
|
|
|
|
* changes made to the previous memslots.
|
|
|
|
*/
|
|
|
|
mutex_lock(&kvm->slots_arch_lock);
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/*
|
|
|
|
* Invalidate the old slot if it's being deleted or moved. This is
|
|
|
|
* done prior to actually deleting/moving the memslot to allow vCPUs to
|
|
|
|
* continue running by ensuring there are no mappings or shadow pages
|
|
|
|
* for the memslot when it is deleted/moved. Without pre-invalidation
|
|
|
|
* (and without a lock), a window would exist between effecting the
|
|
|
|
* delete/move and committing the changes in arch code where KVM or a
|
|
|
|
* guest could access a non-existent memslot.
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
*
|
|
|
|
* Modifications are done on a temporary, unreachable slot. The old
|
|
|
|
* slot needs to be preserved in case a later step fails and the
|
|
|
|
* invalidation needs to be reverted.
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
*/
|
2020-02-18 21:07:23 +00:00
|
|
|
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT);
|
|
|
|
if (!invalid_slot) {
|
|
|
|
mutex_unlock(&kvm->slots_arch_lock);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
kvm_invalidate_memslot(kvm, old, invalid_slot);
|
|
|
|
}
|
2021-05-18 17:34:11 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
r = kvm_prepare_memory_region(kvm, old, new, change);
|
|
|
|
if (r) {
|
2021-05-18 17:34:11 +00:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* For DELETE/MOVE, revert the above INVALID change. No
|
|
|
|
* modifications required since the original slot was preserved
|
|
|
|
* in the inactive slots. Changing the active memslots also
|
|
|
|
* release slots_arch_lock.
|
2021-05-18 17:34:11 +00:00
|
|
|
*/
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
|
|
|
|
kvm_activate_memslot(kvm, invalid_slot, old);
|
|
|
|
kfree(invalid_slot);
|
|
|
|
} else {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
mutex_unlock(&kvm->slots_arch_lock);
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
}
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
return r;
|
2020-02-18 21:07:23 +00:00
|
|
|
}
|
|
|
|
|
2021-11-04 00:25:02 +00:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* For DELETE and MOVE, the working slot is now active as the INVALID
|
|
|
|
* version of the old slot. MOVE is particularly special as it reuses
|
|
|
|
* the old slot and returns a copy of the old slot (in working_slot).
|
|
|
|
* For CREATE, there is no old slot. For DELETE and FLAGS_ONLY, the
|
|
|
|
* old slot is detached but otherwise preserved.
|
2021-11-04 00:25:02 +00:00
|
|
|
*/
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
if (change == KVM_MR_CREATE)
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
kvm_create_memslot(kvm, new);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
else if (change == KVM_MR_DELETE)
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
kvm_delete_memslot(kvm, old, invalid_slot);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
else if (change == KVM_MR_MOVE)
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
kvm_move_memslot(kvm, old, new, invalid_slot);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
else if (change == KVM_MR_FLAGS_ONLY)
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
kvm_update_flags_memslot(kvm, old, new);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
else
|
|
|
|
BUG();
|
2020-02-18 21:07:23 +00:00
|
|
|
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
/* Free the temporary INVALID slot used for DELETE and MOVE. */
|
|
|
|
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
|
|
|
|
kfree(invalid_slot);
|
2021-11-04 00:25:02 +00:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/*
|
|
|
|
* No need to refresh new->arch, changes after dropping slots_arch_lock
|
2022-04-10 15:38:40 +00:00
|
|
|
* will directly hit the final, active memslot. Architectures are
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* responsible for knowing that new->arch may be stale.
|
|
|
|
*/
|
|
|
|
kvm_commit_memory_region(kvm, old, new, change);
|
2020-02-18 21:07:23 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-12-06 19:54:33 +00:00
|
|
|
static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
|
|
|
|
gfn_t start, gfn_t end)
|
2020-02-18 21:07:26 +00:00
|
|
|
{
|
2021-12-06 19:54:33 +00:00
|
|
|
struct kvm_memslot_iter iter;
|
2020-02-18 21:07:26 +00:00
|
|
|
|
2021-12-06 19:54:33 +00:00
|
|
|
kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
|
|
|
|
if (iter.slot->id != id)
|
|
|
|
return true;
|
|
|
|
}
|
2020-02-18 21:07:26 +00:00
|
|
|
|
2021-12-06 19:54:33 +00:00
|
|
|
return false;
|
2020-02-18 21:07:26 +00:00
|
|
|
}
|
|
|
|
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
/*
|
|
|
|
* Allocate some memory and give it an address in the guest physical address
|
|
|
|
* space.
|
|
|
|
*
|
|
|
|
* Discontiguous memory is allowed, mostly for framebuffers.
|
2007-10-29 01:40:42 +00:00
|
|
|
*
|
2014-10-27 15:22:56 +00:00
|
|
|
* Must be called holding kvm->slots_lock for write.
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
*/
|
2007-10-29 01:40:42 +00:00
|
|
|
int __kvm_set_memory_region(struct kvm *kvm,
|
2023-10-27 18:21:50 +00:00
|
|
|
const struct kvm_userspace_memory_region2 *mem)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
struct kvm_memory_slot *old, *new;
|
2021-12-06 19:54:33 +00:00
|
|
|
struct kvm_memslots *slots;
|
2013-01-29 02:00:07 +00:00
|
|
|
enum kvm_mr_change change;
|
2021-12-06 19:54:34 +00:00
|
|
|
unsigned long npages;
|
|
|
|
gfn_t base_gfn;
|
2020-02-18 21:07:28 +00:00
|
|
|
int as_id, id;
|
|
|
|
int r;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
r = check_memory_region_flags(kvm, mem);
|
2012-08-21 02:58:13 +00:00
|
|
|
if (r)
|
2020-02-18 21:07:22 +00:00
|
|
|
return r;
|
2012-08-21 02:58:13 +00:00
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
as_id = mem->slot >> 16;
|
|
|
|
id = (u16)mem->slot;
|
|
|
|
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
/* General sanity checks */
|
2021-11-04 00:25:03 +00:00
|
|
|
if ((mem->memory_size & (PAGE_SIZE - 1)) ||
|
|
|
|
(mem->memory_size != (unsigned long)mem->memory_size))
|
2020-02-18 21:07:22 +00:00
|
|
|
return -EINVAL;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
if (mem->guest_phys_addr & (PAGE_SIZE - 1))
|
2020-02-18 21:07:22 +00:00
|
|
|
return -EINVAL;
|
2011-05-07 07:35:38 +00:00
|
|
|
/* We can read the guest memory with __xxx_user() later on. */
|
2020-06-01 08:17:45 +00:00
|
|
|
if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
|
2021-01-21 12:08:15 +00:00
|
|
|
(mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
|
Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 02:57:57 +00:00
|
|
|
!access_ok((void __user *)(unsigned long)mem->userspace_addr,
|
2020-06-01 08:17:45 +00:00
|
|
|
mem->memory_size))
|
2020-02-18 21:07:22 +00:00
|
|
|
return -EINVAL;
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
if (mem->flags & KVM_MEM_GUEST_MEMFD &&
|
|
|
|
(mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
|
|
|
|
mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
|
|
|
|
return -EINVAL;
|
2023-10-27 18:22:04 +00:00
|
|
|
if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
|
2020-02-18 21:07:22 +00:00
|
|
|
return -EINVAL;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
|
2020-02-18 21:07:22 +00:00
|
|
|
return -EINVAL;
|
2021-12-06 19:54:34 +00:00
|
|
|
if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
|
|
|
|
return -EINVAL;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2021-12-06 19:54:33 +00:00
|
|
|
slots = __kvm_memslots(kvm, as_id);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2020-02-18 21:07:26 +00:00
|
|
|
/*
|
2021-12-06 19:54:22 +00:00
|
|
|
* Note, the old memslot (and the pointer itself!) may be invalidated
|
|
|
|
* and/or destroyed by kvm_set_memslot().
|
2020-02-18 21:07:26 +00:00
|
|
|
*/
|
2021-12-06 19:54:33 +00:00
|
|
|
old = id_to_memslot(slots, id);
|
2020-02-18 21:07:28 +00:00
|
|
|
|
2021-12-06 19:54:08 +00:00
|
|
|
if (!mem->memory_size) {
|
2021-12-06 19:54:22 +00:00
|
|
|
if (!old || !old->npages)
|
2021-12-06 19:54:08 +00:00
|
|
|
return -EINVAL;
|
2020-02-18 21:07:26 +00:00
|
|
|
|
2021-12-06 19:54:22 +00:00
|
|
|
if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages))
|
2021-12-06 19:54:08 +00:00
|
|
|
return -EIO;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
return kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE);
|
2021-12-06 19:54:08 +00:00
|
|
|
}
|
2020-02-18 21:07:26 +00:00
|
|
|
|
2021-12-06 19:54:34 +00:00
|
|
|
base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
|
|
|
|
npages = (mem->memory_size >> PAGE_SHIFT);
|
2020-02-18 21:07:28 +00:00
|
|
|
|
2021-12-06 19:54:22 +00:00
|
|
|
if (!old || !old->npages) {
|
2020-02-18 21:07:26 +00:00
|
|
|
change = KVM_MR_CREATE;
|
KVM: Require total number of memslot pages to fit in an unsigned long
Explicitly disallow creating more memslot pages than can fit in an
unsigned long, KVM doesn't correctly handle a total number of memslot
pages that doesn't fit in an unsigned long and remedying that would be a
waste of time.
For a 64-bit kernel, this is a nop as memslots are not allowed to overlap
in the gfn address space.
With a 32-bit kernel, userspace can at most address 3gb of virtual memory,
whereas wrapping the total number of pages would require 4tb+ of guest
physical memory. Even with x86's second address space for SMM, userspace
would need to alias all of guest memory more than one _thousand_ times.
And on older x86 hardware with MAXPHYADDR < 43, the guest couldn't
actually access any of those aliases even if userspace lied about
guest.MAXPHYADDR.
On 390 and arm64, this is a nop as they don't support 32-bit hosts.
On x86, practically speaking this is simply acknowledging reality as the
existing kvm_mmu_calculate_default_mmu_pages() assumes the total number
of pages fits in an "unsigned long".
On PPC, this is likely a nop as every flavor of PPC KVM assumes gfns (and
gpas!) fit in unsigned long. arch/powerpc/kvm/book3s_32_mmu_host.c goes
a step further and fails the build if CONFIG_PTE_64BIT=y, which
presumably means that it does't support 64-bit physical addresses.
On MIPS, this is also likely a nop as the core MMU helpers assume gpas
fit in unsigned long, e.g. see kvm_mips_##name##_pte.
And finally, RISC-V is a "don't care" as it doesn't exist in any release,
i.e. there is no established ABI to break.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1c2c91baf8e78acccd4dad38da591002e61c013c.1638817638.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* To simplify KVM internals, the total number of pages across
|
|
|
|
* all memslots must fit in an unsigned long.
|
|
|
|
*/
|
2021-12-06 19:54:34 +00:00
|
|
|
if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
|
KVM: Require total number of memslot pages to fit in an unsigned long
Explicitly disallow creating more memslot pages than can fit in an
unsigned long, KVM doesn't correctly handle a total number of memslot
pages that doesn't fit in an unsigned long and remedying that would be a
waste of time.
For a 64-bit kernel, this is a nop as memslots are not allowed to overlap
in the gfn address space.
With a 32-bit kernel, userspace can at most address 3gb of virtual memory,
whereas wrapping the total number of pages would require 4tb+ of guest
physical memory. Even with x86's second address space for SMM, userspace
would need to alias all of guest memory more than one _thousand_ times.
And on older x86 hardware with MAXPHYADDR < 43, the guest couldn't
actually access any of those aliases even if userspace lied about
guest.MAXPHYADDR.
On 390 and arm64, this is a nop as they don't support 32-bit hosts.
On x86, practically speaking this is simply acknowledging reality as the
existing kvm_mmu_calculate_default_mmu_pages() assumes the total number
of pages fits in an "unsigned long".
On PPC, this is likely a nop as every flavor of PPC KVM assumes gfns (and
gpas!) fit in unsigned long. arch/powerpc/kvm/book3s_32_mmu_host.c goes
a step further and fails the build if CONFIG_PTE_64BIT=y, which
presumably means that it does't support 64-bit physical addresses.
On MIPS, this is also likely a nop as the core MMU helpers assume gpas
fit in unsigned long, e.g. see kvm_mips_##name##_pte.
And finally, RISC-V is a "don't care" as it doesn't exist in any release,
i.e. there is no established ABI to break.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1c2c91baf8e78acccd4dad38da591002e61c013c.1638817638.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:07 +00:00
|
|
|
return -EINVAL;
|
2020-02-18 21:07:26 +00:00
|
|
|
} else { /* Modify an existing slot. */
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
/* Private memslots are immutable, they can only be deleted. */
|
|
|
|
if (mem->flags & KVM_MEM_GUEST_MEMFD)
|
|
|
|
return -EINVAL;
|
2021-12-06 19:54:34 +00:00
|
|
|
if ((mem->userspace_addr != old->userspace_addr) ||
|
|
|
|
(npages != old->npages) ||
|
|
|
|
((mem->flags ^ old->flags) & KVM_MEM_READONLY))
|
2020-02-18 21:07:22 +00:00
|
|
|
return -EINVAL;
|
2015-05-18 11:59:39 +00:00
|
|
|
|
2021-12-06 19:54:34 +00:00
|
|
|
if (base_gfn != old->base_gfn)
|
2020-02-18 21:07:26 +00:00
|
|
|
change = KVM_MR_MOVE;
|
2021-12-06 19:54:34 +00:00
|
|
|
else if (mem->flags != old->flags)
|
2020-02-18 21:07:26 +00:00
|
|
|
change = KVM_MR_FLAGS_ONLY;
|
|
|
|
else /* Nothing to change. */
|
|
|
|
return 0;
|
2015-05-18 11:59:39 +00:00
|
|
|
}
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2021-12-06 19:54:33 +00:00
|
|
|
if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
|
2021-12-06 19:54:34 +00:00
|
|
|
kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
|
2021-12-06 19:54:33 +00:00
|
|
|
return -EEXIST;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
/* Allocate a slot that will persist in the memslot. */
|
|
|
|
new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
|
|
|
|
if (!new)
|
|
|
|
return -ENOMEM;
|
2020-02-27 01:32:27 +00:00
|
|
|
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
new->as_id = as_id;
|
|
|
|
new->id = id;
|
|
|
|
new->base_gfn = base_gfn;
|
|
|
|
new->npages = npages;
|
|
|
|
new->flags = mem->flags;
|
|
|
|
new->userspace_addr = mem->userspace_addr;
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
if (mem->flags & KVM_MEM_GUEST_MEMFD) {
|
|
|
|
r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
|
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
}
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
KVM: Dynamically allocate "new" memslots from the get-go
Allocate the "new" memslot for !DELETE memslot updates straight away
instead of filling an intermediate on-stack object and forcing
kvm_set_memslot() to juggle the allocation and do weird things like reuse
the old memslot object in MOVE.
In the MOVE case, this results in an "extra" memslot allocation due to
allocating both the "new" slot and the "invalid" slot, but that's a
temporary and not-huge allocation, and MOVE is a relatively rare memslot
operation.
Regarding MOVE, drop the open-coded management of the gfn tree with a
call to kvm_replace_memslot(), which already handles the case where
new->base_gfn != old->base_gfn. This is made possible by virtue of not
having to copy the "new" memslot data after erasing the old memslot from
the gfn tree. Using kvm_replace_memslot(), and more specifically not
reusing the old memslot, means the MOVE case now does hva tree and hash
list updates, but that's a small price to pay for simplifying the code
and making MOVE align with all the other flavors of updates. The "extra"
updates are firmly in the noise from a performance perspective, e.g. the
"move (in)active area" selfttests show a (very, very) slight improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:35 +00:00
|
|
|
r = kvm_set_memslot(kvm, old, new, change);
|
2020-02-18 21:07:23 +00:00
|
|
|
if (r)
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
goto out_unbind;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_unbind:
|
|
|
|
if (mem->flags & KVM_MEM_GUEST_MEMFD)
|
|
|
|
kvm_gmem_unbind(new);
|
|
|
|
out:
|
|
|
|
kfree(new);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
return r;
|
2007-10-24 21:52:57 +00:00
|
|
|
}
|
2007-10-29 01:40:42 +00:00
|
|
|
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
|
|
|
|
|
|
|
|
int kvm_set_memory_region(struct kvm *kvm,
|
2023-10-27 18:21:50 +00:00
|
|
|
const struct kvm_userspace_memory_region2 *mem)
|
2007-10-29 01:40:42 +00:00
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_lock(&kvm->slots_lock);
|
2013-02-27 10:43:00 +00:00
|
|
|
r = __kvm_set_memory_region(kvm, mem);
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_unlock(&kvm->slots_lock);
|
2007-10-29 01:40:42 +00:00
|
|
|
return r;
|
|
|
|
}
|
2007-10-24 21:52:57 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_memory_region);
|
|
|
|
|
2013-12-29 20:12:29 +00:00
|
|
|
static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
|
2023-10-27 18:21:50 +00:00
|
|
|
struct kvm_userspace_memory_region2 *mem)
|
2007-10-24 21:52:57 +00:00
|
|
|
{
|
2015-05-17 15:30:37 +00:00
|
|
|
if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
|
2007-10-24 21:57:46 +00:00
|
|
|
return -EINVAL;
|
2015-05-18 11:59:39 +00:00
|
|
|
|
2013-02-27 10:43:00 +00:00
|
|
|
return kvm_set_memory_region(kvm, mem);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
|
|
|
|
2020-02-18 21:07:29 +00:00
|
|
|
#ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-18 21:07:30 +00:00
|
|
|
/**
|
|
|
|
* kvm_get_dirty_log - get a snapshot of dirty pages
|
|
|
|
* @kvm: pointer to kvm instance
|
|
|
|
* @log: slot id and address to which we copy the log
|
|
|
|
* @is_dirty: set to '1' if any dirty pages were found
|
|
|
|
* @memslot: set to the associated memslot, always valid on success
|
|
|
|
*/
|
|
|
|
int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
|
|
|
|
int *is_dirty, struct kvm_memory_slot **memslot)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2015-05-17 14:20:07 +00:00
|
|
|
struct kvm_memslots *slots;
|
2017-01-22 16:41:07 +00:00
|
|
|
int i, as_id, id;
|
2010-04-12 10:35:35 +00:00
|
|
|
unsigned long n;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
unsigned long any = 0;
|
|
|
|
|
2022-11-10 10:49:10 +00:00
|
|
|
/* Dirty ring tracking may be exclusive to dirty log tracking */
|
|
|
|
if (!kvm_use_dirty_bitmap(kvm))
|
2020-10-01 01:22:24 +00:00
|
|
|
return -ENXIO;
|
|
|
|
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-18 21:07:30 +00:00
|
|
|
*memslot = NULL;
|
|
|
|
*is_dirty = 0;
|
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
as_id = log->slot >> 16;
|
|
|
|
id = (u16)log->slot;
|
2023-10-27 18:22:04 +00:00
|
|
|
if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
|
2017-01-22 16:41:07 +00:00
|
|
|
return -EINVAL;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
slots = __kvm_memslots(kvm, as_id);
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-18 21:07:30 +00:00
|
|
|
*memslot = id_to_memslot(slots, id);
|
2020-02-18 21:07:31 +00:00
|
|
|
if (!(*memslot) || !(*memslot)->dirty_bitmap)
|
2017-01-22 16:41:07 +00:00
|
|
|
return -ENOENT;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-18 21:07:30 +00:00
|
|
|
kvm_arch_sync_dirty_log(kvm, *memslot);
|
|
|
|
|
|
|
|
n = kvm_dirty_bitmap_bytes(*memslot);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-02-22 14:43:09 +00:00
|
|
|
for (i = 0; !any && i < n/sizeof(long); ++i)
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-18 21:07:30 +00:00
|
|
|
any = (*memslot)->dirty_bitmap[i];
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-18 21:07:30 +00:00
|
|
|
if (copy_to_user(log->dirty_bitmap, (*memslot)->dirty_bitmap, n))
|
2017-01-22 16:41:07 +00:00
|
|
|
return -EFAULT;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-11-18 12:29:43 +00:00
|
|
|
if (any)
|
|
|
|
*is_dirty = 1;
|
2017-01-22 16:41:07 +00:00
|
|
|
return 0;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
2013-10-07 16:47:59 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_dirty_log);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2020-02-18 21:07:29 +00:00
|
|
|
#else /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
|
2015-01-15 23:58:53 +00:00
|
|
|
/**
|
2019-04-23 11:40:30 +00:00
|
|
|
* kvm_get_dirty_log_protect - get a snapshot of dirty pages
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
* and reenable dirty page tracking for the corresponding pages.
|
2015-01-15 23:58:53 +00:00
|
|
|
* @kvm: pointer to kvm instance
|
|
|
|
* @log: slot id and address to which we copy the log
|
|
|
|
*
|
|
|
|
* We need to keep it in mind that VCPU threads can write to the bitmap
|
|
|
|
* concurrently. So, to avoid losing track of dirty pages we keep the
|
|
|
|
* following order:
|
|
|
|
*
|
|
|
|
* 1. Take a snapshot of the bit and clear it if needed.
|
|
|
|
* 2. Write protect the corresponding page.
|
|
|
|
* 3. Copy the snapshot to the userspace.
|
|
|
|
* 4. Upon return caller flushes TLB's if needed.
|
|
|
|
*
|
|
|
|
* Between 2 and 4, the guest may write to the page using the remaining TLB
|
|
|
|
* entry. This is not a problem because the page is reported dirty using
|
|
|
|
* the snapshot taken before and step 4 ensures that writes done after
|
|
|
|
* exiting to userspace will be logged for the next call.
|
|
|
|
*
|
|
|
|
*/
|
2020-02-18 21:07:29 +00:00
|
|
|
static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
|
2015-01-15 23:58:53 +00:00
|
|
|
{
|
2015-05-17 14:20:07 +00:00
|
|
|
struct kvm_memslots *slots;
|
2015-01-15 23:58:53 +00:00
|
|
|
struct kvm_memory_slot *memslot;
|
2017-01-22 16:30:16 +00:00
|
|
|
int i, as_id, id;
|
2015-01-15 23:58:53 +00:00
|
|
|
unsigned long n;
|
|
|
|
unsigned long *dirty_bitmap;
|
|
|
|
unsigned long *dirty_bitmap_buffer;
|
2020-02-18 21:07:29 +00:00
|
|
|
bool flush;
|
2015-01-15 23:58:53 +00:00
|
|
|
|
2022-11-10 10:49:10 +00:00
|
|
|
/* Dirty ring tracking may be exclusive to dirty log tracking */
|
|
|
|
if (!kvm_use_dirty_bitmap(kvm))
|
2020-10-01 01:22:24 +00:00
|
|
|
return -ENXIO;
|
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
as_id = log->slot >> 16;
|
|
|
|
id = (u16)log->slot;
|
2023-10-27 18:22:04 +00:00
|
|
|
if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
|
2017-01-22 16:30:16 +00:00
|
|
|
return -EINVAL;
|
2015-01-15 23:58:53 +00:00
|
|
|
|
2015-05-17 15:30:37 +00:00
|
|
|
slots = __kvm_memslots(kvm, as_id);
|
|
|
|
memslot = id_to_memslot(slots, id);
|
2020-02-18 21:07:31 +00:00
|
|
|
if (!memslot || !memslot->dirty_bitmap)
|
|
|
|
return -ENOENT;
|
2015-01-15 23:58:53 +00:00
|
|
|
|
|
|
|
dirty_bitmap = memslot->dirty_bitmap;
|
|
|
|
|
2020-02-18 21:07:29 +00:00
|
|
|
kvm_arch_sync_dirty_log(kvm, memslot);
|
|
|
|
|
2015-01-15 23:58:53 +00:00
|
|
|
n = kvm_dirty_bitmap_bytes(memslot);
|
2020-02-18 21:07:29 +00:00
|
|
|
flush = false;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
if (kvm->manual_dirty_log_protect) {
|
|
|
|
/*
|
|
|
|
* Unlike kvm_get_dirty_log, we always return false in *flush,
|
|
|
|
* because no flush is needed until KVM_CLEAR_DIRTY_LOG. There
|
|
|
|
* is some code duplication between this function and
|
|
|
|
* kvm_get_dirty_log, but hopefully all architecture
|
|
|
|
* transition to kvm_get_dirty_log_protect and kvm_get_dirty_log
|
|
|
|
* can be eliminated.
|
|
|
|
*/
|
|
|
|
dirty_bitmap_buffer = dirty_bitmap;
|
|
|
|
} else {
|
|
|
|
dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
|
|
|
|
memset(dirty_bitmap_buffer, 0, n);
|
2015-01-15 23:58:53 +00:00
|
|
|
|
2021-02-02 18:57:24 +00:00
|
|
|
KVM_MMU_LOCK(kvm);
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
for (i = 0; i < n / sizeof(long); i++) {
|
|
|
|
unsigned long mask;
|
|
|
|
gfn_t offset;
|
2015-01-15 23:58:53 +00:00
|
|
|
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
if (!dirty_bitmap[i])
|
|
|
|
continue;
|
|
|
|
|
2020-02-18 21:07:29 +00:00
|
|
|
flush = true;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
mask = xchg(&dirty_bitmap[i], 0);
|
|
|
|
dirty_bitmap_buffer[i] = mask;
|
|
|
|
|
2019-02-02 09:20:27 +00:00
|
|
|
offset = i * BITS_PER_LONG;
|
|
|
|
kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
|
|
|
|
offset, mask);
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
}
|
2021-02-02 18:57:24 +00:00
|
|
|
KVM_MMU_UNLOCK(kvm);
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
}
|
|
|
|
|
2020-02-18 21:07:29 +00:00
|
|
|
if (flush)
|
2023-08-11 04:51:19 +00:00
|
|
|
kvm_flush_remote_tlbs_memslot(kvm, memslot);
|
2020-02-18 21:07:29 +00:00
|
|
|
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
2020-02-18 21:07:29 +00:00
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
|
|
|
|
* @kvm: kvm instance
|
|
|
|
* @log: slot id and address to which we copy the log
|
|
|
|
*
|
|
|
|
* Steps 1-4 below provide general overview of dirty page logging. See
|
|
|
|
* kvm_get_dirty_log_protect() function description for additional details.
|
|
|
|
*
|
|
|
|
* We call kvm_get_dirty_log_protect() to handle steps 1-3, upon return we
|
|
|
|
* always flush the TLB (step 4) even if previous step failed and the dirty
|
|
|
|
* bitmap may be corrupt. Regardless of previous outcome the KVM logging API
|
|
|
|
* does not preclude user space subsequent dirty log read. Flushing TLB ensures
|
|
|
|
* writes will be marked dirty for next log read.
|
|
|
|
*
|
|
|
|
* 1. Take a snapshot of the bit and clear it if needed.
|
|
|
|
* 2. Write protect the corresponding page.
|
|
|
|
* 3. Copy the snapshot to the userspace.
|
|
|
|
* 4. Flush TLB's if needed.
|
|
|
|
*/
|
|
|
|
static int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
|
|
|
|
struct kvm_dirty_log *log)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
r = kvm_get_dirty_log_protect(kvm, log);
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->slots_lock);
|
|
|
|
return r;
|
|
|
|
}
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_clear_dirty_log_protect - clear dirty bits in the bitmap
|
|
|
|
* and reenable dirty page tracking for the corresponding pages.
|
|
|
|
* @kvm: pointer to kvm instance
|
|
|
|
* @log: slot id and address from which to fetch the bitmap of dirty pages
|
|
|
|
*/
|
2020-02-18 21:07:29 +00:00
|
|
|
static int kvm_clear_dirty_log_protect(struct kvm *kvm,
|
|
|
|
struct kvm_clear_dirty_log *log)
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
{
|
|
|
|
struct kvm_memslots *slots;
|
|
|
|
struct kvm_memory_slot *memslot;
|
2019-01-02 17:29:37 +00:00
|
|
|
int as_id, id;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
gfn_t offset;
|
2019-01-02 17:29:37 +00:00
|
|
|
unsigned long i, n;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
unsigned long *dirty_bitmap;
|
|
|
|
unsigned long *dirty_bitmap_buffer;
|
2020-02-18 21:07:29 +00:00
|
|
|
bool flush;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
|
2022-11-10 10:49:10 +00:00
|
|
|
/* Dirty ring tracking may be exclusive to dirty log tracking */
|
|
|
|
if (!kvm_use_dirty_bitmap(kvm))
|
2020-10-01 01:22:24 +00:00
|
|
|
return -ENXIO;
|
|
|
|
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
as_id = log->slot >> 16;
|
|
|
|
id = (u16)log->slot;
|
2023-10-27 18:22:04 +00:00
|
|
|
if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-04-17 13:28:44 +00:00
|
|
|
if (log->first_page & 63)
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
slots = __kvm_memslots(kvm, as_id);
|
|
|
|
memslot = id_to_memslot(slots, id);
|
2020-02-18 21:07:31 +00:00
|
|
|
if (!memslot || !memslot->dirty_bitmap)
|
|
|
|
return -ENOENT;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
|
|
|
|
dirty_bitmap = memslot->dirty_bitmap;
|
|
|
|
|
2019-05-08 09:15:45 +00:00
|
|
|
n = ALIGN(log->num_pages, BITS_PER_LONG) / 8;
|
2019-01-02 17:29:37 +00:00
|
|
|
|
|
|
|
if (log->first_page > memslot->npages ||
|
2019-04-17 13:28:44 +00:00
|
|
|
log->num_pages > memslot->npages - log->first_page ||
|
|
|
|
(log->num_pages < memslot->npages - log->first_page && (log->num_pages & 63)))
|
|
|
|
return -EINVAL;
|
2019-01-02 17:29:37 +00:00
|
|
|
|
2020-02-18 21:07:29 +00:00
|
|
|
kvm_arch_sync_dirty_log(kvm, memslot);
|
|
|
|
|
|
|
|
flush = false;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
|
|
|
|
if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
|
|
|
|
return -EFAULT;
|
2015-01-15 23:58:53 +00:00
|
|
|
|
2021-02-02 18:57:24 +00:00
|
|
|
KVM_MMU_LOCK(kvm);
|
2019-05-08 09:15:46 +00:00
|
|
|
for (offset = log->first_page, i = offset / BITS_PER_LONG,
|
|
|
|
n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
i++, offset += BITS_PER_LONG) {
|
|
|
|
unsigned long mask = *dirty_bitmap_buffer++;
|
|
|
|
atomic_long_t *p = (atomic_long_t *) &dirty_bitmap[i];
|
|
|
|
if (!mask)
|
2015-01-15 23:58:53 +00:00
|
|
|
continue;
|
|
|
|
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
mask &= atomic_long_fetch_andnot(mask, p);
|
2015-01-15 23:58:53 +00:00
|
|
|
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
/*
|
|
|
|
* mask contains the bits that really have been cleared. This
|
|
|
|
* never includes any bits beyond the length of the memslot (if
|
|
|
|
* the length is not aligned to 64 pages), therefore it is not
|
|
|
|
* a problem if userspace sets them in log->dirty_bitmap.
|
|
|
|
*/
|
2015-03-17 07:19:58 +00:00
|
|
|
if (mask) {
|
2020-02-18 21:07:29 +00:00
|
|
|
flush = true;
|
2015-03-17 07:19:58 +00:00
|
|
|
kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
|
|
|
|
offset, mask);
|
|
|
|
}
|
2015-01-15 23:58:53 +00:00
|
|
|
}
|
2021-02-02 18:57:24 +00:00
|
|
|
KVM_MMU_UNLOCK(kvm);
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
|
2020-02-18 21:07:29 +00:00
|
|
|
if (flush)
|
2023-08-11 04:51:19 +00:00
|
|
|
kvm_flush_remote_tlbs_memslot(kvm, memslot);
|
2020-02-18 21:07:29 +00:00
|
|
|
|
2017-01-22 16:30:16 +00:00
|
|
|
return 0;
|
2015-01-15 23:58:53 +00:00
|
|
|
}
|
2020-02-18 21:07:29 +00:00
|
|
|
|
|
|
|
static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
|
|
|
|
struct kvm_clear_dirty_log *log)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
r = kvm_clear_dirty_log_protect(kvm, log);
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->slots_lock);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
|
2015-01-15 23:58:53 +00:00
|
|
|
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
|
|
/*
|
|
|
|
* Returns true if _all_ gfns in the range [@start, @end) have attributes
|
|
|
|
* matching @attrs.
|
|
|
|
*/
|
|
|
|
bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
|
|
|
|
unsigned long attrs)
|
|
|
|
{
|
|
|
|
XA_STATE(xas, &kvm->mem_attr_array, start);
|
|
|
|
unsigned long index;
|
|
|
|
bool has_attrs;
|
|
|
|
void *entry;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
if (!attrs) {
|
|
|
|
has_attrs = !xas_find(&xas, end - 1);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
has_attrs = true;
|
|
|
|
for (index = start; index < end; index++) {
|
|
|
|
do {
|
|
|
|
entry = xas_next(&xas);
|
|
|
|
} while (xas_retry(&xas, entry));
|
|
|
|
|
|
|
|
if (xas.xa_index != index || xa_to_value(entry) != attrs) {
|
|
|
|
has_attrs = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return has_attrs;
|
|
|
|
}
|
|
|
|
|
|
|
|
static u64 kvm_supported_mem_attributes(struct kvm *kvm)
|
|
|
|
{
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
if (!kvm || kvm_arch_has_private_mem(kvm))
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
return KVM_MEMORY_ATTRIBUTE_PRIVATE;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
|
|
|
|
struct kvm_mmu_notifier_range *range)
|
|
|
|
{
|
|
|
|
struct kvm_gfn_range gfn_range;
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
struct kvm_memslots *slots;
|
|
|
|
struct kvm_memslot_iter iter;
|
|
|
|
bool found_memslot = false;
|
|
|
|
bool ret = false;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
gfn_range.arg = range->arg;
|
|
|
|
gfn_range.may_block = range->may_block;
|
|
|
|
|
2023-10-27 18:22:04 +00:00
|
|
|
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
slots = __kvm_memslots(kvm, i);
|
|
|
|
|
|
|
|
kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
|
|
|
|
slot = iter.slot;
|
|
|
|
gfn_range.slot = slot;
|
|
|
|
|
|
|
|
gfn_range.start = max(range->start, slot->base_gfn);
|
|
|
|
gfn_range.end = min(range->end, slot->base_gfn + slot->npages);
|
|
|
|
if (gfn_range.start >= gfn_range.end)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!found_memslot) {
|
|
|
|
found_memslot = true;
|
|
|
|
KVM_MMU_LOCK(kvm);
|
|
|
|
if (!IS_KVM_NULL_FN(range->on_lock))
|
|
|
|
range->on_lock(kvm);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret |= range->handler(kvm, &gfn_range);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (range->flush_on_ret && ret)
|
|
|
|
kvm_flush_remote_tlbs(kvm);
|
|
|
|
|
|
|
|
if (found_memslot)
|
|
|
|
KVM_MMU_UNLOCK(kvm);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
|
|
|
|
struct kvm_gfn_range *range)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Unconditionally add the range to the invalidation set, regardless of
|
|
|
|
* whether or not the arch callback actually needs to zap SPTEs. E.g.
|
|
|
|
* if KVM supports RWX attributes in the future and the attributes are
|
|
|
|
* going from R=>RW, zapping isn't strictly necessary. Unconditionally
|
|
|
|
* adding the range allows KVM to require that MMU invalidations add at
|
|
|
|
* least one range between begin() and end(), e.g. allows KVM to detect
|
|
|
|
* bugs where the add() is missed. Relaxing the rule *might* be safe,
|
|
|
|
* but it's not obvious that allowing new mappings while the attributes
|
|
|
|
* are in flux is desirable or worth the complexity.
|
|
|
|
*/
|
|
|
|
kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
|
|
|
|
|
|
|
|
return kvm_arch_pre_set_memory_attributes(kvm, range);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Set @attributes for the gfn range [@start, @end). */
|
|
|
|
static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
|
|
|
|
unsigned long attributes)
|
|
|
|
{
|
|
|
|
struct kvm_mmu_notifier_range pre_set_range = {
|
|
|
|
.start = start,
|
|
|
|
.end = end,
|
|
|
|
.handler = kvm_pre_set_memory_attributes,
|
|
|
|
.on_lock = kvm_mmu_invalidate_begin,
|
|
|
|
.flush_on_ret = true,
|
|
|
|
.may_block = true,
|
|
|
|
};
|
|
|
|
struct kvm_mmu_notifier_range post_set_range = {
|
|
|
|
.start = start,
|
|
|
|
.end = end,
|
|
|
|
.arg.attributes = attributes,
|
|
|
|
.handler = kvm_arch_post_set_memory_attributes,
|
|
|
|
.on_lock = kvm_mmu_invalidate_end,
|
|
|
|
.may_block = true,
|
|
|
|
};
|
|
|
|
unsigned long i;
|
|
|
|
void *entry;
|
|
|
|
int r = 0;
|
|
|
|
|
|
|
|
entry = attributes ? xa_mk_value(attributes) : NULL;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
/* Nothing to do if the entire range as the desired attributes. */
|
|
|
|
if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reserve memory ahead of time to avoid having to deal with failures
|
|
|
|
* partway through setting the new attributes.
|
|
|
|
*/
|
|
|
|
for (i = start; i < end; i++) {
|
|
|
|
r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT);
|
|
|
|
if (r)
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
kvm_handle_gfn_range(kvm, &pre_set_range);
|
|
|
|
|
|
|
|
for (i = start; i < end; i++) {
|
|
|
|
r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
|
|
|
|
GFP_KERNEL_ACCOUNT));
|
|
|
|
KVM_BUG_ON(r, kvm);
|
|
|
|
}
|
|
|
|
|
|
|
|
kvm_handle_gfn_range(kvm, &post_set_range);
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
|
|
|
|
struct kvm_memory_attributes *attrs)
|
|
|
|
{
|
|
|
|
gfn_t start, end;
|
|
|
|
|
|
|
|
/* flags is currently not used. */
|
|
|
|
if (attrs->flags)
|
|
|
|
return -EINVAL;
|
|
|
|
if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
|
|
|
|
return -EINVAL;
|
|
|
|
if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
start = attrs->address >> PAGE_SHIFT;
|
|
|
|
end = (attrs->address + attrs->size) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* xarray tracks data using "unsigned long", and as a result so does
|
|
|
|
* KVM. For simplicity, supports generic attributes only on 64-bit
|
|
|
|
* architectures.
|
|
|
|
*/
|
|
|
|
BUILD_BUG_ON(sizeof(attrs->attributes) != sizeof(unsigned long));
|
|
|
|
|
|
|
|
return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
|
|
|
|
|
2010-10-18 13:22:23 +00:00
|
|
|
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return __gfn_to_memslot(kvm_memslots(kvm), gfn);
|
|
|
|
}
|
2010-06-21 08:44:20 +00:00
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_memslot);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
2021-08-04 22:28:40 +00:00
|
|
|
struct kvm_memslots *slots = kvm_vcpu_memslots(vcpu);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
u64 gen = slots->generation;
|
2021-08-04 22:28:40 +00:00
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
/*
|
|
|
|
* This also protects against using a memslot from a different address space,
|
|
|
|
* since different address spaces have different generation numbers.
|
|
|
|
*/
|
|
|
|
if (unlikely(gen != vcpu->last_used_slot_gen)) {
|
|
|
|
vcpu->last_used_slot = NULL;
|
|
|
|
vcpu->last_used_slot_gen = gen;
|
|
|
|
}
|
|
|
|
|
|
|
|
slot = try_get_memslot(vcpu->last_used_slot, gfn);
|
2021-08-04 22:28:40 +00:00
|
|
|
if (slot)
|
|
|
|
return slot;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fall back to searching all memslots. We purposely use
|
|
|
|
* search_memslots() instead of __gfn_to_memslot() to avoid
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
* thrashing the VM-wide last_used_slot in kvm_memslots.
|
2021-08-04 22:28:40 +00:00
|
|
|
*/
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
slot = search_memslots(slots, gfn, false);
|
2021-08-04 22:28:40 +00:00
|
|
|
if (slot) {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 19:54:30 +00:00
|
|
|
vcpu->last_used_slot = slot;
|
2021-08-04 22:28:40 +00:00
|
|
|
return slot;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
2015-05-17 11:58:53 +00:00
|
|
|
}
|
|
|
|
|
2015-11-14 03:21:06 +00:00
|
|
|
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
|
2007-10-24 21:57:46 +00:00
|
|
|
{
|
2011-11-24 09:40:57 +00:00
|
|
|
struct kvm_memory_slot *memslot = gfn_to_memslot(kvm, gfn);
|
2007-10-24 21:57:46 +00:00
|
|
|
|
2020-04-16 13:48:07 +00:00
|
|
|
return kvm_is_visible_memslot(memslot);
|
2007-10-24 21:57:46 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
|
|
|
|
|
2020-07-08 14:00:23 +00:00
|
|
|
bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *memslot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
|
|
|
|
|
|
|
|
return kvm_is_visible_memslot(memslot);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_is_visible_gfn);
|
|
|
|
|
2020-01-08 20:24:37 +00:00
|
|
|
unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
|
2010-01-28 11:37:56 +00:00
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
unsigned long addr, size;
|
|
|
|
|
|
|
|
size = PAGE_SIZE;
|
|
|
|
|
2020-01-08 20:24:38 +00:00
|
|
|
addr = kvm_vcpu_gfn_to_hva_prot(vcpu, gfn, NULL);
|
2010-01-28 11:37:56 +00:00
|
|
|
if (kvm_is_error_hva(addr))
|
|
|
|
return PAGE_SIZE;
|
|
|
|
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_read_lock(current->mm);
|
2010-01-28 11:37:56 +00:00
|
|
|
vma = find_vma(current->mm, addr);
|
|
|
|
if (!vma)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
size = vma_kernel_pagesize(vma);
|
|
|
|
|
|
|
|
out:
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_read_unlock(current->mm);
|
2010-01-28 11:37:56 +00:00
|
|
|
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
2021-11-15 23:45:58 +00:00
|
|
|
static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
|
2012-08-21 03:02:51 +00:00
|
|
|
{
|
|
|
|
return slot->flags & KVM_MEM_READONLY;
|
|
|
|
}
|
|
|
|
|
2021-11-15 23:45:58 +00:00
|
|
|
static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
|
2012-08-21 03:02:51 +00:00
|
|
|
gfn_t *nr_pages, bool write)
|
2007-11-11 20:05:04 +00:00
|
|
|
{
|
2009-12-23 16:35:21 +00:00
|
|
|
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
|
2012-08-21 03:01:50 +00:00
|
|
|
return KVM_HVA_ERR_BAD;
|
2010-08-22 11:11:43 +00:00
|
|
|
|
2012-08-21 03:02:51 +00:00
|
|
|
if (memslot_is_readonly(slot) && write)
|
|
|
|
return KVM_HVA_ERR_RO_BAD;
|
2010-08-22 11:11:43 +00:00
|
|
|
|
|
|
|
if (nr_pages)
|
|
|
|
*nr_pages = slot->npages - (gfn - slot->base_gfn);
|
|
|
|
|
2012-08-21 03:02:51 +00:00
|
|
|
return __gfn_to_hva_memslot(slot, gfn);
|
2007-11-11 20:05:04 +00:00
|
|
|
}
|
2010-08-22 11:11:43 +00:00
|
|
|
|
2012-08-21 03:02:51 +00:00
|
|
|
static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
gfn_t *nr_pages)
|
|
|
|
{
|
|
|
|
return __gfn_to_hva_many(slot, gfn, nr_pages, true);
|
2007-11-11 20:05:04 +00:00
|
|
|
}
|
2010-08-22 11:11:43 +00:00
|
|
|
|
2012-08-21 03:02:51 +00:00
|
|
|
unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot,
|
2013-12-29 20:12:29 +00:00
|
|
|
gfn_t gfn)
|
2012-08-21 03:02:51 +00:00
|
|
|
{
|
|
|
|
return gfn_to_hva_many(slot, gfn, NULL);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_hva_memslot);
|
|
|
|
|
2010-08-22 11:11:43 +00:00
|
|
|
unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
2010-10-18 13:22:23 +00:00
|
|
|
return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
|
2010-08-22 11:11:43 +00:00
|
|
|
}
|
2008-04-25 13:44:50 +00:00
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_hva);
|
2007-11-11 20:05:04 +00:00
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return gfn_to_hva_many(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, NULL);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_hva);
|
|
|
|
|
2012-08-21 02:59:53 +00:00
|
|
|
/*
|
2018-10-09 02:41:15 +00:00
|
|
|
* Return the hva of a @gfn and the R/W attribute if possible.
|
|
|
|
*
|
|
|
|
* @slot: the kvm_memory_slot which contains @gfn
|
|
|
|
* @gfn: the gfn to be translated
|
|
|
|
* @writable: used to return the read/write attribute of the @slot if the hva
|
|
|
|
* is valid and @writable is not NULL
|
2012-08-21 02:59:53 +00:00
|
|
|
*/
|
2014-08-19 10:15:00 +00:00
|
|
|
unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot,
|
|
|
|
gfn_t gfn, bool *writable)
|
2012-08-21 02:59:53 +00:00
|
|
|
{
|
2013-10-01 16:58:36 +00:00
|
|
|
unsigned long hva = __gfn_to_hva_many(slot, gfn, NULL, false);
|
|
|
|
|
|
|
|
if (!kvm_is_error_hva(hva) && writable)
|
2013-09-09 11:52:33 +00:00
|
|
|
*writable = !memslot_is_readonly(slot);
|
|
|
|
|
2013-10-01 16:58:36 +00:00
|
|
|
return hva;
|
2012-08-21 02:59:53 +00:00
|
|
|
}
|
|
|
|
|
2014-08-19 10:15:00 +00:00
|
|
|
unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
|
|
|
|
|
|
|
|
return gfn_to_hva_memslot_prot(slot, gfn, writable);
|
|
|
|
}
|
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
|
|
|
|
|
|
|
|
return gfn_to_hva_memslot_prot(slot, gfn, writable);
|
|
|
|
}
|
|
|
|
|
2011-01-30 03:15:49 +00:00
|
|
|
static inline int check_user_page_hwpoison(unsigned long addr)
|
|
|
|
{
|
mm: unexport __get_user_pages()
This patch unexports the low-level __get_user_pages() function.
Recent refactoring of the get_user_pages* functions allow flags to be
passed through get_user_pages() which eliminates the need for access to
this function from its one user, kvm.
We can see that the two calls to get_user_pages() which replace
__get_user_pages() in kvm_main.c are equivalent by examining their call
stacks:
get_user_page_nowait():
get_user_pages(start, 1, flags, page, NULL)
__get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, start, 1,
flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
check_user_page_hwpoison():
get_user_pages(addr, 1, flags, NULL, NULL)
__get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
NULL, NULL)
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-24 09:57:25 +00:00
|
|
|
int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
|
2011-01-30 03:15:49 +00:00
|
|
|
|
2023-05-17 19:25:33 +00:00
|
|
|
rc = get_user_pages(addr, 1, flags, NULL);
|
2011-01-30 03:15:49 +00:00
|
|
|
return rc == -EHWPOISON;
|
|
|
|
}
|
|
|
|
|
2012-08-21 03:00:22 +00:00
|
|
|
/*
|
2018-07-27 15:44:41 +00:00
|
|
|
* The fast path to get the writable pfn which will be stored in @pfn,
|
|
|
|
* true indicates success, otherwise false is returned. It's also the
|
2019-12-11 06:26:25 +00:00
|
|
|
* only part that runs if we can in atomic context.
|
2012-08-21 03:00:22 +00:00
|
|
|
*/
|
2018-07-27 15:44:41 +00:00
|
|
|
static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
|
|
|
|
bool *writable, kvm_pfn_t *pfn)
|
2007-03-30 11:02:32 +00:00
|
|
|
{
|
2007-10-18 14:59:34 +00:00
|
|
|
struct page *page[1];
|
2007-03-30 11:02:32 +00:00
|
|
|
|
2012-08-21 03:00:49 +00:00
|
|
|
/*
|
|
|
|
* Fast pin a writable pfn only if it is a write fault request
|
|
|
|
* or the caller allows to map a writable pfn for a read fault
|
|
|
|
* request.
|
|
|
|
*/
|
|
|
|
if (!(write_fault || writable))
|
|
|
|
return false;
|
2010-10-22 16:18:18 +00:00
|
|
|
|
2020-06-08 04:40:55 +00:00
|
|
|
if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
|
2012-08-21 03:00:22 +00:00
|
|
|
*pfn = page_to_pfn(page[0]);
|
2010-10-22 16:18:18 +00:00
|
|
|
|
2012-08-21 03:00:22 +00:00
|
|
|
if (writable)
|
|
|
|
*writable = true;
|
|
|
|
return true;
|
|
|
|
}
|
2010-10-14 09:22:46 +00:00
|
|
|
|
2012-08-21 03:00:22 +00:00
|
|
|
return false;
|
|
|
|
}
|
2010-10-22 16:18:18 +00:00
|
|
|
|
2012-08-21 03:00:22 +00:00
|
|
|
/*
|
|
|
|
* The slow path to get the pfn of the specified host virtual address,
|
|
|
|
* 1 indicates success, -errno is returned if error is detected.
|
|
|
|
*/
|
|
|
|
static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
|
2022-10-11 19:58:08 +00:00
|
|
|
bool interruptible, bool *writable, kvm_pfn_t *pfn)
|
2012-08-21 03:00:22 +00:00
|
|
|
{
|
2023-08-03 14:32:04 +00:00
|
|
|
/*
|
|
|
|
* When a VCPU accesses a page that is not mapped into the secondary
|
|
|
|
* MMU, we lookup the page using GUP to map it, so the guest VCPU can
|
|
|
|
* make progress. We always want to honor NUMA hinting faults in that
|
|
|
|
* case, because GUP usage corresponds to memory accesses from the VCPU.
|
|
|
|
* Otherwise, we'd not trigger NUMA hinting faults once a page is
|
|
|
|
* mapped into the secondary MMU and gets accessed by a VCPU.
|
|
|
|
*
|
|
|
|
* Note that get_user_page_fast_only() and FOLL_WRITE for now
|
|
|
|
* implicitly honor NUMA hinting faults and don't need this flag.
|
|
|
|
*/
|
|
|
|
unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT;
|
2017-11-19 22:47:33 +00:00
|
|
|
struct page *page;
|
2022-08-19 02:28:04 +00:00
|
|
|
int npages;
|
2010-10-22 16:18:18 +00:00
|
|
|
|
2012-08-21 03:00:22 +00:00
|
|
|
might_sleep();
|
|
|
|
|
|
|
|
if (writable)
|
|
|
|
*writable = write_fault;
|
|
|
|
|
2017-11-19 22:47:33 +00:00
|
|
|
if (write_fault)
|
|
|
|
flags |= FOLL_WRITE;
|
|
|
|
if (async)
|
|
|
|
flags |= FOLL_NOWAIT;
|
2022-10-11 19:58:08 +00:00
|
|
|
if (interruptible)
|
|
|
|
flags |= FOLL_INTERRUPTIBLE;
|
2016-10-13 00:20:12 +00:00
|
|
|
|
2017-11-19 22:47:33 +00:00
|
|
|
npages = get_user_pages_unlocked(addr, 1, &page, flags);
|
2012-08-21 03:00:22 +00:00
|
|
|
if (npages != 1)
|
|
|
|
return npages;
|
|
|
|
|
|
|
|
/* map read fault as writable if possible */
|
2012-08-21 03:00:49 +00:00
|
|
|
if (unlikely(!write_fault) && writable) {
|
2017-11-19 22:47:33 +00:00
|
|
|
struct page *wpage;
|
2012-08-21 03:00:22 +00:00
|
|
|
|
2020-06-08 04:40:55 +00:00
|
|
|
if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
|
2012-08-21 03:00:22 +00:00
|
|
|
*writable = true;
|
2017-11-19 22:47:33 +00:00
|
|
|
put_page(page);
|
|
|
|
page = wpage;
|
2010-10-22 16:18:18 +00:00
|
|
|
}
|
2010-08-22 11:10:28 +00:00
|
|
|
}
|
2017-11-19 22:47:33 +00:00
|
|
|
*pfn = page_to_pfn(page);
|
2012-08-21 03:00:22 +00:00
|
|
|
return npages;
|
|
|
|
}
|
2007-11-11 20:05:04 +00:00
|
|
|
|
2012-08-21 03:02:51 +00:00
|
|
|
static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
|
|
|
|
{
|
|
|
|
if (unlikely(!(vma->vm_flags & VM_READ)))
|
|
|
|
return false;
|
2008-04-30 20:37:07 +00:00
|
|
|
|
2012-08-21 03:02:51 +00:00
|
|
|
if (write_fault && (unlikely(!(vma->vm_flags & VM_WRITE))))
|
|
|
|
return false;
|
2010-08-22 11:10:28 +00:00
|
|
|
|
2012-08-21 03:02:51 +00:00
|
|
|
return true;
|
|
|
|
}
|
2010-05-31 06:28:19 +00:00
|
|
|
|
2021-06-24 12:29:04 +00:00
|
|
|
static int kvm_try_get_pfn(kvm_pfn_t pfn)
|
|
|
|
{
|
2022-04-29 01:04:15 +00:00
|
|
|
struct page *page = kvm_pfn_to_refcounted_page(pfn);
|
|
|
|
|
|
|
|
if (!page)
|
2021-06-24 12:29:04 +00:00
|
|
|
return 1;
|
2022-04-29 01:04:15 +00:00
|
|
|
|
|
|
|
return get_page_unless_zero(page);
|
2021-06-24 12:29:04 +00:00
|
|
|
}
|
|
|
|
|
2016-06-07 14:22:47 +00:00
|
|
|
static int hva_to_pfn_remapped(struct vm_area_struct *vma,
|
2022-01-24 02:04:56 +00:00
|
|
|
unsigned long addr, bool write_fault,
|
|
|
|
bool *writable, kvm_pfn_t *p_pfn)
|
2016-06-07 14:22:47 +00:00
|
|
|
{
|
2021-02-08 20:19:40 +00:00
|
|
|
kvm_pfn_t pfn;
|
2021-02-01 10:12:11 +00:00
|
|
|
pte_t *ptep;
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
pte_t pte;
|
2021-02-01 10:12:11 +00:00
|
|
|
spinlock_t *ptl;
|
2016-06-07 15:51:18 +00:00
|
|
|
int r;
|
|
|
|
|
2024-04-10 15:55:26 +00:00
|
|
|
r = follow_pte(vma, addr, &ptep, &ptl);
|
2016-06-07 15:51:18 +00:00
|
|
|
if (r) {
|
|
|
|
/*
|
|
|
|
* get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
|
|
|
|
* not call the fault handler, so do it here.
|
|
|
|
*/
|
|
|
|
bool unlocked = false;
|
2020-08-12 01:39:01 +00:00
|
|
|
r = fixup_user_fault(current->mm, addr,
|
2016-06-07 15:51:18 +00:00
|
|
|
(write_fault ? FAULT_FLAG_WRITE : 0),
|
|
|
|
&unlocked);
|
2020-05-29 09:42:55 +00:00
|
|
|
if (unlocked)
|
|
|
|
return -EAGAIN;
|
2016-06-07 15:51:18 +00:00
|
|
|
if (r)
|
|
|
|
return r;
|
|
|
|
|
2024-04-10 15:55:26 +00:00
|
|
|
r = follow_pte(vma, addr, &ptep, &ptl);
|
2016-06-07 15:51:18 +00:00
|
|
|
if (r)
|
|
|
|
return r;
|
2021-02-01 10:12:11 +00:00
|
|
|
}
|
2016-06-07 15:51:18 +00:00
|
|
|
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
pte = ptep_get(ptep);
|
|
|
|
|
|
|
|
if (write_fault && !pte_write(pte)) {
|
2021-02-01 10:12:11 +00:00
|
|
|
pfn = KVM_PFN_ERR_RO_FAULT;
|
|
|
|
goto out;
|
2016-06-07 15:51:18 +00:00
|
|
|
}
|
|
|
|
|
2018-01-17 18:18:56 +00:00
|
|
|
if (writable)
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
*writable = pte_write(pte);
|
|
|
|
pfn = pte_pfn(pte);
|
2016-06-07 15:51:18 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a reference here because callers of *hva_to_pfn* and
|
|
|
|
* *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
|
|
|
|
* returned pfn. This is only needed if the VMA has VM_MIXEDMAP
|
2021-07-26 15:35:52 +00:00
|
|
|
* set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
|
2016-06-07 15:51:18 +00:00
|
|
|
* simply do nothing for reserved pfns.
|
|
|
|
*
|
|
|
|
* Whoever called remap_pfn_range is also going to call e.g.
|
|
|
|
* unmap_mapping_range before the underlying pages are freed,
|
|
|
|
* causing a call to our MMU notifier.
|
2021-06-24 12:29:04 +00:00
|
|
|
*
|
|
|
|
* Certain IO or PFNMAP mappings can be backed with valid
|
|
|
|
* struct pages, but be allocated without refcounting e.g.,
|
|
|
|
* tail pages of non-compound higher order allocations, which
|
|
|
|
* would then underflow the refcount when the caller does the
|
|
|
|
* required put_page. Don't allow those pages here.
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
*/
|
2021-06-24 12:29:04 +00:00
|
|
|
if (!kvm_try_get_pfn(pfn))
|
|
|
|
r = -EFAULT;
|
2016-06-07 15:51:18 +00:00
|
|
|
|
2021-02-01 10:12:11 +00:00
|
|
|
out:
|
|
|
|
pte_unmap_unlock(ptep, ptl);
|
2016-06-07 15:51:18 +00:00
|
|
|
*p_pfn = pfn;
|
2021-06-24 12:29:04 +00:00
|
|
|
|
|
|
|
return r;
|
2016-06-07 14:22:47 +00:00
|
|
|
}
|
|
|
|
|
2012-08-21 03:00:49 +00:00
|
|
|
/*
|
|
|
|
* Pin guest page in memory and return its pfn.
|
|
|
|
* @addr: host virtual address which maps memory to the guest
|
2024-02-15 23:53:52 +00:00
|
|
|
* @atomic: whether this function is forbidden from sleeping
|
2022-10-11 19:58:08 +00:00
|
|
|
* @interruptible: whether the process can be interrupted by non-fatal signals
|
2012-08-21 03:00:49 +00:00
|
|
|
* @async: whether this function need to wait IO complete if the
|
|
|
|
* host page is not in the memory
|
|
|
|
* @write_fault: whether we should get a writable host page
|
|
|
|
* @writable: whether it allows to map a writable host page for !@write_fault
|
|
|
|
*
|
|
|
|
* The function will map a writable host page for these two cases:
|
|
|
|
* 1): @write_fault = true
|
|
|
|
* 2): @write_fault = false && @writable, @writable will tell the caller
|
|
|
|
* whether the mapping is writable.
|
|
|
|
*/
|
2022-10-11 19:58:08 +00:00
|
|
|
kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
|
|
|
|
bool *async, bool write_fault, bool *writable)
|
2012-08-21 03:00:22 +00:00
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
2022-04-29 01:04:07 +00:00
|
|
|
kvm_pfn_t pfn;
|
2016-06-07 14:22:47 +00:00
|
|
|
int npages, r;
|
2008-04-30 20:37:07 +00:00
|
|
|
|
2012-08-21 03:00:22 +00:00
|
|
|
/* we can do it either atomically or asynchronously, not both */
|
|
|
|
BUG_ON(atomic && async);
|
2007-10-18 14:59:34 +00:00
|
|
|
|
2018-07-27 15:44:41 +00:00
|
|
|
if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
|
2012-08-21 03:00:22 +00:00
|
|
|
return pfn;
|
|
|
|
|
|
|
|
if (atomic)
|
|
|
|
return KVM_PFN_ERR_FAULT;
|
|
|
|
|
2022-10-11 19:58:08 +00:00
|
|
|
npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
|
|
|
|
writable, &pfn);
|
2012-08-21 03:00:22 +00:00
|
|
|
if (npages == 1)
|
|
|
|
return pfn;
|
2022-10-11 19:58:07 +00:00
|
|
|
if (npages == -EINTR)
|
|
|
|
return KVM_PFN_ERR_SIGPENDING;
|
2007-10-18 14:59:34 +00:00
|
|
|
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_read_lock(current->mm);
|
2012-08-21 03:00:22 +00:00
|
|
|
if (npages == -EHWPOISON ||
|
|
|
|
(!async && check_user_page_hwpoison(addr))) {
|
|
|
|
pfn = KVM_PFN_ERR_HWPOISON;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
2020-05-29 09:42:55 +00:00
|
|
|
retry:
|
2021-06-29 02:39:17 +00:00
|
|
|
vma = vma_lookup(current->mm, addr);
|
2012-08-21 03:00:22 +00:00
|
|
|
|
|
|
|
if (vma == NULL)
|
|
|
|
pfn = KVM_PFN_ERR_FAULT;
|
2016-06-07 14:22:47 +00:00
|
|
|
else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
|
2022-01-24 02:04:56 +00:00
|
|
|
r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
|
2020-05-29 09:42:55 +00:00
|
|
|
if (r == -EAGAIN)
|
|
|
|
goto retry;
|
2016-06-07 14:22:47 +00:00
|
|
|
if (r < 0)
|
|
|
|
pfn = KVM_PFN_ERR_FAULT;
|
2012-08-21 03:00:22 +00:00
|
|
|
} else {
|
2012-08-21 03:02:51 +00:00
|
|
|
if (async && vma_is_valid(vma, write_fault))
|
2012-08-21 03:00:22 +00:00
|
|
|
*async = true;
|
|
|
|
pfn = KVM_PFN_ERR_FAULT;
|
|
|
|
}
|
|
|
|
exit:
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_read_unlock(current->mm);
|
2008-04-30 20:37:07 +00:00
|
|
|
return pfn;
|
2008-04-02 19:46:56 +00:00
|
|
|
}
|
|
|
|
|
2021-11-15 23:45:58 +00:00
|
|
|
kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
|
2022-10-11 19:58:08 +00:00
|
|
|
bool atomic, bool interruptible, bool *async,
|
|
|
|
bool write_fault, bool *writable, hva_t *hva)
|
2010-08-22 11:10:28 +00:00
|
|
|
{
|
2012-08-21 03:02:51 +00:00
|
|
|
unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
|
|
|
|
|
2021-02-22 02:45:22 +00:00
|
|
|
if (hva)
|
|
|
|
*hva = addr;
|
|
|
|
|
2016-02-23 14:36:01 +00:00
|
|
|
if (kvm_is_error_hva(addr)) {
|
|
|
|
if (writable)
|
|
|
|
*writable = false;
|
2024-02-15 23:53:55 +00:00
|
|
|
|
|
|
|
return addr == KVM_HVA_ERR_RO_BAD ? KVM_PFN_ERR_RO_FAULT :
|
|
|
|
KVM_PFN_NOSLOT;
|
2016-02-23 14:36:01 +00:00
|
|
|
}
|
2012-08-21 03:02:51 +00:00
|
|
|
|
|
|
|
/* Do not map writable pfn in the readonly memslot. */
|
|
|
|
if (writable && memslot_is_readonly(slot)) {
|
|
|
|
*writable = false;
|
|
|
|
writable = NULL;
|
|
|
|
}
|
|
|
|
|
2022-10-11 19:58:08 +00:00
|
|
|
return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
|
2012-08-21 03:02:51 +00:00
|
|
|
writable);
|
2010-08-22 11:10:28 +00:00
|
|
|
}
|
2015-04-02 09:20:48 +00:00
|
|
|
EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
|
2010-08-22 11:10:28 +00:00
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
|
2010-10-22 16:18:18 +00:00
|
|
|
bool *writable)
|
|
|
|
{
|
2022-10-11 19:58:08 +00:00
|
|
|
return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
|
|
|
|
NULL, write_fault, writable, NULL);
|
2010-10-22 16:18:18 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
|
|
|
|
|
2021-11-15 23:45:58 +00:00
|
|
|
kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
|
2009-12-23 16:35:19 +00:00
|
|
|
{
|
2022-10-11 19:58:08 +00:00
|
|
|
return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
|
|
|
|
NULL, NULL);
|
2009-12-23 16:35:19 +00:00
|
|
|
}
|
2015-05-19 14:09:04 +00:00
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
|
2009-12-23 16:35:19 +00:00
|
|
|
|
2021-11-15 23:45:58 +00:00
|
|
|
kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
|
2009-12-23 16:35:19 +00:00
|
|
|
{
|
2022-10-11 19:58:08 +00:00
|
|
|
return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
|
|
|
|
NULL, NULL);
|
2009-12-23 16:35:19 +00:00
|
|
|
}
|
2012-08-21 02:59:12 +00:00
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
|
2009-12-23 16:35:19 +00:00
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
|
2015-05-17 11:58:53 +00:00
|
|
|
{
|
|
|
|
return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
|
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
|
2015-05-19 14:09:04 +00:00
|
|
|
{
|
|
|
|
return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_pfn);
|
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
|
2015-05-17 11:58:53 +00:00
|
|
|
{
|
|
|
|
return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn);
|
|
|
|
|
2015-05-19 14:01:50 +00:00
|
|
|
int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
struct page **pages, int nr_pages)
|
2010-08-22 11:11:43 +00:00
|
|
|
{
|
|
|
|
unsigned long addr;
|
2017-08-10 12:14:39 +00:00
|
|
|
gfn_t entry = 0;
|
2010-08-22 11:11:43 +00:00
|
|
|
|
2015-05-19 14:01:50 +00:00
|
|
|
addr = gfn_to_hva_many(slot, gfn, &entry);
|
2010-08-22 11:11:43 +00:00
|
|
|
if (kvm_is_error_hva(addr))
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (entry < nr_pages)
|
|
|
|
return 0;
|
|
|
|
|
2020-06-08 04:40:55 +00:00
|
|
|
return get_user_pages_fast_only(addr, nr_pages, FOLL_WRITE, pages);
|
2010-08-22 11:11:43 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_page_many_atomic);
|
|
|
|
|
2022-04-29 01:04:13 +00:00
|
|
|
/*
|
|
|
|
* Do not use this helper unless you are absolutely certain the gfn _must_ be
|
|
|
|
* backed by 'struct page'. A valid example is if the backing memslot is
|
|
|
|
* controlled by KVM. Note, if the returned page is valid, it's refcount has
|
|
|
|
* been elevated by gfn_to_pfn().
|
|
|
|
*/
|
2008-04-02 19:46:56 +00:00
|
|
|
struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
2022-04-29 01:04:15 +00:00
|
|
|
struct page *page;
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
kvm_pfn_t pfn;
|
2008-04-30 20:37:07 +00:00
|
|
|
|
|
|
|
pfn = gfn_to_pfn(kvm, gfn);
|
|
|
|
|
2012-10-16 12:10:59 +00:00
|
|
|
if (is_error_noslot_pfn(pfn))
|
2012-08-03 07:42:10 +00:00
|
|
|
return KVM_ERR_PTR_BAD_PAGE;
|
2012-07-26 03:58:59 +00:00
|
|
|
|
2022-04-29 01:04:15 +00:00
|
|
|
page = kvm_pfn_to_refcounted_page(pfn);
|
|
|
|
if (!page)
|
2012-08-03 07:41:22 +00:00
|
|
|
return KVM_ERR_PTR_BAD_PAGE;
|
2012-07-26 03:58:59 +00:00
|
|
|
|
2022-04-29 01:04:15 +00:00
|
|
|
return page;
|
2007-03-30 11:02:32 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(gfn_to_page);
|
|
|
|
|
2021-11-15 16:50:27 +00:00
|
|
|
void kvm_release_pfn(kvm_pfn_t pfn, bool dirty)
|
2019-12-05 01:30:51 +00:00
|
|
|
{
|
|
|
|
if (dirty)
|
|
|
|
kvm_release_pfn_dirty(pfn);
|
|
|
|
else
|
|
|
|
kvm_release_pfn_clean(pfn);
|
|
|
|
}
|
|
|
|
|
2021-11-15 16:50:27 +00:00
|
|
|
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
|
2019-01-31 20:24:34 +00:00
|
|
|
{
|
|
|
|
kvm_pfn_t pfn;
|
|
|
|
void *hva = NULL;
|
|
|
|
struct page *page = KVM_UNMAPPED_PAGE;
|
|
|
|
|
|
|
|
if (!map)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2021-11-15 16:50:27 +00:00
|
|
|
pfn = gfn_to_pfn(vcpu->kvm, gfn);
|
2019-01-31 20:24:34 +00:00
|
|
|
if (is_error_noslot_pfn(pfn))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (pfn_valid(pfn)) {
|
|
|
|
page = pfn_to_page(pfn);
|
2021-11-15 16:50:27 +00:00
|
|
|
hva = kmap(page);
|
2019-05-20 10:06:36 +00:00
|
|
|
#ifdef CONFIG_HAS_IOMEM
|
2019-12-05 01:30:51 +00:00
|
|
|
} else {
|
2021-11-15 16:50:27 +00:00
|
|
|
hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
|
2019-05-20 10:06:36 +00:00
|
|
|
#endif
|
2019-01-31 20:24:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!hva)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
map->page = page;
|
|
|
|
map->hva = hva;
|
|
|
|
map->pfn = pfn;
|
|
|
|
map->gfn = gfn;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_map);
|
|
|
|
|
2021-11-15 16:50:27 +00:00
|
|
|
void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty)
|
2019-01-31 20:24:34 +00:00
|
|
|
{
|
|
|
|
if (!map)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!map->hva)
|
|
|
|
return;
|
|
|
|
|
2021-11-15 16:50:27 +00:00
|
|
|
if (map->page != KVM_UNMAPPED_PAGE)
|
|
|
|
kunmap(map->page);
|
2019-05-27 08:28:25 +00:00
|
|
|
#ifdef CONFIG_HAS_IOMEM
|
2019-12-05 01:30:51 +00:00
|
|
|
else
|
2021-11-15 16:50:27 +00:00
|
|
|
memunmap(map->hva);
|
2019-05-27 08:28:25 +00:00
|
|
|
#endif
|
2019-01-31 20:24:34 +00:00
|
|
|
|
2019-12-05 01:30:51 +00:00
|
|
|
if (dirty)
|
2021-11-15 16:50:27 +00:00
|
|
|
kvm_vcpu_mark_page_dirty(vcpu, map->gfn);
|
2019-12-05 01:30:51 +00:00
|
|
|
|
2021-11-15 16:50:27 +00:00
|
|
|
kvm_release_pfn(map->pfn, dirty);
|
2019-01-31 20:24:34 +00:00
|
|
|
|
|
|
|
map->hva = NULL;
|
|
|
|
map->page = NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
|
|
|
|
|
2022-04-29 01:04:10 +00:00
|
|
|
static bool kvm_is_ad_tracked_page(struct page *page)
|
2015-05-17 11:58:53 +00:00
|
|
|
{
|
2022-04-29 01:04:10 +00:00
|
|
|
/*
|
|
|
|
* Per page-flags.h, pages tagged PG_reserved "should in general not be
|
|
|
|
* touched (e.g. set dirty) except by its owner".
|
|
|
|
*/
|
|
|
|
return !PageReserved(page);
|
|
|
|
}
|
2015-05-17 11:58:53 +00:00
|
|
|
|
2022-04-29 01:04:10 +00:00
|
|
|
static void kvm_set_page_dirty(struct page *page)
|
|
|
|
{
|
|
|
|
if (kvm_is_ad_tracked_page(page))
|
|
|
|
SetPageDirty(page);
|
|
|
|
}
|
2015-05-17 11:58:53 +00:00
|
|
|
|
2022-04-29 01:04:10 +00:00
|
|
|
static void kvm_set_page_accessed(struct page *page)
|
|
|
|
{
|
|
|
|
if (kvm_is_ad_tracked_page(page))
|
|
|
|
mark_page_accessed(page);
|
2015-05-17 11:58:53 +00:00
|
|
|
}
|
|
|
|
|
2007-11-20 09:49:33 +00:00
|
|
|
void kvm_release_page_clean(struct page *page)
|
|
|
|
{
|
2012-08-03 07:42:52 +00:00
|
|
|
WARN_ON(is_error_page(page));
|
|
|
|
|
2022-04-29 01:04:10 +00:00
|
|
|
kvm_set_page_accessed(page);
|
|
|
|
put_page(page);
|
2007-11-20 09:49:33 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_release_page_clean);
|
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
void kvm_release_pfn_clean(kvm_pfn_t pfn)
|
2008-04-02 19:46:56 +00:00
|
|
|
{
|
2022-04-29 01:04:15 +00:00
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
if (is_error_noslot_pfn(pfn))
|
|
|
|
return;
|
|
|
|
|
|
|
|
page = kvm_pfn_to_refcounted_page(pfn);
|
|
|
|
if (!page)
|
|
|
|
return;
|
|
|
|
|
|
|
|
kvm_release_page_clean(page);
|
2008-04-02 19:46:56 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
|
|
|
|
|
2007-11-20 09:49:33 +00:00
|
|
|
void kvm_release_page_dirty(struct page *page)
|
2007-10-18 09:09:33 +00:00
|
|
|
{
|
2012-07-26 03:58:59 +00:00
|
|
|
WARN_ON(is_error_page(page));
|
|
|
|
|
2022-04-29 01:04:10 +00:00
|
|
|
kvm_set_page_dirty(page);
|
|
|
|
kvm_release_page_clean(page);
|
2008-04-02 19:46:56 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_release_page_dirty);
|
|
|
|
|
2017-09-01 15:11:43 +00:00
|
|
|
void kvm_release_pfn_dirty(kvm_pfn_t pfn)
|
2008-04-02 19:46:56 +00:00
|
|
|
{
|
2022-04-29 01:04:15 +00:00
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
if (is_error_noslot_pfn(pfn))
|
|
|
|
return;
|
|
|
|
|
|
|
|
page = kvm_pfn_to_refcounted_page(pfn);
|
|
|
|
if (!page)
|
|
|
|
return;
|
|
|
|
|
|
|
|
kvm_release_page_dirty(page);
|
2008-04-02 19:46:56 +00:00
|
|
|
}
|
2017-09-01 15:11:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
|
2008-04-02 19:46:56 +00:00
|
|
|
|
2022-04-29 01:04:10 +00:00
|
|
|
/*
|
|
|
|
* Note, checking for an error/noslot pfn is the caller's responsibility when
|
|
|
|
* directly marking a page dirty/accessed. Unlike the "release" helpers, the
|
|
|
|
* "set" helpers are not to be used when the pfn might point at garbage.
|
|
|
|
*/
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
void kvm_set_pfn_dirty(kvm_pfn_t pfn)
|
2008-04-02 19:46:56 +00:00
|
|
|
{
|
2022-04-29 01:04:10 +00:00
|
|
|
if (WARN_ON(is_error_noslot_pfn(pfn)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (pfn_valid(pfn))
|
|
|
|
kvm_set_page_dirty(pfn_to_page(pfn));
|
2007-10-18 09:09:33 +00:00
|
|
|
}
|
2008-04-02 19:46:56 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
|
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
void kvm_set_pfn_accessed(kvm_pfn_t pfn)
|
2008-04-02 19:46:56 +00:00
|
|
|
{
|
2022-04-29 01:04:10 +00:00
|
|
|
if (WARN_ON(is_error_noslot_pfn(pfn)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (pfn_valid(pfn))
|
|
|
|
kvm_set_page_accessed(pfn_to_page(pfn));
|
2008-04-02 19:46:56 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
|
|
|
|
|
2007-10-01 20:14:18 +00:00
|
|
|
static int next_segment(unsigned long len, int offset)
|
|
|
|
{
|
|
|
|
if (len > PAGE_SIZE - offset)
|
|
|
|
return PAGE_SIZE - offset;
|
|
|
|
else
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2024-02-15 23:53:53 +00:00
|
|
|
/* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
|
2015-05-17 11:58:53 +00:00
|
|
|
static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
void *data, int offset, int len)
|
2007-10-01 20:14:18 +00:00
|
|
|
{
|
2007-11-11 20:10:22 +00:00
|
|
|
int r;
|
|
|
|
unsigned long addr;
|
2007-10-01 20:14:18 +00:00
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
|
2007-11-11 20:10:22 +00:00
|
|
|
if (kvm_is_error_hva(addr))
|
|
|
|
return -EFAULT;
|
2015-04-02 12:08:20 +00:00
|
|
|
r = __copy_from_user(data, (void __user *)addr + offset, len);
|
2007-11-11 20:10:22 +00:00
|
|
|
if (r)
|
2007-10-01 20:14:18 +00:00
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
2015-05-17 11:58:53 +00:00
|
|
|
|
|
|
|
int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
|
|
|
|
int len)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
|
|
|
|
|
|
|
|
return __kvm_read_guest_page(slot, gfn, data, offset, len);
|
|
|
|
}
|
2007-10-01 20:14:18 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_read_guest_page);
|
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
|
|
|
|
int offset, int len)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
|
|
|
|
|
|
|
|
return __kvm_read_guest_page(slot, gfn, data, offset, len);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
|
|
|
|
|
2007-10-01 20:14:18 +00:00
|
|
|
int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len)
|
|
|
|
{
|
|
|
|
gfn_t gfn = gpa >> PAGE_SHIFT;
|
|
|
|
int seg;
|
|
|
|
int offset = offset_in_page(gpa);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
while ((seg = next_segment(len, offset)) != 0) {
|
|
|
|
ret = kvm_read_guest_page(kvm, gfn, data, offset, seg);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
offset = 0;
|
|
|
|
len -= seg;
|
|
|
|
data += seg;
|
|
|
|
++gfn;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_read_guest);
|
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
int kvm_vcpu_read_guest(struct kvm_vcpu *vcpu, gpa_t gpa, void *data, unsigned long len)
|
2007-12-21 00:18:23 +00:00
|
|
|
{
|
|
|
|
gfn_t gfn = gpa >> PAGE_SHIFT;
|
2015-05-17 11:58:53 +00:00
|
|
|
int seg;
|
2007-12-21 00:18:23 +00:00
|
|
|
int offset = offset_in_page(gpa);
|
2015-05-17 11:58:53 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
while ((seg = next_segment(len, offset)) != 0) {
|
|
|
|
ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
offset = 0;
|
|
|
|
len -= seg;
|
|
|
|
data += seg;
|
|
|
|
++gfn;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest);
|
2007-12-21 00:18:23 +00:00
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
static int __kvm_read_guest_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
void *data, int offset, unsigned long len)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
unsigned long addr;
|
|
|
|
|
|
|
|
addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
|
2007-12-21 00:18:23 +00:00
|
|
|
if (kvm_is_error_hva(addr))
|
|
|
|
return -EFAULT;
|
2008-01-30 18:57:35 +00:00
|
|
|
pagefault_disable();
|
2015-04-02 12:08:20 +00:00
|
|
|
r = __copy_from_user_inatomic(data, (void __user *)addr + offset, len);
|
2008-01-30 18:57:35 +00:00
|
|
|
pagefault_enable();
|
2007-12-21 00:18:23 +00:00
|
|
|
if (r)
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
void *data, unsigned long len)
|
|
|
|
{
|
|
|
|
gfn_t gfn = gpa >> PAGE_SHIFT;
|
|
|
|
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
|
|
|
|
int offset = offset_in_page(gpa);
|
|
|
|
|
|
|
|
return __kvm_read_guest_atomic(slot, gfn, data, offset, len);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
|
|
|
|
|
2024-02-15 23:53:53 +00:00
|
|
|
/* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
|
2020-10-01 01:20:34 +00:00
|
|
|
static int __kvm_write_guest_page(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *memslot, gfn_t gfn,
|
2015-05-17 11:58:53 +00:00
|
|
|
const void *data, int offset, int len)
|
2007-10-01 20:14:18 +00:00
|
|
|
{
|
2007-11-11 20:10:22 +00:00
|
|
|
int r;
|
|
|
|
unsigned long addr;
|
2007-10-01 20:14:18 +00:00
|
|
|
|
2015-04-10 19:47:27 +00:00
|
|
|
addr = gfn_to_hva_memslot(memslot, gfn);
|
2007-11-11 20:10:22 +00:00
|
|
|
if (kvm_is_error_hva(addr))
|
|
|
|
return -EFAULT;
|
2011-05-15 15:22:04 +00:00
|
|
|
r = __copy_to_user((void __user *)addr + offset, data, len);
|
2007-11-11 20:10:22 +00:00
|
|
|
if (r)
|
2007-10-01 20:14:18 +00:00
|
|
|
return -EFAULT;
|
2020-10-01 01:20:34 +00:00
|
|
|
mark_page_dirty_in_slot(kvm, memslot, gfn);
|
2007-10-01 20:14:18 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2015-05-17 11:58:53 +00:00
|
|
|
|
|
|
|
int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
|
|
|
|
const void *data, int offset, int len)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
|
|
|
|
|
2020-10-01 01:20:34 +00:00
|
|
|
return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len);
|
2015-05-17 11:58:53 +00:00
|
|
|
}
|
2007-10-01 20:14:18 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_write_guest_page);
|
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
|
|
|
|
const void *data, int offset, int len)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
|
|
|
|
|
2020-10-01 01:20:34 +00:00
|
|
|
return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
|
2015-05-17 11:58:53 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
|
|
|
|
|
2007-10-01 20:14:18 +00:00
|
|
|
int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
|
|
|
gfn_t gfn = gpa >> PAGE_SHIFT;
|
|
|
|
int seg;
|
|
|
|
int offset = offset_in_page(gpa);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
while ((seg = next_segment(len, offset)) != 0) {
|
|
|
|
ret = kvm_write_guest_page(kvm, gfn, data, offset, seg);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
offset = 0;
|
|
|
|
len -= seg;
|
|
|
|
data += seg;
|
|
|
|
++gfn;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2014-12-11 05:52:58 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_write_guest);
|
2007-10-01 20:14:18 +00:00
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
|
|
|
gfn_t gfn = gpa >> PAGE_SHIFT;
|
|
|
|
int seg;
|
|
|
|
int offset = offset_in_page(gpa);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
while ((seg = next_segment(len, offset)) != 0) {
|
|
|
|
ret = kvm_vcpu_write_guest_page(vcpu, gfn, data, offset, seg);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
offset = 0;
|
|
|
|
len -= seg;
|
|
|
|
data += seg;
|
|
|
|
++gfn;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest);
|
|
|
|
|
2017-02-04 04:32:28 +00:00
|
|
|
static int __kvm_gfn_to_hva_cache_init(struct kvm_memslots *slots,
|
|
|
|
struct gfn_to_hva_cache *ghc,
|
|
|
|
gpa_t gpa, unsigned long len)
|
2010-10-18 13:22:23 +00:00
|
|
|
{
|
|
|
|
int offset = offset_in_page(gpa);
|
2013-03-29 16:35:21 +00:00
|
|
|
gfn_t start_gfn = gpa >> PAGE_SHIFT;
|
|
|
|
gfn_t end_gfn = (gpa + len - 1) >> PAGE_SHIFT;
|
|
|
|
gfn_t nr_pages_needed = end_gfn - start_gfn + 1;
|
|
|
|
gfn_t nr_pages_avail;
|
2010-10-18 13:22:23 +00:00
|
|
|
|
2020-01-09 19:58:55 +00:00
|
|
|
/* Update ghc->generation before performing any error checks. */
|
2010-10-18 13:22:23 +00:00
|
|
|
ghc->generation = slots->generation;
|
2020-01-09 19:58:55 +00:00
|
|
|
|
|
|
|
if (start_gfn > end_gfn) {
|
|
|
|
ghc->hva = KVM_HVA_ERR_BAD;
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2018-12-17 21:53:33 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the requested region crosses two memslots, we still
|
|
|
|
* verify that the entire region is valid here.
|
|
|
|
*/
|
2020-01-09 19:58:55 +00:00
|
|
|
for ( ; start_gfn <= end_gfn; start_gfn += nr_pages_avail) {
|
2018-12-17 21:53:33 +00:00
|
|
|
ghc->memslot = __gfn_to_memslot(slots, start_gfn);
|
|
|
|
ghc->hva = gfn_to_hva_many(ghc->memslot, start_gfn,
|
|
|
|
&nr_pages_avail);
|
|
|
|
if (kvm_is_error_hva(ghc->hva))
|
2020-01-09 19:58:55 +00:00
|
|
|
return -EFAULT;
|
2018-12-17 21:53:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Use the slow path for cross page reads and writes. */
|
2020-01-09 19:58:55 +00:00
|
|
|
if (nr_pages_needed == 1)
|
2010-10-18 13:22:23 +00:00
|
|
|
ghc->hva += offset;
|
2018-12-17 21:53:33 +00:00
|
|
|
else
|
2013-03-29 16:35:21 +00:00
|
|
|
ghc->memslot = NULL;
|
2018-12-17 21:53:33 +00:00
|
|
|
|
2020-01-09 19:58:55 +00:00
|
|
|
ghc->gpa = gpa;
|
|
|
|
ghc->len = len;
|
|
|
|
return 0;
|
2010-10-18 13:22:23 +00:00
|
|
|
}
|
2017-02-04 04:32:28 +00:00
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
2017-02-04 04:32:28 +00:00
|
|
|
gpa_t gpa, unsigned long len)
|
|
|
|
{
|
2017-05-02 14:20:18 +00:00
|
|
|
struct kvm_memslots *slots = kvm_memslots(kvm);
|
2017-02-04 04:32:28 +00:00
|
|
|
return __kvm_gfn_to_hva_cache_init(slots, ghc, gpa, len);
|
|
|
|
}
|
2017-05-02 14:20:18 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
|
2010-10-18 13:22:23 +00:00
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
2018-12-14 22:34:43 +00:00
|
|
|
void *data, unsigned int offset,
|
|
|
|
unsigned long len)
|
2010-10-18 13:22:23 +00:00
|
|
|
{
|
2017-05-02 14:20:18 +00:00
|
|
|
struct kvm_memslots *slots = kvm_memslots(kvm);
|
2010-10-18 13:22:23 +00:00
|
|
|
int r;
|
2016-11-02 09:08:34 +00:00
|
|
|
gpa_t gpa = ghc->gpa + offset;
|
2010-10-18 13:22:23 +00:00
|
|
|
|
2021-11-22 23:24:01 +00:00
|
|
|
if (WARN_ON_ONCE(len + offset > ghc->len))
|
|
|
|
return -EINVAL;
|
2013-03-29 16:35:21 +00:00
|
|
|
|
2020-01-09 23:56:20 +00:00
|
|
|
if (slots->generation != ghc->generation) {
|
|
|
|
if (__kvm_gfn_to_hva_cache_init(slots, ghc, ghc->gpa, ghc->len))
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
2013-03-29 16:35:21 +00:00
|
|
|
|
2010-10-18 13:22:23 +00:00
|
|
|
if (kvm_is_error_hva(ghc->hva))
|
|
|
|
return -EFAULT;
|
|
|
|
|
KVM: Check for a bad hva before dropping into the ghc slow path
When reading/writing using the guest/host cache, check for a bad hva
before checking for a NULL memslot, which triggers the slow path for
handing cross-page accesses. Because the memslot is nullified on error
by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
crossing into a new page, then the kvm_{read,write}_guest() slow path
could potentially write/access the first chunk prior to detecting the
bad hva.
Arguably, performing a partial access is semantically correct from an
architectural perspective, but that behavior is certainly not intended.
In the original implementation, memslot was not explicitly nullified
and therefore the partial access behavior varied based on whether the
memslot itself was null, or if the hva was simply bad. The current
behavior was introduced as a seemingly unintentional side effect in
commit f1b9dd5eb86c ("kvm: Disallow wraparound in
kvm_gfn_to_hva_cache_init"), which justified the change with "since some
callers don't check the return code from this function, it sit seems
prudent to clear ghc->memslot in the event of an error".
Regardless of intent, the partial access is dependent on _not_ checking
the result of the cache initialization, which is arguably a bug in its
own right, at best simply weird.
Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
Cc: Jim Mattson <jmattson@google.com>
Cc: Andrew Honig <ahonig@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-09 23:56:18 +00:00
|
|
|
if (unlikely(!ghc->memslot))
|
|
|
|
return kvm_write_guest(kvm, gpa, data, len);
|
|
|
|
|
2016-11-02 09:08:34 +00:00
|
|
|
r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
|
2010-10-18 13:22:23 +00:00
|
|
|
if (r)
|
|
|
|
return -EFAULT;
|
2020-10-01 01:20:34 +00:00
|
|
|
mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT);
|
2010-10-18 13:22:23 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2017-05-02 14:20:18 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached);
|
2016-11-02 09:08:34 +00:00
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned long len)
|
2016-11-02 09:08:34 +00:00
|
|
|
{
|
2017-05-02 14:20:18 +00:00
|
|
|
return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len);
|
2016-11-02 09:08:34 +00:00
|
|
|
}
|
2017-05-02 14:20:18 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
|
2010-10-18 13:22:23 +00:00
|
|
|
|
2020-05-25 14:41:19 +00:00
|
|
|
int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned int offset,
|
|
|
|
unsigned long len)
|
2011-07-11 19:28:11 +00:00
|
|
|
{
|
2017-05-02 14:20:18 +00:00
|
|
|
struct kvm_memslots *slots = kvm_memslots(kvm);
|
2011-07-11 19:28:11 +00:00
|
|
|
int r;
|
2020-05-25 14:41:19 +00:00
|
|
|
gpa_t gpa = ghc->gpa + offset;
|
2011-07-11 19:28:11 +00:00
|
|
|
|
2021-11-22 23:24:01 +00:00
|
|
|
if (WARN_ON_ONCE(len + offset > ghc->len))
|
|
|
|
return -EINVAL;
|
2013-03-29 16:35:21 +00:00
|
|
|
|
2020-01-09 23:56:20 +00:00
|
|
|
if (slots->generation != ghc->generation) {
|
|
|
|
if (__kvm_gfn_to_hva_cache_init(slots, ghc, ghc->gpa, ghc->len))
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
2013-03-29 16:35:21 +00:00
|
|
|
|
2011-07-11 19:28:11 +00:00
|
|
|
if (kvm_is_error_hva(ghc->hva))
|
|
|
|
return -EFAULT;
|
|
|
|
|
KVM: Check for a bad hva before dropping into the ghc slow path
When reading/writing using the guest/host cache, check for a bad hva
before checking for a NULL memslot, which triggers the slow path for
handing cross-page accesses. Because the memslot is nullified on error
by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
crossing into a new page, then the kvm_{read,write}_guest() slow path
could potentially write/access the first chunk prior to detecting the
bad hva.
Arguably, performing a partial access is semantically correct from an
architectural perspective, but that behavior is certainly not intended.
In the original implementation, memslot was not explicitly nullified
and therefore the partial access behavior varied based on whether the
memslot itself was null, or if the hva was simply bad. The current
behavior was introduced as a seemingly unintentional side effect in
commit f1b9dd5eb86c ("kvm: Disallow wraparound in
kvm_gfn_to_hva_cache_init"), which justified the change with "since some
callers don't check the return code from this function, it sit seems
prudent to clear ghc->memslot in the event of an error".
Regardless of intent, the partial access is dependent on _not_ checking
the result of the cache initialization, which is arguably a bug in its
own right, at best simply weird.
Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
Cc: Jim Mattson <jmattson@google.com>
Cc: Andrew Honig <ahonig@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-09 23:56:18 +00:00
|
|
|
if (unlikely(!ghc->memslot))
|
2020-05-25 14:41:19 +00:00
|
|
|
return kvm_read_guest(kvm, gpa, data, len);
|
KVM: Check for a bad hva before dropping into the ghc slow path
When reading/writing using the guest/host cache, check for a bad hva
before checking for a NULL memslot, which triggers the slow path for
handing cross-page accesses. Because the memslot is nullified on error
by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
crossing into a new page, then the kvm_{read,write}_guest() slow path
could potentially write/access the first chunk prior to detecting the
bad hva.
Arguably, performing a partial access is semantically correct from an
architectural perspective, but that behavior is certainly not intended.
In the original implementation, memslot was not explicitly nullified
and therefore the partial access behavior varied based on whether the
memslot itself was null, or if the hva was simply bad. The current
behavior was introduced as a seemingly unintentional side effect in
commit f1b9dd5eb86c ("kvm: Disallow wraparound in
kvm_gfn_to_hva_cache_init"), which justified the change with "since some
callers don't check the return code from this function, it sit seems
prudent to clear ghc->memslot in the event of an error".
Regardless of intent, the partial access is dependent on _not_ checking
the result of the cache initialization, which is arguably a bug in its
own right, at best simply weird.
Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
Cc: Jim Mattson <jmattson@google.com>
Cc: Andrew Honig <ahonig@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-09 23:56:18 +00:00
|
|
|
|
2020-05-25 14:41:19 +00:00
|
|
|
r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
|
2011-07-11 19:28:11 +00:00
|
|
|
if (r)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2020-05-25 14:41:19 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached);
|
|
|
|
|
|
|
|
int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned long len)
|
|
|
|
{
|
|
|
|
return kvm_read_guest_offset_cached(kvm, ghc, data, 0, len);
|
|
|
|
}
|
2017-05-02 14:20:18 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
|
2011-07-11 19:28:11 +00:00
|
|
|
|
2007-10-01 20:14:18 +00:00
|
|
|
int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
|
|
|
|
{
|
2020-11-06 10:25:09 +00:00
|
|
|
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
|
2007-10-01 20:14:18 +00:00
|
|
|
gfn_t gfn = gpa >> PAGE_SHIFT;
|
|
|
|
int seg;
|
|
|
|
int offset = offset_in_page(gpa);
|
|
|
|
int ret;
|
|
|
|
|
2015-02-20 13:21:36 +00:00
|
|
|
while ((seg = next_segment(len, offset)) != 0) {
|
2020-11-06 10:25:09 +00:00
|
|
|
ret = kvm_write_guest_page(kvm, gfn, zero_page, offset, len);
|
2007-10-01 20:14:18 +00:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
offset = 0;
|
|
|
|
len -= seg;
|
|
|
|
++gfn;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_clear_guest);
|
|
|
|
|
2020-10-01 01:20:34 +00:00
|
|
|
void mark_page_dirty_in_slot(struct kvm *kvm,
|
2021-11-15 23:45:58 +00:00
|
|
|
const struct kvm_memory_slot *memslot,
|
2020-10-01 01:20:34 +00:00
|
|
|
gfn_t gfn)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2021-12-10 16:36:20 +00:00
|
|
|
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
|
|
|
|
|
2022-01-13 12:29:24 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_DIRTY_RING
|
2022-11-10 10:49:10 +00:00
|
|
|
if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
|
2021-12-10 16:36:20 +00:00
|
|
|
return;
|
2022-11-10 10:49:10 +00:00
|
|
|
|
KVM: Push dirty information unconditionally to backup bitmap
In mark_page_dirty_in_slot(), we bail out when no running vcpu exists
and a running vcpu context is strictly required by architecture. It may
cause backwards compatible issue. Currently, saving vgic/its tables is
the only known case where no running vcpu context is expected. We may
have other unknown cases where no running vcpu context exists and it's
reported by the warning message and we bail out without pushing the
dirty information to the backup bitmap. For this, the application is
going to enable the backup bitmap for the unknown cases. However, the
dirty information can't be pushed to the backup bitmap even though the
backup bitmap is enabled for those unknown cases in the application,
until the unknown cases are added to the allowed list of non-running
vcpu context with extra code changes to the host kernel.
In order to make the new application, where the backup bitmap has been
enabled, to work with the unchanged host, we continue to push the dirty
information to the backup bitmap instead of bailing out early. With the
added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the
kernel crash is avoided silently by the combined conditions: no running
vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true',
and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled
yet.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
2022-11-12 09:43:22 +00:00
|
|
|
WARN_ON_ONCE(!vcpu && !kvm_arch_allow_write_without_running_vcpu(kvm));
|
2022-01-13 12:29:24 +00:00
|
|
|
#endif
|
2021-12-10 16:36:20 +00:00
|
|
|
|
2020-10-01 01:22:26 +00:00
|
|
|
if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
|
2007-07-31 10:41:14 +00:00
|
|
|
unsigned long rel_gfn = gfn - memslot->base_gfn;
|
2020-10-01 01:22:22 +00:00
|
|
|
u32 slot = (memslot->as_id << 16) | memslot->id;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2022-11-10 10:49:10 +00:00
|
|
|
if (kvm->dirty_ring_size && vcpu)
|
2022-11-10 10:49:08 +00:00
|
|
|
kvm_dirty_ring_push(vcpu, slot, rel_gfn);
|
KVM: Push dirty information unconditionally to backup bitmap
In mark_page_dirty_in_slot(), we bail out when no running vcpu exists
and a running vcpu context is strictly required by architecture. It may
cause backwards compatible issue. Currently, saving vgic/its tables is
the only known case where no running vcpu context is expected. We may
have other unknown cases where no running vcpu context exists and it's
reported by the warning message and we bail out without pushing the
dirty information to the backup bitmap. For this, the application is
going to enable the backup bitmap for the unknown cases. However, the
dirty information can't be pushed to the backup bitmap even though the
backup bitmap is enabled for those unknown cases in the application,
until the unknown cases are added to the allowed list of non-running
vcpu context with extra code changes to the host kernel.
In order to make the new application, where the backup bitmap has been
enabled, to work with the unchanged host, we continue to push the dirty
information to the backup bitmap instead of bailing out early. With the
added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the
kernel crash is avoided silently by the combined conditions: no running
vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true',
and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled
yet.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
2022-11-12 09:43:22 +00:00
|
|
|
else if (memslot->dirty_bitmap)
|
2020-10-01 01:22:22 +00:00
|
|
|
set_bit_le(rel_gfn, memslot->dirty_bitmap);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
|
|
|
}
|
2020-10-14 18:26:55 +00:00
|
|
|
EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2010-10-18 13:22:23 +00:00
|
|
|
void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *memslot;
|
|
|
|
|
|
|
|
memslot = gfn_to_memslot(kvm, gfn);
|
2020-10-01 01:20:34 +00:00
|
|
|
mark_page_dirty_in_slot(kvm, memslot, gfn);
|
2010-10-18 13:22:23 +00:00
|
|
|
}
|
2013-10-07 16:47:59 +00:00
|
|
|
EXPORT_SYMBOL_GPL(mark_page_dirty);
|
2010-10-18 13:22:23 +00:00
|
|
|
|
2015-05-17 11:58:53 +00:00
|
|
|
void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *memslot;
|
|
|
|
|
|
|
|
memslot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
|
2020-10-01 01:20:34 +00:00
|
|
|
mark_page_dirty_in_slot(vcpu->kvm, memslot, gfn);
|
2015-05-17 11:58:53 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
|
|
|
|
|
2017-11-24 21:39:01 +00:00
|
|
|
void kvm_sigset_activate(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (!vcpu->sigset_active)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This does a lockless modification of ->real_blocked, which is fine
|
|
|
|
* because, only current can change ->real_blocked and all readers of
|
|
|
|
* ->real_blocked don't care as long ->real_blocked is always a subset
|
|
|
|
* of ->blocked.
|
|
|
|
*/
|
|
|
|
sigprocmask(SIG_SETMASK, &vcpu->sigset, ¤t->real_blocked);
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_sigset_deactivate(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (!vcpu->sigset_active)
|
|
|
|
return;
|
|
|
|
|
|
|
|
sigprocmask(SIG_SETMASK, ¤t->real_blocked, NULL);
|
|
|
|
sigemptyset(¤t->real_blocked);
|
|
|
|
}
|
|
|
|
|
2015-09-03 14:07:38 +00:00
|
|
|
static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2019-01-27 10:17:16 +00:00
|
|
|
unsigned int old, val, grow, grow_start;
|
2015-09-03 14:07:38 +00:00
|
|
|
|
2015-09-03 14:07:39 +00:00
|
|
|
old = val = vcpu->halt_poll_ns;
|
2019-01-27 10:17:16 +00:00
|
|
|
grow_start = READ_ONCE(halt_poll_ns_grow_start);
|
2016-02-09 12:47:55 +00:00
|
|
|
grow = READ_ONCE(halt_poll_ns_grow);
|
2019-01-27 10:17:14 +00:00
|
|
|
if (!grow)
|
|
|
|
goto out;
|
|
|
|
|
2019-01-27 10:17:16 +00:00
|
|
|
val *= grow;
|
|
|
|
if (val < grow_start)
|
|
|
|
val = grow_start;
|
2015-09-03 14:07:38 +00:00
|
|
|
|
|
|
|
vcpu->halt_poll_ns = val;
|
2019-01-27 10:17:14 +00:00
|
|
|
out:
|
2015-09-03 14:07:39 +00:00
|
|
|
trace_kvm_halt_poll_ns_grow(vcpu->vcpu_id, val, old);
|
2015-09-03 14:07:38 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2021-09-02 03:11:00 +00:00
|
|
|
unsigned int old, val, shrink, grow_start;
|
2015-09-03 14:07:38 +00:00
|
|
|
|
2015-09-03 14:07:39 +00:00
|
|
|
old = val = vcpu->halt_poll_ns;
|
2016-02-09 12:47:55 +00:00
|
|
|
shrink = READ_ONCE(halt_poll_ns_shrink);
|
2021-09-02 03:11:00 +00:00
|
|
|
grow_start = READ_ONCE(halt_poll_ns_grow_start);
|
2016-02-09 12:47:55 +00:00
|
|
|
if (shrink == 0)
|
2015-09-03 14:07:38 +00:00
|
|
|
val = 0;
|
|
|
|
else
|
2016-02-09 12:47:55 +00:00
|
|
|
val /= shrink;
|
2015-09-03 14:07:38 +00:00
|
|
|
|
2021-09-02 03:11:00 +00:00
|
|
|
if (val < grow_start)
|
|
|
|
val = 0;
|
|
|
|
|
2015-09-03 14:07:38 +00:00
|
|
|
vcpu->halt_poll_ns = val;
|
2015-09-03 14:07:39 +00:00
|
|
|
trace_kvm_halt_poll_ns_shrink(vcpu->vcpu_id, val, old);
|
2015-09-03 14:07:38 +00:00
|
|
|
}
|
|
|
|
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2018-06-27 21:59:11 +00:00
|
|
|
int ret = -EINTR;
|
|
|
|
int idx = srcu_read_lock(&vcpu->kvm->srcu);
|
|
|
|
|
2022-09-21 00:32:01 +00:00
|
|
|
if (kvm_arch_vcpu_runnable(vcpu))
|
2018-06-27 21:59:11 +00:00
|
|
|
goto out;
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
if (kvm_cpu_has_pending_timer(vcpu))
|
2018-06-27 21:59:11 +00:00
|
|
|
goto out;
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
if (signal_pending(current))
|
2018-06-27 21:59:11 +00:00
|
|
|
goto out;
|
2021-05-25 13:41:17 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
|
|
|
|
goto out;
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
|
2018-06-27 21:59:11 +00:00
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
|
|
|
return ret;
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
}
|
|
|
|
|
2021-10-09 02:12:07 +00:00
|
|
|
/*
|
|
|
|
* Block the vCPU until the vCPU is runnable, an event arrives, or a signal is
|
|
|
|
* pending. This is mostly used when halting a vCPU, but may also be used
|
|
|
|
* directly for other vCPU non-runnable states, e.g. x86's Wait-For-SIPI.
|
|
|
|
*/
|
|
|
|
bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
|
2020-05-08 18:22:40 +00:00
|
|
|
{
|
2021-10-09 02:12:07 +00:00
|
|
|
struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
|
|
|
|
bool waited = false;
|
|
|
|
|
2021-10-09 02:12:08 +00:00
|
|
|
vcpu->stat.generic.blocking = 1;
|
|
|
|
|
2022-06-06 18:08:28 +00:00
|
|
|
preempt_disable();
|
2021-10-09 02:12:07 +00:00
|
|
|
kvm_arch_vcpu_blocking(vcpu);
|
|
|
|
prepare_to_rcuwait(wait);
|
2022-06-06 18:08:28 +00:00
|
|
|
preempt_enable();
|
|
|
|
|
2021-10-09 02:12:07 +00:00
|
|
|
for (;;) {
|
|
|
|
set_current_state(TASK_INTERRUPTIBLE);
|
|
|
|
|
|
|
|
if (kvm_vcpu_check_block(vcpu) < 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
waited = true;
|
|
|
|
schedule();
|
|
|
|
}
|
|
|
|
|
2022-06-06 18:08:28 +00:00
|
|
|
preempt_disable();
|
|
|
|
finish_rcuwait(wait);
|
2021-10-09 02:12:07 +00:00
|
|
|
kvm_arch_vcpu_unblocking(vcpu);
|
2022-06-06 18:08:28 +00:00
|
|
|
preempt_enable();
|
2021-10-09 02:12:07 +00:00
|
|
|
|
2021-10-09 02:12:08 +00:00
|
|
|
vcpu->stat.generic.blocking = 0;
|
|
|
|
|
2021-10-09 02:12:07 +00:00
|
|
|
return waited;
|
|
|
|
}
|
|
|
|
|
2021-10-09 02:11:59 +00:00
|
|
|
static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
|
|
|
|
ktime_t end, bool success)
|
2020-05-08 18:22:40 +00:00
|
|
|
{
|
2021-10-09 02:12:00 +00:00
|
|
|
struct kvm_vcpu_stat_generic *stats = &vcpu->stat.generic;
|
2021-10-09 02:11:59 +00:00
|
|
|
u64 poll_ns = ktime_to_ns(ktime_sub(end, start));
|
|
|
|
|
2021-10-09 02:12:00 +00:00
|
|
|
++vcpu->stat.generic.halt_attempted_poll;
|
|
|
|
|
|
|
|
if (success) {
|
|
|
|
++vcpu->stat.generic.halt_successful_poll;
|
|
|
|
|
|
|
|
if (!vcpu_valid_wakeup(vcpu))
|
|
|
|
++vcpu->stat.generic.halt_poll_invalid;
|
|
|
|
|
|
|
|
stats->halt_poll_success_ns += poll_ns;
|
|
|
|
KVM_STATS_LOG_HIST_UPDATE(stats->halt_poll_success_hist, poll_ns);
|
|
|
|
} else {
|
|
|
|
stats->halt_poll_fail_ns += poll_ns;
|
|
|
|
KVM_STATS_LOG_HIST_UPDATE(stats->halt_poll_fail_hist, poll_ns);
|
|
|
|
}
|
2020-05-08 18:22:40 +00:00
|
|
|
}
|
|
|
|
|
2022-11-17 00:16:56 +00:00
|
|
|
static unsigned int kvm_vcpu_max_halt_poll_ns(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2022-11-17 00:16:57 +00:00
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
|
|
|
|
if (kvm->override_halt_poll_ns) {
|
|
|
|
/*
|
|
|
|
* Ensure kvm->max_halt_poll_ns is not read before
|
|
|
|
* kvm->override_halt_poll_ns.
|
|
|
|
*
|
|
|
|
* Pairs with the smp_wmb() when enabling KVM_CAP_HALT_POLL.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
return READ_ONCE(kvm->max_halt_poll_ns);
|
|
|
|
}
|
|
|
|
|
|
|
|
return READ_ONCE(halt_poll_ns);
|
2022-11-17 00:16:56 +00:00
|
|
|
}
|
|
|
|
|
2007-07-18 09:15:21 +00:00
|
|
|
/*
|
2021-10-09 02:12:07 +00:00
|
|
|
* Emulate a vCPU halt condition, e.g. HLT on x86, WFI on arm, etc... If halt
|
|
|
|
* polling is enabled, busy wait for a short time before blocking to avoid the
|
|
|
|
* expensive block+unblock sequence if a wake event arrives soon after the vCPU
|
|
|
|
* is halted.
|
2007-07-18 09:15:21 +00:00
|
|
|
*/
|
2021-10-09 02:12:06 +00:00
|
|
|
void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
|
2007-06-05 12:53:05 +00:00
|
|
|
{
|
2022-11-17 00:16:56 +00:00
|
|
|
unsigned int max_halt_poll_ns = kvm_vcpu_max_halt_poll_ns(vcpu);
|
2021-10-09 02:11:56 +00:00
|
|
|
bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
|
2020-05-08 18:22:40 +00:00
|
|
|
ktime_t start, cur, poll_end;
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
bool waited = false;
|
2022-11-17 00:16:55 +00:00
|
|
|
bool do_halt_poll;
|
2021-10-09 02:12:06 +00:00
|
|
|
u64 halt_ns;
|
2019-08-02 10:37:09 +00:00
|
|
|
|
2022-11-17 00:16:56 +00:00
|
|
|
if (vcpu->halt_poll_ns > max_halt_poll_ns)
|
|
|
|
vcpu->halt_poll_ns = max_halt_poll_ns;
|
2022-11-17 00:16:55 +00:00
|
|
|
|
|
|
|
do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
|
|
|
|
|
2020-05-08 18:22:40 +00:00
|
|
|
start = cur = poll_end = ktime_get();
|
2021-10-09 02:11:58 +00:00
|
|
|
if (do_halt_poll) {
|
2021-10-09 02:12:09 +00:00
|
|
|
ktime_t stop = ktime_add_ns(start, vcpu->halt_poll_ns);
|
2015-02-26 06:58:23 +00:00
|
|
|
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
do {
|
2021-10-09 02:12:00 +00:00
|
|
|
if (kvm_vcpu_check_block(vcpu) < 0)
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
goto out;
|
2021-07-27 11:12:47 +00:00
|
|
|
cpu_relax();
|
2020-05-08 18:22:40 +00:00
|
|
|
poll_end = cur = ktime_get();
|
2021-05-18 12:00:31 +00:00
|
|
|
} while (kvm_vcpu_can_poll(cur, stop));
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
}
|
2008-05-08 22:47:01 +00:00
|
|
|
|
2021-10-09 02:12:07 +00:00
|
|
|
waited = kvm_vcpu_block(vcpu);
|
2021-08-02 16:56:33 +00:00
|
|
|
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
cur = ktime_get();
|
2021-08-02 16:56:32 +00:00
|
|
|
if (waited) {
|
|
|
|
vcpu->stat.generic.halt_wait_ns +=
|
|
|
|
ktime_to_ns(cur) - ktime_to_ns(poll_end);
|
2021-08-02 16:56:33 +00:00
|
|
|
KVM_STATS_LOG_HIST_UPDATE(vcpu->stat.generic.halt_wait_hist,
|
|
|
|
ktime_to_ns(cur) - ktime_to_ns(poll_end));
|
2021-08-02 16:56:32 +00:00
|
|
|
}
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
out:
|
2021-10-09 02:12:06 +00:00
|
|
|
/* The total time the vCPU was "halted", including polling time. */
|
|
|
|
halt_ns = ktime_to_ns(cur) - ktime_to_ns(start);
|
2015-09-03 14:07:38 +00:00
|
|
|
|
2021-10-09 02:11:59 +00:00
|
|
|
/*
|
|
|
|
* Note, halt-polling is considered successful so long as the vCPU was
|
|
|
|
* never actually scheduled out, i.e. even if the wake event arrived
|
|
|
|
* after of the halt-polling loop itself, but before the full wait.
|
|
|
|
*/
|
2021-10-09 02:11:58 +00:00
|
|
|
if (do_halt_poll)
|
2021-10-09 02:11:59 +00:00
|
|
|
update_halt_poll_stats(vcpu, start, poll_end, !waited);
|
2020-05-08 18:22:40 +00:00
|
|
|
|
2021-10-09 02:11:56 +00:00
|
|
|
if (halt_poll_allowed) {
|
2022-11-17 00:16:56 +00:00
|
|
|
/* Recompute the max halt poll time in case it changed. */
|
|
|
|
max_halt_poll_ns = kvm_vcpu_max_halt_poll_ns(vcpu);
|
|
|
|
|
2019-09-29 01:06:56 +00:00
|
|
|
if (!vcpu_valid_wakeup(vcpu)) {
|
2015-09-03 14:07:38 +00:00
|
|
|
shrink_halt_poll_ns(vcpu);
|
2022-11-17 00:16:56 +00:00
|
|
|
} else if (max_halt_poll_ns) {
|
2021-10-09 02:12:06 +00:00
|
|
|
if (halt_ns <= vcpu->halt_poll_ns)
|
2019-09-29 01:06:56 +00:00
|
|
|
;
|
|
|
|
/* we had a long block, shrink polling */
|
2020-04-17 22:14:46 +00:00
|
|
|
else if (vcpu->halt_poll_ns &&
|
2022-11-17 00:16:56 +00:00
|
|
|
halt_ns > max_halt_poll_ns)
|
2019-09-29 01:06:56 +00:00
|
|
|
shrink_halt_poll_ns(vcpu);
|
|
|
|
/* we had a short halt and our poll time is too small */
|
2022-11-17 00:16:56 +00:00
|
|
|
else if (vcpu->halt_poll_ns < max_halt_poll_ns &&
|
|
|
|
halt_ns < max_halt_poll_ns)
|
2019-09-29 01:06:56 +00:00
|
|
|
grow_halt_poll_ns(vcpu);
|
|
|
|
} else {
|
|
|
|
vcpu->halt_poll_ns = 0;
|
|
|
|
}
|
|
|
|
}
|
2015-09-03 14:07:38 +00:00
|
|
|
|
2021-10-09 02:12:06 +00:00
|
|
|
trace_kvm_vcpu_wakeup(halt_ns, waited, vcpu_valid_wakeup(vcpu));
|
2007-07-18 09:15:21 +00:00
|
|
|
}
|
2021-10-09 02:12:06 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
|
2007-07-18 09:15:21 +00:00
|
|
|
|
2017-04-26 20:32:26 +00:00
|
|
|
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
|
2012-03-08 21:44:24 +00:00
|
|
|
{
|
2021-10-09 02:12:12 +00:00
|
|
|
if (__kvm_vcpu_wake_up(vcpu)) {
|
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-18 11:39:06 +00:00
|
|
|
WRITE_ONCE(vcpu->ready, true);
|
2021-06-18 22:27:03 +00:00
|
|
|
++vcpu->stat.generic.halt_wakeup;
|
2017-04-26 20:32:26 +00:00
|
|
|
return true;
|
2012-03-08 21:44:24 +00:00
|
|
|
}
|
|
|
|
|
2017-04-26 20:32:26 +00:00
|
|
|
return false;
|
2016-05-04 19:09:44 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_wake_up);
|
|
|
|
|
2017-05-04 13:14:13 +00:00
|
|
|
#ifndef CONFIG_S390
|
2016-05-04 19:09:44 +00:00
|
|
|
/*
|
|
|
|
* Kick a sleeping VCPU, or a guest VCPU in guest mode, into host kernel mode.
|
|
|
|
*/
|
|
|
|
void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.
Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode. If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.
Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.
The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode. All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.
x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.
For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.
All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all. I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.
Keep the "cpu != me" checks even though they appear useless/impossible at
first glance. x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.
Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule(). But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU. The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor. In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.
[*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a
Cc: Venkatesh Srinivas <venkateshs@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-27 09:25:09 +00:00
|
|
|
int me, cpu;
|
2016-05-04 19:09:44 +00:00
|
|
|
|
2017-04-26 20:32:26 +00:00
|
|
|
if (kvm_vcpu_wake_up(vcpu))
|
|
|
|
return;
|
|
|
|
|
2021-10-20 10:38:05 +00:00
|
|
|
me = get_cpu();
|
|
|
|
/*
|
|
|
|
* The only state change done outside the vcpu mutex is IN_GUEST_MODE
|
|
|
|
* to EXITING_GUEST_MODE. Therefore the moderately expensive "should
|
|
|
|
* kick" check does not need atomic operations if kvm_vcpu_kick is used
|
|
|
|
* within the vCPU thread itself.
|
|
|
|
*/
|
|
|
|
if (vcpu == __this_cpu_read(kvm_running_vcpu)) {
|
|
|
|
if (vcpu->mode == IN_GUEST_MODE)
|
|
|
|
WRITE_ONCE(vcpu->mode, EXITING_GUEST_MODE);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.
Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode. If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.
Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.
The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode. All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.
x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.
For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.
All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all. I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.
Keep the "cpu != me" checks even though they appear useless/impossible at
first glance. x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.
Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule(). But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU. The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor. In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.
[*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a
Cc: Venkatesh Srinivas <venkateshs@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-27 09:25:09 +00:00
|
|
|
/*
|
|
|
|
* Note, the vCPU could get migrated to a different pCPU at any point
|
|
|
|
* after kvm_arch_vcpu_should_kick(), which could result in sending an
|
|
|
|
* IPI to the previous pCPU. But, that's ok because the purpose of the
|
|
|
|
* IPI is to force the vCPU to leave IN_GUEST_MODE, and migrating the
|
|
|
|
* vCPU also requires it to leave IN_GUEST_MODE.
|
|
|
|
*/
|
|
|
|
if (kvm_arch_vcpu_should_kick(vcpu)) {
|
|
|
|
cpu = READ_ONCE(vcpu->cpu);
|
|
|
|
if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
|
2012-03-08 21:44:24 +00:00
|
|
|
smp_send_reschedule(cpu);
|
KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.
Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode. If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.
Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.
The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode. All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.
x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.
For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.
All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all. I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.
Keep the "cpu != me" checks even though they appear useless/impossible at
first glance. x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.
Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule(). But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU. The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor. In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.
[*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a
Cc: Venkatesh Srinivas <venkateshs@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-27 09:25:09 +00:00
|
|
|
}
|
2021-10-20 10:38:05 +00:00
|
|
|
out:
|
2012-03-08 21:44:24 +00:00
|
|
|
put_cpu();
|
|
|
|
}
|
2013-04-11 11:25:15 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
|
2017-05-04 13:14:13 +00:00
|
|
|
#endif /* !CONFIG_S390 */
|
2012-03-08 21:44:24 +00:00
|
|
|
|
2014-05-23 10:20:42 +00:00
|
|
|
int kvm_vcpu_yield_to(struct kvm_vcpu *target)
|
2012-04-25 13:30:38 +00:00
|
|
|
{
|
|
|
|
struct pid *pid;
|
|
|
|
struct task_struct *task = NULL;
|
2014-05-23 10:20:42 +00:00
|
|
|
int ret = 0;
|
2012-04-25 13:30:38 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
pid = rcu_dereference(target->pid);
|
|
|
|
if (pid)
|
2014-09-18 23:40:41 +00:00
|
|
|
task = get_pid_task(pid, PIDTYPE_PID);
|
2012-04-25 13:30:38 +00:00
|
|
|
rcu_read_unlock();
|
|
|
|
if (!task)
|
2013-01-22 07:39:24 +00:00
|
|
|
return ret;
|
|
|
|
ret = yield_to(task, 1);
|
2012-04-25 13:30:38 +00:00
|
|
|
put_task_struct(task);
|
2013-01-22 07:39:24 +00:00
|
|
|
|
|
|
|
return ret;
|
2012-04-25 13:30:38 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
|
|
|
|
|
2012-07-19 09:47:52 +00:00
|
|
|
/*
|
|
|
|
* Helper that checks whether a VCPU is eligible for directed yield.
|
|
|
|
* Most eligible candidate to yield is decided by following heuristics:
|
|
|
|
*
|
|
|
|
* (a) VCPU which has not done pl-exit or cpu relax intercepted recently
|
|
|
|
* (preempted lock holder), indicated by @in_spin_loop.
|
2020-04-01 14:03:10 +00:00
|
|
|
* Set at the beginning and cleared at the end of interception/PLE handler.
|
2012-07-19 09:47:52 +00:00
|
|
|
*
|
|
|
|
* (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
|
|
|
|
* chance last time (mostly it has become eligible now since we have probably
|
|
|
|
* yielded to lockholder in last iteration. This is done by toggling
|
|
|
|
* @dy_eligible each time a VCPU checked for eligibility.)
|
|
|
|
*
|
|
|
|
* Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding
|
|
|
|
* to preempted lock-holder could result in wrong VCPU selection and CPU
|
|
|
|
* burning. Giving priority for a potential lock-holder increases lock
|
|
|
|
* progress.
|
|
|
|
*
|
|
|
|
* Since algorithm is based on heuristics, accessing another VCPU data without
|
|
|
|
* locking does not harm. It may result in trying to yield to same VCPU, fail
|
|
|
|
* and continue with next VCPU and so on.
|
|
|
|
*/
|
2013-12-29 20:12:29 +00:00
|
|
|
static bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
|
2012-07-19 09:47:52 +00:00
|
|
|
{
|
2014-01-10 00:43:16 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
|
2012-07-19 09:47:52 +00:00
|
|
|
bool eligible;
|
|
|
|
|
|
|
|
eligible = !vcpu->spin_loop.in_spin_loop ||
|
2014-09-04 19:13:31 +00:00
|
|
|
vcpu->spin_loop.dy_eligible;
|
2012-07-19 09:47:52 +00:00
|
|
|
|
|
|
|
if (vcpu->spin_loop.in_spin_loop)
|
|
|
|
kvm_vcpu_set_dy_eligible(vcpu, !vcpu->spin_loop.dy_eligible);
|
|
|
|
|
|
|
|
return eligible;
|
2014-01-10 00:43:16 +00:00
|
|
|
#else
|
|
|
|
return true;
|
2012-07-19 09:47:52 +00:00
|
|
|
#endif
|
2014-01-10 00:43:16 +00:00
|
|
|
}
|
2013-01-22 07:39:24 +00:00
|
|
|
|
2019-08-05 02:03:19 +00:00
|
|
|
/*
|
|
|
|
* Unlike kvm_arch_vcpu_runnable, this function is called outside
|
|
|
|
* a vcpu_load/vcpu_put pair. However, for most architectures
|
|
|
|
* kvm_arch_vcpu_runnable does not require vcpu_load.
|
|
|
|
*/
|
|
|
|
bool __weak kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return kvm_arch_vcpu_runnable(vcpu);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (kvm_arch_dy_runnable(vcpu))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
#ifdef CONFIG_KVM_ASYNC_PF
|
|
|
|
if (!list_empty_careful(&vcpu->async_pf.done))
|
|
|
|
return true;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2024-01-10 00:39:35 +00:00
|
|
|
/*
|
|
|
|
* By default, simply query the target vCPU's current mode when checking if a
|
|
|
|
* vCPU was preempted in kernel mode. All architectures except x86 (or more
|
|
|
|
* specifical, except VMX) allow querying whether or not a vCPU is in kernel
|
|
|
|
* mode even if the vCPU is NOT loaded, i.e. using kvm_arch_vcpu_in_kernel()
|
|
|
|
* directly for cross-vCPU checks is functionally correct and accurate.
|
|
|
|
*/
|
|
|
|
bool __weak kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return kvm_arch_vcpu_in_kernel(vcpu);
|
|
|
|
}
|
|
|
|
|
2021-04-16 03:08:10 +00:00
|
|
|
bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-08-08 04:05:32 +00:00
|
|
|
void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
|
2009-10-09 10:03:20 +00:00
|
|
|
{
|
2011-02-01 14:53:28 +00:00
|
|
|
struct kvm *kvm = me->kvm;
|
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
|
2021-11-16 16:04:02 +00:00
|
|
|
unsigned long i;
|
2011-02-01 14:53:28 +00:00
|
|
|
int yielded = 0;
|
2013-01-22 07:39:24 +00:00
|
|
|
int try = 3;
|
2011-02-01 14:53:28 +00:00
|
|
|
int pass;
|
2009-10-09 10:03:20 +00:00
|
|
|
|
2012-07-18 13:37:46 +00:00
|
|
|
kvm_vcpu_set_in_spin_loop(me, true);
|
2011-02-01 14:53:28 +00:00
|
|
|
/*
|
|
|
|
* We boost the priority of a VCPU that is runnable but not
|
|
|
|
* currently running, because it got preempted by something
|
|
|
|
* else and called schedule in __vcpu_run. Hopefully that
|
|
|
|
* VCPU is holding the lock that we need and will release it.
|
|
|
|
* We approximate round-robin by starting at the last boosted VCPU.
|
|
|
|
*/
|
2013-01-22 07:39:24 +00:00
|
|
|
for (pass = 0; pass < 2 && !yielded && try; pass++) {
|
2011-02-01 14:53:28 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm) {
|
2012-06-19 20:51:04 +00:00
|
|
|
if (!pass && i <= last_boosted_vcpu) {
|
2011-02-01 14:53:28 +00:00
|
|
|
i = last_boosted_vcpu;
|
|
|
|
continue;
|
|
|
|
} else if (pass && i > last_boosted_vcpu)
|
|
|
|
break;
|
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-18 11:39:06 +00:00
|
|
|
if (!READ_ONCE(vcpu->ready))
|
2013-03-04 18:02:27 +00:00
|
|
|
continue;
|
2011-02-01 14:53:28 +00:00
|
|
|
if (vcpu == me)
|
|
|
|
continue;
|
2021-10-09 02:12:12 +00:00
|
|
|
if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
|
2011-02-01 14:53:28 +00:00
|
|
|
continue;
|
2024-01-10 00:39:38 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Treat the target vCPU as being in-kernel if it has a
|
|
|
|
* pending interrupt, as the vCPU trying to yield may
|
|
|
|
* be spinning waiting on IPI delivery, i.e. the target
|
|
|
|
* vCPU is in-kernel for the purposes of directed yield.
|
|
|
|
*/
|
2019-08-01 03:30:14 +00:00
|
|
|
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
|
2021-04-16 03:08:10 +00:00
|
|
|
!kvm_arch_dy_has_pending_interrupt(vcpu) &&
|
2024-01-10 00:39:35 +00:00
|
|
|
!kvm_arch_vcpu_preempted_in_kernel(vcpu))
|
2017-08-08 04:05:32 +00:00
|
|
|
continue;
|
2012-07-19 09:47:52 +00:00
|
|
|
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
|
|
|
|
continue;
|
2013-01-22 07:39:24 +00:00
|
|
|
|
|
|
|
yielded = kvm_vcpu_yield_to(vcpu);
|
|
|
|
if (yielded > 0) {
|
2011-02-01 14:53:28 +00:00
|
|
|
kvm->last_boosted_vcpu = i;
|
|
|
|
break;
|
2013-01-22 07:39:24 +00:00
|
|
|
} else if (yielded < 0) {
|
|
|
|
try--;
|
|
|
|
if (!try)
|
|
|
|
break;
|
2011-02-01 14:53:28 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2012-07-18 13:37:46 +00:00
|
|
|
kvm_vcpu_set_in_spin_loop(me, false);
|
2012-07-19 09:47:52 +00:00
|
|
|
|
|
|
|
/* Ensure vcpu is not eligible during next spinloop */
|
|
|
|
kvm_vcpu_set_dy_eligible(me, false);
|
2009-10-09 10:03:20 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
|
|
|
|
|
2020-10-01 01:22:22 +00:00
|
|
|
static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff)
|
|
|
|
{
|
2021-11-21 12:54:40 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_DIRTY_RING
|
2020-10-01 01:22:22 +00:00
|
|
|
return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
|
|
|
|
(pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
|
|
|
|
kvm->dirty_ring_size / PAGE_SIZE);
|
|
|
|
#else
|
|
|
|
return false;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2018-04-18 19:19:58 +00:00
|
|
|
static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
|
2007-02-22 10:58:31 +00:00
|
|
|
{
|
2017-02-24 22:56:41 +00:00
|
|
|
struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
|
2007-02-22 10:58:31 +00:00
|
|
|
struct page *page;
|
|
|
|
|
2007-12-05 07:15:52 +00:00
|
|
|
if (vmf->pgoff == 0)
|
2007-03-20 10:46:50 +00:00
|
|
|
page = virt_to_page(vcpu->run);
|
2008-01-23 16:14:23 +00:00
|
|
|
#ifdef CONFIG_X86
|
2007-12-05 07:15:52 +00:00
|
|
|
else if (vmf->pgoff == KVM_PIO_PAGE_OFFSET)
|
2007-12-13 15:50:52 +00:00
|
|
|
page = virt_to_page(vcpu->arch.pio_data);
|
2008-05-30 14:05:54 +00:00
|
|
|
#endif
|
2017-03-31 11:53:23 +00:00
|
|
|
#ifdef CONFIG_KVM_MMIO
|
2008-05-30 14:05:54 +00:00
|
|
|
else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
|
|
|
|
page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
|
2008-01-23 16:14:23 +00:00
|
|
|
#endif
|
2020-10-01 01:22:22 +00:00
|
|
|
else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff))
|
|
|
|
page = kvm_dirty_ring_get_page(
|
|
|
|
&vcpu->dirty_ring,
|
|
|
|
vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
|
2007-03-20 10:46:50 +00:00
|
|
|
else
|
2012-01-04 09:25:23 +00:00
|
|
|
return kvm_arch_vcpu_fault(vcpu, vmf);
|
2007-02-22 10:58:31 +00:00
|
|
|
get_page(page);
|
2007-12-05 07:15:52 +00:00
|
|
|
vmf->page = page;
|
|
|
|
return 0;
|
2007-02-22 10:58:31 +00:00
|
|
|
}
|
|
|
|
|
2009-09-27 18:29:37 +00:00
|
|
|
static const struct vm_operations_struct kvm_vcpu_vm_ops = {
|
2007-12-05 07:15:52 +00:00
|
|
|
.fault = kvm_vcpu_fault,
|
2007-02-22 10:58:31 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
|
|
|
|
{
|
2020-10-01 01:22:22 +00:00
|
|
|
struct kvm_vcpu *vcpu = file->private_data;
|
2021-09-29 07:28:46 +00:00
|
|
|
unsigned long pages = vma_pages(vma);
|
2020-10-01 01:22:22 +00:00
|
|
|
|
|
|
|
if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
|
|
|
|
kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
|
|
|
|
((vma->vm_flags & VM_EXEC) || !(vma->vm_flags & VM_SHARED)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2007-02-22 10:58:31 +00:00
|
|
|
vma->vm_ops = &kvm_vcpu_vm_ops;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-02-21 16:04:26 +00:00
|
|
|
static int kvm_vcpu_release(struct inode *inode, struct file *filp)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = filp->private_data;
|
|
|
|
|
2008-04-19 19:33:56 +00:00
|
|
|
kvm_put_kvm(vcpu->kvm);
|
2007-02-21 16:04:26 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-18 20:46:22 +00:00
|
|
|
static struct file_operations kvm_vcpu_fops = {
|
2007-02-21 16:04:26 +00:00
|
|
|
.release = kvm_vcpu_release,
|
|
|
|
.unlocked_ioctl = kvm_vcpu_ioctl,
|
2007-02-22 10:58:31 +00:00
|
|
|
.mmap = kvm_vcpu_mmap,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-15 16:52:59 +00:00
|
|
|
.llseek = noop_llseek,
|
2018-06-17 09:16:21 +00:00
|
|
|
KVM_COMPAT(kvm_vcpu_compat_ioctl),
|
2007-02-21 16:04:26 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocates an inode for the vcpu.
|
|
|
|
*/
|
|
|
|
static int create_vcpu_fd(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
kvm: embed vcpu id to dentry of vcpu anon inode
All d-entries for vcpu have the same, "anon_inode:kvm-vcpu". That means
it is impossible to know the mapping between fds for vcpu and vcpu
from userland.
# LC_ALL=C ls -l /proc/617/fd | grep vcpu
lrwx------. 1 qemu qemu 64 Jan 7 16:50 18 -> anon_inode:kvm-vcpu
lrwx------. 1 qemu qemu 64 Jan 7 16:50 19 -> anon_inode:kvm-vcpu
It is also impossible to know the mapping between vma for kvm_run
structure and vcpu from userland.
# LC_ALL=C grep vcpu /proc/617/maps
7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393 anon_inode:kvm-vcpu
7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393 anon_inode:kvm-vcpu
This change adds vcpu id to d-entries for vcpu. With this change
you can get the following output:
# LC_ALL=C ls -l /proc/617/fd | grep vcpu
lrwx------. 1 qemu qemu 64 Jan 7 16:50 18 -> anon_inode:kvm-vcpu:0
lrwx------. 1 qemu qemu 64 Jan 7 16:50 19 -> anon_inode:kvm-vcpu:1
# LC_ALL=C grep vcpu /proc/617/maps
7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393 anon_inode:kvm-vcpu:0
7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393 anon_inode:kvm-vcpu:1
With the mappings known from the output, a tool like strace can report more details
of qemu-kvm process activities. Here is the strace output of my local prototype:
# ./strace -KK -f -p 617 2>&1 | grep 'KVM_RUN\| K'
...
[pid 664] ioctl(18, KVM_RUN, 0) = 0 (KVM_EXIT_MMIO)
K ready_for_interrupt_injection=1, if_flag=0, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
[pid 664] ioctl(18, KVM_RUN, 0) = 0 (KVM_EXIT_MMIO)
K ready_for_interrupt_injection=1, if_flag=1, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
...
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-19 19:04:22 +00:00
|
|
|
char name[8 + 1 + ITOA_MAX_LEN + 1];
|
|
|
|
|
|
|
|
snprintf(name, sizeof(name), "kvm-vcpu:%d", vcpu->vcpu_id);
|
|
|
|
return anon_inode_getfd(name, &kvm_vcpu_fops, vcpu, O_RDWR | O_CLOEXEC);
|
2007-02-21 16:04:26 +00:00
|
|
|
}
|
|
|
|
|
2022-05-23 19:03:27 +00:00
|
|
|
#ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
|
|
|
|
static int vcpu_get_pid(void *data, u64 *val)
|
|
|
|
{
|
2022-12-13 08:02:36 +00:00
|
|
|
struct kvm_vcpu *vcpu = data;
|
2023-02-11 01:07:19 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
*val = pid_nr(rcu_dereference(vcpu->pid));
|
|
|
|
rcu_read_unlock();
|
2022-05-23 19:03:27 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
DEFINE_SIMPLE_ATTRIBUTE(vcpu_get_pid_fops, vcpu_get_pid, NULL, "%llu\n");
|
|
|
|
|
2019-07-31 18:56:20 +00:00
|
|
|
static void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu)
|
2016-09-16 14:27:35 +00:00
|
|
|
{
|
2020-06-04 13:16:52 +00:00
|
|
|
struct dentry *debugfs_dentry;
|
2016-09-16 14:27:35 +00:00
|
|
|
char dir_name[ITOA_MAX_LEN * 2];
|
|
|
|
|
|
|
|
if (!debugfs_initialized())
|
2019-07-31 18:56:20 +00:00
|
|
|
return;
|
2016-09-16 14:27:35 +00:00
|
|
|
|
|
|
|
snprintf(dir_name, sizeof(dir_name), "vcpu%d", vcpu->vcpu_id);
|
2020-06-04 13:16:52 +00:00
|
|
|
debugfs_dentry = debugfs_create_dir(dir_name,
|
|
|
|
vcpu->kvm->debugfs_dentry);
|
2022-05-23 19:03:27 +00:00
|
|
|
debugfs_create_file("pid", 0444, debugfs_dentry, vcpu,
|
|
|
|
&vcpu_get_pid_fops);
|
2016-09-16 14:27:35 +00:00
|
|
|
|
2020-06-04 13:16:52 +00:00
|
|
|
kvm_arch_create_vcpu_debugfs(vcpu, debugfs_dentry);
|
2016-09-16 14:27:35 +00:00
|
|
|
}
|
2022-05-23 19:03:27 +00:00
|
|
|
#endif
|
2016-09-16 14:27:35 +00:00
|
|
|
|
2007-02-20 16:41:05 +00:00
|
|
|
/*
|
|
|
|
* Creates some virtual cpus. Good luck creating more than one.
|
|
|
|
*/
|
2024-06-14 20:28:55 +00:00
|
|
|
static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
|
2007-02-20 16:41:05 +00:00
|
|
|
{
|
|
|
|
int r;
|
2015-11-05 08:03:50 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
2019-12-18 21:55:30 +00:00
|
|
|
struct page *page;
|
2007-02-20 16:41:05 +00:00
|
|
|
|
2024-06-14 20:28:55 +00:00
|
|
|
/*
|
|
|
|
* KVM tracks vCPU IDs as 'int', be kind to userspace and reject
|
|
|
|
* too-large values instead of silently truncating.
|
|
|
|
*
|
|
|
|
* Ensure KVM_MAX_VCPU_IDS isn't pushed above INT_MAX without first
|
|
|
|
* changing the storage type (at the very least, IDs should be tracked
|
|
|
|
* as unsigned ints).
|
|
|
|
*/
|
|
|
|
BUILD_BUG_ON(KVM_MAX_VCPU_IDS > INT_MAX);
|
2021-09-13 13:57:44 +00:00
|
|
|
if (id >= KVM_MAX_VCPU_IDS)
|
2013-11-19 00:09:22 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2016-06-13 12:48:25 +00:00
|
|
|
mutex_lock(&kvm->lock);
|
2022-03-04 19:48:38 +00:00
|
|
|
if (kvm->created_vcpus >= kvm->max_vcpus) {
|
2016-06-13 12:48:25 +00:00
|
|
|
mutex_unlock(&kvm->lock);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2022-04-19 15:44:09 +00:00
|
|
|
r = kvm_arch_vcpu_precreate(kvm, id);
|
|
|
|
if (r) {
|
|
|
|
mutex_unlock(&kvm->lock);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2016-06-13 12:48:25 +00:00
|
|
|
kvm->created_vcpus++;
|
|
|
|
mutex_unlock(&kvm->lock);
|
|
|
|
|
2021-04-06 19:07:40 +00:00
|
|
|
vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
|
2019-12-18 21:55:15 +00:00
|
|
|
if (!vcpu) {
|
|
|
|
r = -ENOMEM;
|
2016-06-13 12:48:25 +00:00
|
|
|
goto vcpu_decrement;
|
|
|
|
}
|
2007-02-20 16:41:05 +00:00
|
|
|
|
2020-01-09 14:57:12 +00:00
|
|
|
BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
|
2020-12-18 22:01:38 +00:00
|
|
|
page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
|
2019-12-18 21:55:30 +00:00
|
|
|
if (!page) {
|
|
|
|
r = -ENOMEM;
|
2019-12-18 21:55:15 +00:00
|
|
|
goto vcpu_free;
|
2019-12-18 21:55:30 +00:00
|
|
|
}
|
|
|
|
vcpu->run = page_address(page);
|
|
|
|
|
|
|
|
kvm_vcpu_init(vcpu, kvm, id);
|
2019-12-18 21:55:15 +00:00
|
|
|
|
|
|
|
r = kvm_arch_vcpu_create(vcpu);
|
|
|
|
if (r)
|
2019-12-18 21:55:30 +00:00
|
|
|
goto vcpu_free_run_page;
|
2019-12-18 21:55:15 +00:00
|
|
|
|
2020-10-01 01:22:22 +00:00
|
|
|
if (kvm->dirty_ring_size) {
|
|
|
|
r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
|
|
|
|
id, kvm->dirty_ring_size);
|
|
|
|
if (r)
|
|
|
|
goto arch_vcpu_destroy;
|
|
|
|
}
|
|
|
|
|
2007-07-23 06:51:37 +00:00
|
|
|
mutex_lock(&kvm->lock);
|
2023-01-11 18:06:50 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_LOCKDEP
|
|
|
|
/* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
|
|
|
|
mutex_lock(&vcpu->mutex);
|
|
|
|
mutex_unlock(&vcpu->mutex);
|
|
|
|
#endif
|
|
|
|
|
2015-11-05 08:03:50 +00:00
|
|
|
if (kvm_get_vcpu_by_id(kvm, id)) {
|
|
|
|
r = -EEXIST;
|
|
|
|
goto unlock_vcpu_destroy;
|
|
|
|
}
|
2009-06-09 12:56:28 +00:00
|
|
|
|
2019-11-07 12:53:42 +00:00
|
|
|
vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
|
KVM: Fix vcpu_array[0] races
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
no failure path requiring vcpu removal and destruction. Such order is
important because vcpu_array accessors may end up referencing vcpu at
vcpu_array[0] even before online_vcpus is set to 1.
When online_vcpus=0, any call to kvm_get_vcpu() goes through
array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
int num_vcpus = atomic_read(&kvm->online_vcpus);
i = array_index_nospec(i, num_vcpus);
return xa_load(&kvm->vcpu_array, i);
Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
an "empty" range, but actually [0, ULONG_MAX]:
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
(atomic_read(&kvm->online_vcpus) - 1))
In both cases, such online_vcpus=0 edge case, even if leading to
unnecessary calls to XArray API, should not be an issue; requesting
unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
However, this means that when the first vCPU is created and inserted in
vcpu_array *and* before online_vcpus is incremented, code calling
kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
This should not pose a problem assuming that once a vcpu is stored in
vcpu_array, it will remain there, but that's not the case:
kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
from the vcpu_array, then destroyed:
vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0) {
xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
kvm_put_kvm_no_destroy(kvm);
goto unlock_vcpu_destroy;
}
atomic_inc(&kvm->online_vcpus);
This results in a possible race condition when a reference to a vcpu is
acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
vcpu is destroyed.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
Cc: stable@vger.kernel.org
Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-10 14:04:09 +00:00
|
|
|
r = xa_reserve(&kvm->vcpu_array, vcpu->vcpu_idx, GFP_KERNEL_ACCOUNT);
|
2021-11-16 16:04:01 +00:00
|
|
|
if (r)
|
|
|
|
goto unlock_vcpu_destroy;
|
2007-02-20 16:41:05 +00:00
|
|
|
|
2007-07-27 07:16:56 +00:00
|
|
|
/* Now it's all set up, let userspace reach it */
|
2008-04-19 19:33:56 +00:00
|
|
|
kvm_get_kvm(kvm);
|
2007-02-21 16:04:26 +00:00
|
|
|
r = create_vcpu_fd(vcpu);
|
KVM: Fix vcpu_array[0] races
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
no failure path requiring vcpu removal and destruction. Such order is
important because vcpu_array accessors may end up referencing vcpu at
vcpu_array[0] even before online_vcpus is set to 1.
When online_vcpus=0, any call to kvm_get_vcpu() goes through
array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
int num_vcpus = atomic_read(&kvm->online_vcpus);
i = array_index_nospec(i, num_vcpus);
return xa_load(&kvm->vcpu_array, i);
Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
an "empty" range, but actually [0, ULONG_MAX]:
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
(atomic_read(&kvm->online_vcpus) - 1))
In both cases, such online_vcpus=0 edge case, even if leading to
unnecessary calls to XArray API, should not be an issue; requesting
unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
However, this means that when the first vCPU is created and inserted in
vcpu_array *and* before online_vcpus is incremented, code calling
kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
This should not pose a problem assuming that once a vcpu is stored in
vcpu_array, it will remain there, but that's not the case:
kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
from the vcpu_array, then destroyed:
vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0) {
xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
kvm_put_kvm_no_destroy(kvm);
goto unlock_vcpu_destroy;
}
atomic_inc(&kvm->online_vcpus);
This results in a possible race condition when a reference to a vcpu is
acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
vcpu is destroyed.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
Cc: stable@vger.kernel.org
Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-10 14:04:09 +00:00
|
|
|
if (r < 0)
|
|
|
|
goto kvm_put_xa_release;
|
|
|
|
|
2023-06-05 11:44:19 +00:00
|
|
|
if (KVM_BUG_ON(xa_store(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, 0), kvm)) {
|
KVM: Fix vcpu_array[0] races
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
no failure path requiring vcpu removal and destruction. Such order is
important because vcpu_array accessors may end up referencing vcpu at
vcpu_array[0] even before online_vcpus is set to 1.
When online_vcpus=0, any call to kvm_get_vcpu() goes through
array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
int num_vcpus = atomic_read(&kvm->online_vcpus);
i = array_index_nospec(i, num_vcpus);
return xa_load(&kvm->vcpu_array, i);
Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
an "empty" range, but actually [0, ULONG_MAX]:
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
(atomic_read(&kvm->online_vcpus) - 1))
In both cases, such online_vcpus=0 edge case, even if leading to
unnecessary calls to XArray API, should not be an issue; requesting
unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
However, this means that when the first vCPU is created and inserted in
vcpu_array *and* before online_vcpus is incremented, code calling
kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
This should not pose a problem assuming that once a vcpu is stored in
vcpu_array, it will remain there, but that's not the case:
kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
from the vcpu_array, then destroyed:
vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0) {
xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
kvm_put_kvm_no_destroy(kvm);
goto unlock_vcpu_destroy;
}
atomic_inc(&kvm->online_vcpus);
This results in a possible race condition when a reference to a vcpu is
acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
vcpu is destroyed.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
Cc: stable@vger.kernel.org
Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-10 14:04:09 +00:00
|
|
|
r = -EINVAL;
|
|
|
|
goto kvm_put_xa_release;
|
2009-06-09 12:56:28 +00:00
|
|
|
}
|
|
|
|
|
2015-07-29 09:32:20 +00:00
|
|
|
/*
|
2021-11-16 16:04:01 +00:00
|
|
|
* Pairs with smp_rmb() in kvm_get_vcpu. Store the vcpu
|
|
|
|
* pointer before kvm->online_vcpu's incremented value.
|
2015-07-29 09:32:20 +00:00
|
|
|
*/
|
2009-06-09 12:56:28 +00:00
|
|
|
smp_wmb();
|
|
|
|
atomic_inc(&kvm->online_vcpus);
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->lock);
|
2012-11-28 01:29:02 +00:00
|
|
|
kvm_arch_vcpu_postcreate(vcpu);
|
2020-03-31 22:42:22 +00:00
|
|
|
kvm_create_vcpu_debugfs(vcpu);
|
2007-07-27 07:16:56 +00:00
|
|
|
return r;
|
2007-06-07 16:11:53 +00:00
|
|
|
|
KVM: Fix vcpu_array[0] races
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
no failure path requiring vcpu removal and destruction. Such order is
important because vcpu_array accessors may end up referencing vcpu at
vcpu_array[0] even before online_vcpus is set to 1.
When online_vcpus=0, any call to kvm_get_vcpu() goes through
array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
int num_vcpus = atomic_read(&kvm->online_vcpus);
i = array_index_nospec(i, num_vcpus);
return xa_load(&kvm->vcpu_array, i);
Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
an "empty" range, but actually [0, ULONG_MAX]:
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
(atomic_read(&kvm->online_vcpus) - 1))
In both cases, such online_vcpus=0 edge case, even if leading to
unnecessary calls to XArray API, should not be an issue; requesting
unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
However, this means that when the first vCPU is created and inserted in
vcpu_array *and* before online_vcpus is incremented, code calling
kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
This should not pose a problem assuming that once a vcpu is stored in
vcpu_array, it will remain there, but that's not the case:
kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
from the vcpu_array, then destroyed:
vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0) {
xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
kvm_put_kvm_no_destroy(kvm);
goto unlock_vcpu_destroy;
}
atomic_inc(&kvm->online_vcpus);
This results in a possible race condition when a reference to a vcpu is
acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
vcpu is destroyed.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
Cc: stable@vger.kernel.org
Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-10 14:04:09 +00:00
|
|
|
kvm_put_xa_release:
|
|
|
|
kvm_put_kvm_no_destroy(kvm);
|
|
|
|
xa_release(&kvm->vcpu_array, vcpu->vcpu_idx);
|
2011-05-23 08:33:05 +00:00
|
|
|
unlock_vcpu_destroy:
|
2008-09-18 02:16:59 +00:00
|
|
|
mutex_unlock(&kvm->lock);
|
2020-10-01 01:22:22 +00:00
|
|
|
kvm_dirty_ring_free(&vcpu->dirty_ring);
|
|
|
|
arch_vcpu_destroy:
|
2007-11-19 20:04:43 +00:00
|
|
|
kvm_arch_vcpu_destroy(vcpu);
|
2019-12-18 21:55:30 +00:00
|
|
|
vcpu_free_run_page:
|
|
|
|
free_page((unsigned long)vcpu->run);
|
2019-12-18 21:55:15 +00:00
|
|
|
vcpu_free:
|
|
|
|
kmem_cache_free(kvm_vcpu_cache, vcpu);
|
2016-06-13 12:48:25 +00:00
|
|
|
vcpu_decrement:
|
|
|
|
mutex_lock(&kvm->lock);
|
|
|
|
kvm->created_vcpus--;
|
|
|
|
mutex_unlock(&kvm->lock);
|
2007-02-20 16:41:05 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2007-03-05 17:46:05 +00:00
|
|
|
static int kvm_vcpu_ioctl_set_sigmask(struct kvm_vcpu *vcpu, sigset_t *sigset)
|
|
|
|
{
|
|
|
|
if (sigset) {
|
|
|
|
sigdelsetmask(sigset, sigmask(SIGKILL)|sigmask(SIGSTOP));
|
|
|
|
vcpu->sigset_active = 1;
|
|
|
|
vcpu->sigset = *sigset;
|
|
|
|
} else
|
|
|
|
vcpu->sigset_active = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-06-18 22:27:06 +00:00
|
|
|
static ssize_t kvm_vcpu_stats_read(struct file *file, char __user *user_buffer,
|
|
|
|
size_t size, loff_t *offset)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = file->private_data;
|
|
|
|
|
|
|
|
return kvm_stats_read(vcpu->stats_id, &kvm_vcpu_stats_header,
|
|
|
|
&kvm_vcpu_stats_desc[0], &vcpu->stat,
|
|
|
|
sizeof(vcpu->stat), user_buffer, size, offset);
|
|
|
|
}
|
|
|
|
|
2023-07-11 23:01:25 +00:00
|
|
|
static int kvm_vcpu_stats_release(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = file->private_data;
|
|
|
|
|
|
|
|
kvm_put_kvm(vcpu->kvm);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-06-18 22:27:06 +00:00
|
|
|
static const struct file_operations kvm_vcpu_stats_fops = {
|
KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-18 20:46:22 +00:00
|
|
|
.owner = THIS_MODULE,
|
2021-06-18 22:27:06 +00:00
|
|
|
.read = kvm_vcpu_stats_read,
|
2023-07-11 23:01:25 +00:00
|
|
|
.release = kvm_vcpu_stats_release,
|
2021-06-18 22:27:06 +00:00
|
|
|
.llseek = noop_llseek,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
struct file *file;
|
|
|
|
char name[15 + ITOA_MAX_LEN + 1];
|
|
|
|
|
|
|
|
snprintf(name, sizeof(name), "kvm-vcpu-stats:%d", vcpu->vcpu_id);
|
|
|
|
|
|
|
|
fd = get_unused_fd_flags(O_CLOEXEC);
|
|
|
|
if (fd < 0)
|
|
|
|
return fd;
|
|
|
|
|
|
|
|
file = anon_inode_getfile(name, &kvm_vcpu_stats_fops, vcpu, O_RDONLY);
|
|
|
|
if (IS_ERR(file)) {
|
|
|
|
put_unused_fd(fd);
|
|
|
|
return PTR_ERR(file);
|
|
|
|
}
|
2023-07-11 23:01:25 +00:00
|
|
|
|
|
|
|
kvm_get_kvm(vcpu->kvm);
|
|
|
|
|
2021-06-18 22:27:06 +00:00
|
|
|
file->f_mode |= FMODE_PREAD;
|
|
|
|
fd_install(fd, file);
|
|
|
|
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
2007-02-21 16:04:26 +00:00
|
|
|
static long kvm_vcpu_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2007-02-21 16:04:26 +00:00
|
|
|
struct kvm_vcpu *vcpu = filp->private_data;
|
2007-02-09 16:38:35 +00:00
|
|
|
void __user *argp = (void __user *)arg;
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
int r;
|
2008-08-11 17:01:46 +00:00
|
|
|
struct kvm_fpu *fpu = NULL;
|
|
|
|
struct kvm_sregs *kvm_sregs = NULL;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2021-11-11 15:13:38 +00:00
|
|
|
if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
|
2007-11-21 14:41:05 +00:00
|
|
|
return -EIO;
|
2010-05-13 08:25:04 +00:00
|
|
|
|
2014-09-19 23:03:25 +00:00
|
|
|
if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2010-05-13 08:25:04 +00:00
|
|
|
/*
|
2017-12-12 16:41:34 +00:00
|
|
|
* Some architectures have vcpu ioctls that are asynchronous to vcpu
|
|
|
|
* execution; mutex_lock() would break them.
|
2010-05-13 08:25:04 +00:00
|
|
|
*/
|
2017-12-12 16:41:34 +00:00
|
|
|
r = kvm_arch_vcpu_async_ioctl(filp, ioctl, arg);
|
|
|
|
if (r != -ENOIOCTLCMD)
|
2012-09-16 08:50:30 +00:00
|
|
|
return r;
|
2010-05-13 08:25:04 +00:00
|
|
|
|
2017-12-04 20:35:23 +00:00
|
|
|
if (mutex_lock_killable(&vcpu->mutex))
|
|
|
|
return -EINTR;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
switch (ioctl) {
|
2017-07-06 12:44:28 +00:00
|
|
|
case KVM_RUN: {
|
|
|
|
struct pid *oldpid;
|
2007-03-07 11:11:17 +00:00
|
|
|
r = -EINVAL;
|
|
|
|
if (arg)
|
|
|
|
goto out;
|
2017-07-06 12:44:28 +00:00
|
|
|
oldpid = rcu_access_pointer(vcpu->pid);
|
2017-07-17 02:39:32 +00:00
|
|
|
if (unlikely(oldpid != task_pid(current))) {
|
2014-08-05 14:44:14 +00:00
|
|
|
/* The thread running this VCPU changed. */
|
2018-02-23 16:23:57 +00:00
|
|
|
struct pid *newpid;
|
2015-02-26 06:58:23 +00:00
|
|
|
|
2018-02-23 16:23:57 +00:00
|
|
|
r = kvm_arch_vcpu_run_pid_change(vcpu);
|
|
|
|
if (r)
|
|
|
|
break;
|
|
|
|
|
|
|
|
newpid = get_task_pid(current, PIDTYPE_PID);
|
2014-08-05 14:44:14 +00:00
|
|
|
rcu_assign_pointer(vcpu->pid, newpid);
|
|
|
|
if (oldpid)
|
|
|
|
synchronize_rcu();
|
|
|
|
put_pid(oldpid);
|
|
|
|
}
|
2024-05-03 18:17:33 +00:00
|
|
|
vcpu->wants_to_run = !READ_ONCE(vcpu->run->immediate_exit__unsafe);
|
2020-04-16 05:10:57 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl_run(vcpu);
|
2024-05-03 18:17:32 +00:00
|
|
|
vcpu->wants_to_run = false;
|
|
|
|
|
2010-10-24 14:49:08 +00:00
|
|
|
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
break;
|
2017-07-06 12:44:28 +00:00
|
|
|
}
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
case KVM_GET_REGS: {
|
2008-02-25 10:52:20 +00:00
|
|
|
struct kvm_regs *kvm_regs;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2008-02-25 10:52:20 +00:00
|
|
|
r = -ENOMEM;
|
2019-02-11 19:02:49 +00:00
|
|
|
kvm_regs = kzalloc(sizeof(struct kvm_regs), GFP_KERNEL_ACCOUNT);
|
2008-02-25 10:52:20 +00:00
|
|
|
if (!kvm_regs)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
2008-02-25 10:52:20 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl_get_regs(vcpu, kvm_regs);
|
|
|
|
if (r)
|
|
|
|
goto out_free1;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
r = -EFAULT;
|
2008-02-25 10:52:20 +00:00
|
|
|
if (copy_to_user(argp, kvm_regs, sizeof(struct kvm_regs)))
|
|
|
|
goto out_free1;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
r = 0;
|
2008-02-25 10:52:20 +00:00
|
|
|
out_free1:
|
|
|
|
kfree(kvm_regs);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_REGS: {
|
2008-02-25 10:52:20 +00:00
|
|
|
struct kvm_regs *kvm_regs;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2011-12-04 17:36:29 +00:00
|
|
|
kvm_regs = memdup_user(argp, sizeof(*kvm_regs));
|
|
|
|
if (IS_ERR(kvm_regs)) {
|
|
|
|
r = PTR_ERR(kvm_regs);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
}
|
2008-02-25 10:52:20 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl_set_regs(vcpu, kvm_regs);
|
|
|
|
kfree(kvm_regs);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_GET_SREGS: {
|
2019-02-11 19:02:49 +00:00
|
|
|
kvm_sregs = kzalloc(sizeof(struct kvm_sregs),
|
|
|
|
GFP_KERNEL_ACCOUNT);
|
2008-08-11 17:01:46 +00:00
|
|
|
r = -ENOMEM;
|
|
|
|
if (!kvm_sregs)
|
|
|
|
goto out;
|
|
|
|
r = kvm_arch_vcpu_ioctl_get_sregs(vcpu, kvm_sregs);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
2008-08-11 17:01:46 +00:00
|
|
|
if (copy_to_user(argp, kvm_sregs, sizeof(struct kvm_sregs)))
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_SREGS: {
|
2011-12-04 17:36:29 +00:00
|
|
|
kvm_sregs = memdup_user(argp, sizeof(*kvm_sregs));
|
|
|
|
if (IS_ERR(kvm_sregs)) {
|
|
|
|
r = PTR_ERR(kvm_sregs);
|
2012-11-02 10:33:21 +00:00
|
|
|
kvm_sregs = NULL;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
}
|
2008-08-11 17:01:46 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl_set_sregs(vcpu, kvm_sregs);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
break;
|
|
|
|
}
|
2008-04-11 16:24:45 +00:00
|
|
|
case KVM_GET_MP_STATE: {
|
|
|
|
struct kvm_mp_state mp_state;
|
|
|
|
|
|
|
|
r = kvm_arch_vcpu_ioctl_get_mpstate(vcpu, &mp_state);
|
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_to_user(argp, &mp_state, sizeof(mp_state)))
|
2008-04-11 16:24:45 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_MP_STATE: {
|
|
|
|
struct kvm_mp_state mp_state;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&mp_state, argp, sizeof(mp_state)))
|
2008-04-11 16:24:45 +00:00
|
|
|
goto out;
|
|
|
|
r = kvm_arch_vcpu_ioctl_set_mpstate(vcpu, &mp_state);
|
|
|
|
break;
|
|
|
|
}
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
case KVM_TRANSLATE: {
|
|
|
|
struct kvm_translation tr;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&tr, argp, sizeof(tr)))
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
2007-11-16 05:05:55 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl_translate(vcpu, &tr);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_to_user(argp, &tr, sizeof(tr)))
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2008-12-15 12:52:10 +00:00
|
|
|
case KVM_SET_GUEST_DEBUG: {
|
|
|
|
struct kvm_guest_debug dbg;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&dbg, argp, sizeof(dbg)))
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
2008-12-15 12:52:10 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
break;
|
|
|
|
}
|
2007-03-05 17:46:05 +00:00
|
|
|
case KVM_SET_SIGNAL_MASK: {
|
|
|
|
struct kvm_signal_mask __user *sigmask_arg = argp;
|
|
|
|
struct kvm_signal_mask kvm_sigmask;
|
|
|
|
sigset_t sigset, *p;
|
|
|
|
|
|
|
|
p = NULL;
|
|
|
|
if (argp) {
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&kvm_sigmask, argp,
|
2015-02-26 06:58:19 +00:00
|
|
|
sizeof(kvm_sigmask)))
|
2007-03-05 17:46:05 +00:00
|
|
|
goto out;
|
|
|
|
r = -EINVAL;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (kvm_sigmask.len != sizeof(sigset))
|
2007-03-05 17:46:05 +00:00
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&sigset, sigmask_arg->sigset,
|
2015-02-26 06:58:19 +00:00
|
|
|
sizeof(sigset)))
|
2007-03-05 17:46:05 +00:00
|
|
|
goto out;
|
|
|
|
p = &sigset;
|
|
|
|
}
|
2010-06-10 11:10:47 +00:00
|
|
|
r = kvm_vcpu_ioctl_set_sigmask(vcpu, p);
|
2007-03-05 17:46:05 +00:00
|
|
|
break;
|
|
|
|
}
|
2007-04-01 13:34:31 +00:00
|
|
|
case KVM_GET_FPU: {
|
2019-02-11 19:02:49 +00:00
|
|
|
fpu = kzalloc(sizeof(struct kvm_fpu), GFP_KERNEL_ACCOUNT);
|
2008-08-11 17:01:46 +00:00
|
|
|
r = -ENOMEM;
|
|
|
|
if (!fpu)
|
|
|
|
goto out;
|
|
|
|
r = kvm_arch_vcpu_ioctl_get_fpu(vcpu, fpu);
|
2007-04-01 13:34:31 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
2008-08-11 17:01:46 +00:00
|
|
|
if (copy_to_user(argp, fpu, sizeof(struct kvm_fpu)))
|
2007-04-01 13:34:31 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_FPU: {
|
2011-12-04 17:36:29 +00:00
|
|
|
fpu = memdup_user(argp, sizeof(*fpu));
|
|
|
|
if (IS_ERR(fpu)) {
|
|
|
|
r = PTR_ERR(fpu);
|
2012-11-02 10:33:21 +00:00
|
|
|
fpu = NULL;
|
2007-04-01 13:34:31 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
}
|
2008-08-11 17:01:46 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl_set_fpu(vcpu, fpu);
|
2007-04-01 13:34:31 +00:00
|
|
|
break;
|
|
|
|
}
|
2021-06-18 22:27:06 +00:00
|
|
|
case KVM_GET_STATS_FD: {
|
|
|
|
r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
|
|
|
|
break;
|
|
|
|
}
|
2007-02-21 16:04:26 +00:00
|
|
|
default:
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
|
2007-02-21 16:04:26 +00:00
|
|
|
}
|
|
|
|
out:
|
2017-12-04 20:35:23 +00:00
|
|
|
mutex_unlock(&vcpu->mutex);
|
2008-08-11 17:01:46 +00:00
|
|
|
kfree(fpu);
|
|
|
|
kfree(kvm_sregs);
|
2007-02-21 16:04:26 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2015-02-03 08:35:15 +00:00
|
|
|
#ifdef CONFIG_KVM_COMPAT
|
2011-06-08 00:45:37 +00:00
|
|
|
static long kvm_vcpu_compat_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = filp->private_data;
|
|
|
|
void __user *argp = compat_ptr(arg);
|
|
|
|
int r;
|
|
|
|
|
2021-11-11 15:13:38 +00:00
|
|
|
if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
|
2011-06-08 00:45:37 +00:00
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
switch (ioctl) {
|
|
|
|
case KVM_SET_SIGNAL_MASK: {
|
|
|
|
struct kvm_signal_mask __user *sigmask_arg = argp;
|
|
|
|
struct kvm_signal_mask kvm_sigmask;
|
|
|
|
sigset_t sigset;
|
|
|
|
|
|
|
|
if (argp) {
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&kvm_sigmask, argp,
|
2015-02-26 06:58:19 +00:00
|
|
|
sizeof(kvm_sigmask)))
|
2011-06-08 00:45:37 +00:00
|
|
|
goto out;
|
|
|
|
r = -EINVAL;
|
2017-09-04 01:45:17 +00:00
|
|
|
if (kvm_sigmask.len != sizeof(compat_sigset_t))
|
2011-06-08 00:45:37 +00:00
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
2020-07-02 09:39:31 +00:00
|
|
|
if (get_compat_sigset(&sigset,
|
|
|
|
(compat_sigset_t __user *)sigmask_arg->sigset))
|
2011-06-08 00:45:37 +00:00
|
|
|
goto out;
|
2012-08-22 13:34:11 +00:00
|
|
|
r = kvm_vcpu_ioctl_set_sigmask(vcpu, &sigset);
|
|
|
|
} else
|
|
|
|
r = kvm_vcpu_ioctl_set_sigmask(vcpu, NULL);
|
2011-06-08 00:45:37 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
r = kvm_vcpu_ioctl(filp, ioctl, arg);
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-04-18 10:39:36 +00:00
|
|
|
static int kvm_device_mmap(struct file *filp, struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
struct kvm_device *dev = filp->private_data;
|
|
|
|
|
|
|
|
if (dev->ops->mmap)
|
|
|
|
return dev->ops->mmap(dev, vma);
|
|
|
|
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
|
|
|
|
2013-04-12 14:08:42 +00:00
|
|
|
static int kvm_device_ioctl_attr(struct kvm_device *dev,
|
|
|
|
int (*accessor)(struct kvm_device *dev,
|
|
|
|
struct kvm_device_attr *attr),
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
struct kvm_device_attr attr;
|
|
|
|
|
|
|
|
if (!accessor)
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return accessor(dev, &attr);
|
|
|
|
}
|
|
|
|
|
|
|
|
static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
struct kvm_device *dev = filp->private_data;
|
|
|
|
|
2021-11-11 15:13:38 +00:00
|
|
|
if (dev->kvm->mm != current->mm || dev->kvm->vm_dead)
|
2019-02-15 20:48:39 +00:00
|
|
|
return -EIO;
|
|
|
|
|
2013-04-12 14:08:42 +00:00
|
|
|
switch (ioctl) {
|
|
|
|
case KVM_SET_DEVICE_ATTR:
|
|
|
|
return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
|
|
|
|
case KVM_GET_DEVICE_ATTR:
|
|
|
|
return kvm_device_ioctl_attr(dev, dev->ops->get_attr, arg);
|
|
|
|
case KVM_HAS_DEVICE_ATTR:
|
|
|
|
return kvm_device_ioctl_attr(dev, dev->ops->has_attr, arg);
|
|
|
|
default:
|
|
|
|
if (dev->ops->ioctl)
|
|
|
|
return dev->ops->ioctl(dev, ioctl, arg);
|
|
|
|
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_device_release(struct inode *inode, struct file *filp)
|
|
|
|
{
|
|
|
|
struct kvm_device *dev = filp->private_data;
|
|
|
|
struct kvm *kvm = dev->kvm;
|
|
|
|
|
2019-04-18 10:39:41 +00:00
|
|
|
if (dev->ops->release) {
|
|
|
|
mutex_lock(&kvm->lock);
|
2024-04-22 20:01:40 +00:00
|
|
|
list_del_rcu(&dev->vm_node);
|
|
|
|
synchronize_rcu();
|
2019-04-18 10:39:41 +00:00
|
|
|
dev->ops->release(dev);
|
|
|
|
mutex_unlock(&kvm->lock);
|
|
|
|
}
|
|
|
|
|
2013-04-12 14:08:42 +00:00
|
|
|
kvm_put_kvm(kvm);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-18 20:46:22 +00:00
|
|
|
static struct file_operations kvm_device_fops = {
|
2013-04-12 14:08:42 +00:00
|
|
|
.unlocked_ioctl = kvm_device_ioctl,
|
|
|
|
.release = kvm_device_release,
|
2018-06-17 09:16:21 +00:00
|
|
|
KVM_COMPAT(kvm_device_ioctl),
|
2019-04-18 10:39:36 +00:00
|
|
|
.mmap = kvm_device_mmap,
|
2013-04-12 14:08:42 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
struct kvm_device *kvm_device_from_filp(struct file *filp)
|
|
|
|
{
|
|
|
|
if (filp->f_op != &kvm_device_fops)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return filp->private_data;
|
|
|
|
}
|
|
|
|
|
2019-10-21 15:28:19 +00:00
|
|
|
static const struct kvm_device_ops *kvm_device_ops_table[KVM_DEV_TYPE_MAX] = {
|
2013-04-12 14:08:46 +00:00
|
|
|
#ifdef CONFIG_KVM_MPIC
|
2014-09-02 09:27:33 +00:00
|
|
|
[KVM_DEV_TYPE_FSL_MPIC_20] = &kvm_mpic_ops,
|
|
|
|
[KVM_DEV_TYPE_FSL_MPIC_42] = &kvm_mpic_ops,
|
2013-04-27 00:28:37 +00:00
|
|
|
#endif
|
2014-09-02 09:27:33 +00:00
|
|
|
};
|
|
|
|
|
2019-10-21 15:28:19 +00:00
|
|
|
int kvm_register_device_ops(const struct kvm_device_ops *ops, u32 type)
|
2014-09-02 09:27:33 +00:00
|
|
|
{
|
|
|
|
if (type >= ARRAY_SIZE(kvm_device_ops_table))
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
if (kvm_device_ops_table[type] != NULL)
|
|
|
|
return -EEXIST;
|
|
|
|
|
|
|
|
kvm_device_ops_table[type] = ops;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-10-09 10:30:08 +00:00
|
|
|
void kvm_unregister_device_ops(u32 type)
|
|
|
|
{
|
|
|
|
if (kvm_device_ops_table[type] != NULL)
|
|
|
|
kvm_device_ops_table[type] = NULL;
|
|
|
|
}
|
|
|
|
|
2013-04-12 14:08:42 +00:00
|
|
|
static int kvm_ioctl_create_device(struct kvm *kvm,
|
|
|
|
struct kvm_create_device *cd)
|
|
|
|
{
|
2022-08-19 02:15:35 +00:00
|
|
|
const struct kvm_device_ops *ops;
|
2013-04-12 14:08:42 +00:00
|
|
|
struct kvm_device *dev;
|
|
|
|
bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
|
2019-04-11 09:16:47 +00:00
|
|
|
int type;
|
2013-04-12 14:08:42 +00:00
|
|
|
int ret;
|
|
|
|
|
2014-09-02 09:27:33 +00:00
|
|
|
if (cd->type >= ARRAY_SIZE(kvm_device_ops_table))
|
|
|
|
return -ENODEV;
|
|
|
|
|
2019-04-11 09:16:47 +00:00
|
|
|
type = array_index_nospec(cd->type, ARRAY_SIZE(kvm_device_ops_table));
|
|
|
|
ops = kvm_device_ops_table[type];
|
2014-09-02 09:27:33 +00:00
|
|
|
if (ops == NULL)
|
2013-04-12 14:08:42 +00:00
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
if (test)
|
|
|
|
return 0;
|
|
|
|
|
2019-02-11 19:02:49 +00:00
|
|
|
dev = kzalloc(sizeof(*dev), GFP_KERNEL_ACCOUNT);
|
2013-04-12 14:08:42 +00:00
|
|
|
if (!dev)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
dev->ops = ops;
|
|
|
|
dev->kvm = kvm;
|
|
|
|
|
2016-08-09 17:13:01 +00:00
|
|
|
mutex_lock(&kvm->lock);
|
2019-04-11 09:16:47 +00:00
|
|
|
ret = ops->create(dev, type);
|
2013-04-12 14:08:42 +00:00
|
|
|
if (ret < 0) {
|
2016-08-09 17:13:01 +00:00
|
|
|
mutex_unlock(&kvm->lock);
|
2013-04-12 14:08:42 +00:00
|
|
|
kfree(dev);
|
|
|
|
return ret;
|
|
|
|
}
|
2024-04-22 20:01:40 +00:00
|
|
|
list_add_rcu(&dev->vm_node, &kvm->devices);
|
2016-08-09 17:13:01 +00:00
|
|
|
mutex_unlock(&kvm->lock);
|
2013-04-12 14:08:42 +00:00
|
|
|
|
2016-08-09 17:13:00 +00:00
|
|
|
if (ops->init)
|
|
|
|
ops->init(dev);
|
|
|
|
|
2019-01-26 00:54:33 +00:00
|
|
|
kvm_get_kvm(kvm);
|
2013-08-24 20:14:07 +00:00
|
|
|
ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | O_CLOEXEC);
|
2013-04-12 14:08:42 +00:00
|
|
|
if (ret < 0) {
|
2019-10-21 22:58:42 +00:00
|
|
|
kvm_put_kvm_no_destroy(kvm);
|
2016-08-09 17:13:01 +00:00
|
|
|
mutex_lock(&kvm->lock);
|
2024-04-22 20:01:40 +00:00
|
|
|
list_del_rcu(&dev->vm_node);
|
|
|
|
synchronize_rcu();
|
2022-06-01 01:43:28 +00:00
|
|
|
if (ops->release)
|
|
|
|
ops->release(dev);
|
2016-08-09 17:13:01 +00:00
|
|
|
mutex_unlock(&kvm->lock);
|
2022-06-01 01:43:28 +00:00
|
|
|
if (ops->destroy)
|
|
|
|
ops->destroy(dev);
|
2013-04-12 14:08:42 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
cd->fd = ret;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-02-08 14:01:04 +00:00
|
|
|
static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
|
2014-07-14 16:33:08 +00:00
|
|
|
{
|
|
|
|
switch (arg) {
|
|
|
|
case KVM_CAP_USER_MEMORY:
|
2023-10-27 18:21:50 +00:00
|
|
|
case KVM_CAP_USER_MEMORY2:
|
2014-07-14 16:33:08 +00:00
|
|
|
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
|
|
|
|
case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
|
|
|
|
case KVM_CAP_INTERNAL_ERROR_DATA:
|
|
|
|
#ifdef CONFIG_HAVE_KVM_MSI
|
|
|
|
case KVM_CAP_SIGNAL_MSI:
|
|
|
|
#endif
|
2023-10-18 16:07:32 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2015-03-05 10:54:46 +00:00
|
|
|
case KVM_CAP_IRQFD:
|
2014-07-14 16:33:08 +00:00
|
|
|
#endif
|
2015-09-15 06:41:59 +00:00
|
|
|
case KVM_CAP_IOEVENTFD_ANY_LENGTH:
|
2014-07-14 16:33:08 +00:00
|
|
|
case KVM_CAP_CHECK_EXTENSION_VM:
|
2017-02-16 09:40:56 +00:00
|
|
|
case KVM_CAP_ENABLE_CAP_VM:
|
2020-04-17 22:14:46 +00:00
|
|
|
case KVM_CAP_HALT_POLL:
|
2014-07-14 16:33:08 +00:00
|
|
|
return 1;
|
2017-03-31 11:53:23 +00:00
|
|
|
#ifdef CONFIG_KVM_MMIO
|
2017-03-31 11:53:22 +00:00
|
|
|
case KVM_CAP_COALESCED_MMIO:
|
|
|
|
return KVM_COALESCED_MMIO_PAGE_OFFSET;
|
2018-10-13 23:09:55 +00:00
|
|
|
case KVM_CAP_COALESCED_PIO:
|
|
|
|
return 1;
|
2017-03-31 11:53:22 +00:00
|
|
|
#endif
|
2020-02-27 01:32:27 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
|
|
|
case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2:
|
|
|
|
return KVM_DIRTY_LOG_MANUAL_CAPS;
|
|
|
|
#endif
|
2014-07-14 16:33:08 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
|
|
|
|
case KVM_CAP_IRQ_ROUTING:
|
|
|
|
return KVM_MAX_IRQ_ROUTES;
|
2015-05-17 15:30:37 +00:00
|
|
|
#endif
|
2023-10-27 18:22:04 +00:00
|
|
|
#if KVM_MAX_NR_ADDRESS_SPACES > 1
|
2015-05-17 15:30:37 +00:00
|
|
|
case KVM_CAP_MULTI_ADDRESS_SPACE:
|
2023-10-27 18:22:04 +00:00
|
|
|
if (kvm)
|
|
|
|
return kvm_arch_nr_memslot_as_ids(kvm);
|
|
|
|
return KVM_MAX_NR_ADDRESS_SPACES;
|
2014-07-14 16:33:08 +00:00
|
|
|
#endif
|
2019-03-28 16:24:03 +00:00
|
|
|
case KVM_CAP_NR_MEMSLOTS:
|
|
|
|
return KVM_USER_MEM_SLOTS;
|
2020-10-01 01:22:22 +00:00
|
|
|
case KVM_CAP_DIRTY_LOG_RING:
|
2022-09-26 14:51:16 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_DIRTY_RING_TSO
|
|
|
|
return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
|
|
|
|
#else
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
|
|
|
|
#ifdef CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL
|
2020-10-01 01:22:22 +00:00
|
|
|
return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
|
|
|
|
#else
|
|
|
|
return 0;
|
2022-11-10 10:49:10 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP
|
|
|
|
case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
|
2020-10-01 01:22:22 +00:00
|
|
|
#endif
|
2021-06-18 22:27:06 +00:00
|
|
|
case KVM_CAP_BINARY_STATS_FD:
|
KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
member that at the time was unused. Unfortunately this extensibility
mechanism has several issues:
- x86 is not writing the member, so it would not be possible to use it
on x86 except for new events
- the member is not aligned to 64 bits, so the definition of the
uAPI struct is incorrect for 32- on 64-bit userspace. This is a
problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
usage of flags was only introduced in 5.18.
Since padding has to be introduced, place a new field in there
that tells if the flags field is valid. To allow further extensibility,
in fact, change flags to an array of 16 values, and store how many
of the values are valid. The availability of the new ndata field
is tied to a system capability; all architectures are changed to
fill in the field.
To avoid breaking compilation of userspace that was using the flags
field, provide a userspace-only union to overlap flags with data[0].
The new field is placed at the same offset for both 32- and 64-bit
userspace.
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-22 10:30:13 +00:00
|
|
|
case KVM_CAP_SYSTEM_EVENT_DATA:
|
2023-03-15 10:16:06 +00:00
|
|
|
case KVM_CAP_DEVICE_CTRL:
|
2021-06-18 22:27:06 +00:00
|
|
|
return 1;
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
|
|
case KVM_CAP_MEMORY_ATTRIBUTES:
|
|
|
|
return kvm_supported_mem_attributes(kvm);
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_KVM_PRIVATE_MEM
|
|
|
|
case KVM_CAP_GUEST_MEMFD:
|
|
|
|
return !kvm || kvm_arch_has_private_mem(kvm);
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#endif
|
2014-07-14 16:33:08 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return kvm_vm_ioctl_check_extension(kvm, arg);
|
|
|
|
}
|
|
|
|
|
2020-10-01 01:22:22 +00:00
|
|
|
static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
|
|
|
if (!KVM_DIRTY_LOG_PAGE_OFFSET)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* the size should be power of 2 */
|
|
|
|
if (!size || (size & (size - 1)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Should be bigger to keep the reserved entries, or a page */
|
|
|
|
if (size < kvm_dirty_ring_get_rsvd_entries() *
|
|
|
|
sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (size > KVM_DIRTY_RING_MAX_ENTRIES *
|
|
|
|
sizeof(struct kvm_dirty_gfn))
|
|
|
|
return -E2BIG;
|
|
|
|
|
|
|
|
/* We only allow it to set once */
|
|
|
|
if (kvm->dirty_ring_size)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->lock);
|
|
|
|
|
|
|
|
if (kvm->created_vcpus) {
|
|
|
|
/* We don't allow to change this value after vcpu created */
|
|
|
|
r = -EINVAL;
|
|
|
|
} else {
|
|
|
|
kvm->dirty_ring_size = size;
|
|
|
|
r = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->lock);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
|
|
|
|
{
|
2021-11-16 16:04:02 +00:00
|
|
|
unsigned long i;
|
2020-10-01 01:22:22 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
int cleared = 0;
|
|
|
|
|
|
|
|
if (!kvm->dirty_ring_size)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
|
|
|
cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
if (cleared)
|
|
|
|
kvm_flush_remote_tlbs(kvm);
|
|
|
|
|
|
|
|
return cleared;
|
|
|
|
}
|
|
|
|
|
2017-02-16 09:40:56 +00:00
|
|
|
int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
|
|
|
|
struct kvm_enable_cap *cap)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2023-04-26 17:23:22 +00:00
|
|
|
bool kvm_are_all_memslots_empty(struct kvm *kvm)
|
2022-11-10 10:49:10 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
lockdep_assert_held(&kvm->slots_lock);
|
|
|
|
|
2023-10-27 18:22:04 +00:00
|
|
|
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
|
2022-11-10 10:49:10 +00:00
|
|
|
if (!kvm_memslots_empty(__kvm_memslots(kvm, i)))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2023-04-26 17:23:22 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_are_all_memslots_empty);
|
2022-11-10 10:49:10 +00:00
|
|
|
|
2017-02-16 09:40:56 +00:00
|
|
|
static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
|
|
|
|
struct kvm_enable_cap *cap)
|
|
|
|
{
|
|
|
|
switch (cap->cap) {
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
2020-02-27 01:32:27 +00:00
|
|
|
case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2: {
|
|
|
|
u64 allowed_options = KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE;
|
|
|
|
|
|
|
|
if (cap->args[0] & KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE)
|
|
|
|
allowed_options = KVM_DIRTY_LOG_MANUAL_CAPS;
|
|
|
|
|
|
|
|
if (cap->flags || (cap->args[0] & ~allowed_options))
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
return -EINVAL;
|
|
|
|
kvm->manual_dirty_log_protect = cap->args[0];
|
|
|
|
return 0;
|
2020-02-27 01:32:27 +00:00
|
|
|
}
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
#endif
|
2020-04-17 22:14:46 +00:00
|
|
|
case KVM_CAP_HALT_POLL: {
|
|
|
|
if (cap->flags || cap->args[0] != (unsigned int)cap->args[0])
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
kvm->max_halt_poll_ns = cap->args[0];
|
2022-11-17 00:16:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure kvm->override_halt_poll_ns does not become visible
|
|
|
|
* before kvm->max_halt_poll_ns.
|
|
|
|
*
|
|
|
|
* Pairs with the smp_rmb() in kvm_vcpu_max_halt_poll_ns().
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
|
|
|
kvm->override_halt_poll_ns = true;
|
|
|
|
|
2020-04-17 22:14:46 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2020-10-01 01:22:22 +00:00
|
|
|
case KVM_CAP_DIRTY_LOG_RING:
|
2022-09-26 14:51:16 +00:00
|
|
|
case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
|
2022-10-31 00:36:15 +00:00
|
|
|
if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2020-10-01 01:22:22 +00:00
|
|
|
return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
|
2022-11-10 10:49:10 +00:00
|
|
|
case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: {
|
|
|
|
int r = -EINVAL;
|
|
|
|
|
|
|
|
if (!IS_ENABLED(CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP) ||
|
|
|
|
!kvm->dirty_ring_size || cap->flags)
|
|
|
|
return r;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For simplicity, allow enabling ring+bitmap if and only if
|
|
|
|
* there are no memslots, e.g. to ensure all memslots allocate
|
|
|
|
* a bitmap after the capability is enabled.
|
|
|
|
*/
|
|
|
|
if (kvm_are_all_memslots_empty(kvm)) {
|
|
|
|
kvm->dirty_ring_with_bitmap = true;
|
|
|
|
r = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
2017-02-16 09:40:56 +00:00
|
|
|
default:
|
|
|
|
return kvm_vm_ioctl_enable_cap(kvm, cap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-06-18 22:27:05 +00:00
|
|
|
static ssize_t kvm_vm_stats_read(struct file *file, char __user *user_buffer,
|
|
|
|
size_t size, loff_t *offset)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = file->private_data;
|
|
|
|
|
|
|
|
return kvm_stats_read(kvm->stats_id, &kvm_vm_stats_header,
|
|
|
|
&kvm_vm_stats_desc[0], &kvm->stat,
|
|
|
|
sizeof(kvm->stat), user_buffer, size, offset);
|
|
|
|
}
|
|
|
|
|
2023-07-11 23:01:25 +00:00
|
|
|
static int kvm_vm_stats_release(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = file->private_data;
|
|
|
|
|
|
|
|
kvm_put_kvm(kvm);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-06-18 22:27:05 +00:00
|
|
|
static const struct file_operations kvm_vm_stats_fops = {
|
KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-18 20:46:22 +00:00
|
|
|
.owner = THIS_MODULE,
|
2021-06-18 22:27:05 +00:00
|
|
|
.read = kvm_vm_stats_read,
|
2023-07-11 23:01:25 +00:00
|
|
|
.release = kvm_vm_stats_release,
|
2021-06-18 22:27:05 +00:00
|
|
|
.llseek = noop_llseek,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
struct file *file;
|
|
|
|
|
|
|
|
fd = get_unused_fd_flags(O_CLOEXEC);
|
|
|
|
if (fd < 0)
|
|
|
|
return fd;
|
|
|
|
|
|
|
|
file = anon_inode_getfile("kvm-vm-stats",
|
|
|
|
&kvm_vm_stats_fops, kvm, O_RDONLY);
|
|
|
|
if (IS_ERR(file)) {
|
|
|
|
put_unused_fd(fd);
|
|
|
|
return PTR_ERR(file);
|
|
|
|
}
|
2023-07-11 23:01:25 +00:00
|
|
|
|
|
|
|
kvm_get_kvm(kvm);
|
|
|
|
|
2021-06-18 22:27:05 +00:00
|
|
|
file->f_mode |= FMODE_PREAD;
|
|
|
|
fd_install(fd, file);
|
|
|
|
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
2023-10-27 18:21:50 +00:00
|
|
|
#define SANITY_CHECK_MEM_REGION_FIELD(field) \
|
|
|
|
do { \
|
|
|
|
BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) != \
|
|
|
|
offsetof(struct kvm_userspace_memory_region2, field)); \
|
|
|
|
BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) != \
|
|
|
|
sizeof_field(struct kvm_userspace_memory_region2, field)); \
|
|
|
|
} while (0)
|
|
|
|
|
2007-02-21 16:04:26 +00:00
|
|
|
static long kvm_vm_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = filp->private_data;
|
|
|
|
void __user *argp = (void __user *)arg;
|
2007-10-29 15:08:35 +00:00
|
|
|
int r;
|
2007-02-21 16:04:26 +00:00
|
|
|
|
2021-11-11 15:13:38 +00:00
|
|
|
if (kvm->mm != current->mm || kvm->vm_dead)
|
2007-11-21 14:41:05 +00:00
|
|
|
return -EIO;
|
2007-02-21 16:04:26 +00:00
|
|
|
switch (ioctl) {
|
|
|
|
case KVM_CREATE_VCPU:
|
|
|
|
r = kvm_vm_ioctl_create_vcpu(kvm, arg);
|
|
|
|
break;
|
2017-02-16 09:40:56 +00:00
|
|
|
case KVM_ENABLE_CAP: {
|
|
|
|
struct kvm_enable_cap cap;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cap, argp, sizeof(cap)))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
|
|
|
|
break;
|
|
|
|
}
|
2023-10-27 18:21:50 +00:00
|
|
|
case KVM_SET_USER_MEMORY_REGION2:
|
2007-10-09 17:20:39 +00:00
|
|
|
case KVM_SET_USER_MEMORY_REGION: {
|
2023-10-27 18:21:50 +00:00
|
|
|
struct kvm_userspace_memory_region2 mem;
|
|
|
|
unsigned long size;
|
|
|
|
|
|
|
|
if (ioctl == KVM_SET_USER_MEMORY_REGION) {
|
|
|
|
/*
|
|
|
|
* Fields beyond struct kvm_userspace_memory_region shouldn't be
|
|
|
|
* accessed, but avoid leaking kernel memory in case of a bug.
|
|
|
|
*/
|
|
|
|
memset(&mem, 0, sizeof(mem));
|
|
|
|
size = sizeof(struct kvm_userspace_memory_region);
|
|
|
|
} else {
|
|
|
|
size = sizeof(struct kvm_userspace_memory_region2);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Ensure the common parts of the two structs are identical. */
|
|
|
|
SANITY_CHECK_MEM_REGION_FIELD(slot);
|
|
|
|
SANITY_CHECK_MEM_REGION_FIELD(flags);
|
|
|
|
SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
|
|
|
|
SANITY_CHECK_MEM_REGION_FIELD(memory_size);
|
|
|
|
SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
|
2007-10-09 17:20:39 +00:00
|
|
|
|
|
|
|
r = -EFAULT;
|
2023-10-27 18:21:50 +00:00
|
|
|
if (copy_from_user(&mem, argp, size))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -EINVAL;
|
|
|
|
if (ioctl == KVM_SET_USER_MEMORY_REGION &&
|
|
|
|
(mem.flags & ~KVM_SET_USER_MEMORY_REGION_V1_FLAGS))
|
2007-10-09 17:20:39 +00:00
|
|
|
goto out;
|
|
|
|
|
2023-10-27 18:21:50 +00:00
|
|
|
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_GET_DIRTY_LOG: {
|
|
|
|
struct kvm_dirty_log log;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&log, argp, sizeof(log)))
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
goto out;
|
2007-02-20 16:27:58 +00:00
|
|
|
r = kvm_vm_ioctl_get_dirty_log(kvm, &log);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
break;
|
|
|
|
}
|
kvm: introduce manual dirty log reprotect
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-23 00:36:47 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
|
|
|
case KVM_CLEAR_DIRTY_LOG: {
|
|
|
|
struct kvm_clear_dirty_log log;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&log, argp, sizeof(log)))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_clear_dirty_log(kvm, &log);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
2017-03-31 11:53:23 +00:00
|
|
|
#ifdef CONFIG_KVM_MMIO
|
2008-05-30 14:05:54 +00:00
|
|
|
case KVM_REGISTER_COALESCED_MMIO: {
|
|
|
|
struct kvm_coalesced_mmio_zone zone;
|
2015-02-26 06:58:23 +00:00
|
|
|
|
2008-05-30 14:05:54 +00:00
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&zone, argp, sizeof(zone)))
|
2008-05-30 14:05:54 +00:00
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_register_coalesced_mmio(kvm, &zone);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_UNREGISTER_COALESCED_MMIO: {
|
|
|
|
struct kvm_coalesced_mmio_zone zone;
|
2015-02-26 06:58:23 +00:00
|
|
|
|
2008-05-30 14:05:54 +00:00
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&zone, argp, sizeof(zone)))
|
2008-05-30 14:05:54 +00:00
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_unregister_coalesced_mmio(kvm, &zone);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
2009-05-20 14:30:49 +00:00
|
|
|
case KVM_IRQFD: {
|
|
|
|
struct kvm_irqfd data;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&data, argp, sizeof(data)))
|
2009-05-20 14:30:49 +00:00
|
|
|
goto out;
|
2012-06-29 15:56:08 +00:00
|
|
|
r = kvm_irqfd(kvm, &data);
|
2009-05-20 14:30:49 +00:00
|
|
|
break;
|
|
|
|
}
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
case KVM_IOEVENTFD: {
|
|
|
|
struct kvm_ioeventfd data;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&data, argp, sizeof(data)))
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
goto out;
|
|
|
|
r = kvm_ioeventfd(kvm, &data);
|
|
|
|
break;
|
|
|
|
}
|
2012-03-29 19:14:12 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_MSI
|
|
|
|
case KVM_SIGNAL_MSI: {
|
|
|
|
struct kvm_msi msi;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&msi, argp, sizeof(msi)))
|
2012-03-29 19:14:12 +00:00
|
|
|
goto out;
|
|
|
|
r = kvm_send_userspace_msi(kvm, &msi);
|
|
|
|
break;
|
|
|
|
}
|
2012-07-24 12:51:20 +00:00
|
|
|
#endif
|
|
|
|
#ifdef __KVM_HAVE_IRQ_LINE
|
|
|
|
case KVM_IRQ_LINE_STATUS:
|
|
|
|
case KVM_IRQ_LINE: {
|
|
|
|
struct kvm_irq_level irq_event;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_from_user(&irq_event, argp, sizeof(irq_event)))
|
2012-07-24 12:51:20 +00:00
|
|
|
goto out;
|
|
|
|
|
2013-04-11 11:21:40 +00:00
|
|
|
r = kvm_vm_ioctl_irq_line(kvm, &irq_event,
|
|
|
|
ioctl == KVM_IRQ_LINE_STATUS);
|
2012-07-24 12:51:20 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (ioctl == KVM_IRQ_LINE_STATUS) {
|
2015-02-26 06:58:19 +00:00
|
|
|
if (copy_to_user(argp, &irq_event, sizeof(irq_event)))
|
2012-07-24 12:51:20 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2009-06-09 12:56:28 +00:00
|
|
|
#endif
|
2013-04-15 19:12:53 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
|
|
|
|
case KVM_SET_GSI_ROUTING: {
|
|
|
|
struct kvm_irq_routing routing;
|
|
|
|
struct kvm_irq_routing __user *urouting;
|
2016-06-01 12:09:22 +00:00
|
|
|
struct kvm_irq_routing_entry *entries = NULL;
|
2013-04-15 19:12:53 +00:00
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&routing, argp, sizeof(routing)))
|
|
|
|
goto out;
|
|
|
|
r = -EINVAL;
|
2017-04-28 15:06:20 +00:00
|
|
|
if (!kvm_arch_can_set_irq_routing(kvm))
|
|
|
|
goto out;
|
kvm: Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES
These days, we experienced one guest crash with 8 cores and 3 disks,
with qemu error logs as bellow:
qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984:
kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
And then we found one patch(bdf026317d) in qemu tree, which said
could fix this bug.
Execute the following script will reproduce the BUG quickly:
irq_affinity.sh
========================================================================
vda_irq_num=25
vdb_irq_num=27
while [ 1 ]
do
for irq in {1,2,4,8,10,20,40,80}
do
echo $irq > /proc/irq/$vda_irq_num/smp_affinity
echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
done
done
========================================================================
The following qemu log is added in the qemu code and is displayed when
this bug reproduced:
kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024,
irq_routes->nr: 1024, gsi_count: 1024.
That's to say when irq_routes->nr == 1024, there are 1024 routing entries,
but in the kernel code when routes->nr >= 1024, will just return -EINVAL;
The nr is the number of the routing entries which is in of
[1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1].
This patch fix the BUG above.
Cc: stable@vger.kernel.org
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-15 10:00:33 +00:00
|
|
|
if (routing.nr > KVM_MAX_IRQ_ROUTES)
|
2013-04-15 19:12:53 +00:00
|
|
|
goto out;
|
|
|
|
if (routing.flags)
|
|
|
|
goto out;
|
2016-06-01 12:09:22 +00:00
|
|
|
if (routing.nr) {
|
|
|
|
urouting = argp;
|
2023-11-02 18:15:26 +00:00
|
|
|
entries = vmemdup_array_user(urouting->entries,
|
|
|
|
routing.nr, sizeof(*entries));
|
2020-06-03 10:11:31 +00:00
|
|
|
if (IS_ERR(entries)) {
|
|
|
|
r = PTR_ERR(entries);
|
|
|
|
goto out;
|
|
|
|
}
|
2016-06-01 12:09:22 +00:00
|
|
|
}
|
2013-04-15 19:12:53 +00:00
|
|
|
r = kvm_set_irq_routing(kvm, entries, routing.nr,
|
|
|
|
routing.flags);
|
2020-06-03 10:11:31 +00:00
|
|
|
kvfree(entries);
|
2013-04-15 19:12:53 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
|
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 18:21:55 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
|
|
case KVM_SET_MEMORY_ATTRIBUTES: {
|
|
|
|
struct kvm_memory_attributes attrs;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&attrs, argp, sizeof(attrs)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
|
2013-04-12 14:08:42 +00:00
|
|
|
case KVM_CREATE_DEVICE: {
|
|
|
|
struct kvm_create_device cd;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cd, argp, sizeof(cd)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = kvm_ioctl_create_device(kvm, &cd);
|
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(argp, &cd, sizeof(cd)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2014-07-14 16:33:08 +00:00
|
|
|
case KVM_CHECK_EXTENSION:
|
|
|
|
r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
|
|
|
|
break;
|
2020-10-01 01:22:22 +00:00
|
|
|
case KVM_RESET_DIRTY_RINGS:
|
|
|
|
r = kvm_vm_ioctl_reset_dirty_pages(kvm);
|
|
|
|
break;
|
2021-06-18 22:27:05 +00:00
|
|
|
case KVM_GET_STATS_FD:
|
|
|
|
r = kvm_vm_ioctl_get_stats_fd(kvm);
|
|
|
|
break;
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
#ifdef CONFIG_KVM_PRIVATE_MEM
|
|
|
|
case KVM_CREATE_GUEST_MEMFD: {
|
|
|
|
struct kvm_create_guest_memfd guest_memfd;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = kvm_gmem_create(kvm, &guest_memfd);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
2007-02-21 17:28:04 +00:00
|
|
|
default:
|
2007-10-29 15:08:35 +00:00
|
|
|
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
|
2007-02-21 17:28:04 +00:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2015-02-03 08:35:15 +00:00
|
|
|
#ifdef CONFIG_KVM_COMPAT
|
2009-10-22 12:19:27 +00:00
|
|
|
struct compat_kvm_dirty_log {
|
|
|
|
__u32 slot;
|
|
|
|
__u32 padding1;
|
|
|
|
union {
|
|
|
|
compat_uptr_t dirty_bitmap; /* one bit per page */
|
|
|
|
__u64 padding2;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
2021-07-27 12:43:10 +00:00
|
|
|
struct compat_kvm_clear_dirty_log {
|
|
|
|
__u32 slot;
|
|
|
|
__u32 num_pages;
|
|
|
|
__u64 first_page;
|
|
|
|
union {
|
|
|
|
compat_uptr_t dirty_bitmap; /* one bit per page */
|
|
|
|
__u64 padding2;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
2022-10-17 18:45:39 +00:00
|
|
|
long __weak kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
2009-10-22 12:19:27 +00:00
|
|
|
static long kvm_vm_compat_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = filp->private_data;
|
|
|
|
int r;
|
|
|
|
|
2021-11-11 15:13:38 +00:00
|
|
|
if (kvm->mm != current->mm || kvm->vm_dead)
|
2009-10-22 12:19:27 +00:00
|
|
|
return -EIO;
|
2022-10-17 18:45:39 +00:00
|
|
|
|
|
|
|
r = kvm_arch_vm_compat_ioctl(filp, ioctl, arg);
|
|
|
|
if (r != -ENOTTY)
|
|
|
|
return r;
|
|
|
|
|
2009-10-22 12:19:27 +00:00
|
|
|
switch (ioctl) {
|
2021-07-27 12:43:10 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
|
|
|
case KVM_CLEAR_DIRTY_LOG: {
|
|
|
|
struct compat_kvm_clear_dirty_log compat_log;
|
|
|
|
struct kvm_clear_dirty_log log;
|
|
|
|
|
|
|
|
if (copy_from_user(&compat_log, (void __user *)arg,
|
|
|
|
sizeof(compat_log)))
|
|
|
|
return -EFAULT;
|
|
|
|
log.slot = compat_log.slot;
|
|
|
|
log.num_pages = compat_log.num_pages;
|
|
|
|
log.first_page = compat_log.first_page;
|
|
|
|
log.padding2 = compat_log.padding2;
|
|
|
|
log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);
|
|
|
|
|
|
|
|
r = kvm_vm_ioctl_clear_dirty_log(kvm, &log);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
2009-10-22 12:19:27 +00:00
|
|
|
case KVM_GET_DIRTY_LOG: {
|
|
|
|
struct compat_kvm_dirty_log compat_log;
|
|
|
|
struct kvm_dirty_log log;
|
|
|
|
|
|
|
|
if (copy_from_user(&compat_log, (void __user *)arg,
|
|
|
|
sizeof(compat_log)))
|
2017-01-22 10:30:21 +00:00
|
|
|
return -EFAULT;
|
2009-10-22 12:19:27 +00:00
|
|
|
log.slot = compat_log.slot;
|
|
|
|
log.padding1 = compat_log.padding1;
|
|
|
|
log.padding2 = compat_log.padding2;
|
|
|
|
log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);
|
|
|
|
|
|
|
|
r = kvm_vm_ioctl_get_dirty_log(kvm, &log);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
r = kvm_vm_ioctl(filp, ioctl, arg);
|
|
|
|
}
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-18 20:46:22 +00:00
|
|
|
static struct file_operations kvm_vm_fops = {
|
2007-02-21 17:28:04 +00:00
|
|
|
.release = kvm_vm_release,
|
|
|
|
.unlocked_ioctl = kvm_vm_ioctl,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-15 16:52:59 +00:00
|
|
|
.llseek = noop_llseek,
|
2018-06-17 09:16:21 +00:00
|
|
|
KVM_COMPAT(kvm_vm_compat_ioctl),
|
2007-02-21 17:28:04 +00:00
|
|
|
};
|
|
|
|
|
2021-04-08 22:32:14 +00:00
|
|
|
bool file_is_kvm(struct file *file)
|
|
|
|
{
|
|
|
|
return file && file->f_op == &kvm_vm_fops;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(file_is_kvm);
|
|
|
|
|
2012-01-04 09:25:20 +00:00
|
|
|
static int kvm_dev_ioctl_create_vm(unsigned long type)
|
2007-02-21 17:28:04 +00:00
|
|
|
{
|
2022-07-20 09:22:50 +00:00
|
|
|
char fdname[ITOA_MAX_LEN + 1];
|
2022-07-20 09:22:49 +00:00
|
|
|
int r, fd;
|
2007-02-21 17:28:04 +00:00
|
|
|
struct kvm *kvm;
|
2016-07-14 16:54:17 +00:00
|
|
|
struct file *file;
|
2007-02-21 17:28:04 +00:00
|
|
|
|
2022-07-20 09:22:49 +00:00
|
|
|
fd = get_unused_fd_flags(O_CLOEXEC);
|
|
|
|
if (fd < 0)
|
|
|
|
return fd;
|
|
|
|
|
2022-07-20 09:22:50 +00:00
|
|
|
snprintf(fdname, sizeof(fdname), "%d", fd);
|
|
|
|
|
2022-07-20 09:22:51 +00:00
|
|
|
kvm = kvm_create_vm(type, fdname);
|
2022-07-20 09:22:49 +00:00
|
|
|
if (IS_ERR(kvm)) {
|
|
|
|
r = PTR_ERR(kvm);
|
|
|
|
goto put_fd;
|
|
|
|
}
|
|
|
|
|
2016-07-14 16:54:17 +00:00
|
|
|
file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR);
|
|
|
|
if (IS_ERR(file)) {
|
2017-11-21 12:40:17 +00:00
|
|
|
r = PTR_ERR(file);
|
|
|
|
goto put_kvm;
|
2016-07-14 16:54:17 +00:00
|
|
|
}
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2017-06-27 13:45:09 +00:00
|
|
|
/*
|
|
|
|
* Don't call kvm_put_kvm anymore at this point; file->f_op is
|
|
|
|
* already set, with ->release() being kvm_vm_release(). In error
|
|
|
|
* cases it will be called by the final fput(file) and will take
|
|
|
|
* care of doing kvm_put_kvm(kvm).
|
|
|
|
*/
|
2017-07-12 15:56:44 +00:00
|
|
|
kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
|
2007-02-21 17:28:04 +00:00
|
|
|
|
2022-07-20 09:22:49 +00:00
|
|
|
fd_install(fd, file);
|
|
|
|
return fd;
|
2017-11-21 12:40:17 +00:00
|
|
|
|
|
|
|
put_kvm:
|
|
|
|
kvm_put_kvm(kvm);
|
2022-07-20 09:22:49 +00:00
|
|
|
put_fd:
|
|
|
|
put_unused_fd(fd);
|
2017-11-21 12:40:17 +00:00
|
|
|
return r;
|
2007-02-21 17:28:04 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static long kvm_dev_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
|
|
|
{
|
2023-02-08 14:01:04 +00:00
|
|
|
int r = -EINVAL;
|
2007-02-21 17:28:04 +00:00
|
|
|
|
|
|
|
switch (ioctl) {
|
|
|
|
case KVM_GET_API_VERSION:
|
2007-03-07 11:11:17 +00:00
|
|
|
if (arg)
|
|
|
|
goto out;
|
2007-02-21 17:28:04 +00:00
|
|
|
r = KVM_API_VERSION;
|
|
|
|
break;
|
|
|
|
case KVM_CREATE_VM:
|
2012-01-04 09:25:20 +00:00
|
|
|
r = kvm_dev_ioctl_create_vm(arg);
|
2007-02-21 17:28:04 +00:00
|
|
|
break;
|
2007-11-15 15:07:47 +00:00
|
|
|
case KVM_CHECK_EXTENSION:
|
2014-07-14 16:27:35 +00:00
|
|
|
r = kvm_vm_ioctl_check_extension_generic(NULL, arg);
|
2007-03-01 15:56:20 +00:00
|
|
|
break;
|
2007-03-07 11:05:38 +00:00
|
|
|
case KVM_GET_VCPU_MMAP_SIZE:
|
|
|
|
if (arg)
|
|
|
|
goto out;
|
2008-01-24 13:13:08 +00:00
|
|
|
r = PAGE_SIZE; /* struct kvm_run */
|
|
|
|
#ifdef CONFIG_X86
|
|
|
|
r += PAGE_SIZE; /* pio data page */
|
2008-05-30 14:05:54 +00:00
|
|
|
#endif
|
2017-03-31 11:53:23 +00:00
|
|
|
#ifdef CONFIG_KVM_MMIO
|
2008-05-30 14:05:54 +00:00
|
|
|
r += PAGE_SIZE; /* coalesced mmio ring page */
|
2008-01-24 13:13:08 +00:00
|
|
|
#endif
|
2007-03-07 11:05:38 +00:00
|
|
|
break;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
default:
|
2007-10-10 15:16:19 +00:00
|
|
|
return kvm_arch_dev_ioctl(filp, ioctl, arg);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct file_operations kvm_chardev_ops = {
|
|
|
|
.unlocked_ioctl = kvm_dev_ioctl,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-15 16:52:59 +00:00
|
|
|
.llseek = noop_llseek,
|
2018-06-17 09:16:21 +00:00
|
|
|
KVM_COMPAT(kvm_dev_ioctl),
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct miscdevice kvm_dev = {
|
2007-03-04 11:27:36 +00:00
|
|
|
KVM_MINOR,
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
"kvm",
|
|
|
|
&kvm_chardev_ops,
|
|
|
|
};
|
|
|
|
|
2022-11-30 23:09:33 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
|
|
|
|
__visible bool kvm_rebooting;
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_rebooting);
|
|
|
|
|
|
|
|
static DEFINE_PER_CPU(bool, hardware_enabled);
|
|
|
|
static int kvm_usage_count;
|
|
|
|
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
static int __hardware_enable_nolock(void)
|
2007-05-24 10:03:52 +00:00
|
|
|
{
|
2022-11-30 23:09:30 +00:00
|
|
|
if (__this_cpu_read(hardware_enabled))
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
return 0;
|
2009-09-15 09:37:46 +00:00
|
|
|
|
2022-11-30 23:09:30 +00:00
|
|
|
if (kvm_arch_hardware_enable()) {
|
|
|
|
pr_info("kvm: enabling virtualization on CPU%d failed\n",
|
|
|
|
raw_smp_processor_id());
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
return -EIO;
|
2009-09-15 09:37:46 +00:00
|
|
|
}
|
2022-11-30 23:09:30 +00:00
|
|
|
|
|
|
|
__this_cpu_write(hardware_enabled, true);
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void hardware_enable_nolock(void *failed)
|
|
|
|
{
|
|
|
|
if (__hardware_enable_nolock())
|
|
|
|
atomic_inc(failed);
|
2007-05-24 10:03:52 +00:00
|
|
|
}
|
|
|
|
|
2022-11-30 23:09:25 +00:00
|
|
|
static int kvm_online_cpu(unsigned int cpu)
|
2010-11-16 08:37:41 +00:00
|
|
|
{
|
2022-11-30 23:09:25 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Abort the CPU online process if hardware virtualization cannot
|
|
|
|
* be enabled. Otherwise running VMs would encounter unrecoverable
|
|
|
|
* errors when scheduled to this CPU.
|
|
|
|
*/
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
if (kvm_usage_count)
|
|
|
|
ret = __hardware_enable_nolock();
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2022-11-30 23:09:25 +00:00
|
|
|
return ret;
|
2010-11-16 08:37:41 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void hardware_disable_nolock(void *junk)
|
2007-05-24 10:03:52 +00:00
|
|
|
{
|
2022-11-30 23:09:30 +00:00
|
|
|
/*
|
|
|
|
* Note, hardware_disable_all_nolock() tells all online CPUs to disable
|
|
|
|
* hardware, not just CPUs that successfully enabled hardware!
|
|
|
|
*/
|
|
|
|
if (!__this_cpu_read(hardware_enabled))
|
2007-05-24 10:03:52 +00:00
|
|
|
return;
|
2022-11-30 23:09:30 +00:00
|
|
|
|
2014-08-28 13:13:03 +00:00
|
|
|
kvm_arch_hardware_disable();
|
2022-11-30 23:09:30 +00:00
|
|
|
|
|
|
|
__this_cpu_write(hardware_enabled, false);
|
2007-05-24 10:03:52 +00:00
|
|
|
}
|
|
|
|
|
2022-11-30 23:09:25 +00:00
|
|
|
static int kvm_offline_cpu(unsigned int cpu)
|
2010-11-16 08:37:41 +00:00
|
|
|
{
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2013-09-10 10:57:17 +00:00
|
|
|
if (kvm_usage_count)
|
|
|
|
hardware_disable_nolock(NULL);
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2016-07-13 17:16:37 +00:00
|
|
|
return 0;
|
2010-11-16 08:37:41 +00:00
|
|
|
}
|
|
|
|
|
2009-09-15 09:37:46 +00:00
|
|
|
static void hardware_disable_all_nolock(void)
|
|
|
|
{
|
|
|
|
BUG_ON(!kvm_usage_count);
|
|
|
|
|
|
|
|
kvm_usage_count--;
|
|
|
|
if (!kvm_usage_count)
|
2010-11-16 08:37:41 +00:00
|
|
|
on_each_cpu(hardware_disable_nolock, NULL, 1);
|
2009-09-15 09:37:46 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void hardware_disable_all(void)
|
|
|
|
{
|
2022-11-30 23:09:26 +00:00
|
|
|
cpus_read_lock();
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2009-09-15 09:37:46 +00:00
|
|
|
hardware_disable_all_nolock();
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2022-11-30 23:09:26 +00:00
|
|
|
cpus_read_unlock();
|
2009-09-15 09:37:46 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int hardware_enable_all(void)
|
|
|
|
{
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
atomic_t failed = ATOMIC_INIT(0);
|
KVM: Don't enable hardware after a restart/shutdown is initiated
Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
been initiated to avoid re-enabling hardware between kvm_reboot() and
machine_{halt,power_off,restart}(). The restart case is especially
problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
is unable to wake and rendezvous with APs.
Note, this bug, and the original issue that motivated the addition of
kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
In a "normal" reboot, userspace will gracefully teardown userspace before
triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
that might do ioctl(KVM_CREATE_VM) is long gone.
Fixes: 8e1c18157d87 ("KVM: VMX: Disable VMX when system shutdown")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-12 23:31:27 +00:00
|
|
|
int r;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not enable hardware virtualization if the system is going down.
|
|
|
|
* If userspace initiated a forced reboot, e.g. reboot -f, then it's
|
|
|
|
* possible for an in-flight KVM_CREATE_VM to trigger hardware enabling
|
|
|
|
* after kvm_reboot() is called. Note, this relies on system_state
|
|
|
|
* being set _before_ kvm_reboot(), which is why KVM uses a syscore ops
|
|
|
|
* hook instead of registering a dedicated reboot notifier (the latter
|
|
|
|
* runs before system_state is updated).
|
|
|
|
*/
|
|
|
|
if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
|
|
|
|
system_state == SYSTEM_RESTART)
|
|
|
|
return -EBUSY;
|
2009-09-15 09:37:46 +00:00
|
|
|
|
2022-11-30 23:09:26 +00:00
|
|
|
/*
|
|
|
|
* When onlining a CPU, cpu_online_mask is set before kvm_online_cpu()
|
|
|
|
* is called, and so on_each_cpu() between them includes the CPU that
|
|
|
|
* is being onlined. As a result, hardware_enable_nolock() may get
|
|
|
|
* invoked before kvm_online_cpu(), which also enables hardware if the
|
|
|
|
* usage count is non-zero. Disable CPU hotplug to avoid attempting to
|
|
|
|
* enable hardware multiple times.
|
|
|
|
*/
|
|
|
|
cpus_read_lock();
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2009-09-15 09:37:46 +00:00
|
|
|
|
KVM: Don't enable hardware after a restart/shutdown is initiated
Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
been initiated to avoid re-enabling hardware between kvm_reboot() and
machine_{halt,power_off,restart}(). The restart case is especially
problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
is unable to wake and rendezvous with APs.
Note, this bug, and the original issue that motivated the addition of
kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
In a "normal" reboot, userspace will gracefully teardown userspace before
triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
that might do ioctl(KVM_CREATE_VM) is long gone.
Fixes: 8e1c18157d87 ("KVM: VMX: Disable VMX when system shutdown")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-12 23:31:27 +00:00
|
|
|
r = 0;
|
|
|
|
|
2009-09-15 09:37:46 +00:00
|
|
|
kvm_usage_count++;
|
|
|
|
if (kvm_usage_count == 1) {
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
on_each_cpu(hardware_enable_nolock, &failed, 1);
|
2009-09-15 09:37:46 +00:00
|
|
|
|
KVM: Make hardware_enable_failed a local variable in the "enable all" path
Rework detecting hardware enabling errors to use a local variable in the
"enable all" path to track whether or not enabling was successful across
all CPUs. Using a global variable complicates paths that enable hardware
only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
Opportunistically add a WARN if hardware enabling fails during
kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
The WARN is largely futile in the current code, as KVM BUG()s on spurious
faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
hardware enabling fails will explode.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/x86.c:508!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0xa/0x10
Call Trace:
vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
vmx_vcpu_load+0x16/0x60 [kvm_intel]
kvm_arch_vcpu_load+0x32/0x1f0
vcpu_load+0x2f/0x40
kvm_arch_vcpu_ioctl_run+0x19/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
But, the WARN may provide a breadcrumb to understand what went awry, and
someday KVM may fix one or both of those bugs, e.g. by finding a way to
eat spurious faults no matter the context (easier said than done due to
side effects of certain operations, e.g. Intel's VMCLEAR).
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: rebase, WARN on failure in kvm_resume()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-48-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:31 +00:00
|
|
|
if (atomic_read(&failed)) {
|
2009-09-15 09:37:46 +00:00
|
|
|
hardware_disable_all_nolock();
|
|
|
|
r = -EBUSY;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2022-11-30 23:09:26 +00:00
|
|
|
cpus_read_unlock();
|
2009-09-15 09:37:46 +00:00
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2023-05-12 23:31:26 +00:00
|
|
|
static void kvm_shutdown(void)
|
2007-07-17 13:17:55 +00:00
|
|
|
{
|
2009-04-29 03:09:04 +00:00
|
|
|
/*
|
2023-05-12 23:31:26 +00:00
|
|
|
* Disable hardware virtualization and set kvm_rebooting to indicate
|
|
|
|
* that KVM has asynchronously disabled hardware virtualization, i.e.
|
|
|
|
* that relevant errors and exceptions aren't entirely unexpected.
|
|
|
|
* Some flavors of hardware virtualization need to be disabled before
|
|
|
|
* transferring control to firmware (to perform shutdown/reboot), e.g.
|
|
|
|
* on x86, virtualization can block INIT interrupts, which are used by
|
|
|
|
* firmware to pull APs back under firmware control. Note, this path
|
|
|
|
* is used for both shutdown and reboot scenarios, i.e. neither name is
|
|
|
|
* 100% comprehensive.
|
2009-04-29 03:09:04 +00:00
|
|
|
*/
|
2015-02-26 06:58:26 +00:00
|
|
|
pr_info("kvm: exiting hardware virtualization\n");
|
2009-04-29 03:09:04 +00:00
|
|
|
kvm_rebooting = true;
|
2010-11-16 08:37:41 +00:00
|
|
|
on_each_cpu(hardware_disable_nolock, NULL, 1);
|
2007-07-17 13:17:55 +00:00
|
|
|
}
|
|
|
|
|
2022-11-30 23:09:32 +00:00
|
|
|
static int kvm_suspend(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Secondary CPUs and CPU hotplug are disabled across the suspend/resume
|
|
|
|
* callbacks, i.e. no need to acquire kvm_lock to ensure the usage count
|
|
|
|
* is stable. Assert that kvm_lock is not held to ensure the system
|
|
|
|
* isn't suspended while KVM is enabling hardware. Hardware enabling
|
|
|
|
* can be preempted, but the task cannot be frozen until it has dropped
|
|
|
|
* all locks (userspace tasks are frozen via a fake signal).
|
|
|
|
*/
|
|
|
|
lockdep_assert_not_held(&kvm_lock);
|
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
|
|
|
|
if (kvm_usage_count)
|
|
|
|
hardware_disable_nolock(NULL);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_resume(void)
|
|
|
|
{
|
|
|
|
lockdep_assert_not_held(&kvm_lock);
|
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
|
|
|
|
if (kvm_usage_count)
|
|
|
|
WARN_ON_ONCE(__hardware_enable_nolock());
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct syscore_ops kvm_syscore_ops = {
|
|
|
|
.suspend = kvm_suspend,
|
|
|
|
.resume = kvm_resume,
|
2023-05-12 23:31:26 +00:00
|
|
|
.shutdown = kvm_shutdown,
|
2022-11-30 23:09:32 +00:00
|
|
|
};
|
2022-11-30 23:09:33 +00:00
|
|
|
#else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
|
|
|
|
static int hardware_enable_all(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void hardware_disable_all(void)
|
|
|
|
{
|
|
|
|
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
|
2022-11-30 23:09:32 +00:00
|
|
|
|
2023-02-07 12:37:12 +00:00
|
|
|
static void kvm_iodevice_destructor(struct kvm_io_device *dev)
|
|
|
|
{
|
|
|
|
if (dev->ops->destructor)
|
|
|
|
dev->ops->destructor(dev);
|
|
|
|
}
|
|
|
|
|
2009-12-23 16:35:24 +00:00
|
|
|
static void kvm_io_bus_destroy(struct kvm_io_bus *bus)
|
2007-05-31 18:08:53 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < bus->dev_count; i++) {
|
2011-07-27 13:00:48 +00:00
|
|
|
struct kvm_io_device *pos = bus->range[i].dev;
|
2007-05-31 18:08:53 +00:00
|
|
|
|
|
|
|
kvm_iodevice_destructor(pos);
|
|
|
|
}
|
2009-12-23 16:35:24 +00:00
|
|
|
kfree(bus);
|
2007-05-31 18:08:53 +00:00
|
|
|
}
|
|
|
|
|
2013-08-27 13:41:41 +00:00
|
|
|
static inline int kvm_io_bus_cmp(const struct kvm_io_range *r1,
|
2015-02-26 06:58:25 +00:00
|
|
|
const struct kvm_io_range *r2)
|
2011-07-27 13:00:48 +00:00
|
|
|
{
|
2015-09-15 06:41:57 +00:00
|
|
|
gpa_t addr1 = r1->addr;
|
|
|
|
gpa_t addr2 = r2->addr;
|
|
|
|
|
|
|
|
if (addr1 < addr2)
|
2011-07-27 13:00:48 +00:00
|
|
|
return -1;
|
2015-09-15 06:41:57 +00:00
|
|
|
|
|
|
|
/* If r2->len == 0, match the exact address. If r2->len != 0,
|
|
|
|
* accept any overlapping write. Any order is acceptable for
|
|
|
|
* overlapping ranges, because kvm_io_bus_get_first_dev ensures
|
|
|
|
* we process all of them.
|
|
|
|
*/
|
|
|
|
if (r2->len) {
|
|
|
|
addr1 += r1->len;
|
|
|
|
addr2 += r2->len;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (addr1 > addr2)
|
2011-07-27 13:00:48 +00:00
|
|
|
return 1;
|
2015-09-15 06:41:57 +00:00
|
|
|
|
2011-07-27 13:00:48 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-07-16 11:03:29 +00:00
|
|
|
static int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
|
|
|
|
{
|
2013-08-27 13:41:41 +00:00
|
|
|
return kvm_io_bus_cmp(p1, p2);
|
2013-07-16 11:03:29 +00:00
|
|
|
}
|
|
|
|
|
2013-04-05 19:20:30 +00:00
|
|
|
static int kvm_io_bus_get_first_dev(struct kvm_io_bus *bus,
|
2011-07-27 13:00:48 +00:00
|
|
|
gpa_t addr, int len)
|
|
|
|
{
|
|
|
|
struct kvm_io_range *range, key;
|
|
|
|
int off;
|
|
|
|
|
|
|
|
key = (struct kvm_io_range) {
|
|
|
|
.addr = addr,
|
|
|
|
.len = len,
|
|
|
|
};
|
|
|
|
|
|
|
|
range = bsearch(&key, bus->range, bus->dev_count,
|
|
|
|
sizeof(struct kvm_io_range), kvm_io_bus_sort_cmp);
|
|
|
|
if (range == NULL)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
off = range - bus->range;
|
|
|
|
|
2013-08-27 13:41:41 +00:00
|
|
|
while (off > 0 && kvm_io_bus_cmp(&key, &bus->range[off-1]) == 0)
|
2011-07-27 13:00:48 +00:00
|
|
|
off--;
|
|
|
|
|
|
|
|
return off;
|
|
|
|
}
|
|
|
|
|
2015-03-26 14:39:28 +00:00
|
|
|
static int __kvm_io_bus_write(struct kvm_vcpu *vcpu, struct kvm_io_bus *bus,
|
2013-07-03 14:30:53 +00:00
|
|
|
struct kvm_io_range *range, const void *val)
|
|
|
|
{
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
idx = kvm_io_bus_get_first_dev(bus, range->addr, range->len);
|
|
|
|
if (idx < 0)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
while (idx < bus->dev_count &&
|
2013-08-27 13:41:41 +00:00
|
|
|
kvm_io_bus_cmp(range, &bus->range[idx]) == 0) {
|
2015-03-26 14:39:28 +00:00
|
|
|
if (!kvm_iodevice_write(vcpu, bus->range[idx].dev, range->addr,
|
2013-07-03 14:30:53 +00:00
|
|
|
range->len, val))
|
|
|
|
return idx;
|
|
|
|
idx++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
|
2009-06-29 19:24:32 +00:00
|
|
|
/* kvm_io_bus_write - called under kvm->slots_lock */
|
2015-03-26 14:39:28 +00:00
|
|
|
int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
|
2009-06-29 19:24:32 +00:00
|
|
|
int len, const void *val)
|
2007-05-31 18:08:53 +00:00
|
|
|
{
|
2010-04-19 09:41:23 +00:00
|
|
|
struct kvm_io_bus *bus;
|
2011-07-27 13:00:48 +00:00
|
|
|
struct kvm_io_range range;
|
2013-07-03 14:30:53 +00:00
|
|
|
int r;
|
2011-07-27 13:00:48 +00:00
|
|
|
|
|
|
|
range = (struct kvm_io_range) {
|
|
|
|
.addr = addr,
|
|
|
|
.len = len,
|
|
|
|
};
|
2010-04-19 09:41:23 +00:00
|
|
|
|
2015-03-26 14:39:28 +00:00
|
|
|
bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
|
2017-03-23 17:24:19 +00:00
|
|
|
if (!bus)
|
|
|
|
return -ENOMEM;
|
2015-03-26 14:39:28 +00:00
|
|
|
r = __kvm_io_bus_write(vcpu, bus, &range, val);
|
2013-07-03 14:30:53 +00:00
|
|
|
return r < 0 ? r : 0;
|
|
|
|
}
|
2019-02-22 08:10:09 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_io_bus_write);
|
2013-07-03 14:30:53 +00:00
|
|
|
|
|
|
|
/* kvm_io_bus_write_cookie - called under kvm->slots_lock */
|
2015-03-26 14:39:28 +00:00
|
|
|
int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx,
|
|
|
|
gpa_t addr, int len, const void *val, long cookie)
|
2013-07-03 14:30:53 +00:00
|
|
|
{
|
|
|
|
struct kvm_io_bus *bus;
|
|
|
|
struct kvm_io_range range;
|
|
|
|
|
|
|
|
range = (struct kvm_io_range) {
|
|
|
|
.addr = addr,
|
|
|
|
.len = len,
|
|
|
|
};
|
|
|
|
|
2015-03-26 14:39:28 +00:00
|
|
|
bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
|
2017-03-23 17:24:19 +00:00
|
|
|
if (!bus)
|
|
|
|
return -ENOMEM;
|
2013-07-03 14:30:53 +00:00
|
|
|
|
|
|
|
/* First try the device referenced by cookie. */
|
|
|
|
if ((cookie >= 0) && (cookie < bus->dev_count) &&
|
2013-08-27 13:41:41 +00:00
|
|
|
(kvm_io_bus_cmp(&range, &bus->range[cookie]) == 0))
|
2015-03-26 14:39:28 +00:00
|
|
|
if (!kvm_iodevice_write(vcpu, bus->range[cookie].dev, addr, len,
|
2013-07-03 14:30:53 +00:00
|
|
|
val))
|
|
|
|
return cookie;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cookie contained garbage; fall back to search and return the
|
|
|
|
* correct cookie value.
|
|
|
|
*/
|
2015-03-26 14:39:28 +00:00
|
|
|
return __kvm_io_bus_write(vcpu, bus, &range, val);
|
2013-07-03 14:30:53 +00:00
|
|
|
}
|
|
|
|
|
2015-03-26 14:39:28 +00:00
|
|
|
static int __kvm_io_bus_read(struct kvm_vcpu *vcpu, struct kvm_io_bus *bus,
|
|
|
|
struct kvm_io_range *range, void *val)
|
2013-07-03 14:30:53 +00:00
|
|
|
{
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
idx = kvm_io_bus_get_first_dev(bus, range->addr, range->len);
|
2011-07-27 13:00:48 +00:00
|
|
|
if (idx < 0)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
while (idx < bus->dev_count &&
|
2013-08-27 13:41:41 +00:00
|
|
|
kvm_io_bus_cmp(range, &bus->range[idx]) == 0) {
|
2015-03-26 14:39:28 +00:00
|
|
|
if (!kvm_iodevice_read(vcpu, bus->range[idx].dev, range->addr,
|
2013-07-03 14:30:53 +00:00
|
|
|
range->len, val))
|
|
|
|
return idx;
|
2011-07-27 13:00:48 +00:00
|
|
|
idx++;
|
|
|
|
}
|
|
|
|
|
2009-06-29 19:24:32 +00:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
2007-05-31 18:08:53 +00:00
|
|
|
|
2009-06-29 19:24:32 +00:00
|
|
|
/* kvm_io_bus_read - called under kvm->slots_lock */
|
2015-03-26 14:39:28 +00:00
|
|
|
int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
|
2009-12-23 16:35:24 +00:00
|
|
|
int len, void *val)
|
2009-06-29 19:24:32 +00:00
|
|
|
{
|
2010-04-19 09:41:23 +00:00
|
|
|
struct kvm_io_bus *bus;
|
2011-07-27 13:00:48 +00:00
|
|
|
struct kvm_io_range range;
|
2013-07-03 14:30:53 +00:00
|
|
|
int r;
|
2011-07-27 13:00:48 +00:00
|
|
|
|
|
|
|
range = (struct kvm_io_range) {
|
|
|
|
.addr = addr,
|
|
|
|
.len = len,
|
|
|
|
};
|
2009-12-23 16:35:24 +00:00
|
|
|
|
2015-03-26 14:39:28 +00:00
|
|
|
bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
|
2017-03-23 17:24:19 +00:00
|
|
|
if (!bus)
|
|
|
|
return -ENOMEM;
|
2015-03-26 14:39:28 +00:00
|
|
|
r = __kvm_io_bus_read(vcpu, bus, &range, val);
|
2013-07-03 14:30:53 +00:00
|
|
|
return r < 0 ? r : 0;
|
|
|
|
}
|
2011-07-27 13:00:48 +00:00
|
|
|
|
|
|
|
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
|
|
|
|
int len, struct kvm_io_device *dev)
|
2009-06-29 19:24:26 +00:00
|
|
|
{
|
2018-01-16 13:34:41 +00:00
|
|
|
int i;
|
2009-12-23 16:35:24 +00:00
|
|
|
struct kvm_io_bus *new_bus, *bus;
|
2018-01-16 13:34:41 +00:00
|
|
|
struct kvm_io_range range;
|
2009-07-07 21:08:44 +00:00
|
|
|
|
2023-12-07 15:12:01 +00:00
|
|
|
lockdep_assert_held(&kvm->slots_lock);
|
|
|
|
|
2017-07-07 08:51:38 +00:00
|
|
|
bus = kvm_get_bus(kvm, bus_idx);
|
2017-03-23 17:24:19 +00:00
|
|
|
if (!bus)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2013-05-24 22:44:15 +00:00
|
|
|
/* exclude ioeventfd which is limited by maximum fd */
|
|
|
|
if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1)
|
2009-07-07 21:08:44 +00:00
|
|
|
return -ENOSPC;
|
2007-05-31 18:08:53 +00:00
|
|
|
|
2019-01-30 16:07:47 +00:00
|
|
|
new_bus = kmalloc(struct_size(bus, range, bus->dev_count + 1),
|
2019-02-11 19:02:49 +00:00
|
|
|
GFP_KERNEL_ACCOUNT);
|
2009-12-23 16:35:24 +00:00
|
|
|
if (!new_bus)
|
|
|
|
return -ENOMEM;
|
2018-01-16 13:34:41 +00:00
|
|
|
|
|
|
|
range = (struct kvm_io_range) {
|
|
|
|
.addr = addr,
|
|
|
|
.len = len,
|
|
|
|
.dev = dev,
|
|
|
|
};
|
|
|
|
|
|
|
|
for (i = 0; i < bus->dev_count; i++)
|
|
|
|
if (kvm_io_bus_cmp(&bus->range[i], &range) > 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
memcpy(new_bus, bus, sizeof(*bus) + i * sizeof(struct kvm_io_range));
|
|
|
|
new_bus->dev_count++;
|
|
|
|
new_bus->range[i] = range;
|
|
|
|
memcpy(new_bus->range + i + 1, bus->range + i,
|
|
|
|
(bus->dev_count - i) * sizeof(struct kvm_io_range));
|
2009-12-23 16:35:24 +00:00
|
|
|
rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
|
|
|
|
synchronize_srcu_expedited(&kvm->srcu);
|
|
|
|
kfree(bus);
|
2009-07-07 21:08:44 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-04-12 22:20:49 +00:00
|
|
|
int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
|
|
|
|
struct kvm_io_device *dev)
|
2009-07-07 21:08:44 +00:00
|
|
|
{
|
2023-02-07 12:37:12 +00:00
|
|
|
int i;
|
2009-12-23 16:35:24 +00:00
|
|
|
struct kvm_io_bus *new_bus, *bus;
|
2009-07-07 21:08:44 +00:00
|
|
|
|
2021-04-12 22:20:50 +00:00
|
|
|
lockdep_assert_held(&kvm->slots_lock);
|
|
|
|
|
2017-07-07 08:51:38 +00:00
|
|
|
bus = kvm_get_bus(kvm, bus_idx);
|
2017-03-15 08:01:17 +00:00
|
|
|
if (!bus)
|
2021-04-12 22:20:49 +00:00
|
|
|
return 0;
|
2017-03-15 08:01:17 +00:00
|
|
|
|
2021-04-12 22:20:50 +00:00
|
|
|
for (i = 0; i < bus->dev_count; i++) {
|
2012-03-09 04:17:32 +00:00
|
|
|
if (bus->range[i].dev == dev) {
|
2009-07-07 21:08:44 +00:00
|
|
|
break;
|
|
|
|
}
|
2021-04-12 22:20:50 +00:00
|
|
|
}
|
2009-12-23 16:35:24 +00:00
|
|
|
|
2017-03-23 17:24:19 +00:00
|
|
|
if (i == bus->dev_count)
|
2021-04-12 22:20:49 +00:00
|
|
|
return 0;
|
2012-03-09 04:17:32 +00:00
|
|
|
|
2019-01-30 16:07:47 +00:00
|
|
|
new_bus = kmalloc(struct_size(bus, range, bus->dev_count - 1),
|
2019-02-11 19:02:49 +00:00
|
|
|
GFP_KERNEL_ACCOUNT);
|
2020-09-07 18:55:35 +00:00
|
|
|
if (new_bus) {
|
2020-09-18 12:05:00 +00:00
|
|
|
memcpy(new_bus, bus, struct_size(bus, range, i));
|
2020-09-07 18:55:35 +00:00
|
|
|
new_bus->dev_count--;
|
|
|
|
memcpy(new_bus->range + i, bus->range + i + 1,
|
2020-09-18 12:05:00 +00:00
|
|
|
flex_array_size(new_bus, range, new_bus->dev_count - i));
|
2021-04-12 22:20:48 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
|
|
|
|
synchronize_srcu_expedited(&kvm->srcu);
|
|
|
|
|
2023-02-07 12:37:12 +00:00
|
|
|
/*
|
|
|
|
* If NULL bus is installed, destroy the old bus, including all the
|
|
|
|
* attached devices. Otherwise, destroy the caller's device only.
|
|
|
|
*/
|
2021-04-12 22:20:48 +00:00
|
|
|
if (!new_bus) {
|
2017-03-23 17:24:19 +00:00
|
|
|
pr_err("kvm: failed to shrink bus, removing it completely\n");
|
2023-02-07 12:37:12 +00:00
|
|
|
kvm_io_bus_destroy(bus);
|
|
|
|
return -ENOMEM;
|
2017-03-23 17:24:19 +00:00
|
|
|
}
|
2012-03-09 04:17:32 +00:00
|
|
|
|
2023-02-07 12:37:12 +00:00
|
|
|
kvm_iodevice_destructor(dev);
|
2009-12-23 16:35:24 +00:00
|
|
|
kfree(bus);
|
2023-02-07 12:37:12 +00:00
|
|
|
return 0;
|
2007-05-31 18:08:53 +00:00
|
|
|
}
|
|
|
|
|
2016-07-15 11:43:26 +00:00
|
|
|
struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
|
|
|
|
gpa_t addr)
|
|
|
|
{
|
|
|
|
struct kvm_io_bus *bus;
|
|
|
|
int dev_idx, srcu_idx;
|
|
|
|
struct kvm_io_device *iodev = NULL;
|
|
|
|
|
|
|
|
srcu_idx = srcu_read_lock(&kvm->srcu);
|
|
|
|
|
|
|
|
bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu);
|
2017-03-23 17:24:19 +00:00
|
|
|
if (!bus)
|
|
|
|
goto out_unlock;
|
2016-07-15 11:43:26 +00:00
|
|
|
|
|
|
|
dev_idx = kvm_io_bus_get_first_dev(bus, addr, 1);
|
|
|
|
if (dev_idx < 0)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
iodev = bus->range[dev_idx].dev;
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
srcu_read_unlock(&kvm->srcu, srcu_idx);
|
|
|
|
|
|
|
|
return iodev;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_io_bus_get_dev);
|
|
|
|
|
2016-05-18 11:26:23 +00:00
|
|
|
static int kvm_debugfs_open(struct inode *inode, struct file *file,
|
|
|
|
int (*get)(void *, u64 *), int (*set)(void *, u64),
|
|
|
|
const char *fmt)
|
|
|
|
{
|
2022-10-17 03:06:10 +00:00
|
|
|
int ret;
|
2022-12-13 08:02:36 +00:00
|
|
|
struct kvm_stat_data *stat_data = inode->i_private;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2021-06-25 15:32:07 +00:00
|
|
|
/*
|
|
|
|
* The debugfs files are a reference to the kvm struct which
|
|
|
|
* is still valid when kvm_destroy_vm is called. kvm_get_kvm_safe
|
|
|
|
* avoids the race between open and the removal of the debugfs directory.
|
2016-05-18 11:26:23 +00:00
|
|
|
*/
|
2021-06-25 15:32:07 +00:00
|
|
|
if (!kvm_get_kvm_safe(stat_data->kvm))
|
2016-05-18 11:26:23 +00:00
|
|
|
return -ENOENT;
|
|
|
|
|
2022-10-17 03:06:10 +00:00
|
|
|
ret = simple_attr_open(inode, file, get,
|
|
|
|
kvm_stats_debugfs_mode(stat_data->desc) & 0222
|
|
|
|
? set : NULL, fmt);
|
|
|
|
if (ret)
|
2016-05-18 11:26:23 +00:00
|
|
|
kvm_put_kvm(stat_data->kvm);
|
|
|
|
|
2022-10-17 03:06:10 +00:00
|
|
|
return ret;
|
2016-05-18 11:26:23 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_debugfs_release(struct inode *inode, struct file *file)
|
|
|
|
{
|
2022-12-13 08:02:36 +00:00
|
|
|
struct kvm_stat_data *stat_data = inode->i_private;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
|
|
|
simple_attr_release(inode, file);
|
|
|
|
kvm_put_kvm(stat_data->kvm);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static int kvm_get_stat_per_vm(struct kvm *kvm, size_t offset, u64 *val)
|
2016-05-18 11:26:23 +00:00
|
|
|
{
|
2021-06-23 21:28:46 +00:00
|
|
|
*val = *(u64 *)((void *)(&kvm->stat) + offset);
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_clear_stat_per_vm(struct kvm *kvm, size_t offset)
|
|
|
|
{
|
2021-06-23 21:28:46 +00:00
|
|
|
*(u64 *)((void *)(&kvm->stat) + offset) = 0;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static int kvm_get_stat_per_vcpu(struct kvm *kvm, size_t offset, u64 *val)
|
2016-10-19 02:49:47 +00:00
|
|
|
{
|
2021-11-16 16:04:02 +00:00
|
|
|
unsigned long i;
|
2019-12-13 13:07:21 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
2016-10-19 02:49:47 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
*val = 0;
|
2016-10-19 02:49:47 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
2021-06-23 21:28:46 +00:00
|
|
|
*val += *(u64 *)((void *)(&vcpu->stat) + offset);
|
2016-10-19 02:49:47 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static int kvm_clear_stat_per_vcpu(struct kvm *kvm, size_t offset)
|
2016-05-18 11:26:23 +00:00
|
|
|
{
|
2021-11-16 16:04:02 +00:00
|
|
|
unsigned long i;
|
2019-12-13 13:07:21 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
2021-06-23 21:28:46 +00:00
|
|
|
*(u64 *)((void *)(&vcpu->stat) + offset) = 0;
|
2019-12-13 13:07:21 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static int kvm_stat_data_get(void *data, u64 *val)
|
2016-05-18 11:26:23 +00:00
|
|
|
{
|
2019-12-13 13:07:21 +00:00
|
|
|
int r = -EFAULT;
|
2022-12-13 08:02:36 +00:00
|
|
|
struct kvm_stat_data *stat_data = data;
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2021-06-23 21:28:46 +00:00
|
|
|
switch (stat_data->kind) {
|
2019-12-13 13:07:21 +00:00
|
|
|
case KVM_STAT_VM:
|
|
|
|
r = kvm_get_stat_per_vm(stat_data->kvm,
|
2021-06-23 21:28:46 +00:00
|
|
|
stat_data->desc->desc.offset, val);
|
2019-12-13 13:07:21 +00:00
|
|
|
break;
|
|
|
|
case KVM_STAT_VCPU:
|
|
|
|
r = kvm_get_stat_per_vcpu(stat_data->kvm,
|
2021-06-23 21:28:46 +00:00
|
|
|
stat_data->desc->desc.offset, val);
|
2019-12-13 13:07:21 +00:00
|
|
|
break;
|
|
|
|
}
|
2016-05-18 11:26:23 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
return r;
|
2016-05-18 11:26:23 +00:00
|
|
|
}
|
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static int kvm_stat_data_clear(void *data, u64 val)
|
2016-10-19 02:49:47 +00:00
|
|
|
{
|
2019-12-13 13:07:21 +00:00
|
|
|
int r = -EFAULT;
|
2022-12-13 08:02:36 +00:00
|
|
|
struct kvm_stat_data *stat_data = data;
|
2016-10-19 02:49:47 +00:00
|
|
|
|
|
|
|
if (val)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2021-06-23 21:28:46 +00:00
|
|
|
switch (stat_data->kind) {
|
2019-12-13 13:07:21 +00:00
|
|
|
case KVM_STAT_VM:
|
|
|
|
r = kvm_clear_stat_per_vm(stat_data->kvm,
|
2021-06-23 21:28:46 +00:00
|
|
|
stat_data->desc->desc.offset);
|
2019-12-13 13:07:21 +00:00
|
|
|
break;
|
|
|
|
case KVM_STAT_VCPU:
|
|
|
|
r = kvm_clear_stat_per_vcpu(stat_data->kvm,
|
2021-06-23 21:28:46 +00:00
|
|
|
stat_data->desc->desc.offset);
|
2019-12-13 13:07:21 +00:00
|
|
|
break;
|
|
|
|
}
|
2016-10-19 02:49:47 +00:00
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
return r;
|
2016-10-19 02:49:47 +00:00
|
|
|
}
|
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static int kvm_stat_data_open(struct inode *inode, struct file *file)
|
2016-05-18 11:26:23 +00:00
|
|
|
{
|
|
|
|
__simple_attr_check_format("%llu\n", 0ull);
|
2019-12-13 13:07:21 +00:00
|
|
|
return kvm_debugfs_open(inode, file, kvm_stat_data_get,
|
|
|
|
kvm_stat_data_clear, "%llu\n");
|
2016-05-18 11:26:23 +00:00
|
|
|
}
|
|
|
|
|
2019-12-13 13:07:21 +00:00
|
|
|
static const struct file_operations stat_fops_per_vm = {
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.open = kvm_stat_data_open,
|
2016-05-18 11:26:23 +00:00
|
|
|
.release = kvm_debugfs_release,
|
2019-12-13 13:07:21 +00:00
|
|
|
.read = simple_attr_read,
|
|
|
|
.write = simple_attr_write,
|
|
|
|
.llseek = no_llseek,
|
2016-05-18 11:26:23 +00:00
|
|
|
};
|
|
|
|
|
2008-02-08 12:20:26 +00:00
|
|
|
static int vm_stat_get(void *_offset, u64 *val)
|
2007-11-18 14:24:12 +00:00
|
|
|
{
|
|
|
|
unsigned offset = (long)_offset;
|
|
|
|
struct kvm *kvm;
|
2016-05-18 11:26:23 +00:00
|
|
|
u64 tmp_val;
|
2007-11-18 14:24:12 +00:00
|
|
|
|
2008-02-08 12:20:26 +00:00
|
|
|
*val = 0;
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2016-05-18 11:26:23 +00:00
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
2019-12-13 13:07:21 +00:00
|
|
|
kvm_get_stat_per_vm(kvm, offset, &tmp_val);
|
2016-05-18 11:26:23 +00:00
|
|
|
*val += tmp_val;
|
|
|
|
}
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2008-02-08 12:20:26 +00:00
|
|
|
return 0;
|
2007-11-18 14:24:12 +00:00
|
|
|
}
|
|
|
|
|
2016-10-19 02:49:47 +00:00
|
|
|
static int vm_stat_clear(void *_offset, u64 val)
|
|
|
|
{
|
|
|
|
unsigned offset = (long)_offset;
|
|
|
|
struct kvm *kvm;
|
|
|
|
|
|
|
|
if (val)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2016-10-19 02:49:47 +00:00
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
2019-12-13 13:07:21 +00:00
|
|
|
kvm_clear_stat_per_vm(kvm, offset);
|
2016-10-19 02:49:47 +00:00
|
|
|
}
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2016-10-19 02:49:47 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
DEFINE_SIMPLE_ATTRIBUTE(vm_stat_fops, vm_stat_get, vm_stat_clear, "%llu\n");
|
2021-06-23 21:28:46 +00:00
|
|
|
DEFINE_SIMPLE_ATTRIBUTE(vm_stat_readonly_fops, vm_stat_get, NULL, "%llu\n");
|
2007-11-18 14:24:12 +00:00
|
|
|
|
2008-02-08 12:20:26 +00:00
|
|
|
static int vcpu_stat_get(void *_offset, u64 *val)
|
2007-04-19 14:27:43 +00:00
|
|
|
{
|
|
|
|
unsigned offset = (long)_offset;
|
|
|
|
struct kvm *kvm;
|
2016-05-18 11:26:23 +00:00
|
|
|
u64 tmp_val;
|
2007-04-19 14:27:43 +00:00
|
|
|
|
2008-02-08 12:20:26 +00:00
|
|
|
*val = 0;
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2016-05-18 11:26:23 +00:00
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
2019-12-13 13:07:21 +00:00
|
|
|
kvm_get_stat_per_vcpu(kvm, offset, &tmp_val);
|
2016-05-18 11:26:23 +00:00
|
|
|
*val += tmp_val;
|
|
|
|
}
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2008-02-08 12:20:26 +00:00
|
|
|
return 0;
|
2007-04-19 14:27:43 +00:00
|
|
|
}
|
|
|
|
|
2016-10-19 02:49:47 +00:00
|
|
|
static int vcpu_stat_clear(void *_offset, u64 val)
|
|
|
|
{
|
|
|
|
unsigned offset = (long)_offset;
|
|
|
|
struct kvm *kvm;
|
|
|
|
|
|
|
|
if (val)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2016-10-19 02:49:47 +00:00
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
2019-12-13 13:07:21 +00:00
|
|
|
kvm_clear_stat_per_vcpu(kvm, offset);
|
2016-10-19 02:49:47 +00:00
|
|
|
}
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2016-10-19 02:49:47 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
DEFINE_SIMPLE_ATTRIBUTE(vcpu_stat_fops, vcpu_stat_get, vcpu_stat_clear,
|
|
|
|
"%llu\n");
|
2021-06-23 21:28:46 +00:00
|
|
|
DEFINE_SIMPLE_ATTRIBUTE(vcpu_stat_readonly_fops, vcpu_stat_get, NULL, "%llu\n");
|
2007-04-19 14:27:43 +00:00
|
|
|
|
2017-07-12 15:56:44 +00:00
|
|
|
static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm)
|
|
|
|
{
|
|
|
|
struct kobj_uevent_env *env;
|
|
|
|
unsigned long long created, active;
|
|
|
|
|
|
|
|
if (!kvm_dev.this_device || !kvm)
|
|
|
|
return;
|
|
|
|
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_lock(&kvm_lock);
|
2017-07-12 15:56:44 +00:00
|
|
|
if (type == KVM_EVENT_CREATE_VM) {
|
|
|
|
kvm_createvm_count++;
|
|
|
|
kvm_active_vms++;
|
|
|
|
} else if (type == KVM_EVENT_DESTROY_VM) {
|
|
|
|
kvm_active_vms--;
|
|
|
|
}
|
|
|
|
created = kvm_createvm_count;
|
|
|
|
active = kvm_active_vms;
|
2019-01-04 01:14:28 +00:00
|
|
|
mutex_unlock(&kvm_lock);
|
2017-07-12 15:56:44 +00:00
|
|
|
|
2019-02-11 19:02:49 +00:00
|
|
|
env = kzalloc(sizeof(*env), GFP_KERNEL_ACCOUNT);
|
2017-07-12 15:56:44 +00:00
|
|
|
if (!env)
|
|
|
|
return;
|
|
|
|
|
|
|
|
add_uevent_var(env, "CREATED=%llu", created);
|
|
|
|
add_uevent_var(env, "COUNT=%llu", active);
|
|
|
|
|
2017-07-24 11:40:03 +00:00
|
|
|
if (type == KVM_EVENT_CREATE_VM) {
|
2017-07-12 15:56:44 +00:00
|
|
|
add_uevent_var(env, "EVENT=create");
|
2017-07-24 11:40:03 +00:00
|
|
|
kvm->userspace_pid = task_pid_nr(current);
|
|
|
|
} else if (type == KVM_EVENT_DESTROY_VM) {
|
2017-07-12 15:56:44 +00:00
|
|
|
add_uevent_var(env, "EVENT=destroy");
|
2017-07-24 11:40:03 +00:00
|
|
|
}
|
|
|
|
add_uevent_var(env, "PID=%d", kvm->userspace_pid);
|
2017-07-12 15:56:44 +00:00
|
|
|
|
2022-04-06 23:56:13 +00:00
|
|
|
if (!IS_ERR(kvm->debugfs_dentry)) {
|
2019-02-11 19:02:49 +00:00
|
|
|
char *tmp, *p = kmalloc(PATH_MAX, GFP_KERNEL_ACCOUNT);
|
2017-07-24 11:40:03 +00:00
|
|
|
|
|
|
|
if (p) {
|
|
|
|
tmp = dentry_path_raw(kvm->debugfs_dentry, p, PATH_MAX);
|
|
|
|
if (!IS_ERR(tmp))
|
|
|
|
add_uevent_var(env, "STATS_PATH=%s", tmp);
|
|
|
|
kfree(p);
|
2017-07-12 15:56:44 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
/* no need for checks, since we are adding at most only 5 keys */
|
|
|
|
env->envp[env->envp_idx++] = NULL;
|
|
|
|
kobject_uevent_env(&kvm_dev.this_device->kobj, KOBJ_CHANGE, env->envp);
|
|
|
|
kfree(env);
|
|
|
|
}
|
|
|
|
|
2018-05-29 16:22:04 +00:00
|
|
|
static void kvm_init_debug(void)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2021-06-23 21:28:46 +00:00
|
|
|
const struct file_operations *fops;
|
|
|
|
const struct _kvm_stats_desc *pdesc;
|
|
|
|
int i;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2008-04-15 21:05:42 +00:00
|
|
|
kvm_debugfs_dir = debugfs_create_dir("kvm", NULL);
|
2011-12-15 06:23:16 +00:00
|
|
|
|
2021-06-23 21:28:46 +00:00
|
|
|
for (i = 0; i < kvm_vm_stats_header.num_desc; ++i) {
|
|
|
|
pdesc = &kvm_vm_stats_desc[i];
|
|
|
|
if (kvm_stats_debugfs_mode(pdesc) & 0222)
|
|
|
|
fops = &vm_stat_fops;
|
|
|
|
else
|
|
|
|
fops = &vm_stat_readonly_fops;
|
|
|
|
debugfs_create_file(pdesc->name, kvm_stats_debugfs_mode(pdesc),
|
|
|
|
kvm_debugfs_dir,
|
|
|
|
(void *)(long)pdesc->desc.offset, fops);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < kvm_vcpu_stats_header.num_desc; ++i) {
|
|
|
|
pdesc = &kvm_vcpu_stats_desc[i];
|
|
|
|
if (kvm_stats_debugfs_mode(pdesc) & 0222)
|
|
|
|
fops = &vcpu_stat_fops;
|
|
|
|
else
|
|
|
|
fops = &vcpu_stat_readonly_fops;
|
|
|
|
debugfs_create_file(pdesc->name, kvm_stats_debugfs_mode(pdesc),
|
|
|
|
kvm_debugfs_dir,
|
|
|
|
(void *)(long)pdesc->desc.offset, fops);
|
2011-12-15 06:23:16 +00:00
|
|
|
}
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
|
|
|
|
2007-07-11 15:17:21 +00:00
|
|
|
static inline
|
|
|
|
struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
|
|
|
|
{
|
|
|
|
return container_of(pn, struct kvm_vcpu, preempt_notifier);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
|
2015-02-26 06:58:23 +00:00
|
|
|
|
2019-08-01 03:30:14 +00:00
|
|
|
WRITE_ONCE(vcpu->preempted, false);
|
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-18 11:39:06 +00:00
|
|
|
WRITE_ONCE(vcpu->ready, false);
|
2007-07-11 15:17:21 +00:00
|
|
|
|
2020-01-09 14:57:19 +00:00
|
|
|
__this_cpu_write(kvm_running_vcpu, vcpu);
|
2007-11-14 12:38:21 +00:00
|
|
|
kvm_arch_vcpu_load(vcpu, cpu);
|
KVM: Add a flag to track if a loaded vCPU is scheduled out
Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of
being scheduled out (vCPU put path), or if the vCPU is being reloaded
after being scheduled out (vCPU load path). In the short term, this will
allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out
during kvm_arch_vcpu_load().
Longer term, scheduled_out opens up other potential optimizations, without
creating subtle/brittle dependencies. E.g. it allows KVM to keep guest
state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across
kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host
kernel. Forcing arch code to coordinate between kvm_arch_sched_{in,out}()
and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the
exact ordering of calls into arch code.
Adding scheduled_out also obviates the need for a kvm_arch_sched_out()
hook, e.g. if arch code needs to do something novel when putting vCPU
state.
And even if KVM never uses scheduled_out for anything beyond dropping
kvm_arch_sched_in(), just being able to remove all of the arch stubs makes
it worth adding the flag.
Link: https://lore.kernel.org/all/20240430224431.490139-1-seanjc@google.com
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-22 01:40:08 +00:00
|
|
|
|
|
|
|
WRITE_ONCE(vcpu->scheduled_out, false);
|
2007-07-11 15:17:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_sched_out(struct preempt_notifier *pn,
|
|
|
|
struct task_struct *next)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
|
|
|
|
|
KVM: Add a flag to track if a loaded vCPU is scheduled out
Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of
being scheduled out (vCPU put path), or if the vCPU is being reloaded
after being scheduled out (vCPU load path). In the short term, this will
allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out
during kvm_arch_vcpu_load().
Longer term, scheduled_out opens up other potential optimizations, without
creating subtle/brittle dependencies. E.g. it allows KVM to keep guest
state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across
kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host
kernel. Forcing arch code to coordinate between kvm_arch_sched_{in,out}()
and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the
exact ordering of calls into arch code.
Adding scheduled_out also obviates the need for a kvm_arch_sched_out()
hook, e.g. if arch code needs to do something novel when putting vCPU
state.
And even if KVM never uses scheduled_out for anything beyond dropping
kvm_arch_sched_in(), just being able to remove all of the arch stubs makes
it worth adding the flag.
Link: https://lore.kernel.org/all/20240430224431.490139-1-seanjc@google.com
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-22 01:40:08 +00:00
|
|
|
WRITE_ONCE(vcpu->scheduled_out, true);
|
|
|
|
|
2021-06-11 08:28:13 +00:00
|
|
|
if (current->on_rq) {
|
2019-08-01 03:30:14 +00:00
|
|
|
WRITE_ONCE(vcpu->preempted, true);
|
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-18 11:39:06 +00:00
|
|
|
WRITE_ONCE(vcpu->ready, true);
|
|
|
|
}
|
2007-11-14 12:38:21 +00:00
|
|
|
kvm_arch_vcpu_put(vcpu);
|
2020-01-09 14:57:19 +00:00
|
|
|
__this_cpu_write(kvm_running_vcpu, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_get_running_vcpu - get the vcpu running on the current CPU.
|
2020-02-07 16:34:10 +00:00
|
|
|
*
|
|
|
|
* We can disable preemption locally around accessing the per-CPU variable,
|
|
|
|
* and use the resolved vcpu pointer after enabling preemption again,
|
|
|
|
* because even if the current thread is migrated to another CPU, reading
|
|
|
|
* the per-CPU value later will give us the same value as we update the
|
|
|
|
* per-CPU variable in the preempt notifier handlers.
|
2020-01-09 14:57:19 +00:00
|
|
|
*/
|
|
|
|
struct kvm_vcpu *kvm_get_running_vcpu(void)
|
|
|
|
{
|
2020-02-07 16:34:10 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
vcpu = __this_cpu_read(kvm_running_vcpu);
|
|
|
|
preempt_enable();
|
|
|
|
|
|
|
|
return vcpu;
|
2020-01-09 14:57:19 +00:00
|
|
|
}
|
2020-04-28 06:23:27 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_running_vcpu);
|
2020-01-09 14:57:19 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_get_running_vcpus - get the per-CPU array of currently running vcpus.
|
|
|
|
*/
|
|
|
|
struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
|
|
|
|
{
|
|
|
|
return &kvm_running_vcpu;
|
2007-07-11 15:17:21 +00:00
|
|
|
}
|
|
|
|
|
2021-11-11 02:07:33 +00:00
|
|
|
#ifdef CONFIG_GUEST_PERF_EVENTS
|
|
|
|
static unsigned int kvm_guest_state(void)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
|
|
|
|
unsigned int state;
|
|
|
|
|
|
|
|
if (!kvm_arch_pmi_in_guest(vcpu))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
state = PERF_GUEST_ACTIVE;
|
|
|
|
if (!kvm_arch_vcpu_in_kernel(vcpu))
|
|
|
|
state |= PERF_GUEST_USER;
|
|
|
|
|
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long kvm_guest_get_ip(void)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
|
|
|
|
|
|
|
|
/* Retrieving the IP must be guarded by a call to kvm_guest_state(). */
|
|
|
|
if (WARN_ON_ONCE(!kvm_arch_pmi_in_guest(vcpu)))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return kvm_arch_vcpu_get_ip(vcpu);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct perf_guest_info_callbacks kvm_guest_cbs = {
|
|
|
|
.state = kvm_guest_state,
|
|
|
|
.get_ip = kvm_guest_get_ip,
|
|
|
|
.handle_intel_pt_intr = NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
void kvm_register_perf_callbacks(unsigned int (*pt_intr_handler)(void))
|
|
|
|
{
|
|
|
|
kvm_guest_cbs.handle_intel_pt_intr = pt_intr_handler;
|
|
|
|
perf_register_guest_info_callbacks(&kvm_guest_cbs);
|
|
|
|
}
|
|
|
|
void kvm_unregister_perf_callbacks(void)
|
|
|
|
{
|
|
|
|
perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2022-11-30 23:09:16 +00:00
|
|
|
int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
|
2019-04-20 05:18:17 +00:00
|
|
|
{
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
int r;
|
2007-07-31 11:23:01 +00:00
|
|
|
int cpu;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2022-11-30 23:09:33 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
|
2022-11-30 23:09:25 +00:00
|
|
|
r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
|
|
|
|
kvm_online_cpu, kvm_offline_cpu);
|
2007-02-12 08:54:47 +00:00
|
|
|
if (r)
|
2022-11-30 23:09:30 +00:00
|
|
|
return r;
|
|
|
|
|
2022-11-30 23:09:32 +00:00
|
|
|
register_syscore_ops(&kvm_syscore_ops);
|
2022-11-30 23:09:33 +00:00
|
|
|
#endif
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-07-30 11:12:19 +00:00
|
|
|
/* A kmem cache lets us meet the alignment requirements of fx_save. */
|
2010-04-28 12:39:01 +00:00
|
|
|
if (!vcpu_align)
|
|
|
|
vcpu_align = __alignof__(struct kvm_vcpu);
|
2017-10-26 13:45:46 +00:00
|
|
|
kvm_vcpu_cache =
|
|
|
|
kmem_cache_create_usercopy("kvm_vcpu", vcpu_size, vcpu_align,
|
|
|
|
SLAB_ACCOUNT,
|
|
|
|
offsetof(struct kvm_vcpu, arch),
|
2021-06-18 22:27:06 +00:00
|
|
|
offsetofend(struct kvm_vcpu, stats_id)
|
|
|
|
- offsetof(struct kvm_vcpu, arch),
|
2017-10-26 13:45:46 +00:00
|
|
|
NULL);
|
2007-07-30 11:12:19 +00:00
|
|
|
if (!kvm_vcpu_cache) {
|
|
|
|
r = -ENOMEM;
|
2022-11-30 23:09:34 +00:00
|
|
|
goto err_vcpu_cache;
|
2007-07-30 11:12:19 +00:00
|
|
|
}
|
|
|
|
|
2021-09-03 07:51:40 +00:00
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
if (!alloc_cpumask_var_node(&per_cpu(cpu_kick_mask, cpu),
|
|
|
|
GFP_KERNEL, cpu_to_node(cpu))) {
|
|
|
|
r = -ENOMEM;
|
2022-11-30 23:09:34 +00:00
|
|
|
goto err_cpu_kick_mask;
|
2021-09-03 07:51:40 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-11-30 23:08:46 +00:00
|
|
|
r = kvm_irqfd_init();
|
|
|
|
if (r)
|
|
|
|
goto err_irqfd;
|
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
r = kvm_async_pf_init();
|
|
|
|
if (r)
|
2022-11-30 23:08:46 +00:00
|
|
|
goto err_async_pf;
|
2010-10-14 09:22:46 +00:00
|
|
|
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
kvm_chardev_ops.owner = module;
|
KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-18 20:46:22 +00:00
|
|
|
kvm_vm_fops.owner = module;
|
|
|
|
kvm_vcpu_fops.owner = module;
|
|
|
|
kvm_device_fops.owner = module;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-07-11 15:17:21 +00:00
|
|
|
kvm_preempt_ops.sched_in = kvm_sched_in;
|
|
|
|
kvm_preempt_ops.sched_out = kvm_sched_out;
|
|
|
|
|
2018-05-29 16:22:04 +00:00
|
|
|
kvm_init_debug();
|
2009-10-14 23:21:00 +00:00
|
|
|
|
2014-09-24 11:02:46 +00:00
|
|
|
r = kvm_vfio_ops_init();
|
2022-11-30 23:08:45 +00:00
|
|
|
if (WARN_ON_ONCE(r))
|
|
|
|
goto err_vfio;
|
|
|
|
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
kvm_gmem_init(module);
|
|
|
|
|
2022-11-30 23:08:45 +00:00
|
|
|
/*
|
|
|
|
* Registration _must_ be the very last thing done, as this exposes
|
|
|
|
* /dev/kvm to userspace, i.e. all infrastructure must be setup!
|
|
|
|
*/
|
|
|
|
r = misc_register(&kvm_dev);
|
|
|
|
if (r) {
|
|
|
|
pr_err("kvm: misc device register failed\n");
|
|
|
|
goto err_register;
|
|
|
|
}
|
2014-09-24 11:02:46 +00:00
|
|
|
|
KVM: Allow not-present guest page faults to bypass kvm
There are two classes of page faults trapped by kvm:
- host page faults, where the fault is needed to allow kvm to install
the shadow pte or update the guest accessed and dirty bits
- guest page faults, where the guest has faulted and kvm simply injects
the fault back into the guest to handle
The second class, guest page faults, is pure overhead. We can eliminate
some of it on vmx using the following evil trick:
- when we set up a shadow page table entry, if the corresponding guest pte
is not present, set up the shadow pte as not present
- if the guest pte _is_ present, mark the shadow pte as present but also
set one of the reserved bits in the shadow pte
- tell the vmx hardware not to trap faults which have the present bit clear
With this, normal page-not-present faults go directly to the guest,
bypassing kvm entirely.
Unfortunately, this trick only works on Intel hardware, as AMD lacks a
way to discriminate among page faults based on error code. It is also
a little risky since it uses reserved bits which might become unreserved
in the future, so a module parameter is provided to disable it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-09-16 16:58:32 +00:00
|
|
|
return 0;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2022-11-30 23:08:45 +00:00
|
|
|
err_register:
|
|
|
|
kvm_vfio_ops_exit();
|
|
|
|
err_vfio:
|
2010-10-14 09:22:46 +00:00
|
|
|
kvm_async_pf_deinit();
|
2022-11-30 23:08:46 +00:00
|
|
|
err_async_pf:
|
|
|
|
kvm_irqfd_exit();
|
|
|
|
err_irqfd:
|
2022-11-30 23:09:34 +00:00
|
|
|
err_cpu_kick_mask:
|
2021-09-03 07:51:40 +00:00
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
|
2007-07-30 11:12:19 +00:00
|
|
|
kmem_cache_destroy(kvm_vcpu_cache);
|
2022-11-30 23:09:34 +00:00
|
|
|
err_vcpu_cache:
|
2022-11-30 23:09:33 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
|
2022-11-30 23:09:32 +00:00
|
|
|
unregister_syscore_ops(&kvm_syscore_ops);
|
2022-11-30 23:09:25 +00:00
|
|
|
cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
|
2022-11-30 23:09:33 +00:00
|
|
|
#endif
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
return r;
|
|
|
|
}
|
2007-11-14 12:39:31 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_init);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
|
2007-11-14 12:39:31 +00:00
|
|
|
void kvm_exit(void)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
{
|
2021-09-03 07:51:40 +00:00
|
|
|
int cpu;
|
|
|
|
|
2022-11-30 23:08:45 +00:00
|
|
|
/*
|
|
|
|
* Note, unregistering /dev/kvm doesn't strictly need to come first,
|
|
|
|
* fops_get(), a.k.a. try_module_get(), prevents acquiring references
|
|
|
|
* to KVM while the module is being stopped.
|
|
|
|
*/
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
misc_deregister(&kvm_dev);
|
2022-11-30 23:08:45 +00:00
|
|
|
|
|
|
|
debugfs_remove_recursive(kvm_debugfs_dir);
|
2021-09-03 07:51:40 +00:00
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
|
2007-07-30 11:12:19 +00:00
|
|
|
kmem_cache_destroy(kvm_vcpu_cache);
|
2022-11-30 23:08:48 +00:00
|
|
|
kvm_vfio_ops_exit();
|
2010-10-14 09:22:46 +00:00
|
|
|
kvm_async_pf_deinit();
|
2022-11-30 23:09:33 +00:00
|
|
|
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
|
2011-03-23 21:16:23 +00:00
|
|
|
unregister_syscore_ops(&kvm_syscore_ops);
|
2022-11-30 23:09:25 +00:00
|
|
|
cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
|
2022-11-30 23:09:33 +00:00
|
|
|
#endif
|
2022-11-30 23:08:46 +00:00
|
|
|
kvm_irqfd_exit();
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:21:36 +00:00
|
|
|
}
|
2007-11-14 12:39:31 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_exit);
|
2019-11-04 11:22:02 +00:00
|
|
|
|
|
|
|
struct kvm_vm_worker_thread_context {
|
|
|
|
struct kvm *kvm;
|
|
|
|
struct task_struct *parent;
|
|
|
|
struct completion init_done;
|
|
|
|
kvm_vm_thread_fn_t thread_fn;
|
|
|
|
uintptr_t data;
|
|
|
|
int err;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int kvm_vm_worker_thread(void *context)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The init_context is allocated on the stack of the parent thread, so
|
|
|
|
* we have to locally copy anything that is needed beyond initialization
|
|
|
|
*/
|
|
|
|
struct kvm_vm_worker_thread_context *init_context = context;
|
2022-02-22 05:48:48 +00:00
|
|
|
struct task_struct *parent;
|
2019-11-04 11:22:02 +00:00
|
|
|
struct kvm *kvm = init_context->kvm;
|
|
|
|
kvm_vm_thread_fn_t thread_fn = init_context->thread_fn;
|
|
|
|
uintptr_t data = init_context->data;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = kthread_park(current);
|
|
|
|
/* kthread_park(current) is never supposed to return an error */
|
|
|
|
WARN_ON(err != 0);
|
|
|
|
if (err)
|
|
|
|
goto init_complete;
|
|
|
|
|
|
|
|
err = cgroup_attach_task_all(init_context->parent, current);
|
|
|
|
if (err) {
|
|
|
|
kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
|
|
|
|
__func__, err);
|
|
|
|
goto init_complete;
|
|
|
|
}
|
|
|
|
|
|
|
|
set_user_nice(current, task_nice(init_context->parent));
|
|
|
|
|
|
|
|
init_complete:
|
|
|
|
init_context->err = err;
|
|
|
|
complete(&init_context->init_done);
|
|
|
|
init_context = NULL;
|
|
|
|
|
|
|
|
if (err)
|
2022-02-22 05:48:48 +00:00
|
|
|
goto out;
|
2019-11-04 11:22:02 +00:00
|
|
|
|
|
|
|
/* Wait to be woken up by the spawner before proceeding. */
|
|
|
|
kthread_parkme();
|
|
|
|
|
|
|
|
if (!kthread_should_stop())
|
|
|
|
err = thread_fn(kvm, data);
|
|
|
|
|
2022-02-22 05:48:48 +00:00
|
|
|
out:
|
|
|
|
/*
|
|
|
|
* Move kthread back to its original cgroup to prevent it lingering in
|
|
|
|
* the cgroup of the VM process, after the latter finishes its
|
|
|
|
* execution.
|
|
|
|
*
|
|
|
|
* kthread_stop() waits on the 'exited' completion condition which is
|
|
|
|
* set in exit_mm(), via mm_release(), in do_exit(). However, the
|
|
|
|
* kthread is removed from the cgroup in the cgroup_exit() which is
|
|
|
|
* called after the exit_mm(). This causes the kthread_stop() to return
|
|
|
|
* before the kthread actually quits the cgroup.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
parent = rcu_dereference(current->real_parent);
|
|
|
|
get_task_struct(parent);
|
|
|
|
rcu_read_unlock();
|
|
|
|
cgroup_attach_task_all(parent, current);
|
|
|
|
put_task_struct(parent);
|
|
|
|
|
2019-11-04 11:22:02 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
|
|
|
|
uintptr_t data, const char *name,
|
|
|
|
struct task_struct **thread_ptr)
|
|
|
|
{
|
|
|
|
struct kvm_vm_worker_thread_context init_context = {};
|
|
|
|
struct task_struct *thread;
|
|
|
|
|
|
|
|
*thread_ptr = NULL;
|
|
|
|
init_context.kvm = kvm;
|
|
|
|
init_context.parent = current;
|
|
|
|
init_context.thread_fn = thread_fn;
|
|
|
|
init_context.data = data;
|
|
|
|
init_completion(&init_context.init_done);
|
|
|
|
|
|
|
|
thread = kthread_run(kvm_vm_worker_thread, &init_context,
|
|
|
|
"%s-%d", name, task_pid_nr(current));
|
|
|
|
if (IS_ERR(thread))
|
|
|
|
return PTR_ERR(thread);
|
|
|
|
|
|
|
|
/* kthread_run is never supposed to return NULL */
|
|
|
|
WARN_ON(thread == NULL);
|
|
|
|
|
|
|
|
wait_for_completion(&init_context.init_done);
|
|
|
|
|
|
|
|
if (!init_context.err)
|
|
|
|
*thread_ptr = thread;
|
|
|
|
|
|
|
|
return init_context.err;
|
|
|
|
}
|