2019-06-04 08:11:37 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2009-05-20 14:30:49 +00:00
|
|
|
/*
|
|
|
|
* kvm eventfd support - use eventfd objects to signal various KVM events
|
|
|
|
*
|
|
|
|
* Copyright 2009 Novell. All Rights Reserved.
|
2010-05-23 15:37:00 +00:00
|
|
|
* Copyright 2010 Red Hat, Inc. and/or its affiliates.
|
2009-05-20 14:30:49 +00:00
|
|
|
*
|
|
|
|
* Author:
|
|
|
|
* Gregory Haskins <ghaskins@novell.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/kvm_host.h>
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
#include <linux/kvm.h>
|
2015-09-18 14:29:42 +00:00
|
|
|
#include <linux/kvm_irqfd.h>
|
2009-05-20 14:30:49 +00:00
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/syscalls.h>
|
|
|
|
#include <linux/wait.h>
|
|
|
|
#include <linux/poll.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/eventfd.h>
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
#include <linux/kernel.h>
|
2014-01-16 12:44:20 +00:00
|
|
|
#include <linux/srcu.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
|
|
|
#include <linux/slab.h>
|
2014-06-30 10:51:09 +00:00
|
|
|
#include <linux/seqlock.h>
|
2015-09-18 14:29:44 +00:00
|
|
|
#include <linux/irqbypass.h>
|
2014-06-30 10:51:12 +00:00
|
|
|
#include <trace/events/kvm.h>
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
2015-03-26 14:39:29 +00:00
|
|
|
#include <kvm/iodev.h>
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2023-10-18 16:07:32 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2016-10-26 11:35:56 +00:00
|
|
|
static struct workqueue_struct *irqfd_cleanup_wq;
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2019-05-05 08:56:42 +00:00
|
|
|
bool __attribute__((weak))
|
|
|
|
kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
static void
|
|
|
|
irqfd_inject(struct work_struct *work)
|
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd =
|
|
|
|
container_of(work, struct kvm_kernel_irqfd, inject);
|
2009-05-20 14:30:49 +00:00
|
|
|
struct kvm *kvm = irqfd->kvm;
|
|
|
|
|
2012-09-21 17:58:03 +00:00
|
|
|
if (!irqfd->resampler) {
|
2013-04-11 11:21:40 +00:00
|
|
|
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1,
|
|
|
|
false);
|
|
|
|
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0,
|
|
|
|
false);
|
2012-09-21 17:58:03 +00:00
|
|
|
} else
|
|
|
|
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
|
2013-04-11 11:21:40 +00:00
|
|
|
irqfd->gsi, 1, false);
|
2012-09-21 17:58:03 +00:00
|
|
|
}
|
|
|
|
|
KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]
There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.
With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.
In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.
This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).
[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-22 20:43:44 +00:00
|
|
|
static void irqfd_resampler_notify(struct kvm_kernel_irqfd_resampler *resampler)
|
|
|
|
{
|
|
|
|
struct kvm_kernel_irqfd *irqfd;
|
|
|
|
|
|
|
|
list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
|
|
|
|
srcu_read_lock_held(&resampler->kvm->irq_srcu))
|
2023-11-22 12:48:23 +00:00
|
|
|
eventfd_signal(irqfd->resamplefd);
|
KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]
There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.
With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.
In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.
This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).
[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-22 20:43:44 +00:00
|
|
|
}
|
|
|
|
|
2012-09-21 17:58:03 +00:00
|
|
|
/*
|
|
|
|
* Since resampler irqfds share an IRQ source ID, we de-assert once
|
|
|
|
* then notify all of the resampler irqfds using this GSI. We can't
|
|
|
|
* do multiple de-asserts or we risk racing with incoming re-asserts.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
|
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd_resampler *resampler;
|
2014-01-16 12:44:20 +00:00
|
|
|
struct kvm *kvm;
|
|
|
|
int idx;
|
2012-09-21 17:58:03 +00:00
|
|
|
|
2015-09-18 14:29:42 +00:00
|
|
|
resampler = container_of(kian,
|
|
|
|
struct kvm_kernel_irqfd_resampler, notifier);
|
2014-01-16 12:44:20 +00:00
|
|
|
kvm = resampler->kvm;
|
2012-09-21 17:58:03 +00:00
|
|
|
|
2014-01-16 12:44:20 +00:00
|
|
|
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
|
2013-04-11 11:21:40 +00:00
|
|
|
resampler->notifier.gsi, 0, false);
|
2012-09-21 17:58:03 +00:00
|
|
|
|
2014-01-16 12:44:20 +00:00
|
|
|
idx = srcu_read_lock(&kvm->irq_srcu);
|
KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]
There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.
With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.
In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.
This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).
[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-22 20:43:44 +00:00
|
|
|
irqfd_resampler_notify(resampler);
|
2014-01-16 12:44:20 +00:00
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
2012-09-21 17:58:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2015-09-18 14:29:42 +00:00
|
|
|
irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
|
2012-09-21 17:58:03 +00:00
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
|
2012-09-21 17:58:03 +00:00
|
|
|
struct kvm *kvm = resampler->kvm;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->irqfds.resampler_lock);
|
|
|
|
|
|
|
|
list_del_rcu(&irqfd->resampler_link);
|
|
|
|
|
|
|
|
if (list_empty(&resampler->list)) {
|
2023-03-22 20:43:43 +00:00
|
|
|
list_del_rcu(&resampler->link);
|
2012-09-21 17:58:03 +00:00
|
|
|
kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
|
2023-03-22 20:43:43 +00:00
|
|
|
/*
|
2024-07-11 12:11:30 +00:00
|
|
|
* synchronize_srcu_expedited(&kvm->irq_srcu) already called
|
2023-03-22 20:43:43 +00:00
|
|
|
* in kvm_unregister_irq_ack_notifier().
|
|
|
|
*/
|
2012-09-21 17:58:03 +00:00
|
|
|
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
|
2013-04-11 11:21:40 +00:00
|
|
|
resampler->notifier.gsi, 0, false);
|
2012-09-21 17:58:03 +00:00
|
|
|
kfree(resampler);
|
2024-07-11 12:11:30 +00:00
|
|
|
} else {
|
|
|
|
synchronize_srcu_expedited(&kvm->irq_srcu);
|
2012-09-21 17:58:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->irqfds.resampler_lock);
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Race-free decouple logic (ordering is critical)
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
irqfd_shutdown(struct work_struct *work)
|
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd =
|
|
|
|
container_of(work, struct kvm_kernel_irqfd, shutdown);
|
2017-12-22 02:10:36 +00:00
|
|
|
struct kvm *kvm = irqfd->kvm;
|
2010-01-13 17:12:30 +00:00
|
|
|
u64 cnt;
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2020-04-01 14:03:10 +00:00
|
|
|
/* Make sure irqfd has been initialized in assign path. */
|
2024-07-11 12:11:30 +00:00
|
|
|
synchronize_srcu_expedited(&kvm->irq_srcu);
|
2017-12-22 02:10:36 +00:00
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
/*
|
|
|
|
* Synchronize with the wait-queue and unhook ourselves to prevent
|
|
|
|
* further events.
|
|
|
|
*/
|
2010-01-13 17:12:30 +00:00
|
|
|
eventfd_ctx_remove_wait_queue(irqfd->eventfd, &irqfd->wait, &cnt);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We know no new events will be scheduled at this point, so block
|
|
|
|
* until all previously outstanding events have completed
|
|
|
|
*/
|
2012-08-20 21:51:24 +00:00
|
|
|
flush_work(&irqfd->inject);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2012-09-21 17:58:03 +00:00
|
|
|
if (irqfd->resampler) {
|
|
|
|
irqfd_resampler_shutdown(irqfd);
|
|
|
|
eventfd_ctx_put(irqfd->resamplefd);
|
|
|
|
}
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
/*
|
|
|
|
* It is now safe to release the object's resources
|
|
|
|
*/
|
2015-09-18 14:29:44 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
|
|
|
|
irq_bypass_unregister_consumer(&irqfd->consumer);
|
|
|
|
#endif
|
2009-05-20 14:30:49 +00:00
|
|
|
eventfd_ctx_put(irqfd->eventfd);
|
|
|
|
kfree(irqfd);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* assumes kvm->irqfds.lock is held */
|
|
|
|
static bool
|
2015-09-18 14:29:42 +00:00
|
|
|
irqfd_is_active(struct kvm_kernel_irqfd *irqfd)
|
2009-05-20 14:30:49 +00:00
|
|
|
{
|
|
|
|
return list_empty(&irqfd->list) ? false : true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark the irqfd as inactive and schedule it for removal
|
|
|
|
*
|
|
|
|
* assumes kvm->irqfds.lock is held
|
|
|
|
*/
|
|
|
|
static void
|
2015-09-18 14:29:42 +00:00
|
|
|
irqfd_deactivate(struct kvm_kernel_irqfd *irqfd)
|
2009-05-20 14:30:49 +00:00
|
|
|
{
|
|
|
|
BUG_ON(!irqfd_is_active(irqfd));
|
|
|
|
|
|
|
|
list_del_init(&irqfd->list);
|
|
|
|
|
2016-10-26 11:35:56 +00:00
|
|
|
queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
2015-10-28 18:16:47 +00:00
|
|
|
int __attribute__((weak)) kvm_arch_set_irq_inatomic(
|
2015-10-16 07:07:47 +00:00
|
|
|
struct kvm_kernel_irq_routing_entry *irq,
|
|
|
|
struct kvm *kvm, int irq_source_id,
|
|
|
|
int level,
|
|
|
|
bool line_status)
|
|
|
|
{
|
|
|
|
return -EWOULDBLOCK;
|
|
|
|
}
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
/*
|
|
|
|
* Called with wqh->lock held and interrupts disabled
|
|
|
|
*/
|
|
|
|
static int
|
2017-06-20 10:06:13 +00:00
|
|
|
irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
|
2009-05-20 14:30:49 +00:00
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd =
|
|
|
|
container_of(wait, struct kvm_kernel_irqfd, wait);
|
2017-07-04 00:14:56 +00:00
|
|
|
__poll_t flags = key_to_poll(key);
|
2014-06-30 10:51:09 +00:00
|
|
|
struct kvm_kernel_irq_routing_entry irq;
|
2010-11-18 17:09:08 +00:00
|
|
|
struct kvm *kvm = irqfd->kvm;
|
2014-06-30 10:51:09 +00:00
|
|
|
unsigned seq;
|
2014-01-16 12:44:20 +00:00
|
|
|
int idx;
|
2020-10-26 17:53:25 +00:00
|
|
|
int ret = 0;
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2018-02-11 22:34:03 +00:00
|
|
|
if (flags & EPOLLIN) {
|
2020-10-27 13:55:23 +00:00
|
|
|
u64 cnt;
|
|
|
|
eventfd_ctx_do_read(irqfd->eventfd, &cnt);
|
|
|
|
|
2014-01-16 12:44:20 +00:00
|
|
|
idx = srcu_read_lock(&kvm->irq_srcu);
|
2014-06-30 10:51:09 +00:00
|
|
|
do {
|
|
|
|
seq = read_seqcount_begin(&irqfd->irq_entry_sc);
|
|
|
|
irq = irqfd->irq_entry;
|
|
|
|
} while (read_seqcount_retry(&irqfd->irq_entry_sc, seq));
|
2009-05-20 14:30:49 +00:00
|
|
|
/* An event has been signaled, inject an interrupt */
|
2015-10-28 18:16:47 +00:00
|
|
|
if (kvm_arch_set_irq_inatomic(&irq, kvm,
|
|
|
|
KVM_USERSPACE_IRQ_SOURCE_ID, 1,
|
|
|
|
false) == -EWOULDBLOCK)
|
2010-11-18 17:09:08 +00:00
|
|
|
schedule_work(&irqfd->inject);
|
2014-01-16 12:44:20 +00:00
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
2020-10-26 17:53:25 +00:00
|
|
|
ret = 1;
|
2010-11-18 17:09:08 +00:00
|
|
|
}
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2018-02-11 22:34:03 +00:00
|
|
|
if (flags & EPOLLHUP) {
|
2009-05-20 14:30:49 +00:00
|
|
|
/* The eventfd is closing, detach from KVM */
|
2019-03-15 17:58:15 +00:00
|
|
|
unsigned long iflags;
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2019-03-15 17:58:15 +00:00
|
|
|
spin_lock_irqsave(&kvm->irqfds.lock, iflags);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We must check if someone deactivated the irqfd before
|
|
|
|
* we could acquire the irqfds.lock since the item is
|
|
|
|
* deactivated from the KVM side before it is unhooked from
|
|
|
|
* the wait-queue. If it is already deactivated, we can
|
|
|
|
* simply return knowing the other side will cleanup for us.
|
|
|
|
* We cannot race against the irqfd going away since the
|
|
|
|
* other side is required to acquire wqh->lock, which we hold
|
|
|
|
*/
|
|
|
|
if (irqfd_is_active(irqfd))
|
|
|
|
irqfd_deactivate(irqfd);
|
|
|
|
|
2019-03-15 17:58:15 +00:00
|
|
|
spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
2020-10-26 17:53:25 +00:00
|
|
|
return ret;
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
|
|
|
|
poll_table *pt)
|
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd =
|
|
|
|
container_of(pt, struct kvm_kernel_irqfd, pt);
|
2020-10-26 17:53:25 +00:00
|
|
|
add_wait_queue_priority(wqh, &irqfd->wait);
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
2010-11-18 17:09:08 +00:00
|
|
|
/* Must be called under irqfds.lock */
|
2015-09-18 14:29:42 +00:00
|
|
|
static void irqfd_update(struct kvm *kvm, struct kvm_kernel_irqfd *irqfd)
|
2010-11-18 17:09:08 +00:00
|
|
|
{
|
|
|
|
struct kvm_kernel_irq_routing_entry *e;
|
2014-06-30 10:51:10 +00:00
|
|
|
struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS];
|
2015-10-16 07:07:45 +00:00
|
|
|
int n_entries;
|
2014-06-30 10:51:10 +00:00
|
|
|
|
2014-06-30 10:51:11 +00:00
|
|
|
n_entries = kvm_irq_map_gsi(kvm, entries, irqfd->gsi);
|
2010-11-18 17:09:08 +00:00
|
|
|
|
2014-06-30 10:51:09 +00:00
|
|
|
write_seqcount_begin(&irqfd->irq_entry_sc);
|
|
|
|
|
2014-06-30 10:51:10 +00:00
|
|
|
e = entries;
|
2015-10-16 07:07:45 +00:00
|
|
|
if (n_entries == 1)
|
|
|
|
irqfd->irq_entry = *e;
|
|
|
|
else
|
|
|
|
irqfd->irq_entry.type = 0;
|
2014-06-30 10:51:09 +00:00
|
|
|
|
|
|
|
write_seqcount_end(&irqfd->irq_entry_sc);
|
2010-11-18 17:09:08 +00:00
|
|
|
}
|
|
|
|
|
2015-09-18 14:29:43 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
|
|
|
|
void __attribute__((weak)) kvm_arch_irq_bypass_stop(
|
|
|
|
struct irq_bypass_consumer *cons)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
void __attribute__((weak)) kvm_arch_irq_bypass_start(
|
|
|
|
struct irq_bypass_consumer *cons)
|
|
|
|
{
|
|
|
|
}
|
2015-09-18 14:29:53 +00:00
|
|
|
|
|
|
|
int __attribute__((weak)) kvm_arch_update_irqfd_routing(
|
|
|
|
struct kvm *kvm, unsigned int host_irq,
|
|
|
|
uint32_t guest_irq, bool set)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2021-08-27 08:00:03 +00:00
|
|
|
|
|
|
|
bool __attribute__((weak)) kvm_arch_irqfd_route_changed(
|
|
|
|
struct kvm_kernel_irq_routing_entry *old,
|
|
|
|
struct kvm_kernel_irq_routing_entry *new)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2015-09-18 14:29:43 +00:00
|
|
|
#endif
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
static int
|
2012-06-29 15:56:08 +00:00
|
|
|
kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
|
2009-05-20 14:30:49 +00:00
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd, *tmp;
|
2013-08-30 19:47:17 +00:00
|
|
|
struct fd f;
|
2012-09-21 17:58:03 +00:00
|
|
|
struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
|
2009-05-20 14:30:49 +00:00
|
|
|
int ret;
|
2017-07-04 02:25:56 +00:00
|
|
|
__poll_t events;
|
2014-06-30 10:51:11 +00:00
|
|
|
int idx;
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2015-03-04 10:14:33 +00:00
|
|
|
if (!kvm_arch_intc_initialized(kvm))
|
|
|
|
return -EAGAIN;
|
|
|
|
|
2019-05-05 08:56:42 +00:00
|
|
|
if (!kvm_arch_irqfd_allowed(kvm, args))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2019-02-11 19:02:49 +00:00
|
|
|
irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL_ACCOUNT);
|
2009-05-20 14:30:49 +00:00
|
|
|
if (!irqfd)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
irqfd->kvm = kvm;
|
2012-06-29 15:56:08 +00:00
|
|
|
irqfd->gsi = args->gsi;
|
2009-05-20 14:30:49 +00:00
|
|
|
INIT_LIST_HEAD(&irqfd->list);
|
|
|
|
INIT_WORK(&irqfd->inject, irqfd_inject);
|
|
|
|
INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
|
2020-07-20 15:55:29 +00:00
|
|
|
seqcount_spinlock_init(&irqfd->irq_entry_sc, &kvm->irqfds.lock);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2013-08-30 19:47:17 +00:00
|
|
|
f = fdget(args->fd);
|
2024-05-31 18:12:01 +00:00
|
|
|
if (!fd_file(f)) {
|
2013-08-30 19:47:17 +00:00
|
|
|
ret = -EBADF;
|
|
|
|
goto out;
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
2024-05-31 18:12:01 +00:00
|
|
|
eventfd = eventfd_ctx_fileget(fd_file(f));
|
2009-05-20 14:30:49 +00:00
|
|
|
if (IS_ERR(eventfd)) {
|
|
|
|
ret = PTR_ERR(eventfd);
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
irqfd->eventfd = eventfd;
|
|
|
|
|
2012-09-21 17:58:03 +00:00
|
|
|
if (args->flags & KVM_IRQFD_FLAG_RESAMPLE) {
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd_resampler *resampler;
|
2012-09-21 17:58:03 +00:00
|
|
|
|
|
|
|
resamplefd = eventfd_ctx_fdget(args->resamplefd);
|
|
|
|
if (IS_ERR(resamplefd)) {
|
|
|
|
ret = PTR_ERR(resamplefd);
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
irqfd->resamplefd = resamplefd;
|
|
|
|
INIT_LIST_HEAD(&irqfd->resampler_link);
|
|
|
|
|
|
|
|
mutex_lock(&kvm->irqfds.resampler_lock);
|
|
|
|
|
|
|
|
list_for_each_entry(resampler,
|
2012-12-06 21:44:59 +00:00
|
|
|
&kvm->irqfds.resampler_list, link) {
|
2012-09-21 17:58:03 +00:00
|
|
|
if (resampler->notifier.gsi == irqfd->gsi) {
|
|
|
|
irqfd->resampler = resampler;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!irqfd->resampler) {
|
2019-02-11 19:02:49 +00:00
|
|
|
resampler = kzalloc(sizeof(*resampler),
|
|
|
|
GFP_KERNEL_ACCOUNT);
|
2012-09-21 17:58:03 +00:00
|
|
|
if (!resampler) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
mutex_unlock(&kvm->irqfds.resampler_lock);
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
resampler->kvm = kvm;
|
|
|
|
INIT_LIST_HEAD(&resampler->list);
|
|
|
|
resampler->notifier.gsi = irqfd->gsi;
|
|
|
|
resampler->notifier.irq_acked = irqfd_resampler_ack;
|
|
|
|
INIT_LIST_HEAD(&resampler->link);
|
|
|
|
|
2023-03-22 20:43:43 +00:00
|
|
|
list_add_rcu(&resampler->link, &kvm->irqfds.resampler_list);
|
2012-09-21 17:58:03 +00:00
|
|
|
kvm_register_irq_ack_notifier(kvm,
|
|
|
|
&resampler->notifier);
|
|
|
|
irqfd->resampler = resampler;
|
|
|
|
}
|
|
|
|
|
|
|
|
list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
|
2024-07-11 12:11:30 +00:00
|
|
|
synchronize_srcu_expedited(&kvm->irq_srcu);
|
2012-09-21 17:58:03 +00:00
|
|
|
|
|
|
|
mutex_unlock(&kvm->irqfds.resampler_lock);
|
|
|
|
}
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
/*
|
|
|
|
* Install our own custom wake-up handling so we are notified via
|
|
|
|
* a callback whenever someone signals the underlying eventfd
|
|
|
|
*/
|
|
|
|
init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
|
|
|
|
init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
|
|
|
|
|
2010-01-13 16:58:09 +00:00
|
|
|
spin_lock_irq(&kvm->irqfds.lock);
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
list_for_each_entry(tmp, &kvm->irqfds.items, list) {
|
|
|
|
if (irqfd->eventfd != tmp->eventfd)
|
|
|
|
continue;
|
|
|
|
/* This fd is used for another irq already. */
|
|
|
|
ret = -EBUSY;
|
|
|
|
spin_unlock_irq(&kvm->irqfds.lock);
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
2014-06-30 10:51:11 +00:00
|
|
|
idx = srcu_read_lock(&kvm->irq_srcu);
|
|
|
|
irqfd_update(kvm, irqfd);
|
2010-11-18 17:09:08 +00:00
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
list_add_tail(&irqfd->list, &kvm->irqfds.items);
|
|
|
|
|
2014-03-17 18:11:35 +00:00
|
|
|
spin_unlock_irq(&kvm->irqfds.lock);
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
/*
|
|
|
|
* Check if there was an event already pending on the eventfd
|
|
|
|
* before we registered, and trigger it as if we didn't miss it.
|
|
|
|
*/
|
2024-05-31 18:12:01 +00:00
|
|
|
events = vfs_poll(fd_file(f), &irqfd->pt);
|
2014-03-17 18:11:35 +00:00
|
|
|
|
2018-02-11 22:34:03 +00:00
|
|
|
if (events & EPOLLIN)
|
2009-05-20 14:30:49 +00:00
|
|
|
schedule_work(&irqfd->inject);
|
|
|
|
|
2015-09-18 14:29:44 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
|
2016-05-05 17:58:35 +00:00
|
|
|
if (kvm_arch_has_irq_bypass()) {
|
|
|
|
irqfd->consumer.token = (void *)irqfd->eventfd;
|
|
|
|
irqfd->consumer.add_producer = kvm_arch_irq_bypass_add_producer;
|
|
|
|
irqfd->consumer.del_producer = kvm_arch_irq_bypass_del_producer;
|
|
|
|
irqfd->consumer.stop = kvm_arch_irq_bypass_stop;
|
|
|
|
irqfd->consumer.start = kvm_arch_irq_bypass_start;
|
|
|
|
ret = irq_bypass_register_consumer(&irqfd->consumer);
|
|
|
|
if (ret)
|
|
|
|
pr_info("irq bypass consumer (token %p) registration fails: %d\n",
|
2015-09-18 14:29:44 +00:00
|
|
|
irqfd->consumer.token, ret);
|
2016-05-05 17:58:35 +00:00
|
|
|
}
|
2015-09-18 14:29:44 +00:00
|
|
|
#endif
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2017-12-22 02:10:36 +00:00
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
2018-05-28 11:31:13 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* do not drop the file until the irqfd is fully initialized, otherwise
|
|
|
|
* we might race against the EPOLLHUP
|
|
|
|
*/
|
|
|
|
fdput(f);
|
2009-05-20 14:30:49 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
fail:
|
2012-09-21 17:58:03 +00:00
|
|
|
if (irqfd->resampler)
|
|
|
|
irqfd_resampler_shutdown(irqfd);
|
|
|
|
|
|
|
|
if (resamplefd && !IS_ERR(resamplefd))
|
|
|
|
eventfd_ctx_put(resamplefd);
|
|
|
|
|
2009-05-20 14:30:49 +00:00
|
|
|
if (eventfd && !IS_ERR(eventfd))
|
|
|
|
eventfd_ctx_put(eventfd);
|
|
|
|
|
2013-08-30 19:47:17 +00:00
|
|
|
fdput(f);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2013-08-30 19:47:17 +00:00
|
|
|
out:
|
2009-05-20 14:30:49 +00:00
|
|
|
kfree(irqfd);
|
|
|
|
return ret;
|
|
|
|
}
|
2014-08-06 12:24:45 +00:00
|
|
|
|
|
|
|
bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
|
|
|
|
{
|
|
|
|
struct kvm_irq_ack_notifier *kian;
|
|
|
|
int gsi, idx;
|
|
|
|
|
|
|
|
idx = srcu_read_lock(&kvm->irq_srcu);
|
|
|
|
gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
|
|
|
|
if (gsi != -1)
|
2022-01-27 06:54:49 +00:00
|
|
|
hlist_for_each_entry_srcu(kian, &kvm->irq_ack_notifier_list,
|
|
|
|
link, srcu_read_lock_held(&kvm->irq_srcu))
|
2014-08-06 12:24:45 +00:00
|
|
|
if (kian->gsi == gsi) {
|
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
|
|
|
|
|
2015-10-16 07:07:46 +00:00
|
|
|
void kvm_notify_acked_gsi(struct kvm *kvm, int gsi)
|
2014-08-06 12:24:45 +00:00
|
|
|
{
|
|
|
|
struct kvm_irq_ack_notifier *kian;
|
2015-10-16 07:07:46 +00:00
|
|
|
|
2022-01-27 06:54:49 +00:00
|
|
|
hlist_for_each_entry_srcu(kian, &kvm->irq_ack_notifier_list,
|
|
|
|
link, srcu_read_lock_held(&kvm->irq_srcu))
|
2015-10-16 07:07:46 +00:00
|
|
|
if (kian->gsi == gsi)
|
|
|
|
kian->irq_acked(kian);
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
|
|
|
|
{
|
2014-08-06 12:24:45 +00:00
|
|
|
int gsi, idx;
|
|
|
|
|
|
|
|
trace_kvm_ack_irq(irqchip, pin);
|
|
|
|
|
|
|
|
idx = srcu_read_lock(&kvm->irq_srcu);
|
|
|
|
gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
|
|
|
|
if (gsi != -1)
|
2015-10-16 07:07:46 +00:00
|
|
|
kvm_notify_acked_gsi(kvm, gsi);
|
2014-08-06 12:24:45 +00:00
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_register_irq_ack_notifier(struct kvm *kvm,
|
|
|
|
struct kvm_irq_ack_notifier *kian)
|
|
|
|
{
|
|
|
|
mutex_lock(&kvm->irq_lock);
|
|
|
|
hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
|
|
|
|
mutex_unlock(&kvm->irq_lock);
|
2017-04-07 08:50:33 +00:00
|
|
|
kvm_arch_post_irq_ack_notifier_list_update(kvm);
|
2014-08-06 12:24:45 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
|
|
|
|
struct kvm_irq_ack_notifier *kian)
|
|
|
|
{
|
|
|
|
mutex_lock(&kvm->irq_lock);
|
|
|
|
hlist_del_init_rcu(&kian->link);
|
|
|
|
mutex_unlock(&kvm->irq_lock);
|
2024-07-11 12:11:30 +00:00
|
|
|
synchronize_srcu_expedited(&kvm->irq_srcu);
|
2017-04-07 08:50:33 +00:00
|
|
|
kvm_arch_post_irq_ack_notifier_list_update(kvm);
|
2014-08-06 12:24:45 +00:00
|
|
|
}
|
2009-05-20 14:30:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* shutdown any irqfd's that match fd+gsi
|
|
|
|
*/
|
|
|
|
static int
|
2012-06-29 15:56:08 +00:00
|
|
|
kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
|
2009-05-20 14:30:49 +00:00
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd, *tmp;
|
2009-05-20 14:30:49 +00:00
|
|
|
struct eventfd_ctx *eventfd;
|
|
|
|
|
2012-06-29 15:56:08 +00:00
|
|
|
eventfd = eventfd_ctx_fdget(args->fd);
|
2009-05-20 14:30:49 +00:00
|
|
|
if (IS_ERR(eventfd))
|
|
|
|
return PTR_ERR(eventfd);
|
|
|
|
|
|
|
|
spin_lock_irq(&kvm->irqfds.lock);
|
|
|
|
|
|
|
|
list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
|
2012-06-29 15:56:08 +00:00
|
|
|
if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi) {
|
2010-11-18 17:09:08 +00:00
|
|
|
/*
|
2014-06-30 10:51:09 +00:00
|
|
|
* This clearing of irq_entry.type is needed for when
|
2011-03-06 11:03:26 +00:00
|
|
|
* another thread calls kvm_irq_routing_update before
|
|
|
|
* we flush workqueue below (we synchronize with
|
|
|
|
* kvm_irq_routing_update using irqfds.lock).
|
2010-11-18 17:09:08 +00:00
|
|
|
*/
|
2014-06-30 10:51:09 +00:00
|
|
|
write_seqcount_begin(&irqfd->irq_entry_sc);
|
|
|
|
irqfd->irq_entry.type = 0;
|
|
|
|
write_seqcount_end(&irqfd->irq_entry_sc);
|
2009-05-20 14:30:49 +00:00
|
|
|
irqfd_deactivate(irqfd);
|
2010-11-18 17:09:08 +00:00
|
|
|
}
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock_irq(&kvm->irqfds.lock);
|
|
|
|
eventfd_ctx_put(eventfd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Block until we know all outstanding shutdown jobs have completed
|
|
|
|
* so that we guarantee there will not be any more interrupts on this
|
|
|
|
* gsi once this deassign function returns.
|
|
|
|
*/
|
2016-10-26 11:35:56 +00:00
|
|
|
flush_workqueue(irqfd_cleanup_wq);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2012-06-29 15:56:08 +00:00
|
|
|
kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
|
2009-05-20 14:30:49 +00:00
|
|
|
{
|
2012-09-21 17:58:03 +00:00
|
|
|
if (args->flags & ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_RESAMPLE))
|
2012-06-29 15:56:24 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2012-06-29 15:56:08 +00:00
|
|
|
if (args->flags & KVM_IRQFD_FLAG_DEASSIGN)
|
|
|
|
return kvm_irqfd_deassign(kvm, args);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
2012-06-29 15:56:08 +00:00
|
|
|
return kvm_irqfd_assign(kvm, args);
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is called as the kvm VM fd is being released. Shutdown all
|
|
|
|
* irqfds that still remain open
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
kvm_irqfd_release(struct kvm *kvm)
|
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd, *tmp;
|
2009-05-20 14:30:49 +00:00
|
|
|
|
|
|
|
spin_lock_irq(&kvm->irqfds.lock);
|
|
|
|
|
|
|
|
list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)
|
|
|
|
irqfd_deactivate(irqfd);
|
|
|
|
|
|
|
|
spin_unlock_irq(&kvm->irqfds.lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Block until we know all outstanding shutdown jobs have completed
|
|
|
|
* since we do not take a kvm* reference.
|
|
|
|
*/
|
2016-10-26 11:35:56 +00:00
|
|
|
flush_workqueue(irqfd_cleanup_wq);
|
2009-05-20 14:30:49 +00:00
|
|
|
|
|
|
|
}
|
|
|
|
|
2010-11-18 17:09:08 +00:00
|
|
|
/*
|
2014-06-30 10:51:11 +00:00
|
|
|
* Take note of a change in irq routing.
|
2024-07-11 12:11:30 +00:00
|
|
|
* Caller must invoke synchronize_srcu_expedited(&kvm->irq_srcu) afterwards.
|
2010-11-18 17:09:08 +00:00
|
|
|
*/
|
2014-06-30 10:51:11 +00:00
|
|
|
void kvm_irq_routing_update(struct kvm *kvm)
|
2010-11-18 17:09:08 +00:00
|
|
|
{
|
2015-09-18 14:29:42 +00:00
|
|
|
struct kvm_kernel_irqfd *irqfd;
|
2010-11-18 17:09:08 +00:00
|
|
|
|
|
|
|
spin_lock_irq(&kvm->irqfds.lock);
|
|
|
|
|
2015-09-18 14:29:53 +00:00
|
|
|
list_for_each_entry(irqfd, &kvm->irqfds.items, list) {
|
2021-08-27 08:00:03 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
|
|
|
|
/* Under irqfds.lock, so can read irq_entry safely */
|
|
|
|
struct kvm_kernel_irq_routing_entry old = irqfd->irq_entry;
|
|
|
|
#endif
|
|
|
|
|
2014-06-30 10:51:11 +00:00
|
|
|
irqfd_update(kvm, irqfd);
|
2010-11-18 17:09:08 +00:00
|
|
|
|
2015-09-18 14:29:53 +00:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
|
2021-08-27 08:00:03 +00:00
|
|
|
if (irqfd->producer &&
|
|
|
|
kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) {
|
2015-09-18 14:29:53 +00:00
|
|
|
int ret = kvm_arch_update_irqfd_routing(
|
|
|
|
irqfd->kvm, irqfd->producer->irq,
|
|
|
|
irqfd->gsi, 1);
|
|
|
|
WARN_ON(ret);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2010-11-18 17:09:08 +00:00
|
|
|
spin_unlock_irq(&kvm->irqfds.lock);
|
|
|
|
}
|
|
|
|
|
KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]
There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.
With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.
In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.
This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).
[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-22 20:43:44 +00:00
|
|
|
bool kvm_notify_irqfd_resampler(struct kvm *kvm,
|
|
|
|
unsigned int irqchip,
|
|
|
|
unsigned int pin)
|
|
|
|
{
|
|
|
|
struct kvm_kernel_irqfd_resampler *resampler;
|
|
|
|
int gsi, idx;
|
|
|
|
|
|
|
|
idx = srcu_read_lock(&kvm->irq_srcu);
|
|
|
|
gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
|
|
|
|
if (gsi != -1) {
|
|
|
|
list_for_each_entry_srcu(resampler,
|
|
|
|
&kvm->irqfds.resampler_list, link,
|
|
|
|
srcu_read_lock_held(&kvm->irq_srcu)) {
|
|
|
|
if (resampler->notifier.gsi == gsi) {
|
|
|
|
irqfd_resampler_notify(resampler);
|
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
srcu_read_unlock(&kvm->irq_srcu, idx);
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2016-10-26 11:35:56 +00:00
|
|
|
/*
|
|
|
|
* create a host-wide workqueue for issuing deferred shutdown requests
|
|
|
|
* aggregated from all vm* instances. We need our own isolated
|
|
|
|
* queue to ease flushing work items when a VM exits.
|
|
|
|
*/
|
|
|
|
int kvm_irqfd_init(void)
|
|
|
|
{
|
|
|
|
irqfd_cleanup_wq = alloc_workqueue("kvm-irqfd-cleanup", 0, 0);
|
|
|
|
if (!irqfd_cleanup_wq)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-02-28 11:33:18 +00:00
|
|
|
void kvm_irqfd_exit(void)
|
2009-05-20 14:30:49 +00:00
|
|
|
{
|
2016-10-26 11:35:56 +00:00
|
|
|
destroy_workqueue(irqfd_cleanup_wq);
|
2009-05-20 14:30:49 +00:00
|
|
|
}
|
2012-10-08 22:22:59 +00:00
|
|
|
#endif
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* --------------------------------------------------------------------
|
|
|
|
* ioeventfd: translate a PIO/MMIO memory write to an eventfd signal.
|
|
|
|
*
|
|
|
|
* userspace can register a PIO/MMIO address with an eventfd for receiving
|
|
|
|
* notification when the memory has been touched.
|
|
|
|
* --------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct _ioeventfd {
|
|
|
|
struct list_head list;
|
|
|
|
u64 addr;
|
|
|
|
int length;
|
|
|
|
struct eventfd_ctx *eventfd;
|
|
|
|
u64 datamatch;
|
|
|
|
struct kvm_io_device dev;
|
2013-04-04 10:27:21 +00:00
|
|
|
u8 bus_idx;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
bool wildcard;
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct _ioeventfd *
|
|
|
|
to_ioeventfd(struct kvm_io_device *dev)
|
|
|
|
{
|
|
|
|
return container_of(dev, struct _ioeventfd, dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ioeventfd_release(struct _ioeventfd *p)
|
|
|
|
{
|
|
|
|
eventfd_ctx_put(p->eventfd);
|
|
|
|
list_del(&p->list);
|
|
|
|
kfree(p);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool
|
|
|
|
ioeventfd_in_range(struct _ioeventfd *p, gpa_t addr, int len, const void *val)
|
|
|
|
{
|
|
|
|
u64 _val;
|
|
|
|
|
2014-03-31 18:50:38 +00:00
|
|
|
if (addr != p->addr)
|
|
|
|
/* address must be precise for a hit */
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!p->length)
|
|
|
|
/* length = 0 means only look at the address, so always a hit */
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (len != p->length)
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
/* address-range must be precise for a hit */
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (p->wildcard)
|
|
|
|
/* all else equal, wildcard is always a hit */
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* otherwise, we have to actually compare the data */
|
|
|
|
|
|
|
|
BUG_ON(!IS_ALIGNED((unsigned long)val, len));
|
|
|
|
|
|
|
|
switch (len) {
|
|
|
|
case 1:
|
|
|
|
_val = *(u8 *)val;
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
_val = *(u16 *)val;
|
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
_val = *(u32 *)val;
|
|
|
|
break;
|
|
|
|
case 8:
|
|
|
|
_val = *(u64 *)val;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2020-04-20 12:38:05 +00:00
|
|
|
return _val == p->datamatch;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* MMIO/PIO writes trigger an event if the addr/val match */
|
|
|
|
static int
|
2015-03-26 14:39:28 +00:00
|
|
|
ioeventfd_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
|
|
|
|
int len, const void *val)
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
{
|
|
|
|
struct _ioeventfd *p = to_ioeventfd(this);
|
|
|
|
|
|
|
|
if (!ioeventfd_in_range(p, addr, len, val))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2023-11-22 12:48:23 +00:00
|
|
|
eventfd_signal(p->eventfd);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is called as KVM is completely shutting down. We do not
|
|
|
|
* need to worry about locking just nuke anything we have as quickly as possible
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ioeventfd_destructor(struct kvm_io_device *this)
|
|
|
|
{
|
|
|
|
struct _ioeventfd *p = to_ioeventfd(this);
|
|
|
|
|
|
|
|
ioeventfd_release(p);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct kvm_io_device_ops ioeventfd_ops = {
|
|
|
|
.write = ioeventfd_write,
|
|
|
|
.destructor = ioeventfd_destructor,
|
|
|
|
};
|
|
|
|
|
|
|
|
/* assumes kvm->slots_lock held */
|
|
|
|
static bool
|
|
|
|
ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)
|
|
|
|
{
|
|
|
|
struct _ioeventfd *_p;
|
|
|
|
|
|
|
|
list_for_each_entry(_p, &kvm->ioeventfds, list)
|
2013-04-04 10:27:21 +00:00
|
|
|
if (_p->bus_idx == p->bus_idx &&
|
2014-03-31 18:50:38 +00:00
|
|
|
_p->addr == p->addr &&
|
|
|
|
(!_p->length || !p->length ||
|
|
|
|
(_p->length == p->length &&
|
|
|
|
(_p->wildcard || p->wildcard ||
|
|
|
|
_p->datamatch == p->datamatch))))
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2013-02-28 11:33:20 +00:00
|
|
|
static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
|
|
|
|
{
|
|
|
|
if (flags & KVM_IOEVENTFD_FLAG_PIO)
|
|
|
|
return KVM_PIO_BUS;
|
|
|
|
if (flags & KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
|
|
|
|
return KVM_VIRTIO_CCW_NOTIFY_BUS;
|
|
|
|
return KVM_MMIO_BUS;
|
|
|
|
}
|
|
|
|
|
2015-09-15 06:41:55 +00:00
|
|
|
static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
|
|
|
|
enum kvm_bus bus_idx,
|
|
|
|
struct kvm_ioeventfd *args)
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
{
|
|
|
|
|
2015-09-15 06:41:55 +00:00
|
|
|
struct eventfd_ctx *eventfd;
|
|
|
|
struct _ioeventfd *p;
|
|
|
|
int ret;
|
2014-03-31 18:50:38 +00:00
|
|
|
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
eventfd = eventfd_ctx_fdget(args->fd);
|
|
|
|
if (IS_ERR(eventfd))
|
|
|
|
return PTR_ERR(eventfd);
|
|
|
|
|
2019-02-11 19:02:49 +00:00
|
|
|
p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
if (!p) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&p->list);
|
|
|
|
p->addr = args->addr;
|
2013-04-04 10:27:21 +00:00
|
|
|
p->bus_idx = bus_idx;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
p->length = args->len;
|
|
|
|
p->eventfd = eventfd;
|
|
|
|
|
|
|
|
/* The datamatch feature is optional, otherwise this is a wildcard */
|
|
|
|
if (args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH)
|
|
|
|
p->datamatch = args->datamatch;
|
|
|
|
else
|
|
|
|
p->wildcard = true;
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_lock(&kvm->slots_lock);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
2011-03-31 01:57:33 +00:00
|
|
|
/* Verify that there isn't a match already */
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
if (ioeventfd_check_collision(kvm, p)) {
|
|
|
|
ret = -EEXIST;
|
|
|
|
goto unlock_fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
kvm_iodevice_init(&p->dev, &ioeventfd_ops);
|
|
|
|
|
2011-07-27 13:00:48 +00:00
|
|
|
ret = kvm_io_bus_register_dev(kvm, bus_idx, p->addr, p->length,
|
|
|
|
&p->dev);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
if (ret < 0)
|
|
|
|
goto unlock_fail;
|
|
|
|
|
2017-07-07 08:51:38 +00:00
|
|
|
kvm_get_bus(kvm, bus_idx)->ioeventfd_count++;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
list_add_tail(&p->list, &kvm->ioeventfds);
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_unlock(&kvm->slots_lock);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
unlock_fail:
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_unlock(&kvm->slots_lock);
|
2023-03-27 17:54:57 +00:00
|
|
|
kfree(p);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
|
|
|
fail:
|
|
|
|
eventfd_ctx_put(eventfd);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2015-09-15 06:41:55 +00:00
|
|
|
kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx,
|
|
|
|
struct kvm_ioeventfd *args)
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
{
|
2023-02-07 12:37:13 +00:00
|
|
|
struct _ioeventfd *p;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
struct eventfd_ctx *eventfd;
|
2017-07-07 08:51:38 +00:00
|
|
|
struct kvm_io_bus *bus;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
int ret = -ENOENT;
|
2020-09-11 05:56:52 +00:00
|
|
|
bool wildcard;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
|
|
|
eventfd = eventfd_ctx_fdget(args->fd);
|
|
|
|
if (IS_ERR(eventfd))
|
|
|
|
return PTR_ERR(eventfd);
|
|
|
|
|
2020-09-11 05:56:52 +00:00
|
|
|
wildcard = !(args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH);
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_lock(&kvm->slots_lock);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
2023-02-07 12:37:13 +00:00
|
|
|
list_for_each_entry(p, &kvm->ioeventfds, list) {
|
2013-04-04 10:27:21 +00:00
|
|
|
if (p->bus_idx != bus_idx ||
|
|
|
|
p->eventfd != eventfd ||
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
p->addr != args->addr ||
|
|
|
|
p->length != args->len ||
|
|
|
|
p->wildcard != wildcard)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!p->wildcard && p->datamatch != args->datamatch)
|
|
|
|
continue;
|
|
|
|
|
2009-12-23 16:35:24 +00:00
|
|
|
kvm_io_bus_unregister_dev(kvm, bus_idx, &p->dev);
|
2017-07-07 08:51:38 +00:00
|
|
|
bus = kvm_get_bus(kvm, bus_idx);
|
|
|
|
if (bus)
|
|
|
|
bus->ioeventfd_count--;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_unlock(&kvm->slots_lock);
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
|
|
|
|
eventfd_ctx_put(eventfd);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-09-15 06:41:55 +00:00
|
|
|
static int kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
|
|
|
|
{
|
|
|
|
enum kvm_bus bus_idx = ioeventfd_bus_from_flags(args->flags);
|
2015-09-15 06:41:56 +00:00
|
|
|
int ret = kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
|
|
|
|
|
|
|
|
if (!args->len && bus_idx == KVM_MMIO_BUS)
|
|
|
|
kvm_deassign_ioeventfd_idx(kvm, KVM_FAST_MMIO_BUS, args);
|
2015-09-15 06:41:55 +00:00
|
|
|
|
2015-09-15 06:41:56 +00:00
|
|
|
return ret;
|
2015-09-15 06:41:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
|
|
|
|
{
|
|
|
|
enum kvm_bus bus_idx;
|
2015-09-15 06:41:56 +00:00
|
|
|
int ret;
|
2015-09-15 06:41:55 +00:00
|
|
|
|
|
|
|
bus_idx = ioeventfd_bus_from_flags(args->flags);
|
|
|
|
/* must be natural-word sized, or 0 to ignore length */
|
|
|
|
switch (args->len) {
|
|
|
|
case 0:
|
|
|
|
case 1:
|
|
|
|
case 2:
|
|
|
|
case 4:
|
|
|
|
case 8:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* check for range overflow */
|
|
|
|
if (args->addr + args->len < args->addr)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* check for extra flags that we don't understand */
|
|
|
|
if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* ioeventfd with no length can't be combined with DATAMATCH */
|
2015-09-15 06:41:59 +00:00
|
|
|
if (!args->len && (args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH))
|
2015-09-15 06:41:55 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2015-09-15 06:41:56 +00:00
|
|
|
ret = kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
|
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
/* When length is ignored, MMIO is also put on a separate bus, for
|
|
|
|
* faster lookups.
|
|
|
|
*/
|
|
|
|
if (!args->len && bus_idx == KVM_MMIO_BUS) {
|
|
|
|
ret = kvm_assign_ioeventfd_idx(kvm, KVM_FAST_MMIO_BUS, args);
|
|
|
|
if (ret < 0)
|
|
|
|
goto fast_fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
fast_fail:
|
|
|
|
kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
|
|
|
|
fail:
|
|
|
|
return ret;
|
2015-09-15 06:41:55 +00:00
|
|
|
}
|
|
|
|
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
int
|
|
|
|
kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
|
|
|
|
{
|
|
|
|
if (args->flags & KVM_IOEVENTFD_FLAG_DEASSIGN)
|
|
|
|
return kvm_deassign_ioeventfd(kvm, args);
|
|
|
|
|
|
|
|
return kvm_assign_ioeventfd(kvm, args);
|
|
|
|
}
|
2023-10-18 16:18:00 +00:00
|
|
|
|
|
|
|
void
|
|
|
|
kvm_eventfd_init(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
|
|
|
spin_lock_init(&kvm->irqfds.lock);
|
|
|
|
INIT_LIST_HEAD(&kvm->irqfds.items);
|
|
|
|
INIT_LIST_HEAD(&kvm->irqfds.resampler_list);
|
|
|
|
mutex_init(&kvm->irqfds.resampler_lock);
|
|
|
|
#endif
|
|
|
|
INIT_LIST_HEAD(&kvm->ioeventfds);
|
|
|
|
}
|