linux/io_uring
Paolo Bonzini 6c370dc653 Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it.  Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory.  In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.

The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption.  In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.

Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).

guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs.  But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.

Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
  the same memory attributes introduced here
- SNP and TDX support

There are two non-KVM patches buried in the middle of this series:

  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14 08:31:31 -05:00
..
advise.c io_uring: always go async for unsupported fadvise flags 2023-01-29 15:18:26 -07:00
advise.h io_uring: split out fadvise/madvise operations 2022-07-24 18:39:11 -06:00
alloc_cache.h io_uring/rsrc: consolidate node caching 2023-04-12 12:09:41 -06:00
cancel.c io_uring: add support for futex wake and wait 2023-09-29 02:36:57 -06:00
cancel.h io_uring: add support for futex wake and wait 2023-09-29 02:36:57 -06:00
epoll.c io_uring: undeprecate epoll_ctl support 2023-05-26 20:22:41 -06:00
epoll.h io_uring: move epoll handler to its own file 2022-07-24 18:39:11 -06:00
fdinfo.c io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid 2023-10-25 07:44:14 -06:00
fdinfo.h io_uring: move fdinfo helpers to its own file 2022-07-24 18:39:12 -06:00
filetable.c io_uring: add helpers to decode the fixed file file_ptr 2023-06-20 09:36:22 -06:00
filetable.h io_uring: add helpers to decode the fixed file file_ptr 2023-06-20 09:36:22 -06:00
fs.c io_uring/fs: remove sqe->rw_flags checking from LINKAT 2023-09-29 03:07:09 -06:00
fs.h io_uring: split out filesystem related operations 2022-07-24 18:39:11 -06:00
futex.c io_uring: add support for vectored futex waits 2023-09-29 02:37:08 -06:00
futex.h io_uring: add support for vectored futex waits 2023-09-29 02:37:08 -06:00
io_uring.c Merge branch 'kvm-guestmemfd' into HEAD 2023-11-14 08:31:31 -05:00
io_uring.h for-6.7/io_uring-2023-10-30 2023-11-01 11:09:19 -10:00
io-wq.c io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls() 2023-10-05 14:11:18 -06:00
io-wq.h io_uring: break out of iowq iopoll on teardown 2023-09-07 09:02:27 -06:00
kbuf.c io_uring: indicate if io_kbuf_recycle did recycle anything 2023-11-06 13:41:58 -07:00
kbuf.h io_uring: indicate if io_kbuf_recycle did recycle anything 2023-11-06 13:41:58 -07:00
Makefile io_uring: add support for futex wake and wait 2023-09-29 02:36:57 -06:00
msg_ring.c io_uring: use io_file_from_index in io_msg_grab_file 2023-06-20 09:36:22 -06:00
msg_ring.h io_uring: get rid of double locking 2022-12-07 06:47:13 -07:00
net.c io_uring/net: ensure socket is marked connected on connect retry 2023-11-03 13:25:50 -06:00
net.h io_uring: Add KASAN support for alloc_caches 2023-04-03 07:16:14 -06:00
nop.c io_uring: kill extra io_uring_types.h includes 2022-07-24 18:39:14 -06:00
nop.h io_uring: move nop into its own file 2022-07-24 18:39:11 -06:00
notif.c io_uring/notif: add constant for ubuf_info flags 2023-04-15 14:21:04 -06:00
notif.h io_uring/notif: add constant for ubuf_info flags 2023-04-15 14:21:04 -06:00
opdef.c io_uring/rw: add separate prep handler for fixed read/write 2023-11-06 07:43:16 -07:00
opdef.h io_uring/rw: mark readv/writev as vectored in the opcode definition 2023-09-21 12:00:46 -06:00
openclose.c io_uring: use files_lookup_fd_locked() 2023-10-19 11:02:49 +02:00
openclose.h io_uring: split out fixed file installation and removal 2022-07-24 18:39:16 -06:00
poll.c io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups 2023-10-19 06:42:29 -06:00
poll.h io_uring: avoid indirect function calls for the hottest task_work 2023-06-02 08:55:37 -06:00
refs.h io_uring: make io_uring_types.h public 2022-07-24 18:39:14 -06:00
rsrc.c io_uring/rsrc: cleanup io_pin_pages() 2023-10-02 18:25:23 -06:00
rsrc.h io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_by 2023-08-17 19:14:47 -06:00
rw.c io_uring: do not clamp read length for multishot read 2023-11-06 13:41:58 -07:00
rw.h io_uring/rw: add separate prep handler for fixed read/write 2023-11-06 07:43:16 -07:00
slist.h io_uring: silence variable ‘prev’ set but not used warning 2023-03-09 10:10:58 -07:00
splice.c io_uring/splice: use fput() directly 2023-08-10 10:24:25 -06:00
splice.h io_uring: split out splice related operations 2022-07-24 18:39:11 -06:00
sqpoll.c io_uring: Don't set affinity on a dying sqpoll thread 2023-08-30 09:53:44 -06:00
sqpoll.h io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used 2023-08-16 13:40:28 -06:00
statx.c io_uring: for requests that require async, force it 2023-01-29 15:18:26 -07:00
statx.h io_uring: move statx handling to its own file 2022-07-24 18:39:11 -06:00
sync.c io_uring: for requests that require async, force it 2023-01-29 15:18:26 -07:00
sync.h io_uring: split out fs related sync/fallocate functions 2022-07-24 18:39:11 -06:00
tctx.c io_uring: Add io_uring_setup flag to pre-register ring fd and never install it 2023-05-16 08:06:00 -06:00
tctx.h io_uring: simplify __io_uring_add_tctx_node 2022-10-07 12:25:30 -06:00
timeout.c io_uring: never overflow io_aux_cqe 2023-08-11 10:42:57 -06:00
timeout.h io_uring: remove unused return from io_disarm_next 2022-09-21 13:15:01 -06:00
uring_cmd.c io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT 2023-10-19 16:42:03 -06:00
uring_cmd.h io_uring: Remove unnecessary BUILD_BUG_ON 2023-05-04 08:19:05 -06:00
waitid.c io_uring: add IORING_OP_WAITID support 2023-09-21 12:04:45 -06:00
waitid.h io_uring: add IORING_OP_WAITID support 2023-09-21 12:04:45 -06:00
xattr.c io_uring: for requests that require async, force it 2023-01-29 15:18:26 -07:00
xattr.h io_uring: move xattr related opcodes to its own file 2022-07-24 18:39:11 -06:00