Commit Graph

517 Commits

Author SHA1 Message Date
Jens Axboe
49c5c63d48 io_uring: fix task_work cap overshooting
A previous commit fixed task_work overrunning by a lot more than what
the user asked for, by adding a retry list. However, it didn't cap the
overall count, hence for multiple task_work runs inside the same wait
loop, it'd still overshoot the target by potentially a large amount.

Cap it generally inside the wait path. Note that this will still
overshoot the default limit of 20, but should overshoot by no more than
limit-1 in addition to the limit. That still provides a ceiling over how
much task_work will be run, rather than still having gaps where it was
uncapped essentially.

Fixes: f46b9cdb22 ("io_uring: limit local tw done")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-26 13:42:27 -07:00
David Wei
f46b9cdb22 io_uring: limit local tw done
Instead of eagerly running all available local tw, limit the amount of
local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
value of 20 is chosen as a reasonable heuristic to allow enough work
batching but also keep latency down.

Add a retry_llist that maintains a list of local tw that couldn't be
done in time. No synchronisation is needed since it is only modified
within the task context.

Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/20241120221452.3762588-3-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-21 07:11:00 -07:00
David Wei
40cfe55324 io_uring: add io_local_work_pending()
In preparation for adding a new llist of tw to retry due to hitting the
tw limit, add a helper io_local_work_pending(). This function returns
true if there is any local tw pending. For now it only checks
ctx->work_llist.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20241120221452.3762588-2-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-21 07:11:00 -07:00
Pavel Begunkov
d617b3147d io_uring: restore back registered wait arguments
Now we've got a more generic region registration API, place
IORING_ENTER_EXT_ARG_REG and re-enable it.

First, the user has to register a region with the
IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a
ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races
with already running waiters. With that we should have stable constant
values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no
READ_ONCE required.

The other API difference is that we're now passing byte offsets instead
of indexes. The user _must_ align all offsets / pointers to the native
word size, failing to do so might but not necessarily has to lead to a
failure usually returned as -EFAULT. liburing will be hiding this
details from users.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/81822c1b4ffbe8ad391b4f9ad1564def0d26d990.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15 12:28:38 -07:00
Pavel Begunkov
93238e6618 io_uring: add memory region registration
Regions will serve multiple purposes. First, with it we can decouple
ring/etc. object creation from registration / mapping of the memory they
will be placed in. We already have hacks that allow to put both SQ and
CQ into the same huge page, in the future we should be able to:

region = create_region(io_ring);
create_pbuf_ring(io_uring, region, offset=0);
create_pbuf_ring(io_uring, region, offset=N);

The second use case is efficiently passing parameters. The following
patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which
optimises wait arguments. It'll also be useful for request arguments
replacing iovecs, msghdr, etc. pointers. Eventually it would also be
handy for BPF as well if it comes to fruition.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0798cf3a14fad19cfc96fc9feca5f3e11481691d.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15 09:58:34 -07:00
Pavel Begunkov
83e041522e io_uring: temporarily disable registered waits
Disable wait argument registration as it'll be replaced with a more
generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing
in a few commits so leave it be.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70b1d1d218c41ba77a76d1789c8641dab0b0563e.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15 09:58:34 -07:00
Pavel Begunkov
3730aebbda io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
IOPOLL doesn't use the extended arguments, no need for it to support
IORING_ENTER_EXT_ARG_REG. Let's disable it for IOPOLL, if anything it
leaves more space for future extensions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a35ecd919dbdc17bd5b7932273e317832c531b45.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15 09:58:34 -07:00
Pavel Begunkov
b9d69371e8 io_uring: fix invalid hybrid polling ctx leaks
It has already allocated the ctx by the point where it checks the hybrid
poll configuration, plain return leaks the memory.

Fixes: 01ee194d1a ("io_uring: add support for hybrid IOPOLL")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/b57f2608088020501d352fcdeebdb949e281d65b.1731468230.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 07:38:04 -07:00
Pavel Begunkov
af0a2ffef0 io_uring: avoid normal tw intermediate fallback
When a DEFER_TASKRUN io_uring is terminating it requeues deferred task
work items as normal tw, which can further fallback to kthread
execution. Avoid this extra step and always push them to the fallback
kthread.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d1cd472cec2230c66bd1c8d412a5833f0af75384.1730772720.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:38 -07:00
Pavel Begunkov
483242714f io_uring: prevent speculating sq_array indexing
The SQ index array consists of user provided indexes, which io_uring
then uses to index the SQ, and so it's susceptible to speculation. For
all other queues io_uring tracks heads and tails in kernel, and they
shouldn't need any special care.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c6c7a25962924a55869e317e4fdb682dfdc6b279.1730687889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:38 -07:00
Jens Axboe
b6f58a3f4a io_uring: move struct io_kiocb from task_struct to io_uring_task
Rather than store the task_struct itself in struct io_kiocb, store
the io_uring specific task_struct. The life times are the same in terms
of io_uring, and this avoids doing some dereferences through the
task_struct. For the hot path of putting local task references, we can
deref req->tctx instead, which we'll need anyway in that function
regardless of whether it's local or remote references.

This is mostly straight forward, except the original task PF_EXITING
check needs a bit of tweaking. task_work is _always_ run from the
originating task, except in the fallback case, where it's run from a
kernel thread. Replace the potentially racy (in case of fallback work)
checks for req->task->flags with current->flags. It's either the still
the original task, in which case PF_EXITING will be sane, or it has
PF_KTHREAD set, in which case it's fallback work. Both cases should
prevent moving forward with the given request.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:38 -07:00
Jens Axboe
6ed368cc5d io_uring: remove task ref helpers
They are only used right where they are defined, just open-code them
inside io_put_task().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:38 -07:00
Jens Axboe
f03baece08 io_uring: move cancelations to be io_uring_task based
Right now the task_struct pointer is used as the key to match a task,
but in preparation for some io_kiocb changes, move it to using struct
io_uring_task instead. No functional changes intended in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:38 -07:00
Jens Axboe
6f94cbc29a io_uring/rsrc: split io_kiocb node type assignments
Currently the io_rsrc_node assignment in io_kiocb is an array of two
pointers, as two nodes may be assigned to a request - one file node,
and one buffer node. However, the buffer node can co-exist with the
provided buffers, as currently it's not supported to use both provided
and registered buffers at the same time.

This crucially brings struct io_kiocb down to 4 cache lines again, as
before it spilled into the 5th cacheline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:36 -07:00
hexue
01ee194d1a io_uring: add support for hybrid IOPOLL
A new hybrid poll is implemented on the io_uring layer. Once an IO is
issued, it will not poll immediately, but rather block first and re-run
before IO complete, then poll to reap IO. While this poll method could
be a suboptimal solution when running on a single thread, it offers
performance lower than regular polling but higher than IRQ, and CPU
utilization is also lower than polling.

To use hybrid polling, the ring must be setup with both the
IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid
polling has the same restrictions as IOPOLL, in that commands must
explicitly support it.

Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20241101091957.564220-2-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:45:30 -06:00
Jens Axboe
d50f94d761 io_uring/rsrc: get rid of the empty node and dummy_ubuf
The empty node was used as a placeholder for a sparse entry, but it
didn't really solve any issues. The caller still has to check for
whether it's the empty node or not, it may as well just check for a NULL
return instead.

The dummy_ubuf was used for a sparse buffer entry, but NULL will serve
the same purpose there of ensuring an -EFAULT on attempted import.

Just use NULL for a sparse node, regardless of whether or not it's a
file or buffer resource.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:45:30 -06:00
Jens Axboe
b54a14041e io_uring/rsrc: add io_rsrc_node_lookup() helper
There are lots of spots open-coding this functionality, add a generic
helper that does the node lookup in a speculation safe way.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:45:30 -06:00
Jens Axboe
3597f2786b io_uring/rsrc: unify file and buffer resource tables
For files, there's nr_user_files/file_table/file_data, and buffers have
nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and
file_data can't be the same thing, and ditto for the buffer side. That
gets rid of more io_ring_ctx state that's in two spots rather than just
being in one spot, as it should be. Put all the registered file data in
one locations, and ditto on the buffer front.

This also avoids having both io_rsrc_data->nodes being an allocated
array, and ->user_bufs[] or ->file_table.nodes. There's no reason to
have this information duplicated. Keep it in one spot, io_rsrc_data,
along with how many resources are available.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:45:23 -06:00
Jens Axboe
f38f284764 io_uring: only initialize io_kiocb rsrc_nodes when needed
Add the empty node initializing to the preinit part of the io_kiocb
allocation, and reset them if they have been used.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:44:30 -06:00
Jens Axboe
0701db7439 io_uring/rsrc: add an empty io_rsrc_node for sparse buffer entries
Rather than allocate an io_rsrc_node for an empty/sparse buffer entry,
add a const entry that can be used for that. This just needs checking
for writing the tag, and the put check needs to check for that sparse
node rather than NULL for validity.

This avoids allocating rsrc nodes for sparse buffer entries.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:44:30 -06:00
Jens Axboe
fbbb8e991d io_uring/rsrc: get rid of io_rsrc_node allocation cache
It's not going to be needed in the fast path going forward, so kill it
off.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:44:30 -06:00
Jens Axboe
7029acd8a9 io_uring/rsrc: get rid of per-ring io_rsrc_node list
Work in progress, but get rid of the per-ring serialization of resource
nodes, like registered buffers and files. Main issue here is that one
node can otherwise hold up a bunch of other nodes from getting freed,
which is especially a problem for file resource nodes and networked
workloads where some descriptors may not see activity in a long time.

As an example, instantiate an io_uring ring fd and create a sparse
registered file table. Even 2 will do. Then create a socket and register
it as fixed file 0, F0. The number of open files in the app is now 5,
with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4
being the socket. Register this socket (eg "the listener") in slot 0 of
the registered file table. Now add an operation on the socket that uses
slot 0. Finally, loop N times, where each loop creates a new socket,
registers said socket as a file, then unregisters the socket, and
finally closes the socket. This is roughly similar to what a basic
accept loop would look like.

At the end of this loop, it's not unreasonable to expect that there
would still be 5 open files. Each socket created and registered in the
loop is also unregistered and closed. But since the listener socket
registered first still has references to its resource node due to still
being active, each subsequent socket unregistration is stuck behind it
for reclaim. Hence 5 + N files are still open at that point, where N is
awaiting the final put held up by the listener socket.

Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct
io_kiocb now gets explicit resource nodes assigned, with each holding a
reference to the parent node. A parent node is either of type FILE or
BUFFER, which are the two types of nodes that exist. A request can have
two nodes assigned, if it's using both registered files and buffers.
Since request issue and task_work completion is both under the ring
private lock, no atomics are needed to handle these references. It's a
simple unlocked inc/dec. As before, the registered buffer or file table
each hold a reference as well to the registered nodes. Final put of the
node will remove the node and free the underlying resource, eg unmap the
buffer or put the file.

Outside of removing the stall in resource reclaim described above, it
has the following advantages:

1) It's a lot simpler than the previous scheme, and easier to follow.
   No need to specific quiesce handling anymore.

2) There are no resource node allocations in the fast path, all of that
   happens at resource registration time.

3) The structs related to resource handling can all get simplified
   quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can
   go away completely.

4) Handling of resource tags is much simpler, and doesn't require
   persistent storage as it can simply get assigned up front at
   registration time. Just copy them in one-by-one at registration time
   and assign to the resource node.

The only real downside is that a request is now explicitly limited to
pinning 2 resources, one file and one buffer, where before just
assigning a resource node to a request would pin all of them. The upside
is that it's easier to follow now, as an individual resource is
explicitly referenced and assigned to the request.

With this in place, the above mentioned example will be using exactly 5
files at the end of the loop, not N.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:44:18 -06:00
Jens Axboe
aaa736b186 io_uring: specify freeptr usage for SLAB_TYPESAFE_BY_RCU io_kiocb cache
Doesn't matter right now as there's still some bytes left for it, but
let's prepare for the io_kiocb potentially growing and add a specific
freeptr offset for it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:28 -06:00
Jens Axboe
aa00f67adc io_uring: add support for fixed wait regions
Generally applications have 1 or a few waits of waiting, yet they pass
in a struct io_uring_getevents_arg every time. This needs to get copied
and, in turn, the timeout value needs to get copied.

Rather than do this for every invocation, allow the application to
register a fixed set of wait regions that can simply be indexed when
asking the kernel to wait on events.

At ring setup time, the application can register a number of these wait
regions and initialize region/index 0 upfront:

	struct io_uring_reg_wait *reg;

	reg = io_uring_setup_reg_wait(ring, nr_regions, &ret);

	/* set timeout and mark as set, sigmask/sigmask_sz as needed */
	reg->ts.tv_sec = 0;
	reg->ts.tv_nsec = 100000;
	reg->flags = IORING_REG_WAIT_TS;

where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(*reg). The
above initializes index 0, but 63 other regions can be initialized,
if needed. Now, instead of doing:

	struct __kernel_timespec timeout = { .tv_nsec = 100000, };

	io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL);

to wait for events for each submit_and_wait, or just wait, operation, it
can just reference the above region at offset 0 and do:

	io_uring_submit_and_wait_reg(ring, &cqe, nr, 0);

to achieve the same goal of waiting 100usec without needing to copy
both struct io_uring_getevents_arg (24b) and struct __kernel_timeout
(16b) for each invocation. Struct io_uring_reg_wait looks as follows:

struct io_uring_reg_wait {
	struct __kernel_timespec	ts;
	__u32				min_wait_usec;
	__u32				flags;
	__u64				sigmask;
	__u32				sigmask_sz;
	__u32				pad[3];
	__u64				pad2[2];
};

embedding the timeout itself in the region, rather than passing it as
a pointer as well. Note that the signal mask is still passed as a
pointer, both for compatability reasons, but also because there doesn't
seem to be a lot of high frequency waits scenarios that involve setting
and resetting the signal mask for each wait.

The application is free to modify any region before a wait call, or it
can use keep multiple regions with different settings to avoid needing to
modify the same one for wait calls. Up to a page size of regions is mapped
by default, allowing PAGE_SIZE / 64 available regions for use.

The registered region must fit within a page. On a 4kb page size system,
that allows for 64 wait regions if a full page is used, as the size of
struct io_uring_reg_wait is 64b. The region registered must be aligned
to io_uring_reg_wait in size. It's valid to register less than 64
entries.

In network performance testing with zero-copy, this reduced the time
spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4%
to 0.3%.

Wait regions are fixed for the lifetime of the ring - once registered,
they are persistent until the ring is torn down. The regions support
minimum wait timeout as well as the regular waits.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:28 -06:00
Jens Axboe
371b47da25 io_uring: change io_get_ext_arg() to use uaccess begin + end
In scenarios where a high frequency of wait events are seen, the copy
of the struct io_uring_getevents_arg is quite noticeable in the
profiles in terms of time spent. It can be seen as up to 3.5-4.5%.
Rewrite the copy-in logic, saving about 0.5% of the time.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
0a54a7dd0a io_uring: switch struct ext_arg from __kernel_timespec to timespec64
This avoids intermediate storage for turning a __kernel_timespec
user pointer into an on-stack struct timespec64, only then to turn it
into a ktime_t.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
79cfe9e59c io_uring/register: add IORING_REGISTER_RESIZE_RINGS
Once a ring has been created, the size of the CQ and SQ rings are fixed.
Usually this isn't a problem on the SQ ring side, as it merely controls
the available number of requests that can be submitted in a single
system call, and there's rarely a need to change that.

For the CQ ring, it's a different story. For most efficient use of
io_uring, it's important that the CQ ring never overflows. This means
that applications must size it for the worst case scenario, which can
be wasteful.

Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize
the existing rings. It takes a struct io_uring_params argument, the same
one which is used to setup the ring initially, and resizes rings
according to the sizes given.

Certain properties are always inherited from the original ring setup,
like SQE128/CQE32 and other setup options. The implementation only
allows flag associated with how the CQ ring is sized and clamped.

Existing unconsumed SQE and CQE entries are copied as part of the
process. If either the SQ or CQ resized destination ring cannot hold the
entries already present in the source rings, then the operation is failed
with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new
submissions, and the internal mapping holds the completion lock as well
across moving CQ ring state.

To prevent races between mmap and ring resizing, add a mutex that's
solely used to serialize ring resize and mmap. mmap_sem can't be used
here, as as fork'ed process may be doing mmaps on the ring as well.
The ctx->resize_lock is held across mmap operations, and the resize
will grab it before swapping out the already mapped new data.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
81d8191eb9 io_uring: abstract out a bit of the ring filling logic
Abstract out a io_uring_fill_params() helper, which fills out the
necessary bits of struct io_uring_params. Add it to io_uring.h as well,
in preparation for having another internal user of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
09d0a8ea7f io_uring: move max entry definition and ring sizing into header
In preparation for needing this somewhere else, move the definitions
for the maximum CQ and SQ ring size into io_uring.h. Make the
rings_size() helper available as well, and have it take just the setup
flags argument rather than the fill ring pointer. That's all that is
needed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Pavel Begunkov
2946f08ae9 io_uring: clean up cqe trace points
We have too many helpers posting CQEs, instead of tracing completion
events before filling in a CQE and thus having to pass all the data,
set the CQE first, pass it to the tracing helper and let it extract
everything it needs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b83c1ca9ee5aed2df0f3bb743bf5ed699cce4c86.1729267437.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Pavel Begunkov
9b296c625a io_uring: static_key for !IORING_SETUP_NO_SQARRAY
IORING_SETUP_NO_SQARRAY should be preferred and used by default by
liburing, optimise flag checking in io_get_sqe() with a static key.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c164a48542fbb080115e2377ecf160c758562742.1729264988.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Pavel Begunkov
1e6e7602cc io_uring: kill io_llist_xchg
io_llist_xchg is only used to set the list to NULL, which can also be
done with llist_del_all(). Use the latter and kill io_llist_xchg.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6765112680d2e86a58b76166b7513391ff4e5d7.1729264960.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
b6b3eb19dd io_uring: move cancel hash tables to kvmalloc/kvfree
Convert to using kvmalloc/kfree() for the hash tables, and while at it,
make it handle low memory situations better.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
8abf47a8d6 io_uring/cancel: get rid of init_hash_table() helper
All it does is initialize the lists, just move the INIT_HLIST_HEAD()
into the one caller.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
085268829b io_uring/poll: get rid of unlocked cancel hash
io_uring maintains two hash lists of inflight requests:

1) ctx->cancel_table_locked. This is used when the caller has the
   ctx->uring_lock held already. This is only an issue side parameter,
   as removal or task_work will always have it held.

2) ctx->cancel_table. This is used when the issuer does NOT have the
   ctx->uring_lock held, and relies on the table spinlocks for access.

However, it's pretty trivial to simply grab the lock in the one spot
where we care about it, for insertion. With that, we can kill the
unlocked table (and get rid of the _locked postfix for the other one).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Min-Hua Chen
17ea56b752 io_uring: fix casts to io_req_flags_t
Apply __force cast to restricted io_req_flags_t type to fix
the following sparse warning:

io_uring/io_uring.c:2026:23: sparse: warning: cast to restricted io_req_flags_t

No functional changes intended.

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Link: https://lore.kernel.org/r/20240922104132.157055-1-minhuadotchen@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24 13:31:04 -06:00
Guixin Liu
3a87e26429 io_uring: fix memory leak when cache init fail
Exit the percpu ref when cache init fails to free the data memory with
in struct percpu_ref.

Fixes: 206aefde4f ("io_uring: reduce/pack size of io_ring_ctx")
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240923100512.64638-1-kanie@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24 13:31:00 -06:00
Linus Torvalds
3147a0689d for-6.12/io_uring-20240922
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbvv30QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj3+EACs346FzM8PlZe1GxBZ6OnQX80blwoldAxC
 +Abl5xjoJKUgA7rY3lJVBRNR6olA/4I2VD3g8b3RT6lpd/oKzPFg7FOj5Dc/oN+c
 Fo6C7zZdr8caokpL4pfwgyG8ZNssQgRg8e0kRSw8A7AMo1zUazqAXtxjRzeMEOLC
 1kWRYGdHCbVjx+hRIyX6KKP427Z5nXvcqFOC0BOpd5jDNYVh9WjNNyUE7trkGJ7o
 1cjlpaaOURS0yU/4hue6tRnM8LDjaImyTyISvBWzKfKvpc19K1alOQNvHIoIeiBQ
 5MgCNkSpbRmUTrydYVEQXl0Cia2d5+0KQsavUB9nZ8M++NftbRr/i26xT8ReZzXI
 NjaedDF+MyOKeJaft2ZeKH8GgWolysMBa4e89CveRxosa/6gwHCkkB4UK9b3gaBB
 Fij1zh/7fIVG7Tz8yNUDyGe6DzOEol1bn1KnL35/9nuCCRnSAM0vRPwJSkurlQ8B
 PqVUS3BArn+LQZmSZ3HJVKOHv2QAY8etqWizvVmu4DB9Ar+uZ6Ur2uwfMN9JAODP
 Fm2qVvxS73QlrvisdbnVbTzqBnqh3Rs4mb5my/gCWO1s67qtu3abSJCSzcnyxQdd
 yBMDegJxTNv6DErNjPEF4qDODwSTIzswr//kOeLns1EtDGfrK8nxUfIKPQUwLSTO
 Y7h2ru83uA==
 =goTY
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Mostly just a set of fixes in here, or little changes that didn't get
  included in the initial pull request. This contains:

   - Move the SQPOLL napi polling outside the submission lock (Olivier)

   - Rename of the "copy buffers" API that got added in the 6.12 merge
     window. There's really no copying going on, it's just referencing
     the buffers. After a bit of consideration, decided that it was
     better to simply rename this to avoid potential confusion (me)

   - Shrink struct io_mapped_ubuf from 48 to 32 bytes, by changing it to
     start + len tracking rather than having start / end in there, and
     by removing the caching of folio_mask when we can just calculate it
     from folio_shift when we need it (me)

   - Fixes for the SQPOLL affinity checking (me, Felix)

   - Fix for how cqring waiting checks for the presence of task_work.
     Just check it directly rather than check for a specific
     notification mechanism (me)

   - Tweak to how request linking is represented in tracing (me)

   - Fix a syzbot report that deliberately sets up a huge list of
     overflow entries, and then hits rcu stalls when flushing this list.
     Just check for the need to preempt, and drop/reacquire locks in the
     loop. There's no state maintained over the loop itself, and each
     entry is yanked from head-of-list (me)"

* tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux:
  io_uring: check if we need to reschedule during overflow flush
  io_uring: improve request linking trace
  io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL
  io_uring/sqpoll: do the napi busy poll outside the submission block
  io_uring: clean up a type in io_uring_register_get_file()
  io_uring/sqpoll: do not put cpumask on stack
  io_uring/sqpoll: retain test for whether the CPU is valid
  io_uring/rsrc: change ubuf->ubuf_end to length tracking
  io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask
  io_uring: rename "copy buffers" to "clone buffers"
2024-09-24 11:11:38 -07:00
Jens Axboe
eac2ca2d68 io_uring: check if we need to reschedule during overflow flush
In terms of normal application usage, this list will always be empty.
And if an application does overflow a bit, it'll have a few entries.
However, nothing obviously prevents syzbot from running a test case
that generates a ton of overflow entries, and then flushing them can
take quite a while.

Check for needing to reschedule while flushing, and drop our locks and
do so if necessary. There's no state to maintain here as overflows
always prune from head-of-list, hence it's fine to drop and reacquire
the locks at the end of the loop.

Link: https://lore.kernel.org/io-uring/66ed061d.050a0220.29194.0053.GAE@google.com/
Reported-by: syzbot+5fca234bd7eb378ff78e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-20 02:51:20 -06:00
Jens Axboe
eed138d67d io_uring: improve request linking trace
Right now any link trace is listed as being linked after the head
request in the chain, but it's more useful to note explicitly which
request a given new request is chained to. Change the link trace to dump
the tail request so that chains are immediately apparent when looking at
traces.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-20 00:17:46 -06:00
Jens Axboe
04beb6e0e0 io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL
If some part of the kernel adds task_work that needs executing, in terms
of signaling it'll generally use TWA_SIGNAL or TWA_RESUME. Those two
directly translate to TIF_NOTIFY_SIGNAL or TIF_NOTIFY_RESUME, and can
be used for a variety of use case outside of task_work.

However, io_cqring_wait_schedule() only tests explicitly for
TIF_NOTIFY_SIGNAL. This means it can miss if task_work got added for
the task, but used a different kind of signaling mechanism (or none at
all). Normally this doesn't matter as any task_work will be run once
the task exits to userspace, except if:

1) The ring is setup with DEFER_TASKRUN
2) The local work item may generate normal task_work

For condition 2, this can happen when closing a file and it's the final
put of that file, for example. This can cause stalls where a task is
waiting to make progress inside io_cqring_wait(), but there's nothing else
that will wake it up. Hence change the "should we schedule or loop around"
check to check for the presence of task_work explicitly, rather than just
TIF_NOTIFY_SIGNAL as the mechanism. While in there, also change the
ordering of what type of task_work first in terms of ordering, to both
make it consistent with other task_work runs in io_uring, but also to
better handle the case of defer task_work generating normal task_work,
like in the above example.

Reported-by: Jan Hendrik Farr <kernel@jfarr.cc>
Link: https://github.com/axboe/liburing/issues/1235
Cc: stable@vger.kernel.org
Fixes: 846072f16e ("io_uring: mimimise io_cqring_wait_schedule")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-19 11:56:55 -06:00
Linus Torvalds
bdf56c7580 slab updates for 6.12
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmbn5g0ACgkQu+CwddJF
 iJq+Uwf/aqnLNEpjUBzwUUhSojCpPnTtiyjv+AILTxoSTHmbu8OvN0W79+Rpbdmk
 O4QapAK+BCs+VL2VATwCCufcJ75Z78txO+buQE0DgwluFTIYZ+IwpUMPsK04ln6A
 FD1/uvP1QFx60heqcp2c4zWFBUpg4DE6ufx2A5kieO268lFcWLxyVlcdgRU79ZCt
 uAcV2yDLk3GvPGfxZwPKEmZUo/FmuSoBv0XgT+eWxmTu/R7hcpFse49OyjBH8Tvb
 8d/RCIFgXOr8dTIjtds7eenwB/is4TkRlctezEQ0jO9/JwL/BVOgXZjD1qCtNWqz
 is4TWK7VV+vdq1RD+0xC2hV/+uGEwQ==
 =+WAm
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:
 "This time it's mostly refactoring and improving APIs for slab users in
  the kernel, along with some debugging improvements.

   - kmem_cache_create() refactoring (Christian Brauner)

     Over the years have been growing new parameters to
     kmem_cache_create() where most of them are needed only for a small
     number of caches - most recently the rcu_freeptr_offset parameter.

     To avoid adding new parameters to kmem_cache_create() and adjusting
     all its callers, or creating new wrappers such as
     kmem_cache_create_rcu(), we can now pass extra parameters using the
     new struct kmem_cache_args. Not explicitly initialized fields
     default to values interpreted as unused.

     kmem_cache_create() is for now a wrapper that works both with the
     new form: kmem_cache_create(name, object_size, args, flags) and the
     legacy form: kmem_cache_create(name, object_size, align, flags,
     ctor)

   - kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil
     Babka, Uladislau Rezki)

     Since SLOB removal, kfree() is allowed for freeing objects
     allocated by kmem_cache_create(). By extension kfree_rcu() as
     allowed as well, which can allow converting simple call_rcu()
     callbacks that only do kmem_cache_free(), as there was never a
     kmem_cache_free_rcu() variant. However, for caches that can be
     destroyed e.g. on module removal, the cache owners knew to issue
     rcu_barrier() first to wait for the pending call_rcu()'s, and this
     is not sufficient for pending kfree_rcu()'s due to its internal
     batching optimizations. Ulad has provided a new
     kvfree_rcu_barrier() and to make the usage less error-prone,
     kmem_cache_destroy() calls it. Additionally, destroying
     SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier()
     synchronously instead of using an async work, because the past
     motivation for async work no longer applies. Users of custom
     call_rcu() callbacks should however keep calling rcu_barrier()
     before cache destruction.

   - Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn)

     Currently, KASAN cannot catch UAFs in such caches as it is legal to
     access them within a grace period, and we only track the grace
     period when trying to free the underlying slab page. The new
     CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual
     object to be RCU-delayed, after which KASAN can poison them.

   - Delayed memcg charging (Shakeel Butt)

     In some cases, the memcg is uknown at allocation time, such as
     receiving network packets in softirq context. With
     kmem_cache_charge() these may be now charged later when the user
     and its memcg is known.

   - Misc fixes and improvements (Pedro Falcato, Axel Rasmussen,
     Christoph Lameter, Yan Zhen, Peng Fan, Xavier)"

* tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
  mm, slab: restore kerneldoc for kmem_cache_create()
  io_uring: port to struct kmem_cache_args
  slab: make __kmem_cache_create() static inline
  slab: make kmem_cache_create_usercopy() static inline
  slab: remove kmem_cache_create_rcu()
  file: port to struct kmem_cache_args
  slab: create kmem_cache_create() compatibility layer
  slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args
  slab: port KMEM_CACHE() to struct kmem_cache_args
  slab: remove rcu_freeptr_offset from struct kmem_cache
  slab: pass struct kmem_cache_args to do_kmem_cache_create()
  slab: pull kmem_cache_open() into do_kmem_cache_create()
  slab: pass struct kmem_cache_args to create_cache()
  slab: port kmem_cache_create_usercopy() to struct kmem_cache_args
  slab: port kmem_cache_create_rcu() to struct kmem_cache_args
  slab: port kmem_cache_create() to struct kmem_cache_args
  slab: add struct kmem_cache_args
  slab: s/__kmem_cache_create/do_kmem_cache_create/g
  memcg: add charging of already allocated slab objects
  mm/slab: Optimize the code logic in find_mergeable()
  ...
2024-09-18 08:53:53 +02:00
Pavel Begunkov
6746ee4c3a io_uring/cmd: expose iowq to cmds
When an io_uring request needs blocking context we offload it to the
io_uring's thread pool called io-wq. We can get there off ->uring_cmd
by returning -EAGAIN, but there is no straightforward way of doing that
from an asynchronous callback. Add a helper that would transfer a
command to a blocking context.

Note, we do an extra hop via task_work before io_queue_iowq(), that's a
limitation of io_uring infra we have that can likely be lifted later
if that would ever become a problem.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f735f807d7c8ba50c9452c69dfe5d3e9e535037b.1726072086.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 10:44:10 -06:00
Christian Brauner
a6711d1cd4 io_uring: port to struct kmem_cache_args
Port req_cachep to struct kmem_cache_args.

Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-09-10 11:42:59 +02:00
Jens Axboe
6733e678ba io_uring/kbuf: pass in 'len' argument for buffer commit
In preparation for needing the consumed length, pass in the length being
completed. Unused right now, but will be used when it is possible to
partially consume a buffer.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29 08:44:51 -06:00
Jens Axboe
7ed9e09e2d io_uring: wire up min batch wake timeout
Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member
that is currently in there. The value is in usecs, which is explained in
the name as well.

Note that if min_wait_usec and a normal timeout is used in conjunction,
the normal timeout is still relative to the base time. For example, if
min_wait_usec is set to 100 and the normal timeout is 1000, the max
total time waited is still 1000. This also means that if the normal
timeout is shorter than min_wait_usec, then only the min_wait_usec will
take effect.

See previous commit for an explanation of how this works.

IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as
applications doing submit_and_wait_timeout() style operations will
generally not see the -EINVAL from the wait side as they return the
number of IOs submitted. Only if no IOs are submitted will the -EINVAL
bubble back up to the application.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25 08:27:01 -06:00
Jens Axboe
1100c4a265 io_uring: add support for batch wait timeout
Waiting for events with io_uring has two knobs that can be set:

1) The number of events to wake for
2) The timeout associated with the event

Waiting will abort when either of those conditions are met, as expected.

This adds support for a third event, which is associated with the number
of events to wait for. Applications generally like to handle batches of
completions, and right now they'd set a number of events to wait for and
the timeout for that. If no events have been received but the timeout
triggers, control is returned to the application and it can wait again.
However, if the application doesn't have anything to do until events are
reaped, then it's possible to make this waiting more efficient.

For example, the application may have a latency time of 50 usecs and
wanting to handle a batch of 8 requests at the time. If it uses 50 usecs
as the timeout, then it'll be doing 20K context switches per second even
if nothing is happening.

This introduces the notion of min batch wait time. If the min batch wait
time expires, then we'll return to userspace if we have any events at all.
If none are available, the general wait time is applied. Any request
arriving after the min batch wait time will cause waiting to stop and
return control to the application.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25 08:27:01 -06:00
Jens Axboe
cebf123c63 io_uring: implement our own schedule timeout handling
In preparation for having two distinct timeouts and avoid waking the
task if we don't need to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25 08:27:01 -06:00
Jens Axboe
45a41e74b8 io_uring: move schedule wait logic into helper
In preparation for expanding how we handle waits, move the actual
schedule and schedule_timeout() handling into a helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25 08:27:01 -06:00
Jens Axboe
f42b58e448 io_uring: encapsulate extraneous wait flags into a separate struct
Rather than need to pass in 2 or 3 separate arguments, add a struct
to encapsulate the timeout and sigset_t parts of waiting. In preparation
for adding another argument for waiting.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25 08:27:01 -06:00