linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2025-01-16 02:14:58 +00:00

Author	SHA1	Message	Date
Al Viro	8e15e12d37	io_statx_prep(): use getname_uflags() the only thing in flags getname_flags() ever cares about is LOOKUP_EMPTY; anything else is none of its damn business. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-13 11:44:30 -05:00
Pavel Begunkov	b9d69371e8	io_uring: fix invalid hybrid polling ctx leaks It has already allocated the ctx by the point where it checks the hybrid poll configuration, plain return leaks the memory. Fixes: 01ee194d1aba1 ("io_uring: add support for hybrid IOPOLL") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/b57f2608088020501d352fcdeebdb949e281d65b.1731468230.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 07:38:04 -07:00
Ming Lei	a43e236fb9	io_uring/uring_cmd: fix buffer index retrieval Add back buffer index retrieval for IORING_URING_CMD_FIXED. Reported-by: Guangwu Zhang <guazhang@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Fixes: b54a14041ee6 ("io_uring/rsrc: add io_rsrc_node_lookup() helper") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Tested-by: Guangwu Zhang <guazhang@redhat.com> Link: https://lore.kernel.org/r/20241111101318.1387557-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:11:37 -07:00
Pavel Begunkov	df3b8ca604	io_uring/cmd: let cmds to know about dying task When the taks that submitted a request is dying, a task work for that request might get run by a kernel thread or even worse by a half dismantled task. We can't just cancel the task work without running the callback as the cmd might need to do some clean up, so pass a flag instead. If set, it's not safe to access any task resources and the callback is expected to cancel the cmd ASAP. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:21 +01:00
Ming Lei	039c878db7	io_uring/rsrc: add & apply io_req_assign_buf_node() The following pattern becomes more and more: + io_req_assign_rsrc_node(&req->buf_node, node); + req->flags \|= REQ_F_BUF_NODE; so make it a helper, which is less fragile to use than above code, for example, the BUF_NODE flag is even missed in current io_uring_cmd_prep(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241107110149.890530-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 15:24:33 -07:00
Ming Lei	4f219fcce5	io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node' Remove '->ctx_ptr' of 'struct io_rsrc_node', and add 'type' field, meantime remove io_rsrc_node_type(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241107110149.890530-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 15:24:33 -07:00
Ming Lei	0d98c50908	io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers `io_rsrc_node` instance won't be shared among different io_uring ctxs, and its allocation 'ctx' is always same with the user's 'ctx', so it is safe to pass user 'ctx' reference to rsrc helpers. Even in io_clone_buffers(), `io_rsrc_node` instance is allocated actually for destination io_uring_ctx. Then io_rsrc_node_ctx() can be removed, and the 8 bytes `ctx` pointer will be removed from `io_rsrc_node` in the following patch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241107110149.890530-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 15:24:33 -07:00
Nam Cao	fc9f59de26	io_uring: Switch to use hrtimer_setup_on_stack() hrtimer_setup_on_stack() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init_on_stack() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/f0d4ac32ec4050710a656cee8385fa4427be33aa.1730386209.git.namcao@linutronix.de	2024-11-07 02:47:06 +01:00
Nam Cao	c95d36585b	io_uring: Remove redundant hrtimer's callback function setup The IORING_OP_TIMEOUT command uses hrtimer underneath. The timer's callback function is setup in io_timeout(), and then the callback function is setup again when the timer is rearmed. Since the callback function is the same for both cases, the latter setup is redundant, therefore remove it. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk: Link: https://lore.kernel.org/all/07b28dfd5691478a2d250f379c8b90dd37f9bb9a.1730386209.git.namcao@linutronix.de	2024-11-07 02:47:05 +01:00
Pavel Begunkov	af0a2ffef0	io_uring: avoid normal tw intermediate fallback When a DEFER_TASKRUN io_uring is terminating it requeues deferred task work items as normal tw, which can further fallback to kthread execution. Avoid this extra step and always push them to the fallback kthread. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d1cd472cec2230c66bd1c8d412a5833f0af75384.1730772720.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Olivier Langlois	6bf90bd8c5	io_uring/napi: add static napi tracking strategy Add the static napi tracking strategy. That allows the user to manually manage the napi ids list for busy polling, and eliminate the overhead of dynamically updating the list from the fast path. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/96943de14968c35a5c599352259ad98f3c0770ba.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Olivier Langlois	71afd926f2	io_uring/napi: clean up __io_napi_do_busy_loop __io_napi_do_busy_loop now requires to have loop_end in its parameters. This makes the code cleaner and also has the benefit of removing a branch since the only caller not passing NULL for loop_end_arg is also setting the value conditionally. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/d5b9bb91b1a08fff50525e1c18d7b4709b9ca100.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Olivier Langlois	db1e1adf6f	io_uring/napi: Use lock guards Convert napi locks to use the shiny new Scope-Based Resource Management machinery. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/2680ca47ee183cfdb89d1a40c84d349edeb620ab.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Olivier Langlois	a5e26f49fe	io_uring/napi: improve __io_napi_add 1. move the sock->sk pointer validity test outside the function to avoid the function call overhead and to make the function more more reusable 2. change its name to __io_napi_add_id to be more precise about it is doing 3. return an error code to report errors Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/d637fa3b437d753c0f4e44ff6a7b5bf2c2611270.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Olivier Langlois	45b3941d09	io_uring/napi: fix io_napi_entry RCU accesses correct 3 RCU structures modifications that were not using the RCU functions to make their update. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/9f53b5169afa8c7bf3665a0b19dc2f7061173530.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Olivier Langlois	2f3cc8e441	io_uring/napi: protect concurrent io_napi_entry timeout accesses io_napi_entry timeout value can be updated while accessed from the poll functions. Its concurrent accesses are wrapped with READ_ONCE()/WRITE_ONCE() macros to avoid incorrect compiler optimizations. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/3de3087563cf98f75266fd9f85fdba063a8720db.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Pavel Begunkov	483242714f	io_uring: prevent speculating sq_array indexing The SQ index array consists of user provided indexes, which io_uring then uses to index the SQ, and so it's susceptible to speculation. For all other queues io_uring tracks heads and tails in kernel, and they shouldn't need any special care. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6c7a25962924a55869e317e4fdb682dfdc6b279.1730687889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	b6f58a3f4a	io_uring: move struct io_kiocb from task_struct to io_uring_task Rather than store the task_struct itself in struct io_kiocb, store the io_uring specific task_struct. The life times are the same in terms of io_uring, and this avoids doing some dereferences through the task_struct. For the hot path of putting local task references, we can deref req->tctx instead, which we'll need anyway in that function regardless of whether it's local or remote references. This is mostly straight forward, except the original task PF_EXITING check needs a bit of tweaking. task_work is _always_ run from the originating task, except in the fallback case, where it's run from a kernel thread. Replace the potentially racy (in case of fallback work) checks for req->task->flags with current->flags. It's either the still the original task, in which case PF_EXITING will be sane, or it has PF_KTHREAD set, in which case it's fallback work. Both cases should prevent moving forward with the given request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	6ed368cc5d	io_uring: remove task ref helpers They are only used right where they are defined, just open-code them inside io_put_task(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	f03baece08	io_uring: move cancelations to be io_uring_task based Right now the task_struct pointer is used as the key to match a task, but in preparation for some io_kiocb changes, move it to using struct io_uring_task instead. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	6f94cbc29a	io_uring/rsrc: split io_kiocb node type assignments Currently the io_rsrc_node assignment in io_kiocb is an array of two pointers, as two nodes may be assigned to a request - one file node, and one buffer node. However, the buffer node can co-exist with the provided buffers, as currently it's not supported to use both provided and registered buffers at the same time. This crucially brings struct io_kiocb down to 4 cache lines again, as before it spilled into the 5th cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:36 -07:00
Jens Axboe	6af82f7614	io_uring/rsrc: encode node type and ctx together Rather than keep the type field separate rom ctx, use the fact that we can encode up to 4 types of nodes in the LSB of the ctx pointer. Doesn't reclaim any space right now on 64-bit archs, but it leaves a full int for future use. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:54:15 -07:00
Al Viro	0158005aaa	replace do_getxattr() with saner helpers. similar to do_setxattr() in the previous commit... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-06 12:59:39 -05:00
Al Viro	66d7ac6bdb	replace do_setxattr() with saner helpers. io_uring setxattr logics duplicates stuff from fs/xattr.c; provide saner helpers (filename_setxattr() and file_setxattr() resp.) and use them. NB: putname(ERR_PTR()) is a no-op Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-06 12:59:39 -05:00
Al Viro	a10c4c5e01	new helper: import_xattr_name() common logics for marshalling xattr names. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-06 12:59:39 -05:00
Christian Göttsche	537c76629d	fs: rename struct xattr_ctx to kernel_xattr_ctx Rename the struct xattr_ctx to increase distinction with the about to be added user API struct xattr_args. No functional change. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://lore.kernel.org/r/20240426162042.191916-2-cgoettsche@seltendoof.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-06 12:59:21 -05:00
Al Viro	b8cdd2530c	io_[gs]etxattr_prep(): just use getname() getname_flags(pathname, LOOKUP_FOLLOW) is obviously bogus - following trailing symlinks has no impact on how to copy the pathname from userland... Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 13:28:56 -05:00
Al Viro	6348be02ee	fdget(), trivial conversions fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
hexue	01ee194d1a	io_uring: add support for hybrid IOPOLL A new hybrid poll is implemented on the io_uring layer. Once an IO is issued, it will not poll immediately, but rather block first and re-run before IO complete, then poll to reap IO. While this poll method could be a suboptimal solution when running on a single thread, it offers performance lower than regular polling but higher than IRQ, and CPU utilization is also lower than polling. To use hybrid polling, the ring must be setup with both the IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid polling has the same restrictions as IOPOLL, in that commands must explicitly support it. Signed-off-by: hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20241101091957.564220-2-xue01.he@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	c1329532d5	io_uring/rsrc: allow cloning with node replacements Currently cloning a buffer table will fail if the destination already has a table. But it should be possible to use it to replace existing elements. Add a IORING_REGISTER_DST_REPLACE cloning flag, which if set, will allow the destination to already having a buffer table. If that is the case, then entries designated by offset + nr buffers will be replaced if they already exist. Note that it's allowed to use IORING_REGISTER_DST_REPLACE and not have an existing table, in which case it'll work just like not having the flag set and an empty table - it'll just assign the newly created table for that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	b16e920a19	io_uring/rsrc: allow cloning at an offset Right now buffer cloning is an all-or-nothing kind of thing - either the whole table is cloned from a source to a destination ring, or nothing at all. However, it's not always desired to clone the whole thing. Allow for the application to specify a source and destination offset, and a number of buffers to clone. If the destination offset is non-zero, then allocate sparse nodes upfront. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	d50f94d761	io_uring/rsrc: get rid of the empty node and dummy_ubuf The empty node was used as a placeholder for a sparse entry, but it didn't really solve any issues. The caller still has to check for whether it's the empty node or not, it may as well just check for a NULL return instead. The dummy_ubuf was used for a sparse buffer entry, but NULL will serve the same purpose there of ensuring an -EFAULT on attempted import. Just use NULL for a sparse node, regardless of whether or not it's a file or buffer resource. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	4007c3d8c2	io_uring/rsrc: add io_reset_rsrc_node() helper Puts and reset an existing node in a slot, if one exists. Returns true if a node was there, false if not. This helps cleanup some of the code that does a lookup just to clear an existing node. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	5f3829fdd6	io_uring/filetable: kill io_reset_alloc_hint() helper It's only used internally, and in one spot, just open-code ti. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	cb1717a7cd	io_uring/filetable: remove io_file_from_index() helper It's only used in fdinfo, nothing really gained from having this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	b54a14041e	io_uring/rsrc: add io_rsrc_node_lookup() helper There are lots of spots open-coding this functionality, add a generic helper that does the node lookup in a speculation safe way. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	3597f2786b	io_uring/rsrc: unify file and buffer resource tables For files, there's nr_user_files/file_table/file_data, and buffers have nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and file_data can't be the same thing, and ditto for the buffer side. That gets rid of more io_ring_ctx state that's in two spots rather than just being in one spot, as it should be. Put all the registered file data in one locations, and ditto on the buffer front. This also avoids having both io_rsrc_data->nodes being an allocated array, and ->user_bufs[] or ->file_table.nodes. There's no reason to have this information duplicated. Keep it in one spot, io_rsrc_data, along with how many resources are available. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:23 -06:00
Jens Axboe	f38f284764	io_uring: only initialize io_kiocb rsrc_nodes when needed Add the empty node initializing to the preinit part of the io_kiocb allocation, and reset them if they have been used. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	0701db7439	io_uring/rsrc: add an empty io_rsrc_node for sparse buffer entries Rather than allocate an io_rsrc_node for an empty/sparse buffer entry, add a const entry that can be used for that. This just needs checking for writing the tag, and the put check needs to check for that sparse node rather than NULL for validity. This avoids allocating rsrc nodes for sparse buffer entries. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	fbbb8e991d	io_uring/rsrc: get rid of io_rsrc_node allocation cache It's not going to be needed in the fast path going forward, so kill it off. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	7029acd8a9	io_uring/rsrc: get rid of per-ring io_rsrc_node list Work in progress, but get rid of the per-ring serialization of resource nodes, like registered buffers and files. Main issue here is that one node can otherwise hold up a bunch of other nodes from getting freed, which is especially a problem for file resource nodes and networked workloads where some descriptors may not see activity in a long time. As an example, instantiate an io_uring ring fd and create a sparse registered file table. Even 2 will do. Then create a socket and register it as fixed file 0, F0. The number of open files in the app is now 5, with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4 being the socket. Register this socket (eg "the listener") in slot 0 of the registered file table. Now add an operation on the socket that uses slot 0. Finally, loop N times, where each loop creates a new socket, registers said socket as a file, then unregisters the socket, and finally closes the socket. This is roughly similar to what a basic accept loop would look like. At the end of this loop, it's not unreasonable to expect that there would still be 5 open files. Each socket created and registered in the loop is also unregistered and closed. But since the listener socket registered first still has references to its resource node due to still being active, each subsequent socket unregistration is stuck behind it for reclaim. Hence 5 + N files are still open at that point, where N is awaiting the final put held up by the listener socket. Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct io_kiocb now gets explicit resource nodes assigned, with each holding a reference to the parent node. A parent node is either of type FILE or BUFFER, which are the two types of nodes that exist. A request can have two nodes assigned, if it's using both registered files and buffers. Since request issue and task_work completion is both under the ring private lock, no atomics are needed to handle these references. It's a simple unlocked inc/dec. As before, the registered buffer or file table each hold a reference as well to the registered nodes. Final put of the node will remove the node and free the underlying resource, eg unmap the buffer or put the file. Outside of removing the stall in resource reclaim described above, it has the following advantages: 1) It's a lot simpler than the previous scheme, and easier to follow. No need to specific quiesce handling anymore. 2) There are no resource node allocations in the fast path, all of that happens at resource registration time. 3) The structs related to resource handling can all get simplified quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can go away completely. 4) Handling of resource tags is much simpler, and doesn't require persistent storage as it can simply get assigned up front at registration time. Just copy them in one-by-one at registration time and assign to the resource node. The only real downside is that a request is now explicitly limited to pinning 2 resources, one file and one buffer, where before just assigning a resource node to a request would pin all of them. The upside is that it's easier to follow now, as an individual resource is explicitly referenced and assigned to the request. With this in place, the above mentioned example will be using exactly 5 files at the end of the loop, not N. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:18 -06:00
Jens Axboe	1d60d74e85	io_uring/rw: fix missing NOWAIT check for O_DIRECT start write When io_uring starts a write, it'll call kiocb_start_write() to bump the super block rwsem, preventing any freezes from happening while that write is in-flight. The freeze side will grab that rwsem for writing, excluding any new writers from happening and waiting for existing writes to finish. But io_uring unconditionally uses kiocb_start_write(), which will block if someone is currently attempting to freeze the mount point. This causes a deadlock where freeze is waiting for previous writes to complete, but the previous writes cannot complete, as the task that is supposed to complete them is blocked waiting on starting a new write. This results in the following stuck trace showing that dependency with the write blocked starting a new write: task:fio state:D stack:0 pid:886 tgid:886 ppid:876 Call trace: __switch_to+0x1d8/0x348 __schedule+0x8e8/0x2248 schedule+0x110/0x3f0 percpu_rwsem_wait+0x1e8/0x3f8 __percpu_down_read+0xe8/0x500 io_write+0xbb8/0xff8 io_issue_sqe+0x10c/0x1020 io_submit_sqes+0x614/0x2110 __arm64_sys_io_uring_enter+0x524/0x1038 invoke_syscall+0x74/0x268 el0_svc_common.constprop.0+0x160/0x238 do_el0_svc+0x44/0x60 el0_svc+0x44/0xb0 el0t_64_sync_handler+0x118/0x128 el0t_64_sync+0x168/0x170 INFO: task fsfreeze:7364 blocked for more than 15 seconds. Not tainted 6.12.0-rc5-00063-g76aaf945701c #7963 with the attempting freezer stuck trying to grab the rwsem: task:fsfreeze state:D stack:0 pid:7364 tgid:7364 ppid:995 Call trace: __switch_to+0x1d8/0x348 __schedule+0x8e8/0x2248 schedule+0x110/0x3f0 percpu_down_write+0x2b0/0x680 freeze_super+0x248/0x8a8 do_vfs_ioctl+0x149c/0x1b18 __arm64_sys_ioctl+0xd0/0x1a0 invoke_syscall+0x74/0x268 el0_svc_common.constprop.0+0x160/0x238 do_el0_svc+0x44/0x60 el0_svc+0x44/0xb0 el0t_64_sync_handler+0x118/0x128 el0t_64_sync+0x168/0x170 Fix this by having the io_uring side honor IOCB_NOWAIT, and only attempt a blocking grab of the super block rwsem if it isn't set. For normal issue where IOCB_NOWAIT would always be set, this returns -EAGAIN which will have io_uring core issue a blocking attempt of the write. That will in turn also get completions run, ensuring forward progress. Since freezing requires CAP_SYS_ADMIN in the first place, this isn't something that can be triggered by a regular user. Cc: stable@vger.kernel.org # 5.10+ Reported-by: Peter Mann <peter.mann@sh.cz> Link: https://lore.kernel.org/io-uring/38c94aec-81c9-4f62-b44e-1d87f5597644@sh.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-31 08:21:02 -06:00
Jens Axboe	e410ffca58	io_uring/rsrc: kill io_charge_rsrc_node() It's only used from __io_req_set_rsrc_node(), and it takes both the ctx and node itself, while never using the ctx. Just open-code the basic refs++ in __io_req_set_rsrc_node() instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	743fb58a35	io_uring/splice: open code 2nd direct file assignment In preparation for not pinning the whole registered file table, open code the second potential direct file assignment. This will be handled by appropriate helpers in the future, for now just do it manually. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	aaa736b186	io_uring: specify freeptr usage for SLAB_TYPESAFE_BY_RCU io_kiocb cache Doesn't matter right now as there's still some bytes left for it, but let's prepare for the io_kiocb potentially growing and add a specific freeptr offset for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	ff1256b8f3	io_uring/rsrc: move struct io_fixed_file to rsrc.h header There's no need for this internal structure to be visible, move it to the private rsrc.h header instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	a85f31052b	io_uring/nop: add support for testing registered files and buffers Useful for testing performance/efficiency impact of registered files and buffers, vs (particularly) non-registered files. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	aa00f67adc	io_uring: add support for fixed wait regions Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in turn, the timeout value needs to get copied. Rather than do this for every invocation, allow the application to register a fixed set of wait regions that can simply be indexed when asking the kernel to wait on events. At ring setup time, the application can register a number of these wait regions and initialize region/index 0 upfront: struct io_uring_reg_wait reg; reg = io_uring_setup_reg_wait(ring, nr_regions, &ret); / set timeout and mark as set, sigmask/sigmask_sz as needed / reg->ts.tv_sec = 0; reg->ts.tv_nsec = 100000; reg->flags = IORING_REG_WAIT_TS; where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(reg). The above initializes index 0, but 63 other regions can be initialized, if needed. Now, instead of doing: struct __kernel_timespec timeout = { .tv_nsec = 100000, }; io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL); to wait for events for each submit_and_wait, or just wait, operation, it can just reference the above region at offset 0 and do: io_uring_submit_and_wait_reg(ring, &cqe, nr, 0); to achieve the same goal of waiting 100usec without needing to copy both struct io_uring_getevents_arg (24b) and struct __kernel_timeout (16b) for each invocation. Struct io_uring_reg_wait looks as follows: struct io_uring_reg_wait { struct __kernel_timespec ts; __u32 min_wait_usec; __u32 flags; __u64 sigmask; __u32 sigmask_sz; __u32 pad[3]; __u64 pad2[2]; }; embedding the timeout itself in the region, rather than passing it as a pointer as well. Note that the signal mask is still passed as a pointer, both for compatability reasons, but also because there doesn't seem to be a lot of high frequency waits scenarios that involve setting and resetting the signal mask for each wait. The application is free to modify any region before a wait call, or it can use keep multiple regions with different settings to avoid needing to modify the same one for wait calls. Up to a page size of regions is mapped by default, allowing PAGE_SIZE / 64 available regions for use. The registered region must fit within a page. On a 4kb page size system, that allows for 64 wait regions if a full page is used, as the size of struct io_uring_reg_wait is 64b. The region registered must be aligned to io_uring_reg_wait in size. It's valid to register less than 64 entries. In network performance testing with zero-copy, this reduced the time spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4% to 0.3%. Wait regions are fixed for the lifetime of the ring - once registered, they are persistent until the ring is torn down. The regions support minimum wait timeout as well as the regular waits. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	371b47da25	io_uring: change io_get_ext_arg() to use uaccess begin + end In scenarios where a high frequency of wait events are seen, the copy of the struct io_uring_getevents_arg is quite noticeable in the profiles in terms of time spent. It can be seen as up to 3.5-4.5%. Rewrite the copy-in logic, saving about 0.5% of the time. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	0a54a7dd0a	io_uring: switch struct ext_arg from __kernel_timespec to timespec64 This avoids intermediate storage for turning a __kernel_timespec user pointer into an on-stack struct timespec64, only then to turn it into a ktime_t. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00

1 2 3 4 5 ...

1284 Commits