linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2024-12-28 16:53:49 +00:00

Author	SHA1	Message	Date
Christian Brauner	a6711d1cd4	io_uring: port to struct kmem_cache_args Port req_cachep to struct kmem_cache_args. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2024-09-10 11:42:59 +02:00
Felix Moessbauer	f011c9cf04	io_uring/sqpoll: do not allow pinning outside of cpuset The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications. In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in case no explicit pinning is required by inheriting the one of the creating task. In case of explicit pinning, the check is more complicated, as also a cpu outside of the parent cpumask is allowed. We implemented this by using cpuset_cpus_allowed (that has support for cgroup cpusets) and testing if the requested cpu is in the set. Fixes: `37d1e2e364` ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/r/20240909150036.55921-1-felix.moessbauer@siemens.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-09 09:09:08 -06:00
Jens Axboe	0e0bcf07ec	io_uring/eventfd: move refs to refcount_t atomic_t for the struct io_ev_fd references and there are no issues with it. While the ref getting and putting for the eventfd code is somewhat performance critical for cases where eventfd signaling is used (news flash, you should not...), it probably doesn't warrant using an atomic_t for this. Let's just move to it to refcount_t to get the added protection of over/underflows. Link: https://lore.kernel.org/lkml/202409082039.hnsaIJ3X-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409082039.hnsaIJ3X-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-08 16:43:57 -06:00
Anuj Gupta	c9f9ce65c2	io_uring: remove unused rsrc_put_fn rsrc_put_fn is declared but never used, remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20240902062134.136387-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-02 09:39:57 -06:00
Anuj Gupta	6cf52b42c4	io_uring: add new line after variable declaration Fixes checkpatch warning Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20240902062134.136387-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-02 09:39:57 -06:00
Jens Axboe	1802656ef8	io_uring: add GCOV_PROFILE_URING Kconfig option If GCOV is enabled and this option is set, it enables code coverage profiling of the io_uring subsystem. Only use this for test purposes, as it will impact the runtime performance. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-30 10:52:02 -06:00
Jens Axboe	f274495aea	io_uring/kbuf: return correct iovec count from classic buffer peek io_provided_buffers_select() returns 0 to indicate success, but it should be returning 1 to indicate that 1 vec was mapped. This causes peeking to fail with classic provided buffers, and while that's not a use case that anyone should use, it should still work correctly. The end result is that no buffer will be selected, and hence a completion with '0' as the result will be posted, without a buffer attached. Fixes: `35c8711c8f` ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-30 10:45:54 -06:00
Jens Axboe	1c47c0d601	io_uring/rsrc: ensure compat iovecs are copied correctly For buffer registration (or updates), a userspace iovec is copied in and updated. If the application is within a compat syscall, then the iovec type is compat_iovec rather than iovec. However, the type used in __io_sqe_buffers_update() and io_sqe_buffers_register() is always struct iovec, and hence the source is incremented by the size of a non-compat iovec in the loop. This misses every other iovec in the source, and will run into garbage half way through the copies and return -EFAULT to the application. Maintain the source address separately and assign to our user vec pointer, so that copies always happen from the right source address. While in there, correct a bad placement of __user which triggered the following sparse warning prior to this fix: io_uring/rsrc.c:981:33: warning: cast removes address space '__user' of expression io_uring/rsrc.c:981:30: warning: incorrect type in assignment (different address spaces) io_uring/rsrc.c:981:30: expected struct iovec const [noderef] __user uvec io_uring/rsrc.c:981:30: got struct iovec [noderef] __user Fixes: `f4eaf8eda8` ("io_uring/rsrc: Drop io_copy_iov in favor of iovec API") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-30 07:52:43 -06:00
Jens Axboe	ae98dbf43d	io_uring/kbuf: add support for incremental buffer consumption By default, any recv/read operation that uses provided buffers will consume at least 1 buffer fully (and maybe more, in case of bundles). This adds support for incremental consumption, meaning that an application may add large buffers, and each read/recv will just consume the part of the buffer that it needs. For example, let's say an application registers 1MB buffers in a provided buffer ring, for streaming receives. If it gets a short recv, then the full 1MB buffer will be consumed and passed back to the application. With incremental consumption, only the part that was actually used is consumed, and the buffer remains the current one. This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example: Application registers a provided buffer ring, and adds two 32K buffers to the ring. Buffer1 address: 0x1000000 (buffer ID 0) Buffer2 address: 0x2000000 (buffer ID 1) A recv completion is received with the following values: cqe->res 0x1000 (4k bytes received) cqe->flags 0x11 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 0) and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in: cqe->res 0x2010 (8k bytes received) cqe->flags 0x11 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 0) which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is: cqe->res 0x5000 (20k bytes received) cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0) and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The next completion is then: cqe->res 0x1000 (4k bytes received) cqe->flags 0x10001 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 1) which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID. When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final. Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the CQE cflags. On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length. An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen. Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both really require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:58 -06:00
Jens Axboe	6733e678ba	io_uring/kbuf: pass in 'len' argument for buffer commit In preparation for needing the consumed length, pass in the length being completed. Unused right now, but will be used when it is possible to partially consume a buffer. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:51 -06:00
Jens Axboe	641a681679	Revert "io_uring: Require zeroed sqe->len on provided-buffers send" This reverts commit `79996b45f7`. Revert the change that restricts a send provided buffer to be zero, so it will always consume the whole buffer. This is strictly needed for partial consumption, as the send may very well be a subset of the current buffer. In fact, that's the intended use case. For non-incremental provided buffer rings, an application should set sqe->len carefully to avoid the potential issue described in the reverted commit. It is recommended that '0' still be set for len for that case, if the application is set on maintaining more than 1 send inflight for the same socket. This is somewhat of a nonsensical thing to do. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:46 -06:00
Jens Axboe	2c8fa70bf3	io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h In preparation for using this helper in kbuf.h as well, move it there and turn it into a macro. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:42 -06:00
Jens Axboe	ecd5c9b296	io_uring/kbuf: add io_kbuf_commit() helper Committing the selected ring buffer is currently done in three different spots, combine it into a helper and just call that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:38 -06:00
Jens Axboe	120443321d	io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg nr_iovs is capped at 1024, and mode only has a few low values. We can safely make them u16, in preparation for adding a few more members. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	7ed9e09e2d	io_uring: wire up min batch wake timeout Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member that is currently in there. The value is in usecs, which is explained in the name as well. Note that if min_wait_usec and a normal timeout is used in conjunction, the normal timeout is still relative to the base time. For example, if min_wait_usec is set to 100 and the normal timeout is 1000, the max total time waited is still 1000. This also means that if the normal timeout is shorter than min_wait_usec, then only the min_wait_usec will take effect. See previous commit for an explanation of how this works. IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as applications doing submit_and_wait_timeout() style operations will generally not see the -EINVAL from the wait side as they return the number of IOs submitted. Only if no IOs are submitted will the -EINVAL bubble back up to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	1100c4a265	io_uring: add support for batch wait timeout Waiting for events with io_uring has two knobs that can be set: 1) The number of events to wake for 2) The timeout associated with the event Waiting will abort when either of those conditions are met, as expected. This adds support for a third event, which is associated with the number of events to wait for. Applications generally like to handle batches of completions, and right now they'd set a number of events to wait for and the timeout for that. If no events have been received but the timeout triggers, control is returned to the application and it can wait again. However, if the application doesn't have anything to do until events are reaped, then it's possible to make this waiting more efficient. For example, the application may have a latency time of 50 usecs and wanting to handle a batch of 8 requests at the time. If it uses 50 usecs as the timeout, then it'll be doing 20K context switches per second even if nothing is happening. This introduces the notion of min batch wait time. If the min batch wait time expires, then we'll return to userspace if we have any events at all. If none are available, the general wait time is applied. Any request arriving after the min batch wait time will cause waiting to stop and return control to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	cebf123c63	io_uring: implement our own schedule timeout handling In preparation for having two distinct timeouts and avoid waking the task if we don't need to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	45a41e74b8	io_uring: move schedule wait logic into helper In preparation for expanding how we handle waits, move the actual schedule and schedule_timeout() handling into a helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	f42b58e448	io_uring: encapsulate extraneous wait flags into a separate struct Rather than need to pass in 2 or 3 separate arguments, add a struct to encapsulate the timeout and sigset_t parts of waiting. In preparation for adding another argument for waiting. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Pavel Begunkov	2b8e976b98	io_uring: user registered clockid for wait timeouts Add a new registration opcode IORING_REGISTER_CLOCK, which allows the user to select which clock id it wants to use with CQ waiting timeouts. It only allows a subset of all posix clocks and currently supports CLOCK_MONOTONIC and CLOCK_BOOTTIME. Suggested-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Pavel Begunkov	d29cb3726f	io_uring: add absolute mode wait timeouts In addition to current relative timeouts for the waiting loop, where the timespec argument specifies the maximum time it can wait for, add support for the absolute mode, with the value carrying a CLOCK_MONOTONIC absolute time until which we should return control back to the user. Suggested-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Pavel Begunkov	d5cce407e4	io_uring/napi: postpone napi timeout adjustment Remove io_napi_adjust_timeout() and move the adjustments out of the common path into __io_napi_busy_loop(). Now the limit it's calculated based on struct io_wait_queue::timeout, for which we query current time another time. The overhead shouldn't be a problem, it's a polling path, however that can be optimised later by additionally saving the delta time value in io_cqring_wait(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88e14686e245b3b42ff90a3c4d70895d48676206.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Pavel Begunkov	489b80060c	io_uring/napi: refactor __io_napi_busy_loop() we don't need to set ->napi_prefer_busy_poll if we're not going to poll, do the checks first and all polling preparation after. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2ad7ede8cc7905328fc62e8c3805fdb11635ae0b.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	a69307a554	io_uring/kbuf: turn io_buffer_list booleans into flags We could just move these two and save some space, but in preparation for adding another flag, turn them into flags first. This saves 8 bytes in struct io_buffer_list, making it exactly half a cacheline on 64-bit archs now rather than 40 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	566a424212	io_uring/net: use ITER_UBUF for single segment send maps Just like what is being done on the recv side, if we only map a single segment, then use ITER_UBUF for mapping it. That's more efficient than using an ITER_IOVEC. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	03e02e8f95	io_uring/kbuf: use 'bl' directly rather than req->buf_list req->buf_list is assigned higher up and is safe to use as we remain within a locked region, as is the 'bl' variable itself from which it was assigned. To improve readability, use 'bl' directly rather than get it from the io_kiocb, if we need to increment the head directly in the buffer selection path. This makes it readily apparent that it's the same io_buffer_list being used. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Olivier Langlois	7255cd8945	io_uring: micro optimization of __io_sq_thread() condition reverse the order of the element evaluation in an if statement. for many users that are not using iopoll, the iopoll_list will always evaluate to false after having made a memory access whereas to_submit is very likely already loaded in a register. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/052ca60b5c49e7439e4b8bd33bfab4a09d36d3d6.1722374371.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Chenliang Li	a8edbb424b	io_uring/rsrc: enable multi-hugepage buffer coalescing Add support for checking and coalescing multi-hugepage-backed fixed buffers. The coalescing optimizes both time and space consumption caused by mapping and storing multi-hugepage fixed buffers. A coalescable multi-hugepage buffer should fully cover its folios (except potentially the first and last one), and these folios should have the same size. These requirements are for easier processing later, also we need same size'd chunks in io_import_fixed for fast iov_iter adjust. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-3-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Chenliang Li	3d6106aee4	io_uring/rsrc: store folio shift and mask into imu Store the folio shift and folio mask into imu struct and use it in iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a multi-hugepage buffer get coalesced. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-2-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Olivier Langlois	d843634a95	io_uring: add napi busy settings to the fdinfo output This info may be useful when attempting to debug a problem involving a ring using the NAPI feature. Here is an example of the output: ip-172-31-39-89 /proc/772/fdinfo # cat 14 pos: 0 flags: 02000002 mnt_id: 16 ino: 10243 SqMask: 0xff SqHead: 633 SqTail: 633 CachedSqHead: 633 CqMask: 0x3fff CqHead: 430250 CqTail: 430250 CachedCqTail: 430250 SQEs: 0 CQEs: 0 SqThread: 885 SqThreadCpu: 0 SqTotalTime: 52793826 SqWorkTime: 3590465 UserFiles: 0 UserBufs: 0 PollList: op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=6, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=6, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 CqOverflowList: NAPI: enabled napi_busy_poll_to: 1 napi_prefer_busy_poll: true Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb184f8b62703ddd3e6e19eae7ab6c67b97e1e10.1722293317.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:00 -06:00
Jens Axboe	e0ee967630	io_uring/kbuf: sanitize peek buffer setup Harden the buffer peeking a bit, by adding a sanity check for it having a valid size. Outside of that, arg->max_len is a size_t, though it's only ever set to a 32-bit value (as it's governed by MAX_RW_COUNT). Bump our needed check to a size_t so we know it fits. Finally, cap the calculated needed iov value to the PEEK_MAX_IMPORT, which is the maximum number of segments that should be peeked. Fixes: `35c8711c8f` ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-21 07:16:38 -06:00
Jens Axboe	e4956dc7a8	io_uring/sqpoll: annotate debug task == current with data_race() There's a debug check in io_sq_thread_park() checking if it's the SQPOLL thread itself calling park. KCSAN warns about this, as we should not be reading sqd->thread outside of sqd->lock. Just silence this with data_race(). The pointer isn't used for anything but this debug check. Reported-by: syzbot+2b946a3fd80caf971b21@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-13 06:10:59 -06:00
Al Viro	1da91ea87a	introduce fd_file(), convert all accessors to it. For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-08-12 22:00:43 -04:00
Olivier Langlois	48cc7ecd3a	io_uring/napi: remove duplicate io_napi_entry timeout assignation io_napi_entry() has 2 calling sites. One of them is unlikely to find an entry and if it does, the timeout should arguable not be updated. The other io_napi_entry() calling site is overwriting the update made by io_napi_entry() so the io_napi_entry() timeout value update has no or little value and therefore is removed. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/145b54ff179f87609e20dffaf5563c07cdbcad1a.1723423275.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-12 12:11:42 -06:00
Olivier Langlois	84f2eecf95	io_uring/napi: check napi_enabled in io_napi_add() before proceeding doing so avoids the overhead of adding napi ids to all the rings that do not enable napi. if no id is added to napi_list because napi is disabled, __io_napi_busy_loop() will not be called. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Fixes: `b4ccc4dd13` ("io_uring/napi: enable even with a timeout of 0") Link: https://lore.kernel.org/r/bd989ccef5fda14f5fd9888faf4fefcf66bd0369.1723400131.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-12 12:09:03 -06:00
Jens Axboe	8fe8ac24ad	io_uring/net: don't pick multiple buffers for non-bundle send If a send is issued marked with IOSQE_BUFFER_SELECT for selecting a buffer, unless it's a bundle, it should not select multiple buffers. Cc: stable@vger.kernel.org Fixes: `a05d1f625c` ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-07 15:20:52 -06:00
Jens Axboe	70ed519ed5	io_uring/net: ensure expanded bundle send gets marked for cleanup If the iovec inside the kmsg isn't already allocated AND one gets expanded beyond the fixed size, then the request may not already have been marked for cleanup. Ensure that it is. Cc: stable@vger.kernel.org Fixes: `a05d1f625c` ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-07 15:08:17 -06:00
Jens Axboe	11893e144e	io_uring/net: ensure expanded bundle recv gets marked for cleanup If the iovec inside the kmsg isn't already allocated AND one gets expanded beyond the fixed size, then the request may not already have been marked for cleanup. Ensure that it is. Cc: stable@vger.kernel.org Fixes: `2f9c9515bd` ("io_uring/net: support bundles for recv") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-07 15:06:45 -06:00
Olivier Langlois	c3fca4fb83	io_uring: remove unused local list heads in NAPI functions These lists are unused, remove them. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a0ae3e955aed0f3e3d29882fb3d3cb575e0009b.1722294947.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-30 06:20:20 -06:00
Olivier Langlois	2c762be5b7	io_uring: keep multishot request NAPI timeout current This refresh statement was originally present in the original patch: https://lore.kernel.org/netdev/20221121191437.996297-2-shr@devkernel.io/ It has been removed with no explanation in v6: https://lore.kernel.org/netdev/20230201222254.744422-2-shr@devkernel.io/ It is important to make the refresh for multishot requests, because if no new requests using the same NAPI device are added to the ring, the entry will become stale and be removed silently. The unsuspecting user will not know that their ring had busy polling for only 60 seconds before being pruned. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: `8d0c12a80c` ("io-uring: add napi busy poll support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/0fe61a019ec61e5708cd117cb42ed0dab95e1617.1722294646.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-30 06:18:58 -06:00
Pavel Begunkov	3581696176	io_uring/napi: pass ktime to io_napi_adjust_timeout Pass the waiting time for __io_napi_adjust_timeout as ktime and get rid of all timespec64 conversions. It's especially simpler since the caller already have a ktime. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f5b8e8eed4f53a1879e031a6712b25381adc23d.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-26 08:31:59 -06:00
Pavel Begunkov	342b2e395d	io_uring/napi: use ktime in busy polling It's more natural to use ktime/ns instead of keeping around usec, especially since we're comparing it against user provided timers, so convert napi busy poll internal handling to ktime. It's also nicer since the type (ktime_t vs unsigned long) now tells the unit of measure. Keep everything as ktime, which we convert to/from micro seconds for IORING_[UN]REGISTER_NAPI. The net/ busy polling works seems to work with usec, however it's not real usec as shift by 10 is used to get it from nsecs, see busy_loop_current_time(), so it's easy to get truncated nsec back and we get back better precision. Note, we can further improve it later by removing the truncation and maybe convincing net/ to use ktime/ns instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95e7ec8d095069a3ed5d40a4bc6f8b586698bc7e.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-26 08:31:59 -06:00
Jens Axboe	0db4618e8f	io_uring/msg_ring: fix uninitialized use of target_req->flags syzbot reports that KMSAN complains that 'nr_tw' is an uninit-value with the following report: BUG: KMSAN: uninit-value in io_req_local_work_add io_uring/io_uring.c:1192 [inline] BUG: KMSAN: uninit-value in io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240 io_req_local_work_add io_uring/io_uring.c:1192 [inline] io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240 io_msg_remote_post io_uring/msg_ring.c:102 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] io_msg_ring_data io_uring/msg_ring.c:152 [inline] io_msg_ring+0x1c38/0x1ef0 io_uring/msg_ring.c:305 io_issue_sqe+0x383/0x22c0 io_uring/io_uring.c:1710 io_queue_sqe io_uring/io_uring.c:1924 [inline] io_submit_sqe io_uring/io_uring.c:2180 [inline] io_submit_sqes+0x1259/0x2f20 io_uring/io_uring.c:2295 __do_sys_io_uring_enter io_uring/io_uring.c:3205 [inline] __se_sys_io_uring_enter+0x40c/0x3ca0 io_uring/io_uring.c:3142 __x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3142 x64_sys_call+0x2d82/0x3c10 arch/x86/include/generated/asm/syscalls_64.h:427 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f which is the following check: if (nr_tw < nr_wait) return; in io_req_local_work_add(). While nr_tw itself cannot be uninitialized, it does depend on req->flags, which off the msg ring issue path can indeed be uninitialized. Fix this by always clearing the allocated 'req' fully if we can't grab one from the cache itself. Fixes: `50cf5f3842` ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Reported-by: syzbot+82609b8937a4458106ca@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/000000000000fd3d8d061dfc0e4a@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-25 08:41:35 -06:00
Pavel Begunkov	29d63b9403	io_uring: align iowq and task request error handling There is a difference in how io_queue_sqe and io_wq_submit_work treat error codes they get from io_issue_sqe. The first one fails anything unknown but latter only fails when the code is negative. It doesn't make sense to have this discrepancy, align them to the io_queue_sqe behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c550e152bf4a290187f91a4322ddcb5d6d1f2c73.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-24 08:01:49 -06:00
Pavel Begunkov	f1dcdfcadb	io_uring: simplify io_uring_cmd return We don't have to return error code from an op handler back to core io_uring, so once io_uring_cmd() sets the results and handles errors we can juts return IOU_OK and simplify the code. Note, only valid with `e0b23d9953` ("io_uring: optimise ltimeout for inline execution"), there was a problem with iopoll before. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8eae2be5b2a49236cd5f1dadbd1aa5730e9e2d4f.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-24 08:01:49 -06:00
Pavel Begunkov	e142e9cd88	io_uring: fix io_match_task must_hold The __must_hold annotation in io_match_task() uses a non existing parameter "req", fix it. Fixes: `6af3f48bf6` ("io_uring: fix link traversal locking") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3e65ee7709e96507cef3d93291746f2c489f2307.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-24 08:01:49 -06:00
Pavel Begunkov	bd44d7e902	io_uring: don't allow netpolling with SETUP_IOPOLL IORING_SETUP_IOPOLL rings don't have any netpoll handling, let's fail attempts to register netpolling in this case, there might be people who will mix up IOPOLL and netpoll. Cc: stable@vger.kernel.org Fixes: `ef1186c1a8` ("io_uring: add register/unregister napi function") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e7553aee0a8ae4edec6742cd6dd0c1e6914fba8.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-24 08:01:49 -06:00
Pavel Begunkov	f8b632e89a	io_uring: tighten task exit cancellations io_uring_cancel_generic() should retry if any state changes like a request is completed, however in case of a task exit it only goes for another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT) request got completed. Let's assume we have a non-tracked request executing in iowq and a tracked request linked to it. Let's also assume io_uring_cancel_generic() fails to find and cancel the request, i.e. via io_run_local_work(), which may happen as io-wq has gaps. Next, the request logically completes, io-wq still hold a ref but queues it for completion via tw, which happens in io_uring_try_cancel_requests(). After, right before prepare_to_wait() io-wq puts the request, grabs the linked one and tries executes it, e.g. arms polling. Finally the cancellation loop calls prepare_to_wait(), there are no tw to run, no tracked request was completed, so the tctx_inflight() check passes and the task is put to indefinite sleep. Cc: stable@vger.kernel.org Fixes: `3f48cf18f8` ("io_uring: unify files and task cancel") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-24 08:01:49 -06:00
Pavel Begunkov	bcc87d978b	io_uring: fix error pbuf checking Syz reports a problem, which boils down to NULL vs IS_ERR inconsistent error handling in io_alloc_pbuf_ring(). KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:__io_remove_buffers+0xac/0x700 io_uring/kbuf.c:341 Call Trace: <TASK> io_put_bl io_uring/kbuf.c:378 [inline] io_destroy_buffers+0x14e/0x490 io_uring/kbuf.c:392 io_ring_ctx_free+0xa00/0x1070 io_uring/io_uring.c:2613 io_ring_exit_work+0x80f/0x8a0 io_uring/io_uring.c:2844 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd40 kernel/workqueue.c:3390 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Cc: stable@vger.kernel.org Reported-by: syzbot+2074b1a3d447915c6f1c@syzkaller.appspotmail.com Fixes: `87585b0575` ("io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c5f9df20560bd9830401e8e48abc029e7cfd9f5e.1721329239.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-20 11:04:57 -06:00
Pavel Begunkov	24dce1c538	io_uring: fix lost getsockopt completions There is a report that iowq executed getsockopt never completes. The reason being that io_uring_cmd_sock() can return a positive result, and io_uring_cmd() propagates it back to core io_uring, instead of IOU_OK. In case of io_wq_submit_work(), the request will be dropped without completing it. The offending code was introduced by a hack in `a9c3eda7ea` ("io_uring: fix submission-failure handling for uring-cmd"), however it was fine until getsockopt was introduced and started returning positive results. The right solution is to always return IOU_OK, since `e0b23d9953` ("io_uring: optimise ltimeout for inline execution"), we should be able to do it without problems, however for the sake of backporting and minimising side effects, let's keep returning negative return codes and otherwise do IOU_OK. Link: https://github.com/axboe/liburing/issues/1181 Cc: stable@vger.kernel.org Fixes: `8e9fad0e70` ("io_uring: Add io_uring command support for sockets") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/ff349cf0654018189b6077e85feed935f0f8839e.1721149870.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-20 11:04:56 -06:00
Linus Torvalds	51835949dd	Networking changes for 6.11. Not much excitement - a handful of large patchsets (devmem among them) did not make it in time. Core & protocols ---------------- - Use local_lock in addition to local_bh_disable() to protect per-CPU resources in networking, a step closer for local_bh_disable() not to act as a big lock on PREEMPT_RT. - Use flex array for netdevice priv area, ensure its cache alignment. - Add a sysctl knob to allow user to specify a default rto_min at socket init time. Bit of a big hammer but multiple companies were independently carrying such patch downstream so clearly it's useful. - Support scheduling transmission of packets based on CLOCK_TAI. - Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned off using cpusets. - Support multiple L2TPv3 UDP tunnels using the same 5-tuple address. - Allow configuration of multipath hash seed, to both allow synchronizing hashing of two routers, and preventing partial accidental sync. - Improve TCP compliance with RFC 9293 for simultaneous connect(). - Support sending NAT keepalives in IPsec ESP in UDP states. Userspace IKE daemon had to do this before, but the kernel can better keep track of it. - Support sending supervision HSR frames with MAC addresses stored in ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled. - Introduce IPPROTO_SMC for selecting SMC when socket is created. - Allow UDP GSO transmit from devices with no checksum offload. - openvswitch: add packet sampling via psample, separating the sampled traffic from "upcall" packets sent to user space for forwarding. - nf_tables: shrink memory consumption for transaction objects. Things we sprinkled into general kernel code -------------------------------------------- - Power Sequencing subsystem (used by Qualcomm Bluetooth driver for QCA6390). - Add IRQ information in sysfs for auxiliary bus. - Introduce guard definition for local_lock. - Add aligned flavor of __cacheline_group_{begin, end}() markings for grouping fields in structures. BPF --- - Notify user space (via epoll) when a struct_ops object is getting detached/unregistered. - Add new kfuncs for a generic, open-coded bits iterator. - Enable BPF programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head. - Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible WRT BTF from modules. - Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs. - riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter. - Add the capability to offload the netfilter flowtable in XDP layer through kfuncs. Driver API ---------- - Allow users to configure IRQ tresholds between which automatic IRQ moderation can choose. - Expand Power Sourcing (PoE) status with power, class and failure reason. Support setting power limits. - Track additional RSS contexts in the core, make sure configuration changes don't break them. - Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated ESP data paths. - Support updating firmware on SFP modules. Tests and tooling ----------------- - mptcp: use net/lib.sh to manage netns. - TCP-AO and TCP-MD5: replace debug prints used by tests with tracepoints. - openvswitch: make test self-contained (don't depend on OvS CLI tools). Drivers ------- - Ethernet high-speed NICs: - Broadcom (bnxt): - increase the max total outstanding PTP TX packets to 4 - add timestamping statistics support - implement netdev_queue_mgmt_ops - support new RSS context API - Intel (100G, ice, idpf): - implement FEC statistics and dumping signal quality indicators - support E825C products (with 56Gbps PHYs) - nVidia/Mellanox: - support HW-GRO - mlx4/mlx5: support per-queue statistics via netlink - obey the max number of EQs setting in sub-functions - AMD/Solarflare: - support new RSS context API - AMD/Pensando: - ionic: rework fix for doorbell miss to lower overhead and skip it on new HW - Wangxun: - txgbe: support Flow Director perfect filters - Ethernet NICs consumer, embedded and virtual: - Add driver for Tehuti Networks TN40xx chips - Add driver for Meta's internal NIC chips - Add driver for Ethernet MAC on Airoha EN7581 SoCs - Add driver for Renesas Ethernet-TSN devices - Google cloud vNIC: - flow steering support - Microsoft vNIC: - support page sizes other than 4KB on ARM64 - vmware vNIC: - support latency measurement (update to version 9) - VirtIO net: - support for Byte Queue Limits - support configuring thresholds for automatic IRQ moderation - support for AF_XDP Rx zero-copy - Synopsys (stmmac): - support for STM32MP13 SoC - let platforms select the right PCS implementation - TI: - icssg-prueth: add multicast filtering support - icssg-prueth: enable PTP timestamping and PPS - Renesas: - ravb: improve Rx performance 30-400% by using page pool, theaded NAPI and timer-based IRQ coalescing - ravb: add MII support for R-Car V4M - Cadence (macb): - macb: add ARP support to Wake-On-LAN - Cortina: - use phylib for RX and TX pause configuration - Ethernet switches: - nVidia/Mellanox: - support configuration of multipath hash seed - report more accurate max MTU - use page_pool to improve Rx performance - MediaTek: - mt7530: add support for bridge port isolation - Qualcomm: - qca8k: add support for bridge port isolation - Microchip: - lan9371/2: add 100BaseTX PHY support - NXP: - vsc73xx: implement VLAN operations - Ethernet PHYs: - aquantia: enable support for aqr115c - aquantia: add support for PHY LEDs - realtek: add support for rtl8224 2.5Gbps PHY - xpcs: add memory-mapped device support - add BroadR-Reach link mode and support in Broadcom's PHY driver - CAN: - add document for ISO 15765-2 protocol support - mcp251xfd: workaround for erratum DS80000789E, use timestamps to catch when device returns incorrect FIFO status - WiFi: - mac80211/cfg80211: - parse Transmit Power Envelope (TPE) data in mac80211 instead of in drivers - improvements for 6 GHz regulatory flexibility - multi-link improvements - support multiple radios per wiphy - remove DEAUTH_NEED_MGD_TX_PREP flag - Intel (iwlwifi): - bump FW API to 91 for BZ/SC devices - report 64-bit radiotap timestamp - enable P2P low latency by default - handle Transmit Power Envelope (TPE) advertised by AP - remove support for older FW for new devices - fast resume (keeping the device configured) - mvm: re-enable Multi-Link Operation (MLO) - aggregation (A-MSDU) optimizations - MediaTek (mt76): - mt7925 Multi-Link Operation (MLO) support - Qualcomm (ath10k): - LED support for various chipsets - Qualcomm (ath12k): - remove unsupported Tx monitor handling - support channel 2 in 6 GHz band - support Spatial Multiplexing Power Save (SMPS) in 6 GHz band - supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA) - support dynamic VLAN - add panic handler for resetting the firmware state - DebugFS support for datapath statistics - WCN7850: support for Wake on WLAN - Microchip (wilc1000): - read MAC address during probe to make it visible to user space - suspend/resume improvements - TI (wl18xx): - support newer firmware versions - RealTek (rtw89): - preparation for RTL8852BE-VT support - Wake on WLAN support for WiFi 6 chips - 36-bit PCI DMA support - RealTek (rtlwifi): - RTL8192DU support - Broadcom (brcmfmac): - Management Frame Protection support (to enable WPA3) - Bluetooth: - qualcomm: use the power sequencer for QCA6390 - btusb: mediatek: add ISO data transmission functions - hci_bcm4377: add BCM4388 support - btintel: add support for BlazarU core - btintel: add support for Whale Peak2 - btnxpuart: add support for AW693 A1 chipset - btnxpuart: add support for IW615 chipset - btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591 Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmaWjBwACgkQMUZtbf5S IrvuSRAAkJuEzTRqgURBCe4eNEQde6mJJig7l2CKHwCbFiHZpRkFHf8qKbcGWbL6 uLW33SWnKtJVDhxVKWHLq635XW7BAa80YhqGw21GDi+mIEhWXZglHj3xbXNxsMfE 4eg/kG4BkfYWFmHaXOwVWV/mr7nXf6j7WmXNeXEi32ufE1j0OL+YlQenKnMj8yP2 j9JmYa2Chwppng1SblHmcjmGkdNVwFhStKeCG+2K7v06wdDH/QYBlbgUv9gw/cxp NlW//wgiaeX40U4O3kDwt9C+LDoh+0VrDDeVdQ+IsScLtY3PhAzEoKolFYTq2HSr I1JpoaHNnyNsJq3DZrACQ5WlH4yDn6C2EUB6dxNnFaI9F1ZPsi+7MTl6Sei1AklD TuQTj/lxOACBwW2Q77NU72uoxiIUauesGPHcnrAFuoCIEhZF0mso7k59BvrXhsOP QwcLbQdc1YHNkqv/Vc7NBY+ruMsYB+5Ubbhhj2p27dp/CWFIwxI29fze4dn2uhO6 ejHN3mbqwPdSzg12YJtM6Iq61Cnwo2eVSvhTxl+ZVSZtI4nu2arzR+y7QTYmNrXP 6tkgVN9UsWeLl2xJ8wyyqL5mcvNHP2rPXWZ2X56iTaa26m+UlleeQ7YRaYtQAAr0 Ec/vlDMX64SwHhd+qwE99DXGQf2g+KklHKSLsnajJUVrWFTlRI0= =opz8 -----END PGP SIGNATURE----- Merge tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Not much excitement - a handful of large patchsets (devmem among them) did not make it in time. Core & protocols: - Use local_lock in addition to local_bh_disable() to protect per-CPU resources in networking, a step closer for local_bh_disable() not to act as a big lock on PREEMPT_RT - Use flex array for netdevice priv area, ensure its cache alignment - Add a sysctl knob to allow user to specify a default rto_min at socket init time. Bit of a big hammer but multiple companies were independently carrying such patch downstream so clearly it's useful - Support scheduling transmission of packets based on CLOCK_TAI - Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned off using cpusets - Support multiple L2TPv3 UDP tunnels using the same 5-tuple address - Allow configuration of multipath hash seed, to both allow synchronizing hashing of two routers, and preventing partial accidental sync - Improve TCP compliance with RFC 9293 for simultaneous connect() - Support sending NAT keepalives in IPsec ESP in UDP states. Userspace IKE daemon had to do this before, but the kernel can better keep track of it - Support sending supervision HSR frames with MAC addresses stored in ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled - Introduce IPPROTO_SMC for selecting SMC when socket is created - Allow UDP GSO transmit from devices with no checksum offload - openvswitch: add packet sampling via psample, separating the sampled traffic from "upcall" packets sent to user space for forwarding - nf_tables: shrink memory consumption for transaction objects Things we sprinkled into general kernel code: - Power Sequencing subsystem (used by Qualcomm Bluetooth driver for QCA6390) [ Already merged separately - Linus ] - Add IRQ information in sysfs for auxiliary bus - Introduce guard definition for local_lock - Add aligned flavor of __cacheline_group_{begin, end}() markings for grouping fields in structures BPF: - Notify user space (via epoll) when a struct_ops object is getting detached/unregistered - Add new kfuncs for a generic, open-coded bits iterator - Enable BPF programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head - Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible WRT BTF from modules - Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs - riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter - Add the capability to offload the netfilter flowtable in XDP layer through kfuncs Driver API: - Allow users to configure IRQ tresholds between which automatic IRQ moderation can choose - Expand Power Sourcing (PoE) status with power, class and failure reason. Support setting power limits - Track additional RSS contexts in the core, make sure configuration changes don't break them - Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated ESP data paths - Support updating firmware on SFP modules Tests and tooling: - mptcp: use net/lib.sh to manage netns - TCP-AO and TCP-MD5: replace debug prints used by tests with tracepoints - openvswitch: make test self-contained (don't depend on OvS CLI tools) Drivers: - Ethernet high-speed NICs: - Broadcom (bnxt): - increase the max total outstanding PTP TX packets to 4 - add timestamping statistics support - implement netdev_queue_mgmt_ops - support new RSS context API - Intel (100G, ice, idpf): - implement FEC statistics and dumping signal quality indicators - support E825C products (with 56Gbps PHYs) - nVidia/Mellanox: - support HW-GRO - mlx4/mlx5: support per-queue statistics via netlink - obey the max number of EQs setting in sub-functions - AMD/Solarflare: - support new RSS context API - AMD/Pensando: - ionic: rework fix for doorbell miss to lower overhead and skip it on new HW - Wangxun: - txgbe: support Flow Director perfect filters - Ethernet NICs consumer, embedded and virtual: - Add driver for Tehuti Networks TN40xx chips - Add driver for Meta's internal NIC chips - Add driver for Ethernet MAC on Airoha EN7581 SoCs - Add driver for Renesas Ethernet-TSN devices - Google cloud vNIC: - flow steering support - Microsoft vNIC: - support page sizes other than 4KB on ARM64 - vmware vNIC: - support latency measurement (update to version 9) - VirtIO net: - support for Byte Queue Limits - support configuring thresholds for automatic IRQ moderation - support for AF_XDP Rx zero-copy - Synopsys (stmmac): - support for STM32MP13 SoC - let platforms select the right PCS implementation - TI: - icssg-prueth: add multicast filtering support - icssg-prueth: enable PTP timestamping and PPS - Renesas: - ravb: improve Rx performance 30-400% by using page pool, theaded NAPI and timer-based IRQ coalescing - ravb: add MII support for R-Car V4M - Cadence (macb): - macb: add ARP support to Wake-On-LAN - Cortina: - use phylib for RX and TX pause configuration - Ethernet switches: - nVidia/Mellanox: - support configuration of multipath hash seed - report more accurate max MTU - use page_pool to improve Rx performance - MediaTek: - mt7530: add support for bridge port isolation - Qualcomm: - qca8k: add support for bridge port isolation - Microchip: - lan9371/2: add 100BaseTX PHY support - NXP: - vsc73xx: implement VLAN operations - Ethernet PHYs: - aquantia: enable support for aqr115c - aquantia: add support for PHY LEDs - realtek: add support for rtl8224 2.5Gbps PHY - xpcs: add memory-mapped device support - add BroadR-Reach link mode and support in Broadcom's PHY driver - CAN: - add document for ISO 15765-2 protocol support - mcp251xfd: workaround for erratum DS80000789E, use timestamps to catch when device returns incorrect FIFO status - WiFi: - mac80211/cfg80211: - parse Transmit Power Envelope (TPE) data in mac80211 instead of in drivers - improvements for 6 GHz regulatory flexibility - multi-link improvements - support multiple radios per wiphy - remove DEAUTH_NEED_MGD_TX_PREP flag - Intel (iwlwifi): - bump FW API to 91 for BZ/SC devices - report 64-bit radiotap timestamp - enable P2P low latency by default - handle Transmit Power Envelope (TPE) advertised by AP - remove support for older FW for new devices - fast resume (keeping the device configured) - mvm: re-enable Multi-Link Operation (MLO) - aggregation (A-MSDU) optimizations - MediaTek (mt76): - mt7925 Multi-Link Operation (MLO) support - Qualcomm (ath10k): - LED support for various chipsets - Qualcomm (ath12k): - remove unsupported Tx monitor handling - support channel 2 in 6 GHz band - support Spatial Multiplexing Power Save (SMPS) in 6 GHz band - supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA) - support dynamic VLAN - add panic handler for resetting the firmware state - DebugFS support for datapath statistics - WCN7850: support for Wake on WLAN - Microchip (wilc1000): - read MAC address during probe to make it visible to user space - suspend/resume improvements - TI (wl18xx): - support newer firmware versions - RealTek (rtw89): - preparation for RTL8852BE-VT support - Wake on WLAN support for WiFi 6 chips - 36-bit PCI DMA support - RealTek (rtlwifi): - RTL8192DU support - Broadcom (brcmfmac): - Management Frame Protection support (to enable WPA3) - Bluetooth: - qualcomm: use the power sequencer for QCA6390 - btusb: mediatek: add ISO data transmission functions - hci_bcm4377: add BCM4388 support - btintel: add support for BlazarU core - btintel: add support for Whale Peak2 - btnxpuart: add support for AW693 A1 chipset - btnxpuart: add support for IW615 chipset - btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591" * tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits) eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering" tcp: Replace strncpy() with strscpy() wifi: ath12k: fix build vs old compiler tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child(). eth: fbnic: Write the TCAM tables used for RSS control and Rx to host eth: fbnic: Add L2 address programming eth: fbnic: Add basic Rx handling eth: fbnic: Add basic Tx handling eth: fbnic: Add link detection eth: fbnic: Add initial messaging to notify FW of our presence eth: fbnic: Implement Rx queue alloc/start/stop/free eth: fbnic: Implement Tx queue alloc/start/stop/free eth: fbnic: Allocate a netdevice and napi vectors with queues eth: fbnic: Add FW communication mechanism eth: fbnic: Add message parsing for FW messages eth: fbnic: Add register init to set PCIe/Ethernet device config eth: fbnic: Allocate core device specific structures and devlink interface eth: fbnic: Add scaffolding for Meta's NIC driver PCI: Add Meta Platforms vendor ID net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK ...	2024-07-16 19:28:34 -07:00
Linus Torvalds	3e78198862	for-6.11/block-20240710 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9 1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8 kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2 AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv 5iv+EdRSNA== =wVi8 -----END PGP SIGNATURE----- Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe updates via Keith: - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng) - MD updates via Song - sync_action fix and refactoring (Yu Kuai) - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li) - Fix loop detach/open race (Gulam) - Fix lower control limit for blk-throttle (Yu) - Add module descriptions to various drivers (Jeff) - Add support for atomic writes for block devices, and statx reporting for same. Includes SCSI and NVMe (John, Prasad, Alan) - Add IO priority information to block trace points (Dongliang) - Various zone improvements and tweaks (Damien) - mq-deadline tag reservation improvements (Bart) - Ignore direct reclaim swap writes in writeback throttling (Baokun) - Block integrity improvements and fixes (Anuj) - Add basic support for rust based block drivers. Has a dummy null_blk variant for now (Andreas) - Series converting driver settings to queue limits, and cleanups and fixes related to that (Christoph) - Cleanup for poking too deeply into the bvec internals, in preparation for DMA mapping API changes (Christoph) - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas, Ming, Zhu, Damien, Christophe, Chaitanya) * tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits) floppy: add missing MODULE_DESCRIPTION() macro loop: add missing MODULE_DESCRIPTION() macro ublk_drv: add missing MODULE_DESCRIPTION() macro xen/blkback: add missing MODULE_DESCRIPTION() macro block/rnbd: Constify struct kobj_type block: take offset into account in blk_bvec_map_sg again block: fix get_max_segment_size() warning loop: Don't bother validating blocksize virtio_blk: Don't bother validating blocksize null_blk: Don't bother validating blocksize block: Validate logical block size in blk_validate_limits() virtio_blk: Fix default logical block size fallback nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id block: pass a phys_addr_t to get_max_segment_size block: add a bvec_phys helper blk-lib: check for kill signal in ioctl BLKZEROOUT block: limit the Write Zeroes to manually writing zeroes fallback block: refacto blkdev_issue_zeroout block: move read-only and supported checks into (__)blkdev_issue_zeroout ...	2024-07-15 14:20:22 -07:00
Linus Torvalds	3a56e24173	for-6.11/io_uring-20240714 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaTgusQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpr+1EAC4I7pRAM341sfmhe/9QQKMM8VzGwy5Tlr1 AFLO3BujRTl6X8S9fQjIjN1coW6u4F42I19+vVlxqvB7CUnqt9VWpexEjxe4K0FR R+hIZW+fWV9K/eMrcsLcI7oReN5kIihHOzzy3wz0rENoGB5dCl6JAZMHDUCSqP0/ ZJJQ5ut8ah20Y/myHnzP5o4TfdE7nGo73Di2YoE2g3KqeX/dlAKW9+5hqKzzrHhM 2U25k/6KLy0ROzKpy2qW0QRE3pT5udoHLK2ue9+XwXF8JWVTlfVkHBzGY7NstyyT z07SEzW1q4xV1HdCwGDAU7cL2NJMRXSG0p2WZTm8QyaVTdsZQvEx08GLsVdLvFH5 Gg+oOaxVE+INzW+/Lwz7lFHgq6XEjdAlEAOXDtGkZoni6Rt6iCzFCW6RTf/guy8o Cub7tatMyegxai9+FTN/oFVoydRR0tsMf0OHrWnLOperh9CaxAwXvmKFeT/UTwiB KIuIOJop7aThJbiV42a/xwTrEjNMZRv6uVBBEtJX3rxpmIhqTbjcAv9rKMmgtLMk s6yX1MvYdOLhhEDyoUBX0dJdEETBf3KbnYIwi8kb4Sbkw/ZDgnkmSxFysom61wUF byAFEpah3ZFR8aES0uNKUE6UHK6i5qqp0Za/n6gA927E/WGCU9ndaS+01gyknog0 8FqFYwruHQ== =50CO -----END PGP SIGNATURE----- Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: "Here are the io_uring updates queued up for 6.11. Nothing major this time around, various minor improvements and cleanups/fixes. This contains: - Add bind/listen opcodes. Main motivation is to support direct descriptors, to avoid needing a regular fd just for doing these two operations (Gabriel) - Probe fixes (Gabriel) - Treat io-wq work flags as atomics. Not fixing a real issue, but may as well and it silences a KCSAN warning (me) - Cleanup of rsrc __set_current_state() usage (me) - Add 64-bit for {m,f}advise operations (me) - Improve performance of data ring messages (me) - Fix for ring message overflow posting (Pavel) - Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added for faster task_work signaling for io_uring, bundling it with this pull (Pavel) - Add Pavel as a co-maintainer - Various cleanups (me, Thorsten)" * tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits) io_uring/net: check socket is valid in io_bind()/io_listen() kernel: rerun task_work while freezing in get_signal() io_uring/io-wq: limit retrying worker initialisation io_uring/napi: Remove unnecessary s64 cast io_uring/net: cleanup io_recv_finish() bundle handling io_uring/msg_ring: fix overflow posting MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer io_uring/msg_ring: use kmem_cache_free() to free request io_uring/msg_ring: check for dead submitter task io_uring/msg_ring: add an alloc cache for io_kiocb entries io_uring/msg_ring: improve handling of target CQE posting io_uring: add io_add_aux_cqe() helper io_uring: add remote task_work execution helper io_uring/msg_ring: tighten requirement for remote posting io_uring: Allocate only necessary memory in io_probe io_uring: Fix probe of disabled operations io_uring: Introduce IORING_OP_LISTEN io_uring: Introduce IORING_OP_BIND net: Split a __sys_listen helper for io_uring net: Split a __sys_bind helper for io_uring ...	2024-07-15 13:49:10 -07:00
Linus Torvalds	b051320d6a	vfs-6.11.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEF0AAKCRCRxhvAZXjc oq0TAQDjfTLN75RwKQ34RIFtRun2q+OMfBQtSegtaccqazghyAD/QfmPuZDxB5DL rsI/5k5O4VupIKrEdIaqvNxmkmDsSAc= =bf7E -----END PGP SIGNATURE----- Merge tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support passing NULL along AT_EMPTY_PATH for statx(). NULL paths with any flag value other than AT_EMPTY_PATH go the usual route and end up with -EFAULT to retain compatibility (Rust is abusing calls of the sort to detect availability of statx) This avoids path lookup code, lockref management, memory allocation and in case of NULL path userspace memory access (which can be quite expensive with SMAP on x86_64) - Don't block i_writecount during exec. Remove the deny_write_access() mechanism for executables - Relax open_by_handle_at() permissions in specific cases where we can prove that the caller had sufficient privileges to open a file - Switch timespec64 fields in struct inode to discrete integers freeing up 4 bytes Fixes: - Fix false positive circular locking warning in hfsplus - Initialize hfs_inode_info after hfs_alloc_inode() in hfs - Avoid accidental overflows in vfs_fallocate() - Don't interrupt fallocate with EINTR in tmpfs to avoid constantly restarting shmem_fallocate() - Add missing quote in comment in fs/readdir Cleanups: - Don't assign and test in an if statement in mqueue. Move the assignment out of the if statement - Reflow the logic in may_create_in_sticky() - Remove the usage of the deprecated ida_simple_xx() API from procfs - Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new mount api early - Rename variables in copy_tree() to make it easier to understand - Replace WARN(down_read_trylock, ...) abuse with proper asserts in various places in the VFS - Get rid of user_path_at_empty() and drop the empty argument from getname_flags() - Check for error while copying and no path in one branch in getname_flags() - Avoid redundant smp_mb() for THP handling in do_dentry_open() - Rename parent_ino to d_parent_ino and make it use RCU - Remove unused header include in fs/readdir - Export in_group_capable() helper and switch f2fs and fuse over to it instead of open-coding the logic in both places" * tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) ipc: mqueue: remove assignment from IS_ERR argument vfs: rename parent_ino to d_parent_ino and make it use RCU vfs: support statx(..., NULL, AT_EMPTY_PATH, ...) stat: use vfs_empty_path() helper fs: new helper vfs_empty_path() fs: reflow may_create_in_sticky() vfs: remove redundant smp_mb for thp handling in do_dentry_open fuse: Use in_group_or_capable() helper f2fs: Use in_group_or_capable() helper fs: Export in_group_or_capable() vfs: reorder checks in may_create_in_sticky hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode() proc: Remove usage of the deprecated ida_simple_xx() API hfsplus: fix to avoid false alarm of circular locking Improve readability of copy_tree vfs: shave a branch in getname_flags vfs: retire user_path_at_empty and drop empty arg from getname_flags vfs: stop using user_path_at_empty in do_readlinkat tmpfs: don't interrupt fallocate with EINTR fs: don't block i_writecount during exec ...	2024-07-15 10:52:51 -07:00
Tetsuo Handa	ad00e62914	io_uring/net: check socket is valid in io_bind()/io_listen() We need to check that sock_from_file(req->file) != NULL. Reported-by: syzbot <syzbot+1e811482aa2c70afa9a0@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=1e811482aa2c70afa9a0 Fixes: `7481fd93fa` ("io_uring: Introduce IORING_OP_BIND") Fixes: `ff140cc862` ("io_uring: Introduce IORING_OP_LISTEN") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/903da529-eaa3-43ef-ae41-d30f376c60cc@I-love.SAKURA.ne.jp [axboe: move assignment of sock to where the NULL check is] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-13 06:40:15 -06:00
Pavel Begunkov	0453aad676	io_uring/io-wq: limit retrying worker initialisation If io-wq worker creation fails, we retry it by queueing up a task_work. tasK_work is needed because it should be done from the user process context. The problem is that retries are not limited, and if queueing a task_work is the reason for the failure, we might get into an infinite loop. It doesn't seem to happen now but it would with the following patch executing task_work in the freezer's loop. For now, arbitrarily limit the number of attempts to create a worker. Cc: stable@vger.kernel.org Fixes: `3146cba99a` ("io-wq: make worker creation resilient against signals") Reported-by: Julian Orth <ju.orth@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8280436925db88448c7c85c6656edee1a43029ea.1720634146.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-11 01:51:44 -06:00
Thorsten Blum	f7c696a56c	io_uring/napi: Remove unnecessary s64 cast Since the do_div() macro casts the divisor to u32 anyway, remove the unnecessary s64 cast and fix the following Coccinelle/coccicheck warning reported by do_div.cocci: WARNING: do_div() does a 64-by-32 division, please consider using div64_s64 instead Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://lore.kernel.org/r/20240710010520.384009-2-thorsten.blum@toblux.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-10 00:20:52 -06:00
Jakub Kicinski	76ed626479	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h `219343755e` ("net: phy: aquantia: add missing include guards") `61578f6793` ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c `bd07a98178` ("net: txgbe: remove separate irq request for MSI and INTx") `b501d261a5` ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h `048a403648` ("net/mlx5: IFC updates for changing max EQs") `99be56171f` ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c `4130c67cd1` ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") `3f3126515f` ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h `816c6bec09` ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") `5a009b42e0` ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-07-04 14:16:11 -07:00
Jens Axboe	93d8032f41	io_uring/net: cleanup io_recv_finish() bundle handling Combine the two cases that check for whether or not this is a bundle, rather than having them as separate checks. This is easier to reduce, and it reduces the text associated with it as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-02 09:42:25 -06:00
Jens Axboe	6e92c646f5	io_uring/net: don't clear msg_inq before io_recv_buf_select() needs it For bundle receives to function properly, the previous iteration msg_inq value is needed to make a judgement call on how much data there is to receive. A previous fix ended up clearing it earlier as an error case would potentially errantly set IORING_CQE_F_SOCK_NONEMPTY if the request got failed. Move the assignment to post assigning buffers for the receive, but ensure that it's cleared for the buffer selection error case. With that, buffer selection has the right msg_inq value and can correctly bundle receives as designed. Noticed while testing where it was apparent than more than 1 buffer was never received. After fix was in place, multiple buffers are correctly picked for receive. This provides a 10x speedup for the test case, as the buffer size used was 64b. Fixes: `18414a4a2e` ("io_uring/net: assign kmsg inq/flags before buffer selection") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-02 09:42:10 -06:00
Pavel Begunkov	3b7c16be30	io_uring/msg_ring: fix overflow posting The caller of io_cqring_event_overflow() should be holding the completion_lock, which is violated by io_msg_tw_complete. There is only one caller of io_add_aux_cqe(), so just add locking there for now. WARNING: CPU: 0 PID: 5145 at io_uring/io_uring.c:703 io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 RIP: 0010:io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 <TASK> __io_post_aux_cqe io_uring/io_uring.c:816 [inline] io_add_aux_cqe+0x27c/0x320 io_uring/io_uring.c:837 io_msg_tw_complete+0x9d/0x4d0 io_uring/msg_ring.c:78 io_fallback_req_func+0xce/0x1c0 io_uring/io_uring.c:256 process_one_work kernel/workqueue.c:3224 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3305 worker_thread+0x86d/0xd40 kernel/workqueue.c:3383 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:144 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fixes: `f33096a3c9` ("io_uring: add io_add_aux_cqe() helper") Reported-by: syzbot+f7f9c893345c5c615d34@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c7350d07fefe8cce32b50f57665edbb6355ea8c1.1719927398.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-02 08:48:17 -06:00
Pavel Begunkov	060f4ba6e4	io_uring/net: move charging socket out of zc io_uring Currently, io_uring's io_sg_from_iter() duplicates the part of __zerocopy_sg_from_iter() charging pages to the socket. It'd be too easy to miss while changing it in net/, the chunk is not the most straightforward for outside users and full of internal implementation details. io_uring is not a good place to keep it, deduplicate it by moving out of the callback into __zerocopy_sg_from_iter(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-07-02 12:06:50 +02:00
Jens Axboe	be4f5d9c99	io_uring/msg_ring: use kmem_cache_free() to free request The change adding caching around the request allocated and freed for data messages changed a kmem_cache_free() to a kfree(), which isn't correct as the request came from slab in the first place. Fix that up and use the right freeing function if the cache is already at its limit. Note that the current mixing of kmem_cache_alloc and kfree is fine, but consistent alloc/free functions should be used as it's otherwise somewhat confusing. Fixes: `50cf5f3842` ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-01 09:10:59 -06:00
Jens Axboe	b0727b1243	io_uring/msg_ring: check for dead submitter task The change for improving the handling of the target CQE posting inadvertently dropped the NULL check for the submitter task on the target ring, reinstate that. Fixes: `0617bb500b` ("io_uring/msg_ring: improve handling of target CQE posting") Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-01 08:45:27 -06:00
Jens Axboe	dbcabac138	io_uring: signal SQPOLL task_work with TWA_SIGNAL_NO_IPI Before SQPOLL was transitioned to managing its own task_work, the core used TWA_SIGNAL_NO_IPI to ensure that task_work was processed. If not, we can't be sure that all task_work is processed at SQPOLL thread exit time. Fixes: `af5d68f889` ("io_uring/sqpoll: manage task_work privately") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-24 19:46:15 -06:00
Jens Axboe	50cf5f3842	io_uring/msg_ring: add an alloc cache for io_kiocb entries With slab accounting, allocating and freeing memory has considerable overhead. Add a basic alloc cache for the io_kiocb allocations that msg_ring needs to do. Unlike other caches, this one is used by the sender, grabbing it from the remote ring. When the remote ring gets the posted completion, it'll free it locally. Hence it is separately locked, using ctx->msg_lock. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-24 08:39:55 -06:00
Jens Axboe	0617bb500b	io_uring/msg_ring: improve handling of target CQE posting Use the exported helper for queueing task_work for message passing, rather than rolling our own. Note that this is only done for strict data messages for now, file descriptor passing messages still rely on the kernel task_work. It could get converted at some point if it's performance critical. This improves peak performance of message passing by about 5x in some basic testing, with 2 threads just sending messages to each other. Before this change, it was capped at around 700K/sec, with the change it's at over 4M/sec. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-24 08:39:50 -06:00
Jens Axboe	f33096a3c9	io_uring: add io_add_aux_cqe() helper This helper will post a CQE, and can be called from task_work where we now that the ctx is already properly locked and that deferred completions will get flushed later on. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-24 08:39:45 -06:00
Jens Axboe	c3ac76f9ca	io_uring: add remote task_work execution helper All our task_work handling is targeted at the state in the io_kiocb itself, which is what it is being used for. However, MSG_RING rolls its own task_work handling, ignoring how that is usually done. In preparation for switching MSG_RING to be able to use the normal task_work handling, add io_req_task_work_add_remote() which allows the caller to pass in the target io_ring_ctx. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-24 08:39:39 -06:00
Jens Axboe	d57afd8bb7	io_uring/msg_ring: tighten requirement for remote posting Currently this is gated on whether or not the target ring needs a local completion - and if so, whether or not we're running on the right task. The use case for same thread cross posting is probably a lot less relevant than remote posting. And since we're going to improve this situation anyway, just gate it on local posting and ignore what task we're currently running on. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-24 08:39:34 -06:00
Prasad Singamsetty	c34fc6f26a	fs: Initial atomic write support An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: merge into single patch and much rewrite Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-20 15:19:17 -06:00
Chenliang Li	a23800f08a	io_uring/rsrc: fix incorrect assignment of iter->nr_segs in io_import_fixed In io_import_fixed when advancing the iter within the first bvec, the iter->nr_segs is set to bvec->bv_len. nr_segs should be the number of bvecs, plus we don't need to adjust it here, so just remove it. Fixes: `b000ae0ec2` ("io_uring/rsrc: optimise single entry advance") Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240619063819.2445-1-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-20 06:51:56 -06:00
Gabriel Krisman Bertazi	6bc9199d0c	io_uring: Allocate only necessary memory in io_probe We write at most IORING_OP_LAST entries in the probe buffer, so we don't need to allocate temporary space for more than that. As a side effect, we no longer can overflow "size". Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240619020620.5301-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 08:58:00 -06:00
Gabriel Krisman Bertazi	3e05b22238	io_uring: Fix probe of disabled operations io_probe checks io_issue_def->not_supported, but we never really set that field, as we mark non-supported functions through a specific ->prep handler. This means we end up returning IO_URING_OP_SUPPORTED, even for disabled operations. Fix it by just checking the prep handler itself. Fixes: `66f4af93da` ("io_uring: add support for probing opcodes") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240619020620.5301-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 08:58:00 -06:00
Gabriel Krisman Bertazi	ff140cc862	io_uring: Introduce IORING_OP_LISTEN IORING_OP_LISTEN provides the semantic of listen(2) via io_uring. While this is an essentially synchronous system call, the main point is to enable a network path to execute fully with io_uring registered and descriptorless files. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240614163047.31581-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:57:21 -06:00
Gabriel Krisman Bertazi	7481fd93fa	io_uring: Introduce IORING_OP_BIND IORING_OP_BIND provides the semantic of bind(2) via io_uring. While this is an essentially synchronous system call, the main point is to enable a network path to execute fully with io_uring registered and descriptorless files. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240614163047.31581-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:57:21 -06:00
Jens Axboe	3b87184f7e	io_uring/advise: support 64-bit lengths The existing fadvise/madvise support only supports 32-bit lengths. Add support for 64-bit lengths, enabled by the application setting sqe->off rather than sqe->len for the length. If sqe->len is set, then that is used as the 32-bit length. If sqe->len is zero, then sqe->off is read for full 64-bit support. Older kernels will return -EINVAL if 64-bit support isn't available. Fixes: `4840e418c2` ("io_uring: add IORING_OP_FADVISE") Fixes: `c1ca757bd6` ("io_uring: add IORING_OP_MADVISE") Reported-by: Stefan <source@s.muenzel.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-16 14:54:55 -06:00
Jens Axboe	11d1946692	io_uring/rsrc: remove redundant __set_current_state() post schedule() We're guaranteed to be in a TASK_RUNNING state post schedule, so we never need to set the state after that. While in there, remove the other __set_current_state() as well, and just call finish_wait() when we now we're going to break anyway. This is easier to grok than manual __set_current_state() calls. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-16 14:54:55 -06:00
Jens Axboe	3474d1b93f	io_uring/io-wq: make io_wq_work flags atomic The work flags can be set/accessed from different tasks, both the originator of the request, and the io-wq workers. While modifications aren't concurrent, it still makes KMSAN unhappy. There's no real downside to just making the flag reading/manipulation use proper atomics here. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-16 14:54:55 -06:00
Jens Axboe	f2a93294ed	io_uring: use 'state' consistently __io_submit_flush_completions() assigns ctx->submit_state to a local variable and uses it in all but one spot, switch that forgotten statement to using 'state' as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-16 14:54:55 -06:00
Jens Axboe	200f3abd14	io_uring/eventfd: move eventfd handling to separate file This is pretty nicely abstracted already, but let's move it to a separate file rather than have it in the main io_uring file. With that, we can also move the io_ev_fd struct and enum out of global scope. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-16 14:54:55 -06:00
Jens Axboe	60b6c075e8	io_uring/eventfd: move to more idiomatic RCU free usage In some ways, it just "happens to work" currently with using the ops field for both the free and signaling bit. But it depends on ordering of operations in terms of freeing and signaling. Clean it up and use the usual refs == 0 under RCU read side lock to determine if the ev_fd is still valid, and use the reference to gate the freeing as well. Fixes: `21a091b970` ("io_uring: signal registered eventfd to process deferred task work") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-16 14:54:55 -06:00
Gabriel Krisman Bertazi	f4eaf8eda8	io_uring/rsrc: Drop io_copy_iov in favor of iovec API Instead of open coding an io_uring function to copy iovs from userspace, rely on the existing iovec_from_user function. While there, avoid repeatedly zeroing the iov in the !arg case for io_sqe_buffer_register. tested with liburing testsuite. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240523214535.31890-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-16 14:54:55 -06:00
Pavel Begunkov	f4a1254f2a	io_uring: fix cancellation overwriting req->flags Only the current owner of a request is allowed to write into req->flags. Hence, the cancellation path should never touch it. Add a new field instead of the flag, move it into the 3rd cache line because it should always be initialised. poll_refs can move further as polling is an involved process anyway. It's a minimal patch, in the future we can and should find a better place for it and remove now unused REQ_F_CANCEL_SEQ. Fixes: `521223d7c2` ("io_uring/cancel: don't default to setting req->work.cancel_seq") Cc: stable@vger.kernel.org Reported-by: Li Shi <sl1589472800@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6827b129f8f0ad76fa9d1f0a773de938b240ffab.1718323430.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-13 19:25:28 -06:00
Pavel Begunkov	54559642b9	io_uring/rsrc: don't lock while !TASK_RUNNING There is a report of io_rsrc_ref_quiesce() locking a mutex while not TASK_RUNNING, which is due to forgetting restoring the state back after io_run_task_work_sig() and attempts to break out of the waiting loop. do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff815d2494>] prepare_to_wait+0xa4/0x380 kernel/sched/wait.c:237 WARNING: CPU: 2 PID: 397056 at kernel/sched/core.c:10099 __might_sleep+0x114/0x160 kernel/sched/core.c:10099 RIP: 0010:__might_sleep+0x114/0x160 kernel/sched/core.c:10099 Call Trace: <TASK> __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0xb4/0x940 kernel/locking/mutex.c:752 io_rsrc_ref_quiesce+0x590/0x940 io_uring/rsrc.c:253 io_sqe_buffers_unregister+0xa2/0x340 io_uring/rsrc.c:799 __io_uring_register io_uring/register.c:424 [inline] __do_sys_io_uring_register+0x5b9/0x2400 io_uring/register.c:613 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xd8/0x270 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6f/0x77 Reported-by: Li Shi <sl1589472800@gmail.com> Fixes: `4ea15b56f0` ("io_uring/rsrc: use wq for quiescing") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/77966bc104e25b0534995d5dbb152332bc8f31c0.1718196953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-12 13:02:12 -06:00
Mateusz Guzik	dff60734fc	vfs: retire user_path_at_empty and drop empty arg from getname_flags No users after do_readlinkat started doing the job on its own. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240604155257.109500-3-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-06-05 17:03:57 +02:00
Hagar Hemdan	73254a297c	io_uring: fix possible deadlock in io_register_iowq_max_workers() The io_register_iowq_max_workers() function calls io_put_sq_data(), which acquires the sqd->lock without releasing the uring_lock. Similar to the commit `009ad9f0c6` ("io_uring: drop ctx->uring_lock before acquiring sqd->lock"), this can lead to a potential deadlock situation. To resolve this issue, the uring_lock is released before calling io_put_sq_data(), and then it is re-acquired after the function call. This change ensures that the locks are acquired in the correct order, preventing the possibility of a deadlock. Suggested-by: Maximilian Heyne <mheyne@amazon.de> Signed-off-by: Hagar Hemdan <hagarhem@amazon.com> Link: https://lore.kernel.org/r/20240604130527.3597-1-hagarhem@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-04 07:39:17 -06:00
Su Hui	91215f70ea	io_uring/io-wq: avoid garbage value of 'match' in io_wq_enqueue() Clang static checker (scan-build) warning: o_uring/io-wq.c:line 1051, column 3 The expression is an uninitialized value. The computed value will also be garbage. 'match.nr_pending' is used in io_acct_cancel_pending_work(), but it is not fully initialized. Change the order of assignment for 'match' to fix this problem. Fixes: `42abc95f05` ("io-wq: decouple work_list protection from the big wqe->lock") Signed-off-by: Su Hui <suhui@nfschina.com> Link: https://lore.kernel.org/r/20240604121242.2661244-1-suhui@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-04 07:39:00 -06:00
Jens Axboe	415ce0ea55	io_uring/napi: fix timeout calculation Not quite sure what __io_napi_adjust_timeout() was attemping to do, it's adjusting both the NAPI timeout and the general overall timeout, and calculating a value that is never used. The overall timeout is a super set of the NAPI timeout, and doesn't need adjusting. The only thing we really need to care about is that the NAPI timeout doesn't exceed the overall timeout. If a user asked for a timeout of eg 5 usec and NAPI timeout is 10 usec, then we should not spin for 10 usec. While in there, sanitize the time checking a bit. If we have a negative value in the passed in timeout, discard it. Round up the value as well, so we don't end up with a NAPI timeout for the majority of the wait, with only a tiny sleep value at the end. Hence the only case we need to care about is if the NAPI timeout is larger than the overall timeout. If it is, cap the NAPI timeout at what the overall timeout is. Cc: stable@vger.kernel.org Fixes: `8d0c12a80c` ("io-uring: add napi busy poll support") Reported-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-04 07:32:45 -06:00
Jens Axboe	5fc16fa5f1	io_uring: check for non-NULL file pointer in io_file_can_poll() In earlier kernels, it was possible to trigger a NULL pointer dereference off the forced async preparation path, if no file had been assigned. The trace leading to that looks as follows: BUG: kernel NULL pointer dereference, address: 00000000000000b0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 67 PID: 1633 Comm: buf-ring-invali Not tainted 6.8.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022 RIP: 0010:io_buffer_select+0xc3/0x210 Code: 00 00 48 39 d1 0f 82 ae 00 00 00 48 81 4b 48 00 00 01 00 48 89 73 70 0f b7 50 0c 66 89 53 42 85 ed 0f 85 d2 00 00 00 48 8b 13 <48> 8b 92 b0 00 00 00 48 83 7a 40 00 0f 84 21 01 00 00 4c 8b 20 5b RSP: 0018:ffffb7bec38c7d88 EFLAGS: 00010246 RAX: ffff97af2be61000 RBX: ffff97af234f1700 RCX: 0000000000000040 RDX: 0000000000000000 RSI: ffff97aecfb04820 RDI: ffff97af234f1700 RBP: 0000000000000000 R08: 0000000000200030 R09: 0000000000000020 R10: ffffb7bec38c7dc8 R11: 000000000000c000 R12: ffffb7bec38c7db8 R13: ffff97aecfb05800 R14: ffff97aecfb05800 R15: ffff97af2be5e000 FS: 00007f852f74b740(0000) GS:ffff97b1eeec0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000b0 CR3: 000000016deab005 CR4: 0000000000370ef0 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x14d/0x420 ? do_user_addr_fault+0x61/0x6a0 ? exc_page_fault+0x6c/0x150 ? asm_exc_page_fault+0x22/0x30 ? io_buffer_select+0xc3/0x210 __io_import_iovec+0xb5/0x120 io_readv_prep_async+0x36/0x70 io_queue_sqe_fallback+0x20/0x260 io_submit_sqes+0x314/0x630 __do_sys_io_uring_enter+0x339/0xbc0 ? __do_sys_io_uring_register+0x11b/0xc50 ? vm_mmap_pgoff+0xce/0x160 do_syscall_64+0x5f/0x180 entry_SYSCALL_64_after_hwframe+0x46/0x4e RIP: 0033:0x55e0a110a67e Code: ba cc 00 00 00 45 31 c0 44 0f b6 92 d0 00 00 00 31 d2 41 b9 08 00 00 00 41 83 e2 01 41 c1 e2 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 90 89 30 eb a9 0f 1f 40 00 48 8b 42 20 8b 00 a8 06 75 af 85 f6 because the request is marked forced ASYNC and has a bad file fd, and hence takes the forced async prep path. Current kernels with the request async prep cleaned up can no longer hit this issue, but for ease of backporting, let's add this safety check in here too as it really doesn't hurt. For both cases, this will inevitably end with a CQE posted with -EBADF. Cc: stable@vger.kernel.org Fixes: `a76c0b31ee` ("io_uring: commit non-pollable provided mapped buffers upfront") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-01 12:25:35 -06:00
Jens Axboe	18414a4a2e	io_uring/net: assign kmsg inq/flags before buffer selection syzbot reports that recv is using an uninitialized value: ===================================================== BUG: KMSAN: uninit-value in io_req_cqe_overflow io_uring/io_uring.c:810 [inline] BUG: KMSAN: uninit-value in io_req_complete_post io_uring/io_uring.c:937 [inline] BUG: KMSAN: uninit-value in io_issue_sqe+0x1f1b/0x22c0 io_uring/io_uring.c:1763 io_req_cqe_overflow io_uring/io_uring.c:810 [inline] io_req_complete_post io_uring/io_uring.c:937 [inline] io_issue_sqe+0x1f1b/0x22c0 io_uring/io_uring.c:1763 io_wq_submit_work+0xa17/0xeb0 io_uring/io_uring.c:1860 io_worker_handle_work+0xc04/0x2000 io_uring/io-wq.c:597 io_wq_worker+0x447/0x1410 io_uring/io-wq.c:651 ret_from_fork+0x6d/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Uninit was stored to memory at: io_req_set_res io_uring/io_uring.h:215 [inline] io_recv_finish+0xf10/0x1560 io_uring/net.c:861 io_recv+0x12ec/0x1ea0 io_uring/net.c:1175 io_issue_sqe+0x429/0x22c0 io_uring/io_uring.c:1751 io_wq_submit_work+0xa17/0xeb0 io_uring/io_uring.c:1860 io_worker_handle_work+0xc04/0x2000 io_uring/io-wq.c:597 io_wq_worker+0x447/0x1410 io_uring/io-wq.c:651 ret_from_fork+0x6d/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Uninit was created at: slab_post_alloc_hook mm/slub.c:3877 [inline] slab_alloc_node mm/slub.c:3918 [inline] __do_kmalloc_node mm/slub.c:4038 [inline] __kmalloc+0x6e4/0x1060 mm/slub.c:4052 kmalloc include/linux/slab.h:632 [inline] io_alloc_async_data+0xc0/0x220 io_uring/io_uring.c:1662 io_msg_alloc_async io_uring/net.c:166 [inline] io_recvmsg_prep_setup io_uring/net.c:725 [inline] io_recvmsg_prep+0xbe8/0x1a20 io_uring/net.c:806 io_init_req io_uring/io_uring.c:2135 [inline] io_submit_sqe io_uring/io_uring.c:2182 [inline] io_submit_sqes+0x1135/0x2f10 io_uring/io_uring.c:2335 __do_sys_io_uring_enter io_uring/io_uring.c:3246 [inline] __se_sys_io_uring_enter+0x40f/0x3c80 io_uring/io_uring.c:3183 __x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3183 x64_sys_call+0x2c0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:427 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f which appears to be io_recv_finish() reading kmsg->msg.msg_inq to decide if it needs to set IORING_CQE_F_SOCK_NONEMPTY or not. If the recv is entered with buffer selection, but no buffer is available, then we jump error path which calls io_recv_finish() without having assigned kmsg->msg_inq. This might cause an errant setting of the NONEMPTY flag for a request get gets errored with -ENOBUFS. Reported-by: syzbot+b1647099e82b3b349fbf@syzkaller.appspotmail.com Fixes: `4a3223f7bf` ("io_uring/net: switch io_recv() to using io_async_msghdr") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-30 14:04:37 -06:00
Breno Leitao	e112311615	io_uring/rw: Free iovec before cleaning async data kmemleak shows that there is a memory leak in io_uring read operation, where a buffer is allocated at iovec import, but never de-allocated. The memory is allocated at io_async_rw->free_iovec, but, then io_async_rw is kfreed, taking the allocated memory with it. I saw this happening when the read operation fails with -11 (EAGAIN). This is the kmemleak splat. unreferenced object 0xffff8881da591c00 (size 256): ... backtrace (crc 7a15bdee): [<00000000256f2de4>] __kmalloc+0x2d6/0x410 [<000000007a9f5fc7>] iovec_from_user.part.0+0xc6/0x160 [<00000000cecdf83a>] __import_iovec+0x50/0x220 [<00000000d1d586a2>] __io_import_iovec+0x13d/0x220 [<0000000054ee9bd2>] io_prep_rw+0x186/0x340 [<00000000a9c0372d>] io_prep_rwv+0x31/0x120 [<000000001d1170b9>] io_prep_readv+0xe/0x30 [<0000000070b8eb67>] io_submit_sqes+0x1bd/0x780 [<00000000812496d4>] __do_sys_io_uring_enter+0x3ed/0x5b0 [<0000000081499602>] do_syscall_64+0x5d/0x170 [<00000000de1c5a4d>] entry_SYSCALL_64_after_hwframe+0x76/0x7e This occurs because the async data cleanup functions are not set for read/write operations. As a result, the potentially allocated iovec in the rw async data is not freed before the async data is released, leading to a memory leak. With this following patch, kmemleak does not show the leaked memory anymore, and all liburing tests pass. Fixes: `a9165b83c1` ("io_uring/rw: always setup io_async_rw for read/write requests") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240530142340.1248216-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-30 08:33:01 -06:00
Jens Axboe	06fe9b1df1	io_uring: don't attempt to mmap larger than what the user asks for If IORING_FEAT_SINGLE_MMAP is ignored, as can happen if an application uses an ancient liburing or does setup manually, then 3 mmap's are required to map the ring into userspace. The kernel will still have collapsed the mappings, however userspace may ask for mapping them individually. If so, then we should not use the full number of ring pages, as it may exceed the partial mapping. Doing so will yield an -EFAULT from vm_insert_pages(), as we pass in more pages than what the application asked for. Cap the number of pages to match what the application asked for, for the particular mapping operation. Reported-by: Lucas Mülling <lmulling@proton.me> Link: https://github.com/axboe/liburing/issues/1157 Fixes: `3ab1db3c60` ("io_uring: get rid of remap_pfn_range() for mapping rings/sqes") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-29 09:53:14 -06:00
Linus Torvalds	483a351ed4	io_uring-6.10-20240523 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZPahYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpu+CD/0V3y0Nok87IE8B+gKNVFO3yLZai+1iNVe3 wjLjHSOXPleycJaYWSiDo7ujA6kYY6CAvKH1KpjHdTiWvemh6hfClvA4a6kdigTh EB2MOsJcIKhRSS0PyJ+WIK+rIQspP50es9S48HjPdmJ/NtdOJXa4nKOMe6K+tK+N nAkWFjjEvwMO0Sgzx23sjU5lWqw1eJb5TeeA8dYpJtlDeQ3+Py7Msugzvuis176/ ElW8xNyja24OBJjurLLPFr7cAigeT9ra7ciDEzBlL6O5cvf+SrMW++ihgy8TJWbf nbIv8KpNgBNq3h658rLi3cql1hRhRaYpwRiLaek0OYzTb5HO6Xb8WLC1iND5njFT uO1+S7JPLUFJeCi0vqXtopjnzBKadfO7MYqvXWBEAa8B+J3q502WzTJuJ8uoiNLU Ub/12P3zopt19bKE5FMYktNgdHVXYAKC6JxbqXVYtn/aMNypLMnw/XJDdsvHpLjb Y6D3PNTtYya1cil24AvrdA3Kv/lEyBLPurrqmq2aHgxUhuAGbXCJpz7boHkK3AKj ESjz4IeVl1R2EAsYIkfYPlDEOjJN+p6PgmfUEWteREg0tpZsBmSr3VI7JMuKN9FD cisCa30nXWR8Pu4pURocyXZW7INdVODbIPDF1k28mwYAo92l4pAntaREtNOoBtHk FqN2gO/Z9A== =+97D -----END PGP SIGNATURE----- Merge tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linux Pull io_uring fixes from Jens Axboe: "Single fix here for a regression in 6.9, and then a simple cleanup removing some dead code" * tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linux: io_uring: remove checks for NULL 'sq_offset' io_uring/sqpoll: ensure that normal task_work is also run timely	2024-05-23 13:41:49 -07:00
Jens Axboe	547988ad0f	io_uring: remove checks for NULL 'sq_offset' Since the 5.12 kernel release, nobody has been passing NULL as the sq_offset pointer. Remove the checks for it being NULL or not, it will always be valid. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-22 11:13:44 -06:00
Linus Torvalds	b6394d6f71	Assorted commits that had missed the last merge window... -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkzp/gAKCRBZ7Krx/gZQ 63KFAQCsKv3XdcF+2BO+QuwPvR6eAvDxFjrFEcQFyyOXgFVLaAD/UMM0HcEFWxBb PCPvyKVP22wF9PbodkrKJn8DRdtRZwM= =jvWv -----END PGP SIGNATURE----- Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted commits that had missed the last merge window..." * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: remove call_{read,write}_iter() functions do_dentry_open(): kill inode argument kernel_file_open(): get rid of inode argument get_file_rcu(): no need to check for NULL separately fd_is_open(): move to fs/file.c close_on_exec(): pass files_struct instead of fdtable	2024-05-21 13:11:44 -07:00
Jens Axboe	d13ddd9c89	io_uring/sqpoll: ensure that normal task_work is also run timely With the move to private task_work, SQPOLL neglected to also run the normal task_work, if any is pending. This will eventually get run, but we should run it with the private task_work to ensure that things like a final fput() is processed in a timely fashion. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/313824bc-799d-414f-96b7-e6de57c7e21d@gmail.com/ Reported-by: Andrew Udvare <audvare@gmail.com> Fixes: `af5d68f889` ("io_uring/sqpoll: manage task_work privately") Tested-by: Christian Heusel <christian@heusel.eu> Tested-by: Andrew Udvare <audvare@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-21 13:41:14 -06:00
Linus Torvalds	61307b7be4	The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB nvA4E0DcPrUAFy144FNM0NTCb7u9vAw= =V3R/ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...	2024-05-19 09:21:03 -07:00
Linus Torvalds	89721e3038	net-accept-more-20240515 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZFcFwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuP1EADKJOJRtcvO/av2cUR+HZFDC+s/jBwHIJK+ 4UY633zQlxjqc7dt4rX7zk/uk4mkhZnsGY+wS6xH08kB3VO9YksrwREVt6Ur9lP8 UXNVPpPcZ7fFcIp41rYkZX9pTDp2N8z2qsVg7V8wcXJ7EeTXd6L4ZLjhfCiHvs2s i6yIEwLrW+voYuqSFV7vWBIM3mSXSRTIiO2DqRAOtT2lsj374DOthvP2lOSSb5wq 6TF4s4z3HMGs+HF3rjP5kJ6ic6RdC6i31lzEivUMhwCiKN1AZXdp96KXaC+NVPRV t5//EdS+pSenQgkg6XH7d5kzFoCUFJfVZt05w0GCqMA081Q9ySjUymN1zedJbGd9 8CDlW01N8XLqG6+F9yakJLSFY+mUFGduPuueTNiUJWP8kTkQCtYIRzZDeyjxQrE5 c17NW5S1uWkf26Ucyi1r+gxw9N4kGkuB3+NitC6DOc7BW5CocEIoqLWi/UH7cEZe 0v6loTakqBAdgh03RCDMUj9Rt/37pQs2KFT9/CazVpbkvkKsue4xK4K2CUFsxqOj qcoc/LD62at4S3AUWwhUIs3YaQ7v/6AY5hIktqAwsFHmDffUbPdRrXWY1keKIprJ 4qS/sY0M+kvKGnp+80fPVHab9l6/fMLfabIyFuh0M3W/M4eHGt2YfKWreoGEy/1x xLq2iq+ehw== =S6Xt -----END PGP SIGNATURE----- Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: "This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept requests. This is very similar to previous work that enabled the same hint for doing receives on sockets. By far the majority of the work here is refactoring to enable the networking side to pass back whether or not the socket had more pending requests after accepting the current one, the last patch just wires it up for io_uring. Not only does this enable applications to know whether there are more connections to accept right now, it also enables smarter logic for io_uring multishot accept on whether to retry immediately or wait for a poll trigger" * tag 'net-accept-more-20240515' of git://git.kernel.dk/linux: io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept net: pass back whether socket was empty post accept net: have do_accept() take a struct proto_accept_arg argument net: change proto and proto_ops accept type	2024-05-18 10:32:39 -07:00
Jens Axboe	ac287da2e0	io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept If the given protocol supports passing back whether or not we had more pending accept post this one, pass back this information to userspace. This is done by setting IORING_CQE_F_SOCK_NONEMPTY in the CQE flags, just like we do for recv/recvmsg if there's more data available post a receive operation. We can also use this information to be smarter about multishot retry, as we don't need to do a pointless retry if we know for a fact that there aren't any more connections to accept. Suggested-by: Norman Maurer <norman_maurer@apple.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-13 18:19:23 -06:00
Jens Axboe	0645fbe760	net: have do_accept() take a struct proto_accept_arg argument In preparation for passing in more information via this API, change do_accept() to take a proto_accept_arg struct pointer rather than just the file flags separately. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-13 18:19:19 -06:00
Linus Torvalds	9961a78594	for-6.10/io_uring-20240511 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YdYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnmVEADBq8QT9Oa3HTIONHwxjmGMOalr7PSrBP89 S6Inv/l+3xDlyolyLh1HIXUC84iS9Ihi2pNC3dZct4fNcpA99H0CFaHDGwZ5rVri MrFaubZAps1qSzeypqEq3zWGKVUoaYWaOKhuOjye5Ei2tKymbguhDKl1WiKibD21 E9qOYbhSUFdub/xtx9Rv4BS05QW5bHZ2Y/tTFqB8MY4JUsdb9g/deVZkyGUQYRSd 40mDallRldjQQTQ8iU4H6/ORdGIN/90aLPbmzMdFtQcymnmRyid3rOEwhwWYe4NO ljnI8m1SJQilZz1d5oHBXBB5QubVptY1JWxbk8GQCSmOU5wrCq+ARCJXUtBXwniJ K4VFsGm9MkZcc5vsIwIzvsrk8DODla6EVo/jyDy8iFceZcNWfVxdwa5NS67V/6QT macbF785XDsmA5E4UjslbZqU047w+A5N1yazcZWzMk0coJDeB8AtsA1/C2WZOm8p HVoiAzsqt81hvPItnjCyZluL/YW+BKeOTnq04QbpQKcJpZBzszO4ZLtuD+IXkE69 8ZZPGFPnPS4ZMQojKkwsBr+Yo65S18oBDkib36mr2lsdnoWTpGq47C7ScUDBbqGm iI7U8tYMnVVkQQHVVmGI4KOr5/4lxxp8398kqCaxfW3D5BQhbtUOF/OBjBHj1ZSV 9aZx87CyhA== =DwAV -----END PGP SIGNATURE----- Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ...	2024-05-13 12:48:06 -07:00
Linus Torvalds	1b0aabcc9a	vfs-6.10.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L 0+DNepg4R8PZOHH371eSSsLNRCUCkAs= =SVsU -----END PGP SIGNATURE----- Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_ bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...	2024-05-13 11:40:06 -07:00
Ming Lei	deb1e496a8	io_uring: support to inject result for NOP Support to inject result for NOP so that we can inject failure from userspace. It is very helpful for covering failure handling code in io_uring core change. With nop flags, it becomes possible to add more test features on NOP in future. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240510035031.78874-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-10 06:09:45 -06:00
Ming Lei	3d8f874bd6	io_uring: fail NOP if non-zero op flags is passed in The NOP op flags should have been checked from beginning like any other opcode, otherwise NOP may not be extended with the op flags. Given both liburing and Rust io-uring crate always zeros SQE op flags, just ignore users which play raw NOP uring interface without zeroing SQE, because NOP is just for test purpose. Then we can save one NOP2 opcode. Suggested-by: Jens Axboe <axboe@kernel.dk> Fixes: `2b188cc1bb` ("Add io_uring IO interface") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240510035031.78874-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-10 06:09:45 -06:00
Jens Axboe	d3da8e9859	io_uring/net: add IORING_ACCEPT_POLL_FIRST flag Similarly to how polling first is supported for receive, it makes sense to provide the same for accept. An accept operation does a lot of expensive setup, like allocating an fd, a socket/inode, etc. If no connection request is already pending, this is wasted and will just be cleaned up and freed, only to retry via the usual poll trigger. Add IORING_ACCEPT_POLL_FIRST, which tells accept to only initiate the accept request if poll says we have something to accept. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 12:22:11 -06:00
Jens Axboe	7dcc758cca	io_uring/net: add IORING_ACCEPT_DONTWAIT flag This allows the caller to perform a non-blocking attempt, similarly to how recvmsg has MSG_DONTWAIT. If set, and we get -EAGAIN on a connection attempt, propagate the result to userspace rather than arm poll and wait for a retry. Suggested-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 12:22:03 -06:00
Jens Axboe	340f634aa4	io_uring/filetable: don't unnecessarily clear/reset bitmap If we're updating an existing slot, we clear the slot bitmap only to set it again right after. Just leave the bit set rather than toggle it off and on, and move the unused slot setting into the branch of not already having a file occupy this slot. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-08 08:27:45 -06:00
Breno Leitao	8a56530492	io_uring/io-wq: Use set_bit() and test_bit() at worker->flags Utilize set_bit() and test_bit() on worker->flags within io_uring/io-wq to address potential data races. The structure io_worker->flags may be accessed through various data paths, leading to concurrency issues. When KCSAN is enabled, it reveals data races occurring in io_worker_handle_work and io_wq_activate_free_worker functions. BUG: KCSAN: data-race in io_worker_handle_work / io_wq_activate_free_worker write to 0xffff8885c4246404 of 4 bytes by task 49071 on cpu 28: io_worker_handle_work (io_uring/io-wq.c:434 io_uring/io-wq.c:569) io_wq_worker (io_uring/io-wq.c:?) <snip> read to 0xffff8885c4246404 of 4 bytes by task 49024 on cpu 5: io_wq_activate_free_worker (io_uring/io-wq.c:? io_uring/io-wq.c:285) io_wq_enqueue (io_uring/io-wq.c:947) io_queue_iowq (io_uring/io_uring.c:524) io_req_task_submit (io_uring/io_uring.c:1511) io_handle_tw_list (io_uring/io_uring.c:1198) <snip> Line numbers against commit `18daea77cc` ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm"). These races involve writes and reads to the same memory location by different tasks running on different CPUs. To mitigate this, refactor the code to use atomic operations such as set_bit(), test_bit(), and clear_bit() instead of basic "and" and "or" operations. This ensures thread-safe manipulation of worker flags. Also, move `create_index` to avoid holes in the structure. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240507170002.2269003-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 13:17:19 -06:00
Jens Axboe	59b28a6e37	io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring Move the posting outside the checking and locking, it's cleaner that way. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 17:13:51 -06:00
Gabriel Krisman Bertazi	79996b45f7	io_uring: Require zeroed sqe->len on provided-buffers send When sending from a provided buffer, we set sr->len to be the smallest between the actual buffer size and sqe->len. But, now that we disconnect the buffer from the submission request, we can get in a situation where the buffers and requests mismatch, and only part of a buffer gets sent. Assume: * buf[1]->len = 128; buf[2]->len = 256 * sqe[1]->len = 128; sqe[2]->len = 256 If sqe1 runs first, it picks buff[1] and it's all good. But, if sqe[2] runs first, sqe[1] picks buff[2], and the last half of buff[2] is never sent. While arguably the use-case of different-length sends is questionable, it has already raised confusion with potential users of this feature. Let's make the interface less tricky by forcing the length to only come from the buffer ring entry itself. Fixes: `ac5f71a3d9` ("io_uring/net: add provided buffer support for IORING_OP_SEND") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 14:56:36 -06:00
Pavel Begunkov	19352a1d39	io_uring/notif: disable LAZY_WAKE for linked notifs Notifications may now be linked and thus a single tw can post multiple CQEs, it's not safe to use LAZY_WAKE with them. Disable LAZY_WAKE for now, if that'd prove to be a problem we can count them and pass the expected number of CQEs into __io_req_task_work_add(). Fixes: `6fe4220912` ("io_uring/notif: implement notification stacking") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a5accdb7d2d0d27ebec14f8106e14e0192fae17.1714488419.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-30 13:06:27 -06:00
Pavel Begunkov	ef42b85a56	io_uring/net: fix sendzc lazy wake polling SEND[MSG]_ZC produces multiple CQEs via notifications, LAZY_WAKE doesn't handle it and so disable LAZY_WAKE for sendzc polling. It should be fine, sends are not likely to be polled in the first place. Fixes: `6ce4a93dbb` ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b360fb352d91e3aec751d75c87dfb4753a084ee.1714488419.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-30 13:06:27 -06:00
linke li	a4d416dc60	io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it In io_msg_exec_remote(), ctx->submitter_task is read using READ_ONCE at the beginning of the function, checked, and then re-read from ctx->submitter_task, voiding all guarantees of the checks. Reuse the value that was read by READ_ONCE to ensure the consistency of the task struct throughout the function. Signed-off-by: linke li <lilinke99@qq.com> Link: https://lore.kernel.org/r/tencent_F9B2296C93928D6F68FF0C95C33475C68209@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-26 07:40:12 -06:00
Rick Edgecombe	529ce23a76	mm: switch mm->get_unmapped_area() to a flag The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:25 -07:00
Jens Axboe	039a2e800b	io_uring/rw: reinstate thread check for retries Allowing retries for everything is arguably the right thing to do, now that every command type is async read from the start. But it's exposed a few issues around missing check for a retry (which `cca6571381` exposed), and the fixup commit for that isn't necessarily 100% sound in terms of iov_iter state. For now, just revert these two commits. This unfortunately then re-opens the fact that -EAGAIN can get bubbled to userspace for some cases where the kernel very well could just sanely retry them. But until we have all the conditions covered around that, we cannot safely enable that. This reverts commit `df604d2ad4`. This reverts commit `cca6571381`. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-25 09:04:32 -06:00
Pavel Begunkov	6fe4220912	io_uring/notif: implement notification stacking The network stack allows only one ubuf_info per skb, and unlike MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate ubuf_info. That means that send requests can't reuse a previosly allocated skb and need to get one more or more of new ones. That's fine for large sends, but otherwise it would spam the stack with lots of skbs carrying just a little data each. To help with that implement linking notification (i.e. an io_uring wrapper around ubuf_info) into a list. Each is refcounted by skbs and the stack as usual. additionally all non head entries keep a reference to the head, which they put down when their refcount hits 0. When the head have no more users, it'll efficiently put all notifications in a batch. As mentioned previously about ->io_link_skb, the callback implementation always allows to bind to an skb without a ubuf_info. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bf1e7f9b72f9ecc99999fdc0d2cded5eea87fd0b.1713369317.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 19:31:18 -06:00
Pavel Begunkov	5a569469b9	io_uring/notif: simplify io_notif_flush() io_notif_flush() is partially duplicating io_tx_ubuf_complete(), so instead of duplicating it, make the flush call io_tx_ubuf_complete. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19e41652c16718b946a5c80d2ad409df7682e47e.1713369317.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 19:31:18 -06:00
Jens Axboe	3830fff399	Merge branch 'for-uring-ubufops' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.10/io_uring Merge net changes required for the upcoming send zerocopy improvements. * 'for-uring-ubufops' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux: net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 19:30:05 -06:00
Pavel Begunkov	7ab4f16f9e	net: extend ubuf_info callback to ops structure We'll need to associate additional callbacks with ubuf_info, introduce a structure holding ubuf_info callbacks. Apart from a more smarter io_uring notification management introduced in next patches, it can be used to generalise msg_zerocopy_put_abort() and also store ->sg_from_iter, which is currently passed in struct msghdr. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-22 16:21:35 -07:00
Jens Axboe	2f9c9515bd	io_uring/net: support bundles for recv If IORING_OP_RECV is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs buffers available and receives into them, posting a single completion for all of it. This can be used with multishot receive as well, or without it. Now that both send and receive support bundles, add a feature flag for it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the ring, then the kernel supports bundles for recv and send. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 11:26:11 -06:00
Jens Axboe	a05d1f625c	io_uring/net: support bundles for send If IORING_OP_SEND is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea is that an application can fill outgoing buffers in a provided buffer group, and then arm a single send that will service them all. Once there are no more buffers to send, or if the requested length has been sent, the request posts a single completion for all the buffers. This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming in a separate patch. However, this patch does do a lot of the prep work that makes wiring up the sendmsg variant pretty trivial. They share the prep side. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 11:26:05 -06:00
Jens Axboe	35c8711c8f	io_uring/kbuf: add helpers for getting/peeking multiple buffers Our provided buffer interface only allows selection of a single buffer. Add an API that allows getting/peeking multiple buffers at the same time. This is only implemented for the ring provided buffers. It could be added for the legacy provided buffers as well, but since it's strongly encouraged to use the new interface, let's keep it simpler and just provide it for the new API. The legacy interface will always just select a single buffer. There are two new main functions: io_buffers_select(), which selects up as many buffers as it can. The caller supplies the iovec array, and io_buffers_select() may allocate a bigger array if the 'out_len' being passed in is non-zero and bigger than what fits in the provided iovec. Buffers grabbed with this helper are permanently assigned. io_buffers_peek(), which works like io_buffers_select(), except they can be recycled, if needed. Callers using either of these functions should call io_put_kbufs() rather than io_put_kbuf() at completion time. The peek interface must be called with the ctx locked from peek to completion. This add a bit state for the request: - REQ_F_BUFFERS_COMMIT, which means that the the buffers have been peeked and should be committed to the buffer ring head when they are put as part of completion. Prior to this, req->buf_list was cleared to NULL when committed. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 11:26:01 -06:00
Jens Axboe	ac5f71a3d9	io_uring/net: add provided buffer support for IORING_OP_SEND It's pretty trivial to wire up provided buffer support for the send side, just like how it's done the receive side. This enables setting up a buffer ring that an application can use to push pending sends to, and then have a send pick a buffer from that ring. One of the challenges with async IO and networking sends is that you can get into reordering conditions if you have more than one inflight at the same time. Consider the following scenario where everything is fine: 1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, completes successfully, posts CQE 5) sendB is issued, completes successfully, posts CQE All is fine. Requests are always issued in-order, and both complete inline as most sends do. However, if we're flooding socket1 with sends, the following could also result from the same sequence: 1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, completes successfully, posts CQE 7) sendA is retried, completes successfully, posts CQE Now we've sent sendB before sendA, which can make things unhappy. If both sendA and sendB had been using provided buffers, then it would look as follows instead: 1) App queues dataA for sendA, queues sendA for socket1 2) App queues dataB for sendB queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, picks first buffer (dataA), completes successfully, posts CQE (which says "I sent dataA") 7) sendA is retried, picks first buffer (dataB), completes successfully, posts CQE (which says "I sent dataB") Now we've sent the data in order, and everybody is happy. It's worth noting that this also opens the door for supporting multishot sends, as provided buffers would be a prerequisite for that. Those can trigger either when new buffers are added to the outgoing ring, or (if stalled due to lack of space) when space frees up in the socket. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 11:25:56 -06:00
Jens Axboe	3e747dedd4	io_uring/net: add generic multishot retry helper This is just moving io_recv_prep_retry() higher up so it can get used for sends as well, and rename it to be generically useful for both sends and receives. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-22 11:25:49 -06:00
Jens Axboe	df604d2ad4	io_uring/rw: ensure retry condition isn't lost A previous commit removed the checking on whether or not it was possible to retry a request, since it's now possible to retry any of them. This would previously have caused the request to have been ended with an error, but now the retry condition can simply get lost instead. Cleanup the retry handling and always just punt it to task_work, which will queue it with io-wq appropriately. Reported-by: Changhui Zhong <czhong@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Fixes: `cca6571381` ("io_uring/rw: cleanup retry path") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 09:23:55 -06:00
Gabriel Krisman Bertazi	24c3fc5c75	io-wq: Drop intermediate step between pending list and active work next_work is only used to make the work visible for cancellation. Instead, we can just directly write to cur_work before dropping the acct_lock and avoid the extra hop. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240416021054.3940-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:20:32 -06:00
Gabriel Krisman Bertazi	068c27e32e	io-wq: write next_work before dropping acct_lock Commit `361aee450c` ("io-wq: add intermediate work step between pending list and active work") closed a race between a cancellation and the work being removed from the wq for execution. To ensure the request is always reachable by the cancellation, we need to move it within the wq lock, which also synchronizes the cancellation. But commit `42abc95f05` ("io-wq: decouple work_list protection from the big wqe->lock") replaced the wq lock here and accidentally reintroduced the race by releasing the acct_lock too early. In other words: worker \| cancellation work = io_get_next_work() \| raw_spin_unlock(&acct->lock); \| \| \| io_acct_cancel_pending_work \| io_wq_worker_cancel() worker->next_work = work Using acct_lock is still enough since we synchronize on it on io_acct_cancel_pending_work. Fixes: `42abc95f05` ("io-wq: decouple work_list protection from the big wqe->lock") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240416021054.3940-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:20:32 -06:00
Miklos Szeredi	7c98f7cb8f	remove call_{read,write}_iter() functions These have no clear purpose. This is effectively a revert of commit `bb7462b6fd` ("vfs: use helpers for calling f_op->{read,write}_iter()"). The patch was created with the help of a coccinelle script. Fixes: `bb7462b6fd` ("vfs: use helpers for calling f_op->{read,write}_iter()") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-04-15 16:03:25 -04:00
Jens Axboe	c4ce0ab276	io_uring/sqpoll: work around a potential audit memory leak kmemleak complains that there's a memory leak related to connect handling: unreferenced object 0xffff0001093bdf00 (size 128): comm "iou-sqp-455", pid 457, jiffies 4294894164 hex dump (first 32 bytes): 02 00 fa ea 7f 00 00 01 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 2e481b1a): [<00000000c0a26af4>] kmemleak_alloc+0x30/0x38 [<000000009c30bb45>] kmalloc_trace+0x228/0x358 [<000000009da9d39f>] __audit_sockaddr+0xd0/0x138 [<0000000089a93e34>] move_addr_to_kernel+0x1a0/0x1f8 [<000000000b4e80e6>] io_connect_prep+0x1ec/0x2d4 [<00000000abfbcd99>] io_submit_sqes+0x588/0x1e48 [<00000000e7c25e07>] io_sq_thread+0x8a4/0x10e4 [<00000000d999b491>] ret_from_fork+0x10/0x20 which can can happen if: 1) The command type does something on the prep side that triggers an audit call. 2) The thread hasn't done any operations before this that triggered an audit call inside ->issue(), where we have audit_uring_entry() and audit_uring_exit(). Work around this by issuing a blanket NOP operation before the SQPOLL does anything. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 13:06:19 -06:00
Pavel Begunkov	d6e295061f	io_uring/notif: shrink account_pages to u32 ->account_pages is the number of pages we account against the user derived from unsigned len, it definitely fits into unsigned, which saves some space in struct io_notif_data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19f2687fcb36daa74d86f4a27bfb3d35cffec318.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:49 -06:00
Pavel Begunkov	2e730d8de4	io_uring/notif: remove ctx var from io_notif_tw_complete We don't need ctx in the hottest path, i.e. registered buffers, let's get it only when we need it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e7345e268ffaeaf79b4c8f3a5d019d6a87a3d1f1.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:49 -06:00
Pavel Begunkov	7e58d0af5a	io_uring/notif: refactor io_tx_ubuf_complete() Flip the dec_and_test "if", that makes the function extension easier in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/43939e2b04dff03bff5d7227c98afedf951227b3.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:49 -06:00
Jens Axboe	686b56cbee	io_uring: ensure overflow entries are dropped when ring is exiting A previous consolidation cleanup missed handling the case where the ring is dying, and __io_cqring_overflow_flush() doesn't flush entries if the CQ ring is already full. This is fine for the normal CQE overflow flushing, but if the ring is going away, we need to flush everything, even if it means simply freeing the overflown entries. Fixes: 6c948ec44b29 ("io_uring: consolidate overflow flushing") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:27 -06:00
Ruyi Zhang	4d0f4a5413	io_uring/timeout: remove duplicate initialization of the io_timeout list. In the __io_timeout_prep function, the io_timeout list is initialized twice, removing the meaningless second initialization. Signed-off-by: Ruyi Zhang <ruyi.zhang@samsung.com> Link: https://lore.kernel.org/r/20240411055953.2029218-1-ruyi.zhang@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:27 -06:00
Pavel Begunkov	6b231248e9	io_uring: consolidate overflow flushing Consolidate __io_cqring_overflow_flush and io_cqring_overflow_kill() into a single function as it once was, it's easier to work with it this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/986b42c35e76a6be7aa0cdcda0a236a2222da3a7.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:27 -06:00
Pavel Begunkov	8d09a88ef9	io_uring: always lock __io_cqring_overflow_flush Conditional locking is never great, in case of __io_cqring_overflow_flush(), which is a slow path, it's not justified. Don't handle IOPOLL separately, always grab uring_lock for overflow flushing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	408024b959	io_uring: open code io_cqring_overflow_flush() There is only one caller of io_cqring_overflow_flush(), open code it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a1fecd56d9dba923ed8d4d159727fa939d3baa2a.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	e45ec969d1	io_uring: remove extra SQPOLL overflow flush `c1edbf5f08` ("io_uring: flag SQPOLL busy condition to userspace") added an extra overflowed CQE flush in the SQPOLL submission path due to backpressure, which was later removed. Remove the flush and let io_cqring_wait() / iopoll handle it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a83b0724ca6ca9d16c7d79a51b77c81876b2e39.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	a5bff51850	io_uring: unexport io_req_cqe_overflow() There are no users of io_req_cqe_overflow() apart from io_uring.c, make it static. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	8c9a6f549e	io_uring: separate header for exported net bits We're exporting some io_uring bits to networking, e.g. for implementing a net callback for io_uring cmds, but we don't want to expose more than needed. Add a separate header for networking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20240409210554.1878789-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	d285da7dbd	io_uring/net: set MSG_ZEROCOPY for sendzc in advance We can set MSG_ZEROCOPY at the preparation step, do it so we don't have to care about it later in the issue callback. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c2c22aaa577624977f045979a6db2b9fb2e5648c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	6b7f864bb7	io_uring/net: get rid of io_notif_complete_tw_ext io_notif_complete_tw_ext() can be removed and combined with io_notif_complete_tw to make it simpler without sacrificing anything. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/025a124a5e20e2474a57e2f04f16c422eb83063c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	998632921d	io_uring/net: merge ubuf sendzc callbacks Splitting io_tx_ubuf_callback_ext from io_tx_ubuf_callback is a pre mature optimisation that doesn't give us much. Merge the functions into one and reclaim some simplicity back. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d44d68f6f7add33a0dcf0b7fd7b73c2dc543604f.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Ming Lei	bbbef3e9d2	io_uring: return void from io_put_kbuf_comp() The only caller doesn't handle the return value of io_put_kbuf_comp(), so change its return type into void. Also follow Jens's suggestion to rename it as io_put_kbuf_drop(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	c29006a245	io_uring: remove io_req_put_rsrc_locked() io_req_put_rsrc_locked() is a weird shim function around io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding ->uring_lock, so we can just use it directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	d9713ad3fa	io_uring: remove async request cache io_req_complete_post() was a sole user of ->locked_free_list, but since we just gutted the function, the cache is not used anymore and can be removed. ->locked_free_list served as an asynhronous counterpart of the main request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq. Now they're all forced to be completed into the main cache directly, off of the normal completion path or via io_free_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	de96e9ae69	io_uring: turn implicit assumptions into a warning io_req_complete_post() is now io-wq only and shouldn't be used outside of it, i.e. it relies that io-wq holds a ref for the request as explained in a comment below. Let's add a warning to enforce the assumption and make sure nobody would try to do anything weird. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Ming Lei	f39130004d	io_uring: kill dead code in io_req_complete_post Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"), io_req_complete_post() is only called from io-wq submit work, where the request reference is guaranteed to be grabbed and won't drop to zero in io_req_complete_post(). Kill the dead code, meantime add req_ref_put() to put the reference. Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	285207f67c	io_uring/kbuf: remove dead define We no longer use IO_BUFFER_LIST_BUF_PER_PAGE, kill it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	1da2f311ba	io_uring: fix warnings on shadow variables There are a few of those: io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow] 170 \| struct file f = io_file_from_index(&ctx->file_table, i); \| ^ io_uring/fdinfo.c:53:67: note: previous declaration is here 53 \| __cold void io_uring_show_fdinfo(struct seq_file m, struct file f) \| ^ io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow] 187 \| struct io_uring_task tctx = node->task->io_uring; \| ^ io_uring/cancel.c:166:31: note: previous declaration is here 166 \| struct io_uring_task tctx, \| ^ io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow] 371 \| struct io_uring_task tctx = node->task->io_uring; \| ^ io_uring/register.c:312:24: note: previous declaration is here 312 \| struct io_uring_task *tctx = NULL; \| ^ and a simple cleanup gets rid of them. For the fdinfo case, make a distinction between the file being passed in (for the ring), and the registered files we iterate. For the other two cases, just get rid of shadowed variable, there's no reason to have a new one. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	f15ed8b4d0	io_uring: move mapping/allocation helpers to a separate file Move the related code from io_uring.c into memmap.c. No functional changes in this patch, just cleaning it up a bit now that the full transition is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	18595c0a58	io_uring: use unpin_user_pages() where appropriate There are a few cases of open-rolled loops around unpin_user_page(), use the generic helper instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	87585b0575	io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_page() and have it Just Work. This requires a bit of effort on the mmap lookup side, as the ctx uring_lock isn't held, which otherwise protects buffer_lists from being torn down, and it's not safe to grab from mmap context that would introduce an ABBA deadlock between the mmap lock and the ctx uring_lock. Instead, lookup the buffer_list under RCU, as the the list is RCU freed already. Use the existing reference count to determine whether it's possible to safely grab a reference to it (eg if it's not zero already), and drop that reference when done with the mapping. If the mmap reference is the last one, the buffer_list and the associated memory can go away, since the vma insertion has references to the inserted pages at that point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	e270bfd22a	io_uring/kbuf: vmap pinned buffer ring This avoids needing to care about HIGHMEM, and it makes the buffer indexing easier as both ring provided buffer methods are now virtually mapped in a contigious fashion. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	1943f96b38	io_uring: unify io_pin_pages() Move it into io_uring.c where it belongs, and use it in there as well rather than have two implementations of this. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	09fc75e0c0	io_uring: use vmap() for ring mapping This is the last holdout which does odd page checking, convert it to vmap just like what is done for the non-mmap path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	3ab1db3c60	io_uring: get rid of remap_pfn_range() for mapping rings/sqes Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_pages() and have it Just Work. If possible, allocate a single compound page that covers the range that is needed. If that works, then we can just use page_address() on that page. If we fail to get a compound page, allocate single pages and use vmap() to map them into the kernel virtual address space. This just covers the rings/sqes, the other remaining user of the mmap remap_pfn_range() user will be converted separately. Once that is done, we can kill the old alloc/free code. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	22537c9f79	io_uring: use the right type for work_llist empty check io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's not an io_wq_work_list, it's a struct llist_head. They both have ->first as head-of-list, and it turns out the checks are identical. But be proper and use the right helper. Fixes: `dac6a0eae7` ("io_uring: ensure iopoll runs local task work as well") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Joel Granados	a80929d1cd	io_uring: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from kernel_io_uring_disabled_table Signed-off-by: Joel Granados <j.granados@samsung.com> Link: https://lore.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-6-47c1463b3af2@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jiapeng Chong	4e9706c6c8	io_uring: Remove unused function The function are defined in the io_uring.c file, but not called elsewhere, so delete the unused function. io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240328022324.78029-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	77a1cd5e79	io_uring: re-arrange Makefile order The object list is a bit of a mess, with core and opcode files mixed in. Re-arrange it so that we have the core bits first, and then opcode specific files after that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	05eb5fe226	io_uring: refill request cache in memory order The allocator will generally return memory in order, but __io_alloc_req_refill() then adds them to a stack and we'll extract them in the opposite order. This obviously isn't a huge deal, but: 1) it makes debugging easier when they are in order 2) keeping them in-order is the right thing to do 3) reduces the code for adding them to the stack Just add them in reverse to the stack. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	da22bdf38b	io_uring/poll: shrink alloc cache size to 32 This should be plenty, rather than the default of 128, and matches what we have on the rsrc and futex side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	414d0f45c3	io_uring/alloc_cache: switch to array based caching Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage. Outside of that detail, games are also played with KASAN as the list is inside the cached entry itself. Finally, all users of this need a struct io_cache_entry embedded in their struct, which is union'ized with something else in there that isn't used across the free -> realloc cycle. Get rid of all of that, and simply have it be an array. This will not change the memory used, as we're just trading an 8-byte member entry for the per-elem array size. This reduces the overhead of the recycled allocations, and it reduces the amount of code code needed to support recycling to about half of what it currently is. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	e10677a8f6	io_uring: drop ->prep_async() It's now unused, drop the code related to it. This includes the io_issue_defs->manual alloc field. While in there, and since ->async_size is now being used a bit more frequently and in the issue path, move it to io_issue_defs[]. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	5eff57fa9f	io_uring/uring_cmd: defer SQE copying until it's needed The previous commit turned on async data for uring_cmd, and did the basic conversion of setting everything up on the prep side. However, for a lot of use cases, -EIOCBQUEUED will get returned on issue, as the operation got successfully queued. For that case, a persistent SQE isn't needed, as it's just used for issue. Unless execution goes async immediately, defer copying the double SQE until it's necessary. This greatly reduces the overhead of such commands, as evidenced by a perf diff from before and after this change: 10.60% -8.58% [kernel.vmlinux] [k] io_uring_cmd_prep where the prep side drops from 10.60% to ~2%, which is more expected. Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back to where it was before the async command prep. Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	d10f19dff5	io_uring/uring_cmd: switch to always allocating async data Basic conversion ensuring async_data is allocated off the prep path. Adds a basic alloc cache as well, as passthrough IO can be quite high in rate. Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	e2ea5a7069	io_uring/net: move connect to always using async data While doing that, get rid of io_async_connect and just use the generic io_async_msghdr. Both of them have a struct sockaddr_storage in there, and while io_async_msghdr is bigger, if the same type can be used then the netmsg_cache can get reused for connect as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	d6f911a6b2	io_uring/rw: add iovec recycling Let the io_async_rw hold on to the iovec and reuse it, rather than always allocate and free them. Also enables KASAN for the iovec entries, so that reuse can be detected even while they are in the cache. While doing so, shrink io_async_rw by getting rid of the bigger embedded fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1. This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	cca6571381	io_uring/rw: cleanup retry path We no longer need to gate a potential retry on whether or not the context matches our original task, as all read/write operations have been fully prepared upfront. This means there's never any re-import needed, and hence we can always retry requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	0d10bd77a1	io_uring: get rid of struct io_rw_state A separate state struct is not needed anymore, just fold it in with io_async_rw. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	a9165b83c1	io_uring/rw: always setup io_async_rw for read/write requests read/write requests try to put everything on the stack, and then alloc and copy if a retry is needed. This necessitates a bunch of nasty code that deals with intermediate state. Get rid of this, and have the prep side setup everything that is needed upfront, which greatly simplifies the opcode handlers. This includes adding an alloc cache for io_async_rw, to make it cheap to handle. In terms of cost, this should be basically free and transparent. For the worst case of {READ,WRITE}_FIXED which didn't need it before, performance is unaffected in the normal peak workload that is being used to test that. Still runs at 122M IOPS. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	d80f940701	io_uring/net: drop 'kmsg' parameter from io_req_msg_cleanup() Now that iovec recycling is being done, the iovec is no longer being freed in there. Hence the kmsg parameter is now useless. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	7519134178	io_uring/net: add iovec recycling Right now the io_async_msghdr is recycled to avoid the overhead of allocating+freeing it for every request. But the iovec is not included, hence that will be allocated and freed for each transfer regardless. This commit enables recyling of the iovec between io_async_msghdr recycles. This avoids alloc+free for each one if an iovec is used, and on top of that, it extends the cache hot nature of msg to the iovec as well. Also enables KASAN for the iovec entries, so that reuse can be detected even while they are in the cache. The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte saving (or ~23% smaller), as the fast_iovec entry is dropped from 8 entries to a single entry. There's no point keeping a big fast iovec entry, if iovecs aren't being allocated and freed continually. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	9f8539fe29	io_uring/net: remove (now) dead code in io_netmsg_recycle() All net commands have async data at this point, there's no reason to check if this is the case or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	6498c5c97c	io_uring: kill io_msg_alloc_async_prep() We now ONLY call io_msg_alloc_async() from inside prep handling, which is always locked. No need for this helper anymore, or the check in io_msg_alloc_async() on whether the ring is locked or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	50220d6ac8	io_uring/net: get rid of ->prep_async() for send side Move the io_async_msghdr out of the issue path and into prep handling, e it's now done unconditionally and hence does not need to be part of the issue path. This means any usage of io_sendrecv_prep_async() and io_sendmsg_prep_async(), and hence the forced async setup path is now unified with the normal prep setup. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	c6f32c7d9e	io_uring/net: get rid of ->prep_async() for receive side Move the io_async_msghdr out of the issue path and into prep handling, since it's now done unconditionally and hence does not need to be part of the issue path. This reduces the footprint of the multishot fast path of multiple invocations of ->issue() per prep, and also means that using ->prep_async() can be dropped for recvmsg asthis is now done via setup on the prep side. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	3ba8345aec	io_uring/net: always set kmsg->msg.msg_control_user before issue We currently set this separately for async/sync entry, but let's just move it to a generic pre-issue spot and eliminate the difference between the two. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	790b68b32a	io_uring/net: always setup an io_async_msghdr Rather than use an on-stack one and then need to allocate and copy if async execution is required, always grab one upfront. This should be very cheap, and potentially even have cache hotness benefits for back-to-back send/recv requests. For any recv type of request, this is probably a good choice in general, as it's expected that no data is available initially. For send this is not necessarily the case, as space in the socket buffer is expected to be available. However, getting a cached io_async_msghdr is very cheap, and as it should be cache hot, probably the difference here is neglible, if any. A nice side benefit is that io_setup_async_msg can get killed completely, which has some nasty iovec manipulation code. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	f5b00ab222	io_uring/net: unify cleanup handling Now that recv/recvmsg both do the same cleanup, put it in the retry and finish handlers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	4a3223f7bf	io_uring/net: switch io_recv() to using io_async_msghdr No functional changes in this patch, just in preparation for carrying more state than what is available now, if necessary. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	54cdcca05a	io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdr No functional changes in this patch, just in preparation for carrying more state then what is being done now, if necessary. While unifying some of this code, add a generic send setup prep handler that they can both use. This gets rid of some manual msghdr and sockaddr on the stack, and makes it look a bit more like the sendmsg/recvmsg variants. Going forward, more can get unified on top. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	0ae9b9a14d	io_uring/alloc_cache: shrink default max entries from 512 to 128 In practice, we just need to recycle a few elements for (by far) most use cases. Shrink the total size down from 512 to 128, which should be more than plenty. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	29f858a7c6	io_uring: remove timeout/poll specific cancelations For historical reasons these were special cased, as they were the only ones that needed cancelation. But now we handle cancelations generally, and hence there's no need to check for these in io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles both these and the rest as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	2541762342	io_uring: flush delayed fallback task_work in cancelation Just like we run the inline task_work, ensure we also factor in and run the fallback task_work. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	c133b3b06b	io_uring: clean up io_lockdep_assert_cq_locked Move CONFIG_PROVE_LOCKING checks inside of io_lockdep_assert_cq_locked() and kill the else branch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/bbf33c429c9f6d7207a8fe66d1a5866ec2c99850.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	0667db14e1	io_uring: refactor io_req_complete_post() Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests to task_work, it's much cleaner and should normally happen. We couldn't do it before because there was a possibility of looping in complete_post() -> tw -> complete_post() -> ... Also, unexport the function and inline __io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	23fbdde620	io_uring: remove current check from complete_post task_work execution is now always locked, and we shouldn't get into io_req_complete_post() from them. That means that complete_post() is always called out of the original task context and we don't even need to check current. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	902ce82c2a	io_uring: get rid of intermediate aux cqe caches io_post_aux_cqe(), which is used for multishot requests, delays completions by putting CQEs into a temporary array for the purpose completion lock/flush batching. DEFER_TASKRUN doesn't need any locking, so for it we can put completions directly into the CQ and defer post completion handling with a flag. That leaves !DEFER_TASKRUN, which is not that interesting / hot for multishot requests, so have conditional locking with deferred flush for them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	e5c12945be	io_uring: refactor io_fill_cqe_req_aux The restriction on multishot execution context disallowing io-wq is driven by rules of io_fill_cqe_req_aux(), it should only be called in the master task context, either from the syscall path or in task_work. Since task_work now always takes the ctx lock implying IO_URING_F_COMPLETE_DEFER, we can just assume that the function is always called with its defer argument set to true. Kill the argument. Also rename the function for more consistency as "fill" in CQE related functions was usually meant for raw interfaces only copying data into the CQ without any locking, waking the user and other accounting "post" functions take care of. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	8e5b3b89ec	io_uring: remove struct io_tw_state::locked ctx is always locked for task_work now, so get rid of struct io_tw_state::locked. Note I'm stopping one step before removing io_tw_state altogether, which is not empty, because it still serves the purpose of indicating which function is a tw callback and forcing users not to invoke them carelessly out of a wrong context. The removal can always be done later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	92219afb98	io_uring: force tw ctx locking We can run normal task_work without locking the ctx, however we try to lock anyway and most handlers prefer or require it locked. It might have been interesting to multi-submitter ring with high contention completing async read/write requests via task_work, however that will still need to go through io_req_complete_post() and potentially take the lock for rsrc node putting or some other case. In other words, it's hard to care about it, so alawys force the locking. The case described would also because of various io_uring caches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/6ae858f2ef562e6ed9f13c60978c0d48926954ba.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	6e6b8c6212	io_uring/rw: avoid punting to io-wq directly kiocb_done() should care to specifically redirecting requests to io-wq. Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let the core code io_uring handle offloading. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	e1eef2e56c	io_uring/cmd: fix tw <-> issue_flags conversion !IO_URING_F_UNLOCKED does not translate to availability of the deferred completion infra, IO_URING_F_COMPLETE_DEFER does, that what we should pass and look for to use io_req_complete_defer() and other variants. Luckily, it's not a real problem as two wrongs actually made it right, at least as far as io_uring_cmd_work() goes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/aef76d34fe9410df8ecc42a14544fd76cd9d8b9e.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	6edd953b6e	io_uring/cmd: kill one issue_flags to tw conversion io_uring cmd converts struct io_tw_state to issue_flags and later back to io_tw_state, it's awfully ill-fated, not to mention that intermediate issue_flags state is not correct. Get rid of the last conversion, drag through tw everything that came with IO_URING_F_UNLOCKED, and replace io_req_complete_defer() with a direct call to io_req_complete_defer(), at least for the time being. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/c53fa3df749752bd058cf6f824a90704822d6bcc.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	da12d9ab58	io_uring/cmd: move io_uring_try_cancel_uring_cmd() io_uring_try_cancel_uring_cmd() is a part of the cmd handling so let's move it closer to all cmd bits into uring_cmd.c Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/43a3937af4933655f0fd9362c381802f804f43de.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	4fe82aedeb	io_uring/net: restore msg_control on sendzc retry `cac9e4418f` ("io_uring/net: save msghdr->msg_control for retries") reinstatiates msg_control before every __sys_sendmsg_sock(), since the function can overwrite the value in msghdr. We need to do same for zerocopy sendmsg. Cc: stable@vger.kernel.org Fixes: `493108d95f` ("io_uring/net: zerocopy sendmsg") Link: https://github.com/axboe/liburing/issues/1067 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cc1d5d9df0576fa66ddad4420d240a98a020b267.1712596179.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-08 21:48:41 -06:00
Christian Brauner	210a03c9d5	fs: claw back a few FMODE_* bits There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-07 13:49:02 +02:00

... 2 3 4 5 6 ...

1267 Commits