linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2024-12-28 16:52:18 +00:00

Author	SHA1	Message	Date
Gabriel Krisman Bertazi	cd7f9fee71	io_uring/msg_ring: Drop custom destructor kfree can handle slab objects nowadays. Drop the extra callback and just use kfree. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-10-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 13:47:48 -07:00
Gabriel Krisman Bertazi	6cd2993dcd	io_uring: Move old async data allocation helper to header There are two remaining uses of the old async data allocator that do not rely on the alloc cache. I don't want to make them use the new allocator helper because that would require a if(cache) check, which will result in dead code for the cached case (for callers passing a cache, gcc can't prove the cache isn't NULL, and will therefore preserve the check. Since this is an inline function and just a few lines long, keep a second helper to deal with cases where we don't have an async data cache. No functional change intended here. This is just moving the helper around and making it inline. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-9-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 13:47:48 -07:00
Pavel Begunkov	47136e853a	io_uring: prevent reg-wait speculations With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments for the waiting loop the user can specify an offset into a pre-mapped region of memory, in which case the [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the argument. As we address a kernel array using a user given index, it'd be a subject to speculation type of exploits. Use array_index_nospec() to prevent that. Make sure to pass not the full region size but truncate by the maximum offset allowed considering the structure size. Fixes: `d617b3147d` ("io_uring: restore back registered wait arguments") Fixes: `aa00f67adc` ("io_uring: add support for fixed wait regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e3d9da7c43d619de7bcf41d1cd277ab2688c443.1733694126.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 06:58:15 -07:00
Anuj Gupta	4dde0cc445	io_uring: introduce attributes for read/write and PI support Add the ability to pass additional attributes along with read/write. Application can prepare attibute specific information and pass its address using the SQE field: __u64 attr_ptr; Along with setting a mask indicating attributes being passed: __u64 attr_type_mask; Overall 64 attributes are allowed and currently one attribute 'IORING_RW_ATTR_FLAG_PI' is supported. With PI attribute, userspace can pass following information: - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG} - len: length of PI/metadata buffer - addr: address of metadata buffer - seed: seed value for reftag remapping - app_tag: application defined 16b value Process this information to prepare uio_meta_descriptor and pass it down using kiocb->private. PI attribute is supported only for direct IO. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20241128112240.8867-7-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 06:58:14 -07:00
Pavel Begunkov	6be74cf5d7	io_uring: use region api for CQ Convert internal parts of the CQ/SQ array managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46fc3c801290d6b1ac16023d78f6b8e685c87fd6.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 06:58:14 -07:00
Pavel Begunkov	55ea1ea1c8	io_uring: use region api for SQ Convert internal parts of the SQ managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1fb73ced6b835cb319ab0fe1dc0b2e982a9a5650.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 06:58:14 -07:00
Pavel Begunkov	bc4062d81c	io_uring: rename ->resize_lock ->resize_lock is used for resizing rings, but it's a good idea to reuse it in other cases as well. Rename it into mmap_lock as it's protects from races with mmap. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/68f705306f3ac4d2fb999eb80ea1615015ce9f7f.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 06:58:13 -07:00
Linus Torvalds	dd54fcced8	io_uring-6.13-20242901 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdJ6igQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjj3D/44ltUzbKLiGRE8wvtyWSFdAeGUT8DA0MTW ot+Tr43PY6+J+v5ClUmgzJYqLRjNUxJAGUWM8Tmr7tZ2UtKwhHX/CEUtbqOEm2Sg e6aofpzR+sXX+ZqZRrLMPj6gLvuklWra+1STyzA6EkcvLiMqsLCY/U8nIm03VW26 ua0kj+5477pEo9Hei4mfLtHCad94IX6UAv5xuh+90Xo9zxdWYA5sCv6SpXlG/5vy VYF8yChIiQC3SBgs1ewALblkm2RsCU59p0/9mOHOeBYzaFnoOV66fHEawWwKF2qM FLp6ZKpFEgxiRW9JpxhUw8Pv0hQx5FWN15FLLTPb/ss4Xo5uFRq8+0fDP8S5U9OT T37sj1nej7adaSjRWkmrgclNggFyhMmoCO9jMWxO1dmWNtHB153xGWNUcd0v/P2+ FdjibQd79Wpq7aWbKPOQORU8rqshNusUVlge/KlvyufEne9EuOQVjGk/i2AEjU5y f1DomdUbEBeGB2FE7w0YYquI0oBOLQvBBk/hQl5pW7rfMgFoU0WAXiZLaJhM0i81 RgbI5FH1rFZtsnJ3kG6HpNPcibK2seip6weNfgZZnDZCSOHiCZbuxi+WBLtupKng 8J+ZXoDjucBVRgrUQRz6Km62oTLJQ/6CcazqrKvLxERa0eB6SNOxZRd1XYNFKacn xIyyyzQj1g== =b84h -----END PGP SIGNATURE----- Merge tag 'io_uring-6.13-20242901' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: - Remove a leftover struct from when the cqwait registered waiting was transitioned to regions. - Fix for an issue introduced in this merge window, where nop->fd might be used uninitialized. Ensure it's always set. - Add capping of the task_work run in local task_work mode, to prevent bursty and long chains from adding too much latency. - Work around xa_store() leaving ->head non-NULL if it encounters an allocation error during storing. Just a debug trigger, and can go away once xa_store() behaves in a more expected way for this condition. Not a major thing as it basically requires fault injection to trigger it. - Fix a few mapping corner cases - Fix KCSAN complaint on reading the table size post unlock. Again not a "real" issue, but it's easy to silence by just keeping the reading inside the lock that protects it. * tag 'io_uring-6.13-20242901' of git://git.kernel.dk/linux: io_uring/tctx: work around xa_store() allocation error issue io_uring: fix corner case forgetting to vunmap io_uring: fix task_work cap overshooting io_uring: check for overflows in io_pin_pages io_uring/nop: ensure nop->fd is always initialized io_uring: limit local tw done io_uring: add io_local_work_pending() io_uring/region: return negative -E2BIG in io_create_region() io_uring: protect register tracing io_uring: remove io_uring_cqwait_reg_arg	2024-11-30 15:43:02 -08:00
Jens Axboe	49c5c63d48	io_uring: fix task_work cap overshooting A previous commit fixed task_work overrunning by a lot more than what the user asked for, by adding a retry list. However, it didn't cap the overall count, hence for multiple task_work runs inside the same wait loop, it'd still overshoot the target by potentially a large amount. Cap it generally inside the wait path. Note that this will still overshoot the default limit of 20, but should overshoot by no more than limit-1 in addition to the limit. That still provides a ceiling over how much task_work will be run, rather than still having gaps where it was uncapped essentially. Fixes: `f46b9cdb22` ("io_uring: limit local tw done") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-26 13:42:27 -07:00
David Wei	f46b9cdb22	io_uring: limit local tw done Instead of eagerly running all available local tw, limit the amount of local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The value of 20 is chosen as a reasonable heuristic to allow enough work batching but also keep latency down. Add a retry_llist that maintains a list of local tw that couldn't be done in time. No synchronisation is needed since it is only modified within the task context. Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20241120221452.3762588-3-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-21 07:11:00 -07:00
David Wei	40cfe55324	io_uring: add io_local_work_pending() In preparation for adding a new llist of tw to retry due to hitting the tw limit, add a helper io_local_work_pending(). This function returns true if there is any local tw pending. For now it only checks ctx->work_llist. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20241120221452.3762588-2-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-21 07:11:00 -07:00
Linus Torvalds	bf9aa14fc5	A rather large update for timekeeping and timers: - The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: * Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. * Queuing ignored timer signals onto a seperate ignored list. * Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. * Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure * Move all sleep and timeout functions into one file * Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. * Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. * Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place * Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. * Mostly boring cleanups, fixes, improvements and code movement -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9 6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9 8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3 UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5 JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw hZZ4hdM9+3T7cg== =2VC6 -----END PGP SIGNATURE----- Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather large update for timekeeping and timers: - The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: - Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. - Queuing ignored timer signals onto a seperate ignored list. - Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. - Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure - Move all sleep and timeout functions into one file - Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. - Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. - Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place - Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: - Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. - Mostly boring cleanups, fixes, improvements and code movement" tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits) posix-timers: Fix spurious warning on double enqueue versus do_exit() clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties clocksource/drivers/gpx: Remove redundant casts clocksource/drivers/timer-ti-dm: Fix child node refcount handling dt-bindings: timer: actions,owl-timer: convert to YAML clocksource/drivers/ralink: Add Ralink System Tick Counter driver clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource clocksource/drivers/timer-ti-dm: Don't fail probe if int not found clocksource/drivers:sp804: Make user selectable clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions hrtimers: Delete hrtimer_init_on_stack() alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack() io_uring: Switch to use hrtimer_setup_on_stack() sched/idle: Switch to use hrtimer_setup_on_stack() hrtimers: Delete hrtimer_init_sleeper_on_stack() wait: Switch to use hrtimer_setup_sleeper_on_stack() timers: Switch to use hrtimer_setup_sleeper_on_stack() net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack() futex: Switch to use hrtimer_setup_sleeper_on_stack() fs/aio: Switch to use hrtimer_setup_sleeper_on_stack() ...	2024-11-19 16:35:06 -08:00
Linus Torvalds	8350142a4b	for-6.13/io_uring-20241118 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S3kQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjHVEAC+CITBEcGy+S0IK0BpIAhuA+A621LtqBwy 0z/4MZKXMqvWxcFGQJ9Zr8MvxUnY4KFcssiaR5zk+I9TczNu7mLMuPYD1Gb0Klgz mwuFOylo1CAAC41IABYZZ/0qWbTaW0p8tpaGsTbTNk3tBxuMLB550+APAqC1OE9U bb7rP+FHc5+YGI9/7JNWt7NNTSHvVSO6oxjltCxHr1dRg93Jtr2jaY6letY3epFz TCFyfJlDtK8fPwtYRyG51M4g2Vdp9/4qsfPqvnXwUr9MdWaVh5/TFkyvqDi5sCKM zdK/sjRiimYzvqqKg6bzgYscITUPNk2TG6ZJq5U1L7lrglzVY69c7GIUnNzPrL/y AxQsR5Guxz3bRNYWZ4BKJDH+NNB+cgIFEXDsv72qoUy3HTzA6wOPZYxfjhZhKuG/ DjRwM7NGx5oPiKtpK99IulZttXdmtkH0csuLwKmOzrQskQdTuWyrEtU7UQql7oQ5 Rt3DhMXouzYZMicB8U5Q9gO2I3WN+2VVxXl4sa00LG8KsT6PzLnz4Q2k/1c83S6J rRivRbZAbZ1+BqKvF8T7GgzLCeaLgzbeoxmxj6xr87pf3SYEs2KhQeQ+n/C0HTOt GOcG1+bvh7t2aSvlBPKVCExWI4erwG6wXFhfGKsLW9CmwIMqRNxdePpRWe3Cueyp M3QRJuvTxQ== =bDvp -----END PGP SIGNATURE----- Merge tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Cleanups of the eventfd handling code, making it fully private. - Support for sending a sync message to another ring, without having a ring available to send a normal async message. - Get rid of the separate unlocked hash table, unify everything around the single locked one. - Add support for ring resizing. It can be hard to appropriately size the CQ ring upfront, if the application doesn't know how busy it will be. This results in applications sizing rings for the most busy case, which can be wasteful. With ring resizing, they can start small and grow the ring, if needed. - Add support for fixed wait regions, rather than needing to copy the same wait data tons of times for each wait operation. - Rewrite the resource node handling, which before was serialized per ring. This caused issues with particularly fixed files, where one file waiting on IO could hold up putting and freeing of other unrelated files. Now each node is handled separately. New code is much simpler too, and was a net 250 line reduction in code. - Add support for just doing partial buffer clones, rather than always cloning the entire buffer table. - Series adding static NAPI support, where a specific NAPI instance is used rather than having a list of them available that need lookup. - Add support for mapped regions, and also convert the fixed wait support mentioned above to that concept. This avoids doing special mappings for various planned features, and folds the existing registered wait into that too. - Add support for hybrid IO polling, which is a variant of strict IOPOLL but with an initial sleep delay to avoid spinning too early and wasting resources on devices that aren't necessarily in the < 5 usec category wrt latencies. - Various cleanups and little fixes. * tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linux: (79 commits) io_uring/region: fix error codes after failed vmap io_uring: restore back registered wait arguments io_uring: add memory region registration io_uring: introduce concept of memory regions io_uring: temporarily disable registered waits io_uring: disable ENTER_EXT_ARG_REG for IOPOLL io_uring: fortify io_pin_pages with a warning switch io_msg_ring() to CLASS(fd) io_uring: fix invalid hybrid polling ctx leaks io_uring/uring_cmd: fix buffer index retrieval io_uring/rsrc: add & apply io_req_assign_buf_node() io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node' io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers io_uring: avoid normal tw intermediate fallback io_uring/napi: add static napi tracking strategy io_uring/napi: clean up __io_napi_do_busy_loop io_uring/napi: Use lock guards io_uring/napi: improve __io_napi_add io_uring/napi: fix io_napi_entry RCU accesses io_uring/napi: protect concurrent io_napi_entry timeout accesses ...	2024-11-18 17:02:57 -08:00
Pavel Begunkov	d617b3147d	io_uring: restore back registered wait arguments Now we've got a more generic region registration API, place IORING_ENTER_EXT_ARG_REG and re-enable it. First, the user has to register a region with the IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races with already running waiters. With that we should have stable constant values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no READ_ONCE required. The other API difference is that we're now passing byte offsets instead of indexes. The user _must_ align all offsets / pointers to the native word size, failing to do so might but not necessarily has to lead to a failure usually returned as -EFAULT. liburing will be hiding this details from users. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/81822c1b4ffbe8ad391b4f9ad1564def0d26d990.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-15 12:28:38 -07:00
Pavel Begunkov	93238e6618	io_uring: add memory region registration Regions will serve multiple purposes. First, with it we can decouple ring/etc. object creation from registration / mapping of the memory they will be placed in. We already have hacks that allow to put both SQ and CQ into the same huge page, in the future we should be able to: region = create_region(io_ring); create_pbuf_ring(io_uring, region, offset=0); create_pbuf_ring(io_uring, region, offset=N); The second use case is efficiently passing parameters. The following patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which optimises wait arguments. It'll also be useful for request arguments replacing iovecs, msghdr, etc. pointers. Eventually it would also be handy for BPF as well if it comes to fruition. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0798cf3a14fad19cfc96fc9feca5f3e11481691d.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-15 09:58:34 -07:00
Pavel Begunkov	83e041522e	io_uring: temporarily disable registered waits Disable wait argument registration as it'll be replaced with a more generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing in a few commits so leave it be. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70b1d1d218c41ba77a76d1789c8641dab0b0563e.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-15 09:58:34 -07:00
Pavel Begunkov	3730aebbda	io_uring: disable ENTER_EXT_ARG_REG for IOPOLL IOPOLL doesn't use the extended arguments, no need for it to support IORING_ENTER_EXT_ARG_REG. Let's disable it for IOPOLL, if anything it leaves more space for future extensions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a35ecd919dbdc17bd5b7932273e317832c531b45.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-15 09:58:34 -07:00
Pavel Begunkov	b9d69371e8	io_uring: fix invalid hybrid polling ctx leaks It has already allocated the ctx by the point where it checks the hybrid poll configuration, plain return leaks the memory. Fixes: `01ee194d1a` ("io_uring: add support for hybrid IOPOLL") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/b57f2608088020501d352fcdeebdb949e281d65b.1731468230.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-13 07:38:04 -07:00
Nam Cao	fc9f59de26	io_uring: Switch to use hrtimer_setup_on_stack() hrtimer_setup_on_stack() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init_on_stack() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/f0d4ac32ec4050710a656cee8385fa4427be33aa.1730386209.git.namcao@linutronix.de	2024-11-07 02:47:06 +01:00
Pavel Begunkov	af0a2ffef0	io_uring: avoid normal tw intermediate fallback When a DEFER_TASKRUN io_uring is terminating it requeues deferred task work items as normal tw, which can further fallback to kthread execution. Avoid this extra step and always push them to the fallback kthread. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d1cd472cec2230c66bd1c8d412a5833f0af75384.1730772720.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Pavel Begunkov	483242714f	io_uring: prevent speculating sq_array indexing The SQ index array consists of user provided indexes, which io_uring then uses to index the SQ, and so it's susceptible to speculation. For all other queues io_uring tracks heads and tails in kernel, and they shouldn't need any special care. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6c7a25962924a55869e317e4fdb682dfdc6b279.1730687889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	b6f58a3f4a	io_uring: move struct io_kiocb from task_struct to io_uring_task Rather than store the task_struct itself in struct io_kiocb, store the io_uring specific task_struct. The life times are the same in terms of io_uring, and this avoids doing some dereferences through the task_struct. For the hot path of putting local task references, we can deref req->tctx instead, which we'll need anyway in that function regardless of whether it's local or remote references. This is mostly straight forward, except the original task PF_EXITING check needs a bit of tweaking. task_work is _always_ run from the originating task, except in the fallback case, where it's run from a kernel thread. Replace the potentially racy (in case of fallback work) checks for req->task->flags with current->flags. It's either the still the original task, in which case PF_EXITING will be sane, or it has PF_KTHREAD set, in which case it's fallback work. Both cases should prevent moving forward with the given request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	6ed368cc5d	io_uring: remove task ref helpers They are only used right where they are defined, just open-code them inside io_put_task(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	f03baece08	io_uring: move cancelations to be io_uring_task based Right now the task_struct pointer is used as the key to match a task, but in preparation for some io_kiocb changes, move it to using struct io_uring_task instead. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	6f94cbc29a	io_uring/rsrc: split io_kiocb node type assignments Currently the io_rsrc_node assignment in io_kiocb is an array of two pointers, as two nodes may be assigned to a request - one file node, and one buffer node. However, the buffer node can co-exist with the provided buffers, as currently it's not supported to use both provided and registered buffers at the same time. This crucially brings struct io_kiocb down to 4 cache lines again, as before it spilled into the 5th cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:36 -07:00
hexue	01ee194d1a	io_uring: add support for hybrid IOPOLL A new hybrid poll is implemented on the io_uring layer. Once an IO is issued, it will not poll immediately, but rather block first and re-run before IO complete, then poll to reap IO. While this poll method could be a suboptimal solution when running on a single thread, it offers performance lower than regular polling but higher than IRQ, and CPU utilization is also lower than polling. To use hybrid polling, the ring must be setup with both the IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid polling has the same restrictions as IOPOLL, in that commands must explicitly support it. Signed-off-by: hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20241101091957.564220-2-xue01.he@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	d50f94d761	io_uring/rsrc: get rid of the empty node and dummy_ubuf The empty node was used as a placeholder for a sparse entry, but it didn't really solve any issues. The caller still has to check for whether it's the empty node or not, it may as well just check for a NULL return instead. The dummy_ubuf was used for a sparse buffer entry, but NULL will serve the same purpose there of ensuring an -EFAULT on attempted import. Just use NULL for a sparse node, regardless of whether or not it's a file or buffer resource. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	b54a14041e	io_uring/rsrc: add io_rsrc_node_lookup() helper There are lots of spots open-coding this functionality, add a generic helper that does the node lookup in a speculation safe way. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	3597f2786b	io_uring/rsrc: unify file and buffer resource tables For files, there's nr_user_files/file_table/file_data, and buffers have nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and file_data can't be the same thing, and ditto for the buffer side. That gets rid of more io_ring_ctx state that's in two spots rather than just being in one spot, as it should be. Put all the registered file data in one locations, and ditto on the buffer front. This also avoids having both io_rsrc_data->nodes being an allocated array, and ->user_bufs[] or ->file_table.nodes. There's no reason to have this information duplicated. Keep it in one spot, io_rsrc_data, along with how many resources are available. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:23 -06:00
Jens Axboe	f38f284764	io_uring: only initialize io_kiocb rsrc_nodes when needed Add the empty node initializing to the preinit part of the io_kiocb allocation, and reset them if they have been used. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	0701db7439	io_uring/rsrc: add an empty io_rsrc_node for sparse buffer entries Rather than allocate an io_rsrc_node for an empty/sparse buffer entry, add a const entry that can be used for that. This just needs checking for writing the tag, and the put check needs to check for that sparse node rather than NULL for validity. This avoids allocating rsrc nodes for sparse buffer entries. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	fbbb8e991d	io_uring/rsrc: get rid of io_rsrc_node allocation cache It's not going to be needed in the fast path going forward, so kill it off. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	7029acd8a9	io_uring/rsrc: get rid of per-ring io_rsrc_node list Work in progress, but get rid of the per-ring serialization of resource nodes, like registered buffers and files. Main issue here is that one node can otherwise hold up a bunch of other nodes from getting freed, which is especially a problem for file resource nodes and networked workloads where some descriptors may not see activity in a long time. As an example, instantiate an io_uring ring fd and create a sparse registered file table. Even 2 will do. Then create a socket and register it as fixed file 0, F0. The number of open files in the app is now 5, with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4 being the socket. Register this socket (eg "the listener") in slot 0 of the registered file table. Now add an operation on the socket that uses slot 0. Finally, loop N times, where each loop creates a new socket, registers said socket as a file, then unregisters the socket, and finally closes the socket. This is roughly similar to what a basic accept loop would look like. At the end of this loop, it's not unreasonable to expect that there would still be 5 open files. Each socket created and registered in the loop is also unregistered and closed. But since the listener socket registered first still has references to its resource node due to still being active, each subsequent socket unregistration is stuck behind it for reclaim. Hence 5 + N files are still open at that point, where N is awaiting the final put held up by the listener socket. Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct io_kiocb now gets explicit resource nodes assigned, with each holding a reference to the parent node. A parent node is either of type FILE or BUFFER, which are the two types of nodes that exist. A request can have two nodes assigned, if it's using both registered files and buffers. Since request issue and task_work completion is both under the ring private lock, no atomics are needed to handle these references. It's a simple unlocked inc/dec. As before, the registered buffer or file table each hold a reference as well to the registered nodes. Final put of the node will remove the node and free the underlying resource, eg unmap the buffer or put the file. Outside of removing the stall in resource reclaim described above, it has the following advantages: 1) It's a lot simpler than the previous scheme, and easier to follow. No need to specific quiesce handling anymore. 2) There are no resource node allocations in the fast path, all of that happens at resource registration time. 3) The structs related to resource handling can all get simplified quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can go away completely. 4) Handling of resource tags is much simpler, and doesn't require persistent storage as it can simply get assigned up front at registration time. Just copy them in one-by-one at registration time and assign to the resource node. The only real downside is that a request is now explicitly limited to pinning 2 resources, one file and one buffer, where before just assigning a resource node to a request would pin all of them. The upside is that it's easier to follow now, as an individual resource is explicitly referenced and assigned to the request. With this in place, the above mentioned example will be using exactly 5 files at the end of the loop, not N. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:18 -06:00
Jens Axboe	aaa736b186	io_uring: specify freeptr usage for SLAB_TYPESAFE_BY_RCU io_kiocb cache Doesn't matter right now as there's still some bytes left for it, but let's prepare for the io_kiocb potentially growing and add a specific freeptr offset for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	aa00f67adc	io_uring: add support for fixed wait regions Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in turn, the timeout value needs to get copied. Rather than do this for every invocation, allow the application to register a fixed set of wait regions that can simply be indexed when asking the kernel to wait on events. At ring setup time, the application can register a number of these wait regions and initialize region/index 0 upfront: struct io_uring_reg_wait reg; reg = io_uring_setup_reg_wait(ring, nr_regions, &ret); / set timeout and mark as set, sigmask/sigmask_sz as needed / reg->ts.tv_sec = 0; reg->ts.tv_nsec = 100000; reg->flags = IORING_REG_WAIT_TS; where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(reg). The above initializes index 0, but 63 other regions can be initialized, if needed. Now, instead of doing: struct __kernel_timespec timeout = { .tv_nsec = 100000, }; io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL); to wait for events for each submit_and_wait, or just wait, operation, it can just reference the above region at offset 0 and do: io_uring_submit_and_wait_reg(ring, &cqe, nr, 0); to achieve the same goal of waiting 100usec without needing to copy both struct io_uring_getevents_arg (24b) and struct __kernel_timeout (16b) for each invocation. Struct io_uring_reg_wait looks as follows: struct io_uring_reg_wait { struct __kernel_timespec ts; __u32 min_wait_usec; __u32 flags; __u64 sigmask; __u32 sigmask_sz; __u32 pad[3]; __u64 pad2[2]; }; embedding the timeout itself in the region, rather than passing it as a pointer as well. Note that the signal mask is still passed as a pointer, both for compatability reasons, but also because there doesn't seem to be a lot of high frequency waits scenarios that involve setting and resetting the signal mask for each wait. The application is free to modify any region before a wait call, or it can use keep multiple regions with different settings to avoid needing to modify the same one for wait calls. Up to a page size of regions is mapped by default, allowing PAGE_SIZE / 64 available regions for use. The registered region must fit within a page. On a 4kb page size system, that allows for 64 wait regions if a full page is used, as the size of struct io_uring_reg_wait is 64b. The region registered must be aligned to io_uring_reg_wait in size. It's valid to register less than 64 entries. In network performance testing with zero-copy, this reduced the time spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4% to 0.3%. Wait regions are fixed for the lifetime of the ring - once registered, they are persistent until the ring is torn down. The regions support minimum wait timeout as well as the regular waits. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	371b47da25	io_uring: change io_get_ext_arg() to use uaccess begin + end In scenarios where a high frequency of wait events are seen, the copy of the struct io_uring_getevents_arg is quite noticeable in the profiles in terms of time spent. It can be seen as up to 3.5-4.5%. Rewrite the copy-in logic, saving about 0.5% of the time. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	0a54a7dd0a	io_uring: switch struct ext_arg from __kernel_timespec to timespec64 This avoids intermediate storage for turning a __kernel_timespec user pointer into an on-stack struct timespec64, only then to turn it into a ktime_t. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	79cfe9e59c	io_uring/register: add IORING_REGISTER_RESIZE_RINGS Once a ring has been created, the size of the CQ and SQ rings are fixed. Usually this isn't a problem on the SQ ring side, as it merely controls the available number of requests that can be submitted in a single system call, and there's rarely a need to change that. For the CQ ring, it's a different story. For most efficient use of io_uring, it's important that the CQ ring never overflows. This means that applications must size it for the worst case scenario, which can be wasteful. Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize the existing rings. It takes a struct io_uring_params argument, the same one which is used to setup the ring initially, and resizes rings according to the sizes given. Certain properties are always inherited from the original ring setup, like SQE128/CQE32 and other setup options. The implementation only allows flag associated with how the CQ ring is sized and clamped. Existing unconsumed SQE and CQE entries are copied as part of the process. If either the SQ or CQ resized destination ring cannot hold the entries already present in the source rings, then the operation is failed with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new submissions, and the internal mapping holds the completion lock as well across moving CQ ring state. To prevent races between mmap and ring resizing, add a mutex that's solely used to serialize ring resize and mmap. mmap_sem can't be used here, as as fork'ed process may be doing mmaps on the ring as well. The ctx->resize_lock is held across mmap operations, and the resize will grab it before swapping out the already mapped new data. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	81d8191eb9	io_uring: abstract out a bit of the ring filling logic Abstract out a io_uring_fill_params() helper, which fills out the necessary bits of struct io_uring_params. Add it to io_uring.h as well, in preparation for having another internal user of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	09d0a8ea7f	io_uring: move max entry definition and ring sizing into header In preparation for needing this somewhere else, move the definitions for the maximum CQ and SQ ring size into io_uring.h. Make the rings_size() helper available as well, and have it take just the setup flags argument rather than the fill ring pointer. That's all that is needed. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	2946f08ae9	io_uring: clean up cqe trace points We have too many helpers posting CQEs, instead of tracing completion events before filling in a CQE and thus having to pass all the data, set the CQE first, pass it to the tracing helper and let it extract everything it needs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b83c1ca9ee5aed2df0f3bb743bf5ed699cce4c86.1729267437.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	9b296c625a	io_uring: static_key for !IORING_SETUP_NO_SQARRAY IORING_SETUP_NO_SQARRAY should be preferred and used by default by liburing, optimise flag checking in io_get_sqe() with a static key. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c164a48542fbb080115e2377ecf160c758562742.1729264988.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	1e6e7602cc	io_uring: kill io_llist_xchg io_llist_xchg is only used to set the list to NULL, which can also be done with llist_del_all(). Use the latter and kill io_llist_xchg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6765112680d2e86a58b76166b7513391ff4e5d7.1729264960.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	b6b3eb19dd	io_uring: move cancel hash tables to kvmalloc/kvfree Convert to using kvmalloc/kfree() for the hash tables, and while at it, make it handle low memory situations better. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	8abf47a8d6	io_uring/cancel: get rid of init_hash_table() helper All it does is initialize the lists, just move the INIT_HLIST_HEAD() into the one caller. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	085268829b	io_uring/poll: get rid of unlocked cancel hash io_uring maintains two hash lists of inflight requests: 1) ctx->cancel_table_locked. This is used when the caller has the ctx->uring_lock held already. This is only an issue side parameter, as removal or task_work will always have it held. 2) ctx->cancel_table. This is used when the issuer does NOT have the ctx->uring_lock held, and relies on the table spinlocks for access. However, it's pretty trivial to simply grab the lock in the one spot where we care about it, for insertion. With that, we can kill the unlocked table (and get rid of the _locked postfix for the other one). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Al Viro	be5498cac2	remove pointless includes of <linux/fdtable.h> some of those used to be needed, some had been cargo-culted for no reason... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-07 13:34:41 -04:00
Min-Hua Chen	17ea56b752	io_uring: fix casts to io_req_flags_t Apply __force cast to restricted io_req_flags_t type to fix the following sparse warning: io_uring/io_uring.c:2026:23: sparse: warning: cast to restricted io_req_flags_t No functional changes intended. Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Link: https://lore.kernel.org/r/20240922104132.157055-1-minhuadotchen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-24 13:31:04 -06:00
Guixin Liu	3a87e26429	io_uring: fix memory leak when cache init fail Exit the percpu ref when cache init fails to free the data memory with in struct percpu_ref. Fixes: `206aefde4f` ("io_uring: reduce/pack size of io_ring_ctx") Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240923100512.64638-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-24 13:31:00 -06:00
Linus Torvalds	3147a0689d	for-6.12/io_uring-20240922 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbvv30QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj3+EACs346FzM8PlZe1GxBZ6OnQX80blwoldAxC +Abl5xjoJKUgA7rY3lJVBRNR6olA/4I2VD3g8b3RT6lpd/oKzPFg7FOj5Dc/oN+c Fo6C7zZdr8caokpL4pfwgyG8ZNssQgRg8e0kRSw8A7AMo1zUazqAXtxjRzeMEOLC 1kWRYGdHCbVjx+hRIyX6KKP427Z5nXvcqFOC0BOpd5jDNYVh9WjNNyUE7trkGJ7o 1cjlpaaOURS0yU/4hue6tRnM8LDjaImyTyISvBWzKfKvpc19K1alOQNvHIoIeiBQ 5MgCNkSpbRmUTrydYVEQXl0Cia2d5+0KQsavUB9nZ8M++NftbRr/i26xT8ReZzXI NjaedDF+MyOKeJaft2ZeKH8GgWolysMBa4e89CveRxosa/6gwHCkkB4UK9b3gaBB Fij1zh/7fIVG7Tz8yNUDyGe6DzOEol1bn1KnL35/9nuCCRnSAM0vRPwJSkurlQ8B PqVUS3BArn+LQZmSZ3HJVKOHv2QAY8etqWizvVmu4DB9Ar+uZ6Ur2uwfMN9JAODP Fm2qVvxS73QlrvisdbnVbTzqBnqh3Rs4mb5my/gCWO1s67qtu3abSJCSzcnyxQdd yBMDegJxTNv6DErNjPEF4qDODwSTIzswr//kOeLns1EtDGfrK8nxUfIKPQUwLSTO Y7h2ru83uA== =goTY -----END PGP SIGNATURE----- Merge tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: "Mostly just a set of fixes in here, or little changes that didn't get included in the initial pull request. This contains: - Move the SQPOLL napi polling outside the submission lock (Olivier) - Rename of the "copy buffers" API that got added in the 6.12 merge window. There's really no copying going on, it's just referencing the buffers. After a bit of consideration, decided that it was better to simply rename this to avoid potential confusion (me) - Shrink struct io_mapped_ubuf from 48 to 32 bytes, by changing it to start + len tracking rather than having start / end in there, and by removing the caching of folio_mask when we can just calculate it from folio_shift when we need it (me) - Fixes for the SQPOLL affinity checking (me, Felix) - Fix for how cqring waiting checks for the presence of task_work. Just check it directly rather than check for a specific notification mechanism (me) - Tweak to how request linking is represented in tracing (me) - Fix a syzbot report that deliberately sets up a huge list of overflow entries, and then hits rcu stalls when flushing this list. Just check for the need to preempt, and drop/reacquire locks in the loop. There's no state maintained over the loop itself, and each entry is yanked from head-of-list (me)" * tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux: io_uring: check if we need to reschedule during overflow flush io_uring: improve request linking trace io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL io_uring/sqpoll: do the napi busy poll outside the submission block io_uring: clean up a type in io_uring_register_get_file() io_uring/sqpoll: do not put cpumask on stack io_uring/sqpoll: retain test for whether the CPU is valid io_uring/rsrc: change ubuf->ubuf_end to length tracking io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask io_uring: rename "copy buffers" to "clone buffers"	2024-09-24 11:11:38 -07:00

1 2 3 4 5 ...

529 Commits