-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVAUXUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuGsEADEs0/4uXb8kLUF/y0B0bY9jmwiw5id14g5
TkAH9lbceV0Yv0E1tPeWYIz7Y7s83UOduFVZo4hRH8EysH3IYFZCI/ny3v2nJ1av
lN7F7YegVOu6qx77e/CwLo7on14awHkSo8pUdCOm6tYLunLg42miRf+xTpSAL0Mg
ONnt0WxWDOgdNvTaGwBPaVE78FAWK8nc2ACzonQGfzCl2VXOsSy9JaJJMv8eyXOf
VVZCNcSvHh/zVznlC1YPoZh/bgS2UUJmIGL/XMQnM5qzbK1IPpzlN0cu8rje3s9b
TUKBKqr6xhC9nyAS1qAjgZ98RfjVnzcbMX+aWEb/Z0y9XFJVSSQQdW+f9A/0KLZm
jAejHJpNuqwEdB9MplHTXdeSDTkJH3YNbXvtwA6cc/KpZ1FVQXlhSJPp/mbOa7qe
IIeg6SYt84uZ2HxflTtm+I1uVE9QMcsesy3FIK4kxhA8jSximQw+hPZ3xrv4AHLd
cTkRAzfXPUFsJJQCgpv289QXobV/vsFhCFTHFxv63H+EGpJ7e1EaW6Eq0pAHG0Ai
8kk5Ns29jzTVer1W3sMMeDaZ7S8hGRAyRC+Zb/0QxtGsmvxikB0qY1GpdRGPFueQ
gOawhLZdhkigIsq0U1UGMpHKY0G1Sl9wvHuH2qzUKeWk+vFRv5RwR6zQuVJr2Jo/
j3HgyYDs7Q==
=Z0L0
-----END PGP SIGNATURE-----
Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux
Pull io_uring futex support from Jens Axboe:
"This adds support for using futexes through io_uring - first futex
wake and wait, and then the vectored variant of waiting, futex waitv.
For both wait/wake/waitv, we support the bitset variant, as the
'normal' variants can be easily implemented on top of that.
PI and requeue are not supported through io_uring, just the above
mentioned parts. This may change in the future, but in the spirit of
keeping this small (and based on what people have been asking for),
this is what we currently have.
Wake support is pretty straight forward, most of the thought has gone
into the wait side to avoid needing to offload wait operations to a
blocking context. Instead, we rely on the usual callbacks to retry and
post a completion event, when appropriate.
As far as I can recall, the first request for futex support with
io_uring came from Andres Freund, working on postgres. His aio rework
of postgres was one of the early adopters of io_uring, and futex
support was a natural extension for that. This is relevant from both a
usability point of view, as well as for effiency and performance. In
Andres's words, for the former:
Futex wait support in io_uring makes it a lot easier to avoid
deadlocks in concurrent programs that have their own buffer pool:
Obviously pages in the application buffer pool have to be locked
during IO. If the initiator of IO A needs to wait for a held lock
B, the holder of lock B might wait for the IO A to complete. The
ability to wait for a lock and IO completions at the same time
provides an efficient way to avoid such deadlocks
and in terms of effiency, even without unlocking the full potential
yet, Andres says:
Futex wake support in io_uring is useful because it allows for more
efficient directed wakeups. For some "locks" postgres has queues
implemented in userspace, with wakeup logic that cannot easily be
implemented with FUTEX_WAKE_BITSET on a single "futex word"
(imagine waiting for journal flushes to have completed up to a
certain point).
Thus a "lock release" sometimes need to wake up many processes in a
row. A quick-and-dirty conversion to doing these wakeups via
io_uring lead to a 3% throughput increase, with 12% fewer context
switches, albeit in a fairly extreme workload"
* tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux:
io_uring: add support for vectored futex waits
futex: make the vectored futex operations available
futex: make futex_parse_waitv() available as a helper
futex: add wake_data to struct futex_q
io_uring: add support for futex wake and wait
futex: abstract out a __futex_wake_mark() helper
futex: factor out the futex wake handling
futex: move FUTEX2_VALID_MASK to futex.h
'top_waiter' is assigned unconditionally before first use,
so it does not need an initialization.
[ mingo: Created legible changelog. ]
Signed-off-by: Li zeming <zeming@nfschina.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230725195047.3106-1-zeming@nfschina.com
In preparation for having another waker that isn't futex_wake_mark(),
add a wake handler in futex_q. No extra data is associated with the
handler outside of struct futex_q itself. futex_wake_mark() is defined as
the standard wakeup helper, now set through futex_q_init like other
defaults.
Normal sync futex waiting relies on wake_q holding tasks that should
be woken up. This is what futex_wake_mark() does, it'll unqueue the
futex and add the associated task to the wake queue. For async usage of
futex waiting, rather than having tasks sleeping on the futex, we'll
need to deal with a futex wake differently. For the planned io_uring
case, that means posting a completion event for the task in question.
Having a definable wake handler can help support that use case.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In order to support mixed size requeue, add a second flags argument to
the internal futex_requeue() function.
No functional change intended.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230921105248.396780136@noisy.programming.kicks-ass.net
Some new assertions pointed out that the existing code has nested rt_mutex wait
state in the futex code.
Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
still is a rt_waiter enqueued for this task, resulting in a state where there
are two waiters for the same task (and task_struct::pi_blocked_on gets
scrambled).
The reason to take hb->lock at this point is to avoid the wake_futex_pi()
EAGAIN case.
This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
inconsistent. The current rules are such that this inconsistency will not be
observed.
Notably the case that needs to be avoided is where futex_lock_pi() and
futex_unlock_pi() interleave such that unlock will fail to observe a new
waiter.
*However* the case at hand is where a waiter is leaving, in this case the race
means a waiter that is going away is not observed -- which is harmless,
provided this race is explicitly handled.
This is a somewhat dangerous proposition because the converse race is not
observing a new waiter, which must absolutely not happen. But since the race is
valid this cannot be asserted.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20230915151943.GD6743@noisy.programming.kicks-ass.net
Move all the requeue bits into their own file.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-14-andrealmeid@collabora.com