req_get_ioprio looks at req->bio to find the I/O priority, which is not
set when completing bios that the driver fully iterated through.
Stash away the dd_per_prio in the elevator private data instead of looking
it up again to optimize the code a bit while fixing the regression from
removing the per-request ioprio value.
Fixes: 6975c1a486 ("block: remove the ioprio field from struct request")
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Reported-by: Sam Protsenko <semen.protsenko@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Tested-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241126102136.619067-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function blk_revalidate_disk_zones() calls the function
disk_update_zone_resources() after freezing the device queue. In turn,
disk_update_zone_resources() calls queue_limits_start_update() which
takes a queue limits mutex lock, resulting in the ordering:
q->q_usage_counter check -> q->limits_lock. However, the usual ordering
is to always take a queue limit lock before freezing the queue to commit
the limits updates, e.g., the code pattern:
lim = queue_limits_start_update(q);
...
blk_mq_freeze_queue(q);
ret = queue_limits_commit_update(q, &lim);
blk_mq_unfreeze_queue(q);
Thus, blk_revalidate_disk_zones() introduces a potential circular
locking dependency deadlock that lockdep sometimes catches with the
splat:
[ 51.934109] ======================================================
[ 51.935916] WARNING: possible circular locking dependency detected
[ 51.937561] 6.12.0+ #2107 Not tainted
[ 51.938648] ------------------------------------------------------
[ 51.940351] kworker/u16:4/157 is trying to acquire lock:
[ 51.941805] ffff9fff0aa0bea8 (&q->limits_lock){+.+.}-{4:4}, at: disk_update_zone_resources+0x86/0x170
[ 51.944314]
but task is already holding lock:
[ 51.945688] ffff9fff0aa0b890 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: blk_revalidate_disk_zones+0x15f/0x340
[ 51.948527]
which lock already depends on the new lock.
[ 51.951296]
the existing dependency chain (in reverse order) is:
[ 51.953708]
-> #1 (&q->q_usage_counter(queue)#3){++++}-{0:0}:
[ 51.956131] blk_queue_enter+0x1c9/0x1e0
[ 51.957290] blk_mq_alloc_request+0x187/0x2a0
[ 51.958365] scsi_execute_cmd+0x78/0x490 [scsi_mod]
[ 51.959514] read_capacity_16+0x111/0x410 [sd_mod]
[ 51.960693] sd_revalidate_disk.isra.0+0x872/0x3240 [sd_mod]
[ 51.962004] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.962993] really_probe+0xd5/0x330
[ 51.963898] __driver_probe_device+0x78/0x110
[ 51.964925] driver_probe_device+0x1f/0xa0
[ 51.965916] __driver_attach_async_helper+0x60/0xe0
[ 51.967017] async_run_entry_fn+0x2e/0x140
[ 51.968004] process_one_work+0x21f/0x5a0
[ 51.968987] worker_thread+0x1dc/0x3c0
[ 51.969868] kthread+0xe0/0x110
[ 51.970377] ret_from_fork+0x31/0x50
[ 51.970983] ret_from_fork_asm+0x11/0x20
[ 51.971587]
-> #0 (&q->limits_lock){+.+.}-{4:4}:
[ 51.972479] __lock_acquire+0x1337/0x2130
[ 51.973133] lock_acquire+0xc5/0x2d0
[ 51.973691] __mutex_lock+0xda/0xcf0
[ 51.974300] disk_update_zone_resources+0x86/0x170
[ 51.975032] blk_revalidate_disk_zones+0x16c/0x340
[ 51.975740] sd_zbc_revalidate_zones+0x73/0x160 [sd_mod]
[ 51.976524] sd_revalidate_disk.isra.0+0x465/0x3240 [sd_mod]
[ 51.977824] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.978917] really_probe+0xd5/0x330
[ 51.979915] __driver_probe_device+0x78/0x110
[ 51.981047] driver_probe_device+0x1f/0xa0
[ 51.982143] __driver_attach_async_helper+0x60/0xe0
[ 51.983282] async_run_entry_fn+0x2e/0x140
[ 51.984319] process_one_work+0x21f/0x5a0
[ 51.985873] worker_thread+0x1dc/0x3c0
[ 51.987289] kthread+0xe0/0x110
[ 51.988546] ret_from_fork+0x31/0x50
[ 51.989926] ret_from_fork_asm+0x11/0x20
[ 51.991376]
other info that might help us debug this:
[ 51.994127] Possible unsafe locking scenario:
[ 51.995651] CPU0 CPU1
[ 51.996694] ---- ----
[ 51.997716] lock(&q->q_usage_counter(queue)#3);
[ 51.998817] lock(&q->limits_lock);
[ 52.000043] lock(&q->q_usage_counter(queue)#3);
[ 52.001638] lock(&q->limits_lock);
[ 52.002485]
*** DEADLOCK ***
Prevent this issue by moving the calls to blk_mq_freeze_queue() and
blk_mq_unfreeze_queue() around the call to queue_limits_commit_update()
in disk_update_zone_resources(). In case of revalidation failure, the
call to disk_free_zone_resources() in blk_revalidate_disk_zones()
is still done with the queue frozen as before.
Fixes: 843283e96e ("block: Fake max open zones limit when there is no limit")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241126104705.183996-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As smatch, which is a lot smarter than me noticed. So remove the checks
for it, and condense these checks a bit including the comments stating
the obvious.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241119161157.1328171-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because it already is encoded in the opcode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241119161157.1328171-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix an issue detected by the `smatch` tool:
block/blk-mq.c:3314 blk_rq_prep_clone() error: uninitialized
symbol 'bio'.
This patch refactors `blk_rq_prep_clone()` to improve code
readability and ensure safety by addressing potential misuse of
the `bio` variable:
- Move the bio_put(bio); call to the bio_ctr error handling block,
which is the only place where it can be triggered.
- Move the bio variable into the __rq_for_each_bio loop scope.
This change removes the need to set bio to NULL at the loop's
end.
discussion on why bio remains uninitialized:
https://lore.kernel.org/lkml/20241004141037.43277-1-surajsonawane0215@gmail.com
Summary of above discussion:
- I pointed out that `bio` can remain uninitialized if the
allocation with `bio_alloc_clone` fails.
- Keith Busch explained that `bio` is initialized to `NULL` when
`bio_alloc_clone()` fails, preventing uninitialized usage.
- John Garry questioned whether `rq_src->bio` being `NULL` could
leave `bio` uninitialized. Keith clarified that in such cases,
`bio` is not referenced, so it does not need initialization.
- Christoph Hellwig recommended code improvements:
- move the bio_put to the bio_ctr error handling, which is the only
case where it can happen
- move the bio variable into the __rq_for_each_bio scope, which
also removed the need to zero it at the end of the loop
These changes enhance code clarity, address static analysis tool
warnings, and make the function more maintainable.
thread of previous version patch discussion:
https://lore.kernel.org/lkml/20241004100842.9052-1-surajsonawane0215@gmail.com
Signed-off-by: Suraj Sonawane <surajsonawane0215@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241119164412.37609-1-surajsonawane0215@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow stacked devices to support atomic writes by aggregating the minimum
capability of all bottom devices.
Flag BLK_FEAT_ATOMIC_WRITES_STACKED is set for stacked devices which
have been enabled to support atomic writes.
Some things to note on the implementation:
- For simplicity, all bottom devices must have same atomic write boundary
value (if any)
- The atomic write boundary must be a power-of-2 already, but this
restriction could be relaxed. Furthermore, it is now required that the
chunk sectors for a top device must be aligned with this boundary.
- If a bottom device atomic write unit min/max are not aligned with the
top device chunk sectors, the top device atomic write unit min/max are
reduced to a value which works for the chunk sectors.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241118105018.1870052-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is so far expected that the limits passed are valid.
In future atomic writes will be supported for stacked block devices, and
calculating the limits there will be complicated, so add extra sanity
checks to ensure that the values are always valid.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241118105018.1870052-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
lim->discard_granularity is always at least SECTOR_SIZE, so drop the
pointless check for granularity less than SECTOR_SIZE.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241112092144.4059847-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There was a bug report [1] where the user got a warning alignment
inconsistency. The user has optimal I/O 16776704 (0xFFFE00) and physical
block size 4096. Note that the optimal I/O size may be set by the DMA
engines or SCSI controllers and they have no knowledge about the disks
attached to them, so the situation with optimal I/O not aligned to
physical block size may happen.
This commit makes blk_validate_limits round down optimal I/O size to the
physical block size of the block device.
Closes: https://lore.kernel.org/dm-devel/1426ad71-79b4-4062-b2bf-84278be66a5d@redhat.com/T/ [1]
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: a23634644a ("block: take io_opt and io_min into account for max_sectors")
Cc: stable@vger.kernel.org # v6.11+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/3dc0014b-9690-dc38-81c9-4a316a2d4fb2@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
seq_printf is costly. For each block device, 19 decimal values are
yielded in /proc/diskstats via seq_printf. On a system with 16 logical
block devices, profiling for open/read/close sequences shows seq_printf
took ~75% samples of diskstats_show:
diskstats_show(92.626% 2269372/2450040)
seq_printf(76.026% 1725313/2269372)
vsnprintf(99.163% 1710866/1725313)
format_decode(26.597% 455040/1710866)
number(19.554% 334542/1710866)
memcpy_orig(4.183% 71570/1710866)
...
srso_return_thunk(0.009% 148/1725313)
part_stat_read_all(8.030% 182236/2269372)
One million rounds of open/read/close /proc/diskstats takes:
real 0m37.687s
user 0m0.264s
sys 0m32.911s
On average, each sequence tooks ~0.032ms
With this patch, most decimal values are yield via seq_put_decimal_ull,
performance is significantly improved:
real 0m20.792s
user 0m0.316s
sys 0m20.463s
On average, each sequence tooks ~0.020ms, a ~37.5% improvement.
Signed-off-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/r/20241108054500.4251-1-00107082@163.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.
Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While block drivers do the validation as part of committing them to the
queue, users that use the limit outside of a block device context have
to validate the limits and fill in the calculated values as well.
So far btrfs is the only user of queue limits without a block device,
and it has gotten away with that more or less by accident. But with
commit 559218d43e ("block: pre-calculate max_zone_append_sectors")
this became fatal for setups that have small max zone append size,
as it won't be limited now.
Export blk_validate_limits so that it can be called directly from btrfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The request ioprio is only initialized from the first attached bio,
so requests without a bio already never set it. Directly use the
bio field instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The write_hint is only used for read/write requests, which must have a
bio attached to them. Just use the bio field instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make bio_is_zone_append globally available, because file systems need
to use to check for a zone append bio in their end_io handlers to deal
with the block layer emulation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Otherwise it can create unaligned writes on zoned devices.
Fixes: a805a4fa4f ("block: introduce zone_write_granularity limit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241104062647.91160-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For zoned devices, write zeroes must be split at the zone boundary
which is represented as chunk_sectors. For other uses like the
internally RAIDed NVMe devices it is probably at least useful.
Enhance get_max_io_size to know about write zeroes and use it in
bio_split_write_zeroes. Also add a comment about the seemingly
nonsensical zero max_write_zeroes limit.
Fixes: 885fa13f65 ("block: implement splitting of REQ_OP_WRITE_ZEROES bios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241104062647.91160-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_split() may error, so check this.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is disallowed.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of returning an inconclusive value of NULL for an error in calling
bio_split(), return a ERR_PTR() always.
Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since
almost all callers don't check the return code from bio_split(), we'll
crash anyway (for those failures).
Fix up the only user which checks bio_split() return code today (directly
or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache
code does check the return code in cached_dev_cache_miss() ->
bio_next_split() -> bio_split(), but only to see if there was a split, so
there would be no change in behaviour here (when returning a ERR_PTR()).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_init_mq() is only called at the entry of add_disk_fwnode() when
disk IO isn't allowed yet.
So not verify io lock(q->io_lockdep_map) for freeze & unfreeze in
elevator_init_mq().
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Lai Yi <yi1.lai@linux.intel.com>
Fixes: f1be1788a3 ("block: model freeze & enter queue as lock for supporting lockdep")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit f1be1788a3 ("block: model freeze & enter queue as lock for
supporting lockdep") tries to apply lockdep for verifying freeze &
unfreeze. However, the verification is only done the outmost freeze and
unfreeze. This way is actually not correct because q->mq_freeze_depth
still may drop to zero on other task instead of the freeze owner task.
Fix this issue by always verifying the last unfreeze lock on the owner
task context, and make sure both the outmost freeze & unfreeze are
verified in the current task.
Fixes: f1be1788a3 ("block: model freeze & enter queue as lock for supporting lockdep")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No one use blk_freeze_queue(), so remove it and the obsolete comment.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Turn the private disk_zone_is_conv() function in blk-zoned.c into a
public and documented bdev_zone_is_seq() helper with the inverse
polarity of the original function, also adding a check for non-zoned
devices so that all file systems can use the helper, even with a regular
block device.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that a disk revalidation changing the conventional zones bitmap
of a disk does not cause invalid memory references when using the
disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap
pointer.
disk_zone_is_conv() is modified to operate under the RCU read lock and
the function disk_set_conv_zones_bitmap() is added to update a disk
conv_zones_bitmap pointer using rcu_replace_pointer() with the disk
zone_wplugs_lock spinlock held.
disk_free_zone_resources() is modified to call
disk_update_zone_resources() with a NULL bitmap pointer to free the disk
conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in
disk_update_zone_resources() to set the new (revalidated) bitmap and
free the old one.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Per Documentation/filesystems/sysfs.rst, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be
returned to user space.
No functional change intended.
Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241107104258.29742-1-zhangguopeng@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the lock layer zone append emulation, we are now always setting a
max_zone_append_sectors value for zoned devices and this check can't
ever trigger.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Listing every single features that needs to be pre-set by stacking
drivers does not scale.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104054218.45596-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is only used by the nvmet zns passthrough code, which can trivially
just use bio_add_pc_page and do the sanity check for the max zone append
limit itself.
All future zoned file systems should follow the btrfs lead and let the
upper layers fill up bios unlimited by hardware constraints and split
them to the limits in the I/O submission handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This code is unused, and all future zoned file systems should follow
the btrfs lead of splitting the bios themselves to the zoned limits
in the I/O submission handler, because if they didn't they would be
hit by commit ed9832bc08 ("block: introduce folio awareness and add
a bigger size from folio") breaking this code when the zone append
limit (that is usually the max_hw_sectors limit) is smaller than the
largest possible folio size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The seed is only used for kernel generation and verification. That
doesn't happen for user buffers, so passing the seed around doesn't
accomplish anything.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to get the queue_limits from the bdev without having to
poke into the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().
Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:
1) model q->q_usage_counter as two locks(io and queue lock)
- queue lock covers sync with blk_enter_queue()
- io lock covers sync with bio_enter_queue()
2) make the lockdep class/key as per-queue:
- different subsystem has very different lock use pattern, shared lock
class causes false positive easily
- freeze_queue degrades to no lock in case that disk state becomes DEAD
because bio_enter_queue() won't be blocked any more
- freeze_queue degrades to no lock in case that request queue becomes dying
because blk_enter_queue() won't be blocked any more
3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered
- it is trylock because blk_mq_freeze_queue() are allowed to run
concurrently
4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed
- dependency with blk_mq_freeze_queue() is covered
- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible
With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.
For example, lockdep report can be triggered in the report[3] with this
patch applied.
[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166
[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add non_owner variant of start_freeze/unfreeze queue APIs, so that the
caller knows that what they are doing, and we can skip lockdep support
for non_owner variant in per-call level.
Prepare for supporting lockdep for freezing/unfreezing queue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit a6088845c2 ("block: kyber: make kyber more friendly with merging")
removed the only blk_mq_flush_busy_ctxs() call from outside the block layer
core. Hence unexport blk_mq_flush_busy_ctxs().
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241023202850.3469279-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make sure that the tag_list_lock mutex is not held any longer than
necessary. This change reduces latency if e.g. blk_mq_quiesce_tagset()
is called concurrently from more than one thread. This function is used
by the NVMe core and also by the UFS driver.
Reported-by: Peter Wang <peter.wang@mediatek.com>
Cc: Chao Leng <lengchao@huawei.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 414dd48e88 ("blk-mq: add tagset quiesce interface")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241022181617.2716173-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The memory barriers in list_del_init_careful() and list_empty_careful()
in pairs already handle the proper ordering between data.got_token
and data.wq.entry. So remove the redundant explicit barriers. And also
change a "break" statement to "return" to avoid redundant calling of
finish_wait().
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/20241021085251.73353-1-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in block/fs prep patches for the atomic write support.
* for-6.13/block-atomic:
block: Add bdev atomic write limits helpers
fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
block/fs: Pass an iocb to generic_atomic_write_valid()
When a process migrates to another cgroup and the original cgroup is deleted,
the restrictions of throttled bios cannot be removed. If the restrictions
are set too low, it will take a long time to complete these bios.
Refer to the process of deleting a disk to remove the restrictions and
issue bios when deleting the cgroup.
This makes difference on the behavior of throttled bios:
Before: the limit of the throttled bios can't be changed and the bios will
complete under this limit;
Now: the limit will be canceled and the throttled bios will be flushed
immediately.
References:
[1] https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com
[2] https://lore.kernel.org/all/da861d63-58c6-3ca0-2535-9089993e9e28@huaweicloud.com/
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240817071108.1919729-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Supposing first scenario with a virtio_blk driver.
CPU0 CPU1
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_request_bypass_insert() 1) store
blk_mq_start_stopped_hw_queue()
clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending())
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
return
__blk_mq_sched_dispatch_requests()
Supposing another scenario.
CPU0 CPU1
blk_mq_requeue_work()
blk_mq_insert_request() 1) store
virtblk_done()
blk_mq_start_stopped_hw_queue()
blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
continue
blk_mq_run_hw_queue()
Both scenarios are similar, the full memory barrier should be inserted
between 1) and 2), as well as between 3) and 4) to make sure that either
CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list.
Otherwise, either CPU will not rerun the hardware queue causing
starvation of the request.
The easy way to fix it is to add the essential full memory barrier into
helper of blk_mq_hctx_stopped(). In order to not affect the fast path
(hardware queue is not stopped most of the time), we only insert the
barrier into the slow path. Actually, only slow path needs to care about
missing of dispatching the request to the low-level device driver.
Fixes: 320ae51fee ("blk-mq: new multi-queue block IO queueing mechanism")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Supposing the following scenario.
CPU0 CPU1
blk_mq_insert_request() 1) store
blk_mq_unquiesce_queue()
blk_queue_flag_clear() 3) store
blk_mq_run_hw_queues()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_run_hw_queue()
if (blk_queue_quiesced()) 2) load
return
blk_mq_sched_dispatch_requests()
The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED
is cleared or CPU1 sees dispatch list or setting of bitmap of software
queue. Otherwise, either CPU will not rerun the hardware queue causing
starvation.
So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to
synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since
memory barrier is not easy to be maintained.
Fixes: f4560ffe8c ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-3-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Supposing the following scenario with a virtio_blk driver.
CPU0 CPU1 CPU2
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_try_issue_directly()
if (blk_mq_hctx_stopped())
blk_mq_request_bypass_insert() blk_mq_run_hw_queue()
blk_mq_run_hw_queue() blk_mq_run_hw_queue()
blk_mq_insert_request()
return
After CPU0 has marked the queue as stopped, CPU1 will see the queue is
stopped. But before CPU1 puts the request on the dispatch list, CPU2
receives the interrupt of completion of request, so it will run the
hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also
runs the same hardware queue. After both CPU1 and CPU2 complete
blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware
queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly()
should handle a similar situation. Fix it as well.
Fixes: d964f04a8f ("blk-mq: fix direct issue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-2-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>