Commit Graph

1310079 Commits

Author SHA1 Message Date
zhangguopeng
8e71afb94d block: Replace sprintf() with sysfs_emit()
Per Documentation/filesystems/sysfs.rst, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be
returned to user space.

No functional change intended.

Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241107104258.29742-1-zhangguopeng@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 15:30:30 -07:00
Damien Le Moal
4122fef16b block: Switch to using refcount_t for zone write plugs
Replace the raw atomic_t reference counting of zone write plugs with a
refcount_t.  No functional changes.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202411050650.ilIZa8S7-lkp@intel.com/
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241107065438.236348-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 11:21:52 -07:00
Jens Axboe
ab9bc81c1c Revert "block: pre-calculate max_zone_append_sectors"
This causes issue on, at least, nvme-mpath where my boot fails with:

WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380
Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus
CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181
Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024
Workqueue: async async_run_entry_fn
RIP: 0010:blk_validate_limits+0x356/0x380
Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46
RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088
RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38
RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000
R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38
FS:  0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0
Call Trace:
 <TASK>
 ? __warn+0xca/0x1a0
 ? blk_validate_limits+0x356/0x380
 ? report_bug+0x11a/0x1a0
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? blk_validate_limits+0x356/0x380
 blk_alloc_queue+0x7a/0x250
 __blk_alloc_disk+0x39/0x80
 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core]
 nvme_scan_ns+0xcc7/0x1010 [nvme_core]
 async_run_entry_fn+0x27/0x120
 process_scheduled_works+0x1a0/0x360
 worker_thread+0x2bc/0x350
 ? pr_cont_work+0x1b0/0x1b0
 kthread+0x111/0x120
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork+0x30/0x40
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork_asm+0x11/0x20
 </TASK>
---[ end trace 0000000000000000 ]---

presumably due to max_zone_append_sectors not being cleared to zero,
resulting in blk_validate_zoned_limits() complaining and failing.

This reverts commit 2a8f6153e1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 05:45:34 -07:00
Jens Axboe
0b66deb16c Merge tag 'md-6.13-20241105' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.13/block
Pull MD changes from Song:

"1. Enhance handling of faulty and blocked devices, by Yu Kuai.
 2. raid5-ppl atomic improvement, by Uros Bizjak.
 3. md-bitmap fix, by Yuan Can."

* tag 'md-6.13-20241105' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/md-bitmap: Add missing destroy_work_on_stack()
  md/raid5: don't set Faulty rdev for blocked_rdev
  md/raid10: don't wait for Faulty rdev in wait_blocked_rdev()
  md/raid1: don't wait for Faulty rdev in wait_blocked_rdev()
  md/raid1: factor out helper to handle blocked rdev from raid1_write_request()
  md: don't record new badblocks for faulty rdev
  md: don't wait faulty rdev in md_wait_for_blocked_rdev()
  md: add a new helper rdev_blocked()
  md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit()
2024-11-06 07:55:19 -07:00
Philipp Stanner
91ff97a722 mtip32xx: Replace deprecated PCI functions
pcim_iomap_table() and pcim_request_regions() have been deprecated in
commit e354bb84a4 ("PCI: Deprecate pcim_iomap_table(),
pcim_iomap_regions_request_all()") and commit d140f80f60 ("PCI:
Deprecate pcim_iomap_regions() in favor of pcim_iomap_region()"),
respectively.

Replace these functions with pcim_iomap_region().

Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://lore.kernel.org/r/20241106145249.108996-2-pstanner@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 07:54:50 -07:00
Yuan Can
6012169e8a md/md-bitmap: Add missing destroy_work_on_stack()
This commit add missed destroy_work_on_stack() operations for
unplug_work.work in bitmap_unplug_async().

Fixes: a022325ab9 ("md/md-bitmap: add a new helper to unplug bitmap asynchrously")
Cc: stable@vger.kernel.org
Signed-off-by: Yuan Can <yuancan@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241105130105.127336-1-yuancan@huawei.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 21:06:51 -08:00
Yu Kuai
649bfec690 md/raid5: don't set Faulty rdev for blocked_rdev
Faulty rdev should never be accessed anymore, hence there is no point to
wait for bad block to be acknowledged in this case while handling write
request.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-8-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:39 -08:00
Yu Kuai
d419284c95 md/raid10: don't wait for Faulty rdev in wait_blocked_rdev()
Faulty rdev should never be accessed anymore, hence there is no point to
wait for bad block to be acknowledged in this case while handling write
request.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-7-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:39 -08:00
Yu Kuai
ff31a7ef2b md/raid1: don't wait for Faulty rdev in wait_blocked_rdev()
Faulty rdev should never be accessed anymore, hence there is no point to
wait for bad block to be acknowledged in this case while handling write
request.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:39 -08:00
Yu Kuai
88ed59c4cc md/raid1: factor out helper to handle blocked rdev from raid1_write_request()
Currently raid1 is preparing IO for underlying disks while checking if
any disk is blocked, if so allocated resources must be released, then
waiting for rdev to be unblocked and try to prepare IO again.

Make code cleaner by checking blocked rdev first, it doesn't matter if
rdev is blocked while issuing IO, the IO will wait for rdev to be
unblocked or not.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-5-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:39 -08:00
Yu Kuai
29967332ce md: don't record new badblocks for faulty rdev
Faulty will be checked before issuing IO to the rdev, however, rdev can
be faulty at any time, hence it's possible that rdev_set_badblocks()
will be called for faulty rdev. In this case, mddev->sb_flags will be
set and some other path can be blocked by updating super block.

Since faulty rdev will not be accesed anymore, there is no need to
record new babblocks for faulty rdev and forcing updating super block.

Noted this is not a bugfix, just prevent updating superblock in some
corner cases, and will help to slice a bug related to external
metadata[1], testing also shows that devices are removed faster in the
case IO error.

[1] https://lore.kernel.org/all/f34452df-810b-48b2-a9b4-7f925699a9e7@linux.intel.com/

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-4-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:38 -08:00
Yu Kuai
50e8274855 md: don't wait faulty rdev in md_wait_for_blocked_rdev()
md_wait_for_blocked_rdev() is called for write IO while rdev is
blocked, howerver, rdev can be faulty after choosing this rdev to write,
and faulty rdev should never be accessed anymore, hence there is no point
to wait for faulty rdev to be unblocked.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:38 -08:00
Yu Kuai
4abfce19c7 md: add a new helper rdev_blocked()
The helper will be used in later patches for raid1/raid10/raid5, the
difference is that Faulty rdev with unacknowledged bad block will not
be considered blocked.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:38 -08:00
Uros Bizjak
1e79892e76 md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit()
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref)
to use optimized implementation and ease register pressure around
the primitive for targets that implement optimized variant.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Link: https://lore.kernel.org/r/20241007084831.48067-1-ubizjak@gmail.com
Signed-off-by: Song Liu <song@kernel.org>
2024-11-05 16:08:38 -08:00
Christoph Hellwig
2a8f6153e1 block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:34:07 -07:00
Christoph Hellwig
e494c3dce6 block: remove the max_zone_append_sectors check in blk_revalidate_disk_zones
With the lock layer zone append emulation, we are now always setting a
max_zone_append_sectors value for zoned devices and this check can't
ever trigger.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:34:06 -07:00
Christoph Hellwig
05df016684 block: update blk_stack_limits documentation
Listing every single features that needs to be pre-set by stacking
drivers does not scale.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104054218.45596-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:33:20 -07:00
Ming Lei
341468e0ab lib/iov_iter: fix bvec iterator setup
.bi_size of bvec iterator should be initialized as real max size for
walking, and .bi_bvec_done just counts how many bytes need to be
skipped in the 1st bvec, so .bi_size isn't related with .bi_bvec_done.

This patch fixes bvec iterator initialization, and the inner `size`
check isn't needed any more, so revert Eric Dumazet's commit
7bc802acf193 ("iov-iter: do not return more bytes than requested in
iov_iter_extract_bvec_pages()").

Cc: Eric Dumazet <edumazet@google.com>
Fixes: e4e535bff2 ("iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages")
Reported-by: syzbot+71abe7ab2b70bca770fd@syzkaller.appspotmail.com
Tested-by: syzbot+71abe7ab2b70bca770fd@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-01 20:18:21 -06:00
John Garry
d47de6ac88 loop: Simplify discard granularity calc
A bdev discard granularity is always at least SECTOR_SIZE, so don't check
for a zero value.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241101092215.422428-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-01 20:17:52 -06:00
Christoph Hellwig
f187b9bf1a block: remove bio_add_zone_append_page
This is only used by the nvmet zns passthrough code, which can trivially
just use bio_add_pc_page and do the sanity check for the max zone append
limit itself.

All future zoned file systems should follow the btrfs lead and let the
upper layers fill up bios unlimited by hardware constraints and split
them to the limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 10:54:25 -06:00
Christoph Hellwig
cafd00d0e9 block: remove zone append special casing from the direct I/O path
This code is unused, and all future zoned file systems should follow
the btrfs lead of splitting the bios themselves to the zoned limits
in the I/O submission handler, because if they didn't they would be
hit by commit ed9832bc08 ("block: introduce folio awareness and add
a bigger size from folio") breaking this code when the zone append
limit (that is usually the max_hw_sectors limit) is smaller than the
largest possible folio size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 10:54:25 -06:00
Ming Lei
496a51b371 lib/iov_iter.c: initialize bi.bi_idx before iterating over bvec
Initialize bi.bi_idx as 0 before iterating over bvec, otherwise
garbage data can be used as ->bi_idx.

Cc: Christoph Hellwig <hch@lst.de>
Reported-and-tested-by: Klara Modin <klarasmodin@gmail.com>
Fixes: e4e535bff2 ("iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 07:40:34 -06:00
Keith Busch
133008e84b blk-integrity: remove seed for user mapped buffers
The seed is only used for kernel generation and verification. That
doesn't happen for user buffers, so passing the seed around doesn't
accomplish anything.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:49:32 -06:00
Yang Erkun
826cc42adf brd: defer automatic disk creation until module initialization succeeds
My colleague Wupeng found the following problems during fault injection:

BUG: unable to handle page fault for address: fffffbfff809d073
PGD 6e648067 P4D 123ec8067 PUD 123ec4067 PMD 100e38067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 5 UID: 0 PID: 755 Comm: modprobe Not tainted 6.12.0-rc3+ #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
RIP: 0010:__asan_load8+0x4c/0xa0
...
Call Trace:
 <TASK>
 blkdev_put_whole+0x41/0x70
 bdev_release+0x1a3/0x250
 blkdev_release+0x11/0x20
 __fput+0x1d7/0x4a0
 task_work_run+0xfc/0x180
 syscall_exit_to_user_mode+0x1de/0x1f0
 do_syscall_64+0x6b/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

loop_init() is calling loop_add() after __register_blkdev() succeeds and
is ignoring disk_add() failure from loop_add(), for loop_add() failure
is not fatal and successfully created disks are already visible to
bdev_open().

brd_init() is currently calling brd_alloc() before __register_blkdev()
succeeds and is releasing successfully created disks when brd_init()
returns an error. This can cause UAF for the latter two case:

case 1:
    T1:
modprobe brd
  brd_init
    brd_alloc(0) // success
      add_disk
        disk_scan_partitions
          bdev_file_open_by_dev // alloc file
          fput // won't free until back to userspace
    brd_alloc(1) // failed since mem alloc error inject
  // error path for modprobe will release code segment
  // back to userspace
  __fput
    blkdev_release
      bdev_release
        blkdev_put_whole
          bdev->bd_disk->fops->release // fops is freed now, UAF!

case 2:
    T1:                            T2:
modprobe brd
  brd_init
    brd_alloc(0) // success
                                   open(/dev/ram0)
    brd_alloc(1) // fail
  // error path for modprobe

                                   close(/dev/ram0)
                                   ...
                                   /* UAF! */
                                   bdev->bd_disk->fops->release

Fix this problem by following what loop_init() does. Besides,
reintroduce brd_devices_mutex to help serialize modifications to
brd_list.

Fixes: 7f9b348cb5 ("brd: convert to blk_alloc_disk/blk_cleanup_disk")
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241030034914.907829-1-yangerkun@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:31:07 -06:00
John Garry
8d3fd059dd loop: Use bdev limit helpers for configuring discard
Instead of directly looking at the request_queue limits, use the bdev
limits helpers, which is preferable.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241030111900.3981223-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:23:57 -06:00
Christoph Hellwig
2f5a65ef30 block: add a bdev_limits helper
Add a helper to get the queue_limits from the bdev without having to
poke into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 09:15:00 -06:00
Ming Lei
e4e535bff2 iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages
The iov_iter_extract_pages interface allows to return physically
discontiguous pages, as long as all but the first and last page
in the array are page aligned and page size.  Rewrite
iov_iter_extract_bvec_pages to take advantage of that instead of only
returning ranges of physically contiguous pages.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: minor cleanups, new commit log]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241024050021.627350-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 09:14:28 -06:00
Ming Lei
f1be1788a3 block: model freeze & enter queue as lock for supporting lockdep
Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().

Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:

1) model q->q_usage_counter as two locks(io and queue lock)

- queue lock covers sync with blk_enter_queue()

- io lock covers sync with bio_enter_queue()

2) make the lockdep class/key as per-queue:

- different subsystem has very different lock use pattern, shared lock
 class causes false positive easily

- freeze_queue degrades to no lock in case that disk state becomes DEAD
  because bio_enter_queue() won't be blocked any more

- freeze_queue degrades to no lock in case that request queue becomes dying
  because blk_enter_queue() won't be blocked any more

3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered

- it is trylock because blk_mq_freeze_queue() are allowed to run
  concurrently

4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed

- dependency with blk_mq_freeze_queue() is covered

- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible

With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.

For example, lockdep report can be triggered in the report[3] with this
patch applied.

[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166

[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/

[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26 07:14:53 -06:00
Ming Lei
6b6f6c41c8 nvme: core: switch to non_owner variant of start_freeze/unfreeze queue
nvme_start_freeze() and nvme_unfreeze() may be called from same context,
so switch them to call non_owner variant of start_freeze/unfreeze queue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26 07:14:53 -06:00
Ming Lei
8acdd0e7bf blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs
Add non_owner variant of start_freeze/unfreeze queue APIs, so that the
caller knows that what they are doing, and we can skip lockdep support
for non_owner variant in per-call level.

Prepare for supporting lockdep for freezing/unfreezing queue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26 07:14:53 -06:00
Bart Van Assche
e203e20a8b blk-mq: Unexport blk_mq_flush_busy_ctxs()
Commit a6088845c2 ("block: kyber: make kyber more friendly with merging")
removed the only blk_mq_flush_busy_ctxs() call from outside the block layer
core. Hence unexport blk_mq_flush_busy_ctxs().

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241023202850.3469279-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-23 16:30:40 -06:00
Bart Van Assche
ccd9e252c5 blk-mq: Make blk_mq_quiesce_tagset() hold the tag list mutex less long
Make sure that the tag_list_lock mutex is not held any longer than
necessary. This change reduces latency if e.g. blk_mq_quiesce_tagset()
is called concurrently from more than one thread. This function is used
by the NVMe core and also by the UFS driver.

Reported-by: Peter Wang <peter.wang@mediatek.com>
Cc: Chao Leng <lengchao@huawei.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 414dd48e88 ("blk-mq: add tagset quiesce interface")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241022181617.2716173-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 19:11:25 -06:00
Muchun Song
904ebd2527 block: remove redundant explicit memory barrier from rq_qos waiter and waker
The memory barriers in list_del_init_careful() and list_empty_careful()
in pairs already handle the proper ordering between data.got_token
and data.wq.entry. So remove the redundant explicit barriers. And also
change a "break" statement to "return" to avoid redundant calling of
finish_wait().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/20241021085251.73353-1-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 14:05:09 -06:00
Jens Axboe
fdad1a20cd Merge branch 'for-6.13/block-atomic' into for-6.13/block
Merge in block/fs prep patches for the atomic write support.

* for-6.13/block-atomic:
  block: Add bdev atomic write limits helpers
  fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
  block/fs: Pass an iocb to generic_atomic_write_valid()
2024-10-22 08:21:51 -06:00
Li Lingfeng
919b5139bd block: flush all throttled bios when deleting the cgroup
When a process migrates to another cgroup and the original cgroup is deleted,
the restrictions of throttled bios cannot be removed. If the restrictions
are set too low, it will take a long time to complete these bios.

Refer to the process of deleting a disk to remove the restrictions and
issue bios when deleting the cgroup.

This makes difference on the behavior of throttled bios:
Before: the limit of the throttled bios can't be changed and the bios will
complete under this limit;
Now: the limit will be canceled and the throttled bios will be flushed
immediately.

References:
[1] https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com
[2] https://lore.kernel.org/all/da861d63-58c6-3ca0-2535-9089993e9e28@huaweicloud.com/

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240817071108.1919729-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:43 -06:00
Muchun Song
96a9fe64bf block: fix ordering between checking BLK_MQ_S_STOPPED request adding
Supposing first scenario with a virtio_blk driver.

CPU0                        CPU1

blk_mq_try_issue_directly()
  __blk_mq_issue_directly()
    q->mq_ops->queue_rq()
      virtio_queue_rq()
        blk_mq_stop_hw_queue()
                            virtblk_done()
  blk_mq_request_bypass_insert()  1) store
                              blk_mq_start_stopped_hw_queue()
                                clear_bit(BLK_MQ_S_STOPPED)       3) store
                                blk_mq_run_hw_queue()
                                  if (!blk_mq_hctx_has_pending()) 4) load
                                    return
                                  blk_mq_sched_dispatch_requests()
  blk_mq_run_hw_queue()
    if (!blk_mq_hctx_has_pending())
      return
    blk_mq_sched_dispatch_requests()
      if (blk_mq_hctx_stopped())  2) load
        return
      __blk_mq_sched_dispatch_requests()

Supposing another scenario.

CPU0                        CPU1

blk_mq_requeue_work()
  blk_mq_insert_request() 1) store
                            virtblk_done()
                              blk_mq_start_stopped_hw_queue()
  blk_mq_run_hw_queues()        clear_bit(BLK_MQ_S_STOPPED)       3) store
                                blk_mq_run_hw_queue()
                                  if (!blk_mq_hctx_has_pending()) 4) load
                                    return
                                  blk_mq_sched_dispatch_requests()
    if (blk_mq_hctx_stopped())  2) load
      continue
    blk_mq_run_hw_queue()

Both scenarios are similar, the full memory barrier should be inserted
between 1) and 2), as well as between 3) and 4) to make sure that either
CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list.
Otherwise, either CPU will not rerun the hardware queue causing
starvation of the request.

The easy way to fix it is to add the essential full memory barrier into
helper of blk_mq_hctx_stopped(). In order to not affect the fast path
(hardware queue is not stopped most of the time), we only insert the
barrier into the slow path. Actually, only slow path needs to care about
missing of dispatching the request to the low-level device driver.

Fixes: 320ae51fee ("blk-mq: new multi-queue block IO queueing mechanism")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Muchun Song
6bda857bcb block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding
Supposing the following scenario.

CPU0                        CPU1

blk_mq_insert_request()     1) store
                            blk_mq_unquiesce_queue()
                            blk_queue_flag_clear()                3) store
                              blk_mq_run_hw_queues()
                                blk_mq_run_hw_queue()
                                  if (!blk_mq_hctx_has_pending()) 4) load
                                    return
blk_mq_run_hw_queue()
  if (blk_queue_quiesced()) 2) load
    return
  blk_mq_sched_dispatch_requests()

The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED
is cleared or CPU1 sees dispatch list or setting of bitmap of software
queue. Otherwise, either CPU will not rerun the hardware queue causing
starvation.

So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to
synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since
memory barrier is not easy to be maintained.

Fixes: f4560ffe8c ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-3-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Muchun Song
2003ee8a9a block: fix missing dispatching request when queue is started or unquiesced
Supposing the following scenario with a virtio_blk driver.

CPU0                    CPU1                    CPU2

blk_mq_try_issue_directly()
  __blk_mq_issue_directly()
    q->mq_ops->queue_rq()
      virtio_queue_rq()
        blk_mq_stop_hw_queue()
                                                virtblk_done()
                        blk_mq_try_issue_directly()
                          if (blk_mq_hctx_stopped())
  blk_mq_request_bypass_insert()                  blk_mq_run_hw_queue()
  blk_mq_run_hw_queue()     blk_mq_run_hw_queue()
                            blk_mq_insert_request()
                            return

After CPU0 has marked the queue as stopped, CPU1 will see the queue is
stopped. But before CPU1 puts the request on the dispatch list, CPU2
receives the interrupt of completion of request, so it will run the
hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also
runs the same hardware queue. After both CPU1 and CPU2 complete
blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware
queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly()
should handle a similar situation. Fix it as well.

Fixes: d964f04a8f ("blk-mq: fix direct issue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-2-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Xiuhong Wang
732312e183 Revert "blk-throttle: Fix IO hang for a corner case"
This reverts commit 5b7048b897.

The main purpose of this patch is cleanup.
The throtl_adjusted_limit function was removed after
commit bf20ab538c ("blk-throttle: remove
CONFIG_BLK_DEV_THROTTLING_LOW"), so the problem of not being
able to scale after setting bps or iops to 1 will not occur.
So revert this commit that bps/iops can be set to 1.

Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241016024508.3340330-1-xiuhong.wang@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Julia Lawall
28878733ca block: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.

The changes were made using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20241013201704.49576-10-Julia.Lawall@inria.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Greg Joyce
b21d948f4c block: sed-opal: add ioctl IOC_OPAL_SET_SID_PW
After a SED drive is provisioned, there is no way to change the SID
password via the ioctl() interface. A new ioctl IOC_OPAL_SET_SID_PW
will allow the password to be changed. The valid current password is
required.

Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Link: https://lore.kernel.org/r/20240829175639.6478-2-gjoyce@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Uday Shankar
69f407ee8d Documentation: ublk: document UBLK_F_USER_RECOVERY_FAIL_IO
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-6-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Uday Shankar
59eaa01ce7 ublk: support device recovery without I/O queueing
ublk currently supports the following behaviors on ublk server exit:

A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue

and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:

1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands

The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:

default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2

The behavior A + 2 is currently unsupported. Add support for this
behavior under the new flag combination
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_FAIL_IO.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-5-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Uday Shankar
27b5d4170c ublk: merge stop_work and quiesce_work
Save some lines by merging stop_work and quiesce_work into nosrv_work,
which looks at the recovery flags and does the right thing when the "no
ublk server" condition is detected.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-4-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Uday Shankar
3b939b8f71 ublk: refactor recovery configuration flag helpers
ublk currently supports the following behaviors on ublk server exit:

A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue

and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:

1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands

The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:

default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2

We can't easily change the userspace interface to allow independent
selection of one of {A, B, C} and one of {1, 2}, but we can refactor the
internal helpers which test for the flags. Replace the existing helpers
with the following set:

ublk_nosrv_should_reissue_outstanding: tests for behavior C
ublk_nosrv_[dev_]should_queue_io: tests for behavior B
ublk_nosrv_should_stop_dev: tests for behavior 1

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-3-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Uday Shankar
d00c0ea179 ublk: check recovery flags for validity
Setting UBLK_F_USER_RECOVERY_REISSUE without also setting
UBLK_F_USER_RECOVERY is currently silently equivalent to not setting any
recovery flags at all, even though that's obviously not intended. Check
for this case and fail add_dev (with a paranoid warning to aid debugging
any program which might rely on the old behavior) with EINVAL if it is
detected.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-2-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Keith Busch
110234da18 block: enable passthrough command statistics
Applications using the passthrough interfaces for IO want to continue
seeing the disk stats. These requests had been fenced off from this
block layer feature. While the block layer doesn't necessarily know what
a passthrough command does, we do know the data size and direction,
which is enough to account for the command's stats.

Since tracking these has the potential to produce unexpected results,
the passthrough stats are locked behind a new queue flag that needs to
be enabled with the /sys/block/<dev>/queue/iostats_passthrough
attribute.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241007153236.2818562-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:32 -06:00
Christoph Hellwig
d51c9cdfc2 block: return void from the queue_sysfs_entry load_module method
Requesting a module either succeeds or does nothing, return an error from
this method does not make sense.

Also move the load_module after the store method in the struct
declaration to keep the important show and store methods together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20241008050841.104602-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:22 -06:00
Konstantin Khlebnikov
758737d86f block: add partition uuid into uevent as "PARTUUID"
Both most common formats have uuid in addition to partition name:
GPT: standard uuid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
DOS: 4 byte disk signature and 1 byte partition xxxxxxxx-xx

Tools from util-linux use the same notation for them.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
[dianders: rebased to modern kernels]
Signed-off-by: Douglas Anderson <dianders@google.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241004171340.v2.1.I938c91d10e454e841fdf5d64499a8ae8514dc004@changeid
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:15:17 -06:00
Jens Axboe
746fc7e9d4 block: move issue side time stamping to blk_account_io_start()
It's known needed at that point, and it's cleaner to just assign it
there rather than rely on it being reliably set before hitting the
IO accounting. Hence, move it out of blk_mq_rq_time_init(), which is
now only doing the allocation side timing.

While at it, get rid of the '0' time passing to blk_mq_rq_time_init(),
just pass in blk_time_get_ns() for the two cases where 0 is being
explicitly passed in. The rest pass in the previously cached allocation
time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:15:17 -06:00