linux/block
Coly Li 9b15d109a6 block: improve discard bio alignment in __blkdev_issue_discard()
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 07:15:10 -06:00
..
partitions block: always remove partitions from blk_drop_partitions() 2020-07-15 09:23:42 -06:00
badblocks.c block: switch all files cleared marked as GPLv2 to SPDX tags 2019-04-30 16:11:57 -06:00
bfq-cgroup.c block, bfq: invoke flush_idle_tree after reparent_active_queues in pd_offline 2020-03-21 14:31:03 -06:00
bfq-iosched.c blk-mq: remove the bio argument to ->prepare_request 2020-05-29 10:23:24 -06:00
bfq-iosched.h block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroup 2020-03-21 14:31:00 -06:00
bfq-wf2q.c block, bfq: get a ref to a group when adding it to a service tree 2020-02-03 06:58:15 -07:00
bio-integrity.c block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failed 2020-06-02 17:21:50 -06:00
bio.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
blk-cgroup-rwstat.c blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT 2019-11-07 12:28:13 -07:00
blk-cgroup-rwstat.h blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT 2019-11-07 12:28:13 -07:00
blk-cgroup.c writeback: remove struct bdi_writeback_congested 2020-07-08 17:05:53 -06:00
blk-core.c block: remove a bogus warning in __submit_bio_noacct_mq 2020-07-07 11:45:59 -06:00
blk-crypto-fallback.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
blk-crypto-internal.h block: blk-crypto-fallback for Inline Encryption 2020-05-14 09:48:03 -06:00
blk-crypto.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
blk-exec.c block: add a blk_account_io_merge_bio helper 2020-05-27 05:21:23 -06:00
blk-flush.c block: defer flush request no matter whether we have elevator 2020-07-17 07:14:28 -06:00
blk-integrity.c block: Make blk-integrity preclude hardware inline encryption 2020-05-14 09:48:03 -06:00
blk-ioc.c block: remove retry loop in ioc_release_fn() 2020-07-16 10:22:15 -06:00
blk-iocost.c blk-iocost: Use struct_size() in kzalloc_node() 2020-06-24 09:15:58 -06:00
blk-iolatency.c blk-iolatency: only call ktime_get() if needed 2020-07-01 08:02:38 -06:00
blk-lib.c block: improve discard bio alignment in __blkdev_issue_discard() 2020-07-17 07:15:10 -06:00
blk-map.c block: Inline encryption support for blk-mq 2020-05-14 09:47:53 -06:00
blk-merge.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
blk-mq-cpumap.c blk-mq: balance mapping between present CPUs and queues 2019-08-04 21:43:12 -06:00
blk-mq-debugfs-zoned.c block: Cleanup license notice 2019-01-17 21:21:40 -07:00
blk-mq-debugfs.c blk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq() 2020-07-01 07:25:36 -06:00
blk-mq-debugfs.h blk-mq: no need to check return value of debugfs_create functions 2019-06-13 03:00:30 -06:00
blk-mq-pci.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-rdma.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-sched.c blk-mq: Remove unnecessary local variable 2020-07-10 07:58:09 -06:00
blk-mq-sched.h block: blk-mq: Remove blk_mq_sched_started_request and started_request 2019-07-23 07:25:09 -06:00
blk-mq-sysfs.c blk-mq: make sure that line break can be printed 2019-11-04 07:14:10 -07:00
blk-mq-tag.c blk-mq: move blk_mq_get_driver_tag into blk-mq.c 2020-06-30 12:57:59 -06:00
blk-mq-tag.h blk-mq: centralise related handling into blk_mq_get_driver_tag 2020-07-08 16:06:42 -06:00
blk-mq-virtio.c blk-mq: Fix typo in comment 2020-03-17 20:55:21 +01:00
blk-mq.c blk-mq: remove redundant validation in __blk_mq_end_request() 2020-07-10 07:58:33 -06:00
blk-mq.h Revert "blk-mq: put driver tag when this request is completed" 2020-07-01 22:58:32 -06:00
blk-pm.c block: bypass blk_set_runtime_active for uninitialized q->dev 2019-09-12 07:11:56 -06:00
blk-pm.h block: remove the queue_lock indirection 2018-11-15 12:17:28 -07:00
blk-rq-qos.c Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait." 2020-07-15 09:33:37 -06:00
blk-rq-qos.h blk-rq-qos: fix first node deletion of rq_qos_del() 2019-10-15 10:13:13 -06:00
blk-settings.c block: Introduce REQ_OP_ZONE_APPEND 2020-05-12 20:36:28 -06:00
blk-stat.c blk-stat: Optimise blk_stat_add() 2019-10-07 21:19:10 -06:00
blk-stat.h block: deactivate blk_stat timer in wbt_disable_default() 2018-12-12 06:47:51 -07:00
blk-sysfs.c block: create the request_queue debugfs_dir on registration 2020-06-24 09:15:58 -06:00
blk-throttle.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
blk-timeout.c block: make blk_timeout_init() static 2020-07-17 07:13:42 -06:00
blk-wbt.c blk-wbt: rename __wbt_update_limits to wbt_update_limits 2020-05-29 16:30:39 -06:00
blk-wbt.h blk-wbt: remove wbt_update_limits 2020-05-29 16:30:39 -06:00
blk-zoned.c block: Modify revalidate zones 2020-05-12 20:36:28 -06:00
blk.h block: improve discard bio alignment in __blkdev_issue_discard() 2020-07-17 07:15:10 -06:00
bounce.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
bsg-lib.c blk-mq: move failure injection out of blk_mq_complete_request 2020-06-24 09:15:57 -06:00
bsg.c compat_ioctl: bsg: add handler 2020-01-03 09:33:21 +01:00
cmdline-parser.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
elevator.c Merge branch 'for-linus' into for-5.5/block 2019-11-07 12:27:19 -07:00
genhd.c md: switch to ->check_events for media change notifications 2020-07-08 16:19:47 -06:00
ioctl.c block: Fix type of first compat_put_{,u}long() argument 2020-05-19 09:40:29 -06:00
ioprio.c docs: block: convert to ReST 2019-07-15 09:20:27 -03:00
Kconfig treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
Kconfig.iosched treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
keyslot-manager.c block: Make blk-integrity preclude hardware inline encryption 2020-05-14 09:48:03 -06:00
kyber-iosched.c blk-mq: remove the bio argument to ->prepare_request 2020-05-29 10:23:24 -06:00
Makefile blk-mq: merge blk-softirq.c into blk-mq.c 2020-06-24 09:15:56 -06:00
mq-deadline.c blk-mq: remove the bio argument to ->prepare_request 2020-05-29 10:23:24 -06:00
opal_proto.h block: sed-opal: Change the check condition for regular session validity 2020-03-12 08:00:10 -06:00
scsi_ioctl.c scsi: core: Allow non-root users to perform ZBC commands 2020-03-16 18:26:31 -04:00
sed-opal.c block: sed-opal: Change the check condition for regular session validity 2020-03-12 08:00:10 -06:00
t10-pi.c block: Allow t10-pi to be modular 2020-01-06 20:59:04 -07:00