The explanatory comment used `set_task_state` instead of
`set_current_state` which is the function actually used in the code.
Signed-off-by: Alvaro Parker <alparkerdf@gmail.com>
Link: https://lore.kernel.org/r/20240903172214.520086-1-alparkerdf@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nobody is maintaining this code, and it just falls under the umbrella
of block layer code. But at least mark it as such, in case anyone wants
to care more deeply about it and assume the responsibility of doing so.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consider the following scenario:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\-------------\ \-------------\ \--------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
break the merge chain.
Expected result: caller will allocate a new bfqq for BIC1
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
| | |
\-------------\ \--------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 1 3
Since the condition is only used for the last bfqq4 when the previous
bfqq2 and bfqq3 are already splited. Fix the problem by checking if
bfqq is the last one in the merge chain as well.
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consider the following merge chain:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
IO from Process 1 will get bfqf2 from BIC1 first, then
bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
handle this IO from bfqq3. However, the merge chain can be much deeper
and bfqq3 can be merged to other bfqq as well.
Fix this problem by iterating to the last bfqq in
bfq_setup_cooperator().
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If request timetout is handled by nbd_requeue_cmd(), normal completion
has to be stopped for avoiding to complete this requeued request, other
use-after-free can be triggered.
Fix the race by clearing NBD_CMD_INFLIGHT in nbd_requeue_cmd(), meantime
make sure that cmd->lock is grabbed for clearing the flag and the
requeue.
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Fixes: 2895f1831e91 ("nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240830034145.1827742-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MD updates from Song:
"Major changes in this set are:
1. md-bitmap refactoring, by Yu Kuai;
2. raid5 performance optimization, by Artur Paszkiewicz;
3. Other small fixes, by Yu Kuai and Chen Ni."
* tag 'md-6.12-20240829' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (49 commits)
md/raid5: rename wait_for_overlap to wait_for_reshape
md/raid5: only add to wq if reshape is in progress
md/raid5: use wait_on_bit() for R5_Overlap
md: Remove flush handling
md/md-bitmap: make in memory structure internal
md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
md/md-bitmap: merge md_bitmap_free() into bitmap_operations
md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
md/md-bitmap: pass in mddev directly for md_bitmap_resize()
md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
md/md-bitmap: merge bitmap_unplug() into bitmap_operations
md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
...
From Artur:
The wait_for_overlap wait queue is currently used in two cases, which
are not really related:
- waiting for actual overlapping bios, which uses R5_Overlap bit,
- waiting for events related to reshape.
Handling every write request in raid5_make_request() involves adding to
and removing from this wait queue, which uses a spinlock. With fast
storage and multiple submitting threads the contention on this lock is
noticeable.
This patch series aims to resolve this by separating the two cases
mentioned above and using this wait queue only when reshape is in
progress.
The results when testing 4k random writes on raid5 with null_blk
(8 jobs, qd=64, group_thread_cnt=8):
before: 463k IOPS
after: 523k IOPS
The improvement is not huge with this series alone but it is just one of
the bottlenecks. When applied onto some other changes I'm working on, it
allowed to go from 845k IOPS to 975k IOPS on the same test.
* md-6.12-raid5-opt:
md/raid5: rename wait_for_overlap to wait_for_reshape
md/raid5: only add to wq if reshape is in progress
md/raid5: use wait_on_bit() for R5_Overlap
Signed-off-by: Song Liu <song@kernel.org>
Now that actual overlaps are not handled on the wait_for_overlap wq
anymore, the remaining cases when we wait on this wq are limited to
reshape. If reshape is not in progress, don't add to the wq in
raid5_make_request() because add_wait_queue() / remove_wait_queue()
operations take a spinlock and cause noticeable contention when multiple
threads are submitting requests to the mddev.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Link: https://lore.kernel.org/r/20240827153536.6743-3-artur.paszkiewicz@intel.com
Signed-off-by: Song Liu <song@kernel.org>
bio_split_rw is designed to split read and write bios with a payload.
Currently it is called by __bio_split_to_limits for all operations not
explicitly list, which works because bio_may_need_split explicitly checks
for bi_vcnt == 1 and thus skips the bypass if there is no payload and
bio_for_each_bvec loop will never execute it's body if bi_size is 0.
But all this is hard to understand, fragile and wasted pointless cycles.
Switch __bio_split_to_limits to only call bio_split_rw for READ and
WRITE command and don't attempt any kind split for operation that do not
require splitting.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in
__bio_split_to_limits. This is harmful because REQ_OP_ZONE_APPEND
bios do not adhere to the soft max_limits value but instead use their
own capped version of max_hw_sectors, leading to incorrect splits that
later blow up in bio_split.
We still need the bio_split_rw logic to count nr_segs for blk-mq code,
so add a new wrapper that passes in the right limit, and turns any bio
that would need a split into an error as an additional debugging aid.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queue_limits_max_zone_append_sectors doesn't change the lim argument,
so mark it as const.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.
Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs. This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.
To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split. This turns out to nicely clean
up the btrfs flow as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From Yu Kuai (with minor changes by Song Liu):
The background is that currently bitmap is using a global spin_lock,
causing lock contention and huge IO performance degradation for all raid
levels.
However, it's impossible to implement a new lock free bitmap with
current situation that md-bitmap exposes the internal implementation
with lots of exported apis. Hence bitmap_operations is invented, to
describe bitmap core implementation, and a new bitmap can be introduced
with a new bitmap_operations, we only need to switch to the new one
during initialization.
And with this we can build bitmap as kernel module, but that's not
our concern for now.
This version was tested with mdadm tests and lvm2 tests. This set does
not introduce new errors in these tests.
* md-6.12-bitmap: (42 commits)
md/md-bitmap: make in memory structure internal
md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
md/md-bitmap: merge md_bitmap_free() into bitmap_operations
md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
md/md-bitmap: pass in mddev directly for md_bitmap_resize()
md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
md/md-bitmap: merge bitmap_unplug() into bitmap_operations
md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync()
md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations
md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations
...
Signed-off-by: Song Liu <song@kernel.org>
The bio->bi_iter.bi_size is updated when bio_add_page() is called. So we
do not need to assign msg->bi_size again to it, since its redudant and
can also be harmful. Instead we can use it to add a sanity check, which
checks the locally calculated bi_size, with the one sent in msg.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20240809135346.978320-1-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For flush request, md has a special flush handling to merge concurrent
flush request into single one, however, the whole mechanism is based on
a disk level spin_lock 'mddev->lock'. And fsync can be called quite
often in some user cases, for consequence, spin lock from IO fast path can
cause performance degradation.
Fortunately, the block layer already has flush handling to merge
concurrent flush request, and it only acquires hctx level spin lock. (see
details in blk-flush.c)
This patch removes the flush handling in md, and converts to use general
block layer flush handling in underlying disks.
Flush test for 4 nvme raid10:
start 128 threads to do fsync 100000 times, on arm64, see how long it
takes.
Test script:
void* thread_func(void* arg) {
int fd = *(int*)arg;
for (int i = 0; i < FSYNC_COUNT; i++) {
fsync(fd);
}
return NULL;
}
int main() {
int fd = open("/dev/md0", O_RDWR);
if (fd < 0) {
perror("open");
exit(1);
}
pthread_t threads[THREADS];
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < THREADS; i++) {
pthread_create(&threads[i], NULL, thread_func, &fd);
}
for (int i = 0; i < THREADS; i++) {
pthread_join(threads[i], NULL);
}
gettimeofday(&end, NULL);
close(fd);
long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec);
printf("Elapsed time: %lld microseconds\n", elapsed);
return 0;
}
Test result: about 10 times faster:
Before this patch: 50943374 microseconds
After this patch: 5096347 microseconds
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Now that struct bitmap_page and bitmap is not used externally anymore,
move them from md-bitmap.h to md-bitmap.c (expect that dm-raid is still
using define marco 'COUNTER_MAX').
Also fix some checkpatch warnings.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-43-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
And move the condition "if (mddev->bitmap)" into md_bitmap_resize() as
well, on the one hand make code cleaner, on the other hand try not to
access bitmap directly.
Since we are here, also change the parameter 'init' from int to bool.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-35-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Add a parameter 'bool sync' to distinguish them, and
md_bitmap_unplug_async() won't be exported anymore, hence
bitmap_operations only need one op to cover them.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-30-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-29-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-28-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
For internal callers, aborted are always set to false, while for
external callers, aborted are always set to true.
Hence there is no need to always pass in true for exported api.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-27-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Also fix lots of code style.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-26-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible. And change the type
of 'success' and 'behind' from int to bool.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-25-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible. And change the type
of 'behind' from int to bool.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-24-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
And while we're here, also fix coding style for bitmap_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-23-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-22-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>