2019-04-30 14:42:43 -04:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
* Copyright (C) 1994, Karl Keyte: Added support for disk statistics
|
|
|
|
* Elevator latency, (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
|
|
|
|
* Queue request tables / lock, selectable elevator, Jens Axboe <axboe@suse.de>
|
2008-01-31 13:03:55 +01:00
|
|
|
* kernel-doc documentation started by NeilBrown <neilb@cse.unsw.edu.au>
|
|
|
|
* - July2000
|
2005-04-16 15:20:36 -07:00
|
|
|
* bio rewrite, highmem i/o, etc, Jens Axboe <axboe@suse.de> - may 2001
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This handles all read/write requests to block devices
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
2020-12-08 21:29:51 -08:00
|
|
|
#include <linux/blk-pm.h>
|
2021-09-20 14:33:27 +02:00
|
|
|
#include <linux/blk-integrity.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/mm.h>
|
mm: move readahead prototypes from mm.h
Patch series "Change readahead API", v11.
This series adds a readahead address_space operation to replace the
readpages operation. The key difference is that pages are added to the
page cache as they are allocated (and then looked up by the filesystem)
instead of passing them on a list to the readpages operation and having
the filesystem add them to the page cache. It's a net reduction in code
for each implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com
The only unconverted filesystems are those which use fscache. Their
conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier. This should be completed by the end of
the year.
I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
Miklos Szeredi have done a marvellous job of providing constructive
criticism.
These patches pass an xfstests run on ext4, xfs & btrfs with no
regressions that I can tell (some of the tests seem a little flaky
before and remain flaky afterwards).
This patch (of 25):
The readahead code is part of the page cache so should be found in the
pagemap.h file. force_page_cache_readahead is only used within mm, so
move it to mm/internal.h instead. Remove the parameter names where they
add no value, and rename the ones which were actively misleading.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-01 21:46:07 -07:00
|
|
|
#include <linux/pagemap.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/kernel_stat.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/completion.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/writeback.h>
|
2006-12-10 02:19:35 -08:00
|
|
|
#include <linux/task_io_accounting_ops.h>
|
2006-12-08 02:39:46 -08:00
|
|
|
#include <linux/fault-inject.h>
|
2011-03-08 13:19:51 +01:00
|
|
|
#include <linux/list_sort.h>
|
2011-10-19 14:32:38 +02:00
|
|
|
#include <linux/delay.h>
|
2012-04-19 16:29:22 -07:00
|
|
|
#include <linux/ratelimit.h>
|
2013-03-23 11:42:26 +08:00
|
|
|
#include <linux/pm_runtime.h>
|
2019-09-16 18:44:29 +03:00
|
|
|
#include <linux/t10-pi.h>
|
2017-01-31 14:53:20 -08:00
|
|
|
#include <linux/debugfs.h>
|
2018-02-06 14:05:39 -08:00
|
|
|
#include <linux/bpf.h>
|
2021-11-23 19:53:12 +01:00
|
|
|
#include <linux/part_stat.h>
|
2020-05-14 16:45:09 +08:00
|
|
|
#include <linux/sched/sysctl.h>
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
|
|
|
#include <linux/blk-crypto.h>
|
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 13:43:05 +08:00
|
|
|
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/block.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2008-01-29 14:51:59 +01:00
|
|
|
#include "blk.h"
|
2021-11-23 19:53:08 +01:00
|
|
|
#include "blk-mq-sched.h"
|
2018-09-26 14:01:03 -07:00
|
|
|
#include "blk-pm.h"
|
2022-02-11 18:11:49 +08:00
|
|
|
#include "blk-cgroup.h"
|
2021-10-05 09:11:56 -06:00
|
|
|
#include "blk-throttle.h"
|
2024-01-30 15:26:34 -05:00
|
|
|
#include "blk-ioprio.h"
|
2008-01-29 14:51:59 +01:00
|
|
|
|
2017-01-31 14:53:20 -08:00
|
|
|
struct dentry *blk_debugfs_root;
|
|
|
|
|
2010-11-16 12:52:38 +01:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
|
2009-10-01 21:16:13 +02:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
|
2013-04-18 09:00:26 -07:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
|
2014-04-28 12:30:52 -06:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_split);
|
2012-12-14 20:49:27 +01:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
|
2021-02-21 21:29:59 -08:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_insert);
|
2008-11-26 11:59:56 +01:00
|
|
|
|
2022-11-14 05:26:36 +01:00
|
|
|
static DEFINE_IDA(blk_queue_ida);
|
2011-12-14 00:33:37 +01:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* For queue allocation
|
|
|
|
*/
|
2022-11-14 05:26:36 +01:00
|
|
|
static struct kmem_cache *blk_requestq_cachep;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Controlling structure to kblockd
|
|
|
|
*/
|
2006-01-09 16:02:34 +01:00
|
|
|
static struct workqueue_struct *kblockd_workqueue;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2018-03-07 17:10:04 -08:00
|
|
|
/**
|
|
|
|
* blk_queue_flag_set - atomically set a queue flag
|
|
|
|
* @flag: flag to be set
|
|
|
|
* @q: request queue
|
|
|
|
*/
|
|
|
|
void blk_queue_flag_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-14 17:02:07 +01:00
|
|
|
set_bit(flag, &q->queue_flags);
|
2018-03-07 17:10:04 -08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_queue_flag_set);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_queue_flag_clear - atomically clear a queue flag
|
|
|
|
* @flag: flag to be cleared
|
|
|
|
* @q: request queue
|
|
|
|
*/
|
|
|
|
void blk_queue_flag_clear(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-14 17:02:07 +01:00
|
|
|
clear_bit(flag, &q->queue_flags);
|
2018-03-07 17:10:04 -08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_queue_flag_clear);
|
|
|
|
|
2019-06-20 10:59:16 -07:00
|
|
|
#define REQ_OP_NAME(name) [REQ_OP_##name] = #name
|
|
|
|
static const char *const blk_op_name[] = {
|
|
|
|
REQ_OP_NAME(READ),
|
|
|
|
REQ_OP_NAME(WRITE),
|
|
|
|
REQ_OP_NAME(FLUSH),
|
|
|
|
REQ_OP_NAME(DISCARD),
|
|
|
|
REQ_OP_NAME(SECURE_ERASE),
|
|
|
|
REQ_OP_NAME(ZONE_RESET),
|
2019-08-01 10:26:36 -07:00
|
|
|
REQ_OP_NAME(ZONE_RESET_ALL),
|
2019-10-27 23:05:45 +09:00
|
|
|
REQ_OP_NAME(ZONE_OPEN),
|
|
|
|
REQ_OP_NAME(ZONE_CLOSE),
|
|
|
|
REQ_OP_NAME(ZONE_FINISH),
|
2020-05-12 17:55:47 +09:00
|
|
|
REQ_OP_NAME(ZONE_APPEND),
|
2019-06-20 10:59:16 -07:00
|
|
|
REQ_OP_NAME(WRITE_ZEROES),
|
|
|
|
REQ_OP_NAME(DRV_IN),
|
|
|
|
REQ_OP_NAME(DRV_OUT),
|
|
|
|
};
|
|
|
|
#undef REQ_OP_NAME
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_op_str - Return string XXX in the REQ_OP_XXX.
|
|
|
|
* @op: REQ_OP_XXX.
|
|
|
|
*
|
|
|
|
* Description: Centralize block layer function to convert REQ_OP_XXX into
|
|
|
|
* string format. Useful in the debugging and tracing bio or request. For
|
|
|
|
* invalid REQ_OP_XXX it returns string "UNKNOWN".
|
|
|
|
*/
|
2022-07-14 11:06:28 -07:00
|
|
|
inline const char *blk_op_str(enum req_op op)
|
2019-06-20 10:59:16 -07:00
|
|
|
{
|
|
|
|
const char *op_str = "UNKNOWN";
|
|
|
|
|
|
|
|
if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op])
|
|
|
|
op_str = blk_op_name[op];
|
|
|
|
|
|
|
|
return op_str;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_op_str);
|
|
|
|
|
2017-06-03 09:38:04 +02:00
|
|
|
static const struct {
|
|
|
|
int errno;
|
|
|
|
const char *name;
|
|
|
|
} blk_errors[] = {
|
|
|
|
[BLK_STS_OK] = { 0, "" },
|
|
|
|
[BLK_STS_NOTSUPP] = { -EOPNOTSUPP, "operation not supported" },
|
|
|
|
[BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" },
|
|
|
|
[BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" },
|
|
|
|
[BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" },
|
|
|
|
[BLK_STS_TARGET] = { -EREMOTEIO, "critical target" },
|
2023-04-07 15:05:35 -05:00
|
|
|
[BLK_STS_RESV_CONFLICT] = { -EBADE, "reservation conflict" },
|
2017-06-03 09:38:04 +02:00
|
|
|
[BLK_STS_MEDIUM] = { -ENODATA, "critical medium" },
|
|
|
|
[BLK_STS_PROTECTION] = { -EILSEQ, "protection" },
|
|
|
|
[BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" },
|
2018-01-30 22:04:57 -05:00
|
|
|
[BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" },
|
2017-06-20 07:05:46 -05:00
|
|
|
[BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" },
|
2022-02-03 11:28:26 -08:00
|
|
|
[BLK_STS_OFFLINE] = { -ENODEV, "device offline" },
|
2017-06-03 09:38:04 +02:00
|
|
|
|
2017-06-03 09:38:06 +02:00
|
|
|
/* device mapper special case, should not leak out: */
|
|
|
|
[BLK_STS_DM_REQUEUE] = { -EREMCHG, "dm internal retry" },
|
|
|
|
|
2020-09-24 13:53:28 -07:00
|
|
|
/* zone device specific errors */
|
|
|
|
[BLK_STS_ZONE_OPEN_RESOURCE] = { -ETOOMANYREFS, "open zones exceeded" },
|
|
|
|
[BLK_STS_ZONE_ACTIVE_RESOURCE] = { -EOVERFLOW, "active zones exceeded" },
|
|
|
|
|
2023-05-11 03:13:36 +02:00
|
|
|
/* Command duration limit device-side timeout */
|
|
|
|
[BLK_STS_DURATION_LIMIT] = { -ETIME, "duration limit exceeded" },
|
|
|
|
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:54 +00:00
|
|
|
[BLK_STS_INVAL] = { -EINVAL, "invalid" },
|
|
|
|
|
2017-06-03 09:38:04 +02:00
|
|
|
/* everything else not covered above: */
|
|
|
|
[BLK_STS_IOERR] = { -EIO, "I/O" },
|
|
|
|
};
|
|
|
|
|
|
|
|
blk_status_t errno_to_blk_status(int errno)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(blk_errors); i++) {
|
|
|
|
if (blk_errors[i].errno == errno)
|
|
|
|
return (__force blk_status_t)i;
|
|
|
|
}
|
|
|
|
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(errno_to_blk_status);
|
|
|
|
|
|
|
|
int blk_status_to_errno(blk_status_t status)
|
|
|
|
{
|
|
|
|
int idx = (__force int)status;
|
|
|
|
|
2017-06-21 10:55:46 -07:00
|
|
|
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
|
2017-06-03 09:38:04 +02:00
|
|
|
return -EIO;
|
|
|
|
return blk_errors[idx].errno;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_status_to_errno);
|
|
|
|
|
2021-11-17 07:14:03 +01:00
|
|
|
const char *blk_status_to_str(blk_status_t status)
|
2017-06-03 09:38:04 +02:00
|
|
|
{
|
|
|
|
int idx = (__force int)status;
|
|
|
|
|
2017-06-21 10:55:46 -07:00
|
|
|
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
|
2021-11-17 07:14:03 +01:00
|
|
|
return "<null>";
|
|
|
|
return blk_errors[idx].name;
|
2017-06-03 09:38:04 +02:00
|
|
|
}
|
2023-08-13 14:26:34 -04:00
|
|
|
EXPORT_SYMBOL_GPL(blk_status_to_str);
|
2017-06-03 09:38:04 +02:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
|
|
|
* blk_sync_queue - cancel any pending callbacks on a queue
|
|
|
|
* @q: the queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* The block layer may perform asynchronous callback activity
|
|
|
|
* on a queue, such as calling the unplug function after a timeout.
|
|
|
|
* A block device may call blk_sync_queue to ensure that any
|
|
|
|
* such activity is cancelled, thus allowing it to release resources
|
2007-05-09 08:57:56 +02:00
|
|
|
* that the callbacks might use. The caller must already have made sure
|
2020-07-01 10:59:43 +02:00
|
|
|
* that its ->submit_bio will not re-add plugging prior to calling
|
2005-04-16 15:20:36 -07:00
|
|
|
* this function.
|
|
|
|
*
|
2011-03-02 19:05:33 -05:00
|
|
|
* This function does not cancel any asynchronous activity arising
|
2014-09-09 01:27:23 +09:00
|
|
|
* out of elevator or throttling code. That would require elevator_exit()
|
2012-03-05 13:15:12 -08:00
|
|
|
* and blkcg_exit_queue() to be called with queue lock initialized.
|
2011-03-02 19:05:33 -05:00
|
|
|
*
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
|
|
|
void blk_sync_queue(struct request_queue *q)
|
|
|
|
{
|
2008-11-19 14:38:39 +01:00
|
|
|
del_timer_sync(&q->timeout);
|
2017-10-19 10:00:48 -07:00
|
|
|
cancel_work_sync(&q->timeout_work);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_sync_queue);
|
|
|
|
|
2017-11-09 10:49:57 -08:00
|
|
|
/**
|
2018-09-26 14:01:04 -07:00
|
|
|
* blk_set_pm_only - increment pm_only counter
|
2017-11-09 10:49:57 -08:00
|
|
|
* @q: request queue pointer
|
|
|
|
*/
|
2018-09-26 14:01:04 -07:00
|
|
|
void blk_set_pm_only(struct request_queue *q)
|
2017-11-09 10:49:57 -08:00
|
|
|
{
|
2018-09-26 14:01:04 -07:00
|
|
|
atomic_inc(&q->pm_only);
|
2017-11-09 10:49:57 -08:00
|
|
|
}
|
2018-09-26 14:01:04 -07:00
|
|
|
EXPORT_SYMBOL_GPL(blk_set_pm_only);
|
2017-11-09 10:49:57 -08:00
|
|
|
|
2018-09-26 14:01:04 -07:00
|
|
|
void blk_clear_pm_only(struct request_queue *q)
|
2017-11-09 10:49:57 -08:00
|
|
|
{
|
2018-09-26 14:01:04 -07:00
|
|
|
int pm_only;
|
|
|
|
|
|
|
|
pm_only = atomic_dec_return(&q->pm_only);
|
|
|
|
WARN_ON_ONCE(pm_only < 0);
|
|
|
|
if (pm_only == 0)
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2017-11-09 10:49:57 -08:00
|
|
|
}
|
2018-09-26 14:01:04 -07:00
|
|
|
EXPORT_SYMBOL_GPL(blk_clear_pm_only);
|
2017-11-09 10:49:57 -08:00
|
|
|
|
2022-11-14 05:26:36 +01:00
|
|
|
static void blk_free_queue_rcu(struct rcu_head *rcu_head)
|
|
|
|
{
|
2022-12-15 10:16:29 +08:00
|
|
|
struct request_queue *q = container_of(rcu_head,
|
|
|
|
struct request_queue, rcu_head);
|
|
|
|
|
|
|
|
percpu_ref_exit(&q->q_usage_counter);
|
|
|
|
kmem_cache_free(blk_requestq_cachep, q);
|
2022-11-14 05:26:36 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_free_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
blk_free_queue_stats(q->stats);
|
|
|
|
if (queue_is_mq(q))
|
|
|
|
blk_mq_release(q);
|
|
|
|
|
|
|
|
ida_free(&blk_queue_ida, q->id);
|
2024-10-25 08:37:20 +08:00
|
|
|
lockdep_unregister_key(&q->io_lock_cls_key);
|
|
|
|
lockdep_unregister_key(&q->q_lock_cls_key);
|
2022-11-14 05:26:36 +01:00
|
|
|
call_rcu(&q->rcu_head, blk_free_queue_rcu);
|
|
|
|
}
|
|
|
|
|
2020-06-19 20:47:23 +00:00
|
|
|
/**
|
|
|
|
* blk_put_queue - decrement the request_queue refcount
|
|
|
|
* @q: the request_queue structure to decrement the refcount for
|
|
|
|
*
|
2022-11-14 05:26:36 +01:00
|
|
|
* Decrements the refcount of the request_queue and free it when the refcount
|
|
|
|
* reaches 0.
|
2020-06-19 20:47:23 +00:00
|
|
|
*/
|
2007-07-24 09:28:11 +02:00
|
|
|
void blk_put_queue(struct request_queue *q)
|
2006-03-18 18:34:37 -05:00
|
|
|
{
|
2022-11-14 05:26:36 +01:00
|
|
|
if (refcount_dec_and_test(&q->refs))
|
|
|
|
blk_free_queue(q);
|
2006-03-18 18:34:37 -05:00
|
|
|
}
|
2011-05-27 07:44:43 +02:00
|
|
|
EXPORT_SYMBOL(blk_put_queue);
|
2006-03-18 18:34:37 -05:00
|
|
|
|
2024-10-25 08:37:20 +08:00
|
|
|
bool blk_queue_start_drain(struct request_queue *q)
|
2014-12-22 14:04:42 -07:00
|
|
|
{
|
2017-03-27 20:06:58 +08:00
|
|
|
/*
|
|
|
|
* When queue DYING flag is set, we need to block new req
|
|
|
|
* entering queue, so we call blk_freeze_queue_start() to
|
|
|
|
* prevent I/O from crossing blk_queue_enter().
|
|
|
|
*/
|
2024-10-31 21:37:19 +08:00
|
|
|
bool freeze = __blk_freeze_queue_start(q, current);
|
2018-11-15 12:22:51 -07:00
|
|
|
if (queue_is_mq(q))
|
2014-12-22 14:04:42 -07:00
|
|
|
blk_mq_wake_waiters(q);
|
2017-11-09 10:49:53 -08:00
|
|
|
/* Make blk_queue_enter() reexamine the DYING flag. */
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2024-10-25 08:37:20 +08:00
|
|
|
|
|
|
|
return freeze;
|
2014-12-22 14:04:42 -07:00
|
|
|
}
|
2021-09-29 09:12:40 +02:00
|
|
|
|
2017-11-09 10:49:58 -08:00
|
|
|
/**
|
|
|
|
* blk_queue_enter() - try to increase q->q_usage_counter
|
|
|
|
* @q: request queue pointer
|
2020-12-08 21:29:50 -08:00
|
|
|
* @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM
|
2017-11-09 10:49:58 -08:00
|
|
|
*/
|
2017-11-09 10:49:59 -08:00
|
|
|
int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
|
2015-10-21 13:20:12 -04:00
|
|
|
{
|
2020-12-08 21:29:50 -08:00
|
|
|
const bool pm = flags & BLK_MQ_REQ_PM;
|
2017-11-09 10:49:58 -08:00
|
|
|
|
2021-09-29 09:12:38 +02:00
|
|
|
while (!blk_try_enter_queue(q, pm)) {
|
2017-11-09 10:49:58 -08:00
|
|
|
if (flags & BLK_MQ_REQ_NOWAIT)
|
2022-09-12 09:53:25 -07:00
|
|
|
return -EAGAIN;
|
2015-10-21 13:20:12 -04:00
|
|
|
|
2017-03-27 20:06:56 +08:00
|
|
|
/*
|
2021-09-29 09:12:38 +02:00
|
|
|
* read pair of barrier in blk_freeze_queue_start(), we need to
|
|
|
|
* order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
|
|
|
|
* reading .mq_freeze_depth or queue dying flag, otherwise the
|
|
|
|
* following wait may never return if the two reads are
|
|
|
|
* reordered.
|
2017-03-27 20:06:56 +08:00
|
|
|
*/
|
|
|
|
smp_rmb();
|
2018-04-12 19:11:58 +01:00
|
|
|
wait_event(q->mq_freeze_wq,
|
2019-05-21 11:25:55 +08:00
|
|
|
(!q->mq_freeze_depth &&
|
2020-12-08 21:29:51 -08:00
|
|
|
blk_pm_resume_queue(pm, q)) ||
|
2018-04-12 19:11:58 +01:00
|
|
|
blk_queue_dying(q));
|
2015-10-21 13:20:12 -04:00
|
|
|
if (blk_queue_dying(q))
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
2021-09-29 09:12:38 +02:00
|
|
|
|
2024-10-25 08:37:20 +08:00
|
|
|
rwsem_acquire_read(&q->q_lockdep_map, 0, 0, _RET_IP_);
|
|
|
|
rwsem_release(&q->q_lockdep_map, _RET_IP_);
|
2021-09-29 09:12:38 +02:00
|
|
|
return 0;
|
2015-10-21 13:20:12 -04:00
|
|
|
}
|
|
|
|
|
2021-11-04 12:45:51 -06:00
|
|
|
int __bio_queue_enter(struct request_queue *q, struct bio *bio)
|
2020-04-28 13:27:56 +02:00
|
|
|
{
|
2021-09-29 09:12:39 +02:00
|
|
|
while (!blk_try_enter_queue(q, false)) {
|
2021-10-14 15:03:29 +01:00
|
|
|
struct gendisk *disk = bio->bi_bdev->bd_disk;
|
|
|
|
|
2021-09-29 09:12:39 +02:00
|
|
|
if (bio->bi_opf & REQ_NOWAIT) {
|
2021-09-29 09:12:40 +02:00
|
|
|
if (test_bit(GD_DEAD, &disk->state))
|
2021-09-29 09:12:39 +02:00
|
|
|
goto dead;
|
2020-04-28 13:27:56 +02:00
|
|
|
bio_wouldblock_error(bio);
|
2022-09-12 09:53:25 -07:00
|
|
|
return -EAGAIN;
|
2021-09-29 09:12:39 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* read pair of barrier in blk_freeze_queue_start(), we need to
|
|
|
|
* order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
|
|
|
|
* reading .mq_freeze_depth or queue dying flag, otherwise the
|
|
|
|
* following wait may never return if the two reads are
|
|
|
|
* reordered.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
wait_event(q->mq_freeze_wq,
|
|
|
|
(!q->mq_freeze_depth &&
|
|
|
|
blk_pm_resume_queue(false, q)) ||
|
2021-09-29 09:12:40 +02:00
|
|
|
test_bit(GD_DEAD, &disk->state));
|
|
|
|
if (test_bit(GD_DEAD, &disk->state))
|
2021-09-29 09:12:39 +02:00
|
|
|
goto dead;
|
2020-04-28 13:27:56 +02:00
|
|
|
}
|
|
|
|
|
2024-10-25 08:37:20 +08:00
|
|
|
rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
|
|
|
|
rwsem_release(&q->io_lockdep_map, _RET_IP_);
|
2021-09-29 09:12:39 +02:00
|
|
|
return 0;
|
|
|
|
dead:
|
|
|
|
bio_io_error(bio);
|
|
|
|
return -ENODEV;
|
2020-04-28 13:27:56 +02:00
|
|
|
}
|
|
|
|
|
2015-10-21 13:20:12 -04:00
|
|
|
void blk_queue_exit(struct request_queue *q)
|
|
|
|
{
|
|
|
|
percpu_ref_put(&q->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_queue_usage_counter_release(struct percpu_ref *ref)
|
|
|
|
{
|
|
|
|
struct request_queue *q =
|
|
|
|
container_of(ref, struct request_queue, q_usage_counter);
|
|
|
|
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
|
|
|
}
|
|
|
|
|
2017-08-28 15:03:41 -07:00
|
|
|
static void blk_rq_timed_out_timer(struct timer_list *t)
|
2015-10-30 20:57:30 +08:00
|
|
|
{
|
2017-08-28 15:03:41 -07:00
|
|
|
struct request_queue *q = from_timer(q, t, timeout);
|
2015-10-30 20:57:30 +08:00
|
|
|
|
|
|
|
kblockd_schedule_work(&q->timeout_work);
|
|
|
|
}
|
|
|
|
|
2019-01-30 22:21:45 +09:00
|
|
|
static void blk_timeout_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2024-02-13 08:34:18 +01:00
|
|
|
struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
|
2005-06-23 00:08:19 -07:00
|
|
|
{
|
2007-07-24 09:28:11 +02:00
|
|
|
struct request_queue *q;
|
2024-02-13 08:34:18 +01:00
|
|
|
int error;
|
2005-06-23 00:08:19 -07:00
|
|
|
|
2022-11-01 16:00:47 +01:00
|
|
|
q = kmem_cache_alloc_node(blk_requestq_cachep, GFP_KERNEL | __GFP_ZERO,
|
|
|
|
node_id);
|
2005-04-16 15:20:36 -07:00
|
|
|
if (!q)
|
2024-02-13 08:34:18 +01:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2018-05-31 19:11:36 +02:00
|
|
|
q->last_merge = NULL;
|
|
|
|
|
2022-06-15 04:18:16 -04:00
|
|
|
q->id = ida_alloc(&blk_queue_ida, GFP_KERNEL);
|
2024-02-13 08:34:18 +01:00
|
|
|
if (q->id < 0) {
|
|
|
|
error = q->id;
|
2022-11-01 16:00:47 +01:00
|
|
|
goto fail_q;
|
2024-02-13 08:34:18 +01:00
|
|
|
}
|
2011-12-14 00:33:37 +01:00
|
|
|
|
2017-03-21 17:20:01 -06:00
|
|
|
q->stats = blk_alloc_queue_stats();
|
2024-02-13 08:34:18 +01:00
|
|
|
if (!q->stats) {
|
|
|
|
error = -ENOMEM;
|
2022-07-27 12:22:57 -04:00
|
|
|
goto fail_id;
|
2024-02-13 08:34:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
error = blk_set_default_limits(lim);
|
|
|
|
if (error)
|
|
|
|
goto fail_stats;
|
|
|
|
q->limits = *lim;
|
2017-03-21 17:20:01 -06:00
|
|
|
|
2011-11-23 10:59:13 +01:00
|
|
|
q->node = node_id;
|
2009-06-12 14:42:56 +02:00
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
atomic_set(&q->nr_active_requests_shared_tags, 0);
|
2020-08-19 23:20:26 +08:00
|
|
|
|
2017-08-28 15:03:41 -07:00
|
|
|
timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
|
2019-01-30 22:21:45 +09:00
|
|
|
INIT_WORK(&q->timeout_work, blk_timeout_work);
|
2011-12-14 00:33:41 +01:00
|
|
|
INIT_LIST_HEAD(&q->icq_list);
|
2006-03-18 18:34:37 -05:00
|
|
|
|
2022-11-14 05:26:36 +01:00
|
|
|
refcount_set(&q->refs, 1);
|
2020-06-19 20:47:30 +00:00
|
|
|
mutex_init(&q->debugfs_mutex);
|
2006-03-18 18:34:37 -05:00
|
|
|
mutex_init(&q->sysfs_lock);
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 19:01:48 +08:00
|
|
|
mutex_init(&q->sysfs_dir_lock);
|
2024-02-13 08:34:14 +01:00
|
|
|
mutex_init(&q->limits_lock);
|
block/rq_qos: protect rq_qos apis with a new lock
commit 50e34d78815e ("block: disable the elevator int del_gendisk")
move rq_qos_exit() from disk_release() to del_gendisk(), this will
introduce some problems:
1) If rq_qos_add() is triggered by enabling iocost/iolatency through
cgroupfs, then it can concurrent with del_gendisk(), it's not safe to
write 'q->rq_qos' concurrently.
2) Activate cgroup policy that is relied on rq_qos will call
rq_qos_add() and blkcg_activate_policy(), and if rq_qos_exit() is
called in the middle, null-ptr-dereference will be triggered in
blkcg_activate_policy().
3) blkg_conf_open_bdev() can call blkdev_get_no_open() first to find the
disk, then if rq_qos_exit() from del_gendisk() is done before
rq_qos_add(), then memory will be leaked.
This patch add a new disk level mutex 'rq_qos_mutex':
1) The lock will protect rq_qos_exit() directly.
2) For wbt that doesn't relied on blk-cgroup, rq_qos_add() can only be
called from disk initialization for now because wbt can't be
destructed until rq_qos_exit(), so it's safe not to protect wbt for
now. Hoever, in case that rq_qos dynamically destruction is supported
in the furture, this patch also protect rq_qos_add() from wbt_init()
directly, this is enough because blk-sysfs already synchronize
writers with disk removal.
3) For iocost and iolatency, in order to synchronize disk removal and
cgroup configuration, the lock is held after blkdev_get_no_open()
from blkg_conf_open_bdev(), and is released in blkg_conf_exit().
In order to fix the above memory leak, disk_live() is checked after
holding the new lock.
Fixes: 50e34d78815e ("block: disable the elevator int del_gendisk")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230414084008.2085155-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-14 16:40:08 +08:00
|
|
|
mutex_init(&q->rq_qos_mutex);
|
2018-11-15 12:17:28 -07:00
|
|
|
spin_lock_init(&q->queue_lock);
|
2011-03-02 19:04:42 -05:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
|
|
|
init_waitqueue_head(&q->mq_freeze_wq);
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_init(&q->mq_freeze_lock);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
|
|
|
|
2024-04-07 20:59:10 +08:00
|
|
|
blkg_init_queue(q);
|
|
|
|
|
2015-10-21 13:20:12 -04:00
|
|
|
/*
|
|
|
|
* Init percpu_ref in atomic mode so that it's faster to shutdown.
|
|
|
|
* See blk_register_queue() for details.
|
|
|
|
*/
|
2024-02-13 08:34:18 +01:00
|
|
|
error = percpu_ref_init(&q->q_usage_counter,
|
2015-10-21 13:20:12 -04:00
|
|
|
blk_queue_usage_counter_release,
|
2024-02-13 08:34:18 +01:00
|
|
|
PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
|
|
|
|
if (error)
|
2021-08-09 16:17:43 +02:00
|
|
|
goto fail_stats;
|
2024-10-25 08:37:20 +08:00
|
|
|
lockdep_register_key(&q->io_lock_cls_key);
|
|
|
|
lockdep_register_key(&q->q_lock_cls_key);
|
|
|
|
lockdep_init_map(&q->io_lockdep_map, "&q->q_usage_counter(io)",
|
|
|
|
&q->io_lock_cls_key, 0);
|
|
|
|
lockdep_init_map(&q->q_lockdep_map, "&q->q_usage_counter(queue)",
|
|
|
|
&q->q_lock_cls_key, 0);
|
2012-03-05 13:15:05 -08:00
|
|
|
|
2021-10-05 18:23:27 +08:00
|
|
|
q->nr_requests = BLKDEV_DEFAULT_RQ;
|
2020-03-27 09:30:11 +01:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
return q;
|
2011-12-14 00:33:37 +01:00
|
|
|
|
2017-03-21 17:20:01 -06:00
|
|
|
fail_stats:
|
2021-08-09 16:17:43 +02:00
|
|
|
blk_free_queue_stats(q->stats);
|
2011-12-14 00:33:37 +01:00
|
|
|
fail_id:
|
2022-06-15 04:18:16 -04:00
|
|
|
ida_free(&blk_queue_ida, q->id);
|
2011-12-14 00:33:37 +01:00
|
|
|
fail_q:
|
2022-11-01 16:00:47 +01:00
|
|
|
kmem_cache_free(blk_requestq_cachep, q);
|
2024-02-13 08:34:18 +01:00
|
|
|
return ERR_PTR(error);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2020-06-19 20:47:23 +00:00
|
|
|
/**
|
|
|
|
* blk_get_queue - increment the request_queue refcount
|
|
|
|
* @q: the request_queue structure to increment the refcount for
|
|
|
|
*
|
|
|
|
* Increment the refcount of the request_queue kobject.
|
2020-06-19 20:47:24 +00:00
|
|
|
*
|
|
|
|
* Context: Any context.
|
2020-06-19 20:47:23 +00:00
|
|
|
*/
|
2011-12-14 00:33:38 +01:00
|
|
|
bool blk_get_queue(struct request_queue *q)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2022-07-21 08:34:32 +02:00
|
|
|
if (unlikely(blk_queue_dying(q)))
|
|
|
|
return false;
|
2022-11-14 05:26:36 +01:00
|
|
|
refcount_inc(&q->refs);
|
2022-07-21 08:34:32 +02:00
|
|
|
return true;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2011-05-27 07:44:43 +02:00
|
|
|
EXPORT_SYMBOL(blk_get_queue);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2006-12-08 02:39:46 -08:00
|
|
|
#ifdef CONFIG_FAIL_MAKE_REQUEST
|
|
|
|
|
|
|
|
static DECLARE_FAULT_ATTR(fail_make_request);
|
|
|
|
|
|
|
|
static int __init setup_fail_make_request(char *str)
|
|
|
|
{
|
|
|
|
return setup_fault_attr(&fail_make_request, str);
|
|
|
|
}
|
|
|
|
__setup("fail_make_request=", setup_fail_make_request);
|
|
|
|
|
2021-11-17 07:13:58 +01:00
|
|
|
bool should_fail_request(struct block_device *part, unsigned int bytes)
|
2006-12-08 02:39:46 -08:00
|
|
|
{
|
2024-04-28 00:15:07 -04:00
|
|
|
return bdev_test_flag(part, BD_MAKE_IT_FAIL) &&
|
|
|
|
should_fail(&fail_make_request, bytes);
|
2006-12-08 02:39:46 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __init fail_make_request_debugfs(void)
|
|
|
|
{
|
2011-08-03 16:21:01 -07:00
|
|
|
struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
|
|
|
|
NULL, &fail_make_request);
|
|
|
|
|
2014-04-11 15:58:56 +08:00
|
|
|
return PTR_ERR_OR_ZERO(dir);
|
2006-12-08 02:39:46 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
late_initcall(fail_make_request_debugfs);
|
|
|
|
#endif /* CONFIG_FAIL_MAKE_REQUEST */
|
|
|
|
|
2022-09-05 18:27:54 +08:00
|
|
|
static inline void bio_check_ro(struct bio *bio)
|
2018-01-11 14:09:11 +01:00
|
|
|
{
|
2021-01-24 11:02:35 +01:00
|
|
|
if (op_is_write(bio_op(bio)) && bdev_read_only(bio->bi_bdev)) {
|
2018-09-05 16:14:36 -06:00
|
|
|
if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
|
2022-09-05 18:27:54 +08:00
|
|
|
return;
|
2023-11-28 20:30:27 +08:00
|
|
|
|
2024-04-12 01:24:27 -04:00
|
|
|
if (bdev_test_flag(bio->bi_bdev, BD_RO_WARNED))
|
2023-11-28 20:30:27 +08:00
|
|
|
return;
|
|
|
|
|
2024-04-12 01:24:27 -04:00
|
|
|
bdev_set_flag(bio->bi_bdev, BD_RO_WARNED);
|
|
|
|
|
2023-11-28 20:30:27 +08:00
|
|
|
/*
|
|
|
|
* Use ioctl to set underlying disk of raid/dm to read-only
|
|
|
|
* will trigger this.
|
|
|
|
*/
|
|
|
|
pr_warn("Trying to write to read-only block-device %pg\n",
|
|
|
|
bio->bi_bdev);
|
2018-01-11 14:09:11 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-02-06 14:05:39 -08:00
|
|
|
static noinline int should_fail_bio(struct bio *bio)
|
|
|
|
{
|
2021-01-24 11:02:34 +01:00
|
|
|
if (should_fail_request(bdev_whole(bio->bi_bdev), bio->bi_iter.bi_size))
|
2018-02-06 14:05:39 -08:00
|
|
|
return -EIO;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
|
|
|
|
|
2018-03-14 16:56:53 +01:00
|
|
|
/*
|
|
|
|
* Check whether this bio extends beyond the end of the device or partition.
|
|
|
|
* This may well happen - the kernel calls bread() without checking the size of
|
|
|
|
* the device, e.g., when mounting a file system.
|
|
|
|
*/
|
2021-01-24 11:02:35 +01:00
|
|
|
static inline int bio_check_eod(struct bio *bio)
|
2018-03-14 16:56:53 +01:00
|
|
|
{
|
2021-01-24 11:02:35 +01:00
|
|
|
sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);
|
2018-03-14 16:56:53 +01:00
|
|
|
unsigned int nr_sectors = bio_sectors(bio);
|
|
|
|
|
2023-05-24 08:05:38 +02:00
|
|
|
if (nr_sectors &&
|
2018-03-14 16:56:53 +01:00
|
|
|
(nr_sectors > maxsector ||
|
|
|
|
bio->bi_iter.bi_sector > maxsector - nr_sectors)) {
|
2022-03-04 19:00:57 +01:00
|
|
|
pr_info_ratelimited("%s: attempt to access beyond end of device\n"
|
2022-05-04 07:33:55 -07:00
|
|
|
"%pg: rw=%d, sector=%llu, nr_sectors = %u limit=%llu\n",
|
|
|
|
current->comm, bio->bi_bdev, bio->bi_opf,
|
|
|
|
bio->bi_iter.bi_sector, nr_sectors, maxsector);
|
2018-03-14 16:56:53 +01:00
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-08-23 19:10:32 +02:00
|
|
|
/*
|
|
|
|
* Remap block n of partition p to block n+start(p) of the disk.
|
|
|
|
*/
|
2021-01-24 11:02:35 +01:00
|
|
|
static int blk_partition_remap(struct bio *bio)
|
2017-08-23 19:10:32 +02:00
|
|
|
{
|
2021-01-24 11:02:34 +01:00
|
|
|
struct block_device *p = bio->bi_bdev;
|
2017-08-23 19:10:32 +02:00
|
|
|
|
2018-03-14 16:56:53 +01:00
|
|
|
if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
|
2021-01-24 11:02:35 +01:00
|
|
|
return -EIO;
|
2019-11-11 11:39:25 +09:00
|
|
|
if (bio_sectors(bio)) {
|
2020-11-24 09:36:54 +01:00
|
|
|
bio->bi_iter.bi_sector += p->bd_start_sect;
|
2020-12-03 17:21:38 +01:00
|
|
|
trace_block_bio_remap(bio, p->bd_dev,
|
2020-11-24 09:34:24 +01:00
|
|
|
bio->bi_iter.bi_sector -
|
2020-11-24 09:36:54 +01:00
|
|
|
p->bd_start_sect);
|
2018-03-14 16:56:53 +01:00
|
|
|
}
|
2021-01-24 11:02:36 +01:00
|
|
|
bio_set_flag(bio, BIO_REMAPPED);
|
2021-01-24 11:02:35 +01:00
|
|
|
return 0;
|
2017-08-23 19:10:32 +02:00
|
|
|
}
|
|
|
|
|
2020-05-12 17:55:47 +09:00
|
|
|
/*
|
|
|
|
* Check write append to a zoned block device.
|
|
|
|
*/
|
|
|
|
static inline blk_status_t blk_check_zone_append(struct request_queue *q,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
int nr_sectors = bio_sectors(bio);
|
|
|
|
|
|
|
|
/* Only applicable to zoned block devices */
|
2022-07-06 09:03:37 +02:00
|
|
|
if (!bdev_is_zoned(bio->bi_bdev))
|
2020-05-12 17:55:47 +09:00
|
|
|
return BLK_STS_NOTSUPP;
|
|
|
|
|
|
|
|
/* The bio sector must point to the start of a sequential zone */
|
2024-04-08 10:41:23 +09:00
|
|
|
if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector))
|
2020-05-12 17:55:47 +09:00
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not allowed to cross zone boundaries. Otherwise, the BIO will be
|
|
|
|
* split and could result in non-contiguous sectors being written in
|
|
|
|
* different zones.
|
|
|
|
*/
|
|
|
|
if (nr_sectors > q->limits.chunk_sectors)
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
/* Make sure the BIO is small enough and will not get split */
|
2024-11-08 16:46:51 +01:00
|
|
|
if (nr_sectors > q->limits.max_zone_append_sectors)
|
2020-05-12 17:55:47 +09:00
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
bio->bi_opf |= REQ_NOMERGE;
|
|
|
|
|
|
|
|
return BLK_STS_OK;
|
|
|
|
}
|
|
|
|
|
2021-11-03 05:47:09 -06:00
|
|
|
static void __submit_bio(struct bio *bio)
|
|
|
|
{
|
2024-05-22 04:03:08 +08:00
|
|
|
/* If plug is not used, add new plug here to cache nsecs time. */
|
|
|
|
struct blk_plug plug;
|
|
|
|
|
2022-02-16 12:45:08 +08:00
|
|
|
if (unlikely(!blk_crypto_bio_prep(&bio)))
|
|
|
|
return;
|
|
|
|
|
2024-05-22 04:03:08 +08:00
|
|
|
blk_start_plug(&plug);
|
|
|
|
|
2024-04-12 01:21:45 -04:00
|
|
|
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
|
2021-10-12 13:12:24 +02:00
|
|
|
blk_mq_submit_bio(bio);
|
2022-02-16 12:45:08 +08:00
|
|
|
} else if (likely(bio_queue_enter(bio) == 0)) {
|
2023-04-14 07:32:02 -06:00
|
|
|
struct gendisk *disk = bio->bi_bdev->bd_disk;
|
|
|
|
|
2022-02-16 12:45:08 +08:00
|
|
|
disk->fops->submit_bio(bio);
|
|
|
|
blk_queue_exit(disk->queue);
|
|
|
|
}
|
2024-05-22 04:03:08 +08:00
|
|
|
|
|
|
|
blk_finish_plug(&plug);
|
2020-05-16 20:28:01 +02:00
|
|
|
}
|
|
|
|
|
2020-07-01 10:59:45 +02:00
|
|
|
/*
|
|
|
|
* The loop in this function may be a bit non-obvious, and so deserves some
|
|
|
|
* explanation:
|
|
|
|
*
|
|
|
|
* - Before entering the loop, bio->bi_next is NULL (as all callers ensure
|
|
|
|
* that), so we have a list with a single bio.
|
|
|
|
* - We pretend that we have just taken it off a longer list, so we assign
|
|
|
|
* bio_list to a pointer to the bio_list_on_stack, thus initialising the
|
|
|
|
* bio_list of new bios to be added. ->submit_bio() may indeed add some more
|
|
|
|
* bios through a recursive call to submit_bio_noacct. If it did, we find a
|
|
|
|
* non-NULL value in bio_list and re-enter the loop from the top.
|
|
|
|
* - In this case we really did just take the bio of the top of the list (no
|
|
|
|
* pretending) and so remove it from bio_list, and call into ->submit_bio()
|
|
|
|
* again.
|
|
|
|
*
|
|
|
|
* bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
|
|
|
|
* bio_list_on_stack[1] contains bios that were submitted before the current
|
2022-03-04 21:08:03 -05:00
|
|
|
* ->submit_bio, but that haven't been processed yet.
|
2020-07-01 10:59:45 +02:00
|
|
|
*/
|
2021-10-12 13:12:24 +02:00
|
|
|
static void __submit_bio_noacct(struct bio *bio)
|
2020-07-01 10:59:45 +02:00
|
|
|
{
|
|
|
|
struct bio_list bio_list_on_stack[2];
|
|
|
|
|
|
|
|
BUG_ON(bio->bi_next);
|
|
|
|
|
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
|
|
|
current->bio_list = bio_list_on_stack;
|
|
|
|
|
|
|
|
do {
|
2021-10-14 15:03:29 +01:00
|
|
|
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
|
2020-07-01 10:59:45 +02:00
|
|
|
struct bio_list lower, same;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a fresh bio_list for all subordinate requests.
|
|
|
|
*/
|
|
|
|
bio_list_on_stack[1] = bio_list_on_stack[0];
|
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
|
|
|
|
2021-10-12 13:12:24 +02:00
|
|
|
__submit_bio(bio);
|
2020-07-01 10:59:45 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Sort new bios into those for a lower level and those for the
|
|
|
|
* same level.
|
|
|
|
*/
|
|
|
|
bio_list_init(&lower);
|
|
|
|
bio_list_init(&same);
|
|
|
|
while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
|
2021-10-14 15:03:29 +01:00
|
|
|
if (q == bdev_get_queue(bio->bi_bdev))
|
2020-07-01 10:59:45 +02:00
|
|
|
bio_list_add(&same, bio);
|
|
|
|
else
|
|
|
|
bio_list_add(&lower, bio);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now assemble so we handle the lowest level first.
|
|
|
|
*/
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &lower);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &same);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
|
|
|
|
} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
|
|
|
|
|
|
|
|
current->bio_list = NULL;
|
|
|
|
}
|
|
|
|
|
2021-10-12 13:12:24 +02:00
|
|
|
static void __submit_bio_noacct_mq(struct bio *bio)
|
2020-07-01 10:59:46 +02:00
|
|
|
{
|
2020-07-02 21:21:25 +02:00
|
|
|
struct bio_list bio_list[2] = { };
|
2020-07-01 10:59:46 +02:00
|
|
|
|
2020-07-02 21:21:25 +02:00
|
|
|
current->bio_list = bio_list;
|
2020-07-01 10:59:46 +02:00
|
|
|
|
|
|
|
do {
|
2021-10-12 13:12:24 +02:00
|
|
|
__submit_bio(bio);
|
2020-07-02 21:21:25 +02:00
|
|
|
} while ((bio = bio_list_pop(&bio_list[0])));
|
2020-07-01 10:59:46 +02:00
|
|
|
|
|
|
|
current->bio_list = NULL;
|
|
|
|
}
|
|
|
|
|
2022-02-16 12:45:10 +08:00
|
|
|
void submit_bio_noacct_nocheck(struct bio *bio)
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 09:53:42 +02:00
|
|
|
{
|
2023-02-16 11:22:50 +08:00
|
|
|
blk_cgroup_bio_start(bio);
|
|
|
|
blkcg_bio_issue_init(bio);
|
|
|
|
|
|
|
|
if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
|
|
|
|
trace_block_bio_queue(bio);
|
|
|
|
/*
|
|
|
|
* Now that enqueuing has been traced, we need to trace
|
|
|
|
* completion as well.
|
|
|
|
*/
|
|
|
|
bio_set_flag(bio, BIO_TRACE_COMPLETION);
|
|
|
|
}
|
|
|
|
|
2011-09-15 14:01:40 +02:00
|
|
|
/*
|
2020-07-01 10:59:45 +02:00
|
|
|
* We only want one ->submit_bio to be active at a time, else stack
|
|
|
|
* usage with stacked devices could be a problem. Use current->bio_list
|
|
|
|
* to collect a list of requests submited by a ->submit_bio method while
|
|
|
|
* it is active, and then process them after it returned.
|
2011-09-15 14:01:40 +02:00
|
|
|
*/
|
2021-10-12 13:12:24 +02:00
|
|
|
if (current->bio_list)
|
2017-03-10 17:00:47 +11:00
|
|
|
bio_list_add(¤t->bio_list[0], bio);
|
2024-04-12 01:21:45 -04:00
|
|
|
else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
|
2021-10-12 13:12:24 +02:00
|
|
|
__submit_bio_noacct_mq(bio);
|
|
|
|
else
|
|
|
|
__submit_bio_noacct(bio);
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 09:53:42 +02:00
|
|
|
}
|
2022-02-16 12:45:10 +08:00
|
|
|
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:54 +00:00
|
|
|
static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
if (bio->bi_iter.bi_size > queue_atomic_write_unit_max_bytes(q))
|
|
|
|
return BLK_STS_INVAL;
|
|
|
|
|
|
|
|
if (bio->bi_iter.bi_size % queue_atomic_write_unit_min_bytes(q))
|
|
|
|
return BLK_STS_INVAL;
|
|
|
|
|
|
|
|
return BLK_STS_OK;
|
|
|
|
}
|
|
|
|
|
2022-02-16 12:45:10 +08:00
|
|
|
/**
|
|
|
|
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
|
|
|
|
* @bio: The bio describing the location in memory and on the device.
|
|
|
|
*
|
|
|
|
* This is a version of submit_bio() that shall only be used for I/O that is
|
|
|
|
* resubmitted to lower level drivers by stacking block drivers. All file
|
|
|
|
* systems and other upper level users of the block layer should use
|
|
|
|
* submit_bio() instead.
|
|
|
|
*/
|
|
|
|
void submit_bio_noacct(struct bio *bio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2021-01-24 11:02:34 +01:00
|
|
|
struct block_device *bdev = bio->bi_bdev;
|
2021-10-14 15:03:29 +01:00
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
2017-06-03 09:38:06 +02:00
|
|
|
blk_status_t status = BLK_STS_IOERR;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
might_sleep();
|
|
|
|
|
2017-06-20 07:05:46 -05:00
|
|
|
/*
|
2020-05-28 13:19:29 -06:00
|
|
|
* For a REQ_NOWAIT based request, return -EOPNOTSUPP
|
2020-09-23 16:06:51 -04:00
|
|
|
* if queue does not support NOWAIT.
|
2017-06-20 07:05:46 -05:00
|
|
|
*/
|
2022-09-27 09:58:15 +02:00
|
|
|
if ((bio->bi_opf & REQ_NOWAIT) && !bdev_nowait(bdev))
|
2020-05-28 13:19:29 -06:00
|
|
|
goto not_supported;
|
2017-06-20 07:05:46 -05:00
|
|
|
|
2018-02-06 14:05:39 -08:00
|
|
|
if (should_fail_bio(bio))
|
2011-09-12 12:12:01 +02:00
|
|
|
goto end_io;
|
2022-09-05 18:27:54 +08:00
|
|
|
bio_check_ro(bio);
|
2021-01-25 19:39:57 +01:00
|
|
|
if (!bio_flagged(bio, BIO_REMAPPED)) {
|
|
|
|
if (unlikely(bio_check_eod(bio)))
|
|
|
|
goto end_io;
|
2024-04-12 00:54:19 -04:00
|
|
|
if (bdev_is_partition(bdev) &&
|
|
|
|
unlikely(blk_partition_remap(bio)))
|
2021-01-25 19:39:57 +01:00
|
|
|
goto end_io;
|
|
|
|
}
|
2006-03-23 20:00:26 +01:00
|
|
|
|
2011-09-12 12:12:01 +02:00
|
|
|
/*
|
2020-07-01 10:59:44 +02:00
|
|
|
* Filter flush bio's early so that bio based drivers without flush
|
|
|
|
* support don't have to worry about them.
|
2011-09-12 12:12:01 +02:00
|
|
|
*/
|
2022-11-02 00:09:03 -07:00
|
|
|
if (op_is_flush(bio->bi_opf)) {
|
|
|
|
if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE &&
|
|
|
|
bio_op(bio) != REQ_OP_ZONE_APPEND))
|
2007-11-02 08:49:08 +01:00
|
|
|
goto end_io;
|
2024-06-17 08:04:40 +02:00
|
|
|
if (!bdev_write_cache(bdev)) {
|
2022-11-02 00:09:03 -07:00
|
|
|
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
|
|
|
|
if (!bio_sectors(bio)) {
|
|
|
|
status = BLK_STS_OK;
|
|
|
|
goto end_io;
|
|
|
|
}
|
2007-11-02 08:49:08 +01:00
|
|
|
}
|
2011-09-12 12:12:01 +02:00
|
|
|
}
|
2006-10-30 22:07:21 -08:00
|
|
|
|
2024-07-18 15:08:17 +08:00
|
|
|
if (!(q->limits.features & BLK_FEAT_POLL) &&
|
|
|
|
(bio->bi_opf & REQ_POLLED)) {
|
2021-10-12 13:12:21 +02:00
|
|
|
bio_clear_polled(bio);
|
2024-07-18 15:08:17 +08:00
|
|
|
goto not_supported;
|
|
|
|
}
|
2018-12-14 17:21:22 +01:00
|
|
|
|
2016-06-09 16:00:36 +02:00
|
|
|
switch (bio_op(bio)) {
|
2023-12-21 08:05:38 +01:00
|
|
|
case REQ_OP_READ:
|
2024-08-05 11:33:15 +00:00
|
|
|
break;
|
2023-12-21 08:05:38 +01:00
|
|
|
case REQ_OP_WRITE:
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:54 +00:00
|
|
|
if (bio->bi_opf & REQ_ATOMIC) {
|
|
|
|
status = blk_validate_atomic_write_op_size(q, bio);
|
|
|
|
if (status != BLK_STS_OK)
|
|
|
|
goto end_io;
|
|
|
|
}
|
2023-12-21 08:05:38 +01:00
|
|
|
break;
|
|
|
|
case REQ_OP_FLUSH:
|
|
|
|
/*
|
|
|
|
* REQ_OP_FLUSH can't be submitted through bios, it is only
|
|
|
|
* synthetized in struct request by the flush state machine.
|
|
|
|
*/
|
|
|
|
goto not_supported;
|
2016-06-09 16:00:36 +02:00
|
|
|
case REQ_OP_DISCARD:
|
2022-04-15 06:52:55 +02:00
|
|
|
if (!bdev_max_discard_sectors(bdev))
|
2016-06-09 16:00:36 +02:00
|
|
|
goto not_supported;
|
|
|
|
break;
|
|
|
|
case REQ_OP_SECURE_ERASE:
|
2022-04-15 06:52:57 +02:00
|
|
|
if (!bdev_max_secure_erase_sectors(bdev))
|
2016-06-09 16:00:36 +02:00
|
|
|
goto not_supported;
|
|
|
|
break;
|
2020-05-12 17:55:47 +09:00
|
|
|
case REQ_OP_ZONE_APPEND:
|
|
|
|
status = blk_check_zone_append(q, bio);
|
|
|
|
if (status != BLK_STS_OK)
|
|
|
|
goto end_io;
|
|
|
|
break;
|
2023-12-21 08:05:38 +01:00
|
|
|
case REQ_OP_WRITE_ZEROES:
|
|
|
|
if (!q->limits.max_write_zeroes_sectors)
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
2016-10-18 15:40:32 +09:00
|
|
|
case REQ_OP_ZONE_RESET:
|
2019-10-27 23:05:45 +09:00
|
|
|
case REQ_OP_ZONE_OPEN:
|
|
|
|
case REQ_OP_ZONE_CLOSE:
|
|
|
|
case REQ_OP_ZONE_FINISH:
|
2019-08-01 10:26:36 -07:00
|
|
|
case REQ_OP_ZONE_RESET_ALL:
|
2024-07-04 14:28:15 +09:00
|
|
|
if (!bdev_is_zoned(bio->bi_bdev))
|
2019-08-01 10:26:36 -07:00
|
|
|
goto not_supported;
|
|
|
|
break;
|
2023-12-21 08:05:38 +01:00
|
|
|
case REQ_OP_DRV_IN:
|
|
|
|
case REQ_OP_DRV_OUT:
|
|
|
|
/*
|
|
|
|
* Driver private operations are only used with passthrough
|
|
|
|
* requests.
|
|
|
|
*/
|
|
|
|
fallthrough;
|
2016-06-09 16:00:36 +02:00
|
|
|
default:
|
2023-12-21 08:05:38 +01:00
|
|
|
goto not_supported;
|
2011-09-12 12:12:01 +02:00
|
|
|
}
|
2009-09-08 21:56:38 +02:00
|
|
|
|
2021-11-12 17:33:54 +08:00
|
|
|
if (blk_throtl_bio(bio))
|
2022-02-16 12:45:10 +08:00
|
|
|
return;
|
|
|
|
submit_bio_noacct_nocheck(bio);
|
2022-02-16 12:45:11 +08:00
|
|
|
return;
|
2008-11-28 13:32:03 +09:00
|
|
|
|
2016-06-09 16:00:36 +02:00
|
|
|
not_supported:
|
2017-06-03 09:38:06 +02:00
|
|
|
status = BLK_STS_NOTSUPP;
|
2008-11-28 13:32:03 +09:00
|
|
|
end_io:
|
2017-06-03 09:38:06 +02:00
|
|
|
bio->bi_status = status;
|
2015-07-20 15:29:37 +02:00
|
|
|
bio_endio(bio);
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 09:53:42 +02:00
|
|
|
}
|
2020-07-01 10:59:44 +02:00
|
|
|
EXPORT_SYMBOL(submit_bio_noacct);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2024-01-30 15:26:34 -05:00
|
|
|
static void bio_set_ioprio(struct bio *bio)
|
|
|
|
{
|
|
|
|
/* Nobody set ioprio so far? Initialize it based on task's nice value */
|
|
|
|
if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
|
|
|
|
bio->bi_ioprio = get_current_ioprio();
|
|
|
|
blkcg_set_ioprio(bio);
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
2008-08-19 20:13:11 +02:00
|
|
|
* submit_bio - submit a bio to the block device layer for I/O
|
2005-04-16 15:20:36 -07:00
|
|
|
* @bio: The &struct bio which describes the I/O
|
|
|
|
*
|
2020-04-28 13:27:53 +02:00
|
|
|
* submit_bio() is used to submit I/O requests to block devices. It is passed a
|
|
|
|
* fully set up &struct bio that describes the I/O that needs to be done. The
|
2021-01-24 11:02:34 +01:00
|
|
|
* bio will be send to the device described by the bi_bdev field.
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2020-04-28 13:27:53 +02:00
|
|
|
* The success/failure status of the request, along with notification of
|
|
|
|
* completion, is delivered asynchronously through the ->bi_end_io() callback
|
2022-09-14 00:42:37 -07:00
|
|
|
* in @bio. The bio must NOT be touched by the caller until ->bi_end_io() has
|
2020-04-28 13:27:53 +02:00
|
|
|
* been called.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-10-12 13:12:24 +02:00
|
|
|
void submit_bio(struct bio *bio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2022-05-16 08:36:54 +02:00
|
|
|
if (bio_op(bio) == REQ_OP_READ) {
|
|
|
|
task_io_account_read(bio->bi_iter.bi_size);
|
|
|
|
count_vm_events(PGPGIN, bio_sectors(bio));
|
|
|
|
} else if (bio_op(bio) == REQ_OP_WRITE) {
|
|
|
|
count_vm_events(PGPGOUT, bio_sectors(bio));
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2024-01-30 15:26:34 -05:00
|
|
|
bio_set_ioprio(bio);
|
2021-10-12 13:12:24 +02:00
|
|
|
submit_bio_noacct(bio);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(submit_bio);
|
|
|
|
|
2021-10-12 13:12:24 +02:00
|
|
|
/**
|
|
|
|
* bio_poll - poll for BIO completions
|
|
|
|
* @bio: bio to poll for
|
2021-11-26 00:20:55 +08:00
|
|
|
* @iob: batches of IO
|
2021-10-12 13:12:24 +02:00
|
|
|
* @flags: BLK_POLL_* flags that control the behavior
|
|
|
|
*
|
|
|
|
* Poll for completions on queue associated with the bio. Returns number of
|
|
|
|
* completed entries found.
|
|
|
|
*
|
|
|
|
* Note: the caller must either be the context that submitted @bio, or
|
|
|
|
* be in a RCU critical section to prevent freeing of @bio.
|
|
|
|
*/
|
2021-10-12 09:24:29 -06:00
|
|
|
int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags)
|
2021-10-12 13:12:24 +02:00
|
|
|
{
|
|
|
|
blk_qc_t cookie = READ_ONCE(bio->bi_cookie);
|
2023-02-24 10:01:19 -07:00
|
|
|
struct block_device *bdev;
|
|
|
|
struct request_queue *q;
|
2022-03-04 21:08:03 -05:00
|
|
|
int ret = 0;
|
2021-10-12 13:12:24 +02:00
|
|
|
|
2023-02-24 10:01:19 -07:00
|
|
|
bdev = READ_ONCE(bio->bi_bdev);
|
|
|
|
if (!bdev)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
q = bdev_get_queue(bdev);
|
2024-06-17 08:04:48 +02:00
|
|
|
if (cookie == BLK_QC_T_NONE || !(q->limits.features & BLK_FEAT_POLL))
|
2021-10-12 13:12:24 +02:00
|
|
|
return 0;
|
|
|
|
|
2022-01-27 08:05:49 +01:00
|
|
|
blk_flush_plug(current->plug, false);
|
2021-10-12 13:12:24 +02:00
|
|
|
|
block: treat poll queue enter similarly to timeouts
We ran into an issue where a production workload would randomly grind to
a halt and not continue until the pending IO had timed out. This turned
out to be a complicated interaction between queue freezing and polled
IO:
1) You have an application that does polled IO. At any point in time,
there may be polled IO pending.
2) You have a monitoring application that issues a passthrough command,
which is marked with side effects such that it needs to freeze the
queue.
3) Passthrough command is started, which calls blk_freeze_queue_start()
on the device. At this point the queue is marked frozen, and any
attempt to enter the queue will fail (for non-blocking) or block.
4) Now the driver calls blk_mq_freeze_queue_wait(), which will return
when the queue is quiesced and pending IO has completed.
5) The pending IO is polled IO, but any attempt to poll IO through the
normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to
bio_queue_enter() as the queue is frozen. Rather than poll and
complete IO, the polling threads will sit in a tight loop attempting
to poll, but failing to enter the queue to do so.
The end result is that progress for either application will be stalled
until all pending polled IO has timed out. This causes obvious huge
latency issues for the application doing polled IO, but also long delays
for passthrough command.
Fix this by treating queue enter for polled IO just like we do for
timeouts. This allows quick quiesce of the queue as we still poll and
complete this IO, while still disallowing queueing up new IO.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-20 07:51:07 -07:00
|
|
|
/*
|
|
|
|
* We need to be able to enter a frozen queue, similar to how
|
|
|
|
* timeouts also need to do that. If that is blocked, then we can
|
|
|
|
* have pending IO when a queue freeze is started, and then the
|
|
|
|
* wait for the freeze to finish will wait for polled requests to
|
|
|
|
* timeout as the poller is preventer from entering the queue and
|
|
|
|
* completing them. As long as we prevent new IO from being queued,
|
|
|
|
* that should be all that matters.
|
|
|
|
*/
|
|
|
|
if (!percpu_ref_tryget(&q->q_usage_counter))
|
2021-10-12 13:12:24 +02:00
|
|
|
return 0;
|
2022-03-04 21:08:03 -05:00
|
|
|
if (queue_is_mq(q)) {
|
2021-10-12 09:24:29 -06:00
|
|
|
ret = blk_mq_poll(q, cookie, iob, flags);
|
2022-03-04 21:08:03 -05:00
|
|
|
} else {
|
|
|
|
struct gendisk *disk = q->disk;
|
|
|
|
|
|
|
|
if (disk && disk->fops->poll_bio)
|
|
|
|
ret = disk->fops->poll_bio(bio, iob, flags);
|
|
|
|
}
|
2021-10-12 13:12:24 +02:00
|
|
|
blk_queue_exit(q);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(bio_poll);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper to implement file_operations.iopoll. Requires the bio to be stored
|
|
|
|
* in iocb->private, and cleared before freeing the bio.
|
|
|
|
*/
|
2021-10-12 09:24:29 -06:00
|
|
|
int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
|
|
|
|
unsigned int flags)
|
2021-10-12 13:12:24 +02:00
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: the bio cache only uses SLAB_TYPESAFE_BY_RCU, so bio can
|
|
|
|
* point to a freshly allocated bio at this point. If that happens
|
|
|
|
* we have a few cases to consider:
|
|
|
|
*
|
|
|
|
* 1) the bio is beeing initialized and bi_bdev is NULL. We can just
|
|
|
|
* simply nothing in this case
|
|
|
|
* 2) the bio points to a not poll enabled device. bio_poll will catch
|
|
|
|
* this and return 0
|
|
|
|
* 3) the bio points to a poll capable device, including but not
|
|
|
|
* limited to the one that the original bio pointed to. In this
|
|
|
|
* case we will call into the actual poll method and poll for I/O,
|
|
|
|
* even if we don't need to, but it won't cause harm either.
|
|
|
|
*
|
|
|
|
* For cases 2) and 3) above the RCU grace period ensures that bi_bdev
|
|
|
|
* is still allocated. Because partitions hold a reference to the whole
|
|
|
|
* device bdev and thus disk, the disk is also still valid. Grabbing
|
|
|
|
* a reference to the queue in bio_poll() ensures the hctxs and requests
|
|
|
|
* are still valid as well.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
bio = READ_ONCE(kiocb->private);
|
2023-02-24 10:01:19 -07:00
|
|
|
if (bio)
|
2021-10-12 09:24:29 -06:00
|
|
|
ret = bio_poll(bio, iob, flags);
|
2021-10-12 13:12:24 +02:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
|
|
|
|
|
2021-11-17 07:14:01 +01:00
|
|
|
void update_io_ticks(struct block_device *part, unsigned long now, bool end)
|
2020-05-27 07:24:13 +02:00
|
|
|
{
|
|
|
|
unsigned long stamp;
|
|
|
|
again:
|
2020-11-24 09:36:54 +01:00
|
|
|
stamp = READ_ONCE(part->bd_stamp);
|
block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):
Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
Test result: util is about 90%, while the disk is really idle.
This behaviour is introduced by commit 5b18b5a73760 ("block: delete
part_round_stats and switch to less precise counting"), however, there
was a key point that is missed that this patch also improve performance
a lot:
Before the commit:
part_round_stats:
if (part->stamp != now)
stats |= 1;
part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;
After the commit:
update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()
-> only one task can reach here in 1 jiffies.
Hence in order to account io_ticks precisely, we only need to know if
there are IO inflight at most once in one jiffies. Noted that for
rq-based device, iterating tags should not be used here because
'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
The additional overhead is quite little:
- per cpu add/dec for each IO for rq-based device;
- per cpu sum for each jiffies;
And it's verified by null-blk that there are no performance degration
under heavy IO pressure.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 20:37:16 +08:00
|
|
|
if (unlikely(time_after(now, stamp)) &&
|
|
|
|
likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) &&
|
|
|
|
(end || part_in_flight(part)))
|
|
|
|
__part_stat_add(part, io_ticks, now - stamp);
|
|
|
|
|
2024-04-12 00:54:19 -04:00
|
|
|
if (bdev_is_partition(part)) {
|
2020-11-24 09:36:54 +01:00
|
|
|
part = bdev_whole(part);
|
2020-05-27 07:24:13 +02:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-23 17:12:26 +08:00
|
|
|
unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
|
2022-04-17 22:27:13 -04:00
|
|
|
unsigned long start_time)
|
2020-05-27 07:24:04 +02:00
|
|
|
{
|
|
|
|
part_stat_lock();
|
2022-04-17 22:27:13 -04:00
|
|
|
update_io_ticks(bdev, start_time, false);
|
|
|
|
part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
|
2020-05-27 07:24:04 +02:00
|
|
|
part_stat_unlock();
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
|
|
|
|
2022-01-28 10:58:39 -05:00
|
|
|
return start_time;
|
|
|
|
}
|
2022-04-17 22:27:13 -04:00
|
|
|
EXPORT_SYMBOL(bdev_start_io_acct);
|
2022-01-28 10:58:39 -05:00
|
|
|
|
2021-01-24 11:02:37 +01:00
|
|
|
/**
|
|
|
|
* bio_start_io_acct - start I/O accounting for bio based drivers
|
|
|
|
* @bio: bio to start account for
|
|
|
|
*
|
|
|
|
* Returns the start time that should be passed back to bio_end_io_acct().
|
|
|
|
*/
|
|
|
|
unsigned long bio_start_io_acct(struct bio *bio)
|
2020-08-31 15:27:23 -07:00
|
|
|
{
|
2023-02-23 17:12:26 +08:00
|
|
|
return bdev_start_io_acct(bio->bi_bdev, bio_op(bio), jiffies);
|
2020-08-31 15:27:23 -07:00
|
|
|
}
|
2021-01-24 11:02:37 +01:00
|
|
|
EXPORT_SYMBOL_GPL(bio_start_io_acct);
|
2020-08-31 15:27:23 -07:00
|
|
|
|
2022-07-14 11:06:28 -07:00
|
|
|
void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
|
2023-02-23 17:12:26 +08:00
|
|
|
unsigned int sectors, unsigned long start_time)
|
2020-05-27 07:24:04 +02:00
|
|
|
{
|
|
|
|
const int sgrp = op_stat_group(op);
|
|
|
|
unsigned long now = READ_ONCE(jiffies);
|
|
|
|
unsigned long duration = now - start_time;
|
2018-12-06 11:41:19 -05:00
|
|
|
|
2020-05-27 07:24:04 +02:00
|
|
|
part_stat_lock();
|
2022-04-17 22:27:13 -04:00
|
|
|
update_io_ticks(bdev, now, true);
|
2023-02-23 17:12:26 +08:00
|
|
|
part_stat_inc(bdev, ios[sgrp]);
|
|
|
|
part_stat_add(bdev, sectors[sgrp], sectors);
|
2022-04-17 22:27:13 -04:00
|
|
|
part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
|
|
|
|
part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
2022-04-17 22:27:13 -04:00
|
|
|
EXPORT_SYMBOL(bdev_end_io_acct);
|
2020-08-31 15:27:23 -07:00
|
|
|
|
2021-01-24 11:02:37 +01:00
|
|
|
void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time,
|
2022-04-17 22:27:13 -04:00
|
|
|
struct block_device *orig_bdev)
|
2020-08-31 15:27:23 -07:00
|
|
|
{
|
2023-02-23 17:12:26 +08:00
|
|
|
bdev_end_io_acct(orig_bdev, bio_op(bio), bio_sectors(bio), start_time);
|
2020-08-31 15:27:23 -07:00
|
|
|
}
|
2021-01-24 11:02:37 +01:00
|
|
|
EXPORT_SYMBOL_GPL(bio_end_io_acct_remapped);
|
2020-08-31 15:27:23 -07:00
|
|
|
|
2008-10-01 16:12:15 +02:00
|
|
|
/**
|
|
|
|
* blk_lld_busy - Check if underlying low-level drivers of a device are busy
|
|
|
|
* @q : the queue of the device being checked
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Check if underlying low-level drivers of a device are busy.
|
|
|
|
* If the drivers want to export their busy state, they must set own
|
|
|
|
* exporting function using blk_queue_lld_busy() first.
|
|
|
|
*
|
|
|
|
* Basically, this function is used only by request stacking drivers
|
|
|
|
* to stop dispatching requests to underlying devices when underlying
|
|
|
|
* devices are busy. This behavior helps more I/O merging on the queue
|
|
|
|
* of the request stacking driver and prevents I/O throughput regression
|
|
|
|
* on burst I/O load.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* 0 - Not busy (The request stacking driver should dispatch request)
|
|
|
|
* 1 - Busy (The request stacking driver should stop dispatching request)
|
|
|
|
*/
|
|
|
|
int blk_lld_busy(struct request_queue *q)
|
|
|
|
{
|
2018-11-15 12:22:51 -07:00
|
|
|
if (queue_is_mq(q) && q->mq_ops->busy)
|
2018-10-29 10:15:10 -06:00
|
|
|
return q->mq_ops->busy(q);
|
2008-10-01 16:12:15 +02:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_lld_busy);
|
|
|
|
|
2014-04-08 09:15:35 -06:00
|
|
|
int kblockd_schedule_work(struct work_struct *work)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
return queue_work(kblockd_workqueue, work);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_schedule_work);
|
|
|
|
|
2017-04-10 09:54:55 -06:00
|
|
|
int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
|
|
|
|
|
2021-10-06 06:34:11 -06:00
|
|
|
void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk = current;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this is a nested plug, don't actually assign it.
|
|
|
|
*/
|
|
|
|
if (tsk->plug)
|
|
|
|
return;
|
|
|
|
|
block: cache current nsec time in struct blk_plug
Querying the current time is the most costly thing we do in the block
layer per IO, and depending on kernel config settings, we may do it
many times per IO.
None of the callers actually need nsec granularity. Take advantage of
that by caching the current time in the plug, with the assumption here
being that any time checking will be temporally close enough that the
slight loss of precision doesn't matter.
If the block plug gets flushed, eg on preempt or schedule out, then
we invalidate the cached clock.
On a basic peak IOPS test case with iostats enabled, this changes
the performance from:
IOPS=108.41M, BW=52.93GiB/s, IOS/call=31/31
IOPS=108.43M, BW=52.94GiB/s, IOS/call=32/32
IOPS=108.29M, BW=52.88GiB/s, IOS/call=31/32
IOPS=108.35M, BW=52.91GiB/s, IOS/call=32/32
IOPS=108.42M, BW=52.94GiB/s, IOS/call=31/31
IOPS=108.40M, BW=52.93GiB/s, IOS/call=32/32
IOPS=108.31M, BW=52.89GiB/s, IOS/call=32/31
to
IOPS=118.79M, BW=58.00GiB/s, IOS/call=31/32
IOPS=118.62M, BW=57.92GiB/s, IOS/call=31/31
IOPS=118.80M, BW=58.01GiB/s, IOS/call=32/31
IOPS=118.78M, BW=58.00GiB/s, IOS/call=32/32
IOPS=118.69M, BW=57.95GiB/s, IOS/call=32/31
IOPS=118.62M, BW=57.92GiB/s, IOS/call=32/31
IOPS=118.63M, BW=57.92GiB/s, IOS/call=31/32
which is more than a 9% improvement in performance. Looking at perf diff,
we can see a huge reduction in time overhead:
10.55% -9.88% [kernel.vmlinux] [k] read_tsc
1.31% -1.22% [kernel.vmlinux] [k] ktime_get
Note that since this relies on blk_plug for the caching, it's only
applicable to the issue side. But this is where most of the time calls
happen anyway. On the completion side, cached time stamping is done with
struct io_comp patch, as long as the driver supports it.
It's also worth noting that the above testing doesn't enable any of the
higher cost CPU items on the block layer side, like wbt, cgroups,
iocost, etc, which all would add additional time querying and hence
overhead. IOW, results would likely look even better in comparison with
those enabled, as distros would do.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-15 14:46:03 -07:00
|
|
|
plug->cur_ktime = 0;
|
2024-11-13 16:20:44 +01:00
|
|
|
rq_list_init(&plug->mq_list);
|
|
|
|
rq_list_init(&plug->cached_rqs);
|
2021-10-06 06:34:11 -06:00
|
|
|
plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
|
|
|
|
plug->rq_count = 0;
|
|
|
|
plug->multiple_queues = false;
|
2021-10-19 06:02:30 -06:00
|
|
|
plug->has_elevator = false;
|
2021-10-06 06:34:11 -06:00
|
|
|
INIT_LIST_HEAD(&plug->cb_list);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Store ordering should not be needed here, since a potential
|
|
|
|
* preempt will imply a full memory barrier
|
|
|
|
*/
|
|
|
|
tsk->plug = plug;
|
|
|
|
}
|
|
|
|
|
2011-09-21 10:00:16 +02:00
|
|
|
/**
|
|
|
|
* blk_start_plug - initialize blk_plug and track it inside the task_struct
|
|
|
|
* @plug: The &struct blk_plug that needs to be initialized
|
|
|
|
*
|
|
|
|
* Description:
|
2019-01-08 16:57:34 -05:00
|
|
|
* blk_start_plug() indicates to the block layer an intent by the caller
|
|
|
|
* to submit multiple I/O requests in a batch. The block layer may use
|
|
|
|
* this hint to defer submitting I/Os from the caller until blk_finish_plug()
|
|
|
|
* is called. However, the block layer may choose to submit requests
|
|
|
|
* before a call to blk_finish_plug() if the number of queued I/Os
|
|
|
|
* exceeds %BLK_MAX_REQUEST_COUNT, or if the size of the I/O is larger than
|
|
|
|
* %BLK_PLUG_FLUSH_SIZE. The queued I/Os may also be submitted early if
|
|
|
|
* the task schedules (see below).
|
|
|
|
*
|
2011-09-21 10:00:16 +02:00
|
|
|
* Tracking blk_plug inside the task_struct will help with auto-flushing the
|
|
|
|
* pending I/O should the task end up blocking between blk_start_plug() and
|
|
|
|
* blk_finish_plug(). This is important from a performance perspective, but
|
|
|
|
* also ensures that we don't deadlock. For instance, if the task is blocking
|
|
|
|
* for a memory allocation, memory reclaim could end up wanting to free a
|
|
|
|
* page belonging to that request that is currently residing in our private
|
|
|
|
* plug. By flushing the pending I/O when the process goes to sleep, we avoid
|
|
|
|
* this kind of deadlock.
|
|
|
|
*/
|
2011-03-08 13:19:51 +01:00
|
|
|
void blk_start_plug(struct blk_plug *plug)
|
|
|
|
{
|
2021-10-06 06:34:11 -06:00
|
|
|
blk_start_plug_nr_ios(plug, 1);
|
2011-03-08 13:19:51 +01:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_start_plug);
|
|
|
|
|
2012-07-31 09:08:15 +02:00
|
|
|
static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule)
|
2011-04-18 09:52:22 +02:00
|
|
|
{
|
|
|
|
LIST_HEAD(callbacks);
|
|
|
|
|
2012-07-31 09:08:15 +02:00
|
|
|
while (!list_empty(&plug->cb_list)) {
|
|
|
|
list_splice_init(&plug->cb_list, &callbacks);
|
2011-04-18 09:52:22 +02:00
|
|
|
|
2012-07-31 09:08:15 +02:00
|
|
|
while (!list_empty(&callbacks)) {
|
|
|
|
struct blk_plug_cb *cb = list_first_entry(&callbacks,
|
2011-04-18 09:52:22 +02:00
|
|
|
struct blk_plug_cb,
|
|
|
|
list);
|
2012-07-31 09:08:15 +02:00
|
|
|
list_del(&cb->list);
|
2012-07-31 09:08:15 +02:00
|
|
|
cb->callback(cb, from_schedule);
|
2012-07-31 09:08:15 +02:00
|
|
|
}
|
2011-04-18 09:52:22 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-31 09:08:14 +02:00
|
|
|
struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data,
|
|
|
|
int size)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = current->plug;
|
|
|
|
struct blk_plug_cb *cb;
|
|
|
|
|
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
list_for_each_entry(cb, &plug->cb_list, list)
|
|
|
|
if (cb->callback == unplug && cb->data == data)
|
|
|
|
return cb;
|
|
|
|
|
|
|
|
/* Not currently on the callback list */
|
|
|
|
BUG_ON(size < sizeof(*cb));
|
|
|
|
cb = kzalloc(size, GFP_ATOMIC);
|
|
|
|
if (cb) {
|
|
|
|
cb->data = data;
|
|
|
|
cb->callback = unplug;
|
|
|
|
list_add(&cb->list, &plug->cb_list);
|
|
|
|
}
|
|
|
|
return cb;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_check_plugged);
|
|
|
|
|
2022-01-27 08:05:49 +01:00
|
|
|
void __blk_flush_plug(struct blk_plug *plug, bool from_schedule)
|
2011-03-08 13:19:51 +01:00
|
|
|
{
|
2021-10-20 16:41:18 +02:00
|
|
|
if (!list_empty(&plug->cb_list))
|
|
|
|
flush_plug_callbacks(plug, from_schedule);
|
2023-07-14 11:11:06 +01:00
|
|
|
blk_mq_flush_plug_list(plug, from_schedule);
|
2021-11-03 05:49:07 -06:00
|
|
|
/*
|
|
|
|
* Unconditionally flush out cached requests, even if the unplug
|
|
|
|
* event came from schedule. Since we know hold references to the
|
|
|
|
* queue for cached requests, we don't want a blocked task holding
|
|
|
|
* up a queue freeze/quiesce event.
|
|
|
|
*/
|
2024-11-13 16:20:44 +01:00
|
|
|
if (unlikely(!rq_list_empty(&plug->cached_rqs)))
|
2021-10-06 06:34:11 -06:00
|
|
|
blk_mq_free_plug_rqs(plug);
|
2024-01-16 09:18:39 -07:00
|
|
|
|
block: fix that blk_time_get_ns() doesn't update time after schedule
While monitoring the throttle time of IO from iocost, it's found that
such time is always zero after the io_schedule() from ioc_rqos_throttle,
for example, with the following debug patch:
+ printk("%s-%d: %s enter %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
while (true) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (wait.committed)
break;
io_schedule();
}
+ printk("%s-%d: %s exit %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
It can be observerd that blk_time_get_ns() always return the same time:
[ 1068.096579] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.272587] fio-1268: ioc_rqos_throttle exit 1067901962288
[ 1068.274389] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.472690] fio-1268: ioc_rqos_throttle exit 1067901962288
[ 1068.474485] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.672656] fio-1268: ioc_rqos_throttle exit 1067901962288
[ 1068.674451] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.872655] fio-1268: ioc_rqos_throttle exit 1067901962288
And I think the root cause is that 'PF_BLOCK_TS' is always cleared
by blk_flush_plug() before scheduel(), hence blk_plug_invalidate_ts()
will never be called:
blk_time_get_ns
plug->cur_ktime = ktime_get_ns();
current->flags |= PF_BLOCK_TS;
io_schedule:
io_schedule_prepare
blk_flush_plug
__blk_flush_plug
/* the flag is cleared, while time is not */
current->flags &= ~PF_BLOCK_TS;
schedule
sched_update_worker
/* the flag is not set, hence plug->cur_ktime is not cleared */
if (tsk->flags & PF_BLOCK_TS)
blk_plug_invalidate_ts()
blk_time_get_ns
/* got the time stashed before schedule */
return plug->cur_ktime;
Fix the problem by clearing cached time in __blk_flush_plug().
Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240411032349.3051233-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-11 11:23:48 +08:00
|
|
|
plug->cur_ktime = 0;
|
2024-01-16 09:18:39 -07:00
|
|
|
current->flags &= ~PF_BLOCK_TS;
|
2011-03-08 13:19:51 +01:00
|
|
|
}
|
|
|
|
|
2019-01-08 16:57:34 -05:00
|
|
|
/**
|
|
|
|
* blk_finish_plug - mark the end of a batch of submitted I/O
|
|
|
|
* @plug: The &struct blk_plug passed to blk_start_plug()
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Indicate that a batch of I/O submissions is complete. This function
|
|
|
|
* must be paired with an initial call to blk_start_plug(). The intent
|
|
|
|
* is to allow the block layer to optimize I/O submission. See the
|
|
|
|
* documentation for blk_start_plug() for more information.
|
|
|
|
*/
|
2011-03-08 13:19:51 +01:00
|
|
|
void blk_finish_plug(struct blk_plug *plug)
|
|
|
|
{
|
2021-10-20 16:41:19 +02:00
|
|
|
if (plug == current->plug) {
|
2022-01-27 08:05:49 +01:00
|
|
|
__blk_flush_plug(plug, false);
|
2021-10-20 16:41:19 +02:00
|
|
|
current->plug = NULL;
|
|
|
|
}
|
2011-03-08 13:19:51 +01:00
|
|
|
}
|
2011-04-15 15:20:10 +02:00
|
|
|
EXPORT_SYMBOL(blk_finish_plug);
|
2011-03-08 13:19:51 +01:00
|
|
|
|
2020-05-14 16:45:09 +08:00
|
|
|
void blk_io_schedule(void)
|
|
|
|
{
|
|
|
|
/* Prevent hang_check timer from firing at us during very long I/O */
|
|
|
|
unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
|
|
|
|
|
|
|
|
if (timeout)
|
|
|
|
io_schedule_timeout(timeout);
|
|
|
|
else
|
|
|
|
io_schedule();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_io_schedule);
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
int __init blk_dev_init(void)
|
|
|
|
{
|
2022-07-14 11:06:32 -07:00
|
|
|
BUILD_BUG_ON((__force u32)REQ_OP_LAST >= (1 << REQ_OP_BITS));
|
2016-10-28 08:48:16 -06:00
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
2019-12-09 10:31:43 -08:00
|
|
|
sizeof_field(struct request, cmd_flags));
|
2016-10-28 08:48:16 -06:00
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
2019-12-09 10:31:43 -08:00
|
|
|
sizeof_field(struct bio, bi_opf));
|
2009-04-27 14:53:54 +02:00
|
|
|
|
2011-01-03 15:01:47 +01:00
|
|
|
/* used for unplugging and affects IO latency/throughput - HIGHPRI */
|
|
|
|
kblockd_workqueue = alloc_workqueue("kblockd",
|
2014-06-11 23:43:54 +02:00
|
|
|
WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
|
2005-04-16 15:20:36 -07:00
|
|
|
if (!kblockd_workqueue)
|
|
|
|
panic("Failed to create kblockd\n");
|
|
|
|
|
2024-01-31 17:43:23 +08:00
|
|
|
blk_requestq_cachep = KMEM_CACHE(request_queue, SLAB_PANIC);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2017-01-31 14:53:20 -08:00
|
|
|
blk_debugfs_root = debugfs_create_dir("block", NULL);
|
|
|
|
|
2008-01-24 08:53:35 +01:00
|
|
|
return 0;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|