Rename deadline_is_seq_writes() to deadline_is_seq_write() (remove the
"s" plural) to more correctly reflect the fact that this function tests
a single request, not multiple requests.
Fixes: 015d02f48537 ("block: mq-deadline: Do not break sequential write streams to zoned HDDs")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20221126025550.967914-2-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOBLEoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgponRD/90Nx2W6P8iN21w3ZtzhtYiG9WVBhUtQqhK
IFCUuaq7Meh8Iz6dpoOhEBHZ3alxZRptYLVVKMGUxkOUjzJPKljermvcb/If288G
5yaRrBtwkL+opZuDTwTiDFrUFGAY8+Uk/cmse8iqLkmQ+eJxmpr28tTrkOEToinS
D8JEnXG0YTRNe7o52wBC0rD2zJURHidAqyHuKdXP9DnPntSKfWQNZHK6kWwiJ20W
UQDgk3sycbmv8WXQ2nsDvrGf1s9FeQzS+gu5gWiA1sbQ5yhBvnGpT/U9E8WOeCZR
wszfWlsjOfv6N095o6plNeVo3Ti/QgliGiJvuBhkEhU6M9kOCEKzTh0W5DNyDpyo
DVB4/FmSyBKf1Aif9eo3gUqBdaaKaNn5b2vmgwuY/P5ALjanrL8izsnzdMfxOyRf
wNFgiYlD3VOksWxHUnPLx9nMtM9uDjkdE8IeRr/4bfP46qxSOpC1dZWvu6Ot1vPr
bfYo0QM+wUis4tfdxW9MIIi8oDAV0jbPN3zC2/c1end0KfZzlBiRh/1aernWweAj
NgVJC+9GhzR0RV0T74vH1JY5Xa5PF3VREbNeCYhzLPH/QtI/dCNIVhAv13p06+6x
zkeyMKUo8oLNl7WJRDb5WU/k2gr1msbwxvS/IdE1PuopqTefU9zdDlP/bYab0xla
E6DHJ2aHlw==
=41zr
-----END PGP SIGNATURE-----
Merge tag 'block-6.1-2022-11-25' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- A few fixes for s390 sads (Stefan, Colin)
- Ensure that ublk doesn't reorder requests, as that can be problematic
on devices that need specific ordering (Ming)
- Fix a queue reference leak in disk allocation handling (Christoph)
* tag 'block-6.1-2022-11-25' of git://git.kernel.dk/linux:
ublk_drv: don't forward io commands in reserve order
s390/dasd: fix possible buffer overflow in copy_pair_show
s390/dasd: fix no record found for raw_track_access
s390/dasd: increase printing of debug data payload
s390/dasd: Fix spelling mistake "Ivalid" -> "Invalid"
blk-mq: fix queue reference leak on blk_mq_alloc_disk_for_queue failure
The devnode() in struct class should not be modifying the device that is
passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Justin Sanders <justin@coraid.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: John Stultz <jstultz@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sean Young <sean@mess.org>
Cc: Frank Haverkamp <haver@linux.ibm.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Xie Yongji <xieyongji@bytedance.com>
Cc: Gautam Dawar <gautam.dawar@xilinx.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Eli Cohen <elic@nvidia.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: alsa-devel@alsa-project.org
Cc: dri-devel@lists.freedesktop.org
Cc: kvm@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-block@vger.kernel.org
Cc: linux-input@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Link: https://lore.kernel.org/r/20221123122523.1332370-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The dev_uevent() in struct class should not be modifying the device that
is passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russ Weight <russell.h.weight@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Raed Salem <raeds@nvidia.com>
Cc: Chen Zhongjin <chenzhongjin@huawei.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Avihai Horon <avihaih@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jakob Koschel <jakobkoschel@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Wang Yufen <wangyufen@huawei.com>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-pm@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221123122523.1332370-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mq-deadline ensures an in order dispatching of write requests to zoned
block devices using a per zone lock (a bit). This implies that for any
purely sequential write workload, the drive is exercised most of the
time at a maximum queue depth of one.
However, when such sequential write workload crosses a zone boundary
(when sequentially writing multiple contiguous zones), zone write
locking may prevent the last write to one zone to be issued (as the
previous write is still being executed) but allow the first write to the
following zone to be issued (as that zone is not yet being writen and
not locked). This result in an out of order delivery of the sequential
write commands to the device every time a zone boundary is crossed.
While such behavior does not break the sequential write constraint of
zoned block devices (and does not generate any write error), some zoned
hard-disks react badly to seeing these out of order writes, resulting in
lower write throughput.
This problem can be addressed by always dispatching the first request
of a stream of sequential write requests, regardless of the zones
targeted by these sequential writes. To do so, the function
deadline_skip_seq_writes() is introduced and used in
deadline_next_request() to select the next write command to issue if the
target device is an HDD (blk_queue_nonrot() being false).
deadline_fifo_request() is modified using the new
deadline_earlier_request() and deadline_is_seq_write() helpers to ignore
requests in the fifo list that have a preceding request in lba order
that is sequential.
With this fix, a sequential write workload executed with the following
fio command:
fio --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \
--size=68719476736 --ioengine=libaio --iodepth=32 --rw=write \
--bs=65536
results in an increase from 225 MB/s to 250 MB/s of the write throughput
of an SMR HDD (11% increase).
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
dd_finish_request() tests if the per prio fifo_list is not empty to
determine if request dispatching must be restarted for handling blocked
write requests to zoned devices with a call to
blk_mq_sched_mark_restart_hctx(). While simple, this implementation has
2 problems:
1) Only the priority level of the completed request is considered.
However, writes to a zone may be blocked due to other writes to the
same zone using a different priority level. While this is unlikely to
happen in practice, as writing a zone with different IO priorirites
does not make sense, nothing in the code prevents this from
happening.
2) The use of list_empty() is dangerous as dd_finish_request() does not
take dd->lock and may run concurrently with the insert and dispatch
code.
Fix these 2 problems by testing the write fifo list of all priority
levels using the new helper dd_has_write_work(), and by testing each
fifo list using list_empty_careful().
Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow the compiler to verify consistency of function declarations and
function definitions. This patch fixes the following sparse errors:
block/blk-crypto-profile.c:241:14: error: no previous prototype for ‘blk_crypto_get_keyslot’ [-Werror=missing-prototypes]
241 | blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile *profile,
| ^~~~~~~~~~~~~~~~~~~~~~
block/blk-crypto-profile.c:318:6: error: no previous prototype for ‘blk_crypto_put_keyslot’ [-Werror=missing-prototypes]
318 | void blk_crypto_put_keyslot(struct blk_crypto_keyslot *slot)
| ^~~~~~~~~~~~~~~~~~~~~~
block/blk-crypto-profile.c:344:6: error: no previous prototype for ‘__blk_crypto_cfg_supported’ [-Werror=missing-prototypes]
344 | bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
block/blk-crypto-profile.c:373:5: error: no previous prototype for ‘__blk_crypto_evict_key’ [-Werror=missing-prototypes]
373 | int __blk_crypto_evict_key(struct blk_crypto_profile *profile,
| ^~~~~~~~~~~~~~~~~~~~~~
Cc: Eric Biggers <ebiggers@google.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221123172923.434339-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_match does not care about elevator_features any more. Remove
related descriptions from its document.
Fixes: ffb86425ee2c ("block: don't check for required features in elevator_match")
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/a58424555202c07a9ccf7f60c3ad7e247da09e25.1669126766.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We no longer support falling back to the old io scheduler if switching to
the new one fails. Update the document to indicate that.
Fixes: a1ce35fa4985 ("block: remove dead elevator code")
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/94250961689ba7d2e67a7d9e7995a11166fedb31.1669126766.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop the request queue reference just acquired when __alloc_disk_node
failed.
Fixes: 6f8191fdf41d ("block: simplify disk shutdown")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20221122072753.426077-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags")
cleaned up the function blk_mq_realloc_tag_set_tags. After this change,
the function does not update nr_hw_queues of struct blk_mq_tag_set when
new nr_hw_queues value is smaller than original. This results in failure
of queue number change of block devices. To avoid the failure, add the
missing nr_hw_queues update.
Fixes: ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags")
Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/linux-block/20221118140640.featvt3fxktfquwh@shindev/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221122084917.2034220-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_crypto_get_keyslot, blk_crypto_put_keyslot, __blk_crypto_evict_key
and __blk_crypto_cfg_supported are only used internally by the
blk-crypto code, so move the out of blk-crypto-profile.h, which is
included by drivers that supply blk-crypto functionality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221114042944.1009870-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a blk_crypto_config_supported_natively helper that wraps
__blk_crypto_cfg_supported to retrieve the crypto_profile from the
request queue. With this fscrypt can stop including
blk-crypto-profile.h and rely on the public consumer interface in
blk-crypto.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221114042944.1009870-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch all public blk-crypto interfaces to use struct block_device
arguments to specify the device they operate on instead of th
request_queue, which is a block layer implementation detail.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221114042944.1009870-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmN38ZUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgXxD/9tUSFUKIVGIn4pmNILfY3XV45HOi1w44yR
zCxCELupcBeT+YixmaJcT8sunrrg2fLPOXMrDJk1cG/izXHzkjAQsHZvERfqC7hC
f5onH+2MyGm3qBwxV0iGqITJgTwQGInVJijT4f9UZd/8ultymyZR2nOdIdIydHCF
qzlOjq6hgIuGKHhFgOqRUg/OAkx510ZEEilUDcZ6XVV+zL7ccN6J9+eNTI3c58wT
7jvxZC4u6QGKteGvVniE3WXgk3QdFiQRORvV09g+PkbG/vPjAIZ5tJFb9PdIOebD
3guDiNUasgz2vnDetMK+yk4LcedcRfWnqgn+Vm8C26j5Fxs13eDx5kMDteVy7CYh
3bokOATHohoZZ9qTApgQUswTfGJfBdoy0nUTPuffxPdKDyUPteIxFCADcnyDHnDG
d/+PjU3FKF31o2HcUfvYp7OMO0VZP0hJSWps8znoVXKxb+LH9qKkYzHVlfni5kkS
k9XqqD1Ki98Erb346YqgvQjCkz+CUd5DxtGyh9Oh2+oS2qHP6WjdKo1QPFmWD5dp
EyXGSqGoZrIPtnKohLUN9EiVXanRQWJr3L0gw2CYXpmwfSKfMC3CQraEC1jOc01l
TfsLJGbl3L5XpLzxoBwDu44cqp+VvbalergdcmsDTLDFHhONY2g5LJh6C9/EDdnQ
Cde1uHikGw==
=sOGG
-----END PGP SIGNATURE-----
Merge tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- Two more bogus nid quirks (Bean Huo, Tiago Dias Ferreira)
- Memory leak fix in nvmet (Sagi Grimberg)
- Regression fix for block cgroups pinning the wrong blkcg, causing
leaks of cgroups and blkcgs (Chris)
- UAF fix for drbd setup error handling (Dan)
- Fix DMA alignment propagation in DM (Keith)
* tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux:
dm-log-writes: set dma_alignment limit in io_hints
dm-integrity: set dma_alignment limit in io_hints
block: make blk_set_default_limits() private
dm-crypt: provide dma_alignment limit in io_hints
block: make dma_alignment a stacking queue_limit
nvmet: fix a memory leak in nvmet_auth_set_key
nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000
drbd: use after free in drbd_create_device()
nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro
blk-cgroup: properly pin the parent in blkcg_css_online
As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which is
called when the blkcg reference count reaches 0. This circular dependency
will prevent blkcg from being freed until some other events cause
cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
function to flush stats for a given css and cpu and call it at the blkgs
destruction path, blkcg_destroy_blkgs(), whenever there are still some
pending stats to be flushed. This will ensure that blkcg reference
count can reach 0 ASAP.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.
As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.
The blkg_iostat_set is added to a lockless list on the update side
in blk_cgroup_bio_start(). It is removed from the lockless list when
flushed in blkcg_rstat_flush(). Due to racing, it is possible that
blk_iostat_set's in the lockless list may have no new IO stats to be
flushed, but that is OK.
To protect against destruction of blkg, a percpu reference is gotten
when putting into the lockless list and put back when removed.
When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221105005902.407297-3-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
Simplify error handling code by returning this error directly instead
of setting an intermediate "ret" variable.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221105005902.407297-2-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Device mappers had always been getting the default 511 dma mask, but
the underlying device might have a larger alignment requirement. Since
this value is used to determine alloweable direct-io alignment, this
needs to be a stackable limit.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221110184501.2451620-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hold a reference to the holder kobject for each bd_holder_disk,
so to make the code a bit more robust, use a reference to it instead
of the block_device. As long as no one clears ->bd_holder_dir in
before freeing the disk, this isn't strictly required, but it does
make the code more clear and more robust.
Orignally-From: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221115141054.1051801-10-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, the caller of bd_link_disk_holer() get 'bdev' by
blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'.
Howerver, it's possible that del_gendisk() can be called currently, and
'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus
use after free is triggered.
t1: t2:
bdev = blkdev_get_by_dev
del_gendisk
kobject_put(bd_holder_dir)
kobject_free()
bd_link_disk_holder
Fix the problem by checking disk is still live and grabbing a reference
to 'bd_holder_dir' first in bd_link_disk_holder().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221115141054.1051801-9-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that dm has been fixed to track of holder registrations before
add_disk, the somewhat buggy block layer code can be safely removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221115141054.1051801-8-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zero out the pointer to ->slave_dir so that the holder code doesn't
incorrectly treat the object as alive when add_disk failed or after
del_gendisk was called.
Fixes: 89f871af1b26 ("dm: delay registering the gendisk")
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20221115141054.1051801-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While the block device code should switch to implementing
->writepages instead of ->writepage eventually, the current
implementation is entirely pointless as it does the same looping over
->writepage as the generic code if no ->writepages is present.
Remove blkdev_writepages so that we can eventually unexport
generic_writepages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221116132035.2192924-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch extends REQ_ALLOC_CACHE to IRQ completions, whenever
currently it's only limited to iopoll. Instead of guarding the list with
irq toggling on alloc, which is expensive, it keeps an additional
irq-safe list from which bios are spliced in batches to ammortise
overhead. On the put side it toggles irqs, but in many cases they're
already disabled and so cheap.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c2306de96b900ab9264f4428ec37768ddcf0da36.1667384020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Biosets keep a mempool, so as long as requests complete we can always
can allocate and have forward progress. Percpu bio caches break that
assumptions as we may complete into the cache of one CPU and after try
and fail to allocate with another CPU. We also can't grab from another
CPU's cache without tricky sync.
If we're allocating with a bio while the mempool is undersaturated,
remove REQ_ALLOC_CACHE flag, so on put it will go straight to mempool.
It might try to free into mempool more requests than required, but
assuming than there is no memory starvation in the system it'll
stabilise and never hit that path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/aa150caf9c263fa92269e86d7826cc8fa65f38de.1667384020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg_css_online is supposed to pin the blkcg of the parent, but
397c9f46ee4d refactored things and along the way, changed it to pin the
css instead. This results in extra pins, and we end up leaking blkcgs
and cgroups.
Fixes: 397c9f46ee4d ("blk-cgroup: move blkcg_{pin,unpin}_online out of line")
Signed-off-by: Chris Mason <clm@fb.com>
Spotted-by: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org> # v5.19+
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20221114181930.2093706-1-clm@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set->nr_hw_queues for the current number of tags, and remove the
duplicate set->nr_hw_queues update in the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point in trying to share any code with the realloc case when
all that is needed by the initial tagset allocation is a simple
kcalloc_node.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
oom_bfqq is just a fallback bfqq, so shouldn't be used with waker
detection.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221108181030.1611703-2-khazhy@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
but woken_list_node still being hashed. This would happen when
bfq_init_rq() expects a brand new allocated queue to be returned from
bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
without resetting woken_list_node. Since we can always return oom_bfqq
when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful.
Crashes would have a stacktrace like:
[160595.656560] bfq_add_bfqq_busy+0x110/0x1ec
[160595.661142] bfq_add_request+0x6bc/0x980
[160595.666602] bfq_insert_request+0x8ec/0x1240
[160595.671762] bfq_insert_requests+0x58/0x9c
[160595.676420] blk_mq_sched_insert_request+0x11c/0x198
[160595.682107] blk_mq_submit_bio+0x270/0x62c
[160595.686759] __submit_bio_noacct_mq+0xec/0x178
[160595.691926] submit_bio+0x120/0x184
[160595.695990] ext4_mpage_readpages+0x77c/0x7c8
[160595.701026] ext4_readpage+0x60/0xb0
[160595.705158] filemap_read_page+0x54/0x114
[160595.711961] filemap_fault+0x228/0x5f4
[160595.716272] do_read_fault+0xe0/0x1f0
[160595.720487] do_fault+0x40/0x1c8
Tested by injecting random failures into bfq_get_queue, crashes go away
completely.
Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221108181030.1611703-1-khazhy@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-8-logang@deltatee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-7-logang@deltatee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.
Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-5-logang@deltatee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In accordance with [1] the DMA-able memory buffers must be
cacheline-aligned otherwise the cache writing-back and invalidation
performed during the mapping may cause the adjacent data being lost. It's
specifically required for the DMA-noncoherent platforms [2]. Seeing the
opal_dev.{cmd,resp} buffers are implicitly used for DMAs in the NVME and
SCSI/SD drivers in framework of the nvme_sec_submit() and sd_sec_submit()
methods respectively they must be cacheline-aligned to prevent the denoted
problem. One of the option to guarantee that is to kmalloc the buffers
[2]. Let's explicitly allocate them then instead of embedding into the
opal_dev structure instance.
Note this fix was inspired by the commit c94b7f9bab22 ("nvme-hwmon:
kmalloc the NVME SMART log buffer").
[1] Documentation/core-api/dma-api.rst
[2] Documentation/core-api/dma-api-howto.rst
Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221107203944.31686-1-Sergey.Semin@baikalelectronics.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the description of @required_features in elevator_match()
to clear the below warning:
block/elevator.c:103: warning: Excess function parameter 'required_features' description in 'elevator_match'
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2734
Fixes: ffb86425ee2c ("block: don't check for required features in elevator_match")
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221107062255.2685-1-yang.lee@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNmdGEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvLpD/9pL9SLpoUAnSvYAzaJC0dJhFHzQhmQgA55
qxwgC4NDyxYgvsLuHVVoR5qRSNQO37nKkgoeHsqSX56UTloQnggurg0Cr94VJcjC
seT1++Dl6BPz1M9h/UFS2cwm26GC+cQmsLIQoACSi0lNEOLytPP/emq6Vuqz0udx
ah1ACXebiHe07A8Kvpt7orHlpM/dKH0/4g5/7h0E5RWrC9yg1WEOHPjd/MQ5amy0
9YkhtqM5OQfNsVY0DcRbgRPr115xSi/L6No3Q6pMAVqzM7ZRk3iD039be7Sooqn8
sl54gZB3AWGzgrFnJLjKCcQg4qg/wyYhZXEuV2JdzYeXCBK6RMcV0I2hP6vWP7Au
dqlw5khvQOwx32qYNlXHU7g/ve5qY7hblIHbyqtKjQicIQ8LP18Ek1QWQcywiK4E
hyYJ/3gYRjVqigyw32++cMSRLbLktiY38+J7NxujIj6J1aOYosCA5kIxTSa11tLG
VGeXny5CS5l0zrl3irGBRI1Qi33T0hnbmf99v+MndFhRfsYAF8tKwuJyI+d+rJvj
S8grDzsmlzwe1INXEbnMEg+SsHOPe5On0bzYIYX9Oi0BSsZf1i4u7SdDb9tu2Tiw
WSJyYBNGCsl7wFSoLmY75j1OvWY/iXYPKqZ8bt9STQbO9vL+VHksFzcnnkvzrBG6
Zs1uD17jwQ==
=JQlF
-----END PGP SIGNATURE-----
Merge tag 'block-6.1-2022-11-05' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fixes for the ublk driver (Ming)
- Fixes for error handling memory leaks (Chen Jun, Chen Zhongjin)
- Explicitly clear the last request in a chain when the plug is
flushed, as it may have already been issued (Al)
* tag 'block-6.1-2022-11-05' of git://git.kernel.dk/linux:
block: blk_add_rq_to_plug(): clear stale 'last' after flush
blk-mq: Fix kmemleak in blk_mq_init_allocated_queue
block: Fix possible memory leak for rq_wb on add_disk failure
ublk_drv: add ublk_queue_cmd() for cleanup
ublk_drv: avoid to touch io_uring cmd in blk_mq io path
ublk_drv: comment on ublk_driver entry of Kconfig
ublk_drv: return flag of UBLK_F_URING_CMD_COMP_IN_TASK in case of module