Fix all kernel-doc warnings in drbd_actlog.c:
drbd_actlog.c:963: warning: No description found for return value of 'drbd_rs_begin_io'
drbd_actlog.c:1015: warning: Function parameter or member 'peer_device' not described in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: Excess function parameter 'device' description in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: No description found for return value of 'drbd_try_rs_begin_io'
drbd_actlog.c:1197: warning: No description found for return value of 'drbd_rs_del_all'
Fix one spelling error (s/ore/or/).
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: <drbd-dev@lists.linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <linux-block@vger.kernel.org>
Link: https://lore.kernel.org/r/20231222061909.8791-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only use disk_set_zoned to actually enable zoned device support.
For clearing it, call disk_clear_zoned, which is renamed from
disk_clear_zone_settings and now directly clears the zoned flag as
well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When zones were first added the SCSI and ATA specs, two different
models were supported (in addition to the drive managed one that
is invisible to the host):
- host managed where non-conventional zones there is strict requirement
to write at the write pointer, or else an error is returned
- host aware where a write point is maintained if writes always happen
at it, otherwise it is left in an under-defined state and the
sequential write preferred zones behave like conventional zones
(probably very badly performing ones, though)
Not surprisingly this lukewarm model didn't prove to be very useful and
was finally removed from the ZBC and SBC specs (NVMe never implemented
it). Due to to the easily disappearing write pointer host software
could never rely on the write pointer to actually be useful for say
recovery.
Fortunately only a few HDD prototypes shipped using this model which
never made it to mass production. Drop the support before it is too
late. Note that any such host aware prototype HDD can still be used
with Linux as we'll now treat it as a conventional HDD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
virtblk_revalidate_zones is called unconditionally from
virtblk_config_changed_work from the virtio config_changed callback.
virtblk_revalidate_zones is a bit odd in that it re-clears the zoned
state for host aware or non-zoned devices, which isn't needed unless the
zoned mode changed - but a zone mode change to a host managed model isn't
handled at all, and virtio_blk also doesn't handle any other config
change except for a capacity change is handled (and even if it was
the upper layers above virtio_blk wouldn't handle it very well).
But even the useful case of a size change that would add or remove
zones isn't handled properly as blk_revalidate_disk_zones expects the
device capacity to cover all zones, but the capacity is only updated
after virtblk_revalidate_zones.
As this code appears to be entirely untested and is getting in the way
remove it for now, but it can be readded in a fixed version with
proper test coverage if needed.
Fixes: 95bfec41bd ("virtio-blk: add support for zoned block devices")
Fixes: f1ba4e674f ("virtio-blk: fix to match virtio spec")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move reading and checking the zoned model from virtblk_probe_zoned_device
into the caller, leaving only the code to perform the actual setup for
host managed zoned devices in virtblk_probe_zoned_device.
This allows to share the model reading and sharing between builds with
and without CONFIG_BLK_DEV_ZONED, and improve it for the
!CONFIG_BLK_DEV_ZONED case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since "dev_search_path" can technically be as large as PATH_MAX,
there was a risk of truncation when copying it and a second string
into "full_path" since it was also PATH_MAX sized. The W=1 builds were
reporting this warning:
drivers/block/rnbd/rnbd-srv.c: In function 'process_msg_open.isra':
drivers/block/rnbd/rnbd-srv.c:616:51: warning: '%s' directive output may be truncated writing up to 254 bytes into a region of size between 0 and 4095 [-Wformat-truncation=]
616 | snprintf(full_path, PATH_MAX, "%s/%s",
| ^~
In function 'rnbd_srv_get_full_path',
inlined from 'process_msg_open.isra' at drivers/block/rnbd/rnbd-srv.c:721:14: drivers/block/rnbd/rnbd-srv.c:616:17: note: 'snprintf' output between 2 and 4351 bytes into a destination of size 4096
616 | snprintf(full_path, PATH_MAX, "%s/%s",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
617 | dev_search_path, dev_name);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~
To fix this, unconditionally check for truncation (as was already done
for the case where "%SESSNAME%" was present).
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312100355.lHoJPgKy-lkp@intel.com/
Cc: Md. Haris Iqbal <haris.iqbal@ionos.com>
Cc: Jack Wang <jinpu.wang@ionos.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <linux-block@vger.kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20231212214738.work.169-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While printing error, replace %ld by %pe. %pe prints a string
whereas %ld would print an error code.
Signed-off-by: Supriti Singh <supriti.singh@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20231124213422.113449-3-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a socket is processing ioctl 'NBD_SET_SOCK', config->socks might be
krealloc in nbd_add_socket(), and a garbage request is received now, a UAF
may occurs.
T1
nbd_ioctl
__nbd_ioctl
nbd_add_socket
blk_mq_freeze_queue
T2
recv_work
nbd_read_reply
sock_xmit
krealloc config->socks
def config->socks
Pass nbd_sock to nbd_read_reply(). And introduce a new function
sock_xmit_recv(), which differs from sock_xmit only in the way it get
socket.
==================================================================
BUG: KASAN: use-after-free in sock_xmit+0x525/0x550
Read of size 8 at addr ffff8880188ec428 by task kworker/u12:1/18779
Workqueue: knbd4-recv recv_work
Call Trace:
__dump_stack
dump_stack+0xbe/0xfd
print_address_description.constprop.0+0x19/0x170
__kasan_report.cold+0x6c/0x84
kasan_report+0x3a/0x50
sock_xmit+0x525/0x550
nbd_read_reply+0xfe/0x2c0
recv_work+0x1c2/0x750
process_one_work+0x6b6/0xf10
worker_thread+0xdd/0xd80
kthread+0x30a/0x410
ret_from_fork+0x22/0x30
Allocated by task 18784:
kasan_save_stack+0x1b/0x40
kasan_set_track
set_alloc_info
__kasan_kmalloc
__kasan_kmalloc.constprop.0+0xf0/0x130
slab_post_alloc_hook
slab_alloc_node
slab_alloc
__kmalloc_track_caller+0x157/0x550
__do_krealloc
krealloc+0x37/0xb0
nbd_add_socket
+0x2d3/0x880
__nbd_ioctl
nbd_ioctl+0x584/0x8e0
__blkdev_driver_ioctl
blkdev_ioctl+0x2a0/0x6e0
block_ioctl+0xee/0x130
vfs_ioctl
__do_sys_ioctl
__se_sys_ioctl+0x138/0x190
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Freed by task 18784:
kasan_save_stack+0x1b/0x40
kasan_set_track+0x1c/0x30
kasan_set_free_info+0x20/0x40
__kasan_slab_free.part.0+0x13f/0x1b0
slab_free_hook
slab_free_freelist_hook
slab_free
kfree+0xcb/0x6c0
krealloc+0x56/0xb0
nbd_add_socket+0x2d3/0x880
__nbd_ioctl
nbd_ioctl+0x584/0x8e0
__blkdev_driver_ioctl
blkdev_ioctl+0x2a0/0x6e0
block_ioctl+0xee/0x130
vfs_ioctl
__do_sys_ioctl
__se_sys_ioctl+0x138/0x190
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230911023308.3467802-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, null_queue_rq()
would return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE for the request,
which has been marked as MQ_RQ_IN_FLIGHT by blk_mq_start_request().
Then null_queue_rqs() put these requests in the rqlist, return back to
the block layer core, which would try to queue them individually again,
so the warning in blk_mq_start_request() triggered.
Fix it by splitting the null_queue_rq() into two parts: the first is the
preparation of request, the second is the handling of request. We put
the blk_mq_start_request() after the preparation part, which may fail
and return back to the block layer core.
The throttling also belongs to the preparation part, so move it before
blk_mq_start_request(). And change the return type of null_handle_cmd()
to void, since it always return BLK_STS_OK now.
Reported-by: <syzbot+fcc47ba2476570cbbeb0@syzkaller.appspotmail.com>
Closes: https://lore.kernel.org/all/0000000000000e6aac06098aee0c@google.com/
Fixes: d78bfa1346 ("block/null_blk: add queue_rqs() support")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20231120032521.1012037-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Memory reordering may occur in nbd_genl_connect(), causing config_refs
to be set to 1 while nbd->config is still empty. Opening nbd at this
time will cause null-ptr-dereference.
T1 T2
nbd_open
nbd_get_config_unlocked
nbd_genl_connect
nbd_alloc_and_init_config
//memory reordered
refcount_set(&nbd->config_refs, 1) // 2
nbd->config
->null point
nbd->config = config // 1
Fix it by adding smp barrier to guarantee the execution sequence.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-4-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are no functional changes, just to make code cleaner and prepare
to fix null-ptr-dereference while accessing 'nbd->config'.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-3-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are no functional changes, make the code cleaner and prepare to
fix null-ptr-dereference while accessing 'nbd->config'.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-2-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 4af5f2e030 ("nbd: use blk_mq_alloc_disk and
blk_cleanup_disk") cleans up disk by blk_cleanup_disk() and it won't set
disk->private_data as NULL as before. UAF may be triggered in nbd_open()
if someone tries to open nbd device right after nbd_put() since nbd has
been free in nbd_dev_remove().
Fix this by implementing ->free_disk and free private data in it.
Fixes: 4af5f2e030 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231107103435.2074904-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
vdpa/mlx5:
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
new maintainer
vdpa:
support for vq descriptor mappings
decouple reset of iotlb mapping from device reset
fixes, cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmVCUMYPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRp4L0H/RKcnNXPRqzhhBI1XVQ11Z8CO8WjovcmJalu
ADHNEGmvuWnY79fp9eLiZ4iVaTx1qbzqIB5Q500DJ65jh71W7UQ8ww6CGjNUoRGs
Zoe4G09WoOf4bvDZZzVV7ml/AzMdsHWSZK8pxY3QI9CsC9Zfp9hg20QYxPylCqYx
SIJx7w2MkoojfmtOHRx1WUxaQz99yfU4Z0C5PxtRE1HGN6/a1aY0P0CAl5jq8uCK
U5sCRsfCmP7VKlspeEddMiPA35ADbCiysSobCbwGVQEs5cHpMUX7KWa+oV0tF/PY
9uyJb2rJy6zG3tXmL4XNib665ZR86HX6qiWRfm2nBQQStuHaJyg=
=mXgo
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"vhost,virtio,vdpa: features, fixes, cleanups.
vdpa/mlx5:
- VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
- new maintainer
vdpa:
- support for vq descriptor mappings
- decouple reset of iotlb mapping from device reset
and fixes, cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits)
vdpa_sim: implement .reset_map support
vdpa/mlx5: implement .reset_map driver op
vhost-vdpa: clean iotlb map during reset for older userspace
vdpa: introduce .compat_reset operation callback
vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
vhost-vdpa: reset vendor specific mapping to initial state in .release
vdpa: introduce .reset_map operation callback
virtio_pci: add check for common cfg size
virtio-blk: fix implicit overflow on virtio_max_dma_size
virtio_pci: add build offset check for the new common cfg items
virtio: add definition of VIRTIO_F_NOTIF_CONFIG_DATA feature bit
vduse: make vduse_class constant
vhost-scsi: Spelling s/preceeding/preceding/g
virtio: kdoc for struct virtio_pci_modern_device
vdpa: Update sysfs ABI documentation
MAINTAINERS: Add myself as mlx5_vdpa driver
virtio-balloon: correct the comment of virtballoon_migratepage()
mlx5_vdpa: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
vdpa/mlx5: Update cvq iotlb mapping on ASID change
vdpa/mlx5: Make iotlb helper functions more generic
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vjMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqVcEADaNf6X7LVKKrdQ4sA38dBZYGM3kNz0SCYV
vkjQAs0Fyylbu6EhYOLO/R+UCtpytLlnbr4NmFDbhaEG4OJcwoDLDxpMQ7Gda58v
4RBXAiIlhZX3g99/ebvtNtVEvQa9gF4h8k2n/gKsG+PoS+cbkKAI0Na2duI1d/pL
B5nQ31VAHhsyjUv1nIPLrQS6lsL7ZTFvH8L6FLcEVM03poy8PE2H6kN7WoyXwtfo
LN3KK0Nu7B0Wx2nDx0ffisxcDhbChGs7G2c9ndPTvxg6/4HW+2XSeNUwTxXYpyi2
ZCD+AHCzMB/w6GNNWFw4xfau5RrZ4c4HdBnmyR6+fPb1u6nGzjgquzFyLyLu5MkA
n/NvOHP1Cbd3QIXG1TnBi2kDPkQ5FOIAjFSe9IZAGT4dUkZ63wBoDil1jCgMLuCR
C+AFPLhiIg3cFvu9+fdZ6BkCuZYESd3YboBtRKeMionEexrPTKt4QWqIoVJgd/Y7
nwvR8jkIBpVgQZT8ocYqhSycLCYV2lGqEBSq4rlRiEb/W1G9Awmg8UTGuUYFSC1G
vGPCwhGi+SBsbo84aPCfSdUkKDlruNWP0GwIFxo0hsiTOoHP+7UWeenJ2Jw5lNPt
p0Y72TEDDaSMlE4cJx6IWdWM/B+OWzCyRyl3uVcy7bToEsVhIbBSSth7+sh2n7Cy
WgH1lrtMzg==
=sace
-----END PGP SIGNATURE-----
Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- Improvements to the queue_rqs() support, and adding null_blk support
for that as well (Chengming)
- Series improving badblocks support (Coly)
- Key store support for sed-opal (Greg)
- IBM partition string handling improvements (Jan)
- Make number of ublk devices supported configurable (Mike)
- Cancelation improvements for ublk (Ming)
- MD pull requests via Song:
- Handle timeout in md-cluster, by Denis Plotnikov
- Cleanup pers->prepare_suspend, by Yu Kuai
- Rewrite mddev_suspend(), by Yu Kuai
- Simplify md_seq_ops, by Yu Kuai
- Reduce unnecessary locking array_state_store(), by Mariusz
Tkaczyk
- Make rdev add/remove independent from daemon thread, by Yu Kuai
- Refactor code around quiesce() and mddev_suspend(), by Yu Kuai
- NVMe pull request via Keith:
- nvme-auth updates (Mark)
- nvme-tcp tls (Hannes)
- nvme-fc annotaions (Kees)
- Misc cleanups and improvements (Jiapeng, Joel)
* tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits)
block: ublk_drv: Remove unused function
md: cleanup pers->prepare_suspend()
nvme-auth: allow mixing of secret and hash lengths
nvme-auth: use transformed key size to create resp
nvme-auth: alloc nvme_dhchap_key as single buffer
nvmet-tcp: use 'spin_lock_bh' for state_lock()
powerpc/pseries: PLPKS SED Opal keystore support
block: sed-opal: keystore access for SED Opal keys
block:sed-opal: SED Opal keystore
ublk: simplify aborting request
ublk: replace monitor with cancelable uring_cmd
ublk: quiesce request queue when aborting queue
ublk: rename mm_lock as lock
ublk: move ublk_cancel_dev() out of ub->mutex
ublk: make sure io cmd handled in submitter task context
ublk: don't get ublk device reference in ublk_abort_queue()
ublk: Make ublks_max configurable
ublk: Limit dev_id/ub_number values
md-cluster: check for timeout while a new disk adding
nvme: rework NVME_AUTH Kconfig selection
...
The following codes have an implicit conversion from size_t to u32:
(u32)max_size = (size_t)virtio_max_dma_size(vdev);
This may lead overflow, Ex (size_t)4G -> (u32)0. Once
virtio_max_dma_size() has a larger size than U32_MAX, use U32_MAX
instead.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20230904061045.510460-1-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24
jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m
yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99
Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf
beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/
ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE
UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N
/SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9
BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq
Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM
BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M
m2VXyBl3121BVtKi4w==
=0iVF
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"One of the more voluminous set of changes is for adding the new
__counted_by annotation[1] to gain run-time bounds checking of
dynamically sized arrays with UBSan.
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees
Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)"
* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
reset: Annotate struct reset_control_array with __counted_by
kexec: Annotate struct crash_mem with __counted_by
virtio_console: Annotate struct port_buffer with __counted_by
ima: Add __counted_by for struct modsig and use struct_size()
MAINTAINERS: Include stackleak paths in hardening entry
string: Adjust strtomem() logic to allow for smaller sources
hardening: x86: drop reference to removed config AMD_IOMMU_V2
randstruct: Fix gcc-plugin performance mode to stay in group
mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
KVM: Annotate struct kvm_irq_routing_table with __counted_by
virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
sparc: Annotate struct cpuinfo_tree with __counted_by
isdn: kcapi: replace deprecated strncpy with strscpy_pad
isdn: replace deprecated strncpy with strscpy
NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
...
disk_check_media_change is mostly called from ->open where it makes
little sense to mark the file system on the device as dead, as we
are just opening it. So instead of calling bdev_mark_dead from
disk_check_media_change move it into the few callers that are not
in an open instance. This avoid calling into bdev_mark_dead and
thus taking s_umount with open_mutex held.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231017184823.1383356-4-hch@lst.de
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert zram to use bdev_open_by_dev() and pass the handle around.
CC: Minchan Kim <minchan@kernel.org>
CC: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-8-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert rnbd-srv to use bdev_open_by_path() and pass the handle
around.
CC: Jack Wang <jinpu.wang@ionos.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
Acked-by: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-6-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert pktcdvd to use bdev_open_by_dev().
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-5-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
The function are defined in the ublk_drv.c file, but not called
elsewhere, so delete the unused function.
drivers/block/ublk_drv.c:1211:20: warning: unused function 'ublk_abort_io_cmds'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6938
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Fixes: b4e1353f46 ("ublk: simplify aborting request")
Reviewed-by: Ming Lei <ming.lei@rehdat.com>
Link: https://lore.kernel.org/r/20231019030444.53680-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now ublk_abort_queue() is run exclusively with ublk_queue_rq() and the
ubq_daemon task, so simplify aborting request:
- set UBLK_IO_FLAG_ABORTED in ublk_abort_queue() just for aborting
this request
- abort request in ublk_queue_rq() if ubq->canceling is set
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Monitor work actually introduces one extra context for handling abort, this
way is easy to cause race, and also introduce extra delay when handling
aborting.
Now we start to support cancelable uring_cmd, so use it instead:
1) this cancel callback is either run from the uring cmd submission task
context or called after the io_uring context is exit, so the callback is
run exclusively with ublk_ch_uring_cmd() and __ublk_rq_task_work().
2) the previous patch freezes request queue when calling ublk_abort_queue(),
which is now completely exclusive with ublk_queue_rq() and
ublk_ch_uring_cmd()/__ublk_rq_task_work().
3) in timeout handler, if all IOs are in-flight, then all uring commands
are completed, uring command canceling can't help us to provide forward
progress any more, so call ublk_abort_requests() in timeout handler.
This way simplifies aborting queue, and is helpful for adding new feature,
such as, relax the limit of using single task for handling one queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
So far aborting queue ends request when the ubq daemon is exiting, and
it can be run concurrently with ublk_queue_rq(), this way is fragile and
we depend on the tricky usage of UBLK_IO_FLAG_ABORTED for avoiding such
race.
Quiesce queue when aborting queue, and the two code paths can be run
completely exclusively, then it becomes easier to add new ublk feature,
such as relaxing single same task limit for each queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename mm_lock field of ublk_device as lock, so that this lock can be reused
for protecting access of ub->ub_disk, which will be used for simplifying
ublk_abort_queue() by quiesce queue in next patch.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk_cancel_dev() just calls ublk_cancel_queue() to cancel all pending
io commands after ublk request queue is idle. The only protection is just
the read & write of ubq->nr_io_ready and avoid duplicated command cancel,
so add one per-queue lock with cancel flag for providing this protection,
meantime move ublk_cancel_dev() out of ub->mutex.
Then we needn't to call io_uring_cmd_complete_in_task() to cancel
pending command. And the same cancel logic will be re-used for
cancelable uring command.
This patch basically reverts commit ac5902f84b ("ublk: fix AB-BA lockdep warning").
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In well-done ublk server implementation, ublk io command won't be
linked into any link chain. Meantime they are always handled in no-wait
style, so basically io cmd is always handled in submitter task context.
However, the server may set IOSQE_ASYNC, or io command is linked to one
chain mistakenly, then we may still run into io-wq context and
ctx->uring_lock isn't held.
So in case of IO_URING_F_UNLOCKED, schedule this command by
io_uring_cmd_complete_in_task to force running it in submitter task. Then
ublk_ch_uring_cmd_local() is guaranteed to run with context uring_lock held,
and we needn't to worry about sync among submission code path any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk_abort_queue() is called in ublk_daemon_monitor_work(), in which
it is guaranteed that the device is live because monitor work is
canceled when removing device, so no need to get the device reference.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We are converting tcmu applications to ublk, but have systems with up
to 1k devices. This patch allows us to configure the ublks_max from
userspace with the ublks_max modparam.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231012150600.6198-3-michael.christie@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The dev_id/ub_number is used for the ublk dev's char device's minor
number so it has to fit into MINORMASK. This patch adds checks to prevent
userspace from passing a number that's too large and limits what can be
allocated by the ublk_index_idr for the case where userspace has the
kernel allocate the dev_id/ub_number.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231012150600.6198-2-michael.christie@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmUgNUIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppleEADNnAobhJCLJWw2uSwkfIqap4pL+LFvA0B5
qdhOJb1U7t3mmu6/VhSEpvRFQiSvMHzhXBWHepQfii0Iu7ubkOwlEhPEirZJLBdo
2Qt6RPxijfhKRHbrujNzjrCUAIti8zchaBrYmYQtB7paKCtTStPLOezBgfOutFl7
TRLF+GBpsISuQUASSIFssxf7RpZTzS9pdlrS09BeViROdwwqYHXNUIl0kxEFu9BM
SjElcby5ZJj5drg0SRl5ePG/XCp4Sr/rFDQJI2S/d++qwJS8egsYPQAe7Tbpw4y5
9FBhs5hKypRkdTjbOVa23lqdfHcSIEyRM2lUL12vaVMA5m+mZ7SePt9qX4Vbg074
2QJ1YVm4c/UFe6uieUcmDgJNgUMEyeEyo4SUj9gl/gn8wQE1dz5qB5PF1YWEcHa7
KJ1zBbxBKw4ETLBdt24wCHH8HyA7SO8RCk8EMxFv7ENvQSzXzo+mdvrUYpp04Mu8
eNV8VvqpMvBuWuh9uAO4CS1GIFQW7DijrProbyh9aOthu7850/ahnwS3b6+jkZYP
3A7vaHKpU3VyGLYULAE9lZhmeCxmFNHzClgzX2Wve4Z94Xoq7Sa8xksV3JBx5NzY
5Rt5rfdTZArt+LPL4CrdwlxEnT3F5yTSKXp82lWo6P1ds/x7WEYXKf0B4OgAGANk
6tk9rK3ZBA==
=NWf+
-----END PGP SIGNATURE-----
Merge tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Just two minor fixes, for nbd and md"
* tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux:
nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctl
md/raid5: release batch_last before waiting for another stripe_head
blk_mark_disk_dead is the proper interface to shut down a block
device, but it also makes the disk unusable forever.
nbd_clear_sock_ioctl on the other hand wants to shut down the file
system, but allow the block device to be used again when when connected
to another socket. Switch nbd to use disk_force_media_change and
nbd_bdev_reset to go back to a behavior of the old __invalidate_device
call, with the added benefit of incrementing the device generation
as there is no guarantee the old content comes back when the device
is reconnected.
Reported-by: Samuel Holland <samuel.holland@sifive.com>
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 0c1c9a27ce ("nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20231003153106.1331363-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
We should favor a more robust and less ambiguous interface.
We expect that both `nullb->disk_name` and `disk->disk_name` be
NUL-terminated:
| snprintf(nullb->disk_name, sizeof(nullb->disk_name),
| "%s", config_item_name(&dev->group.cg_item));
...
| pr_info("disk %s created\n", nullb->disk_name);
It seems like NUL-padding may be required due to __assign_disk_name()
utilizing a memcpy as opposed to a `str*cpy` api.
| static inline void __assign_disk_name(char *name, struct gendisk *disk)
| {
| if (disk)
| memcpy(name, disk->disk_name, DISK_NAME_LEN);
| else
| memset(name, 0, DISK_NAME_LEN);
| }
Then we go and print it with `__print_disk_name` which wraps `nullb_trace_disk_name()`.
| #define __print_disk_name(name) nullb_trace_disk_name(p, name)
This function obviously expects a NUL-terminated string.
| const char *nullb_trace_disk_name(struct trace_seq *p, char *name)
| {
| const char *ret = trace_seq_buffer_ptr(p);
|
| if (name && *name)
| trace_seq_printf(p, "disk=%s, ", name);
| trace_seq_putc(p, 0);
|
| return ret;
| }
>From the above, we need both 1) a NUL-terminated string and 2) a
NUL-padded string. So, let's use strscpy_pad() as per Kees' suggestion
from v1.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230919-strncpy-drivers-block-null_blk-main-c-v3-1-10cf0a87a2c3@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct fifo_buffer.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200316.never.707-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
rbd_dev_refresh() has been holding header_rwsem across header and
parent info read-in unnecessarily for ages. With commit 870611e487
("rbd: get snapshot context after exclusive lock is ensured to be
held"), the potential for deadlocks became much more real owning to
a) header_rwsem now nesting inside lock_rwsem and b) rw_semaphores
not allowing new readers after a writer is registered.
For example, assuming that I/O request 1, I/O request 2 and header
read-in request all target the same OSD:
1. I/O request 1 comes in and gets submitted
2. watch error occurs
3. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
releases lock_rwsem
4. after reestablishing the watch, rbd_reregister_watch() calls
rbd_dev_refresh() which takes header_rwsem for write and submits
a header read-in request
5. I/O request 2 comes in: after taking lock_rwsem for read in
__rbd_img_handle_request(), it blocks trying to take header_rwsem
for read in rbd_img_object_requests()
6. another watch error occurs
7. rbd_watch_errcb() blocks trying to take lock_rwsem for write
8. I/O request 1 completion is received by the messenger but can't be
processed because lock_rwsem won't be granted anymore
9. header read-in request completion can't be received, let alone
processed, because the messenger is stranded
Change rbd_dev_refresh() to take header_rwsem only for actually
updating rbd_dev->header. Header and parent info read-in don't need
any locking.
Cc: stable@vger.kernel.org # 0b035401c5: rbd: move rbd_dev_refresh() definition
Cc: stable@vger.kernel.org # 510a7330c8: rbd: decouple header read-in from updating rbd_dev->header
Cc: stable@vger.kernel.org # c10311776f: rbd: decouple parent info read-in from updating rbd_dev
Cc: stable@vger.kernel.org
Fixes: 870611e487 ("rbd: get snapshot context after exclusive lock is ensured to be held")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Unlike header read-in, parent info read-in is already decoupled in
get_parent_info(), but it's buried in rbd_dev_v2_parent_info() along
with the processing logic.
Separate the initial read-in and update read-in logic into
rbd_dev_setup_parent() and rbd_dev_update_parent() respectively and
have rbd_dev_v2_parent_info() just populate struct parent_image_info
(i.e. what get_parent_info() did). Some existing QoI issues, like
flatten of a standalone clone being disregarded on refresh, remain.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Make rbd_dev_header_info() populate a passed struct rbd_image_header
instead of rbd_dev->header and introduce rbd_dev_update_header() for
updating mutable fields in rbd_dev->header upon refresh. The initial
read-in of both mutable and immutable fields in rbd_dev_image_probe()
passes in rbd_dev->header so no update step is required there.
rbd_init_layout() is now called directly from rbd_dev_image_probe()
instead of individually in format 1 and format 2 implementations.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Move rbd_dev_refresh() definition further down to avoid having to
move struct parent_image_info definition in the next commit. This
spares some forward declarations too.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Add batched mq_ops.queue_rqs() support in null_blk for testing. The
implementation is much easy since null_blk doesn't have commit_rqs().
We simply handle each request one by one, if errors are encountered,
leave them in the passed in list and return back.
There is about 3.6% improvement in IOPS of fio/t/io_uring on null_blk
with hw_queue_depth=256 on my test VM, from 1.09M to 1.13M.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-6-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now we update driver tags request table in blk_mq_get_driver_tag(),
so the driver that support queue_rqs() have to update that inflight
table by itself.
Move it to blk_mq_start_request(), which is a better place where
we setup the deadline for request timeout check. And it's just
where the request becomes inflight.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmT7NSUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuIuEACxQj0JjIVwgp8VlUcMAEFap7GH940qs8is
tnRrGS48G/z3WZ0zizGQ1hF2m9oOJ1NwZ1bBrGJlVBP8KSgyR9IxI6WVt0JBZgg8
DNQt+v5FoX4v5ZbFr/NTqz/6YnPWVwyTkJs4JIi9YLmpKkd58N+51n3wl56xyrp7
8OcjHPH1ZI0qGar2poPXI7EKaweSzJW7nr3/cHujYoCfAGn7eZbHNgY0wdAtWdhM
oGFABgMKTFKA0rCSaP8uo7GIqov+KXvFUNOuC6eNjtaUvXMecPLEJxvDhhi52fml
LEH1a9bmtazu6bufrbI2D4SmXLNBcqnQw5d/1pn9oRO/UU6AAi6E7I/L1iodKWav
bkr2i+YMnhUqVDFRwy+j78bcGRlLPWupvj68RdpvAvzhdW/85p9nH66CJfTnjQ3q
xJJKoUREd/8C6ktm5xHYyETxcL5q6Vdz1w5GmsCyOWWpsXsEbh0GHn6pRDkLWJf4
OQhO4CbxIARGE5NRK8MYl/J/581yS1cIdGZAg7hsVHzTGDSt3h/Hlh6ONpWCcvXD
SJI39W83oBSOnsbRG4FU6Bnt/WHTD0yi07cr867l9HEq63JOK/OduMjXA/tdRfwP
fZU3RfFYcS8+h9UHgrSgvCk03fCjq1HsWCDfpc6sUuRG1fH6jSNFdkCKqhCoU9Py
K2IKi1IE0Q==
=4SLW
-----END PGP SIGNATURE-----
Merge tag 'block-6.6-2023-09-08' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix null_blk polled IO timeout handling (Chengming)
- Regression fix for swapped arguments in drbd bvec_set_page()
(Christoph)
- String length handling fix for s390 dasd (Heiko)
- Fixes for blk-throttle accounting (Yu)
- Fix page pinning issue for same page segments (Christoph)
- Remove redundant file_remove_privs() call (Christoph)
- Fix a regression in partition handling for devices not supporting
partitions (Li)
* tag 'block-6.6-2023-09-08' of git://git.kernel.dk/linux:
drbd: swap bvec_set_page len and offset
block: fix pin count management when merging same-page segments
null_blk: fix poll request timeout handling
s390/dasd: fix string length handling
block: don't add or resize partition on the disk with GENHD_FL_NO_PART
block: remove the call to file_remove_privs in blkdev_write_iter
blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice()
blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()
blk-throttle: fix wrong comparation while 'carryover_ios/bytes' is negative
blk-throttle: print signed value 'carryover_bytes/ios' for user
fscrypt support to CephFS! The list of things which don't work with
encryption should be fairly short, mostly around the edges: fallocate
(not supported well in CephFS to begin with), copy_file_range (requires
re-encryption), non-default striping patterns.
This was a multi-year effort principally by Jeff Layton with assistance
from Xiubo Li, Luís Henriques and others, including several dependant
changes in the MDS, netfs helper library and fscrypt framework itself.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmT4pl4THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5kzB/4sMgzZyUa3T1vA/G2pPvEkyy1qDxsW
y+o4dDMWA9twcrBVpNuGd54wbXpmO/LAekHEdorjayH+f0zf10MsnP1ePz9WB3NG
jr7RRujb+Gpd2OFYJXGSEbd3faTg8M2kpGCCrVe7SFNoyu8z9NwFItwWMog5aBjX
ODGQrq+kA4ARA6xIqwzF5gP0zr+baT9rWhQdm7Xo9itWdosnbyDLJx1dpEfLuqBX
te3SmifDzedn3Gw73hdNo/+ybw0kHARoK+RmXCTsoDDQw+JsoO9KxZF5Q8QcDELq
2woPNp0Hl+Dm4MkzGnPxv56Qj8ZDViS59syXC0CfGRmu4nzF1Rw+0qn5
=/WlE
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Mixed with some fixes and cleanups, this brings in reasonably complete
fscrypt support to CephFS! The list of things which don't work with
encryption should be fairly short, mostly around the edges: fallocate
(not supported well in CephFS to begin with), copy_file_range
(requires re-encryption), non-default striping patterns.
This was a multi-year effort principally by Jeff Layton with
assistance from Xiubo Li, Luís Henriques and others, including several
dependant changes in the MDS, netfs helper library and fscrypt
framework itself"
* tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client: (53 commits)
ceph: make num_fwd and num_retry to __u32
ceph: make members in struct ceph_mds_request_args_ext a union
rbd: use list_for_each_entry() helper
libceph: do not include crypto/algapi.h
ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper
ceph: fix updating i_truncate_pagecache_size for fscrypt
ceph: wait for OSD requests' callbacks to finish when unmounting
ceph: drop messages from MDS when unmounting
ceph: update documentation regarding snapshot naming limitations
ceph: prevent snapshot creation in encrypted locked directories
ceph: add support for encrypted snapshot names
ceph: invalidate pages when doing direct/sync writes
ceph: plumb in decryption during reads
ceph: add encryption support to writepage and writepages
ceph: add read/modify/write to ceph_sync_write
ceph: align data in pages in ceph_sync_write
ceph: don't use special DIO path for encrypted inodes
ceph: add truncate size handling support for fscrypt
ceph: add object version support for sync read
libceph: allow ceph_osdc_new_request to accept a multi-op read
...
bvec_set_page has the following signature:
static inline void bvec_set_page(struct bio_vec *bv, struct page *page,
unsigned int len, unsigned int offset)
However, the usage in DRBD swaps the len and offset parameters. This
leads to a bvec with length=0 instead of the intended length=4096, which
causes sock_sendmsg to return -EIO.
This leaves DRBD unable to transmit any pages and thus completely
broken.
Swapping the parameters fixes the regression.
Fixes: eeac7405c7 ("drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()")
Reported-by: Serguei Ivantsov <manowar@gsc-game.com>
Link: https://lore.kernel.org/regressions/CAKH+VT3YLmAn0Y8=q37UTDShqxDLsqPcQ4hBMzY7HPn7zNx+RQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230906133034.948817-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When doing io_uring benchmark on /dev/nullb0, it's easy to crash the
kernel if poll requests timeout triggered, as reported by David. [1]
BUG: kernel NULL pointer dereference, address: 0000000000000008
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:null_timeout_rq+0x4e/0x91
Call Trace:
? null_timeout_rq+0x4e/0x91
blk_mq_handle_expired+0x31/0x4b
bt_iter+0x68/0x84
? bt_tags_iter+0x81/0x81
__sbitmap_for_each_set.constprop.0+0xb0/0xf2
? __blk_mq_complete_request_remote+0xf/0xf
bt_for_each+0x46/0x64
? __blk_mq_complete_request_remote+0xf/0xf
? percpu_ref_get_many+0xc/0x2a
blk_mq_queue_tag_busy_iter+0x14d/0x18e
blk_mq_timeout_work+0x95/0x127
process_one_work+0x185/0x263
worker_thread+0x1b5/0x227
This is indeed a race problem between null_timeout_rq() and null_poll().
null_poll() null_timeout_rq()
spin_lock(&nq->poll_lock)
list_splice_init(&nq->poll_list, &list)
spin_unlock(&nq->poll_lock)
while (!list_empty(&list))
req = list_first_entry()
list_del_init()
...
blk_mq_add_to_batch()
// req->rq_next = NULL
spin_lock(&nq->poll_lock)
// rq->queuelist->next == NULL
list_del_init(&rq->queuelist)
spin_unlock(&nq->poll_lock)
Fix these problems by setting requests state to MQ_RQ_COMPLETE under
nq->poll_lock protection, in which null_timeout_rq() can safely detect
this race and early return.
Note this patch just fix the kernel panic when request timeout happen.
[1] https://lore.kernel.org/all/3893581.1691785261@warthog.procyon.org.uk/
Fixes: 0a593fbbc2 ("null_blk: poll queue support")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230901120306.170520-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>