linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2025-01-08 14:13:53 +00:00

Author	SHA1	Message	Date
Mike Snitzer	6c43cf2488	dm vdo int-map: return VDO_SUCCESS on success Update all callers to check for VDO_SUCCESS (most already did). Also fix whitespace for update_mapping() parameters. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	2de70388b3	dm vdo: check for VDO_SUCCESS return value from memory-alloc functions VDO_SUCCESS and UDS_SUCCESS were used interchangably, update all callers of VDO's memory-alloc functions to consistently check for VDO_SUCCESS. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	97d3380396	dm vdo memory-alloc: return VDO_SUCCESS on success Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Matthew Sakai	ee8f6ec1b1	dm vdo errors: remove unused error codes Also define VDO_SUCCESS in a more central location, and rename error block constants for clarity. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:56 -05:00
Mike Snitzer	8f89115efc	dm vdo memory-alloc: rename vdo_do_allocation to __vdo_do_allocation __vdo_do_allocation shouldn't be used outside of memory-alloc.h, so add hidden prefix. Also, tabify the vdo_allocate_extended macro. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	0eea6b6e78	dm vdo memory-alloc: change from uds_ to vdo_ namespace Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Bruce Johnston	6008d526b0	dm-vdo: change unnamed enums to defines Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:56 -05:00
Matthew Sakai	04530b487b	dm vdo: remove outdated pointer_map reference Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:56 -05:00
Matthew Sakai	e1e510fcad	dm vdo: update module comments Update outdated comments referring to separate VDO and UDS modules. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Matthew Sakai	bbe434d94e	dm vdo indexer delta-index: fix typos in comments Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Jiapeng Chong	eebd4e1630	dm vdo: fix various function names referenced in comment blocks No functional modification involved. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Mike Snitzer	17b1a73fea	dm vdo: move indexer files into sub-directory The goal is to assist high-level understanding of which code is conceptually specific to VDO's indexer. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Matthew Sakai	61234f0bda	dm vdo: remove unnecessary indexer.h includes Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Chung Chung	81c751ad1b	dm vdo: clean up scnprintf usage Ignore scnprintf return status since it is not necessary. Change write_* functions type from int to void since we no longer return any result. Also, clean up any code that checks or uses any scnprintf return results. Check uds_allocate return code which was previous ignored, return and log error when uds_allocate failed. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Mike Snitzer	20be466c7a	dm vdo: include <asm/current.h> to resolve current being undeclared Reported when building on loongarch. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Mike Snitzer	444d3f0bfd	dm vdo indexer-volume: fix missing mutex_lock in process_entry Must mutex_lock after dm_bufio_read, before dm_bufio_read error handling, otherwise process_entry error path will return without volume->read_threads_mutex held. This fixes potential double mutex_unlock. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Mike Snitzer	b259c1a60c	dm vdo flush: initialize return to NULL in allocate_flush Otherwise, error path could result in allocate_flush's subsequent check for flush being non-NULL leading to false positive. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Dan Carpenter	672fc9b8c0	dm vdo slab-depot: delete unnecessary check in allocate_components This is a duplicate check so it can't be true. Delete it. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Mike Snitzer	924553644a	dm vdo memory-alloc: simplify allocations_allowed() Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Susan LeGendre-McGhee	dcd1332bb5	dm vdo: remove internal ticket references Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Tejun Heo	c375b22333	dm-verity: Convert from tasklet to BH workqueue The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This commit converts dm-verity from tasklet to BH workqueue. It backfills tasklet code that was removed with commit `0a9bab391e` ("dm-crypt, dm-verity: disable tasklets") and tweaks to use BH workqueue (and does some renaming). This is a minimal conversion which doesn't rename the related names including the "try_verify_in_tasklet" option. If this patch is applied, a follow-up patch would be necessary. I couldn't decide whether the option name would need to be updated too. Signed-off-by: Tejun Heo <tj@kernel.org> [snitzer: rename 'use_tasklet' to 'use_bh_wq' and 'in_tasklet' to 'in_bh'] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-02 10:30:36 -05:00
Tejun Heo	fb6ad4aec1	dm-crypt: Convert from tasklet to BH workqueue The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This commit converts dm-crypt from tasklet to BH workqueue. It backfills tasklet code that was removed with commit `0a9bab391e` ("dm-crypt, dm-verity: disable tasklets") and tweaks to use BH workqueue. Like a regular workqueue, a BH workqueue allows freeing the currently executing work item. Converting from tasklet to BH workqueue removes the need for deferring bio_endio() again to a work item, which was buggy anyway. I tested this lightly with "--perf-no_read_workqueue --perf-no_write_workqueue" + some code modifications, but would really -appreciate if someone who knows the code base better could take a look. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/82b964f0-c2c8-a2c6-5b1f-f3145dc2c8e5@redhat.com [snitzer: rebase ontop of commit `0a9bab391e` reduced this commit's changes] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-02 10:30:36 -05:00
Christoph Hellwig	8e0ef41286	dm: use queue_limits_set Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-01 08:54:42 -07:00
Mike Snitzer	6a87a8a258	dm vdo thread-device: rename all methods to reflect vdo-only use Also moved vdo_init()'s call to vdo_initialize_thread_device_registry next to other registry initialization. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:24 -05:00
Mike Snitzer	82b354ffe2	dm vdo thread-registry: rename all methods to reflect vdo-only use Otherwise, uds_ prefix is misleading (vdo_ is the new catch-all for code that is used by vdo-only or _both_ vdo and the indexer code). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:20 -05:00
Mike Snitzer	cb6f8b7500	dm vdo thread-utils: cleanup included headers Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:11 -05:00
Mike Snitzer	650e3107bc	dm vdo thread-utils: further cleanup of thread functions Change thread function prefix from "uds_" to "vdo_" and fix vdo_join_threads() to return void. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:07 -05:00
Mike Snitzer	fe6e4ccbe8	dm vdo thread-utils: remove all uds_*_mutex wrappers Just use mutex_init, mutex_lock and mutex_unlock. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:03 -05:00
Mike Snitzer	7f2e494ddd	dm vdo thread-utils: push uds_*_cond interface down to indexer Only used by indexer components. Also return void from uds_init_cond(), remove uds_destroy_cond(), and fix up all callers. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:58 -05:00
Mike Snitzer	877f36b764	dm vdo: fold thread-cond-var.c into thread-utils Further cleanup is needed for thread-utils interfaces given many functions should return void or be removed entirely because they amount to obfuscation via wrappers. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:54 -05:00
Mike Snitzer	8e6333af19	dm vdo indexer: rename uds.h to indexer.h Also remove unnecessary include from funnel-queue.c. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:49 -05:00
Mike Snitzer	c2f54aa2b2	dm vdo: rename uds-threads.[ch] to thread-utils.[ch] Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:45 -05:00
Mike Snitzer	eef7cf5e22	dm vdo indexer sparse-cache: cleanup threads_barrier code Rename 'barrier' to 'threads_barrier', remove useless uds_destroy_barrier(), return void from remaining methods and clean up uds_make_sparse_cache() accordingly. Also remove uds_ prefix from the 2 remaining threads_barrier functions. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:41 -05:00
Mike Snitzer	0593855a83	dm vdo uds-threads: push 'barrier' down to sparse-cache The sparse-cache is the only user of the 'barrier' data structure, so just move it private to it. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:36 -05:00
Mike Snitzer	2d98aa1780	dm vdo uds-threads: eliminate uds_*_semaphore interfaces The implementation of thread 'barrier' data structure does not require overdone private semaphore wrappers. Also rename the barrier structure's 'mutex' member (a semaphore) to 'lock'. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:32 -05:00
Mike Snitzer	9d87418945	dm vdo: make uds_*_semaphore interface private to uds-threads.c Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:23 -05:00
Mike Snitzer	50944062f7	dm vdo block-map: rename page state name from "UDS_FREE" to "FREE" Only used for log message, but no need for "UDS_" prefix. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:16 -05:00
Harshit Mogalapalli	f304f6b443	dm vdo volume-index: fix an assert statement in start_restoring_volume_sub_index() Use "==" instead of "=" in ASSERT() statement. Fixes: ef074a31e88e ("dm vdo: implement the volume index") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-01 09:25:09 -05:00
Yu Kuai	0091c5a269	md/raid1: factor out helpers to choose the best rdev from read_balance() The way that best rdev is chosen: 1) If the read is sequential from one rdev: - if rdev is rotational, use this rdev; - if rdev is non-rotational, use this rdev until total read length exceed disk opt io size; 2) If the read is not sequential: - if there is idle disk, use it, otherwise: - if the array has non-rotational disk, choose the rdev with minimal inflight IO; - if all the underlaying disks are rotational disk, choose the rdev with closest IO; There are no functional changes, just to make code cleaner and prepare for following refactor. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	ba58f57fdf	md/raid1: factor out the code to manage sequential IO There is no functional change for now, make read_balance() cleaner and prepare to fix problems and refactor the handler of sequential IO. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-11-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	9f3ced7922	md/raid1: factor out choose_bb_rdev() from read_balance() read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the rdev with bad blocks from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-10-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	dfa8ecd167	md/raid1: factor out choose_slow_rdev() from read_balance() read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the slow rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-9-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	31a7333175	md/raid1: factor out read_first_rdev() from read_balance() read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the first rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-8-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	f109207629	md/raid1-10: factor out a new helper raid1_should_read_first() If resync is in progress, read_balance() should find the first usable disk, otherwise, data could be inconsistent after resync is done. raid1 and raid10 implement the same checking, hence factor out the checking to make code cleaner. Noted that raid1 is using 'mddev->recovery_cp', which is updated after all resync IO is done, while raid10 is using 'conf->next_resync', which is inaccurate because raid10 update it before submitting resync IO. Fortunately, raid10 read IO can't concurrent with resync IO, hence there is no problem. And this patch also switch raid10 to use 'mddev->recovery_cp'. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	f29841ff3b	md/raid1-10: add a helper raid1_check_read_range() The checking and handler of bad blocks appear many timers during read_balance() in raid1 and raid10. This helper will be used in later patches to simplify read_balance() a lot. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-6-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	257ac239ff	md/raid1: fix choose next idle in read_balance() Commit `12cee5a8a2` ("md/raid1: prevent merging too large request") add the case choose next idle in read_balance(): read_balance: for_each_rdev if(next_seq_sect == this_sector \|\| dist == 0) -> sequential reads best_disk = disk; if (...) choose_next_idle = 1 continue; for_each_rdev -> iterate next rdev if (pending == 0) best_disk = disk; -> choose the next idle disk break; if (choose_next_idle) -> keep using this rdev if there are no other idle disk contine However, commit `2e52d449bc` ("md/raid1: add failfast handling for reads.") remove the code: - /* If device is idle, use it */ - if (pending == 0) { - best_disk = disk; - break; - } Hence choose next idle will never work now, fix this problem by following: 1) don't set best_disk in this case, read_balance() will choose the best disk after iterating all the disks; 2) add 'pending' so that other idle disk will be chosen; 3) add a new local variable 'sequential_disk' to record the disk, and if there is no other idle disk, 'sequential_disk' will be chosen; Fixes: `2e52d449bc` ("md/raid1: add failfast handling for reads.") Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Yu Kuai	2c27d09d3a	md/raid1: record nonrot rdevs while adding/removing rdevs to conf For raid1, each read will iterate all the rdevs from conf and check if any rdev is non-rotational, then choose rdev with minimal IO inflight if so, or rdev with closest distance otherwise. Disk nonrot info can be changed through sysfs entry: /sys/block/[disk_name]/queue/rotational However, consider that this should only be used for testing, and user really shouldn't do this in real life. Record the number of non-rotational disks in conf, to avoid checking each rdev in IO fast path and simplify read_balance() a little bit. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Yu Kuai	969d6589ab	md/raid1: factor out helpers to add rdev to conf There are no functional changes, just make code cleaner and prepare to record disk non-rotational information while adding and removing rdev to conf Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-3-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Yu Kuai	3a0f007b69	md: add a new helper rdev_has_badblock() The current api is_badblock() must pass in 'first_bad' and 'bad_sectors', however, many caller just want to know if there are badblocks or not, and these caller must define two local variable that will never be used. Add a new helper rdev_has_badblock() that will only return if there are badblocks or not, remove unnecessary local variables and replace is_badblock() with the new helper in many places. There are no functional changes, and the new helper will also be used later to refactor read_balance(). Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Gui-Dong Han	dfd2bf4367	md/raid5: fix atomicity violation in raid5_cache_count In raid5_cache_count(): if (conf->max_nr_stripes < conf->min_nr_stripes) return 0; return conf->max_nr_stripes - conf->min_nr_stripes; The current check is ineffective, as the values could change immediately after being checked. In raid5_set_cache_size(): ... conf->min_nr_stripes = size; ... while (size > conf->max_nr_stripes) conf->min_nr_stripes = conf->max_nr_stripes; ... Due to intermediate value updates in raid5_set_cache_size(), concurrent execution of raid5_cache_count() and raid5_set_cache_size() may lead to inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes. The current checks are ineffective as values could change immediately after being checked, raising the risk of conf->min_nr_stripes exceeding conf->max_nr_stripes and potentially causing an integer overflow. This possible bug is found by an experimental static analysis tool developed by our team. This tool analyzes the locking APIs to extract function pairs that can be concurrently executed, and then analyzes the instructions in the paired functions to identify possible concurrency bugs including data races and atomicity violations. The above possible bug is reported when our tool analyzes the source code of Linux 6.2. To resolve this issue, it is suggested to introduce local variables 'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the values remain stable throughout the check. Adding locks in raid5_cache_count() fails to resolve atomicity violations, as raid5_set_cache_size() may hold intermediate values of conf->min_nr_stripes while unlocked. With this patch applied, our tool no longer reports the bug, with the kernel configuration allyesconfig for x86_64. Due to the lack of associated hardware, we cannot test the patch in runtime testing, and just verify it according to the code logic. Fixes: `edbe83ab4c` ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han <2045gemini@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240112071017.16313-1-2045gemini@gmail.com Signed-off-by: Song Liu <song@kernel.org>	2024-02-27 14:25:34 -08:00
Heming Zhao	ecbd8ebb51	md/md-bitmap: fix incorrect usage for sb_index Commit `d7038f9518` ("md-bitmap: don't use ->index for pages backing the bitmap file") removed page->index from bitmap code, but left wrong code logic for clustered-md. current code never set slot offset for cluster nodes, will sometimes cause crash in clustered env. Call trace (partly): md_bitmap_file_set_bit+0x110/0x1d8 [md_mod] md_bitmap_startwrite+0x13c/0x240 [md_mod] raid1_make_request+0x6b0/0x1c08 [raid1] md_handle_request+0x1dc/0x368 [md_mod] md_submit_bio+0x80/0xf8 [md_mod] __submit_bio+0x178/0x300 submit_bio_noacct_nocheck+0x11c/0x338 submit_bio_noacct+0x134/0x614 submit_bio+0x28/0xdc submit_bh_wbc+0x130/0x1cc submit_bh+0x1c/0x28 Fixes: `d7038f9518` ("md-bitmap: don't use ->index for pages backing the bitmap file") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240223121128.28985-1-heming.zhao@suse.com	2024-02-26 13:34:44 -08:00
Li Nan	e9b0a1556c	md: check mddev->pers before calling md_set_readonly() If 'mddev->pers' is NULL, there is nothing to do in md_set_readonly(). Except for md_ioctl(), the other two callers of md_set_readonly() have already checked 'mddev->pers'. To simplify the code, move the check of 'mddev->pers' to the caller. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-10-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	650b2e69ff	md: clean up openers check in do_md_stop() and md_set_readonly() Before stopping or setting readonly, mddev_set_closing_and_sync_blockdev() is always called to check the openers. So no longer need to check it again in do_md_stop() and md_set_readonly(). Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-9-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	99b902ac17	md: sync blockdev before stopping raid or setting readonly Commit `a05b7ea03d` ("md: avoid crash when stopping md array races with closing other open fds.") added sync_block before stopping raid and setting readonly. Later in commit `260fa034ef` ("md: avoid deadlock when dirty buffers during md_stop.") it is moved to ioctl. array_state_store() was ignored. Add sync blockdev to array_state_store() now. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-8-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	f74aaf614e	md: factor out a helper to sync mddev There are no functional changes, prepare to sync mddev in array_state_store(). Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-7-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	9674f54e41	md: Don't clear MD_CLOSING when the raid is about to stop The raid should not be opened anymore when it is about to be stopped. However, other processes can open it again if the flag MD_CLOSING is cleared before exiting. From now on, this flag will not be cleared when the raid will be stopped. Fixes: `065e519e71` ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-6-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	91b26a39fb	md: return directly before setting did_set_md_closing There is nothing to do at 'out' before setting 'did_set_md_closing' in md_ioctl(). Return directly, and it will help us to remove 'did_set_md_closing' later. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-5-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	9dd8702e7c	md: clean up invalid BUG_ON in md_ioctl 'disk->private_data' is set to mddev in md_alloc() and never set to NULL, and users need to open mddev before submitting ioctl. So mddev must not have been freed during ioctl, and there is no need to check mddev here. Clean up it. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-4-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	4e26593944	md: changed the switch of RAID_VERSION to if There is only one case of this 'switch'. Change it to 'if'. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-3-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	2fe4ffc3ec	md: merge the check of capabilities into md_ioctl_valid() There is no functional change. Just to make code cleaner. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-2-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Christian Brauner	3789fb8746	bcache: port block device access to files Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-13-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:24 +01:00
Christian Brauner	a28d893eb3	md: port block device access to file Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-4-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:22 +01:00
Linus Torvalds	f2e367d6ad	- Fix DM integrity and verity targets to not use excessive stack when they recheck in the error path. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXaEswACgkQxSPxCi2d A1oT2Qf/c1opgjRUe+yY/v7nWf4paufSj2O4LYAy/qQBU7IS9CcXQPzi/pKlfEo8 60OZfa5gfrCAla79se7hHI/mxReq7CI5nFvYDyqQ1JZQ/djG/4cN/oWf5fQ12pon /ET1IzaZ+Mom+5wDBeQBLoQwXTA1ru5Bi1OiUe9Ed3wzadZQQks5s65fPnc0emGJ ClyaXiiCt4Dy36E5GmuPpmPB4ZJ57SwcnFWDFIeCHEbIQk36APkZ22z7lqGObjw2 ANO1l59k6ojzmaXLi9pw/J/o/qyfNR0MpeI7SpmtJzhSZKeGKsUX2GlJ9QBhViJp XL/+7MbSRJ43IY1lomoHZm1vxe0aPg== =sQPX -----END PGP SIGNATURE----- Merge tag 'for-6.8/dm-fix-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mike Snitzer: - Fix DM integrity and verity targets to not use excessive stack when they recheck in the error path. * tag 'for-6.8/dm-fix-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-integrity, dm-verity: reduce stack usage for recheck	2024-02-24 09:55:29 -08:00
Arnd Bergmann	66ad2fbcdb	dm-integrity, dm-verity: reduce stack usage for recheck The newly added integrity_recheck() function has another larger stack allocation, just like its caller integrity_metadata(). When it gets inlined, the combination of the two exceeds the warning limit for 32-bit architectures and possibly risks an overflow when this is called from a deep call chain through a file system: drivers/md/dm-integrity.c:1767:13: error: stack frame size (1048) exceeds limit (1024) in 'integrity_metadata' [-Werror,-Wframe-larger-than] 1767 \| static void integrity_metadata(struct work_struct *w) Since the caller at this point is done using its checksum buffer, just reuse the same buffer in the new function to avoid the double allocation. [Mikulas: add "noinline" to integrity_recheck and verity_recheck. These functions are only called on error, so they shouldn't bloat the stack frame or code size of the caller.] Fixes: `c88f5e553f` ("dm-integrity: recheck the integrity tag after a failure") Fixes: `9177f3c0de` ("dm-verity: recheck the hash after a failure") Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-24 10:53:57 -05:00
Linus Torvalds	e7768e65cd	- Stable fixes for 3 DM targets (integrity, verity and crypt) to address systemic failure that can occur if user provided pages map to the same block. - Fix DM crypt to not allow modifying data that being encrypted for authenticated encryption. - Fix DM crypt and verity targets to align their respective bvec_iter struct members to avoid the need for byte level access (due to __packed attribute) that is costly on some arches (like RISC). -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXY0iwACgkQxSPxCi2d A1oG3Qf/WE0T5qyBnDZ7irhvJmSLVx4oAwzB0PmMtELZ3Tkyn7BBAxq1Q2I2UT3x r90d1uy/pz6Y+kZkAPZjYuYLctukEa1swpfFe0Sn01dBrbgGU/p2vi3fkF+ZK6/t n5EN8S5dkf6rIDmp8R56iP8mP4OEultYjLugxc6ROohFgHZicoqv+Pye9kHp0Y19 HSW2eueag/s2nMa9HKjIEd3+NBgmGb0qMMf3M6CXpRLNi/f/cyHbPzq83+eW3gcg jl480w5YHk2nOUSqrO8UfIaP4BpD3SEXQxVqIzdkVX4cEBO4yRcBNrQpsT89GsXj sg5zinkq3g7SThEpQWdpkeZMR/6q/A== =n0nQ -----END PGP SIGNATURE----- Merge tag 'for-6.8/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Stable fixes for 3 DM targets (integrity, verity and crypt) to address systemic failure that can occur if user provided pages map to the same block. - Fix DM crypt to not allow modifying data that being encrypted for authenticated encryption. - Fix DM crypt and verity targets to align their respective bvec_iter struct members to avoid the need for byte level access (due to __packed attribute) that is costly on some arches (like RISC). * tag 'for-6.8/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-crypt, dm-integrity, dm-verity: bump target version dm-verity, dm-crypt: align "struct bvec_iter" correctly dm-crypt: recheck the integrity tag after a failure dm-crypt: don't modify the data when using authenticated encryption dm-verity: recheck the hash after a failure dm-integrity: recheck the integrity tag after a failure	2024-02-23 09:23:54 -08:00
Pierre Gondois	b20a229c28	bcache: use of hlist_count_nodes() Make use of the newly added hlist_count_nodes(). Link: https://lkml.kernel.org/r/20240104164937.424320-4-pierre.gondois@arm.com Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Acked-by: Coly Li <colyli@suse.de> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Martijn Coenen <maco@android.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 15:38:51 -08:00
Mathieu Desnoyers	c29290728d	dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Link: https://lkml.kernel.org/r/20240215144633.96437-5-mathieu.desnoyers@efficios.com Fixes: `d92576f116` ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 15:27:19 -08:00
Linus Torvalds	ffd2cb6b71	block-6.8-2024-02-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXXiBEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprR0D/9zwzw1JcCcaMlYPL8yJcUjxNOQF7qrldXQ 86u4Jmqq8QtAzOZWTuXZiFBaq9/+h7FsnPppPXsTXPxz6wrlOHhc+38NR0Zs3kHq vng6glfRRBkX8NuMGID754IOpwS79ZP3z07Yk6ruZKcmVVx40WVBLtFwENA7Ub+Q /ktbu0PUe+7xBIsEBkgDGBfpyagJaMP+vgaQzl36sDXVY5lSiyHRhez27WrovNGU kXOTzuEY2RezWF6oI7yth7zllTAw/tJEpbjhFZCOm6DaZffHF7AHpoTOLYdK989Y ZA2d9tWltfgTvjohNUjtQmlL/SHKHFKE+JrlUgkv8KpGN9Y+ySKJsoSG37ntL3+W fX5NAe5MDy5xO6jm/Kj8668oYdlCHODm3faj3ezzhBTQYFEssc9bX06uGhiQugaI fosI4oAHJ9jYFNzZzeAMx1oFvorCzinseGbDzN/938Q6nRAZdpLxWHhQ6V1+81Ny lv/HFV4DoDW+4sMp69UP8yK92x9UDutaxwbl7tgdnHfPmp9s8VeLgv6xbPRB5hJp XrCH1WVgM7cYGz26pVhUrFDIdPBVPPNfTz0hAo2O1zpGbM+2JiENgK71MrLu5P9i m+QRa8FIeV80wRH0wdT4H/Oy8r8fOrUD8JG6WKiR98SSS81raOWdF8TzFWGEuFvO ZH5FBgowjg== =0LBw -----END PGP SIGNATURE----- Merge tag 'block-6.8-2024-02-22' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Mostly just fixlets for md, but also a sed-opal parsing fix" * tag 'block-6.8-2024-02-22' of git://git.kernel.dk/linux: block: sed-opal: handle empty atoms when parsing response md: Don't suspend the array for interrupted reshape md: Don't register sync_thread for reshape directly md: Make sure md_do_sync() will set MD_RECOVERY_DONE md: Don't ignore read-only array in md_check_recovery() md: Don't ignore suspended array in md_check_recovery() md: Fix missing release of 'active_io' for flush	2024-02-22 11:57:30 -08:00
Mike Snitzer	fa34e5893f	dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing list Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Mike Snitzer	86ab1b84b2	dm ioctl: update DM_DRIVER_EMAIL to new dm-devel mailing list Fixes: `3da5d2de92` ("MAINTAINERS: update the dm-devel mailing list") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Fan Wu	9356fcfe0a	dm verity: set DM_TARGET_SINGLETON feature flag The device-mapper has a flag to mark targets as singleton, which is a required flag for immutable targets. Without this flag, multiple dm-verity targets can be added to a mapped device, which has no practical use cases and will let dm_table_get_immutable_target return NULL. This patch adds the missing flag, restricting only one dm-verity target per mapped device. Signed-off-by: Fan Wu <wufan@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	5d8d408153	dm crypt: Fix IO priority lost when queuing write bios Since dm-crypt queues writes to a different kernel thread (workqueue), the bios will dispatch from tasks with different io_context->ioprio settings and blkcg than the submitting task, thus giving incorrect ioprio to the io scheduler. Get the original IO priority setting via struct dm_crypt_io::base_bio and set this priority in the bio for write. Link: https://lore.kernel.org/dm-devel/alpine.LRH.2.11.1612141049250.13402@mail.ewheeler.net Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	d95e2c34a3	dm verity: Fix IO priority lost when reading FEC and hash After obtaining the data, verification or error correction process may trigger a new IO that loses the priority of the original IO, that is, the verification of the higher priority IO may be blocked by the lower priority IO. Make the IO used for verification and error correction follow the priority of the original IO. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	e9b2238e47	dm bufio: Support IO priority Some IO will dispatch from kworker with different io_context settings than the submitting task, we may need to specify a priority to avoid losing priority. Add dm_bufio_read_with_ioprio() and dm_bufio_prefetch_with_ioprio() for use by bufio users to pass an ioprio other than IOPRIO_DEFAULT. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> [snitzer: introduced _with_ioprio() wrappers to reduce churn] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	6e5f0f6383	dm io: Support IO priority Some IO will dispatch from kworker with different io_context settings than the submitting task, we may need to specify a priority to avoid losing priority. Add IO priority parameter to dm_io() and update all callers. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:51 -05:00
Mike Snitzer	1e00d57694	dm vdo logger: update logging to start with "device-mapper: vdo" Stops short of actually using DM's various logging macros (e.g. DMERR, DMINFO, etc) because VDO's logger isn't quite compatible with them. Also switch emit_log_message_to_kernel() from open-coding printk with log-level to using corresponding pr_ macro. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	318a9ce59b	dm vdo logger: switch UDS_LOG_NOTICE to be alias for UDS_LOG_INFO Prepare to bring VDO's logging closer to DM's logging by eliminating support for KERN_NOTICE log level (DM hasn't ever had a need for it). Only one message in index-session.c used UDS_LOG_NOTICE, convert it to log with uds_log_info(). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	cae3816d99	dm vdo: tweak wait_for_completion_interruptible callers Update uds_join_threads to delay in wait_for_completion_interruptible loop. And cleanup style nits in perform_admin_operation(). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	5581a43d30	dm vdo delta-index: fix various small nits Fix some needless line wrapping (given surrounding context), missing braces and some stale or incorrect references to data structure or function name. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	dea93aab18	dm vdo chapter_index: fix a few small nits Add missing braces and raise one function arg up a line to eliminate line wrap. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	571eff3969	dm vdo: cleanup style for comments in structs Use /* ... / rather than /* ... */ if for no other reason than syntax highlighting is improved (at least for me, in emacs: comments are now red, code is yellow. Previously comments were also yellow). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	d008f6eeab	dm vdo dedupe: fix various small nits Add a __must_hold sparse annotation to launch_dedupe_state_change that reflects its ASSERTION code comments about locking requirements, add some extra braces and fix a couple typos. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	181547bbb8	dm vdo string-utils: remove unnecessary includes Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Ken Raeburn	5f770bd1f2	dm vdo message-stats: reformat to remove excessive newlines Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:18 -05:00
Mike Snitzer	fbbd7a25e8	dm vdo: use #define for NO_CHAPTER and NO_CHAPTER_INDEX_ENTRY Avoids unconventional use of 'static const' and enum in headers. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Susan LeGendre-McGhee	b196d6bd30	dm vdo: move encoding constants to encodings.c Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:18 -05:00
Matthew Sakai	ea9ca07aff	dm vdo: add documentation details on zones and locking Add details describing the vdo zone and thread model to the documentation comments for major vdo components. Also added some high-level description of the block map structure. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:18 -05:00
Mike Snitzer	b863d7f750	dm vdo recovery-journal: fix sparse 'mixed bitwiseness' warning Only one user of WRITE_FLAGS so no need to factor it out in an enum (which causes sparse's 'mixed bitwiseness' warning). Just use the flags in the only consumer. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	f46b1ab7e7	dm vdo dedupe: silence sparse warnings about locking context imbalances Annotate both open_index() and close_index() with __must_hold(&zones->lock) to silence these sparse warnings: warning: context imbalance in 'close_index' - unexpected unlock warning: context imbalance in 'open_index' - unexpected unlock Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	872564c501	dm vdo data-vio: silence sparse warnings about locking context imbalances Factor wait_permit() out from acquire_permit() so that the latter always holds the spinlock and the former always releases it. Otherwise sparse complains about locking context imbalances due to conditional spin_unlock in acquire_permit: warning: context imbalance in 'acquire_permit' - different lock contexts for basic block warning: context imbalance in 'vdo_launch_bio' - unexpected unlock Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	a6c05c981e	dm vdo: fix various blk_opf_t sparse warnings Use proper blk_opf_t type rather than unsigned int. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	ff91994648	dm vdo: fix sparse 'warning: Using plain integer as NULL pointer' Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	3fa8e6ec07	dm vdo: fix sparse warnings about missing statics Addresses various sparse warnings like: warning: symbol 'SYMBOL' was not declared. Should it be static? Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	952b57a58d	dm vdo: rename struct configuration to uds_configuration Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	7f67d0f1c8	dm vdo: rename struct geometry to index_geometry Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	5c45cd10c0	dm vdo index: fix various small nits Add braces around multi-line while loops and if statements. Also remove excess newlines. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	ac9ae5769d	dm vdo dedupe: fix various small nits Remove extra blank line, mark function inline, add missing braces, and fix a typo in a comment. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	1ccef45aa8	dm vdo slab-depot: fix various small nits Comment typo, whitespace issues, mark function inline. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	97b6f0e752	dm vdo data-vio: rename is_trim flag to is_discard Eliminate use of "trim" in favor of "discard" since it reflects the top-level Linux discard primative rather than the ATA specific ditto. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	f7c1c2e085	dm vdo: rename vdo_map_to_system_error to vdo_status_to_errno Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	c10497b3b1	dm vdo: rename uds_map_to_system_error to uds_status_to_errno Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	86492a3f69	dm vdo: slight cleanup of UDS error codes No need to increment each UDS_ error code manually (relative to UDS_ERROR_CODE_BASE). Also, remove unused PRP_BLOCK_START and PRP_BLOCK_END. Lastly, UDS_SUCCESS and VDO_SUCCESS are used interchangeably; so best to explicitly set VDO_SUCCESS equal to UDS_SUCCESS. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	b06d5c37b8	dm vdo block-map: rename struct cursors member to 'completion' 'completion' is more informative name for a 'struct vdo_completion' than 'parent'. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	3ccf136a49	dm vdo block-map: avoid extra dereferences to access vdo object The vdo_page_cache's 'vdo' is the same as the block_map's vdo instance, so use that to save 2 extra dereferences. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	36778716a2	dm vdo block-map: remove extra vdo arg from initialize_block_map_zone The block_map is passed to initialize_block_map_zone, but the block_map's vdo member is already initialized with the same vdo instance, so just use it. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	8810d3d594	dm vdo block-map: use uds_log_ratelimit() rather than open code it Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	6bda10727d	dm vdo block-map: fix a few small nits Rename 'pages' to 'num_pages' in distribute_page_over_waitq(). Update assert message in validate_completed_page() to model others. Tweak line-wrapping on a comment that was needlessly long. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	f36b1d3ba5	dm vdo: use a proper Makefile for dm-vdo Requires moving dm-vdo-target.c into drivers/md/dm-vdo/ This change adds a proper drivers/md/dm-vdo/Makefile and eliminates the abnormal use of patsubst in drivers/md/Makefile -- which was the cause of at least one build failure that was reported by the upstream build bot. Also, split out VDO's drivers/md/dm-vdo/Kconfig and include it from drivers/md/Kconfig Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	4c79d55678	dm vdo: fix how dm_kcopyd_client_create() failure is checked dm_kcopyd_client_create() returns an ERR_PTR so its return must be checked with IS_ERR(). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Bruce Johnston	9165dac822	dm vdo int-map: remove unused parameter from vdo_int_map_create Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Bruce Johnston	ffb8d96541	dm vdo int-map: rename functions to use a common vdo_int_map preamble Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Bruce Johnston	db6b0a7ffe	dm vdo dedupe: switch to using int-map instead of pointer-map Use get_unaligned_le64() on the hash lock's record name to serve as the key to use with the int hash-map. Switching to using int hash-map removes the only consumer of pointer hash-map, as such it is removed. Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Mike Snitzer	a4bba246ec	dm vdo wait-queue: rename to vdo_waitq_dequeue_waiter Rename vdo_waitq_dequeue_next_waiter to vdo_waitq_dequeue_waiter. The "next" aspect of returned waiter is implied. "next" also isn't informative ("oldest" would be). Removing "next_" adds symmetry to vdo_waitq_enqueue_waiter(). Also fix whitespace and comments from previous waitq commit. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	29f0ef873c	dm vdo block-map: optimize enter_zone_read_only_mode Rather than incrementally dequeue from the zone->flush_waiters vdo_wait_queue, simply re-initialize it. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	e752e5c33b	dm vdo wait-queue: optimize vdo_waitq_dequeue_matching_waiters Remove temporary 'matched_waiters' waitq and just enqueue matched waiters directly to the caller provided 'matched_waitq'. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	cd1227dd83	dm vdo wait-queue: remove unused debug function vdo_waitq_get_next_waiter Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	d6e260cc42	dm vdo wait-queue: add proper namespace to interface Rename various interfaces and structs associated with vdo's wait-queue, e.g.: s/wait_queue/vdo_wait_queue/, s/waiter/vdo_waiter/, etc. Now all function names start with "vdo_waitq_" or "vdo_waiter_". Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	46a707cce0	dm vdo io-submitter: rename to vdo_submit_vio and submit_data_vio Rename process_vio_io() to vdo_submit_vio(), and process_data_vio_io() to submit_data_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	d58d3c86c3	dm vdo io-submitter: rename to vdo_submit_data_vio Rename submit_data_vio_io() to vdo_submit_data_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	ebe16015c3	dm vdo io-submitter: rename to vdo_submit_flush_vio Rename submit_flush_vio() to vdo_submit_flush_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	f7f46761cc	dm vdo io-submitter: rename to vdo_submit_metadata_vio Rename submit_metadata_vio() to vdo_submit_metadata_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	0dc2009d97	dm vdo io-submitter: remove get_bio_sector Just open-code access to bio's sector. Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Matthew Sakai	f11aca85b0	dm vdo: enable configuration and building of dm-vdo dm-vdo targets are not supported for 32-bit configurations. A vdo target typically requires 1 to 1.5 GB of memory at any given time, which is likely a large fraction of the addressable memory of a 32-bit system. At the same time, the amount of addressable storage attached to a 32-bit system may not be large enough for deduplication to provide much benefit. Because of these concerns, 32-bit platforms are deemed unlikely to benefit from using a vdo target, so dm-vdo is targeted only at 64-bit platforms. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Matthew Sakai	03d1e20fa1	dm vdo: add the top-level DM target This adds the dm-vdo target. The dm-vdo target provides inline deduplication, compression, and zero-block elimination, allowing applications to consume less actual storage than a normal target. By layering it with other device mapper targets, it can add these features to any storage stack. It can also provide a common deduplication pool for groups of targets. The vdo target does not protect against data corruption, relying instead on integrity protection of the storage below it. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	29a811959c	dm vdo: add debugging support Add support for dumping detailed vdo state to the kernel log via a dmsetup message. The dump code is not thread-safe and is generally intended for use only when the vdo is hung. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	92f8d7a94f	dm vdo: add sysfs support for setting parameters and fetching stats Add data and methods setting run time parameters via sysfs, and to make state and statistics information available through sysfs. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	a9457ab9d0	dm vdo: add statistics reporting Add data and methods to report statisics. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	827c6389c6	dm vdo: add the on-disk formats and marshalling of vdo structures Add data and methods for marshalling and unmarshalling the persistent metadata. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	06e932fea1	dm vdo: add the primary vdo structure Add the data and methods that manage the dm-vdo target itself. This includes the overall state of the target and its threads, the state of the logical volumes, startup, shutdown, and statistics. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	4fa98386be	dm vdo: add repair of damaged vdo volumes When a vdo is restarted after a crash, it will automatically attempt to recover from its journals. If a vdo encounters an unrecoverable error, it will enter read-only mode. This mode indicates that some previously acknowledged data may have been lost. The vdo may be instructed to rebuild as best it can in order to return to a writable state. Although some data may be lost, this process will ensure that the vdo's own metadata is self-consistent. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	95a7235768	dm vdo: add the recovery journal The recovery journal is used to amortize updates across the block map and slab depot. Each write request causes an entry to be made in the journal. Entries are either "data remappings" or "block map remappings." For a data remapping, the journal records the logical address affected and its old and new physical mappings. For a block map remapping, the journal records the block map page number and the physical block allocated for it (block map pages are never reclaimed, so the old mapping is always 0). Each journal entry and the data write it represents must be stable on disk before the other metadata structures may be updated to reflect the operation. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	14d531d7b7	dm vdo: implement the block map page cache The set of leaf pages of the block map tree is too large to fit in memory, so each block map zone maintains a cache of leaf pages. This patch adds the implementation of that cache. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	ddb12d6714	dm vdo: add the block map The block map contains the logical to physical mapping. It can be thought of as an array with one entry per logical address. Each entry is 5 bytes: 36 bits contain the physical block number which holds the data for the given logical address, and the remaining 4 bits are used to indicate the nature of the mapping. Of the 16 possible states, one represents a logical address which is unmapped (i.e. it has never been written, or has been discarded), one represents an uncompressed block, and the other 14 states are used to indicate that the mapped data is compressed, and which of the compression slots in the compressed block this logical address maps to. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	7ce49449ff	dm vdo: add the slab depot Add the data and methods that implement the slab_depot that manages the allocation of slabs of blocks added by the preceding patches. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	c9ba9fd33c	dm vdo: add the block allocators and physical zones Each slab is independent of every other. They are assigned to "physical zones" in round-robin fashion. If there are P physical zones, then slab n is assigned to zone n mod P. The set of slabs in each physical zone is managed by a block allocator. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	883069e30e	dm vdo: add the slab summary The slab depot maintains an additional small data structure, the "slab summary," which is used to reduce the amount of work needed to come back online after a crash. The slab summary maintains an entry for each slab indicating whether or not the slab has ever been used, whether it is clean (i.e. all of its reference count updates have been persisted to storage), and approximately how full it is. During recovery, each physical zone will attempt to recover at least one slab, stopping whenever it has recovered a slab which has some free blocks. Once each zone has some space (or has determined that none is available), the target can resume normal operation in a degraded mode. Read and write requests can be serviced, perhaps with degraded performance, while the remainder of the dirty slabs are recovered. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	09eff388df	dm vdo: add slab structure, slab journal and reference counters Most of the vdo volume belongs to the slab depot. The depot contains a collection of slabs. The slabs can be up to 32GB, and are divided into three sections. Most of a slab consists of a linear sequence of 4K blocks. These blocks are used either to store data, or to hold portions of the block map (see subsequent patches). In addition to the data blocks, each slab has a set of reference counters, using 1 byte for each data block. Finally each slab has a journal. Reference updates are written to the slab journal, which is written out one block at a time as each block fills. A copy of the reference counters is kept in memory, and are written out a block at a time, in oldest-dirtied-order whenever there is a need to reclaim slab journal space. The journal is used both to ensure that the main recovery journal (see subsequent patches) can regularly free up space, and also to amortize the cost of updating individual reference blocks. This patch adds the slab structure as well as the slab journal and reference counters. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	58a55a5916	dm vdo: add the compressed block bin packer When blocks do not deduplicate, vdo will attempt to compress them. Up to 14 compressed blocks may be packed into a single data block (this limitation is imposed by the block map). The packer implements a simple best-fit packing algorithm and also manages the formatting and writing of compressed blocks when bins fill. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	b053056133	dm vdo: add use of deduplication index in hash zones Add the data and methods that manage queries to the deduplication index and the responses from the index. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	cfaf07fae7	dm vdo: add hash locks and hash zones In order to deduplicate concurrent writes of the same data (to different locations), data_vios which are writing the same data are grouped together in a "hash lock," named for and keyed by the hash of the data being written. Each hash lock is assigned to a hash zone based on a portion of its hash. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	c65bfacedc	dm vdo: add the vdo io_submitter The io_submitter handles bio submission from vdo data store to the storage below. It will merge bios when possible. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	062401f0de	dm vdo: add flush support This patch adds support for handling incoming flush and/or FUA bios. Each such bio is assigned to a struct vdo_flush. These are allocated as needed, but there is always one kept in reserve in case allocations fail. In the event of an allocation failure, bios may need to wait for an outstanding flush to complete. The logical address space is partitioned into logical zones, each handled by its own thread. Each zone keeps a list of all data_vios handling write requests for logical addresses in that zone. When a flush bio is processed, each logical zone is informed of the flush. When all of the writes which are in progress at the time of the notification have completed in all zones, the flush bio is then allowed to complete. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	79535a7881	dm vdo: add data_vio, the request object which services incoming bios Add the data and methods that implement the data_vio object that handles user data bios as they are processed. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	9533d804ae	dm vdo: add vio, the request object for vdo metadata Add the data and methods that implement the vio object that is basic unit of I/O in vdo. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	3f493fcea0	dm vdo: add administrative state and action manager This patch adds the admin_state structures which are used to track the states of individual vdo components for handling of operations like suspend and resume. It also adds the action manager which is used to schedule and manage cross-thread administrative and internal operations. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	8ce89dde12	dm vdo: implement external deduplication index interface The deduplication index interface for index clients includes the deduplication request and index session structures. This is the interface that the rest of the vdo target uses to make requests, receive responses, and collect statistics. This patch also adds sysfs nodes for inspecting various index properties at runtime. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	766130e8a9	dm vdo: implement top-level deduplication index The top-level deduplication index brings all the earlier components together. The top-level index creates the separate zone structures that enable the index to handle several requests in parallel, handles dispatching requests to the right zones and components, and coordinates metadata to ensure that it remain consistent. It also coordinates recovery in the event of an unexpected index failure. If sparse caching is enabled, the top-level index also handles the coordination required by the sparse chapter index cache, which (unlike most index structures) is shared among all zones. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	4e7ff03922	dm vdo: implement the chapter volume store The volume store structures manage the reading and writing of chapter pages. When a chapter is closed, it is packed into a read-only structure, split across several pages, and written to storage. The volume store also contains a cache and specialized queues that sort and batch requests by the page they need, in order to minimize latency and I/O requests when records have to be read from storage. The cache and queues also coordinate with the volume index to ensure that the volume does not waste resources reading pages that are no longer valid. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	6afc7bcaa4	dm vdo: implement the open chapter and chapter indexes Deduplication records are stored in groups called chapters. New records are collected in a structure called the open chapter, which is optimized for adding, removing, and sorting records. When a chapter fills, it is packed into a read-only structure called a closed chapter, which is optimized for searching and reading. The closed chapter includes a delta index, called the chapter index, which maps each record name to the record page containing the record and allows the index to read at most one record page when looking up a record. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	a4eb7e2555	dm vdo: implement the volume index The volume index is a large delta index that maps each record name to the chapter which contains the newest record for that name. The volume index can contain several million records and is stored entirely in memory while the index is operating, accounting for the majority of the deduplication index's memory budget. The volume index is composed of two subindexes in order to handle sparse hook names separately from regular names. If sparse indexing is not enabled, the sparse hook portion of the volume index is not used or instantiated. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	024512e732	dm vdo: implement the delta index The delta index is a space and memory efficient alternative to a hashtable. Instead of storing the entire key for each entry, the entries are sorted by key and only the difference between adjacent keys (the delta) is stored. If the keys are evenly distributed, the size of the deltas follows an exponential distribution, and the deltas can use a Huffman code to take up even less space. This structure allows the index to use many fewer bytes per entry than a traditional hash table, but it is slightly more expensive to look up entries, because a request must read and sum every entry in a list of deltas in order to find a given record. The delta index reduces this lookup cost by splitting its key space into many sub-lists, each starting at a fixed key value, so that each individual list is short. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	b46d79bdb8	dm vdo: add deduplication index storage interface This patch adds infrastructure for managing reads and writes to the underlying storage layer for the deduplication index. The deduplication index uses dm-bufio for all of its reads and writes, so part of this infrastructure is managing the various dm-bufio clients required. It also adds the buffered reader and buffered writer abstractions, which simplify reading and writing metadata structures that span several blocks. This patch also includes structures and utilities for encoding and decoding all of the deduplication index metadata, collectively called the index layout. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	4390aa138b	dm vdo: add deduplication configuration structures Add structures which record the configuration of various deduplication index parameters. This also includes facilities for saving and loading the configuration and validating its integrity. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:14 -05:00
Matthew Sakai	cc46b9554b	dm vdo: add basic hash map data structures This patch adds two hash maps, one keyed by integers, the other by pointers, and also a priority heap. The integer map is used for locking of logical and physical addresses. The pointer map is used for managing concurrent writes of the same data, ensuring that those writes are deduplicated. The priority heap is used to minimize the search time for free blocks. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:13 -05:00
Matthew Sakai	d9e894d9b2	dm vdo: add specialized request queueing functionality This patch adds funnel_queue, a mostly lock-free multi-producer, single-consumer queue. It also adds the request queue used by the dm-vdo deduplication index, and the work_queue used by the dm-vdo data store. Both of these are built on top of funnel queue and are intended to support the dispatching of many short-running tasks. The work_queue also supports priorities. Finally, this patch adds vdo_completion, the structure which is enqueued on work_queues. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:13 -05:00
Matthew Sakai	89f9b701f5	dm vdo: add thread and synchronization utilities This patch adds utilities for managing and using named threads, as well as several locking and synchronization utilities. These utilities help dm-vdo minimize thread transitions and manage interactions between threads. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:13 -05:00
Matthew Sakai	4fcb4290df	dm vdo: add vdo type declarations, constants, and simple data structures Add definitions of constants defining the fixed parameters of a VDO volume, and the default and maximum values of configurable or dynamic parameters. Add definitions of internal status codes used for internal communication within the module and for logging. Add definitions of types and structs used to manage the processing of an I/O operation. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:13 -05:00
Matthew Sakai	03d1089e1d	dm vdo: add basic logging and support utilities Add various support utilities for the vdo target and deduplication index, including logging utilities, string and time management, and index-specific error codes. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:13 -05:00
Matthew Sakai	46766d4888	dm vdo: add memory allocation utilities This patch adds standardized allocation macros and memory tracking tools to track and report any allocated memory that is not freed. This makes it easier to ensure that the vdo target does not leak memory. This patch also adds utilities for controlling whether certain threads are allowed to allocate memory, since memory allocation during certain critical code sections can cause the vdo target to deadlock. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:13 -05:00
Matthew Sakai	a3957b1f3e	dm vdo: add the MurmurHash3 fast hashing algorithm MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally written by Austin Appleby and placed in the public domain. This version has been modified to produce the same result on both big endian and little endian processors, making it suitable for use in portable persistent data. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:13 -05:00
Lizhe	b9d3ac48a8	dm crypt: remove redundant state settings after waking up The task status has been set to TASK_RUNNING in schedule(). No need to set again here. Signed-off-by: Lizhe <sensor1010@163.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:37:31 -05:00
Mike Snitzer	47c00dcdd5	dm thin: add braces around conditional code that spans lines Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:37:31 -05:00
Mike Snitzer	0e0c50e85a	dm-crypt, dm-integrity, dm-verity: bump target version Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:35:47 -05:00
Mikulas Patocka	787f1b2800	dm-verity, dm-crypt: align "struct bvec_iter" correctly "struct bvec_iter" is defined with the __packed attribute, so it is aligned on a single byte. On X86 (and on other architectures that support unaligned addresses in hardware), "struct bvec_iter" is accessed using the 8-byte and 4-byte memory instructions, however these instructions are less efficient if they operate on unaligned addresses. (on RISC machines that don't have unaligned access in hardware, GCC generates byte-by-byte accesses that are very inefficient - see [1]) This commit reorders the entries in "struct dm_verity_io" and "struct convert_context", so that "struct bvec_iter" is aligned on 8 bytes. [1] https://lore.kernel.org/all/ZcLuWUNRZadJr0tQ@fedora/T/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:34:32 -05:00
Mikulas Patocka	42e15d1207	dm-crypt: recheck the integrity tag after a failure If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-crypt reports an authentication error [1]. The error is reported in a log and it may cause RAID leg being kicked out of the array. This commit fixes dm-crypt, so that if integrity verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the integrity tag is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an integrity error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:34:32 -05:00
Mikulas Patocka	50c7024009	dm-crypt: don't modify the data when using authenticated encryption It was said that authenticated encryption could produce invalid tag when the data that is being encrypted is modified [1]. So, fix this problem by copying the data into the clone bio first and then encrypt them inside the clone bio. This may reduce performance, but it is needed to prevent the user from corrupting the device by writing data with O_DIRECT and modifying them at the same time. [1] https://lore.kernel.org/all/20240207004723.GA35324@sol.localdomain/T/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:34:31 -05:00
Mikulas Patocka	9177f3c0de	dm-verity: recheck the hash after a failure If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-verity reports an error [1]. This commit fixes dm-verity, so that if hash verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the hash is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:34:31 -05:00
Mikulas Patocka	c88f5e553f	dm-integrity: recheck the integrity tag after a failure If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-integrity reports an error [1]. The error is reported in a log and it may cause RAID leg being kicked out of the array. This commit fixes dm-integrity, so that if integrity verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the integrity tag is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an integrity error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:34:31 -05:00
Masahiro Yamada	cd14b01846	treewide: replace or remove redundant def_bool in Kconfig files 'def_bool X' is a shorthand for 'bool' plus 'default X'. 'def_bool' is redundant where 'bool' is already present, so 'def_bool X' can be replaced with 'default X', or removed if X is 'n'. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2024-02-20 20:47:45 +09:00
Christoph Hellwig	b3f0846e72	bcache: pass queue_limits to blk_mq_alloc_disk Pass the queue limits directly to blk_alloc_disk instead of setting them one at a time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-19 16:58:24 -07:00
Christoph Hellwig	74fa8f9c55	block: pass a queue_limits argument to blk_alloc_disk Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-19 16:58:23 -07:00
Jens Axboe	d69591caec	Merge tag 'md-6.9-20240216' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD changes from Song: "1. Cleanup redundant checks, by Yu Kuai. 2. Remove deprecated headers, by Marc Zyngier and Song Liu. 3. Concurrency fixes, by Li Lingfeng. 4. Memory leak fix, by Li Nan." * tag 'md-6.9-20240216' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix kmemleak of rdev->serial md/multipath: Remove md-multipath.h md/linear: Get rid of md-linear.h md: use RCU lock to protect traversal in md_spares_need_change() md: get rdev->mddev with READ_ONCE() md: remove redundant md_wakeup_thread() md: remove redundant check of 'mddev->sync_thread'	2024-02-16 15:43:58 -07:00
Yu Kuai	9e46c70e82	md: Don't suspend the array for interrupted reshape md_start_sync() will suspend the array if there are spares that can be added or removed from conf, however, if reshape is still in progress, this won't happen at all or data will be corrupted(remove_and_add_spares won't be called from md_choose_sync_action for reshape), hence there is no need to suspend the array if reshape is not done yet. Meanwhile, there is a potential deadlock for raid456: 1) reshape is interrupted; 2) set one of the disk WantReplacement, and add a new disk to the array, however, recovery won't start until the reshape is finished; 3) then issue an IO across reshpae position, this IO will wait for reshape to make progress; 4) continue to reshape, then md_start_sync() found there is a spare disk that can be added to conf, mddev_suspend() is called; Step 4 and step 3 is waiting for each other, deadlock triggered. Noted this problem is found by code review, and it's not reporduced yet. Fix this porblem by don't suspend the array for interrupted reshape, this is safe because conf won't be changed until reshape is done. Fixes: `bc08041b32` ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-6-yukuai1@huaweicloud.com	2024-02-15 14:17:27 -08:00
Yu Kuai	ad39c08186	md: Don't register sync_thread for reshape directly Currently, if reshape is interrupted, then reassemble the array will register sync_thread directly from pers->run(), in this case 'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee that md_do_sync() will be executed, hence stop_sync_thread() will hang because 'MD_RECOVERY_RUNNING' can't be cleared. Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE, however, following hang can still be triggered by dm-raid test shell/lvconvert-raid-reshape.sh occasionally: [root@fedora ~]# cat /proc/1982/stack [<0>] stop_sync_thread+0x1ab/0x270 [md_mod] [<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod] [<0>] raid_presuspend+0x1e/0x70 [dm_raid] [<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod] [<0>] __dm_destroy+0x2a5/0x310 [dm_mod] [<0>] dm_destroy+0x16/0x30 [dm_mod] [<0>] dev_remove+0x165/0x290 [dm_mod] [<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod] [<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod] [<0>] vfs_ioctl+0x21/0x60 [<0>] __x64_sys_ioctl+0xb9/0xe0 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Meanwhile mddev->recovery is: MD_RECOVERY_RUNNING \| MD_RECOVERY_INTR \| MD_RECOVERY_RESHAPE \| MD_RECOVERY_FROZEN Fix this problem by remove the code to register sync_thread directly from raid10 and raid5. And let md_check_recovery() to register sync_thread. Fixes: `f67055780c` ("[PATCH] md: Checkpoint and allow restart of raid5 reshape") Fixes: `f52f5c71f3` ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-5-yukuai1@huaweicloud.com	2024-02-15 14:17:22 -08:00
Yu Kuai	82ec0ae59d	md: Make sure md_do_sync() will set MD_RECOVERY_DONE stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must set MD_RECOVERY_DONE, so that follow up md_check_recovery() will unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up stop_sync_thread(). If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will return without setting MD_RECOVERY_DONE, and after commit `f52f5c71f3` ("md: fix stopping sync thread"), dm-raid switch from md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread from md_stop() and md_stop_writes(), causing the test shell/lvconvert-raid-reshape.sh hang. We shouldn't switch back to md_reap_sync_thread() because it's problematic in the first place. Fix the problem by making sure md_do_sync() will set MD_RECOVERY_DONE. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/ece2b06f-d647-6613-a534-ff4c9bec1142@redhat.com/ Fixes: `d5d885fd51` ("md: introduce new personality funciton start()") Fixes: `5fd6c1dce0` ("[PATCH] md: allow checkpoint of recovery with version-1 superblock") Fixes: `f52f5c71f3` ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-4-yukuai1@huaweicloud.com	2024-02-15 14:17:18 -08:00
Yu Kuai	55a48ad2db	md: Don't ignore read-only array in md_check_recovery() Usually if the array is not read-write, md_check_recovery() won't register new sync_thread in the first place. And if the array is read-write and sync_thread is registered, md_set_readonly() will unregister sync_thread before setting the array read-only. md/raid follow this behavior hence there is no problem. After commit `f52f5c71f3` ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) array is read-only. dm-raid update super block: rs_update_sbs ro = mddev->ro mddev->ro = 0 -> set array read-write md_update_sb 2) register new sync thread concurrently. 3) dm-raid set array back to read-only: rs_update_sbs mddev->ro = ro 4) stop the array: raid_dtr md_stop stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 5) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 6) daemon thread can't unregister sync thread: md_check_recovery if (!md_is_rdwr(mddev) && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) return; -> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang; The root cause is that dm-raid manipulate 'mddev->ro' by itself, however, dm-raid really should stop sync thread before setting the array read-only. Unfortunately, I need to read more code before I can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix the problem the easy way for now to prevent dm-raid regression. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/9801e40-8ac7-e225-6a71-309dcf9dc9aa@redhat.com/ Fixes: `ecbfb9f118` ("dm raid: add raid level takeover support") Fixes: `f52f5c71f3` ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-3-yukuai1@huaweicloud.com	2024-02-15 14:17:13 -08:00
Yu Kuai	1baae052cc	md: Don't ignore suspended array in md_check_recovery() mddev_suspend() never stop sync_thread, hence it doesn't make sense to ignore suspended array in md_check_recovery(), which might cause sync_thread can't be unregistered. After commit `f52f5c71f3` ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) suspend the array: raid_postsuspend mddev_suspend 2) stop the array: raid_dtr md_stop __md_stop_writes stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 3) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 4) daemon thread can't unregister sync thread: md_check_recovery if (mddev->suspended) return; -> return directly md_read_sync_thread clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); -> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang; This problem is not just related to dm-raid, fix it by ignoring suspended array in md_check_recovery(). And follow up patches will improve dm-raid better to frozen sync thread during suspend. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/8fb335e-6d2c-dbb5-d7-ded8db5145a@redhat.com/ Fixes: `68866e425b` ("MD: no sync IO while suspended") Fixes: `f52f5c71f3` ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-2-yukuai1@huaweicloud.com	2024-02-15 14:17:09 -08:00
Li Nan	6cf3506587	md: fix kmemleak of rdev->serial If kobject_add() is fail in bind_rdev_to_array(), 'rdev->serial' will be alloc not be freed, and kmemleak occurs. unreferenced object 0xffff88815a350000 (size 49152): comm "mdadm", pid 789, jiffies 4294716910 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc f773277a): [<0000000058b0a453>] kmemleak_alloc+0x61/0xe0 [<00000000366adf14>] __kmalloc_large_node+0x15e/0x270 [<000000002e82961b>] __kmalloc_node.cold+0x11/0x7f [<00000000f206d60a>] kvmalloc_node+0x74/0x150 [<0000000034bf3363>] rdev_init_serial+0x67/0x170 [<0000000010e08fe9>] mddev_create_serial_pool+0x62/0x220 [<00000000c3837bf0>] bind_rdev_to_array+0x2af/0x630 [<0000000073c28560>] md_add_new_disk+0x400/0x9f0 [<00000000770e30ff>] md_ioctl+0x15bf/0x1c10 [<000000006cfab718>] blkdev_ioctl+0x191/0x3f0 [<0000000085086a11>] vfs_ioctl+0x22/0x60 [<0000000018b656fe>] __x64_sys_ioctl+0xba/0xe0 [<00000000e54e675e>] do_syscall_64+0x71/0x150 [<000000008b0ad622>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fixes: `963c555e75` ("md: introduce mddev_create/destroy_wb_pool for the change of member device") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240208085556.2412922-1-linan666@huaweicloud.com	2024-02-12 09:20:08 -08:00
Johannes Thumshirn	71f4ecdbb4	block: remove gfp_flags from blkdev_zone_mgmt Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can drop the gfp_mask parameter from blkdev_zone_mgmt() as well as blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-12 08:41:16 -07:00
Johannes Thumshirn	218082010a	dm: dm-zoned: guard blkdev_zone_mgmt with noio scope Guard the calls to blkdev_zone_mgmt() with a memalloc_noio scope. This helps us getting rid of the GFP_NOIO argument to blkdev_zone_mgmt(); Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-2-ae3b7c8def61@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-12 08:41:16 -07:00
Yu Kuai	855678ed85	md: Fix missing release of 'active_io' for flush submit_flushes atomic_set(&mddev->flush_pending, 1); rdev_for_each_rcu(rdev, mddev) atomic_inc(&mddev->flush_pending); bi->bi_end_io = md_end_flush submit_bio(bi); /* flush io is done first */ md_end_flush if (atomic_dec_and_test(&mddev->flush_pending)) percpu_ref_put(&mddev->active_io) -> active_io is not released if (atomic_dec_and_test(&mddev->flush_pending)) -> missing release of active_io For consequence, mddev_suspend() will wait for 'active_io' to be zero forever. Fix this problem by releasing 'active_io' in submit_flushes() if 'flush_pending' is decreased to zero. Fixes: `fa2bbff7b0` ("md: synchronize flush io with array reconfiguration") Cc: stable@vger.kernel.org # v6.1+ Reported-by: Blazej Kucman <blazej.kucman@linux.intel.com> Closes: https://lore.kernel.org/lkml/20240130172524.0000417b@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-7-yukuai1@huaweicloud.com	2024-02-07 23:44:19 -08:00
Song Liu	83cbdaf61b	md/multipath: Remove md-multipath.h md-multipath is already deprecated. Remove the header file. Signed-off-by: Song Liu <song@kernel.org>	2024-02-05 15:34:39 -08:00
Marc Zyngier	95b77082b7	md/linear: Get rid of md-linear.h Given that `849d18e27b` ("md: Remove deprecated CONFIG_MD_LINEAR") killed the linear flavour of MD, it seems only logical to drop the leftover include file that used to come with it. I also feel that it should be my own privilege to remove my 30 year old attempt at writing kernel code ;-). RIP! Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201224549.750644-1-maz@kernel.org	2024-02-05 15:32:30 -08:00
Li Lingfeng	570b9147de	md: use RCU lock to protect traversal in md_spares_need_change() Since md_start_sync() will be called without the protect of mddev_lock, and it can run concurrently with array reconfiguration, traversal of rdev in it should be protected by RCU lock. Commit `bc08041b32` ("md: suspend array in md_start_sync() if array need reconfiguration") added md_spares_need_change() to md_start_sync(), casusing use of rdev without any protection. Fix this by adding RCU lock in md_spares_need_change(). Fixes: `bc08041b32` ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240104133629.1277517-1-lilingfeng@huaweicloud.com	2024-02-05 15:23:59 -08:00
Li Lingfeng	9cfcf99e7e	md: get rdev->mddev with READ_ONCE() Users may get rdev->mddev by sysfs while rdev is releasing. So use both READ_ONCE() and WRITE_ONCE() to prevent load/store tearing and to read/write mddev atomically. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231229070500.3602712-1-lilingfeng@huaweicloud.com	2024-02-05 15:23:58 -08:00
Yu Kuai	faeaf210a5	md: remove redundant md_wakeup_thread() On the one hand, mddev_unlock() will call md_wakeup_thread() unconditionally; on the other hand, md_check_recovery() can't make progress if 'reconfig_mutex' can't be grabbed. Hence, it really doesn't make sense to wake up daemon thread while 'reconfig_mutex' is still grabbed. Remove all the md_wakup_thread() for 'mddev->thread' while 'reconfig_mtuex' is still grabbed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231228125553.2697765-3-yukuai1@huaweicloud.com	2024-02-05 15:23:58 -08:00
Yu Kuai	61c90765e1	md: remove redundant check of 'mddev->sync_thread' The lifetime of sync_thread: 1) Set MD_RECOVERY_NEEDED and wake up daemon thread (by ioctl/sysfs or other events); 2) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_NEEDED is set: a) try to grab reconfig_mutex; b) set MD_RECOVERY_RUNNING; c) clear MD_RECOVERY_NEEDED, and then queue sync_work; 3) md_start_sync() choose sync_action, then register sync_thread; 4) md_do_sync() is done, set MD_RECOVERY_DONE and wake up daemon thread; 5) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_DONE is set: a) try to grab reconfig_mutex; b) unregister sync_thread; c) clear MD_RECOVERY_RUNNING and MD_RECOVERY_DONE; Hence there is no such case that MD_RECOVERY_RUNNING is not set, while sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231228125553.2697765-2-yukuai1@huaweicloud.com	2024-02-05 15:23:58 -08:00
Mikulas Patocka	0a9bab391e	dm-crypt, dm-verity: disable tasklets Tasklets have an inherent problem with memory corruption. The function tasklet_action_common calls tasklet_trylock, then it calls the tasklet callback and then it calls tasklet_unlock. If the tasklet callback frees the structure that contains the tasklet or if it calls some code that may free it, tasklet_unlock will write into free memory. The commits `8e14f61015` and `d9a02e016a` try to fix it for dm-crypt, but it is not a sufficient fix and the data corruption can still happen [1]. There is no fix for dm-verity and dm-verity will write into free memory with every tasklet-processed bio. There will be atomic workqueues implemented in the kernel 6.9 [2]. They will have better interface and they will not suffer from the memory corruption problem. But we need something that stops the memory corruption now and that can be backported to the stable kernels. So, I'm proposing this commit that disables tasklets in both dm-crypt and dm-verity. This commit doesn't remove the tasklet support, because the tasklet code will be reused when atomic workqueues will be implemented. [1] https://lore.kernel.org/all/d390d7ee-f142-44d3-822a-87949e14608b@suse.de/T/ [2] https://lore.kernel.org/lkml/20240130091300.2968534-1-tj@kernel.org/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: `39d42fa96b` ("dm crypt: add flags to optionally bypass kcryptd workqueues") Fixes: `5721d4e5a9` ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-02 12:33:50 -05:00
Mikulas Patocka	40ef8756fb	dm writecache: allow allocations larger than 2GiB The function kvmalloc_node limits the allocation size to INT_MAX. This limit will be overflowed if dm-writecache attempts to map a device with 1TiB or larger length. This commit changes kvmalloc_array to vmalloc_array to avoid the limit. The commit also changes vmalloc(array_size()) to vmalloc_array(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-01-30 14:07:30 -05:00
Mikulas Patocka	9cf11ce06e	dm stats: limit the number of entries The kvmalloc function fails with a warning if the size is larger than INT_MAX. Linus said that there should be limits that prevent this warning from being hit. This commit adds the limits to the dm-stats subsystem in DM core. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-01-30 14:06:44 -05:00
Mikulas Patocka	bd504bcfec	dm: limit the number of targets and parameter size area The kvmalloc function fails with a warning if the size is larger than INT_MAX. The warning was triggered by a syscall testing robot. In order to avoid the warning, this commit limits the number of targets to 1048576 and the size of the parameter area to 1073741824. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-01-30 14:05:02 -05:00
Mikulas Patocka	9f3fe29d77	md: fix a suspicious RCU usage warning RCU protection was removed in the commit `2d32777d60` ("raid1: remove rcu protection to access rdev from conf"). However, the code in fix_read_error does rcu_dereference outside rcu_read_lock - this triggers the following warning. The warning is triggered by a LVM2 test shell/integrity-caching.sh. This commit removes rcu_dereference. ============================= WARNING: suspicious RCU usage 6.7.0 #2 Not tainted ----------------------------- drivers/md/raid1.c:2265 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by mdX_raid1/1859. stack backtrace: CPU: 2 PID: 1859 Comm: mdX_raid1 Not tainted 6.7.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x60/0x70 lockdep_rcu_suspicious+0x153/0x1b0 raid1d+0x1732/0x1750 [raid1] ? lock_acquire+0x9f/0x270 ? finish_wait+0x3d/0x80 ? md_thread+0xf7/0x130 [md_mod] ? lock_release+0xaa/0x230 ? md_register_thread+0xd0/0xd0 [md_mod] md_thread+0xa0/0x130 [md_mod] ? housekeeping_test_cpu+0x30/0x30 kthread+0xdc/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x28/0x40 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 </TASK> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `ca294b34aa` ("md/raid1: support read error check") Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/51539879-e1ca-fde3-b8b4-8934ddedcbc@redhat.com	2024-01-24 22:58:00 -08:00
Linus Torvalds	9d1694dc91	for-6.8/block-2024-01-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWpoCgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqIUEADFvJdC2izkPzYzsOrMK5Rt1H7vaHGKhbA+ zWCuQaa1xQd8bazq+NVnQpbzgclkE/WodTCNfNXcTTjzeQEmcZC888llP3Y9vwyP XfEKH7fSaeKvGigJLro1oPe3YV7/t89F5ol3BoZayfzJF8GEU9BXRWzgOkZzijnk xdm5wUyn/GknksMuQQraZ+U6bQRFLBOulzoaQeMD6Dosx+uRlM4WvAJawC+uOV6R qPT2BVSfYGzmgEKvoaphw0FMkUhFBMDHfXTpQBi5tIzTKOaof8tynYEGz0FHZWeh V0JEEp+3jLWFxFXeEcXgBVPJPE8J0DzGm9g17/uwC2Yhmlbw4FKZVRvGG+PpeUso D5aqhqm3w0x7HgZ7JKwy/aUctADYvjVcSVzPHTaFK0aCSYCIAXxqv4p7fOoxPqyx T32IUHTzGtkCdqzv/xFdtTYhTNM2vyzzbbWj5lXgCBqHsXOVbCh8UM2p+9ec2Umq Fo1XF9eoCDe6Sn4s15hJ5G4DEhKGOKkHluvRUdM+0selA5b0sNOeUqlAf2v+0ve3 Pv3e3X4NPssNIEcsDHf5pc3zGC+LXRS0oFvfIvDESBjwXc3iHIMl+SkjyS57P4Fd RKrHEUUiACuCKO/IWqFYLiNBNHnP3RmV5gSxIZr9QJhFSwOzP+/+4++TCdF5vdAV amhv+0PdCw== =DLW9 -----END PGP SIGNATURE----- Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes, Christoph) - discard fixes and improvements (Christoph) - timeout debug improvements (Keith, Max) - various cleanups (Daniel, Max, Giuxen) - trace event string fixes (Arnd) - shadow doorbell setup on reset fix (William) - a write zeroes quirk for SK Hynix (Jim) - MD pull request via Song: - Sparse warning since v6.0 (Bart) - /proc/mdstat regression since v6.7 (Yu Kuai) - Use symbolic error value (Christian) - IO Priority documentation update (Christian) - Fix for accessing queue limits without having entered the queue (Christoph, me) - Fix for loop dio support (Christoph) - Move null_blk off deprecated ida interface (Christophe) - Ensure nbd initializes full msghdr (Eric) - Fix for a regression with the folio conversion, which is now easier to hit because of an unrelated change (Matthew) - Remove redundant check in virtio-blk (Li) - Fix for a potential hang in sbitmap (Ming) - Fix for partial zone appending (Damien) - Misc changes and fixes (Bart, me, Kemeng, Dmitry) * tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits) Documentation: block: ioprio: Update schedulers loop: fix the the direct I/O support check when used on top of block devices blk-mq: Remove the hctx 'run' debugfs attribute nbd: always initialize struct msghdr completely block: Fix iterating over an empty bio with bio_for_each_folio_all block: bio-integrity: fix kcalloc() arguments order virtio_blk: remove duplicate check if queue is broken in virtblk_done sbitmap: remove stale comment in sbq_calc_wake_batch block: Correct a documentation comment in blk-cgroup.c null_blk: Remove usage of the deprecated ida_simple_xx() API block: ensure we hold a queue reference when using queue limits blk-mq: rename blk_mq_can_use_cached_rq block: print symbolic error name instead of error code blk-mq: fix IO hang from sbitmap wakeup race nvmet-rdma: avoid circular locking dependency on install_queue() nvmet-tcp: avoid circular locking dependency on install_queue() nvme-pci: set doorbell config before unquiescing block: fix partial zone append completion handling in req_bio_endio() block/iocost: silence warning on 'last_period' potentially being unused md/raid1: Use blk_opf_t for read and write operations ...	2024-01-18 18:22:40 -08:00
Linus Torvalds	01d550f0fc	for-6.8/block-2024-01-08 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcIOIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpn6hD/9oO7U75PuxUwYYHZ9Uzxpw6gQ0LEmeyJmE NQYCkfYHVq3IsgOdF7elI9v3qtr6v8V8CdB7cByrnn3DgwsMuiTKZZ0dK7vH37PO DX+/xn349e8oH7RdRo7f3m95g1YbHfpfnj0Rc4mjTDV72Jr/HlLTVgGTQg8DEnCR wBIFmeuBHHgeeLh87gsWLAP7ReReiy9V1uqpDFsko2/4BxRAM/8eedkwcAxD8aEy rd+dT/SBQj2cOdQMUeExT3gWjwzHh6ZHx3f1WCLK5fdck6BogH2hBUeri6F/H98L HoaXjBZYBTH68hB/mnO5I4g1ZlrVM74Vp7JPa3e1SFFtyEi6lsyrk2J3GoNh0E7r pXqH5kAcaJwBsBrbRGuvEyGbn9RLTaN5Gvseud0VE4oMruyodTniQaHXuIGackgz sMavMho4486EUWPaF7gIBdLNK1hO13w+IDZ4+3oBxhudMqdgZbk4iYpOCqQ7QY5G 2vkzAE/sZ+aVNXeaIQOI8dE5clBy8gJ+6+t8dm3DY1r1xdbcnU40iZ8/fri3h69r vHs9bpQnVWZF0gEyEflY1pkcAPpIkvMmWCR7Ehy5YCkIfa+qfSL05o3dicpWovLP N+gCtpkhTK2AvmUWsUMypMLRvoSOImyCIiobrr3qNBaUdgRP8xKfUa72RuRp8cGl Vrj5oAiE3w== =YAfp -----END PGP SIGNATURE----- Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: "Pretty quiet round this time around. This contains: - NVMe updates via Keith: - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) - MD updates via Song: - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai) - Fix raid5 hang issue (Junxiao Bi) - Add Yu Kuai as Reviewer of the md subsystem - Remove deprecated flavors (Song Liu) - raid1 read error check support (Li Nan) - Better handle events off-by-1 case (Alex Lyakas) - Efficiency improvements for passthrough (Kundan) - Support for mapping integrity data directly (Keith) - Zoned write fix (Damien) - rnbd fixes (Kees, Santosh, Supriti) - Default to a sane discard size granularity (Christoph) - Make the default max transfer size naming less confusing (Christoph) - Remove support for deprecated host aware zoned model (Christoph) - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel, Bart, Christoph)" * tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits) block: Treat sequential write preferred zone type as invalid block: remove disk_clear_zoned sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics drivers/block/xen-blkback/common.h: Fix spelling typo in comment blk-cgroup: fix rcu lockdep warning in blkg_lookup() blk-cgroup: don't use removal safe list iterators block: floor the discard granularity to the physical block size mtd_blkdevs: use the default discard granularity bcache: use the default discard granularity zram: use the default discard granularity null_blk: use the default discard granularity nbd: use the default discard granularity ubd: use the default discard granularity block: default the discard granularity to sector size bcache: discard_granularity should not be smaller than a sector block: remove two comments in bio_split_discard block: rename and document BLK_DEF_MAX_SECTORS loop: don't abuse BLK_DEF_MAX_SECTORS aoe: don't abuse BLK_DEF_MAX_SECTORS null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS ...	2024-01-11 13:58:04 -08:00
Bart Van Assche	7dab24554d	md/raid1: Use blk_opf_t for read and write operations Use the type blk_opf_t for read and write operations instead of int. This patch does not affect the generated code but fixes the following sparse warning: drivers/md/raid1.c:1993:60: sparse: sparse: incorrect type in argument 5 (different base types) expected restricted blk_opf_t [usertype] opf got int rw Cc: Song Liu <song@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Fixes: `3c5e514db5` ("md/raid1: Use the new blk_opf_t type") Cc: stable@vger.kernel.org # v6.0+ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401080657.UjFnvQgX-lkp@intel.com/ Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240108001223.23835-1-bvanassche@acm.org	2024-01-09 15:14:01 -08:00
Linus Torvalds	fb46e22a9e	Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series "maple_tree: add mt_free_one() and mt_attr() helpers" "Some cleanups of maple tree" - In the series "mm: use memmap_on_memory semantics for dax/kmem" Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series "Add folio_zero_tail() and folio_fill_tail()" "Make folio_start_writeback return void" "Fix fault handler's handling of poisoned tail pages" "Convert aops->error_remove_page to ->error_remove_folio" "Finish two folio conversions" "More swap folio conversions" - Kefeng Wang has also contributed folio-related work in the series "mm: cleanup and use more folio in page fault" - Jim Cromie has improved the kmemleak reporting output in the series "tweak kmemleak report format". - In the series "stackdepot: allow evicting stack traces" Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series "mm: page_alloc: fixes for high atomic reserve caluculations". - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series "samples: introduce cgroup events listeners". - Some mapletree maintanance work from Liam Howlett in the series "maple_tree: iterator state changes". - Nhat Pham has improved zswap's approach to writeback in the series "workload-specific and memory pressure-driven zswap writeback". - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series "mm/damon: let users feed and tame/auto-tune DAMOS" "selftests/damon: add Python-written DAMON functionality tests" "mm/damon: misc updates for 6.8" - Yosry Ahmed has improved memcg's stats flushing in the series "mm: memcg: subtree stats flushing and thresholds". - In the series "Multi-size THP for anonymous memory" Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series "More buffer_head cleanups". - Suren Baghdasaryan has done work on Andrea Arcangeli's series "userfaultfd move option". UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a "KSM Advisor", in the series "mm/ksm: Add ksm advisor". This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series "mm/zswap: dstmem reuse optimizations and cleanups". - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is "Clean up the writeback paths". - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series "kasan: save mempool stack traces". - Andrey also performed some KASAN maintenance work in the series "kasan: assorted clean-ups". - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series "mm/rmap: interface overhaul". - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series "mm/mglru: Kconfig cleanup". - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series "Remove some lruvec page accounting functions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU= =0NHs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...	2024-01-09 11:18:47 -08:00
Yu Kuai	f9cfe7e7f9	md: Fix md_seq_ops() regressions Commit `cf1b6d4441` ("md: simplify md_seq_ops") introduce following regressions: 1) If list all_mddevs is emptly, personalities and unused devices won't be showed to user anymore. 2) If seq_file buffer overflowed from md_seq_show(), then md_seq_start() will be called again, hence personalities will be showed to user again. 3) If seq_file buffer overflowed from md_seq_stop(), seq_read_iter() doesn't handle this, hence unused devices won't be showed to user. Fix above problems by printing personalities and unused devices in md_seq_show(). Fixes: `cf1b6d4441` ("md: simplify md_seq_ops") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240109133957.2975272-1-yukuai1@huaweicloud.com	2024-01-09 10:08:18 -08:00
Kirill A. Shutemov	5e0a760b44	mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER commit `23baf831a3` ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-01-08 15:27:15 -08:00
Linus Torvalds	3f6984e730	vfs-6.8.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUx4wAKCRCRxhvAZXjc osaNAQC/c+xXVfiq/pFbuK9MQLna4RGZaGcG9k312YniXbHq0AD9HAf4aPcZwPy1 /wkD4pauj3UZ3f0xBSyazGBvAXyN0Qc= =iFAQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs super updates from Christian Brauner: "This contains the super work for this cycle including the long-awaited series by Jan to make it possible to prevent writing to mounted block devices: - Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. We expect that this will be interesting to quite a few workloads. Btrfs is currently opted out of this because they still haven't merged patches we require for this to work from three kernel releases ago. - Reimplement block device freezing and thawing as holder operations on the block device. This allows us to extend block device freezing to all devices associated with a superblock and not just the main device. It also allows us to remove get_active_super() and thus another function that scans the global list of superblocks. Freezing via additional block devices only works if the filesystem chooses to use @fs_holder_ops for these additional devices as well. That currently only includes ext4 and xfs. Earlier releases switched get_tree_bdev() and mount_bdev() to use @fs_holder_ops. The remaining nilfs2 open-coded version of mount_bdev() has been converted to rely on @fs_holder_ops as well. So block device freezing for the main block device will continue to work as before. There should be no regressions in functionality. The only special case is btrfs where block device freezing for the main block device never worked because sb->s_bdev isn't set. Block device freezing for btrfs can be fixed once they can switch to @fs_holder_ops but that can happen whenever they're ready" * tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) block: Fix a memory leak in bdev_open_by_dev() super: don't bother with WARN_ON_ONCE() super: massage wait event mechanism ext4: Block writes to journal device xfs: Block writes to log device fs: Block writes to mounted block devices btrfs: Do not restrict writes to btrfs devices block: Add config option to not allow writing to mounted devices block: Remove blkdev_get_by_*() functions bcachefs: Convert to bdev_open_by_path() fs: handle freezing from multiple devices fs: remove dead check nilfs2: simplify device handling fs: streamline thaw_super_locked ext4: simplify device handling xfs: simplify device handling fs: simplify setup_bdev_super() calls blkdev: comment fs_holder_ops porting: document block device freeze and thaw changes fs: remove unused helper ...	2024-01-08 10:43:51 -08:00
Christoph Hellwig	105c1a5f6c	bcache: use the default discard granularity The discard granularity now defaults to a single sector, so don't set that value explicitly. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-29 08:44:12 -07:00
Christoph Hellwig	5e7169e7f7	bcache: discard_granularity should not be smaller than a sector Just like all block I/O, discards are in units of sectors. Thus setting a smaller than sector size discard limit in case of > 512 byte sectors in bcache doesn't make sense. Always set the discard granularity to 512 bytes instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-29 08:44:12 -07:00
Christoph Hellwig	7437bb73f0	block: remove support for the host aware zone model When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Song Liu	415c745187	md: Remove deprecated CONFIG_MD_FAULTY md-faulty has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-4-song@kernel.org	2023-12-19 10:37:50 -08:00
Song Liu	d8730f0cf4	md: Remove deprecated CONFIG_MD_MULTIPATH md-multipath has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-3-song@kernel.org	2023-12-19 10:37:41 -08:00
Song Liu	849d18e27b	md: Remove deprecated CONFIG_MD_LINEAR md-linear has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org	2023-12-19 10:16:51 -08:00
Mike Snitzer	6849302fdf	dm audit: fix Kconfig so DM_AUDIT depends on BLK_DEV_DM Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-12-18 13:11:05 -05:00
Mikulas Patocka	b86f4b790c	dm-integrity: don't modify bio's immutable bio_vec in integrity_metadata() __bio_for_each_segment assumes that the first struct bio_vec argument doesn't change - it calls "bio_advance_iter_single((bio), &(iter), (bvl).bv_len)" to advance the iterator. Unfortunately, the dm-integrity code changes the bio_vec with "bv.bv_len -= pos". When this code path is taken, the iterator would be out of sync and dm-integrity would report errors. This happens if the machine is out of memory and "kmalloc" fails. Fix this bug by making a copy of "bv" and changing the copy instead. Fixes: `7eada909bf` ("dm: add integrity target") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-12-18 13:11:05 -05:00
Yu Kuai	db29d79b34	dm-raid: delay flushing event_work() after reconfig_mutex is released After commit `db5e653d7c` ("md: delay choosing sync action to md_start_sync()"), md_start_sync() will hold 'reconfig_mutex', however, in order to make sure event_work is done, __md_stop() will flush workqueue with reconfig_mutex grabbed, hence if sync_work is still pending, deadlock will be triggered. Fortunately, former pacthes to fix stopping sync_thread already make sure all sync_work is done already, hence such deadlock is not possible anymore. However, in order not to cause confusions for people by this implicit dependency, delay flushing event_work to dm-raid where 'reconfig_mutex' is not held, and add some comments to emphasize that the workqueue can't be flushed with 'reconfig_mutex'. Fixes: `db5e653d7c` ("md: delay choosing sync action to md_start_sync()") Depends-on: `f52f5c71f3` ("md: fix stopping sync thread") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-12-18 13:05:21 -05:00
Li Nan	ca294b34aa	md/raid1: support read error check After commit `1e50915fe0` ("raid: improve MD/raid10 handling of correctable read errors."), rdev will be set to faulty if it reads data error to many times in raid10. Add this mechanism to raid1 now. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231215023852.3478228-3-linan666@huaweicloud.com	2023-12-15 15:22:15 -08:00
Li Nan	1979dbbe32	md: factor out a helper exceed_read_errors() to check read_errors Move check_decay_read_errors() to raid1-10.c and factor out a helper exceed_read_errors() to check if read_errors exceeds the limit, so that raid1 can also use it. There are no functional changes. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com	2023-12-15 15:22:15 -08:00
Alex Lyakas	dc1cc22ed5	md: Whenassemble the array, consult the superblock of the freshest device Upon assembling the array, both kernel and mdadm allow the devices to have event counter difference of 1, and still consider them as up-to-date. However, a device whose event count is behind by 1, may in fact not be up-to-date, and array resync with such a device may cause data corruption. To avoid this, consult the superblock of the freshest device about the status of a device, whose event counter is behind by 1. Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/1702470271-16073-1-git-send-email-alex.lyakas@zadara.com	2023-12-15 15:21:24 -08:00
Gou Hao	af140f806a	md/raid1: remove unnecessary null checking If %__GFP_DIRECT_RECLAIM is set then bio_alloc_bioset will always be able to allocate a bio. See comment of bio_alloc_bioset. Signed-off-by: Gou Hao <gouhao@uniontech.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231214151458.28970-1-gouhao@uniontech.com	2023-12-15 00:45:37 -08:00
Linus Torvalds	d71369dbe0	block-6.7-2023-12-08 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVzOGgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprPED/sFJUY31XzGNWlqigwneZYppNYLyfJwTZ5z FJyMfN/i9IPWZdBnY/Sed4lp/rWhlEIDNN69bHG7ErK4t8weaGWAV9+ygwRHNmm/ bAcezY0rJwh23pl/kCYxidhVVFpyhSjFebDUQ6nY4XeTm9OOeeVsTsKVNmN7hF/M kGkJU+xZfu63RHcc0NATJPaaZGv0t5tZDnLCOiBy71tckxQlvPqvGnXKoTK1XeEv WQu3WgYFhDmRfjnaWPKW8HewcEjZHrRNFiAgKOf1fVCUdEbUrMU9qtGRr0Wv5Fwv f5cVqc+K44AE+spz/3Kb07q2/yI9cY3gbB1Ogt1ML1ryOMX9VgmymTxAhHyMJdJF +SnDgkUGnkn4mAr93lOad1DL76Ep5fvK3NX799TwkL1RC/78GiGYJWEOA3rC7sD2 Nfrs5RHxAnjwlT4jKSJ8pbGZa3SQ5g+zn7sePxTRSEARq4Z8YY7WhQpKuW70NVvp qkVCZgUEXO7DQcpjPLksOGt4tviXHahuIbq7RQD6OFZ1eNFrljHljymo+4rRRT/B QlouVtrdkGA+u2xj0NJ24rtlyVvSg/150UWdtxgkq9zLQudFD9HoIDySjdnwHgIE JTjIwH6Wq2y4TqGb4BkOlYMSqJaL0jcac2mT9oLkVoJ4OoQ6P9rtjHnl0yQGsb2R a6vKAkWF7Q== =Ds8S -----END PGP SIGNATURE----- Merge tag 'block-6.7-2023-12-08' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Nothing major in here, just miscellanous fixes for MD and NVMe: - NVMe pull request via Keith: - Proper nvme ctrl state setting (Keith) - Passthrough command optimization (Keith) - Spectre fix (Nitesh) - Kconfig clarifications (Shin'ichiro) - Frozen state deadlock fix (Bitao) - Power setting quirk (Georg) - MD pull requests via Song: - 6.7 regresisons with recovery/sync (Yu) - Reshape fix (David)" * tag 'block-6.7-2023-12-08' of git://git.kernel.dk/linux: md: split MD_RECOVERY_NEEDED out of mddev_resume nvme-pci: Add sleep quirk for Kingston drives md: fix stopping sync thread md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly() md: fix missing flush of sync_work nvme: fix deadlock between reset and scan nvme: prevent potential spectre v1 gadget nvme: improve NVME_HOST_AUTH and NVME_TARGET_AUTH config descriptions nvme-ioctl: move capable() admin check to the end nvme: ensure reset state check ordering nvme: introduce helper function to get ctrl state md/raid6: use valid sector values to determine if an I/O should wait on the reshape	2023-12-08 12:36:45 -08:00
Jens Axboe	c6d3ab9e76	Merge tag 'md-fixes-20231207-1' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.7 Pull MD fix from Song: "This change from Yu Kuai fixes a bug reported in https://bugzilla.kernel.org/show_bug.cgi?id=218200" * tag 'md-fixes-20231207-1' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: split MD_RECOVERY_NEEDED out of mddev_resume	2023-12-07 12:15:18 -07:00
Yu Kuai	b39113349d	md: split MD_RECOVERY_NEEDED out of mddev_resume New mddev_resume() calls are added to synchronize IO with array reconfiguration, however, this introduces a performance regression while adding it in md_start_sync(): 1) someone sets MD_RECOVERY_NEEDED first; 2) daemon thread grabs reconfig_mutex, then clears MD_RECOVERY_NEEDED and queues a new sync work; 3) daemon thread releases reconfig_mutex; 4) in md_start_sync a) check that there are spares that can be added/removed, then suspend the array; b) remove_and_add_spares may not be called, or called without really add/remove spares; c) resume the array, then set MD_RECOVERY_NEEDED again! Loop between 2 - 4, then mddev_suspend() will be called quite often, for consequence, normal IO will be quite slow. Fix this problem by don't set MD_RECOVERY_NEEDED again in md_start_sync(), hence the loop will be broken. Fixes: `bc08041b32` ("md: suspend array in md_start_sync() if array need reconfiguration") Suggested-by: Song Liu <song@kernel.org> Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200 Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231207020724.2797445-1-yukuai1@huaweicloud.com	2023-12-07 10:19:47 -08:00
Jens Axboe	7d2affce33	Merge tag 'md-fixes-20231206' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.7 Pull MD fixes from Song: "This set from Yu Kuai fixes issues around sync_work, which was introduced in 6.7 kernels." * tag 'md-fixes-20231206' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix stopping sync thread md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly() md: fix missing flush of sync_work	2023-12-06 15:31:58 -07:00
Yu Kuai	f52f5c71f3	md: fix stopping sync thread Currently sync thread is stopped from multiple contex: - idle_sync_thread - frozen_sync_thread - __md_stop_writes - md_set_readonly - do_md_stop And there are some problems: 1) sync_work is flushed while reconfig_mutex is grabbed, this can deadlock because the work function will grab reconfig_mutex as well. 2) md_reap_sync_thread() can't be called directly while md_do_sync() is not finished yet, for example, commit `130443d60b` ("md: refactor idle/frozen_sync_thread() to fix deadlock"). 3) If MD_RECOVERY_RUNNING is not set, there is no need to stop sync_thread at all because sync_thread must not be registered. Factor out a helper stop_sync_thread(), so that above contex will behave the same. Fix 1) by flushing sync_work after reconfig_mutex is released, before waiting for sync_thread to be done; Fix 2) bt letting daemon thread to unregister sync_thread; Fix 3) by always checking MD_RECOVERY_RUNNING first. Fixes: `db5e653d7c` ("md: delay choosing sync action to md_start_sync()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231205094215.1824240-4-yukuai1@huaweicloud.com	2023-12-06 12:44:00 -08:00
Yu Kuai	c9f7cb5b2b	md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly() If md_set_readonly() failed, the array could still be read-write, however 'MD_RECOVERY_FROZEN' could still be set, which leave the array in an abnormal state that sync or recovery can't continue anymore. Hence make sure the flag is cleared after md_set_readonly() returns. Fixes: `88724bfa68` ("md: wait for pending superblock updates before switching to read-only") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231205094215.1824240-3-yukuai1@huaweicloud.com	2023-12-06 12:44:00 -08:00
Yu Kuai	f2d87a759f	md: fix missing flush of sync_work Commit `ac61978196` ("md: use separate work_struct for md_start_sync()") use a new sync_work to replace del_work, however, stop_sync_thread() and __md_stop_writes() was trying to wait for sync_thread to be done, hence they should switch to use sync_work as well. Noted that md_start_sync() from sync_work will grab 'reconfig_mutex', hence other contex can't held the same lock to flush work, and this will be fixed in later patches. Fixes: `ac61978196` ("md: use separate work_struct for md_start_sync()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231205094215.1824240-2-yukuai1@huaweicloud.com	2023-12-06 12:44:00 -08:00
Jens Axboe	a134cd8dfb	Merge tag 'md-fixes-20231201-1' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.7 Pull MD fix from Song: "This change fixes issue with raid456 reshape." * tag 'md-fixes-20231201-1' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid6: use valid sector values to determine if an I/O should wait on the reshape	2023-12-01 18:37:24 -07:00
Yu Kuai	fa2bbff7b0	md: synchronize flush io with array reconfiguration Currently rcu is used to protect iterating rdev from submit_flushes(): submit_flushes remove_and_add_spares synchronize_rcu pers->hot_remove_disk() rcu_read_lock() rdev_for_each_rcu if (rdev->raid_disk >= 0) rdev->radi_disk = -1; atomic_inc(&rdev->nr_pending) rcu_read_unlock() bi = bio_alloc_bioset() bi->bi_end_io = md_end_flush bi->private = rdev submit_bio // issue io for removed rdev Fix this problem by grabbing 'acive_io' before iterating rdev, make sure that remove_and_add_spares() won't concurrent with submit_flushes(). Fixes: `a2826aa92e` ("md: support barrier requests on all personalities.") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231129020234.1586910-1-yukuai1@huaweicloud.com	2023-12-01 15:49:42 -08:00
David Jeffery	c467e97f07	md/raid6: use valid sector values to determine if an I/O should wait on the reshape During a reshape or a RAID6 array such as expanding by adding an additional disk, I/Os to the region of the array which have not yet been reshaped can stall indefinitely. This is from errors in the stripe_ahead_of_reshape function causing md to think the I/O is to a region in the actively undergoing the reshape. stripe_ahead_of_reshape fails to account for the q disk having a sector value of 0. By not excluding the q disk from the for loop, raid6 will always generate a min_sector value of 0, causing a return value which stalls. The function's max_sector calculation also uses min() when it should use max(), causing the max_sector value to always be 0. During a backwards rebuild this can cause the opposite problem where it allows I/O to advance when it should wait. Fixing these errors will allow safe I/O to advance in a timely manner and delay only I/O which is unsafe due to stripes in the middle of undergoing the reshape. Fixes: `486f605586` ("md/raid5: Check all disks in a stripe_head for reshape progress") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231128181233.6187-1-djeffery@redhat.com	2023-12-01 14:43:29 -08:00
Linus Torvalds	ee0c8a9b34	block-6.7-2023-12-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVqKo4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqL9D/9bPvuA+Oogx+C/kNConjxnuyPBiXcZjb/4 5gO/6N0FC8yu+HQqgscGTyEjJO2FKfLx+YxxBs1UVIt4Tm+jZwC3nPqw9X4W3RCz pK9fxCNlzxey0SZU3ZJQIOtqP3df5Yuas9V/h35GS4m1XaoDE6cPpsIVUrAnoNwg W990L8sOy6y4XzMPzyHJCyoDCay1Qp2ly0Vdlz4/ESRmEp564i42nFN+8zpZ/w7h V+Ekn6JwP1ssqUeY/k43QcfRzYwSvvnTQJ1y9t3erf6HcHtpbCgnL1jTaGEmr4IS 1sw3ffqo23xBSsGP+D2OF4+9pwGI9+xwNpYnRdrpDPxKhCn5EEh+g6+f+m7YEnFV q1swlMTqHtRLFdYbKe8Tl8hPRwEeSpKy8sXph56hwGZY0T/IyB+Pe3aXrh1DYPA5 4+GASZHFQPH82P1ibVNdpMRZe4rPPblw38GZauZ1JbI0m0zXqEveB2AgZeCcw1ky l7KBdMdGBqSWYVmfKcJd3f30vKPyhMSp4eE9/LFp24vmyIIw+dSp6vup0yrM6jk9 taUU6PCHzaxmI1YGz1BzNVa8cfYKB6aiWeQ2OGa4Z7ba4TuksMLkbfVvu21jdi+z PsL/KlqPSPwFL/3XAZagIb3BXUhoQyfwIU8GnAuw2wTU5RJzWnbwF3wXpNaBIJxI 8y5OWsFqIg== =5kb6 -----END PGP SIGNATURE----- Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Invalid namespace identification error handling (Marizio Ewan, Keith) - Fabrics keep-alive tuning (Mark) - Fix for a bad error check regression in bcache (Markus) - Fix for a performance regression with O_DIRECT (Ming) - Fix for a flush related deadlock (Ming) - Make the read-only warn on per-partition (Yu) * tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux: nvme-core: check for too small lba shift blk-mq: don't count completed flush data request as inflight in case of quiesce block: Document the role of the two attribute groups block: warn once for each partition in bio_check_ro() block: move .bd_inode into 1st cacheline of block_device nvme: check for valid nvme_identify_ns() before using it nvme-core: fix a memory leak in nvme_ns_info_from_identify() nvme: fine-tune sending of first keep-alive bcache: revert replacing IS_ERR_OR_NULL with IS_ERR	2023-12-02 06:39:30 +09:00
Linus Torvalds	abd792f330	- Fix DM verity target's FEC support to always initialize IO before it frees it. Also fix alignment of struct dm_verity_fec_io within the per-bio-data. - Fix DM verity target to not FEC failed readahead IO. - Update DM flakey target to use MAX_ORDER rather than MAX_ORDER - 1. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmVqIjwACgkQxSPxCi2d A1pcwAgA00/Fln0p84cD3wFKauC61RALx5awoS0S2obAN+JY9yLs3xl1XDm92HyI 9giOXofHVKIlOQW6qASfZoCNGvtKPCVoKZF9KXKCqpK8wyKpuuG+yTPVeSsOK/fw pKcPp3FyXsu+9FXH3oO9xauLPOiGDC7BfIcHFQITHzT7qwMxQcPQ1HwfVwjrWIjG lgIQToiSZokBKBWXKyo63SMVkwWhlTdrfG1CJrc0UC9/f6DBMS0RTYJqmNJ3V8ak i0QyQdGZxc9TFuZe/G+Oq381z+X42iRDlluVU3ClMQTyoemQRcySi98CjRLruu7x 1H79s8ZIaJc/4mkxlJUQingL+dmuGA== =Av5r -----END PGP SIGNATURE----- Merge tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM verity target's FEC support to always initialize IO before it frees it. Also fix alignment of struct dm_verity_fec_io within the per-bio-data - Fix DM verity target to not FEC failed readahead IO - Update DM flakey target to use MAX_ORDER rather than MAX_ORDER - 1 * tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-flakey: start allocating with MAX_ORDER dm-verity: align struct dm_verity_fec_io properly dm verity: don't perform FEC for failed readahead IO dm verity: initialize fec io before freeing it	2023-12-02 06:32:29 +09:00
Linus Torvalds	e6861be452	More bcachefs bugfixes for 6.7 Bigger/user visible fixes: - bcache & bcachefs were broken with CFI enabled; patch for closures to fix type punning - mark erasure coding as extra-experimental; there are incompatible disk space accounting changes coming for erasure coding, and I'm still seeing checksum errors in some tests - several fixes for durability-related issues (durability is a device specific setting where we can tell bcachefs that data on a given device should be counted as replicated x times ) - a fix for a rare livelock when a btree node merge then updates a parent node that is almost full - fix a race in the device removal path, where dropping a pointer in a btree node to a device would be clobbered by an in flight btree write updating the btree node key on completion - fix one SRCU lock hold time warning in the btree gc code - ther's still a bunch more of these to fix - fix a rare race where we'd start copygc before initializing the "are we rw" percpu refcount; copygc would think we were already ro and die immediately https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmVnoHoACgkQE6szbY3K bnbzLBAApVEg3kB3XDCHYw+8AxLbzkuKbuV8FR/w+ULYAmRKbnM5e4pM4UJzwVJ9 vzBS9KUT4mVNpA5zl7FWmqh5AiJkhbPgb/BijtQiS+gz1ofZ8uCW/DjzWZpaTaT9 0zz9auiKwzJbBmLXC2lWC28MUPjFNXxlP2pfQPqhpKqlGKBC893hKeJ0Veb6dM1R DqkctoWtSQzsNpEaXiQpKBNoNUIlYcFX1XXHn+XpPpWNe80SpMfVNCs2qPkMByu/ V/QULE9cHI7RTu7oyFY80+9xQDeXDDYZgvtpD7hqNPcyyoix+r/DVz1mZe41XF2B bvaJhfcdWePctmiuEXJVXT4HSkwwzC6EKHwi7fejGY56hOvsrEAxNzTEIPRNw5st ZkZlxASwFqkiJ3ehy+KRngLX2GZSbJsU4aM5ViQJKtz4rBzGyyf0LmMucdxAoDH5 zLzsAYaA6FkIZ5e5ZNdTDj7/TMnKWXlU9vTttqIpb8s7qSy+3ejk5NuGitJihZ4R LAaCTs1JIsItLP47Ko0ZvmKV6CHlmt+Ht8OBqu73BWJ8vsBTQ8JMK4mGt60bwHvm LdEMtp3C3FmXFc06zhKoGgjrletZYO6G4mFBPnQqh1brfFXM1prVg3ftDTqBWkMI iAz2chiVc8k0qxoSAqylCYFaGzgiBKzw6YMtqPRmZgfLcq/sJ34= =vN+y -----END PGP SIGNATURE----- Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs Pull more bcachefs bugfixes from Kent Overstreet: - bcache & bcachefs were broken with CFI enabled; patch for closures to fix type punning - mark erasure coding as extra-experimental; there are incompatible disk space accounting changes coming for erasure coding, and I'm still seeing checksum errors in some tests - several fixes for durability-related issues (durability is a device specific setting where we can tell bcachefs that data on a given device should be counted as replicated x times) - a fix for a rare livelock when a btree node merge then updates a parent node that is almost full - fix a race in the device removal path, where dropping a pointer in a btree node to a device would be clobbered by an in flight btree write updating the btree node key on completion - fix one SRCU lock hold time warning in the btree gc code - ther's still a bunch more of these to fix - fix a rare race where we'd start copygc before initializing the "are we rw" percpu refcount; copygc would think we were already ro and die immediately * tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits) bcachefs: Extra kthread_should_stop() calls for copygc bcachefs: Convert gc_alloc_start() to for_each_btree_key2() bcachefs: Fix race between btree writes and metadata drop bcachefs: move journal seq assertion bcachefs: -EROFS doesn't count as move_extent_start_fail bcachefs: trace_move_extent_start_fail() now includes errcode bcachefs: Fix split_race livelock bcachefs: Fix bucket data type for stripe buckets bcachefs: Add missing validation for jset_entry_data_usage bcachefs: Fix zstd compress workspace size bcachefs: bpos is misaligned on big endian bcachefs: Fix ec + durability calculation bcachefs: Data update path won't accidentaly grow replicas bcachefs: deallocate_extra_replicas() bcachefs: Proper refcounting for journal_keys bcachefs: preserve device path as device name bcachefs: Fix an endianness conversion bcachefs: Start gc, copygc, rebalance threads after initing writes ref bcachefs: Don't stop copygc thread on device resize bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops ...	2023-12-02 06:02:16 +09:00
Mikulas Patocka	41e05548fa	dm-flakey: start allocating with MAX_ORDER Commit `23baf831a3` ("mm, treewide: redefine MAX_ORDER sanely") changed the meaning of MAX_ORDER from exclusive to inclusive. So, we can allocate compound pages with up to 1 << MAX_ORDER pages. Reflect this change in dm-flakey and start trying to allocate compound pages with MAX_ORDER. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-29 15:47:55 -05:00
Mikulas Patocka	38bc1ab135	dm-verity: align struct dm_verity_fec_io properly dm_verity_fec_io is placed after the end of two hash digests. If the hash digest has unaligned length, struct dm_verity_fec_io could be unaligned. This commit fixes the placement of struct dm_verity_fec_io, so that it's aligned. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: `a739ff3f54` ("dm verity: add support for forward error correction") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-29 12:58:06 -05:00
Wu Bo	0193e3966c	dm verity: don't perform FEC for failed readahead IO We found an issue under Android OTA scenario that many BIOs have to do FEC where the data under dm-verity is 100% complete and no corruption. Android OTA has many dm-block layers, from upper to lower: dm-verity dm-snapshot dm-origin & dm-cow dm-linear ufs DM tables have to change 2 times during Android OTA merging process. When doing table change, the dm-snapshot will be suspended for a while. During this interval, many readahead IOs are submitted to dm_verity from filesystem. Then the kverity works are busy doing FEC process which cost too much time to finish dm-verity IO. This causes needless delay which feels like system is hung. After adding debugging it was found that each readahead IO needed around 10s to finish when this situation occurred. This is due to IO amplification: dm-snapshot suspend erofs_readahead // 300+ io is submitted dm_submit_bio (dm_verity) dm_submit_bio (dm_snapshot) bio return EIO bio got nothing, it's empty verity_end_io verity_verify_io forloop range(0, io->n_blocks) // each io->nblocks ~= 20 verity_fec_decode fec_decode_rsb fec_read_bufs forloop range(0, v->fec->rsn) // v->fec->rsn = 253 new_read submit_bio (dm_snapshot) end loop end loop dm-snapshot resume Readahead BIOs get nothing while dm-snapshot is suspended, so all of them will cause verity's FEC. Each readahead BIO needs to verify ~20 (io->nblocks) blocks. Each block needs to do FEC, and every block needs to do 253 (v->fec->rsn) reads. So during the suspend interval(~200ms), 300 readahead BIOs trigger ~1518000 (30020253) IOs to dm-snapshot. As readahead IO is not required by userspace, and to fix this issue, it is best to pass readahead errors to upper layer to handle it. Cc: stable@vger.kernel.org Fixes: `a739ff3f54` ("dm verity: add support for forward error correction") Signed-off-by: Wu Bo <bo.wu@vivo.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-29 12:55:31 -05:00
Wu Bo	7be05bdfb4	dm verity: initialize fec io before freeing it If BIO error, verity_end_io() can call verity_finish_io() before verity_fec_init_io(). Therefore, fec_io->rs is not initialized and may crash when doing memory freeing in verity_fec_finish_io(). Crash call stack: die+0x90/0x2b8 __do_kernel_fault+0x260/0x298 do_bad_area+0x2c/0xdc do_translation_fault+0x3c/0x54 do_mem_abort+0x54/0x118 el1_abort+0x38/0x5c el1h_64_sync_handler+0x50/0x90 el1h_64_sync+0x64/0x6c free_rs+0x18/0xac fec_rs_free+0x10/0x24 mempool_free+0x58/0x148 verity_fec_finish_io+0x4c/0xb0 verity_end_io+0xb8/0x150 Cc: stable@vger.kernel.org # v6.0+ Fixes: `5721d4e5a9` ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Wu Bo <bo.wu@vivo.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-29 12:23:17 -05:00
Yu Kuai	7ecab28c3b	md/md-multipath: remove rcu protection to access rdev from conf Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; And these will cover all the scenarios in md-multipath. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-6-yukuai1@huaweicloud.com	2023-11-27 15:49:05 -08:00
Yu Kuai	ad8606702f	md/raid5: remove rcu protection to access rdev from conf Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid456. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-5-yukuai1@huaweicloud.com	2023-11-27 15:49:05 -08:00
Yu Kuai	2d32777d60	md/raid1: remove rcu protection to access rdev from conf Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid1. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-4-yukuai1@huaweicloud.com	2023-11-27 15:49:05 -08:00
Yu Kuai	a448af25be	md/raid10: remove rcu protection to access rdev from conf Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid10. This patch also cleanup the code to handle the case that replacement replace rdev while IO is still inflight. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-3-yukuai1@huaweicloud.com	2023-11-27 15:49:04 -08:00
Yu Kuai	c891f1fd90	md: remove flag RemoveSynchronized rcu is not used correctly here, because synchronize_rcu() is called before replacing old value, for example: remove_and_add_spares // other path synchronize_rcu // called before replacing old value set_bit(RemoveSynchronized) rcu_read_lock() rdev = conf->mirros[].rdev pers->hot_remove_disk conf->mirros[].rdev = NULL; if (!test_bit(RemoveSynchronized)) synchronize_rcu /* * won't be called, and won't wait * for concurrent readers to be done. / // access rdev after remove_and_add_spares() rcu_read_unlock() Fortunately, there is a separate rcu protection to prevent such rdev to be freed: md_kick_rdev_from_array //other path rcu_read_lock() rdev = conf->mirros[].rdev list_del_rcu(&rdev->same_set) rcu_read_unlock() / * rdev can be removed from conf, but * rdev won't be freed. */ synchronize_rcu() free rdev Hence remove this useless flag and prepare to remove rcu protection to access rdev from 'conf'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com	2023-11-27 15:49:04 -08:00
Junxiao Bi	bed9e27baf	Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" This reverts commit `5e2cf333b7`. That commit introduced the following race and can cause system hung. md_write_start: raid5d: // mddev->in_sync == 1 set "MD_SB_CHANGE_PENDING" // running before md_write_start wakeup it waiting "MD_SB_CHANGE_PENDING" cleared >>>>>>>>> hung wakeup mddev->thread ... waiting "MD_SB_CHANGE_PENDING" cleared >>>> hung, raid5d should clear this flag but get hung by same flag. The issue reverted commit fixing is fixed by last patch in a new way. Fixes: `5e2cf333b7` ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231108182216.73611-2-junxiao.bi@oracle.com	2023-11-27 15:46:51 -08:00
Junxiao Bi	d6e035aad6	md: bypass block throttle for superblock update commit `5e2cf333b7` ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") introduced a hung bug and will be reverted in next patch, since the issue that commit is fixing is due to md superblock write is throttled by wbt, to fix it, we can have superblock write bypass block layer throttle. Fixes: `5e2cf333b7` ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Suggested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231108182216.73611-1-junxiao.bi@oracle.com	2023-11-27 15:46:16 -08:00
Markus Weippert	bb6cc25386	bcache: revert replacing IS_ERR_OR_NULL with IS_ERR Commit `028ddcac47` ("bcache: Remove unnecessary NULL point check in node allocations") replaced IS_ERR_OR_NULL by IS_ERR. This leads to a NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000080 Call Trace: ? __die_body.cold+0x1a/0x1f ? page_fault_oops+0xd2/0x2b0 ? exc_page_fault+0x70/0x170 ? asm_exc_page_fault+0x22/0x30 ? btree_node_free+0xf/0x160 [bcache] ? up_write+0x32/0x60 btree_gc_coalesce+0x2aa/0x890 [bcache] ? bch_extent_bad+0x70/0x170 [bcache] btree_gc_recurse+0x130/0x390 [bcache] ? btree_gc_mark_node+0x72/0x230 [bcache] bch_btree_gc+0x5da/0x600 [bcache] ? cpuusage_read+0x10/0x10 ? bch_btree_gc+0x600/0x600 [bcache] bch_gc_thread+0x135/0x180 [bcache] The relevant code starts with: new_nodes[0] = NULL; for (i = 0; i < nodes; i++) { if (__bch_keylist_realloc(&keylist, bkey_u64s(&r[i].b->key))) goto out_nocoalesce; // ... out_nocoalesce: // ... for (i = 0; i < nodes; i++) if (!IS_ERR(new_nodes[i])) { // IS_ERR_OR_NULL before `028ddcac47` btree_node_free(new_nodes[i]); // new_nodes[0] is NULL rw_unlock(true, new_nodes[i]); } This patch replaces IS_ERR() by IS_ERR_OR_NULL() to fix this. Fixes: `028ddcac47` ("bcache: Remove unnecessary NULL point check in node allocations") Link: https://lore.kernel.org/all/3DF4A87A-2AC1-4893-AE5F-E921478419A9@suse.de/ Cc: stable@vger.kernel.org Cc: Zheng Wang <zyytlz.wz@163.com> Cc: Coly Li <colyli@suse.de> Signed-off-by: Markus Weippert <markus@gekmihesg.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-24 09:39:03 -07:00
Kent Overstreet	d4e3b928ab	closures: CLOSURE_CALLBACK() to fix type punning Control flow integrity is now checking that type signatures match on indirect function calls. That breaks closures, which embed a work_struct in a closure in such a way that a closure_fn may also be used as a workqueue fn by the underlying closure code. So we have to change closure fns to take a work_struct as their argument - but that results in a loss of clarity, as closure fns have different semantics from normal workqueue functions (they run owning a ref on the closure, which must be released with continue_at() or closure_return()). Thus, this patc introduces CLOSURE_CALLBACK() and closure_type() macros as suggested by Kees, to smooth things over a bit. Suggested-by: Kees Cook <keescook@chromium.org> Cc: Coly Li <colyli@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-24 00:29:58 -05:00
Linus Torvalds	bc893f744e	block-6.7-2023-11-23 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVfrJIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqKID/oCNxih4ErRdCYJOZA+tR0YkuWJBKmF9KSe w94GnsxpCdPFMo+Lr2J3WoHsxH/xTFiT5BcLMTKb5Rju7zSIvf40xZxmizwhnU3p aA1WhWjo2Qu20p35wbKy23frSOZIZAj3+PXUxOG9hso6fY96JVpDToBQ4I9k5OG1 1KHXorlktQgDWk1XryI0z9F73Q3Ja3Bt2G4aXRAGwnTyvvIHJd+jm8ksjaEKKkmQ YXv1+ZJXhHkaeSfcwYcOFDffx+O1dooWnHjcKbgyXu0UQq0kxxIwBkF5C9ltZVVT TW2WusplldV4EU/Z3ck7E03Kbmk81iTN9yyh7wAr0TmkEEOdx5Mv/PtqUHme0A4Z PJEBR7g6AVnERttGk4H2MOiPBsMOKBFT2UIVSqUTMsij2a5uaKjGvBXXDELdf4g9 45QKuP1QflEImtKm6HtELA9gWuAF1rGOrH7PMLLPhmQck643cgVIcw6Tz3eyaJMe cBuvTsejkbQVeLSiX1mdvZk0gxuowCP/A8+CXyODdhl+H/mrDnMgR6HCV07VOy6F lVXeXjQaqwd9fdQ2kxfjbstNkjE3Z/NRDGJ+y4oNFNj6YmLoVHivBAXDHhHIVpXb u/F+IUm7I2M8G3DfqS0VOABauqzbDxe7c8j10OzJeDskd06t1prt6IY/qV4uhImM G6XNMzeHjw== =EShg -----END PGP SIGNATURE----- Merge tag 'block-6.7-2023-11-23' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "A bit bigger than usual at this time, but nothing really earth shattering: - NVMe pull request via Keith: - TCP TLS fixes (Hannes) - Authentifaction fixes (Mark, Hannes) - Properly terminate target names (Christoph) - MD pull request via Song, fixing a raid5 corruption issue - Disentanglement of the dependency mess in nvme introduced with the tls additions. Now it should actually build on all configs (Arnd) - Series of bcache fixes (Coly) - Removal of a dead helper (Damien) - s390 dasd fix (Muhammad, Jan) - lockdep blk-cgroup fixes (Ming)" * tag 'block-6.7-2023-11-23' of git://git.kernel.dk/linux: (33 commits) nvme: tcp: fix compile-time checks for TLS mode nvme: target: fix Kconfig select statements nvme: target: fix nvme_keyring_id() references nvme: move nvme_stop_keep_alive() back to original position nbd: pass nbd_sock to nbd_read_reply() instead of index s390/dasd: protect device queue against concurrent access s390/dasd: resolve spelling mistake block/null_blk: Fix double blk_mq_start_request() warning nvmet-tcp: always initialize tls_handshake_tmo_work nvmet: nul-terminate the NQNs passed in the connect command nvme: blank out authentication fabrics options if not configured nvme: catch errors from nvme_configure_metadata() nvme-tcp: only evaluate 'tls' option if TLS is selected nvme-auth: set explanation code for failure2 msgs nvme-auth: unlock mutex in one place only block: Remove blk_set_runtime_active() nbd: fix null-ptr-dereference while accessing 'nbd->config' nbd: factor out a helper to get nbd_config without holding 'config_lock' nbd: fold nbd config initialization into nbd_alloc_config() bcache: avoid NULL checking to c->root in run_cache_set() ...	2023-11-23 17:40:15 -08:00
Jens Axboe	8a554c6234	Merge tag 'md-fixes-20231120' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.7 Pull MD fix from Song. * tag 'md-fixes-20231120' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix bi_status reporting in md_end_clone_io	2023-11-20 09:45:31 -07:00
Coly Li	3eba5e0b24	bcache: avoid NULL checking to c->root in run_cache_set() In run_cache_set() after c->root returned from bch_btree_node_get(), it is checked by IS_ERR_OR_NULL(). Indeed it is unncessary to check NULL because bch_btree_node_get() will not return NULL pointer to caller. This patch replaces IS_ERR_OR_NULL() by IS_ERR() for the above reason. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-11-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Coly Li	31f5b956a1	bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc() This patch adds code comments to bch_btree_node_get() and __bch_btree_node_alloc() that NULL pointer will not be returned and it is unnecessary to check NULL pointer by the callers of these routines. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-10-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Coly Li	f72f4312d4	bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce() Commit `028ddcac47` ("bcache: Remove unnecessary NULL point check in node allocations") do the following change inside btree_gc_coalesce(), 31 @@ -1340,7 +1340,7 @@ static int btree_gc_coalesce( 32 memset(new_nodes, 0, sizeof(new_nodes)); 33 closure_init_stack(&cl); 34 35 - while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b)) 36 + while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b)) 37 keys += r[nodes++].keys; 38 39 blocks = btree_default_blocks(b->c) * 2 / 3; At line 35 the original r[nodes].b is not always allocatored from __bch_btree_node_alloc(), and possibly initialized as NULL pointer by caller of btree_gc_coalesce(). Therefore the change at line 36 is not correct. This patch replaces the mistaken IS_ERR() by IS_ERR_OR_NULL() to avoid potential issue. Fixes: `028ddcac47` ("bcache: Remove unnecessary NULL point check in node allocations") Cc: <stable@vger.kernel.org> # 6.5+ Cc: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-9-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Mingzhe Zou	2faac25d79	bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race We get a kernel crash about "unable to handle kernel paging request": ```dmesg [368033.032005] BUG: unable to handle kernel paging request at ffffffffad9ae4b5 [368033.032007] PGD fc3a0d067 P4D fc3a0d067 PUD fc3a0e063 PMD 8000000fc38000e1 [368033.032012] Oops: 0003 [#1] SMP PTI [368033.032015] CPU: 23 PID: 55090 Comm: bch_dirtcnt[0] Kdump: loaded Tainted: G OE --------- - - 4.18.0-147.5.1.es8_24.x86_64 #1 [368033.032017] Hardware name: Tsinghua Tongfang THTF Chaoqiang Server/072T6D, BIOS 2.4.3 01/17/2017 [368033.032027] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1d0 [368033.032029] Code: 8b 02 48 85 c0 74 f6 48 89 c1 eb d0 c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 c0 3d 02 00 48 03 04 cd 60 68 93 ad <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 02 [368033.032031] RSP: 0018:ffffbb48852abe00 EFLAGS: 00010082 [368033.032032] RAX: ffffffffad9ae4b5 RBX: 0000000000000246 RCX: 0000000000003bf3 [368033.032033] RDX: ffff97b0ff8e3dc0 RSI: 0000000000600000 RDI: ffffbb4884743c68 [368033.032034] RBP: 0000000000000001 R08: 0000000000000000 R09: 000007ffffffffff [368033.032035] R10: ffffbb486bb01000 R11: 0000000000000001 R12: ffffffffc068da70 [368033.032036] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [368033.032038] FS: 0000000000000000(0000) GS:ffff97b0ff8c0000(0000) knlGS:0000000000000000 [368033.032039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [368033.032040] CR2: ffffffffad9ae4b5 CR3: 0000000fc3a0a002 CR4: 00000000003626e0 [368033.032042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [368033.032043] bcache: bch_cached_dev_attach() Caching rbd479 as bcache462 on set 8cff3c36-4a76-4242-afaa-7630206bc70b [368033.032045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [368033.032046] Call Trace: [368033.032054] _raw_spin_lock_irqsave+0x32/0x40 [368033.032061] __wake_up_common_lock+0x63/0xc0 [368033.032073] ? bch_ptr_invalid+0x10/0x10 [bcache] [368033.033502] bch_dirty_init_thread+0x14c/0x160 [bcache] [368033.033511] ? read_dirty_submit+0x60/0x60 [bcache] [368033.033516] kthread+0x112/0x130 [368033.033520] ? kthread_flush_work_fn+0x10/0x10 [368033.034505] ret_from_fork+0x35/0x40 ``` The crash occurred when call wake_up(&state->wait), and then we want to look at the value in the state. However, bch_sectors_dirty_init() is not found in the stack of any task. Since state is allocated on the stack, we guess that bch_sectors_dirty_init() has exited, causing bch_dirty_init_thread() to be unable to handle kernel paging request. In order to verify this idea, we added some printing information during wake_up(&state->wait). We find that "wake up" is printed twice, however we only expect the last thread to wake up once. ```dmesg [ 994.641004] alcache: bch_dirty_init_thread() wake up [ 994.641018] alcache: bch_dirty_init_thread() wake up [ 994.641523] alcache: bch_sectors_dirty_init() init exit ``` There is a race. If bch_sectors_dirty_init() exits after the first wake up, the second wake up will trigger this bug("unable to handle kernel paging request"). Proceed as follows: bch_sectors_dirty_init kthread_run ==============> bch_dirty_init_thread(bch_dirtcnt[0]) ... ... atomic_inc(&state.started) ... ... ... atomic_read(&state.enough) ... ... atomic_set(&state->enough, 1) kthread_run ======================================================> bch_dirty_init_thread(bch_dirtcnt[1]) ... atomic_dec_and_test(&state->started) ... atomic_inc(&state.started) ... ... ... wake_up(&state->wait) ... atomic_read(&state.enough) atomic_dec_and_test(&state->started) ... ... wait_event(state.wait, atomic_read(&state.started) == 0) ... return ... wake_up(&state->wait) We believe it is very common to wake up twice if there is no dirty, but crash is an extremely low probability event. It's hard for us to reproduce this issue. We attached and detached continuously for a week, with a total of more than one million attaches and only one crash. Putting atomic_inc(&state.started) before kthread_run() can avoid waking up twice. Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-8-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Mingzhe Zou	e34820f984	bcache: fixup lock c->root error We had a problem with io hung because it was waiting for c->root to release the lock. crash> cache_set.root -l cache_set.list ffffa03fde4c0050 root = 0xffff802ef454c800 crash> btree -o 0xffff802ef454c800 \| grep rw_semaphore [ffff802ef454c858] struct rw_semaphore lock; crash> struct rw_semaphore ffff802ef454c858 struct rw_semaphore { count = { counter = -4294967297 }, wait_list = { next = 0xffff00006786fc28, prev = 0xffff00005d0efac8 }, wait_lock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } }, osq = { tail = { counter = 0 } }, owner = 0xffffa03fdc586603 } The "counter = -4294967297" means that lock count is -1 and a write lock is being attempted. Then, we found that there is a btree with a counter of 1 in btree_cache_freeable. crash> cache_set -l cache_set.list ffffa03fde4c0050 -o\|grep btree_cache [ffffa03fde4c1140] struct list_head btree_cache; [ffffa03fde4c1150] struct list_head btree_cache_freeable; [ffffa03fde4c1160] struct list_head btree_cache_freed; [ffffa03fde4c1170] unsigned int btree_cache_used; [ffffa03fde4c1178] wait_queue_head_t btree_cache_wait; [ffffa03fde4c1190] struct task_struct btree_cache_alloc_lock; crash> list -H ffffa03fde4c1140\|wc -l 973 crash> list -H ffffa03fde4c1150\|wc -l 1123 crash> cache_set.btree_cache_used -l cache_set.list ffffa03fde4c0050 btree_cache_used = 2097 crash> list -s btree -l btree.list -H ffffa03fde4c1140\|grep -E -A2 "^ lock = {" > btree_cache.txt crash> list -s btree -l btree.list -H ffffa03fde4c1150\|grep -E -A2 "^ lock = {" > btree_cache_freeable.txt [root@node-3 127.0.0.1-2023-08-04-16:40:28]# pwd /var/crash/127.0.0.1-2023-08-04-16:40:28 [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache.txt\|grep counter\|grep -v "counter = 0" [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache_freeable.txt\|grep counter\|grep -v "counter = 0" counter = 1 We found that this is a bug in bch_sectors_dirty_init() when locking c->root: (1). Thread X has locked c->root(A) write. (2). Thread Y failed to lock c->root(A), waiting for the lock(c->root A). (3). Thread X bch_btree_set_root() changes c->root from A to B. (4). Thread X releases the lock(c->root A). (5). Thread Y successfully locks c->root(A). (6). Thread Y releases the lock(c->root B). down_write locked ---(1)----------------------┐ \| \| \| down_read waiting ---(2)----┐ \| \| \| ┌-------------┐ ┌-------------┐ bch_btree_set_root ===(3)========>> \| c->root A \| \| c->root B \| \| \| └-------------┘ └-------------┘ up_write ---(4)---------------------┘ \| \| \| \| \| down_read locked ---(5)-----------┘ \| \| \| up_read ---(6)-----------------------------┘ Since c->root may change, the correct steps to lock c->root should be the same as bch_root_usage(), compare after locking. static unsigned int bch_root_usage(struct cache_set c) { unsigned int bytes = 0; struct bkey k; struct btree b; struct btree_iter iter; goto lock_root; do { rw_unlock(false, b); lock_root: b = c->root; rw_lock(false, b, b->level); } while (b != c->root); for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad) bytes += bkey_bytes(k); rw_unlock(false, b); return (bytes * 100) / btree_bytes(c); } Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Mingzhe Zou	7cc47e64d3	bcache: fixup init dirty data errors We found that after long run, the dirty_data of the bcache device will have errors. This error cannot be eliminated unless re-register. We also found that reattach after detach, this error can accumulate. In bch_sectors_dirty_init(), all inode <= d->id keys will be recounted again. This is wrong, we only need to count the keys of the current device. Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Rand Deeb	2c7f497ac2	bcache: prevent potential division by zero error In SHOW(), the variable 'n' is of type 'size_t.' While there is a conditional check to verify that 'n' is not equal to zero before executing the 'do_div' macro, concerns arise regarding potential division by zero error in 64-bit environments. The concern arises when 'n' is 64 bits in size, greater than zero, and the lower 32 bits of it are zeros. In such cases, the conditional check passes because 'n' is non-zero, but the 'do_div' macro casts 'n' to 'uint32_t,' effectively truncating it to its lower 32 bits. Consequently, the 'n' value becomes zero. To fix this potential division by zero error and ensure precise division handling, this commit replaces the 'do_div' macro with div64_u64(). div64_u64() is designed to work with 64-bit operands, guaranteeing that division is performed correctly. This change enhances the robustness of the code, ensuring that division operations yield accurate results in all scenarios, eliminating the possibility of division by zero, and improving compatibility across different 64-bit environments. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Rand Deeb <rand.sec96@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Colin Ian King	be93825f0e	bcache: remove redundant assignment to variable cur_idx Variable cur_idx is being initialized with a value that is never read, it is being re-assigned later in a while-loop. Remove the redundant assignment. Cleans up clang scan build warning: drivers/md/bcache/writeback.c:916:2: warning: Value stored to 'cur_idx' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Coly Li	777967e7e9	bcache: check return value from btree_node_alloc_replacement() In btree_gc_rewrite_node(), pointer 'n' is not checked after it returns from btree_gc_rewrite_node(). There is potential possibility that 'n' is a non NULL ERR_PTR(), referencing such error code is not permitted in following code. Therefore a return value checking is necessary after 'n' is back from btree_node_alloc_replacement(). Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20231120052503.6122-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Coly Li	baf8fb7e0e	bcache: avoid oversize memory allocation by small stripe_size Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are used for dirty data writeback, their sizes are decided by backing device capacity and stripe size. Larger backing device capacity or smaller stripe size make these two arraies occupies more dynamic memory space. Currently bcache->stripe_size is directly inherited from queue->limits.io_opt of underlying storage device. For normal hard drives, its limits.io_opt is 0, and bcache sets the corresponding stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for devices do declare value for queue->limits.io_opt, small stripe_size (comparing to 1TB) becomes an issue for oversize memory allocations of bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the capacity of hard drives gets much larger in recent decade. For example a raid5 array assembled by three 20TB hardrives, the raid device capacity is 40TB with typical 512KB limits.io_opt. After the math calculation in bcache code, these two arraies will occupy 400MB dynamic memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is declared on a new 2TB hard drive, then these two arraies request 2GB and 512MB dynamic memory from kzalloc(). The result is that bcache device always fails to initialize on his system. To avoid the oversize memory allocation, bcache->stripe_size should not directly inherited by queue->limits.io_opt from the underlying device. This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size and set bcache device's stripe size against the declared limits.io_opt value from the underlying storage device, - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will set its stripe size directly by this limits.io_opt value. - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will set its stripe size by a value multiplying limits.io_opt and euqal or large than BCH_MIN_STRIPE_SZ. Then the minimal stripe size of a bcache device will always be >= 4MB. For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied by these two arraies will be 2.5MB in total. Such mount of memory allocated for bcache->stripe_sectors_dirty and bcache->full_dirty_stripes is reasonable for most of storage devices. Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@lists.ewheeler.net> Link: https://lore.kernel.org/r/20231120052503.6122-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 09:17:51 -07:00
Song Liu	45b478951b	md: fix bi_status reporting in md_end_clone_io md_end_clone_io() may overwrite error status in orig_bio->bi_status with BLK_STS_OK. This could happen when orig_bio has BIO_CHAIN (split by md_submit_bio => bio_split_to_limits, for example). As a result, upper layer may miss error reported from md (or the device) and consider the failed IO was successful. Fix this by only update orig_bio->bi_status when current bio reports error and orig_bio is BLK_STS_OK. This is the same behavior as __bio_chain_endio(). Fixes: `10764815ff` ("md: add io accounting for raid0 and raid5") Cc: stable@vger.kernel.org # v5.14+ Reported-by: Bhanu Victor DiCara <00bvd0+linux@gmail.com> Closes: https://lore.kernel.org/regressions/5727380.DvuYhMxLoT@bvd0/ Signed-off-by: Song Liu <song@kernel.org> Tested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>	2023-11-19 20:51:25 -08:00
Christian Brauner	982c3b3058	bdev: rename freeze and thaw helpers We have bdev_mark_dead() etc and we're going to move block device freezing to holder ops in the next patch. Make the naming consistent: * freeze_bdev() -> bdev_freeze() * thaw_bdev() -> bdev_thaw() Also document the return code. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-2-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-18 14:59:23 +01:00
Mikulas Patocka	13648e04a9	dm-crypt: start allocating with MAX_ORDER Commit `23baf831a3` ("mm, treewide: redefine MAX_ORDER sanely") changed the meaning of MAX_ORDER from exclusive to inclusive. So, we can allocate compound pages with up to 1 << MAX_ORDER pages. Reflect this change in dm-crypt and start trying to allocate compound pages with MAX_ORDER. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-17 14:41:15 -05:00
Mikulas Patocka	28f07f2ab4	dm-verity: don't use blocking calls from tasklets The commit `5721d4e5a9` enhanced dm-verity, so that it can verify blocks from tasklets rather than from workqueues. This reportedly improves performance significantly. However, dm-verity was using the flag CRYPTO_TFM_REQ_MAY_SLEEP from tasklets which resulted in warnings about sleeping function being called from non-sleeping context. BUG: sleeping function called from invalid context at crypto/internal.h:206 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 14, name: ksoftirqd/0 preempt_count: 100, expected: 0 RCU nest depth: 0, expected: 0 CPU: 0 PID: 14 Comm: ksoftirqd/0 Tainted: G W 6.7.0-rc1 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x32/0x50 __might_resched+0x110/0x160 crypto_hash_walk_done+0x54/0xb0 shash_ahash_update+0x51/0x60 verity_hash_update.isra.0+0x4a/0x130 [dm_verity] verity_verify_io+0x165/0x550 [dm_verity] ? free_unref_page+0xdf/0x170 ? psi_group_change+0x113/0x390 verity_tasklet+0xd/0x70 [dm_verity] tasklet_action_common.isra.0+0xb3/0xc0 __do_softirq+0xaf/0x1ec ? smpboot_thread_fn+0x1d/0x200 ? sort_range+0x20/0x20 run_ksoftirqd+0x15/0x30 smpboot_thread_fn+0xed/0x200 kthread+0xdc/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x28/0x40 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 </TASK> This commit fixes dm-verity so that it doesn't use the flags CRYPTO_TFM_REQ_MAY_SLEEP and CRYPTO_TFM_REQ_MAY_BACKLOG from tasklets. The crypto API would do GFP_ATOMIC allocation instead, it could return -ENOMEM and we catch -ENOMEM in verity_tasklet and requeue the request to the workqueue. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v6.0+ Fixes: `5721d4e5a9` ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-17 14:41:14 -05:00
Mikulas Patocka	2a695062a5	dm-bufio: fix no-sleep mode dm-bufio has a no-sleep mode. When activated (with the DM_BUFIO_CLIENT_NO_SLEEP flag), the bufio client is read-only and we could call dm_bufio_get from tasklets. This is used by dm-verity. Unfortunately, commit `450e8dee51` ("dm bufio: improve concurrent IO performance") broke this and the kernel would warn that cache_get() was calling down_read() from no-sleeping context. The bug can be reproduced by using "veritysetup open" with the "--use-tasklets" flag. This commit fixes dm-bufio, so that the tasklet mode works again, by expanding use of the 'no_sleep_enabled' static_key to conditionally use either a rw_semaphore or rwlock_t (which are colocated in the buffer_tree structure using a union). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v6.4 Fixes: `450e8dee51` ("dm bufio: improve concurrent IO performance") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-17 14:41:14 -05:00
Mikulas Patocka	ccadc8a21e	dm-delay: avoid duplicate logic This is small refactoring of dm-delay - we avoid duplicate logic in flush_delayed_bios and flush_delayed_bios_fast and join these two functions into one. We also add cond_resched() to flush_delayed_bios because the list may have unbounded number of entries. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-17 14:41:14 -05:00
Mikulas Patocka	38cfff5681	dm-delay: fix bugs introduced by kthread mode This commit fixes the following bugs introduced by commit `70bbeb29fa` ("dm delay: for short delays, use kthread instead of timers and wq"): * the function flush_worker_fn has no exit path - on unload, this function will just loop and consume 100% CPU without any progress * the wake-up mechanism in flush_worker_fn is racy - a wake up will be missed if the process adds entries to the delayed_bios list just before set_current_state(TASK_INTERRUPTIBLE) * flush_delayed_bios_fast submits a bio while holding a global mutex; this may deadlock if we have multiple stacked dm-delay devices and the underlying device attempts to acquire the mutex too * if the target constructor fails, it will call delay_dtr. delay_dtr would attempt to free dc->timer_lock without it being initialized by the constructor. * if the target constructor's kthread allocation fails, delay_dtr would crash trying to dereference dc->worker because it is non-NULL due to ERR_PTR. Fixes: `70bbeb29fa` ("dm delay: for short delays, use kthread instead of timers and wq") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-17 14:41:14 -05:00
Mikulas Patocka	6fc45b6ed9	dm-delay: fix a race between delay_presuspend and delay_bio In delay_presuspend, we set the atomic variable may_delay and then stop the timer and flush pending bios. The intention here is to prevent the delay target from re-arming the timer again. However, this test is racy. Suppose that one thread goes to delay_bio, sees that dc->may_delay is one and proceeds; now, another thread executes delay_presuspend, it sets dc->may_delay to zero, deletes the timer and flushes pending bios. Then, the first thread continues and adds the bio to delayed->list despite the fact that dc->may_delay is false. Fix this bug by changing may_delay's type from atomic_t to bool and only access it while holding the delayed_bios_lock mutex. Note that we don't have to grab the mutex in delay_resume because there are no bios in flight at this point. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-11-17 14:38:46 -05:00
Linus Torvalds	ecae0bd517	Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series "Fixes and cleanups to compaction". - Joel Fernandes has a patchset ("Optimize mremap during mutual alignment within PMD") which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested. - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series "Do not try to access unaccepted memory" Adrian Hunter provides some fixups for the recently-added "unaccepted memory' feature. To increase the feature's checking coverage. "Plug a few gaps where RAM is exposed without checking if it is unaccepted memory". - In the series "cleanups for lockless slab shrink" Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code. - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series "use refcount+RCU method to implement lockless slab shrink". - David Hildenbrand contributes some maintenance work for the rmap code in the series "Anon rmap cleanups". - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series "mm: migrate: more folio conversion and unification". - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series "Add and use bdev_getblk()". - In the series "Use nth_page() in place of direct struct page manipulation" Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames. - In the series "mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO" has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use. - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code rationalization and folio conversions in the hugetlb code. - Yin Fengwei has improved mlock()'s handling of large folios in the series "support large folio for mlock" - In the series "Expose swapcache stat for memcg v1" Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2. - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named "MDWE without inheritance". - Kefeng Wang has provided the series "mm: convert numa balancing functions to use a folio" which does what it says. - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch makes is possible for a process to propagate KSM treatment across exec(). - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use "high bandwidth memory" in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named "memory tiering: calculate abstract distance based on ACPI HMAT" - In the series "Smart scanning mode for KSM" Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans. - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series "mm: memcg: fix tracking of pending stats updates values". - In the series "Implement IOCTL to get and optionally clear info about PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU. - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance" - a bunch of relatively minor maintenance tweaks to this code. - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series "Handle more faults under the VMA lock". Some rationalizations of the fault path became possible as a result. - In the series "mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups and folio conversions. - In the series "various improvements to the GUP interface" Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements. - Andrey Konovalov has sent along the series "kasan: assorted fixes and improvements" which does those things. - Some page allocator maintenance work from Kemeng Shi in the series "Two minor cleanups to break_down_buddy_pages". - In thes series "New selftest for mm" Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults. - In the series "Add folio_end_read" Matthew Wilcox provides cleanups and an optimization to the core pagecache code. - Nhat Pham has added memcg accounting for hugetlb memory in the series "hugetlb memcg accounting". - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series "Abstract vma_merge() and split_vma()". - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series "Fix page_owner's use of free timestamps". - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series "permit write-sealed memfd read-only shared mappings". - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series "Batch hugetlb vmemmap modification operations". - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series "Finish the create_empty_buffers() transition". - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series "mm: PCP high auto-tuning". - Roman Gushchin has contributed the patchset "mm: improve performance of accounted kernel memory allocations" which improves their performance by ~30% as measured by a micro-benchmark. - folio conversions from Kefeng Wang in the series "mm: convert page cpupid functions to folios". - Some kmemleak fixups in Liu Shixin's series "Some bugfix about kmemleak". - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series "handle memoryless nodes more appropriately". - khugepaged conversions from Vishal Moola in the series "Some khugepaged folio conversions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y FgeUPAD1oasg6CP+INZvCj34waNxwAc= =E+Y4 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...	2023-11-02 19:38:47 -10:00
Linus Torvalds	426ee5196d	sysctl-6.7-rc1 To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last 2 patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmVCqKsSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinEgYQAIpkqRL85DBwems19Uk9A27lkctwZ6Fc HdslQCObQTsbuKVimZFP4IL2beUfUE0cfLZCXlzp+4nRDOf6vyhyf3w19jPQtI0Q YdqwTk9y6G5VjDsb35QK0+UBloY/kZ1H3/LW4uCwjXTuksUGmWW2Qvey35696Scv hDMLADqKQmdpYxLUaNi9QyYbEAjYtOai2ezg3+i7hTG168t1k/Ab2BxIFrPVsCR2 FAiq05L4ugWjNskdsWBjck05JZsx9SK/qcAxpIPoUm4nGiFNHApXE0E0hs3vsnmn WIHIbxCQw8ZlUDlmw4S+0YH3NFFzFbWfmW8k2b0f2qZTJm/rU4KiJfcJVknkAUVF raFox6XDW0AUQ9L/NOUJ9ip5rup57GcFrMYocdJ3PPAvvmHKOb1D1O741p75RRcc 9j7zwfIRrzjPUqzhsQS/GFjdJu3lJNmEBK1AcgrVry6WoItrAzJHKPPDC7TwaNmD eXpjxMl1sYzzHqtVh4hn+xkUYphj/6gTGMV8zdo+/FopFswgeJW9G8kHtlEWKDPk MRIKwACmfetP6f3ngHunBg+BOipbjCANL7JI0nOhVOQoaULxCCPx+IPJ6GfSyiuH AbcjH8DGI7fJbUkBFoF0dsRFZ2gH8ds1PYMbWUJ6x3FtuCuv5iIuvQYoaWU6itm7 6f0KvCogg0fU =Qf50 -----END PGP SIGNATURE----- Merge tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last two patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now" * tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits) watchdog: move softlockup_panic back to early_param proc: sysctl: prevent aliased sysctls from getting passed to init intel drm: Remove now superfluous sentinel element from ctl_table array Drivers: hv: Remove now superfluous sentinel element from ctl_table array raid: Remove now superfluous sentinel element from ctl_table array fw loader: Remove the now superfluous sentinel element from ctl_table array sgi-xp: Remove the now superfluous sentinel element from ctl_table array vrf: Remove the now superfluous sentinel element from ctl_table array char-misc: Remove the now superfluous sentinel element from ctl_table array infiniband: Remove the now superfluous sentinel element from ctl_table array macintosh: Remove the now superfluous sentinel element from ctl_table array parport: Remove the now superfluous sentinel element from ctl_table array scsi: Remove now superfluous sentinel element from ctl_table array tty: Remove now superfluous sentinel element from ctl_table array xen: Remove now superfluous sentinel element from ctl_table array hpet: Remove now superfluous sentinel element from ctl_table array c-sky: Remove now superfluous sentinel element from ctl_talbe array powerpc: Remove now superfluous sentinel element from ctl_table arrays riscv: Remove now superfluous sentinel element from ctl_table array x86/vdso: Remove now superfluous sentinel element from ctl_table array ...	2023-11-01 20:51:41 -10:00
Linus Torvalds	0364249d20	- Update DM core to directly call the map function for both the linear and stripe targets; which are provided by DM core. - Various updates to use new safer string functions. - Update DM core to respect REQ_NOWAIT flag in normal bios so that memory allocations are always attempted with GFP_NOWAIT. - Add Mikulas Patocka to MAINTAINERS as a DM maintainer! - Improve DM delay target's handling of short delays (< 50ms) by using a kthread to check expiration of IOs rather than timers and a wq. - Update the DM error target so that it works with zoned storage. This helps xfstests to provide proper IO error handling coverage when testing a filesystem with native zoned storage support. - Update both DM crypt and integrity targets to improve performance by using crypto_shash_digest() rather than init+update+final sequence. - Fix DM crypt target by backfilling missing memory allocation accounting for compound pages. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmVCeG8ACgkQxSPxCi2d A1oXAAgAnT0mb1psEwDhKiJG26bUeJDJHIaNTPPw4UpCvKEWU7lTxzJUmthnwLZI D+JnkrenAfGppdsFkHh1ogTBz8r60aeBTY31SUoCUVS/clwHxRstLCkSE2655an3 WNgrGItta5BRtcgTMxU7bmqVOD4Xyw704L7BK5CM4zKBQuXJtkegylALhdXvYSjs mvHWoPczFW6Rv79CkF0uTkon7Ji041BvKVjbp+fYQf9Pul5d6S1v4Jrvlo2EqblU 6OM75OosNgziQhdw7faVOgFCVkQfLkEeYsJ47dBOjpC3+sbyKENbC4r+ZmkCZGyI uJERswpmAkW7pp+Ab6sqG582DXvPvA== =PPwQ -----END PGP SIGNATURE----- Merge tag 'for-6.7/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Update DM core to directly call the map function for both the linear and stripe targets; which are provided by DM core - Various updates to use new safer string functions - Update DM core to respect REQ_NOWAIT flag in normal bios so that memory allocations are always attempted with GFP_NOWAIT - Add Mikulas Patocka to MAINTAINERS as a DM maintainer! - Improve DM delay target's handling of short delays (< 50ms) by using a kthread to check expiration of IOs rather than timers and a wq - Update the DM error target so that it works with zoned storage. This helps xfstests to provide proper IO error handling coverage when testing a filesystem with native zoned storage support - Update both DM crypt and integrity targets to improve performance by using crypto_shash_digest() rather than init+update+final sequence - Fix DM crypt target by backfilling missing memory allocation accounting for compound pages * tag 'for-6.7/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm crypt: account large pages in cc->n_allocated_pages dm integrity: use crypto_shash_digest() in sb_mac() dm crypt: use crypto_shash_digest() in crypt_iv_tcw_whitening() dm error: Add support for zoned block devices dm delay: for short delays, use kthread instead of timers and wq MAINTAINERS: add Mikulas Patocka as a DM maintainer dm: respect REQ_NOWAIT flag in normal bios issued to DM dm: enhance alloc_multiple_bios() to be more versatile dm: make __send_duplicate_bios return unsigned int dm log userspace: replace deprecated strncpy with strscpy dm ioctl: replace deprecated strncpy with strscpy_pad dm crypt: replace open-coded kmemdup_nul dm cache metadata: replace deprecated strncpy with strscpy dm: shortcut the calls to linear_map and stripe_map	2023-11-01 12:55:54 -10:00
Linus Torvalds	90d624af2e	for-6.7/block-2023-10-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vjMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqVcEADaNf6X7LVKKrdQ4sA38dBZYGM3kNz0SCYV vkjQAs0Fyylbu6EhYOLO/R+UCtpytLlnbr4NmFDbhaEG4OJcwoDLDxpMQ7Gda58v 4RBXAiIlhZX3g99/ebvtNtVEvQa9gF4h8k2n/gKsG+PoS+cbkKAI0Na2duI1d/pL B5nQ31VAHhsyjUv1nIPLrQS6lsL7ZTFvH8L6FLcEVM03poy8PE2H6kN7WoyXwtfo LN3KK0Nu7B0Wx2nDx0ffisxcDhbChGs7G2c9ndPTvxg6/4HW+2XSeNUwTxXYpyi2 ZCD+AHCzMB/w6GNNWFw4xfau5RrZ4c4HdBnmyR6+fPb1u6nGzjgquzFyLyLu5MkA n/NvOHP1Cbd3QIXG1TnBi2kDPkQ5FOIAjFSe9IZAGT4dUkZ63wBoDil1jCgMLuCR C+AFPLhiIg3cFvu9+fdZ6BkCuZYESd3YboBtRKeMionEexrPTKt4QWqIoVJgd/Y7 nwvR8jkIBpVgQZT8ocYqhSycLCYV2lGqEBSq4rlRiEb/W1G9Awmg8UTGuUYFSC1G vGPCwhGi+SBsbo84aPCfSdUkKDlruNWP0GwIFxo0hsiTOoHP+7UWeenJ2Jw5lNPt p0Y72TEDDaSMlE4cJx6IWdWM/B+OWzCyRyl3uVcy7bToEsVhIbBSSth7+sh2n7Cy WgH1lrtMzg== =sace -----END PGP SIGNATURE----- Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - Improvements to the queue_rqs() support, and adding null_blk support for that as well (Chengming) - Series improving badblocks support (Coly) - Key store support for sed-opal (Greg) - IBM partition string handling improvements (Jan) - Make number of ublk devices supported configurable (Mike) - Cancelation improvements for ublk (Ming) - MD pull requests via Song: - Handle timeout in md-cluster, by Denis Plotnikov - Cleanup pers->prepare_suspend, by Yu Kuai - Rewrite mddev_suspend(), by Yu Kuai - Simplify md_seq_ops, by Yu Kuai - Reduce unnecessary locking array_state_store(), by Mariusz Tkaczyk - Make rdev add/remove independent from daemon thread, by Yu Kuai - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai - NVMe pull request via Keith: - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees) - Misc cleanups and improvements (Jiapeng, Joel) * tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits) block: ublk_drv: Remove unused function md: cleanup pers->prepare_suspend() nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() powerpc/pseries: PLPKS SED Opal keystore support block: sed-opal: keystore access for SED Opal keys block:sed-opal: SED Opal keystore ublk: simplify aborting request ublk: replace monitor with cancelable uring_cmd ublk: quiesce request queue when aborting queue ublk: rename mm_lock as lock ublk: move ublk_cancel_dev() out of ub->mutex ublk: make sure io cmd handled in submitter task context ublk: don't get ublk device reference in ublk_abort_queue() ublk: Make ublks_max configurable ublk: Limit dev_id/ub_number values md-cluster: check for timeout while a new disk adding nvme: rework NVME_AUTH Kconfig selection ...	2023-11-01 12:30:07 -10:00
Mikulas Patocka	9793c269da	dm crypt: account large pages in cc->n_allocated_pages The commit `5054e778fc` ("dm crypt: allocate compound pages if possible") changed dm-crypt to use compound pages to improve performance. Unfortunately, there was an oversight: the allocation of compound pages was not accounted at all. Normal pages are accounted in a percpu counter cc->n_allocated_pages and dm-crypt is limited to allocate at most 2% of memory. Because compound pages were not accounted at all, dm-crypt could allocate memory over the 2% limit. Fix this by adding the accounting of compound pages, so that memory consumption of dm-crypt is properly limited. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `5054e778fc` ("dm crypt: allocate compound pages if possible") Cc: stable@vger.kernel.org # v6.5+ Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-31 14:25:06 -04:00
Eric Biggers	070bb43ab0	dm integrity: use crypto_shash_digest() in sb_mac() Simplify sb_mac() by using crypto_shash_digest() instead of an init+update+final sequence. This should also improve performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-31 11:06:21 -04:00
Eric Biggers	6d0ee3b680	dm crypt: use crypto_shash_digest() in crypt_iv_tcw_whitening() Simplify crypt_iv_tcw_whitening() by using crypto_shash_digest() instead of an init+update+final sequence. This should also improve performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-31 11:06:21 -04:00
Damien Le Moal	a951104333	dm error: Add support for zoned block devices dm-error is used in several test cases in the xfstests test suite to check the handling of IO errors in file systems. However, with several file systems getting native support for zoned block devices (e.g. btrfs and f2fs), dm-error's lack of zoned block device support creates problems as the file system attempts executing zone commands (e.g. a zone append operation) against a dm-error non-zoned block device, which causes various issues in the block layer (e.g. WARN_ON triggers). This commit adds supports for zoned block devices to dm-error, allowing a DM device table containing an error target to be exposed as a zoned block device (if all targets have a compatible zoned model support and mapping). This is done as follows: 1) Allow passing 2 arguments to an error target, similar to dm-linear: a backing device and a start sector. These arguments are optional and dm-error retains its characteristics if the arguments are not specified. 2) Implement the iterate_devices method so that dm-core can normally check the zone support and restrictions (e.g. zone alignment of the targets). When the backing device arguments are not specified, the iterate_devices method never calls the fn() argument. When no backing device is specified, as before, we assume that the DM device is not zoned. When the backing device arguments are specified, the zoned model of the DM device will depend on the backing device type: - If the backing device is zoned and its model and mapping is compatible with other targets of the device, the resulting device will be zoned, with the dm-error mapped portion always returning errors (similar to the default non-zoned case). - If the backing device is not zoned, then the DM device will not be either. This zone support for dm-error requires the definition of a functional report_zones operation so that dm_revalidate_zones() can operate correctly and resources for emulating zone append operations initialized. This is necessary for cases where dm-error is used to partially map a device and have an overall correct handling of zone append. This means that dm-error does not fail report zones operations. Two changes that are not obvious are included to avoid issues: 1) dm_table_supports_zoned_model() is changed to directly check if the backing device of a wildcard target (= dm-error target) is zoned. Otherwise, we wouldn't be able to catch the invalid setup of dm-error without a backing device (non zoned case) being combined with zoned targets. 2) dm_table_supports_dax() is modified to return false if the wildcard target is found. Otherwise, when dm-error is set without a backing device, we end up with a NULL pointer dereference in set_dax_synchronous (dax_dev is NULL). This is consistent with the current behavior because dm_table_supports_dax() always returned false for targets that do not define the iterate_devices method. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-31 11:06:21 -04:00
Christian Loehle	70bbeb29fa	dm delay: for short delays, use kthread instead of timers and wq DM delay's current design of using timers and wq to realize the delays is insufficient for delays below ~50ms. This commit enhances the design to use a kthread to flush the expired delays, trading some CPU time (in some cases) for better delay accuracy and delays closer to what the user requested for smaller delays. The new design is chosen as long as all the delays are below 50ms. Since bios can't be completed in interrupt context using a kthread is probably the most reasonable way to approach this. Testing with echo "0 2097152 zero" \| dmsetup create dm-zeros for i in $(seq 0 20); do echo "0 2097152 delay /dev/mapper/dm-zeros 0 $i" \| dmsetup create dm-delay-${i}ms; done Some performance numbers for comparison, on beaglebone black (single core) CONFIG_HZ_1000=y: fio --name=1msread --rw=randread --bs=4k --runtime=60 --time_based \ --filename=/dev/mapper/dm-delay-1ms Theoretical maximum: 1000 IOPS Previous: 250 IOPS Kthread: 500 IOPS fio --name=10msread --rw=randread --bs=4k --runtime=60 --time_based \ --filename=/dev/mapper/dm-delay-10ms Theoretical maximum: 100 IOPS Previous: 45 IOPS Kthread: 50 IOPS fio --name=1mswrite --rw=randwrite --direct=1 --bs=4k --runtime=60 \ --time_based --filename=/dev/mapper/dm-delay-1ms Theoretical maximum: 1000 IOPS Previous: 498 IOPS Kthread: 1000 IOPS fio --name=10mswrite --rw=randwrite --direct=1 --bs=4k --runtime=60 \ --time_based --filename=/dev/mapper/dm-delay-10ms Theoretical maximum: 100 IOPS Previous: 90 IOPS Kthread: 100 IOPS (This one is just to prove the new design isn't impacting throughput, not really about delays): fio --name=10mswriteasync --rw=randwrite --direct=1 --bs=4k \ --runtime=60 --time_based --filename=/dev/mapper/dm-delay-10ms \ --numjobs=32 --iodepth=64 --ioengine=libaio --group_reporting Previous: 13.3k IOPS Kthread: 13.3k IOPS Signed-off-by: Christian Loehle <christian.loehle@arm.com> [Harshit: kthread_create error handling fix in delay_ctr] Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-31 11:06:21 -04:00
Linus Torvalds	befaa609f4	hardening updates for v6.7-rc1 - Add LKDTM test for stuck CPUs (Mark Rutland) - Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo) - Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva) - Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh) - Convert group_info.usage to refcount_t (Elena Reshetova) - Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva) - Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn) - Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook) - Fix strtomem() compile-time check for small sources (Kees Cook) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24 jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99 Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/ ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N /SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9 BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M m2VXyBl3121BVtKi4w== =0iVF -----END PGP SIGNATURE----- Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "One of the more voluminous set of changes is for adding the new __counted_by annotation[1] to gain run-time bounds checking of dynamically sized arrays with UBSan. - Add LKDTM test for stuck CPUs (Mark Rutland) - Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo) - Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva) - Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh) - Convert group_info.usage to refcount_t (Elena Reshetova) - Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva) - Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn) - Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook) - Fix strtomem() compile-time check for small sources (Kees Cook)" * tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits) hwmon: (acpi_power_meter) replace open-coded kmemdup_nul reset: Annotate struct reset_control_array with __counted_by kexec: Annotate struct crash_mem with __counted_by virtio_console: Annotate struct port_buffer with __counted_by ima: Add __counted_by for struct modsig and use struct_size() MAINTAINERS: Include stackleak paths in hardening entry string: Adjust strtomem() logic to allow for smaller sources hardening: x86: drop reference to removed config AMD_IOMMU_V2 randstruct: Fix gcc-plugin performance mode to stay in group mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by irqchip/imx-intmux: Annotate struct intmux_data with __counted_by KVM: Annotate struct kvm_irq_routing_table with __counted_by virt: acrn: Annotate struct vm_memory_region_batch with __counted_by hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by sparc: Annotate struct cpuinfo_tree with __counted_by isdn: kcapi: replace deprecated strncpy with strscpy_pad isdn: replace deprecated strncpy with strscpy NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by ...	2023-10-30 19:09:55 -10:00
Linus Torvalds	9e87705289	Initial bcachefs pull request for 6.7-rc1 Here's the bcachefs filesystem pull request. One new patch since last week: the exportfs constants ended up conflicting with other filesystems that are also getting added to the global enum, so switched to new constants picked by Amir. I'll also be sending another pull request later on in the cycle bringing things up to date my master branch that people are currently running; that will be restricted to fs/bcachefs/, naturally. Testing - fstests as well as the bcachefs specific tests in ktest: https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream It's also been soaking in linux-next, which resulted in a whole bunch of smatch complaints and fixes and a patch or two from Kees. The only new non fs/bcachefs/ patch is the objtool patch that adds bcachefs functions to the list of noreturns. The patch that exports osq_lock() has been dropped for now, per Ingo. Prereq patch list: `faf1dce852` objtool: Add bcachefs noreturns `73badee428` lib/generic-radix-tree.c: Add peek_prev() `9492261ff2` lib/generic-radix-tree.c: Don't overflow in peek() `0fb5d567f5` MAINTAINERS: Add entry for generic-radix-tree `b414e8ecd4` closures: Add a missing include `48b7935722` closures: closure_nr_remaining() `ced58fc7ab` closures: closure_wait_event() `bd0d22e41e` MAINTAINERS: Add entry for closures `8c8d2d9670` bcache: move closures to lib/ `957e48087d` locking: export contention tracepoints for bcachefs six locks `21db931445` lib: Export errname `83feeb1955` lib/string_helpers: string_get_size() now returns characters wrote `7d672f4094` stacktrace: Export stack_trace_save_tsk `771eb4fe8b` fs: factor out d_mark_tmpfile() `2b69987be5` sched: Add task_struct->faults_disabled_mapping -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmU/wyIACgkQE6szbY3K bnZc1xAAqjQBGXdtgtKQvk0/ru0WaMZguMsOHd3BUXIbm30F6eJqnoXQ/ahALofc Ju6NrOgcy9wmdPKWpbeF+aK3WnkAW9jShDd0QieVH6PkhyYyh5r11iR/EVtjjLu5 6Teodn8fyTqn9WSDtKG15QreTCJrEasAoGFQKQDA8oiXC7zc+RSpLUkkTWD/pxyW zVqkGGiAUG4x6FON+X2a3QBa9WCahIgV6XzHstGLsmOECxKO/LopGR5jThuIhv9t Yo0wodQTKAgb9QviG6V3f2dJLQKKUVDmVEGTXv+8Hl3d8CiYBJeIh+icp+VESBo1 m8ev0y2xbTPLwgm5v0Uj4o/G8ISZ+qmcexV2zQ9xUWUAd2AjEBzhCh9BrNXM5qSg o7mphH+Pt6bJXgzxb2RkYJixU11yG3yuHPOCrRGGFpVHiNYhdHuJeDZOqChWZB8x 6kY0uvU0X0tqVfWKxMwTwuqG8mJ5BkJNvnEvYi05QEZG0dDcUhgOqYlNNaL8vGkl qVixOwE4aH4kscdmW2gXY1c76VSebheyN8n6Wj1zrmTw4hTJH7ZWXPtmbRqQzpB6 U6w3NjVyopbIjuF+syWeGqitTT/8fpvgZU4E9MpKGmHX4ADgecp6YSZQzzxTJn7D cbVX7YQxhmsM50C1PW7A8yLCspD/uRNiKLvzb/g9gFSInk4rV+U= =g+ia -----END PGP SIGNATURE----- Merge tag 'bcachefs-2023-10-30' of https://evilpiepirate.org/git/bcachefs Pull initial bcachefs updates from Kent Overstreet: "Here's the bcachefs filesystem pull request. One new patch since last week: the exportfs constants ended up conflicting with other filesystems that are also getting added to the global enum, so switched to new constants picked by Amir. The only new non fs/bcachefs/ patch is the objtool patch that adds bcachefs functions to the list of noreturns. The patch that exports osq_lock() has been dropped for now, per Ingo" * tag 'bcachefs-2023-10-30' of https://evilpiepirate.org/git/bcachefs: (2781 commits) exportfs: Change bcachefs fid_type enum to avoid conflicts bcachefs: Refactor memcpy into direct assignment bcachefs: Fix drop_alloc_keys() bcachefs: snapshot_create_lock bcachefs: Fix snapshot skiplists during snapshot deletion bcachefs: bch2_sb_field_get() refactoring bcachefs: KEY_TYPE_error now counts towards i_sectors bcachefs: Fix handling of unknown bkey types bcachefs: Switch to unsafe_memcpy() in a few places bcachefs: Use struct_size() bcachefs: Correctly initialize new buckets on device resize bcachefs: Fix another smatch complaint bcachefs: Use strsep() in split_devs() bcachefs: Add iops fields to bch_member bcachefs: Rename bch_sb_field_members -> bch_sb_field_members_v1 bcachefs: New superblock section members_v2 bcachefs: Add new helper to retrieve bch_member from sb bcachefs: bucket_lock() is now a sleepable lock bcachefs: fix crc32c checksum merge byte order problem bcachefs: Fix bch2_inode_delete_keys() ...	2023-10-30 11:09:38 -10:00
Jan Kara	b3856da790	bcache: Fixup error handling in register_cache() Coverity has noticed that the printing of error message in register_cache() uses already freed bdev_handle to get to bdev. In fact the problem has been there even before commit "bcache: Convert to bdev_open_by_path()" just a bit more subtle one - cache object itself could have been freed by the time we looked at ca->bdev and we don't hold any reference to bdev either so even that could in principle go away (due to device unplug or similar). Fix all these problems by printing the error message before closing the bdev. Fixes: dc893f51d24a ("bcache: Convert to bdev_open_by_path()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231004093757.11560-1-jack@suse.cz Asked-by: Coly Li <colyli@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:22 +02:00
Jan Kara	9f0f5a30d3	md: Convert to bdev_open_by_dev() Convert md to use bdev_open_by_dev() and pass the handle around. We also don't need the 'Holder' flag anymore so remove it. CC: linux-raid@vger.kernel.org CC: Song Liu <song@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-11-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:18 +02:00
Jan Kara	c2fce61fb2	dm: Convert to bdev_open_by_dev() Convert device mapper to use bdev_open_by_dev() and pass the handle around. CC: Alasdair Kergon <agk@redhat.com> CC: Mike Snitzer <snitzer@kernel.org> CC: dm-devel@redhat.com Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-10-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:18 +02:00
Jan Kara	631b001fd6	bcache: Convert to bdev_open_by_path() Convert bcache to use bdev_open_by_path() and pass the handle around. CC: linux-bcache@vger.kernel.org CC: Coly Li <colyli@suse.de> CC: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-9-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:17 +02:00
Mike Snitzer	6f25dd1c57	dm: respect REQ_NOWAIT flag in normal bios issued to DM Update DM core's normal IO submission to allocate required memory using GFP_NOWAIT if REQ_NOWAIT is set. Tested with simple test provided in commit `a9ce385344` ("dm: don't attempt to queue IO under RCU protection") that was enhanced to check error codes. Also tested using fio's pvsync2 with nowait=1. But testing with induced GFP_NOWAIT allocation failures wasn't performed (yet). Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-27 13:22:51 -04:00
Mike Snitzer	4a2fe29608	dm: enhance alloc_multiple_bios() to be more versatile alloc_multiple_bios() has the useful ability to try allocating bios with GFP_NOWAIT but will fallback to using GFP_NOIO. The callers service both empty flush bios and abnormal bios (e.g. discard). alloc_multiple_bios() enhancements offered in this commit: - don't require table_devices_lock if num_bios = 1 - allow caller to pass GFP_NOWAIT to do usual GFP_NOWAIT with GFP_NOIO fallback - allow caller to pass GFP_NOIO to _only_ allocate using GFP_NOIO Flush bios with data may be issued to DM with REQ_NOWAIT, as such it makes sense to attempt servicing them with GFP_NOWAIT allocations. But abnormal IO should never be issued using REQ_NOWAIT (if that changes in the future that's fine, but no sense supporting it now). While at it, rename __send_changing_extent_only() to __send_abnormal_io(). [Thanks to both Ming and Mikulas for help with translating known possible IO scenarios to requirements.] Suggested-by: Ming Lei <ming.lei@redhat.com> Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-27 12:44:18 -04:00
Mikulas Patocka	34dbaa88ca	dm: make __send_duplicate_bios return unsigned int All the callers cast the value returned by __send_duplicate_bios to unsigned int type, so we can return unsigned int as well. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-23 13:07:36 -04:00
Justin Stitt	18ac523329	dm log userspace: replace deprecated strncpy with strscpy `strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. `lc` is already zero-allocated: \| lc = kzalloc(sizeof(*lc), GFP_KERNEL); ... as such, any future NUL-padding is superfluous. A suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. Let's also go with the more idiomatic `dest, src, sizeof(dest)` pattern for destination buffers that the compiler can calculate the size for. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-23 13:02:48 -04:00
Justin Stitt	0ffb645ea8	dm ioctl: replace deprecated strncpy with strscpy_pad `strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. We expect `spec->target_type` to be NUL-terminated based on its use with a format string after `dm_table_add_target()` is called \| r = dm_table_add_target(table, spec->target_type, \| (sector_t) spec->sector_start, \| (sector_t) spec->length, \| target_params); ... wherein `spec->target_type` is passed as parameter `type` and later printed with DMERR: \| DMERR("%s: %s: unknown target type", dm_device_name(t->md), type); It appears that `spec` is not zero-allocated and thus NUL-padding may be required in this ioctl context. Considering the above, a suitable replacement is `strscpy_pad` due to the fact that it guarantees NUL-termination whilst maintaining the NUL-padding behavior that strncpy provides. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-23 13:02:37 -04:00
Justin Stitt	e9d7bd2c86	dm crypt: replace open-coded kmemdup_nul kzalloc() followed by strncpy() on an expected NUL-terminated string is just kmemdup_nul(). Let's simplify this code (while also dropping a deprecated strncpy() call [1]). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-23 13:02:25 -04:00
Justin Stitt	ac4149ba7e	dm cache metadata: replace deprecated strncpy with strscpy `strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. It seems `cmd->policy_name` is intended to be NUL-terminated based on a now changed line of code from Commit (`c6b4fcbad0` "dm: add cache target"): \| if (strcmp(cmd->policy_name, policy_name)) { // ... However, now a length-bounded strncmp is used: \| if (strncmp(cmd->policy_name, policy_name, sizeof(cmd->policy_name))) ... which means NUL-terminated may not strictly be required. However, I believe the intent of the code is clear and we should maintain NUL-termination of policy_names. Moreover, __begin_transaction_flags() zero-allocates `cmd` before calling read_superblock_fields(): \| cmd = kzalloc(sizeof(*cmd), GFP_KERNEL); Also, `disk_super->policy_name` is zero-initialized \| memset(disk_super->policy_name, 0, sizeof(disk_super->policy_name)); ... therefore any NUL-padding is redundant. Considering the above, a suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-23 13:02:05 -04:00
Kent Overstreet	8c8d2d9670	bcache: move closures to lib/ Prep work for bcachefs - being a fork of bcache it also uses closures Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>	2023-10-19 14:47:33 -04:00
Yu Kuai	78b7b13f07	md: cleanup pers->prepare_suspend() pers->prepare_suspend() is not used anymore and can be removed. Reverts following three commit: - commit `431e61257d` ("md: export md_is_rdwr() and is_md_suspended()") - commit `3e00777d51` ("md: add a new api prepare_suspend() in md_personality") - commit `868bba54a3` ("md/raid5: fix a deadlock in the case that reshape is interrupted") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231016100240.540474-1-yukuai1@huaweicloud.com	2023-10-18 09:22:28 -07:00
Denis Plotnikov	1bbe254e43	md-cluster: check for timeout while a new disk adding A new disk adding may end up with timeout and a new disk won't be added. Add returning the error in that case. Found by Linux Verification Center (linuxtesting.org) with SVACE Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230925125940.1542506-1-den-plotnikov@yandex-team.ru	2023-10-12 09:16:19 -07:00
Joel Granados	dd6291c506	raid: Remove now superfluous sentinel element from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from raid_table Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-10-11 12:16:13 -07:00
Yu Kuai	2b16a52549	md: rename __mddev_suspend/resume() back to mddev_suspend/resume() Now that the old apis are removed, __mddev_suspend/resume() can be renamed to their original names. This is done by: sed -i "s/__mddev_suspend/mddev_suspend/g" .[ch] sed -i "s/__mddev_resume/mddev_resume/g" .[ch] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com	2023-10-10 18:49:51 -07:00
Yu Kuai	4717c02875	md: remove old apis to suspend the array Now that mddev_suspend() and mddev_resume() is not used anywhere, remove them, and remove 'MD_ALLOW_SB_UPDATE' and 'MD_UPDATING_SB' as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-19-yukuai1@huaweicloud.com	2023-10-10 18:49:51 -07:00
Yu Kuai	bc08041b32	md: suspend array in md_start_sync() if array need reconfiguration So that io won't concurrent with array reconfiguration, and it's safe to suspend the array directly because normal io won't rely on md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-18-yukuai1@huaweicloud.com	2023-10-10 18:49:51 -07:00
Yu Kuai	b42cd7b3a2	md/raid5: replace suspend with quiesce() callback raid5 is the only personality to suspend array in check_reshape() and start_reshape() callback, suspend and quiesce() callback can both wait for all normal io to be done, and prevent new io to be dispatched, the difference is that suspend is implemented in common layer, and quiesce() callback is implemented in raid5. In order to cleanup all the usage of mddev_suspend(), the new apis __mddev_suspend() need to be called before 'reconfig_mutex' is held, and it's not good to affect all the personalities in common layer just for raid5. Hence replace suspend with quiesce() callaback, prepare to reomove all the users of mddev_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-17-yukuai1@huaweicloud.com	2023-10-10 18:49:51 -07:00
Yu Kuai	1978c742f3	md/md-linear: cleanup linear_add() Now that caller already suspend the array, there is no need to suspend array in liner_add(). Note that mddev_suspend/resume() is not used anymore. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-16-yukuai1@huaweicloud.com	2023-10-10 18:49:51 -07:00
Yu Kuai	b4128c00a6	md: cleanup mddev_create/destroy_serial_pool() Now that except for stopping the array, all the callers already suspend the array, there is no need to suspend anymore, hence remove the second parameter. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-15-yukuai1@huaweicloud.com	2023-10-10 18:49:51 -07:00
Yu Kuai	58226942ad	md: use new apis to suspend array before mddev_create/destroy_serial_pool mddev_create/destroy_serial_pool() will be called from several places where mddev_suspend() will be called later. Prepare to remove the mddev_suspend() from mddev_create/destroy_serial_pool(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-14-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	1b0a2d950e	md: use new apis to suspend array for ioctls involed array reconfiguration 'reconfig_mutex' will be grabbed before these ioctls, suspend array before holding the lock, so that io won't concurrent with array reconfiguration through ioctls. This is not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-13-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	cfa078c8b8	md: use new apis to suspend array for adding/removing rdev from state_store() User can write 'remove' and 're-add' to trigger array reconfiguration through sysfs, suspend array in this case so that io won't concurrent with array reconfiguration. And now that all the caller of add_bound_rdev() alread suspend the array, remove mddev_suspend/resume() from add_bound_rdev() as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-12-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	205669f377	md: use new apis to suspend array for sysfs apis Convert to use new apis in following sysfs apis: - level_store - suspend_lo_store - suspend_hi_store - serialize_policy_store These are not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-11-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	e28ca92fbb	md/raid5: use new apis to suspend array Convert to use new apis, the old apis will be removed eventually. These are not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-10-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	1b172e0b11	md/raid5-cache: use new apis to suspend array Convert to use new apis, the old apis will be removed eventually. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-9-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	3cddf86a57	md/md-bitmap: use new apis to suspend array for location_store() Convert to use new apis, the old apis will be removed eventually. This is not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-8-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	4eb3327aa2	md/dm-raid: use new apis to suspend array Convert to use new apis, the old apis will be removed eventually. These are not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-7-yukuai1@huaweicloud.com	2023-10-10 18:49:50 -07:00
Yu Kuai	f45461e24f	md: add new helpers to suspend/resume and lock/unlock array The new helpers suspend the array first and then lock the array, Prepare to refactor from: mddev_lock/lock_nointr mddev_suspend ... mddev_resuem mddev_lock With: mddev_suspend_and_lock/lock_nointr ... mddev_unlock_and_resume After all the use cases is refactored, mddev_suspend/resume() will be removed. And mddev_suspend_and_lock() will also replace mddev_lock() for the case that the array will be reconfigured, in order to synchronize with io to prevent problems in many corner cases. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-6-yukuai1@huaweicloud.com	2023-10-10 18:49:49 -07:00
Yu Kuai	714d20150e	md: add new helpers to suspend/resume array Advantages for new apis: - reconfig_mutex is not required; - the weird logical that suspend array hold 'reconfig_mutex' for mddev_check_recovery() to update superblock is not needed; - the specail handling, 'pers->prepare_suspend', for raid456 is not needed; - It's safe to be called at any time once mddev is allocated, and it's designed to be used from slow path where array configuration is changed; - the new helpers is designed to be called before mddev_lock(), hence it support to be interrupted by user as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-5-yukuai1@huaweicloud.com	2023-10-10 18:49:49 -07:00
Yu Kuai	2e82248b70	md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery() Prepare to cleanup pers->prepare_suspend(), which is used to fix a deadlock in raid456 by returning error for io that is waiting for reshape to make progress in mddev_suspend(). This change will allow reshape to make progress while waiting for io to be done in mddev_suspend() in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-4-yukuai1@huaweicloud.com	2023-10-10 18:49:49 -07:00
Yu Kuai	06a4d0d8c6	md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log' 'conf->log' is set with 'reconfig_mutex' grabbed, however, readers are not procted, hence protect it with READ_ONCE/WRITE_ONCE to prevent reading abnormal values. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-3-yukuai1@huaweicloud.com	2023-10-10 18:49:49 -07:00
Yu Kuai	617787f138	md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi' Protect 'suspend_lo' and 'suspend_hi' with READ_ONCE/WRITE_ONCE to prevent reading abnormal values. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-2-yukuai1@huaweicloud.com	2023-10-10 18:49:49 -07:00
Linus Torvalds	832b5d0bf9	This push fixes a 6.5 regression in dm-crypt. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmUfc6MACgkQxycdCkmx i6cCahAAwuexZ3VzTvbHKE1kGxyWS+RXsq2zAJ87VAclG0UeFyICQghKzGJ1SRzo wc6Z5752r/F2rXhvSKlQw6Gd51HPoDxK4s2aC2KYLKEwWoNCvBWyFQveiAF8qtDC 4xBZmRqYmYVZ7yD2lfHiaPC9TbUdMNMCrgXfjqOMxoKCFJ8BKGmUUu4ncr9leOqv Z6haTQ3mGBRI+IIN99FbYAJ+YpM0uhnjxd/8+GsDeyihhFA4l5Q57Rb+YMj6EtaR QTHXqxU3fGEXy9ffW0mefrG+GpSs3dLgkkZpPMsvrh+njLfwHCLLzyI1k/y+4YFh I414CEn/rZ/TMotx4JQufRGR3DYkZQZZXyGOIRCKn1WDBxnxj4LeBylN4f5UgxnO DXRPvZAO/P5kMNyRIst8KAaUllV1VoWniut3q1W1lQkpd2LPEmNVcp6K0EXQyJhL OU0Ay+T36ELLgaEqjJ4Vvee6f+zsTl17j5uj4chisuoihWNqPh5f9ovy9IS33Sbo 089FT9jXKciruPaprT0CcBOBMfYwIqvEdL/iBiVrkJgUhv+OfbnanIgJv4xXpHjP v6xWe3o97IOI2h3akDMen86uzlWyx+AvQ4qu64SzEwB9YmH3jallfGfCdXQqqdzs z6jOv7TicSuupm07sjOCyScmugGm2vOWmLup+fcZGUFANdtJPxw= =ctRf -----END PGP SIGNATURE----- Merge tag 'v6.6-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix a regression in dm-crypt" * tag 'v6.6-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: dm crypt: Fix reqsize in crypt_iv_eboiv_gen	2023-10-10 10:47:35 -07:00
Yu Kuai	9e55a22fce	md/raid1: don't split discard io for write behind Currently, discad io is treated the same as normal write io, and for write behind case, io size is limited to: BIO_MAX_VECS * (PAGE_SIZE >> 9) For 0.5KB sector size and 4KB PAGE_SIZE, this is just 1MB. For consequence, if 'WriteMostly' is set to one of the underlying disks, then diskcard io will be splited into 1MB and it will take a long time for the diskcard to finish. Fix this problem by disable write behind for discard io. Reported-by: Roman Mamedov <rm@romanrm.net> Closes: https://lore.kernel.org/all/6a1165f7-c792-c054-b8f0-1ad4f7b8ae01@ultracoder.org/ Reported-and-tested-by: Kirill Kirilenko <kirill@ultracoder.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231007112105.407449-1-yukuai1@huaweicloud.com	2023-10-09 16:32:02 -07:00
Linus Torvalds	4aef108a4d	- Fix memory leak when freeing dm zoned target device - Update dm-devel mailing list address in MAINTAINERS -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmUglFMACgkQxSPxCi2d A1rOswf/T5nqpzFrbrzYj80DpRIQZj+nvTvPTId8X2kWQZ6uJqOpXSA1QP8DHeQj ycn88TEK8sbXn+wvXXWoPuMrh6FVJoZASvdlhq5QhnwpCT0riwRrnh7XQIrl3Ktn EQNmAoiKPMF/pZe653BalW/SnRJEa6603RhjHFBuGLKVS7i38oWrmbfcVEdQ+/nf Pq7gIWoFHtL9LgvhfCLHDg/EYO+brNTJtCnqt5t67iILkK6A//BO9y3O6f2QIj8C 9Y5wGduZyaZ0uOj+LuU3sEqMfr8S6q+0vbJn4yQPDhSM0msLfB20hnYN1Nl2TGIx B0k3YBScNq6u+pmPjKZMcnRXJnKPaA== =p6BS -----END PGP SIGNATURE----- Merge tag 'for-6.6/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix memory leak when freeing dm zoned target device - Update dm-devel mailing list address in MAINTAINERS * tag 'for-6.6/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: MAINTAINERS: update the dm-devel mailing list dm zoned: free dmz->ddev array in dmz_put_zoned_devices	2023-10-07 10:17:48 -07:00
Mikulas Patocka	f144503217	dm: shortcut the calls to linear_map and stripe_map Shortcut the calls to linear_map and stripe_map, so that they don't suffer the overhead of retpolines used for indirect calls. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-10-06 19:09:25 -04:00
Linus Torvalds	fc5b94f1cb	block-6.6-2023-10-06 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmUgNUIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppleEADNnAobhJCLJWw2uSwkfIqap4pL+LFvA0B5 qdhOJb1U7t3mmu6/VhSEpvRFQiSvMHzhXBWHepQfii0Iu7ubkOwlEhPEirZJLBdo 2Qt6RPxijfhKRHbrujNzjrCUAIti8zchaBrYmYQtB7paKCtTStPLOezBgfOutFl7 TRLF+GBpsISuQUASSIFssxf7RpZTzS9pdlrS09BeViROdwwqYHXNUIl0kxEFu9BM SjElcby5ZJj5drg0SRl5ePG/XCp4Sr/rFDQJI2S/d++qwJS8egsYPQAe7Tbpw4y5 9FBhs5hKypRkdTjbOVa23lqdfHcSIEyRM2lUL12vaVMA5m+mZ7SePt9qX4Vbg074 2QJ1YVm4c/UFe6uieUcmDgJNgUMEyeEyo4SUj9gl/gn8wQE1dz5qB5PF1YWEcHa7 KJ1zBbxBKw4ETLBdt24wCHH8HyA7SO8RCk8EMxFv7ENvQSzXzo+mdvrUYpp04Mu8 eNV8VvqpMvBuWuh9uAO4CS1GIFQW7DijrProbyh9aOthu7850/ahnwS3b6+jkZYP 3A7vaHKpU3VyGLYULAE9lZhmeCxmFNHzClgzX2Wve4Z94Xoq7Sa8xksV3JBx5NzY 5Rt5rfdTZArt+LPL4CrdwlxEnT3F5yTSKXp82lWo6P1ds/x7WEYXKf0B4OgAGANk 6tk9rK3ZBA== =NWf+ -----END PGP SIGNATURE----- Merge tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Just two minor fixes, for nbd and md" * tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux: nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctl md/raid5: release batch_last before waiting for another stripe_head	2023-10-06 15:43:19 -07:00
Herbert Xu	152d0bcdf1	dm crypt: Fix reqsize in crypt_iv_eboiv_gen A skcipher_request object is made up of struct skcipher_request followed by a variable-sized trailer. The allocation of the skcipher_request and IV in crypt_iv_eboiv_gen is missing the memory for struct skcipher_request. Fix it by adding it to reqsize. Fixes: `e3023094df` ("dm crypt: Avoid using MAX_CIPHER_BLOCKSIZE") Cc: <stable@vger.kernel.org> #6.5+ Reported-by: Tatu Heikkilä <tatu.heikkila@gmail.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-10-06 10:39:18 +08:00
Qi Zheng	8a0e8bb112	mm: shrinker: convert shrinker_rwsem to mutex Now there are no readers of shrinker_rwsem, so we can simply replace it with mutex lock. [akpm@linux-foundation.org: update the fix to alloc_shrinker_info()] Link: https://lkml.kernel.org/r/20230911094444.68966-46-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:26 -07:00
Qi Zheng	a6a1eb6214	bcache: dynamically allocate the md-bcache shrinker In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the md-bcache shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct cache_set. Link: https://lkml.kernel.org/r/20230911094444.68966-27-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Coly Li <colyli@suse.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:25 -07:00
Qi Zheng	86298d8b8c	md/raid5: dynamically allocate the md-raid5 shrinker In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the md-raid5 shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct r5conf. Link: https://lkml.kernel.org/r/20230911094444.68966-26-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Song Liu <song@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:25 -07:00
Qi Zheng	ba3d6acafd	dm zoned: dynamically allocate the dm-zoned-meta shrinker In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the dm-zoned-meta shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct dmz_metadata. Link: https://lkml.kernel.org/r/20230911094444.68966-25-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:25 -07:00
Qi Zheng	1f1d459c9a	dm: dynamically allocate the dm-bufio shrinker In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the dm-bufio shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct dm_bufio_client. Link: https://lkml.kernel.org/r/20230911094444.68966-24-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:25 -07:00
David Jeffery	2fd7b0f6d5	md/raid5: release batch_last before waiting for another stripe_head When raid5_get_active_stripe is called with a ctx containing a stripe_head in its batch_last pointer, it can cause a deadlock if the task sleeps waiting on another stripe_head to become available. The stripe_head held by batch_last can be blocking the advancement of other stripe_heads, leading to no stripe_heads being released so raid5_get_active_stripe waits forever. Like with the quiesce state handling earlier in the function, batch_last needs to be released by raid5_get_active_stripe before it waits for another stripe_head. Fixes: `3312e6c887` ("md/raid5: Keep a reference to last stripe_head for batch") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231002183422.13047-1-djeffery@redhat.com	2023-10-03 08:53:09 -07:00
Kees Cook	96d7c65939	dm: Annotate struct dm_bio_prison with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct dm_bio_prison. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230915200407.never.611-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2023-10-02 09:48:53 -07:00
Kees Cook	37d27cf1f5	dm: Annotate struct dm_stat with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct dm_stat. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230915200400.never.585-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2023-10-02 09:48:52 -07:00
Kees Cook	694b3b9d7a	dm: Annotate struct stripe_c with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct stripe_c. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230915200352.never.118-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2023-10-02 09:48:52 -07:00
Kees Cook	6521ba56ca	dm crypt: Annotate struct crypt_config with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct crypt_config. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230915200344.never.272-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2023-10-02 09:48:52 -07:00
Kees Cook	e3260d90c8	dm raid: Annotate struct raid_set with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct raid_set. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230915200335.never.098-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2023-10-02 09:48:52 -07:00
Mariusz Tkaczyk	09f894affc	md: do not require mddev_lock() for all options in array_state_store() We don't need to lock device to reject not supported request in array_state_store(). No functional changes intended. There are differences between ioctl and sysfs handling during stopping. With this change, it will be easier to add additional steps which needs to be completed before mddev_lock() is taken. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230928125517.12356-1-mariusz.tkaczyk@linux.intel.com	2023-09-28 13:34:57 -07:00
Yu Kuai	cf1b6d4441	md: simplify md_seq_ops Before this patch, the implementation is hacky and hard to understand: 1) md_seq_start set pos to 1; 2) md_seq_show found pos is 1, then print Personalities; 3) md_seq_next found pos is 1, then it update pos to the first mddev; 4) md_seq_show found pos is not 1 or 2, show mddev; 5) md_seq_next found pos is not 1 or 2, update pos to next mddev; 6) loop 4-5 until the last mddev, then md_seq_next update pos to 2; 7) md_seq_show found pos is 2, then print unused devices; 8) md_seq_next found pos is 2, stop; This patch remove the magic value and use seq_list_start/next/stop() directly, and move printing "Personalities" to md_seq_start(), "unsed devices" to md_seq_stop(): 1) md_seq_start print Personalities, and then set pos to first mddev; 2) md_seq_show show mddev; 3) md_seq_next update pos to next mddev; 4) loop 2-3 until the last mddev; 5) md_seq_stop print unsed devices; Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230927061241.1552837-3-yukuai1@huaweicloud.com	2023-09-27 13:54:26 -07:00
Yu Kuai	3d8d32873c	md: factor out a helper from mddev_put() There are no functional changes, prepare to simplify md_seq_ops in next patch. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230927061241.1552837-2-yukuai1@huaweicloud.com	2023-09-27 13:54:26 -07:00
Justin Stitt	ceb0416383	md: replace deprecated strncpy with memcpy `strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. There are three such strncpy uses that this patch addresses: The respective destination buffers are: 1) mddev->clevel 2) clevel 3) mddev->metadata_type We expect mddev->clevel to be NUL-terminated due to its use with format strings: \| ret = sprintf(page, "%s\n", mddev->clevel); Furthermore, we can see that mddev->clevel is not expected to be NUL-padded as `md_clean()` merely set's its first byte to NULL -- not the entire buffer: \| static void md_clean(struct mddev mddev) \| { \| mddev->array_sectors = 0; \| mddev->external_size = 0; \| ... \| mddev->level = LEVEL_NONE; \| mddev->clevel[0] = 0; \| ... A suitable replacement for this instance is `memcpy` as we know the number of bytes to copy and perform manual NUL-termination at a specified offset. This really decays to just a byte copy from one buffer to another. `strscpy` is also a considerable replacement but using `slen` as the length argument would result in truncation of the last byte unless something like `slen + 1` was provided which isn't the most idiomatic strscpy usage. For the next case, the destination buffer `clevel` is expected to be NUL-terminated based on its usage within kstrtol() which expects NUL-terminated strings. Note that, in context, this code removes a trailing newline which is seemingly not required as kstrtol() can handle trailing newlines implicitly. However, there exists further usage of clevel (or buf) that would also like to have the newline removed. All in all, with similar reasoning to the first case, let's just use memcpy as this is just a byte copy and NUL-termination is handled manually. The third and final case concerning `mddev->metadata_type` is more or less the same as the other two. We expect that it be NUL-terminated based on its usage with seq_printf: \| seq_printf(seq, " super external:%s", \| mddev->metadata_type); ... and we can surmise that NUL-padding isn't required either due to how it is handled in md_clean(): \| static void md_clean(struct mddev mddev) \| { \| ... \| mddev->metadata_type[0] = 0; \| ... So really, all these instances have precisely calculated lengths and purposeful NUL-termination so we can just use memcpy to remove ambiguity surrounding strncpy. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230925-strncpy-drivers-md-md-c-v1-1-2b0093b89c2b@google.com	2023-09-25 14:36:41 -07:00
Kees Cook	e887544d76	md/md-linear: Annotate struct linear_conf with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct linear_conf. Additionally, since the element count member must be set before accessing the annotated flexible array member, move its initialization earlier. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230915200328.never.064-kees@kernel.org	2023-09-22 11:23:01 -07:00
Yu Kuai	54d21eb6ad	md: don't check 'mddev->pers' and 'pers->quiesce' from suspend_lo_store() Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to remove such checking. This will also allow the array to be suspended even before the array is ran. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-8-yukuai1@huaweicloud.com	2023-09-22 10:28:27 -07:00
Yu Kuai	a2a9f16838	md: don't check 'mddev->pers' from suspend_hi_store() Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to remove such checking. This will also allow the array to be suspended even before the array is ran. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-7-yukuai1@huaweicloud.com	2023-09-22 10:28:27 -07:00
Yu Kuai	158d32af87	md-bitmap: suspend array earlier in location_store() Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to call mddev_suspend() earlier. This will also be helper to refactor mddev_suspend() later. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-6-yukuai1@huaweicloud.com	2023-09-22 10:28:27 -07:00
Yu Kuai	b71fe4ac75	md-bitmap: remove the checking of 'pers->quiesce' from location_store() After commit 4d27e927344a ("md: don't quiesce in mddev_suspend()"), there is no need to check 'pers->quiesce' before calling mddev_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-5-yukuai1@huaweicloud.com	2023-09-22 10:28:27 -07:00
Yu Kuai	b721e7885e	md: don't rely on 'mddev->pers' to be set in mddev_suspend() 'active_io' used to be initialized while the array is running, and 'mddev->pers' is set while the array is running as well. Hence caller must hold 'reconfig_mutex' and guarantee 'mddev->pers' is set before calling mddev_suspend(). Now that 'active_io' is initialized when mddev is allocated, such restriction doesn't exist anymore. In the meantime, follow up patches will refactor mddev_suspend(), hence add checking for 'mddev->pers' to prevent null-ptr-deref. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-4-yukuai1@huaweicloud.com	2023-09-22 10:28:26 -07:00
Yu Kuai	b8494823e2	md: initialize 'writes_pending' while allocating mddev Currently 'writes_pending' is initialized in pers->run for raid1/5/10, and it's freed while deleing mddev, instead of pers->free. pers->run can be called multiple times before mddev is deleted, and a helper mddev_init_writes_pending() is used to prevent 'writes_pending' to be initialized multiple times, this usage is safe but a litter weird. On the other hand, 'writes_pending' is only initialized for raid1/5/10, however, it's used in common layer, for example: array_state_store set_in_sync if (!mddev->in_sync) -> in_sync is used for all levels // access writes_pending There might be some implicit dependency that I don't recognized to make sure 'writes_pending' can only be accessed for raid1/5/10, but there are no comments about that. By the way, it make sense to initialize 'writes_pending' in common layer because there are already three levels use it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com	2023-09-22 10:28:26 -07:00
Yu Kuai	d58eff83bd	md: initialize 'active_io' while allocating mddev 'active_io' is used for mddev_suspend() and it's initialized in md_run(), this restrict that 'reconfig_mutex' must be held and "mddev->pers" must be set before calling mddev_suspend(). Initialize 'active_io' early so that mddev_suspend() is safe to call once mddev is allocated, this will be helpful to refactor mddev_suspend() in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com	2023-09-22 10:28:26 -07:00
Yu Kuai	81e2ce1b3d	md: delay remove_and_add_spares() for read only array to md_start_sync() Before this patch, for read-only array: md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will call remove_and_add_spares() directly to try to remove and add rdevs from array. After this patch: 1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the worker 'sync_work' is not pending, and there are rdevs can be added or removed, then it will queue new work md_start_sync(); 2) md_start_sync() will call remove_and_add_spares() and exist; This change make sure that array reconfiguration is independent from daemon, and it'll be much easier to synchronize it with io, consier that io may rely on daemon thread to be done. Also fix a problem that 'pers->spars_active' is called after remove_and_add_spares(), which order is wrong, because spares must active first, and then remove_and_add_spares() can add spares to the array, like what read-write case does: 1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do recovery; 2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return; 3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING'; 4) in the next round of daemon, call remove_and_add_spares() to add spares to the array. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com	2023-09-22 10:28:26 -07:00
Yu Kuai	a0ae7e4e0b	md: factor out a helper rdev_addable() from remove_and_add_spares() There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-7-yukuai1@huaweicloud.com	2023-09-22 10:28:25 -07:00
Yu Kuai	b172a0704d	md: factor out a helper rdev_is_spare() from remove_and_add_spares() There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-6-yukuai1@huaweicloud.com	2023-09-22 10:28:25 -07:00
Yu Kuai	3389d57f97	md: factor out a helper rdev_removeable() from remove_and_add_spares() There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-5-yukuai1@huaweicloud.com	2023-09-22 10:28:25 -07:00
Yu Kuai	db5e653d7c	md: delay choosing sync action to md_start_sync() Before this patch, for read-write array: 1) md_check_recover() found that something need to be done, and it'll try to grab 'reconfig_mutex'. The case that md_check_recover() need to do something: - array is not suspend; - super_block need to be updated; - 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set; - unusual case related to safemode; 2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set, md_check_recover() will try to choose a sync action, and then queue a work md_start_sync(). 3) md_start_sync() register sync_thread; After this patch, 1) is the same; 2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set, queue a work md_start_sync() directly; 3) md_start_sync() will try to choose a sync action, and then register sync_thread(); Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2) and 3) and md_do_sync() is always ran in serial and they can never concurrent, this change should not introduce any behavior change for now. Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING' without protection in error path, which might affect the logical in md_check_recovery(). The advantage to change this is that array reconfiguration is independent from daemon now, and it'll be much easier to synchronize it with io, consider that io may rely on daemon thread to be done. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com	2023-09-22 10:28:25 -07:00
Yu Kuai	897c62a1ca	md: factor out a helper to choose sync action from md_check_recovery() There are no functional changes, on the one hand make the code cleaner, on the other hand prevent following checkpatch error in the next patch to delay choosing sync action to md_start_sync(). ERROR: do not use assignment in if condition + } else if ((spares = remove_and_add_spares(mddev, NULL))) { Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-3-yukuai1@huaweicloud.com	2023-09-22 10:28:25 -07:00
Yu Kuai	ac61978196	md: use separate work_struct for md_start_sync() It's a little weird to borrow 'del_work' for md_start_sync(), declare a new work_struct 'sync_work' for md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com	2023-09-22 10:28:24 -07:00
Fedor Pchelkin	9850ccd5dd	dm zoned: free dmz->ddev array in dmz_put_zoned_devices Commit `4dba12881f` ("dm zoned: support arbitrary number of devices") made the pointers to additional zoned devices to be stored in a dynamically allocated dmz->ddev array. However, this array is not freed. Rename dmz_put_zoned_device to dmz_put_zoned_devices and fix it to free the dmz->ddev array when cleaning up zoned device information. Remove NULL assignment for all dmz->ddev elements and just free the dmz->ddev array instead. Found by Linux Verification Center (linuxtesting.org). Fixes: `4dba12881f` ("dm zoned: support arbitrary number of devices") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-09-20 13:48:06 -04:00
Linus Torvalds	e39bfb5925	- Fix DM core retrieve_deps() UAF race due to missing locking of a DM table's list of devices that is managed using dm_{get,put}_device. - Revert DM core's half-baked RCU optimization if IO submitter has set REQ_NOWAIT. Can be revisited, and properly justified, after comprehensively auditing all of DM to also pass GFP_NOWAIT for any allocations if REQ_NOWAIT used. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmUEt7wACgkQxSPxCi2d A1prNwf/RB4EyKiSx7XS3ysM6mh/BPGO5FNjWwHebkrSFzAkEowo4i0cY9lRD0N4 x9Wbd5bcV8HarH/fiyffQxgdfXspAIrMt8z5hRnfElkBLzg6hHixxg/3sFCwg+U3 LG6AZFNLil7VmDeca9Pd8MCyXoy1u4ErWjkz3fU8pzzT+NDwRZPZhUMd/MFCWag6 q22S8KMXkYKiAHqKauF52CeDH77XsO66G70t/AElemZ66PpyKpasg2p99RCuHgTg 7jNuMTM6qXYWSWw8OswVXCPZEVfCp4zTFv1ebu9bagfDKR4ppNxwzyz7/CMkir14 4uKKzQ/cy8QND6OR/05zKh4U3ctqyA== =rVpu -----END PGP SIGNATURE----- Merge tag 'for-6.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM core retrieve_deps() UAF race due to missing locking of a DM table's list of devices that is managed using dm_{get,put}_device. - Revert DM core's half-baked RCU optimization if IO submitter has set REQ_NOWAIT. Can be revisited, and properly justified, after comprehensively auditing all of DM to also pass GFP_NOWAIT for any allocations if REQ_NOWAIT used. * tag 'for-6.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: don't attempt to queue IO under RCU protection dm: fix a race condition in retrieve_deps	2023-09-15 14:30:54 -07:00
Jens Axboe	a9ce385344	dm: don't attempt to queue IO under RCU protection dm looks up the table for IO based on the request type, with an assumption that if the request is marked REQ_NOWAIT, it's fine to attempt to submit that IO while under RCU read lock protection. This is not OK, as REQ_NOWAIT just means that we should not be sleeping waiting on other IO, it does not mean that we can't potentially schedule. A simple test case demonstrates this quite nicely: int main(int argc, char *argv[]) { struct iovec iov; int fd; fd = open("/dev/dm-0", O_RDONLY \| O_DIRECT); posix_memalign(&iov.iov_base, 4096, 4096); iov.iov_len = 4096; preadv2(fd, &iov, 1, 0, RWF_NOWAIT); return 0; } which will instantly spew: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:306 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5580, name: dm-nowait preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 INFO: lockdep is turned off. CPU: 7 PID: 5580 Comm: dm-nowait Not tainted 6.6.0-rc1-g39956d2dcd81 #132 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x11d/0x1b0 __might_resched+0x3c3/0x5e0 ? preempt_count_sub+0x150/0x150 mempool_alloc+0x1e2/0x390 ? mempool_resize+0x7d0/0x7d0 ? lock_sync+0x190/0x190 ? lock_release+0x4b7/0x670 ? internal_get_user_pages_fast+0x868/0x2d40 bio_alloc_bioset+0x417/0x8c0 ? bvec_alloc+0x200/0x200 ? internal_get_user_pages_fast+0xb8c/0x2d40 bio_alloc_clone+0x53/0x100 dm_submit_bio+0x27f/0x1a20 ? lock_release+0x4b7/0x670 ? blk_try_enter_queue+0x1a0/0x4d0 ? dm_dax_direct_access+0x260/0x260 ? rcu_is_watching+0x12/0xb0 ? blk_try_enter_queue+0x1cc/0x4d0 __submit_bio+0x239/0x310 ? __bio_queue_enter+0x700/0x700 ? kvm_clock_get_cycles+0x40/0x60 ? ktime_get+0x285/0x470 submit_bio_noacct_nocheck+0x4d9/0xb80 ? should_fail_request+0x80/0x80 ? preempt_count_sub+0x150/0x150 ? lock_release+0x4b7/0x670 ? __bio_add_page+0x143/0x2d0 ? iov_iter_revert+0x27/0x360 submit_bio_noacct+0x53e/0x1b30 submit_bio_wait+0x10a/0x230 ? submit_bio_wait_endio+0x40/0x40 __blkdev_direct_IO_simple+0x4f8/0x780 ? blkdev_bio_end_io+0x4c0/0x4c0 ? stack_trace_save+0x90/0xc0 ? __bio_clone+0x3c0/0x3c0 ? lock_release+0x4b7/0x670 ? lock_sync+0x190/0x190 ? atime_needs_update+0x3bf/0x7e0 ? timestamp_truncate+0x21b/0x2d0 ? inode_owner_or_capable+0x240/0x240 blkdev_direct_IO.part.0+0x84a/0x1810 ? rcu_is_watching+0x12/0xb0 ? lock_release+0x4b7/0x670 ? blkdev_read_iter+0x40d/0x530 ? reacquire_held_locks+0x4e0/0x4e0 ? __blkdev_direct_IO_simple+0x780/0x780 ? rcu_is_watching+0x12/0xb0 ? __mark_inode_dirty+0x297/0xd50 ? preempt_count_add+0x72/0x140 blkdev_read_iter+0x2a4/0x530 do_iter_readv_writev+0x2f2/0x3c0 ? generic_copy_file_range+0x1d0/0x1d0 ? fsnotify_perm.part.0+0x25d/0x630 ? security_file_permission+0xd8/0x100 do_iter_read+0x31b/0x880 ? import_iovec+0x10b/0x140 vfs_readv+0x12d/0x1a0 ? vfs_iter_read+0xb0/0xb0 ? rcu_is_watching+0x12/0xb0 ? rcu_is_watching+0x12/0xb0 ? lock_release+0x4b7/0x670 do_preadv+0x1b3/0x260 ? do_readv+0x370/0x370 __x64_sys_preadv2+0xef/0x150 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f5af41ad806 Code: 41 54 41 89 fc 55 44 89 c5 53 48 89 cb 48 83 ec 18 80 3d e4 dd 0d 00 00 74 7a 45 89 c1 49 89 ca 45 31 c0 b8 47 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 be 00 00 00 48 85 c0 79 4a 48 8b 0d da 55 RSP: 002b:00007ffd3145c7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000147 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5af41ad806 RDX: 0000000000000001 RSI: 00007ffd3145c850 RDI: 0000000000000003 RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 R13: 00007ffd3145c850 R14: 000055f5f0431dd8 R15: 0000000000000001 </TASK> where in fact it is dm itself that attempts to allocate a bio clone with GFP_NOIO under the rcu read lock, regardless of the request type. Fix this by getting rid of the special casing for REQ_NOWAIT, and just use the normal SRCU protected table lookup. Get rid of the bio based table locking helpers at the same time, as they are now unused. Cc: stable@vger.kernel.org Fixes: `563a225c9f` ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-09-15 15:39:59 -04:00
Mariusz Tkaczyk	c8870379a2	md: Put the right device in md_seq_next If there are multiple arrays in system and one mddevice is marked with MD_DELETED and md_seq_next() is called in the middle of removal then it _get()s proper device but it may _put() deleted one. As a result, active counter may never be zeroed for mddevice and it cannot be removed. Put the device which has been _get with previous md_seq_next() call. Cc: stable@vger.kernel.org Fixes: `12a6caf273` ("md: only delete entries from all_mddevs when the disk is freed") Reported-by: AceLan Kao <acelan@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217798 Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914152416.10819-1-mariusz.tkaczyk@linux.intel.com	2023-09-14 10:13:11 -07:00
Mikulas Patocka	f6007dce0c	dm: fix a race condition in retrieve_deps There's a race condition in the multipath target when retrieve_deps races with multipath_message calling dm_get_device and dm_put_device. retrieve_deps walks the list of open devices without holding any lock but multipath may add or remove devices to the list while it is running. The end result may be memory corruption or use-after-free memory access. See this description of a UAF with multipath_message(): https://listman.redhat.com/archives/dm-devel/2022-October/052373.html Fix this bug by introducing a new rw semaphore "devices_lock". We grab devices_lock for read in retrieve_deps and we grab it for write in dm_get_device and dm_put_device. Reported-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Tested-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-09-14 11:18:29 -04:00
Nigel Croxon	df203da47f	md/raid1: fix error: ISO C90 forbids mixed declarations There is a compile error when this commit is added: md: raid1: fix potential OOB in raid1_remove_disk() drivers/md/raid1.c: In function 'raid1_remove_disk': drivers/md/raid1.c:1844:9: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] 1844 \| struct raid1_info *p = conf->mirrors + number; \| ^~~~~~ That's because the new code was inserted before the struct. The change is move the struct command above this commit. Fixes: `8b0472b50b` ("md: raid1: fix potential OOB in raid1_remove_disk()") Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/46d929d0-2aab-4cf2-b2bf-338963e8ba5a@redhat.com	2023-09-11 14:41:58 -07:00
Yu Kuai	99892147f0	md: fix warning for holder mismatch from export_rdev() Commit `a1d7671910` ("md: use mddev->external to select holder in export_rdev()") fix the problem that 'claim_rdev' is used for blkdev_get_by_dev() while 'rdev' is used for blkdev_put(). However, if mddev->external is changed from 0 to 1, then 'rdev' is used for blkdev_get_by_dev() while 'claim_rdev' is used for blkdev_put(). And this problem can be reporduced reliably by following: New file: mdadm/tests/23rdev-lifetime devname=${dev0##*/} devt=`cat /sys/block/$devname/dev` pid="" runtime=2 clean_up_test() { pill -9 $pid echo clear > /sys/block/md0/md/array_state } trap 'clean_up_test' EXIT add_by_sysfs() { while true; do echo $devt > /sys/block/md0/md/new_dev done } remove_by_sysfs(){ while true; do echo remove > /sys/block/md0/md/dev-${devname}/state done } echo md0 > /sys/module/md_mod/parameters/new_array \|\| die "create md0 failed" add_by_sysfs & pid="$pid $!" remove_by_sysfs & pid="$pid $!" sleep $runtime exit 0 Test cmd: ./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime Test result: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 960 at block/bdev.c:618 blkdev_put+0x27c/0x330 Modules linked in: multipath md_mod loop CPU: 0 PID: 960 Comm: test Not tainted 6.5.0-rc2-00121-g01e55c376936-dirty #50 RIP: 0010:blkdev_put+0x27c/0x330 Call Trace: <TASK> export_rdev.isra.23+0x50/0xa0 [md_mod] mddev_unlock+0x19d/0x300 [md_mod] rdev_attr_store+0xec/0x190 [md_mod] sysfs_kf_write+0x52/0x70 kernfs_fop_write_iter+0x19a/0x2a0 vfs_write+0x3b5/0x770 ksys_write+0x74/0x150 __x64_sys_write+0x22/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix the problem by recording if 'rdev' is used as holder. Fixes: `a1d7671910` ("md: use mddev->external to select holder in export_rdev()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825025532.1523008-3-yukuai1@huaweicloud.com	2023-09-08 13:16:40 -07:00
Yu Kuai	7deac114be	md: don't dereference mddev after export_rdev() Except for initial reference, mddev->kobject is referenced by rdev->kobject, and if the last rdev is freed, there is no guarantee that mddev is still valid. Hence mddev should not be used anymore after export_rdev(). This problem can be triggered by following test for mdadm at very low rate: New file: mdadm/tests/23rdev-lifetime devname=${dev0##*/} devt=`cat /sys/block/$devname/dev` pid="" runtime=2 clean_up_test() { pill -9 $pid echo clear > /sys/block/md0/md/array_state } trap 'clean_up_test' EXIT add_by_sysfs() { while true; do echo $devt > /sys/block/md0/md/new_dev done } remove_by_sysfs(){ while true; do echo remove > /sys/block/md0/md/dev-${devname}/state done } echo md0 > /sys/module/md_mod/parameters/new_array \|\| die "create md0 failed" add_by_sysfs & pid="$pid $!" remove_by_sysfs & pid="$pid $!" sleep $runtime exit 0 Test cmd: ./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime Test result: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bcb: 0000 [#4] PREEMPT SMP CPU: 0 PID: 1292 Comm: test Tainted: G D W 6.5.0-rc2-00121-g01e55c376936 #562 RIP: 0010:md_wakeup_thread+0x9e/0x320 [md_mod] Call Trace: <TASK> mddev_unlock+0x1b6/0x310 [md_mod] rdev_attr_store+0xec/0x190 [md_mod] sysfs_kf_write+0x52/0x70 kernfs_fop_write_iter+0x19a/0x2a0 vfs_write+0x3b5/0x770 ksys_write+0x74/0x150 __x64_sys_write+0x22/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix this problem by don't dereference mddev after export_rdev(). Fixes: `3ce94ce5d0` ("md: fix duplicate filename for rdev") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825025532.1523008-2-yukuai1@huaweicloud.com	2023-09-08 13:16:10 -07:00
Linus Torvalds	3d3dfeb3ae	for-6.6/block-2023-08-28 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ 7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs d0Y/zCZf1A== =p1bd -----END PGP SIGNATURE----- Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...	2023-08-29 20:21:42 -07:00
David Jeffery	cc22b5407e	md: raid0: account for split bio in iostat accounting When a bio is split by md raid0, the newly created bio will not be tracked by md for I/O accounting. Only the portion of I/O still assigned to the original bio which was reduced by the split will be accounted for. This results in md iostat data sometimes showing I/O values far below the actual amount of data being sent through md. md_account_bio() needs to be called for all bio generated by the bio split. A simple example of the issue was generated using a raid0 device on partitions to the same device. Since all raid0 I/O then goes to one device, it makes it easy to see a gap between the md device and its sd storage. Reading an lvm device on top of the md device, the iostat output (some 0 columns and extra devices removed to make the data more compact) was: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 md2 1364.00 411496.00 0.00 0.00 411496 sde 1734.00 646144.00 0.00 0.00 646144 md2 1699.00 510680.00 0.00 0.00 510680 sde 2155.00 802784.00 0.00 0.00 802784 md2 803.00 241480.00 0.00 0.00 241480 sde 1016.00 377888.00 0.00 0.00 377888 md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 I/O was generated doing large direct I/O reads (12M) with dd to a linear lvm volume on top of the 4 leg raid0 device. The md2 reads were showing as roughly 2/3 of the reads to the sde device containing all of md2's raid partitions. The sum of reads to sde was `1826816` kB, which was the expected amount as it was the amount read by dd. With the patch, the total reads from md will match the reads from sde and be consistent with the amount of I/O generated. Fixes: `10764815ff` ("md: add io accounting for raid0 and raid5") Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com	2023-08-17 21:11:31 -07:00
Jan Kara	319ff40a54	md/raid0: Fix performance regression for large sequential writes Commit `f00d7c85be` ("md/raid0: fix up bio splitting.") among other things changed how bio that needs to be split is submitted. Before this commit, we have split the bio, mapped and submitted each part. After this commit, we map only the first part of the split bio and submit the second part unmapped. Due to bio sorting in __submit_bio_noacct() this results in the following request ordering: 9,0 18 1181 0.525037895 15995 Q WS 1479315464 + 63392 Split off chunk-sized (1024 sectors) request: 9,0 18 1182 0.629019647 15995 X WS 1479315464 / 1479316488 Request is unaligned to the chunk so it's split in raid0_make_request(). This is the first part mapped and punted to bio_list: 8,0 18 7053 0.629020455 15995 A WS 739921928 + 1016 <- (9,0) 1479315464 Now raid0_make_request() returns, second part is postponed on bio_list. __submit_bio_noacct() resorts the bio_list, mapped request is submitted to the underlying device: 8,0 18 7054 0.629022782 15995 G WS 739921928 + 1016 Now we take another request from the bio_list which is the remainder of the original huge request. Split off another chunk-sized bit from it and the situation repeats: 9,0 18 1183 0.629024499 15995 X WS 1479316488 / 1479317512 8,16 18 6998 0.629025110 15995 A WS 739921928 + 1016 <- (9,0) 1479316488 8,16 18 6999 0.629026728 15995 G WS 739921928 + 1016 ... 9,0 18 1184 0.629032940 15995 X WS 1479317512 / 1479318536 [libnetacq-write] 8,0 18 7059 0.629033294 15995 A WS 739922952 + 1016 <- (9,0) 1479317512 8,0 18 7060 0.629033902 15995 G WS 739922952 + 1016 ... This repeats until we consume the whole original huge request. Now we finally get to processing the second parts of the split off requests (in reverse order): 8,16 18 7181 0.629161384 15995 A WS 739952640 + 8 <- (9,0) 1479377920 8,0 18 7239 0.629162140 15995 A WS 739952640 + 8 <- (9,0) 1479376896 8,16 18 7186 0.629163881 15995 A WS 739951616 + 8 <- (9,0) 1479375872 8,0 18 7242 0.629164421 15995 A WS 739951616 + 8 <- (9,0) 1479374848 ... I guess it is obvious that this IO pattern is extremely inefficient way to perform sequential IO. It also makes bio_list to grow to rather long lengths. Change raid0_make_request() to map both parts of the split bio. Since we know we are provided with at most chunk-sized bios, we will always need to split the incoming bio at most once. Fixes: `f00d7c85be` ("md/raid0: fix up bio splitting.") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz Signed-off-by: Song Liu <song@kernel.org>	2023-08-17 21:11:31 -07:00

... 5 6 7 8 9 ...

8305 Commits