2019-05-20 17:08:12 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
md.c : Multiple Devices driver for Linux
|
2014-09-30 04:23:59 +00:00
|
|
|
Copyright (C) 1998, 1999, 2000 Ingo Molnar
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
completely rewritten, based on the MD driver code from Marc Zyngier
|
|
|
|
|
|
|
|
Changes:
|
|
|
|
|
|
|
|
- RAID-1/RAID-5 extensions by Miguel de Icaza, Gadi Oxman, Ingo Molnar
|
|
|
|
- RAID-6 extensions by H. Peter Anvin <hpa@zytor.com>
|
|
|
|
- boot support for linear and striped mode by Harald Hoyer <HarryH@Royal.Net>
|
|
|
|
- kerneld support by Boris Tobotras <boris@xtalk.msk.su>
|
|
|
|
- kmod support by: Cyrus Durgin
|
|
|
|
- RAID0 bugfixes: Mark Anthony Lisher <markal@iname.com>
|
|
|
|
- Devfs support by Richard Gooch <rgooch@atnf.csiro.au>
|
|
|
|
|
|
|
|
- lots of fixes and improvements to the RAID1/RAID5 and generic
|
|
|
|
RAID code (such as request based resynchronization):
|
|
|
|
|
|
|
|
Neil Brown <neilb@cse.unsw.edu.au>.
|
|
|
|
|
2005-06-22 00:17:14 +00:00
|
|
|
- persistent bitmap code
|
|
|
|
Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, Inc.
|
|
|
|
|
2016-11-02 03:16:49 +00:00
|
|
|
|
|
|
|
Errors, Warnings, etc.
|
|
|
|
Please use:
|
|
|
|
pr_crit() for error conditions that risk data loss
|
|
|
|
pr_err() for error conditions that are unexpected, like an IO error
|
|
|
|
or internal inconsistency
|
|
|
|
pr_warn() for error conditions that could have been predicated, like
|
|
|
|
adding a device to an array when it has incompatible metadata
|
|
|
|
pr_info() for every interesting, very rare events, like an array starting
|
|
|
|
or stopping, or resync starting or stopping
|
|
|
|
pr_debug() for everything else.
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
2019-06-14 09:10:36 +00:00
|
|
|
#include <linux/sched/mm.h>
|
2017-02-08 17:51:30 +00:00
|
|
|
#include <linux/sched/signal.h>
|
2005-09-09 23:23:56 +00:00
|
|
|
#include <linux/kthread.h>
|
2009-03-31 03:33:13 +00:00
|
|
|
#include <linux/blkdev.h>
|
2021-09-20 12:33:27 +00:00
|
|
|
#include <linux/blk-integrity.h>
|
2015-12-25 02:20:34 +00:00
|
|
|
#include <linux/badblocks.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/sysctl.h>
|
2009-03-31 03:33:13 +00:00
|
|
|
#include <linux/seq_file.h>
|
2011-09-16 06:31:11 +00:00
|
|
|
#include <linux/fs.h>
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
#include <linux/poll.h>
|
2006-06-26 07:27:37 +00:00
|
|
|
#include <linux/ctype.h>
|
2009-12-15 02:01:06 +00:00
|
|
|
#include <linux/string.h>
|
2008-10-13 00:55:12 +00:00
|
|
|
#include <linux/hdreg.h>
|
|
|
|
#include <linux/proc_fs.h>
|
|
|
|
#include <linux/random.h>
|
2021-09-20 12:33:25 +00:00
|
|
|
#include <linux/major.h>
|
2011-07-03 17:58:33 +00:00
|
|
|
#include <linux/module.h>
|
2008-10-13 00:55:12 +00:00
|
|
|
#include <linux/reboot.h>
|
2005-06-22 00:17:14 +00:00
|
|
|
#include <linux/file.h>
|
2009-12-14 01:50:05 +00:00
|
|
|
#include <linux/compat.h>
|
2008-10-14 22:09:21 +00:00
|
|
|
#include <linux/delay.h>
|
2009-03-31 03:33:13 +00:00
|
|
|
#include <linux/raid/md_p.h>
|
|
|
|
#include <linux/raid/md_u.h>
|
2020-03-24 07:25:19 +00:00
|
|
|
#include <linux/raid/detect.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
|
|
|
#include <linux/slab.h>
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
#include <linux/percpu-refcount.h>
|
2020-03-25 15:48:42 +00:00
|
|
|
#include <linux/part_stat.h>
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
|
2016-11-18 17:44:08 +00:00
|
|
|
#include <trace/events/block.h>
|
2009-03-31 03:33:13 +00:00
|
|
|
#include "md.h"
|
2017-10-10 21:02:41 +00:00
|
|
|
#include "md-bitmap.h"
|
2014-03-29 15:01:53 +00:00
|
|
|
#include "md-cluster.h"
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-05-23 02:10:17 +00:00
|
|
|
/* pers_list is a list of registered personalities protected by pers_lock. */
|
2006-01-06 08:20:36 +00:00
|
|
|
static LIST_HEAD(pers_list);
|
2005-04-16 22:20:36 +00:00
|
|
|
static DEFINE_SPINLOCK(pers_lock);
|
|
|
|
|
2023-02-14 03:19:22 +00:00
|
|
|
static const struct kobj_type md_ktype;
|
2018-06-08 00:52:54 +00:00
|
|
|
|
2014-03-29 15:01:53 +00:00
|
|
|
struct md_cluster_operations *md_cluster_ops;
|
2014-06-07 07:39:37 +00:00
|
|
|
EXPORT_SYMBOL(md_cluster_ops);
|
2019-04-04 16:56:14 +00:00
|
|
|
static struct module *md_cluster_mod;
|
2014-03-29 15:01:53 +00:00
|
|
|
|
2008-05-23 20:04:38 +00:00
|
|
|
static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
|
2010-10-15 13:36:08 +00:00
|
|
|
static struct workqueue_struct *md_wq;
|
|
|
|
static struct workqueue_struct *md_misc_wq;
|
2023-05-29 13:11:04 +00:00
|
|
|
struct workqueue_struct *md_bitmap_wq;
|
2008-05-23 20:04:38 +00:00
|
|
|
|
2013-04-24 01:42:41 +00:00
|
|
|
static int remove_and_add_spares(struct mddev *mddev,
|
|
|
|
struct md_rdev *this);
|
2014-12-15 01:56:57 +00:00
|
|
|
static void mddev_detach(struct mddev *mddev);
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
static void export_rdev(struct md_rdev *rdev, struct mddev *mddev);
|
2023-05-23 02:10:17 +00:00
|
|
|
static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
|
2013-04-24 01:42:41 +00:00
|
|
|
|
2009-12-14 01:49:58 +00:00
|
|
|
/*
|
|
|
|
* Default number of read corrections we'll attempt on an rdev
|
|
|
|
* before ejecting it from the array. We divide the read error
|
|
|
|
* count by 2 for every hour elapsed between read errors.
|
|
|
|
*/
|
|
|
|
#define MD_DEFAULT_MAX_CORRECTED_READ_ERRORS 20
|
2020-07-20 18:08:52 +00:00
|
|
|
/* Default safemode delay: 200 msec */
|
|
|
|
#define DEFAULT_SAFEMODE_DELAY ((200 * HZ)/1000 +1)
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit'
|
|
|
|
* is 1000 KB/sec, so the extra system load does not show up that much.
|
|
|
|
* Increase it if you want to have more _guaranteed_ speed. Note that
|
2005-09-10 07:26:54 +00:00
|
|
|
* the RAID driver will use the maximum available bandwidth if the IO
|
2005-04-16 22:20:36 +00:00
|
|
|
* subsystem is idle. There is also an 'absolute maximum' reconstruction
|
|
|
|
* speed limit - in case reconstruction slows down your system despite
|
|
|
|
* idle IO detection.
|
|
|
|
*
|
|
|
|
* you can change it via /proc/sys/dev/raid/speed_limit_min and _max.
|
2006-01-06 08:21:36 +00:00
|
|
|
* or /sys/block/mdX/md/sync_speed_{min,max}
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
static int sysctl_speed_limit_min = 1000;
|
|
|
|
static int sysctl_speed_limit_max = 200000;
|
2011-10-11 05:47:53 +00:00
|
|
|
static inline int speed_min(struct mddev *mddev)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
|
|
|
return mddev->sync_speed_min ?
|
|
|
|
mddev->sync_speed_min : sysctl_speed_limit_min;
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static inline int speed_max(struct mddev *mddev)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
|
|
|
return mddev->sync_speed_max ?
|
|
|
|
mddev->sync_speed_max : sysctl_speed_limit_max;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2019-12-23 09:49:00 +00:00
|
|
|
static void rdev_uninit_serial(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
|
|
|
|
return;
|
|
|
|
|
2019-12-23 09:49:01 +00:00
|
|
|
kvfree(rdev->serial);
|
2019-12-23 09:49:00 +00:00
|
|
|
rdev->serial = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rdevs_uninit_serial(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
rdev_uninit_serial(rdev);
|
|
|
|
}
|
|
|
|
|
2019-12-23 09:48:53 +00:00
|
|
|
static int rdev_init_serial(struct md_rdev *rdev)
|
2019-06-19 09:30:46 +00:00
|
|
|
{
|
2019-12-23 09:49:01 +00:00
|
|
|
/* serial_nums equals with BARRIER_BUCKETS_NR */
|
|
|
|
int i, serial_nums = 1 << ((PAGE_SHIFT - ilog2(sizeof(atomic_t))));
|
2019-12-23 09:49:00 +00:00
|
|
|
struct serial_in_rdev *serial = NULL;
|
|
|
|
|
|
|
|
if (test_bit(CollisionCheck, &rdev->flags))
|
|
|
|
return 0;
|
|
|
|
|
2019-12-23 09:49:01 +00:00
|
|
|
serial = kvmalloc(sizeof(struct serial_in_rdev) * serial_nums,
|
|
|
|
GFP_KERNEL);
|
2019-12-23 09:49:00 +00:00
|
|
|
if (!serial)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2019-12-23 09:49:01 +00:00
|
|
|
for (i = 0; i < serial_nums; i++) {
|
|
|
|
struct serial_in_rdev *serial_tmp = &serial[i];
|
|
|
|
|
|
|
|
spin_lock_init(&serial_tmp->serial_lock);
|
|
|
|
serial_tmp->serial_rb = RB_ROOT_CACHED;
|
|
|
|
init_waitqueue_head(&serial_tmp->serial_io_wait);
|
|
|
|
}
|
|
|
|
|
2019-12-23 09:49:00 +00:00
|
|
|
rdev->serial = serial;
|
2019-12-23 09:48:53 +00:00
|
|
|
set_bit(CollisionCheck, &rdev->flags);
|
2019-06-19 09:30:46 +00:00
|
|
|
|
2019-12-23 09:49:00 +00:00
|
|
|
return 0;
|
2019-06-19 09:30:46 +00:00
|
|
|
}
|
|
|
|
|
2019-12-23 09:49:00 +00:00
|
|
|
static int rdevs_init_serial(struct mddev *mddev)
|
2019-12-23 09:48:55 +00:00
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
2019-12-23 09:49:00 +00:00
|
|
|
int ret = 0;
|
2019-12-23 09:48:55 +00:00
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
2019-12-23 09:49:00 +00:00
|
|
|
ret = rdev_init_serial(rdev);
|
|
|
|
if (ret)
|
|
|
|
break;
|
2019-12-23 09:48:55 +00:00
|
|
|
}
|
2019-12-23 09:49:00 +00:00
|
|
|
|
|
|
|
/* Free all resources if pool is not existed */
|
|
|
|
if (ret && !mddev->serial_info_pool)
|
|
|
|
rdevs_uninit_serial(mddev);
|
|
|
|
|
|
|
|
return ret;
|
2019-12-23 09:48:55 +00:00
|
|
|
}
|
|
|
|
|
2019-06-14 09:10:36 +00:00
|
|
|
/*
|
2019-12-23 09:48:57 +00:00
|
|
|
* rdev needs to enable serial stuffs if it meets the conditions:
|
|
|
|
* 1. it is multi-queue device flaged with writemostly.
|
|
|
|
* 2. the write-behind mode is enabled.
|
|
|
|
*/
|
|
|
|
static int rdev_need_serial(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
return (rdev && rdev->mddev->bitmap_info.max_write_behind > 0 &&
|
2020-06-26 08:01:56 +00:00
|
|
|
rdev->bdev->bd_disk->queue->nr_hw_queues != 1 &&
|
2019-12-23 09:48:57 +00:00
|
|
|
test_bit(WriteMostly, &rdev->flags));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Init resource for rdev(s), then create serial_info_pool if:
|
|
|
|
* 1. rdev is the first device which return true from rdev_enable_serial.
|
|
|
|
* 2. rdev is NULL, means we want to enable serialization for all rdevs.
|
2019-06-14 09:10:36 +00:00
|
|
|
*/
|
2019-12-23 09:48:53 +00:00
|
|
|
void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
|
2019-12-23 09:48:55 +00:00
|
|
|
bool is_suspend)
|
2019-06-14 09:10:36 +00:00
|
|
|
{
|
2019-12-23 09:49:00 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
2019-12-23 09:48:57 +00:00
|
|
|
if (rdev && !rdev_need_serial(rdev) &&
|
|
|
|
!test_bit(CollisionCheck, &rdev->flags))
|
2019-06-14 09:10:36 +00:00
|
|
|
return;
|
|
|
|
|
2019-12-23 09:48:57 +00:00
|
|
|
if (!is_suspend)
|
|
|
|
mddev_suspend(mddev);
|
|
|
|
|
|
|
|
if (!rdev)
|
2019-12-23 09:49:00 +00:00
|
|
|
ret = rdevs_init_serial(mddev);
|
2019-12-23 09:48:57 +00:00
|
|
|
else
|
2019-12-23 09:49:00 +00:00
|
|
|
ret = rdev_init_serial(rdev);
|
|
|
|
if (ret)
|
|
|
|
goto abort;
|
2019-12-23 09:48:57 +00:00
|
|
|
|
2019-12-23 09:48:53 +00:00
|
|
|
if (mddev->serial_info_pool == NULL) {
|
2020-04-09 14:17:23 +00:00
|
|
|
/*
|
|
|
|
* already in memalloc noio context by
|
|
|
|
* mddev_suspend()
|
|
|
|
*/
|
2019-12-23 09:48:53 +00:00
|
|
|
mddev->serial_info_pool =
|
|
|
|
mempool_create_kmalloc_pool(NR_SERIAL_INFOS,
|
|
|
|
sizeof(struct serial_info));
|
2019-12-23 09:49:00 +00:00
|
|
|
if (!mddev->serial_info_pool) {
|
|
|
|
rdevs_uninit_serial(mddev);
|
2019-12-23 09:48:53 +00:00
|
|
|
pr_err("can't alloc memory pool for serialization\n");
|
2019-12-23 09:49:00 +00:00
|
|
|
}
|
2019-06-14 09:10:36 +00:00
|
|
|
}
|
2019-12-23 09:49:00 +00:00
|
|
|
|
|
|
|
abort:
|
2019-12-23 09:48:57 +00:00
|
|
|
if (!is_suspend)
|
|
|
|
mddev_resume(mddev);
|
2019-06-14 09:10:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2019-12-23 09:48:57 +00:00
|
|
|
* Free resource from rdev(s), and destroy serial_info_pool under conditions:
|
|
|
|
* 1. rdev is the last device flaged with CollisionCheck.
|
|
|
|
* 2. when bitmap is destroyed while policy is not enabled.
|
|
|
|
* 3. for disable policy, the pool is destroyed only when no rdev needs it.
|
2019-06-14 09:10:36 +00:00
|
|
|
*/
|
2019-12-23 09:49:00 +00:00
|
|
|
void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
|
|
|
|
bool is_suspend)
|
2019-06-14 09:10:36 +00:00
|
|
|
{
|
2019-12-23 09:48:55 +00:00
|
|
|
if (rdev && !test_bit(CollisionCheck, &rdev->flags))
|
2019-06-14 09:10:36 +00:00
|
|
|
return;
|
|
|
|
|
2019-12-23 09:48:53 +00:00
|
|
|
if (mddev->serial_info_pool) {
|
2019-06-14 09:10:36 +00:00
|
|
|
struct md_rdev *temp;
|
2019-12-23 09:48:57 +00:00
|
|
|
int num = 0; /* used to track if other rdevs need the pool */
|
2019-06-14 09:10:36 +00:00
|
|
|
|
2019-12-23 09:48:55 +00:00
|
|
|
if (!is_suspend)
|
|
|
|
mddev_suspend(mddev);
|
|
|
|
rdev_for_each(temp, mddev) {
|
|
|
|
if (!rdev) {
|
2019-12-23 09:49:00 +00:00
|
|
|
if (!mddev->serialize_policy ||
|
|
|
|
!rdev_need_serial(temp))
|
|
|
|
rdev_uninit_serial(temp);
|
2019-12-23 09:48:57 +00:00
|
|
|
else
|
|
|
|
num++;
|
|
|
|
} else if (temp != rdev &&
|
|
|
|
test_bit(CollisionCheck, &temp->flags))
|
2019-06-14 09:10:36 +00:00
|
|
|
num++;
|
2019-12-23 09:48:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (rdev)
|
2019-12-23 09:49:00 +00:00
|
|
|
rdev_uninit_serial(rdev);
|
2019-12-23 09:48:57 +00:00
|
|
|
|
|
|
|
if (num)
|
|
|
|
pr_info("The mempool could be used by other devices\n");
|
|
|
|
else {
|
2019-12-23 09:48:53 +00:00
|
|
|
mempool_destroy(mddev->serial_info_pool);
|
|
|
|
mddev->serial_info_pool = NULL;
|
2019-06-14 09:10:36 +00:00
|
|
|
}
|
2019-12-23 09:48:55 +00:00
|
|
|
if (!is_suspend)
|
|
|
|
mddev_resume(mddev);
|
2019-06-14 09:10:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
static struct ctl_table_header *raid_table_header;
|
|
|
|
|
2013-11-14 04:16:18 +00:00
|
|
|
static struct ctl_table raid_table[] = {
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
.procname = "speed_limit_min",
|
|
|
|
.data = &sysctl_speed_limit_min,
|
|
|
|
.maxlen = sizeof(int),
|
2006-07-10 11:44:18 +00:00
|
|
|
.mode = S_IRUGO|S_IWUSR,
|
2009-11-16 11:11:48 +00:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 22:20:36 +00:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "speed_limit_max",
|
|
|
|
.data = &sysctl_speed_limit_max,
|
|
|
|
.maxlen = sizeof(int),
|
2006-07-10 11:44:18 +00:00
|
|
|
.mode = S_IRUGO|S_IWUSR,
|
2009-11-16 11:11:48 +00:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 22:20:36 +00:00
|
|
|
},
|
2009-11-05 22:34:02 +00:00
|
|
|
{ }
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
|
|
|
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
static int start_readonly;
|
|
|
|
|
2017-04-12 06:26:13 +00:00
|
|
|
/*
|
|
|
|
* The original mechanism for creating an md device is to create
|
|
|
|
* a device node in /dev and to open it. This causes races with device-close.
|
|
|
|
* The preferred method is to write to the "new_array" module parameter.
|
|
|
|
* This can avoid races.
|
|
|
|
* Setting create_on_open to false disables the original mechanism
|
|
|
|
* so all the races disappear.
|
|
|
|
*/
|
|
|
|
static bool create_on_open = true;
|
|
|
|
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
/*
|
|
|
|
* We have a system wide 'event count' that is incremented
|
|
|
|
* on any 'interesting' event, and readers of /proc/mdstat
|
|
|
|
* can use 'poll' or 'select' to find out when the event
|
|
|
|
* count increases.
|
|
|
|
*
|
|
|
|
* Events are:
|
|
|
|
* start array, stop array, error, add device, remove device,
|
|
|
|
* start build, activate spare
|
|
|
|
*/
|
2006-01-06 08:20:43 +00:00
|
|
|
static DECLARE_WAIT_QUEUE_HEAD(md_event_waiters);
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
static atomic_t md_event_count;
|
2021-10-04 15:34:53 +00:00
|
|
|
void md_new_event(void)
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
{
|
|
|
|
atomic_inc(&md_event_count);
|
|
|
|
wake_up(&md_event_waiters);
|
|
|
|
}
|
2006-03-27 09:18:10 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_new_event);
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Enables to iterate over all existing md arrays
|
|
|
|
* all_mddevs_lock protects this list.
|
|
|
|
*/
|
|
|
|
static LIST_HEAD(all_mddevs);
|
|
|
|
static DEFINE_SPINLOCK(all_mddevs_lock);
|
|
|
|
|
2009-03-31 03:39:39 +00:00
|
|
|
/* Rather than calling directly into the personality make_request function,
|
|
|
|
* IO requests come here first so that we can check if the device is
|
|
|
|
* being suspended pending a reconfiguration.
|
|
|
|
* We hold a refcount over the call to ->make_request. By the time that
|
|
|
|
* call has finished, the bio has been linked into some internal structure
|
|
|
|
* and so is visible to ->quiesce(), so we don't need the refcount any more.
|
|
|
|
*/
|
2017-10-17 02:46:43 +00:00
|
|
|
static bool is_suspended(struct mddev *mddev, struct bio *bio)
|
|
|
|
{
|
2023-01-31 05:17:09 +00:00
|
|
|
if (is_md_suspended(mddev))
|
2017-10-17 02:46:43 +00:00
|
|
|
return true;
|
|
|
|
if (bio_data_dir(bio) != WRITE)
|
|
|
|
return false;
|
|
|
|
if (mddev->suspend_lo >= mddev->suspend_hi)
|
|
|
|
return false;
|
|
|
|
if (bio->bi_iter.bi_sector >= mddev->suspend_hi)
|
|
|
|
return false;
|
|
|
|
if (bio_end_sector(bio) < mddev->suspend_lo)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2017-09-21 17:23:35 +00:00
|
|
|
void md_handle_request(struct mddev *mddev, struct bio *bio)
|
|
|
|
{
|
|
|
|
check_suspended:
|
2017-10-17 02:46:43 +00:00
|
|
|
if (is_suspended(mddev, bio)) {
|
2017-09-21 17:23:35 +00:00
|
|
|
DEFINE_WAIT(__wait);
|
2021-12-21 20:06:19 +00:00
|
|
|
/* Bail out if REQ_NOWAIT is set for the bio */
|
|
|
|
if (bio->bi_opf & REQ_NOWAIT) {
|
|
|
|
bio_wouldblock_error(bio);
|
|
|
|
return;
|
|
|
|
}
|
2017-09-21 17:23:35 +00:00
|
|
|
for (;;) {
|
|
|
|
prepare_to_wait(&mddev->sb_wait, &__wait,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
2017-10-17 02:46:43 +00:00
|
|
|
if (!is_suspended(mddev, bio))
|
2017-09-21 17:23:35 +00:00
|
|
|
break;
|
|
|
|
schedule();
|
|
|
|
}
|
|
|
|
finish_wait(&mddev->sb_wait, &__wait);
|
|
|
|
}
|
2023-01-31 05:17:10 +00:00
|
|
|
if (!percpu_ref_tryget_live(&mddev->active_io))
|
|
|
|
goto check_suspended;
|
2017-09-21 17:23:35 +00:00
|
|
|
|
|
|
|
if (!mddev->pers->make_request(mddev, bio)) {
|
2023-01-31 05:17:10 +00:00
|
|
|
percpu_ref_put(&mddev->active_io);
|
2017-09-21 17:23:35 +00:00
|
|
|
goto check_suspended;
|
|
|
|
}
|
|
|
|
|
2023-01-31 05:17:10 +00:00
|
|
|
percpu_ref_put(&mddev->active_io);
|
2017-09-21 17:23:35 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_handle_request);
|
|
|
|
|
2021-10-12 11:12:24 +00:00
|
|
|
static void md_submit_bio(struct bio *bio)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2010-03-25 05:20:56 +00:00
|
|
|
const int rw = bio_data_dir(bio);
|
2021-01-24 10:02:34 +00:00
|
|
|
struct mddev *mddev = bio->bi_bdev->bd_disk->private_data;
|
2010-03-25 05:20:56 +00:00
|
|
|
|
2020-07-02 11:35:02 +00:00
|
|
|
if (mddev == NULL || mddev->pers == NULL) {
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
bio_io_error(bio);
|
2021-10-12 11:12:24 +00:00
|
|
|
return;
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
}
|
|
|
|
|
2020-07-02 11:35:02 +00:00
|
|
|
if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) {
|
2009-03-31 03:39:39 +00:00
|
|
|
bio_io_error(bio);
|
2021-10-12 11:12:24 +00:00
|
|
|
return;
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
2020-07-02 11:35:02 +00:00
|
|
|
|
2022-07-27 16:22:55 +00:00
|
|
|
bio = bio_split_to_limits(bio);
|
2023-01-04 15:51:19 +00:00
|
|
|
if (!bio)
|
|
|
|
return;
|
2020-07-02 11:35:02 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->ro == MD_RDONLY && unlikely(rw == WRITE)) {
|
2015-07-20 13:29:37 +00:00
|
|
|
if (bio_sectors(bio) != 0)
|
2017-06-03 07:38:06 +00:00
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
2015-07-20 13:29:37 +00:00
|
|
|
bio_endio(bio);
|
2021-10-12 11:12:24 +00:00
|
|
|
return;
|
2013-02-21 02:28:09 +00:00
|
|
|
}
|
2010-03-25 05:20:56 +00:00
|
|
|
|
2016-04-25 23:52:38 +00:00
|
|
|
/* bio could be mergeable after passing to underlayer */
|
2016-08-05 21:35:16 +00:00
|
|
|
bio->bi_opf &= ~REQ_NOMERGE;
|
2017-09-21 17:23:35 +00:00
|
|
|
|
|
|
|
md_handle_request(mddev, bio);
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
|
|
|
|
2010-04-06 04:23:02 +00:00
|
|
|
/* mddev_suspend makes sure no new requests are submitted
|
|
|
|
* to the device, and that any requests that have been submitted
|
|
|
|
* are completely handled.
|
2014-12-15 01:56:58 +00:00
|
|
|
* Once mddev_detach() is called and completes, the module will be
|
|
|
|
* completely unused.
|
2010-04-06 04:23:02 +00:00
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
void mddev_suspend(struct mddev *mddev)
|
2009-03-31 03:39:39 +00:00
|
|
|
{
|
2023-05-23 02:10:17 +00:00
|
|
|
struct md_thread *thread = rcu_dereference_protected(mddev->thread,
|
|
|
|
lockdep_is_held(&mddev->reconfig_mutex));
|
|
|
|
|
|
|
|
WARN_ON_ONCE(thread && current == thread->tsk);
|
2015-12-18 04:19:16 +00:00
|
|
|
if (mddev->suspended++)
|
|
|
|
return;
|
2017-06-05 06:49:39 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
2017-10-17 02:46:43 +00:00
|
|
|
set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
|
2023-01-31 05:17:10 +00:00
|
|
|
percpu_ref_kill(&mddev->active_io);
|
2023-05-12 01:56:09 +00:00
|
|
|
|
2023-08-25 03:09:52 +00:00
|
|
|
if (mddev->pers && mddev->pers->prepare_suspend)
|
2023-05-12 01:56:09 +00:00
|
|
|
mddev->pers->prepare_suspend(mddev);
|
|
|
|
|
2023-01-31 05:17:10 +00:00
|
|
|
wait_event(mddev->sb_wait, percpu_ref_is_zero(&mddev->active_io));
|
2017-10-17 02:46:43 +00:00
|
|
|
clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags);
|
|
|
|
wait_event(mddev->sb_wait, !test_bit(MD_UPDATING_SB, &mddev->flags));
|
2012-05-16 09:06:14 +00:00
|
|
|
|
|
|
|
del_timer_sync(&mddev->safemode_timer);
|
md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().
The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.
But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.
What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.
This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.
Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.
Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-04-09 14:17:20 +00:00
|
|
|
/* restrict memory reclaim I/O during raid array is suspend */
|
|
|
|
mddev->noio_flag = memalloc_noio_save();
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
2010-06-01 09:37:27 +00:00
|
|
|
EXPORT_SYMBOL_GPL(mddev_suspend);
|
2009-03-31 03:39:39 +00:00
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void mddev_resume(struct mddev *mddev)
|
2009-03-31 03:39:39 +00:00
|
|
|
{
|
2017-10-19 01:17:16 +00:00
|
|
|
lockdep_assert_held(&mddev->reconfig_mutex);
|
2015-12-18 04:19:16 +00:00
|
|
|
if (--mddev->suspended)
|
|
|
|
return;
|
2023-06-28 01:29:31 +00:00
|
|
|
|
|
|
|
/* entred the memalloc scope from mddev_suspend() */
|
|
|
|
memalloc_noio_restore(mddev->noio_flag);
|
|
|
|
|
2023-01-31 05:17:10 +00:00
|
|
|
percpu_ref_resurrect(&mddev->active_io);
|
2009-03-31 03:39:39 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
2011-06-07 22:49:36 +00:00
|
|
|
|
2012-05-22 03:55:29 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2011-06-07 22:49:36 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2010-06-01 09:37:27 +00:00
|
|
|
EXPORT_SYMBOL_GPL(mddev_resume);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-12-14 01:49:49 +00:00
|
|
|
/*
|
2010-09-03 09:56:18 +00:00
|
|
|
* Generic flush handling for md
|
2009-12-14 01:49:49 +00:00
|
|
|
*/
|
|
|
|
|
2019-03-29 17:46:16 +00:00
|
|
|
static void md_end_flush(struct bio *bio)
|
2009-12-14 01:49:49 +00:00
|
|
|
{
|
2019-03-29 17:46:16 +00:00
|
|
|
struct md_rdev *rdev = bio->bi_private;
|
|
|
|
struct mddev *mddev = rdev->mddev;
|
2009-12-14 01:49:49 +00:00
|
|
|
|
2022-11-04 13:53:38 +00:00
|
|
|
bio_put(bio);
|
|
|
|
|
2009-12-14 01:49:49 +00:00
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
|
2019-03-29 17:46:16 +00:00
|
|
|
if (atomic_dec_and_test(&mddev->flush_pending)) {
|
|
|
|
/* The pre-request flush has finished */
|
|
|
|
queue_work(md_wq, &mddev->flush_work);
|
2009-12-14 01:49:49 +00:00
|
|
|
}
|
2018-05-21 03:49:54 +00:00
|
|
|
}
|
2010-12-09 05:04:25 +00:00
|
|
|
|
2019-03-29 17:46:16 +00:00
|
|
|
static void md_submit_flush_data(struct work_struct *ws);
|
|
|
|
|
|
|
|
static void submit_flushes(struct work_struct *ws)
|
2009-12-14 01:49:49 +00:00
|
|
|
{
|
2019-03-29 17:46:16 +00:00
|
|
|
struct mddev *mddev = container_of(ws, struct mddev, flush_work);
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2009-12-14 01:49:49 +00:00
|
|
|
|
2019-03-29 17:46:17 +00:00
|
|
|
mddev->start_flush = ktime_get_boottime();
|
2019-03-29 17:46:16 +00:00
|
|
|
INIT_WORK(&mddev->flush_work, md_submit_flush_data);
|
|
|
|
atomic_set(&mddev->flush_pending, 1);
|
2009-12-14 01:49:49 +00:00
|
|
|
rcu_read_lock();
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each_rcu(rdev, mddev)
|
2009-12-14 01:49:49 +00:00
|
|
|
if (rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(Faulty, &rdev->flags)) {
|
|
|
|
/* Take two references, one is dropped
|
|
|
|
* when request finishes, one after
|
|
|
|
* we reclaim rcu_read_lock
|
|
|
|
*/
|
|
|
|
struct bio *bi;
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
rcu_read_unlock();
|
2022-01-24 09:11:03 +00:00
|
|
|
bi = bio_alloc_bioset(rdev->bdev, 0,
|
|
|
|
REQ_OP_WRITE | REQ_PREFLUSH,
|
|
|
|
GFP_NOIO, &mddev->bio_set);
|
2018-05-21 03:49:54 +00:00
|
|
|
bi->bi_end_io = md_end_flush;
|
2019-03-29 17:46:16 +00:00
|
|
|
bi->bi_private = rdev;
|
|
|
|
atomic_inc(&mddev->flush_pending);
|
2016-06-05 19:31:41 +00:00
|
|
|
submit_bio(bi);
|
2009-12-14 01:49:49 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2019-03-29 17:46:16 +00:00
|
|
|
if (atomic_dec_and_test(&mddev->flush_pending))
|
|
|
|
queue_work(md_wq, &mddev->flush_work);
|
|
|
|
}
|
2009-12-14 01:49:49 +00:00
|
|
|
|
2019-03-29 17:46:16 +00:00
|
|
|
static void md_submit_flush_data(struct work_struct *ws)
|
|
|
|
{
|
|
|
|
struct mddev *mddev = container_of(ws, struct mddev, flush_work);
|
|
|
|
struct bio *bio = mddev->flush_bio;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* must reset flush_bio before calling into md_handle_request to avoid a
|
|
|
|
* deadlock, because other bios passed md_handle_request suspend check
|
|
|
|
* could wait for this and below md_handle_request could wait for those
|
|
|
|
* bios because of suspend check
|
|
|
|
*/
|
2020-12-10 06:33:32 +00:00
|
|
|
spin_lock_irq(&mddev->lock);
|
2020-11-11 05:16:56 +00:00
|
|
|
mddev->prev_flush_start = mddev->start_flush;
|
2019-03-29 17:46:16 +00:00
|
|
|
mddev->flush_bio = NULL;
|
2020-12-10 06:33:32 +00:00
|
|
|
spin_unlock_irq(&mddev->lock);
|
2019-03-29 17:46:16 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
|
|
|
|
|
|
|
if (bio->bi_iter.bi_size == 0) {
|
|
|
|
/* an empty barrier - all done */
|
|
|
|
bio_endio(bio);
|
|
|
|
} else {
|
|
|
|
bio->bi_opf &= ~REQ_PREFLUSH;
|
|
|
|
md_handle_request(mddev, bio);
|
2009-12-14 01:49:49 +00:00
|
|
|
}
|
|
|
|
}
|
2019-03-29 17:46:16 +00:00
|
|
|
|
2019-09-16 17:15:14 +00:00
|
|
|
/*
|
|
|
|
* Manages consolidation of flushes and submitting any flushes needed for
|
|
|
|
* a bio with REQ_PREFLUSH. Returns true if the bio is finished or is
|
|
|
|
* being finished in another context. Returns false if the flushing is
|
|
|
|
* complete but still needs the I/O portion of the bio to be processed.
|
|
|
|
*/
|
|
|
|
bool md_flush_request(struct mddev *mddev, struct bio *bio)
|
2019-03-29 17:46:16 +00:00
|
|
|
{
|
2020-11-11 05:16:56 +00:00
|
|
|
ktime_t req_start = ktime_get_boottime();
|
2019-03-29 17:46:16 +00:00
|
|
|
spin_lock_irq(&mddev->lock);
|
2020-11-11 05:16:57 +00:00
|
|
|
/* flush requests wait until ongoing flush completes,
|
|
|
|
* hence coalescing all the pending requests.
|
|
|
|
*/
|
2019-03-29 17:46:16 +00:00
|
|
|
wait_event_lock_irq(mddev->sb_wait,
|
2019-03-29 17:46:17 +00:00
|
|
|
!mddev->flush_bio ||
|
2020-11-11 05:16:58 +00:00
|
|
|
ktime_before(req_start, mddev->prev_flush_start),
|
2019-03-29 17:46:16 +00:00
|
|
|
mddev->lock);
|
2020-11-11 05:16:57 +00:00
|
|
|
/* new request after previous flush is completed */
|
2020-11-11 05:16:58 +00:00
|
|
|
if (ktime_after(req_start, mddev->prev_flush_start)) {
|
2019-03-29 17:46:17 +00:00
|
|
|
WARN_ON(mddev->flush_bio);
|
|
|
|
mddev->flush_bio = bio;
|
|
|
|
bio = NULL;
|
|
|
|
}
|
2019-03-29 17:46:16 +00:00
|
|
|
spin_unlock_irq(&mddev->lock);
|
|
|
|
|
2019-03-29 17:46:17 +00:00
|
|
|
if (!bio) {
|
|
|
|
INIT_WORK(&mddev->flush_work, submit_flushes);
|
|
|
|
queue_work(md_wq, &mddev->flush_work);
|
|
|
|
} else {
|
|
|
|
/* flush was performed for some other bio while we waited. */
|
|
|
|
if (bio->bi_iter.bi_size == 0)
|
|
|
|
/* an empty barrier - all done */
|
|
|
|
bio_endio(bio);
|
|
|
|
else {
|
|
|
|
bio->bi_opf &= ~REQ_PREFLUSH;
|
2019-09-16 17:15:14 +00:00
|
|
|
return false;
|
2019-03-29 17:46:17 +00:00
|
|
|
}
|
|
|
|
}
|
2019-09-16 17:15:14 +00:00
|
|
|
return true;
|
2019-03-29 17:46:16 +00:00
|
|
|
}
|
2010-09-03 09:56:18 +00:00
|
|
|
EXPORT_SYMBOL(md_flush_request);
|
2009-03-31 03:39:39 +00:00
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static inline struct mddev *mddev_get(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2022-07-19 09:18:23 +00:00
|
|
|
lockdep_assert_held(&all_mddevs_lock);
|
|
|
|
|
|
|
|
if (test_bit(MD_DELETED, &mddev->flags))
|
|
|
|
return NULL;
|
2005-04-16 22:20:36 +00:00
|
|
|
atomic_inc(&mddev->active);
|
|
|
|
return mddev;
|
|
|
|
}
|
|
|
|
|
2009-03-04 07:57:25 +00:00
|
|
|
static void mddev_delayed_delete(struct work_struct *ws);
|
md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-08 21:31:10 +00:00
|
|
|
|
2022-07-23 06:24:29 +00:00
|
|
|
void mddev_put(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
|
|
|
|
return;
|
md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-08 21:31:10 +00:00
|
|
|
if (!mddev->raid_disks && list_empty(&mddev->disks) &&
|
2009-12-30 01:08:49 +00:00
|
|
|
mddev->ctime == 0 && !mddev->hold_active) {
|
|
|
|
/* Array is not configured at all, and not held active,
|
|
|
|
* so destroy it */
|
2022-07-19 09:18:23 +00:00
|
|
|
set_bit(MD_DELETED, &mddev->flags);
|
2018-06-08 00:52:54 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Call queue_work inside the spinlock so that
|
|
|
|
* flush_workqueue() after mddev_find will succeed in waiting
|
|
|
|
* for the work to be done.
|
|
|
|
*/
|
|
|
|
queue_work(md_misc_wq, &mddev->del_work);
|
md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-08 21:31:10 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2017-10-17 00:01:48 +00:00
|
|
|
static void md_safemode_timeout(struct timer_list *t);
|
2023-08-25 03:16:16 +00:00
|
|
|
static void md_start_sync(struct work_struct *ws);
|
2015-07-24 22:19:58 +00:00
|
|
|
|
2023-08-25 03:09:50 +00:00
|
|
|
static void active_io_release(struct percpu_ref *ref)
|
2010-04-01 04:55:30 +00:00
|
|
|
{
|
2023-08-25 03:09:50 +00:00
|
|
|
struct mddev *mddev = container_of(ref, struct mddev, active_io);
|
|
|
|
|
|
|
|
wake_up(&mddev->sb_wait);
|
|
|
|
}
|
|
|
|
|
md: initialize 'writes_pending' while allocating mddev
Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.
On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:
array_state_store
set_in_sync
if (!mddev->in_sync) -> in_sync is used for all levels
// access writes_pending
There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.
By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
2023-08-25 03:09:51 +00:00
|
|
|
static void no_op(struct percpu_ref *r) {}
|
|
|
|
|
2023-08-25 03:09:50 +00:00
|
|
|
int mddev_init(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (percpu_ref_init(&mddev->active_io, active_io_release,
|
|
|
|
PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
md: initialize 'writes_pending' while allocating mddev
Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.
On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:
array_state_store
set_in_sync
if (!mddev->in_sync) -> in_sync is used for all levels
// access writes_pending
There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.
By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
2023-08-25 03:09:51 +00:00
|
|
|
if (percpu_ref_init(&mddev->writes_pending, no_op,
|
|
|
|
PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
|
|
|
|
percpu_ref_exit(&mddev->active_io);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We want to start with the refcount at zero */
|
|
|
|
percpu_ref_put(&mddev->writes_pending);
|
|
|
|
|
2010-04-01 04:55:30 +00:00
|
|
|
mutex_init(&mddev->open_mutex);
|
|
|
|
mutex_init(&mddev->reconfig_mutex);
|
md: add a mutex to synchronize idle and frozen in action_store()
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
2023-05-29 13:20:34 +00:00
|
|
|
mutex_init(&mddev->sync_mutex);
|
2010-04-01 04:55:30 +00:00
|
|
|
mutex_init(&mddev->bitmap_info.mutex);
|
|
|
|
INIT_LIST_HEAD(&mddev->disks);
|
|
|
|
INIT_LIST_HEAD(&mddev->all_mddevs);
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
INIT_LIST_HEAD(&mddev->deleting);
|
2017-10-17 00:01:48 +00:00
|
|
|
timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
|
2010-04-01 04:55:30 +00:00
|
|
|
atomic_set(&mddev->active, 1);
|
|
|
|
atomic_set(&mddev->openers, 0);
|
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
2023-05-29 13:20:35 +00:00
|
|
|
atomic_set(&mddev->sync_seq, 0);
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_lock_init(&mddev->lock);
|
2019-03-29 17:46:16 +00:00
|
|
|
atomic_set(&mddev->flush_pending, 0);
|
2010-04-01 04:55:30 +00:00
|
|
|
init_waitqueue_head(&mddev->sb_wait);
|
|
|
|
init_waitqueue_head(&mddev->recovery_wait);
|
|
|
|
mddev->reshape_position = MaxSector;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
2013-06-25 06:23:59 +00:00
|
|
|
mddev->last_sync_action = "none";
|
2010-04-01 04:55:30 +00:00
|
|
|
mddev->resync_min = 0;
|
|
|
|
mddev->resync_max = MaxSector;
|
|
|
|
mddev->level = LEVEL_NONE;
|
2023-08-25 03:16:16 +00:00
|
|
|
|
|
|
|
INIT_WORK(&mddev->sync_work, md_start_sync);
|
|
|
|
INIT_WORK(&mddev->del_work, mddev_delayed_delete);
|
2023-08-25 03:09:50 +00:00
|
|
|
|
|
|
|
return 0;
|
2010-04-01 04:55:30 +00:00
|
|
|
}
|
2010-06-01 09:37:27 +00:00
|
|
|
EXPORT_SYMBOL_GPL(mddev_init);
|
2010-04-01 04:55:30 +00:00
|
|
|
|
2023-08-25 03:09:50 +00:00
|
|
|
void mddev_destroy(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
percpu_ref_exit(&mddev->active_io);
|
md: initialize 'writes_pending' while allocating mddev
Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.
On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:
array_state_store
set_in_sync
if (!mddev->in_sync) -> in_sync is used for all levels
// access writes_pending
There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.
By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
2023-08-25 03:09:51 +00:00
|
|
|
percpu_ref_exit(&mddev->writes_pending);
|
2023-08-25 03:09:50 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(mddev_destroy);
|
|
|
|
|
2021-04-03 16:15:28 +00:00
|
|
|
static struct mddev *mddev_find_locked(dev_t unit)
|
|
|
|
{
|
|
|
|
struct mddev *mddev;
|
|
|
|
|
|
|
|
list_for_each_entry(mddev, &all_mddevs, all_mddevs)
|
|
|
|
if (mddev->unit == unit)
|
|
|
|
return mddev;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2021-04-12 08:05:28 +00:00
|
|
|
/* find an unused unit number */
|
|
|
|
static dev_t mddev_alloc_unit(void)
|
|
|
|
{
|
|
|
|
static int next_minor = 512;
|
|
|
|
int start = next_minor;
|
|
|
|
bool is_free = 0;
|
|
|
|
dev_t dev = 0;
|
|
|
|
|
|
|
|
while (!is_free) {
|
|
|
|
dev = MKDEV(MD_MAJOR, next_minor);
|
|
|
|
next_minor++;
|
|
|
|
if (next_minor > MINORMASK)
|
|
|
|
next_minor = 0;
|
|
|
|
if (next_minor == start)
|
|
|
|
return 0; /* Oh dear, all in use. */
|
|
|
|
is_free = !mddev_find_locked(dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
return dev;
|
|
|
|
}
|
|
|
|
|
2021-04-12 08:05:30 +00:00
|
|
|
static struct mddev *mddev_alloc(dev_t unit)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2021-04-12 08:05:30 +00:00
|
|
|
struct mddev *new;
|
|
|
|
int error;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-02-16 02:58:51 +00:00
|
|
|
if (unit && MAJOR(unit) != MD_MAJOR)
|
2021-04-12 08:05:29 +00:00
|
|
|
unit &= ~((1 << MdpMinorShift) - 1);
|
2011-02-16 02:58:51 +00:00
|
|
|
|
2021-04-12 08:05:29 +00:00
|
|
|
new = kzalloc(sizeof(*new), GFP_KERNEL);
|
|
|
|
if (!new)
|
2021-04-12 08:05:30 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2023-08-25 03:09:50 +00:00
|
|
|
|
|
|
|
error = mddev_init(new);
|
|
|
|
if (error)
|
|
|
|
goto out_free_new;
|
2009-01-08 21:31:10 +00:00
|
|
|
|
2021-04-12 08:05:29 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
2009-01-08 21:31:10 +00:00
|
|
|
if (unit) {
|
2021-04-12 08:05:30 +00:00
|
|
|
error = -EEXIST;
|
|
|
|
if (mddev_find_locked(unit))
|
2023-08-25 03:09:50 +00:00
|
|
|
goto out_destroy_new;
|
2021-04-12 08:05:29 +00:00
|
|
|
new->unit = unit;
|
|
|
|
if (MAJOR(unit) == MD_MAJOR)
|
|
|
|
new->md_minor = MINOR(unit);
|
|
|
|
else
|
|
|
|
new->md_minor = MINOR(unit) >> MdpMinorShift;
|
|
|
|
new->hold_active = UNTIL_IOCTL;
|
|
|
|
} else {
|
2021-04-12 08:05:30 +00:00
|
|
|
error = -ENODEV;
|
2021-04-12 08:05:28 +00:00
|
|
|
new->unit = mddev_alloc_unit();
|
2021-04-12 08:05:29 +00:00
|
|
|
if (!new->unit)
|
2023-08-25 03:09:50 +00:00
|
|
|
goto out_destroy_new;
|
2021-04-12 08:05:28 +00:00
|
|
|
new->md_minor = MINOR(new->unit);
|
2009-01-08 21:31:10 +00:00
|
|
|
new->hold_active = UNTIL_STOP;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2021-04-12 08:05:29 +00:00
|
|
|
list_add(&new->all_mddevs, &all_mddevs);
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
return new;
|
2023-08-25 03:09:50 +00:00
|
|
|
|
|
|
|
out_destroy_new:
|
2021-04-12 08:05:29 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
2023-08-25 03:09:50 +00:00
|
|
|
mddev_destroy(new);
|
|
|
|
out_free_new:
|
2021-04-12 08:05:29 +00:00
|
|
|
kfree(new);
|
2021-04-12 08:05:30 +00:00
|
|
|
return ERR_PTR(error);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2022-07-19 09:18:16 +00:00
|
|
|
static void mddev_free(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
spin_lock(&all_mddevs_lock);
|
|
|
|
list_del(&mddev->all_mddevs);
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
|
2023-08-25 03:09:50 +00:00
|
|
|
mddev_destroy(mddev);
|
2022-07-19 09:18:16 +00:00
|
|
|
kfree(mddev);
|
|
|
|
}
|
|
|
|
|
2021-05-29 10:30:49 +00:00
|
|
|
static const struct attribute_group md_redundancy_group;
|
2010-04-15 00:13:47 +00:00
|
|
|
|
2023-06-21 14:29:33 +00:00
|
|
|
void mddev_unlock(struct mddev *mddev)
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
struct md_rdev *tmp;
|
2023-06-21 14:29:33 +00:00
|
|
|
LIST_HEAD(delete);
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
|
2023-06-21 14:29:33 +00:00
|
|
|
if (!list_empty(&mddev->deleting))
|
|
|
|
list_splice_init(&mddev->deleting, &delete);
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
|
2010-04-14 07:15:37 +00:00
|
|
|
if (mddev->to_remove) {
|
2010-04-15 00:13:47 +00:00
|
|
|
/* These cannot be removed under reconfig_mutex as
|
|
|
|
* an access to the files will try to take reconfig_mutex
|
|
|
|
* while holding the file unremovable, which leads to
|
|
|
|
* a deadlock.
|
2010-08-08 11:18:03 +00:00
|
|
|
* So hold set sysfs_active while the remove in happeing,
|
|
|
|
* and anything else which might set ->to_remove or my
|
|
|
|
* otherwise change the sysfs namespace will fail with
|
|
|
|
* -EBUSY if sysfs_active is still set.
|
|
|
|
* We set sysfs_active under reconfig_mutex and elsewhere
|
|
|
|
* test it under the same mutex to ensure its correct value
|
|
|
|
* is seen.
|
2010-04-15 00:13:47 +00:00
|
|
|
*/
|
2021-05-29 10:30:49 +00:00
|
|
|
const struct attribute_group *to_remove = mddev->to_remove;
|
2010-04-14 07:15:37 +00:00
|
|
|
mddev->to_remove = NULL;
|
2010-08-08 11:18:03 +00:00
|
|
|
mddev->sysfs_active = 1;
|
2010-04-15 00:13:47 +00:00
|
|
|
mutex_unlock(&mddev->reconfig_mutex);
|
|
|
|
|
2010-06-01 09:37:23 +00:00
|
|
|
if (mddev->kobj.sd) {
|
|
|
|
if (to_remove != &md_redundancy_group)
|
|
|
|
sysfs_remove_group(&mddev->kobj, to_remove);
|
|
|
|
if (mddev->pers == NULL ||
|
|
|
|
mddev->pers->sync_request == NULL) {
|
|
|
|
sysfs_remove_group(&mddev->kobj, &md_redundancy_group);
|
|
|
|
if (mddev->sysfs_action)
|
|
|
|
sysfs_put(mddev->sysfs_action);
|
2020-08-05 00:27:18 +00:00
|
|
|
if (mddev->sysfs_completed)
|
|
|
|
sysfs_put(mddev->sysfs_completed);
|
|
|
|
if (mddev->sysfs_degraded)
|
|
|
|
sysfs_put(mddev->sysfs_degraded);
|
2010-06-01 09:37:23 +00:00
|
|
|
mddev->sysfs_action = NULL;
|
2020-08-05 00:27:18 +00:00
|
|
|
mddev->sysfs_completed = NULL;
|
|
|
|
mddev->sysfs_degraded = NULL;
|
2010-06-01 09:37:23 +00:00
|
|
|
}
|
2010-04-14 07:15:37 +00:00
|
|
|
}
|
2010-08-08 11:18:03 +00:00
|
|
|
mddev->sysfs_active = 0;
|
2010-04-15 00:13:47 +00:00
|
|
|
} else
|
|
|
|
mutex_unlock(&mddev->reconfig_mutex);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-08-25 02:55:31 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
wake_up(&mddev->sb_wait);
|
|
|
|
|
2023-06-21 14:29:33 +00:00
|
|
|
list_for_each_entry_safe(rdev, tmp, &delete, same_set) {
|
|
|
|
list_del_init(&rdev->same_set);
|
|
|
|
kobject_del(&rdev->kobj);
|
|
|
|
export_rdev(rdev, mddev);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
EXPORT_SYMBOL_GPL(mddev_unlock);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-04-14 15:43:55 +00:00
|
|
|
struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr)
|
2012-10-11 02:37:33 +00:00
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
|
|
|
|
rdev_for_each_rcu(rdev, mddev)
|
|
|
|
if (rdev->desc_nr == nr)
|
|
|
|
return rdev;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
2015-04-14 15:43:55 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu);
|
2012-10-11 02:37:33 +00:00
|
|
|
|
|
|
|
static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
if (rdev->bdev->bd_dev == dev)
|
|
|
|
return rdev;
|
2009-01-08 21:31:08 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-12-27 09:31:40 +00:00
|
|
|
struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev)
|
2012-10-11 02:37:33 +00:00
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
|
|
|
|
rdev_for_each_rcu(rdev, mddev)
|
|
|
|
if (rdev->bdev->bd_dev == dev)
|
|
|
|
return rdev;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
2017-12-27 09:31:40 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_find_rdev_rcu);
|
2012-10-11 02:37:33 +00:00
|
|
|
|
2011-10-11 05:49:58 +00:00
|
|
|
static struct md_personality *find_pers(int level, char *clevel)
|
2006-01-06 08:20:36 +00:00
|
|
|
{
|
2011-10-11 05:49:58 +00:00
|
|
|
struct md_personality *pers;
|
2006-01-06 08:20:51 +00:00
|
|
|
list_for_each_entry(pers, &pers_list, list) {
|
|
|
|
if (level != LEVEL_NONE && pers->level == level)
|
2006-01-06 08:20:36 +00:00
|
|
|
return pers;
|
2006-01-06 08:20:51 +00:00
|
|
|
if (strcmp(pers->name, clevel)==0)
|
|
|
|
return pers;
|
|
|
|
}
|
2006-01-06 08:20:36 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-07-11 12:02:23 +00:00
|
|
|
/* return the offset of the super block in 512byte sectors */
|
2011-10-11 05:45:26 +00:00
|
|
|
static inline sector_t calc_dev_sboffset(struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2021-10-18 10:11:06 +00:00
|
|
|
return MD_NEW_SIZE_SECTORS(bdev_nr_sectors(rdev->bdev));
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int alloc_disk_sb(struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
rdev->sb_page = alloc_page(GFP_KERNEL);
|
2016-11-02 03:16:49 +00:00
|
|
|
if (!rdev->sb_page)
|
2008-07-11 12:02:20 +00:00
|
|
|
return -ENOMEM;
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-05-22 03:54:30 +00:00
|
|
|
void md_rdev_clear(struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
if (rdev->sb_page) {
|
2006-01-06 08:20:31 +00:00
|
|
|
put_page(rdev->sb_page);
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->sb_loaded = 0;
|
|
|
|
rdev->sb_page = NULL;
|
2008-07-11 12:02:23 +00:00
|
|
|
rdev->sb_start = 0;
|
2009-03-31 03:33:13 +00:00
|
|
|
rdev->sectors = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2011-07-28 01:31:47 +00:00
|
|
|
if (rdev->bb_page) {
|
|
|
|
put_page(rdev->bb_page);
|
|
|
|
rdev->bb_page = NULL;
|
|
|
|
}
|
2016-01-06 20:19:22 +00:00
|
|
|
badblocks_exit(&rdev->badblocks);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2012-05-22 03:54:30 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_rdev_clear);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-07-20 13:29:37 +00:00
|
|
|
static void super_written(struct bio *bio)
|
2005-06-22 00:17:28 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev = bio->bi_private;
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = rdev->mddev;
|
2005-06-22 00:17:28 +00:00
|
|
|
|
2017-06-03 07:38:06 +00:00
|
|
|
if (bio->bi_status) {
|
2020-07-28 10:01:41 +00:00
|
|
|
pr_err("md: %s gets error=%d\n", __func__,
|
|
|
|
blk_status_to_errno(bio->bi_status));
|
[PATCH] md: support BIO_RW_BARRIER for md/raid1
We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....
So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.
We initially assumes barriers are OK.
When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers. This will usually clear the flags fairly quickly.
If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.
When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried. So raid1d is enhanced to do this,
and when any bio write completes (i.e. no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.
We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
1/ the device used to support BARRIER, but now doesn't. Few devices
change like this, though raid1 can!
or
2/ the array has no persistent superblock, so there was no opportunity to
pre-test for barriers when writing the superblock.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:34 +00:00
|
|
|
md_error(mddev, rdev);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (!test_bit(Faulty, &rdev->flags)
|
|
|
|
&& (bio->bi_opf & MD_FAILFAST)) {
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
|
2016-11-18 05:16:11 +00:00
|
|
|
set_bit(LastDev, &rdev->flags);
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
clear_bit(LastDev, &rdev->flags);
|
2005-06-22 00:17:28 +00:00
|
|
|
|
2022-11-04 13:53:38 +00:00
|
|
|
bio_put(bio);
|
|
|
|
|
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
|
[PATCH] md: support BIO_RW_BARRIER for md/raid1
We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....
So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.
We initially assumes barriers are OK.
When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers. This will usually clear the flags fairly quickly.
If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.
When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried. So raid1d is enhanced to do this,
and when any bio write completes (i.e. no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.
We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
1/ the device used to support BARRIER, but now doesn't. Few devices
change like this, though raid1 can!
or
2/ the array has no persistent superblock, so there was no opportunity to
pre-test for barriers when writing the superblock.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:34 +00:00
|
|
|
if (atomic_dec_and_test(&mddev->pending_writes))
|
|
|
|
wake_up(&mddev->sb_wait);
|
2005-06-22 00:17:28 +00:00
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
|
2005-06-22 00:17:28 +00:00
|
|
|
sector_t sector, int size, struct page *page)
|
|
|
|
{
|
|
|
|
/* write first size bytes of page to sector of rdev
|
|
|
|
* Increment mddev->pending_writes before returning
|
|
|
|
* and decrement it on completion, waking up sb_wait
|
|
|
|
* if zero is reached.
|
|
|
|
* If an error occurred, call md_error
|
|
|
|
*/
|
2016-11-18 05:16:11 +00:00
|
|
|
struct bio *bio;
|
|
|
|
|
2018-02-02 22:13:19 +00:00
|
|
|
if (!page)
|
|
|
|
return;
|
|
|
|
|
2016-11-18 05:16:11 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
|
|
|
return;
|
|
|
|
|
2022-01-24 09:11:03 +00:00
|
|
|
bio = bio_alloc_bioset(rdev->meta_bdev ? rdev->meta_bdev : rdev->bdev,
|
|
|
|
1,
|
|
|
|
REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_FUA,
|
|
|
|
GFP_NOIO, &mddev->sync_set);
|
2005-06-22 00:17:28 +00:00
|
|
|
|
2016-03-29 21:00:19 +00:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
|
2013-10-11 22:44:27 +00:00
|
|
|
bio->bi_iter.bi_sector = sector;
|
2023-05-31 11:50:28 +00:00
|
|
|
__bio_add_page(bio, page, size, 0);
|
2005-06-22 00:17:28 +00:00
|
|
|
bio->bi_private = rdev;
|
|
|
|
bio->bi_end_io = super_written;
|
2016-11-18 05:16:11 +00:00
|
|
|
|
|
|
|
if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
|
|
|
|
test_bit(FailFast, &rdev->flags) &&
|
|
|
|
!test_bit(LastDev, &rdev->flags))
|
2022-01-24 09:11:03 +00:00
|
|
|
bio->bi_opf |= MD_FAILFAST;
|
[PATCH] md: support BIO_RW_BARRIER for md/raid1
We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....
So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.
We initially assumes barriers are OK.
When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers. This will usually clear the flags fairly quickly.
If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.
When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried. So raid1d is enhanced to do this,
and when any bio write completes (i.e. no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.
We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
1/ the device used to support BARRIER, but now doesn't. Few devices
change like this, though raid1 can!
or
2/ the array has no persistent superblock, so there was no opportunity to
pre-test for barriers when writing the superblock.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:34 +00:00
|
|
|
|
2005-06-22 00:17:28 +00:00
|
|
|
atomic_inc(&mddev->pending_writes);
|
2016-06-05 19:31:41 +00:00
|
|
|
submit_bio(bio);
|
[PATCH] md: support BIO_RW_BARRIER for md/raid1
We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....
So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.
We initially assumes barriers are OK.
When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers. This will usually clear the flags fairly quickly.
If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.
When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried. So raid1d is enhanced to do this,
and when any bio write completes (i.e. no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.
We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
1/ the device used to support BARRIER, but now doesn't. Few devices
change like this, though raid1 can!
or
2/ the array has no persistent superblock, so there was no opportunity to
pre-test for barriers when writing the superblock.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:34 +00:00
|
|
|
}
|
|
|
|
|
2016-11-18 05:16:11 +00:00
|
|
|
int md_super_wait(struct mddev *mddev)
|
[PATCH] md: support BIO_RW_BARRIER for md/raid1
We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....
So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.
We initially assumes barriers are OK.
When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers. This will usually clear the flags fairly quickly.
If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.
When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried. So raid1d is enhanced to do this,
and when any bio write completes (i.e. no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.
We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
1/ the device used to support BARRIER, but now doesn't. Few devices
change like this, though raid1 can!
or
2/ the array has no persistent superblock, so there was no opportunity to
pre-test for barriers when writing the superblock.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:34 +00:00
|
|
|
{
|
2010-09-03 09:56:18 +00:00
|
|
|
/* wait for all superblock writes that were scheduled to complete */
|
2014-09-09 04:20:28 +00:00
|
|
|
wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
|
2016-12-08 23:48:19 +00:00
|
|
|
if (test_and_clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
|
2016-11-18 05:16:11 +00:00
|
|
|
return -EAGAIN;
|
|
|
|
return 0;
|
2005-06-22 00:17:28 +00:00
|
|
|
}
|
|
|
|
|
2011-10-11 05:45:26 +00:00
|
|
|
int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
|
2022-07-14 18:06:57 +00:00
|
|
|
struct page *page, blk_opf_t opf, bool metadata_op)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2021-01-26 14:52:42 +00:00
|
|
|
struct bio bio;
|
|
|
|
struct bio_vec bvec;
|
|
|
|
|
2017-08-23 17:10:32 +00:00
|
|
|
if (metadata_op && rdev->meta_bdev)
|
2022-07-14 18:06:57 +00:00
|
|
|
bio_init(&bio, rdev->meta_bdev, &bvec, 1, opf);
|
2017-08-23 17:10:32 +00:00
|
|
|
else
|
2022-07-14 18:06:57 +00:00
|
|
|
bio_init(&bio, rdev->bdev, &bvec, 1, opf);
|
2022-01-24 09:11:06 +00:00
|
|
|
|
2011-01-13 22:14:33 +00:00
|
|
|
if (metadata_op)
|
2021-01-26 14:52:42 +00:00
|
|
|
bio.bi_iter.bi_sector = sector + rdev->sb_start;
|
2012-05-20 23:28:32 +00:00
|
|
|
else if (rdev->mddev->reshape_position != MaxSector &&
|
|
|
|
(rdev->mddev->reshape_backwards ==
|
|
|
|
(sector >= rdev->mddev->reshape_position)))
|
2021-01-26 14:52:42 +00:00
|
|
|
bio.bi_iter.bi_sector = sector + rdev->new_data_offset;
|
2011-01-13 22:14:33 +00:00
|
|
|
else
|
2021-01-26 14:52:42 +00:00
|
|
|
bio.bi_iter.bi_sector = sector + rdev->data_offset;
|
2023-05-31 11:50:28 +00:00
|
|
|
__bio_add_page(&bio, page, size, 0);
|
2016-06-05 19:31:41 +00:00
|
|
|
|
2021-01-26 14:52:42 +00:00
|
|
|
submit_bio_wait(&bio);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2021-01-26 14:52:42 +00:00
|
|
|
return !bio.bi_status;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2006-01-06 08:20:34 +00:00
|
|
|
EXPORT_SYMBOL_GPL(sync_page_io);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int read_disk_sb(struct md_rdev *rdev, int size)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
if (rdev->sb_loaded)
|
|
|
|
return 0;
|
|
|
|
|
2022-07-14 18:06:57 +00:00
|
|
|
if (!sync_page_io(rdev, 0, size, rdev->sb_page, REQ_OP_READ, true))
|
2005-04-16 22:20:36 +00:00
|
|
|
goto fail;
|
|
|
|
rdev->sb_loaded = 1;
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
fail:
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_err("md: disabled device %pg, could not read superblock.\n",
|
|
|
|
rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2017-05-04 13:26:20 +00:00
|
|
|
static int md_uuid_equal(mdp_super_t *sb1, mdp_super_t *sb2)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2014-09-30 04:23:59 +00:00
|
|
|
return sb1->set_uuid0 == sb2->set_uuid0 &&
|
2008-07-11 12:02:20 +00:00
|
|
|
sb1->set_uuid1 == sb2->set_uuid1 &&
|
|
|
|
sb1->set_uuid2 == sb2->set_uuid2 &&
|
|
|
|
sb1->set_uuid3 == sb2->set_uuid3;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2017-05-04 13:26:20 +00:00
|
|
|
static int md_sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
mdp_super_t *tmp1, *tmp2;
|
|
|
|
|
|
|
|
tmp1 = kmalloc(sizeof(*tmp1),GFP_KERNEL);
|
|
|
|
tmp2 = kmalloc(sizeof(*tmp2),GFP_KERNEL);
|
|
|
|
|
|
|
|
if (!tmp1 || !tmp2) {
|
|
|
|
ret = 0;
|
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
|
|
|
|
*tmp1 = *sb1;
|
|
|
|
*tmp2 = *sb2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* nr_disks is not constant
|
|
|
|
*/
|
|
|
|
tmp1->nr_disks = 0;
|
|
|
|
tmp2->nr_disks = 0;
|
|
|
|
|
2008-07-11 12:02:20 +00:00
|
|
|
ret = (memcmp(tmp1, tmp2, MD_SB_GENERIC_CONSTANT_WORDS * 4) == 0);
|
2005-04-16 22:20:36 +00:00
|
|
|
abort:
|
2005-06-22 00:17:30 +00:00
|
|
|
kfree(tmp1);
|
|
|
|
kfree(tmp2);
|
2005-04-16 22:20:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-05-09 09:35:37 +00:00
|
|
|
static u32 md_csum_fold(u32 csum)
|
|
|
|
{
|
|
|
|
csum = (csum & 0xffff) + (csum >> 16);
|
|
|
|
return (csum & 0xffff) + (csum >> 16);
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static unsigned int calc_sb_csum(mdp_super_t *sb)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2007-05-09 09:35:37 +00:00
|
|
|
u64 newcsum = 0;
|
|
|
|
u32 *sb32 = (u32*)sb;
|
|
|
|
int i;
|
2005-04-16 22:20:36 +00:00
|
|
|
unsigned int disk_csum, csum;
|
|
|
|
|
|
|
|
disk_csum = sb->sb_csum;
|
|
|
|
sb->sb_csum = 0;
|
2007-05-09 09:35:37 +00:00
|
|
|
|
|
|
|
for (i = 0; i < MD_SB_BYTES/4 ; i++)
|
|
|
|
newcsum += sb32[i];
|
|
|
|
csum = (newcsum & 0xffffffff) + (newcsum>>32);
|
|
|
|
|
|
|
|
#ifdef CONFIG_ALPHA
|
|
|
|
/* This used to use csum_partial, which was wrong for several
|
|
|
|
* reasons including that different results are returned on
|
|
|
|
* different architectures. It isn't critical that we get exactly
|
|
|
|
* the same return value as before (we always csum_fold before
|
|
|
|
* testing, and that removes any differences). However as we
|
|
|
|
* know that csum_partial always returned a 16bit value on
|
|
|
|
* alphas, do a fold to maximise conformity to previous behaviour.
|
|
|
|
*/
|
|
|
|
sb->sb_csum = md_csum_fold(disk_csum);
|
|
|
|
#else
|
2005-04-16 22:20:36 +00:00
|
|
|
sb->sb_csum = disk_csum;
|
2007-05-09 09:35:37 +00:00
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
return csum;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Handle superblock details.
|
|
|
|
* We want to be able to handle multiple superblock formats
|
|
|
|
* so we have a common interface to them all, and an array of
|
|
|
|
* different handlers.
|
|
|
|
* We rely on user-space to write the initial superblock, and support
|
|
|
|
* reading and updating of superblocks.
|
|
|
|
* Interface methods are:
|
2011-10-11 05:45:26 +00:00
|
|
|
* int load_super(struct md_rdev *dev, struct md_rdev *refdev, int minor_version)
|
2005-04-16 22:20:36 +00:00
|
|
|
* loads and validates a superblock on dev.
|
|
|
|
* if refdev != NULL, compare superblocks on both devices
|
|
|
|
* Return:
|
|
|
|
* 0 - dev has a superblock that is compatible with refdev
|
|
|
|
* 1 - dev has a superblock that is compatible and newer than refdev
|
|
|
|
* so dev should be used as the refdev in future
|
|
|
|
* -EINVAL superblock incompatible or invalid
|
|
|
|
* -othererror e.g. -EIO
|
|
|
|
*
|
2011-10-11 05:47:53 +00:00
|
|
|
* int validate_super(struct mddev *mddev, struct md_rdev *dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
* Verify that dev is acceptable into mddev.
|
|
|
|
* The first time, mddev->raid_disks will be 0, and data from
|
|
|
|
* dev should be merged in. Subsequent calls check that dev
|
|
|
|
* is new enough. Return 0 or -EINVAL
|
|
|
|
*
|
2011-10-11 05:47:53 +00:00
|
|
|
* void sync_super(struct mddev *mddev, struct md_rdev *dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
* Update the superblock for rdev with data in mddev
|
|
|
|
* This does not write to disc.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct super_type {
|
2008-06-27 22:31:46 +00:00
|
|
|
char *name;
|
|
|
|
struct module *owner;
|
2012-05-20 23:27:00 +00:00
|
|
|
int (*load_super)(struct md_rdev *rdev,
|
|
|
|
struct md_rdev *refdev,
|
2008-06-27 22:31:46 +00:00
|
|
|
int minor_version);
|
2012-05-20 23:27:00 +00:00
|
|
|
int (*validate_super)(struct mddev *mddev,
|
|
|
|
struct md_rdev *rdev);
|
|
|
|
void (*sync_super)(struct mddev *mddev,
|
|
|
|
struct md_rdev *rdev);
|
2011-10-11 05:45:26 +00:00
|
|
|
unsigned long long (*rdev_size_change)(struct md_rdev *rdev,
|
2008-07-21 04:42:12 +00:00
|
|
|
sector_t num_sectors);
|
2012-05-20 23:27:00 +00:00
|
|
|
int (*allow_new_offset)(struct md_rdev *rdev,
|
|
|
|
unsigned long long new_offset);
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
|
|
|
|
2009-06-17 22:49:23 +00:00
|
|
|
/*
|
|
|
|
* Check that the given mddev has no bitmap.
|
|
|
|
*
|
|
|
|
* This function is called from the run method of all personalities that do not
|
|
|
|
* support bitmaps. It prints an error message and returns non-zero if mddev
|
|
|
|
* has a bitmap. Otherwise, it returns 0.
|
|
|
|
*
|
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
int md_check_no_bitmap(struct mddev *mddev)
|
2009-06-17 22:49:23 +00:00
|
|
|
{
|
2009-12-14 01:49:52 +00:00
|
|
|
if (!mddev->bitmap_info.file && !mddev->bitmap_info.offset)
|
2009-06-17 22:49:23 +00:00
|
|
|
return 0;
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: bitmaps are not supported for %s\n",
|
2009-06-17 22:49:23 +00:00
|
|
|
mdname(mddev), mddev->pers->name);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_check_no_bitmap);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2014-09-30 04:23:59 +00:00
|
|
|
* load_super for 0.90.0
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2011-10-11 05:45:26 +00:00
|
|
|
static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
mdp_super_t *sb;
|
|
|
|
int ret;
|
2019-10-30 10:47:02 +00:00
|
|
|
bool spare_disk = true;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
2008-07-11 12:02:23 +00:00
|
|
|
* Calculate the position of the superblock (512byte sectors),
|
2005-04-16 22:20:36 +00:00
|
|
|
* it's at the end of the disk.
|
|
|
|
*
|
|
|
|
* It also happens to be a multiple of 4Kb.
|
|
|
|
*/
|
2011-01-13 22:14:33 +00:00
|
|
|
rdev->sb_start = calc_dev_sboffset(rdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-09-09 23:23:53 +00:00
|
|
|
ret = read_disk_sb(rdev, MD_SB_BYTES);
|
2016-11-02 03:16:49 +00:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
ret = -EINVAL;
|
|
|
|
|
2011-07-27 01:00:36 +00:00
|
|
|
sb = page_address(rdev->sb_page);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (sb->md_magic != MD_SB_MAGIC) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: invalid raid superblock magic on %pg\n",
|
|
|
|
rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sb->major_version != 0 ||
|
2006-03-27 09:18:11 +00:00
|
|
|
sb->minor_version < 90 ||
|
|
|
|
sb->minor_version > 91) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("Bad version number %d.%d on %pg\n",
|
|
|
|
sb->major_version, sb->minor_version, rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sb->raid_disks <= 0)
|
|
|
|
goto abort;
|
|
|
|
|
2007-05-09 09:35:37 +00:00
|
|
|
if (md_csum_fold(calc_sb_csum(sb)) != md_csum_fold(sb->sb_csum)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: invalid superblock checksum on %pg\n", rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
|
|
|
|
rdev->preferred_minor = sb->md_minor;
|
|
|
|
rdev->data_offset = 0;
|
2012-05-20 23:27:00 +00:00
|
|
|
rdev->new_data_offset = 0;
|
2005-09-09 23:23:53 +00:00
|
|
|
rdev->sb_size = MD_SB_BYTES;
|
2011-07-28 01:31:47 +00:00
|
|
|
rdev->badblocks.shift = -1;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (sb->level == LEVEL_MULTIPATH)
|
|
|
|
rdev->desc_nr = -1;
|
|
|
|
else
|
|
|
|
rdev->desc_nr = sb->this_disk.number;
|
|
|
|
|
2019-10-30 10:47:02 +00:00
|
|
|
/* not spare disk, or LEVEL_MULTIPATH */
|
|
|
|
if (sb->level == LEVEL_MULTIPATH ||
|
|
|
|
(rdev->desc_nr >= 0 &&
|
2019-12-10 07:01:29 +00:00
|
|
|
rdev->desc_nr < MD_SB_DISKS &&
|
2019-10-30 10:47:02 +00:00
|
|
|
sb->disks[rdev->desc_nr].state &
|
|
|
|
((1<<MD_DISK_SYNC) | (1 << MD_DISK_ACTIVE))))
|
|
|
|
spare_disk = false;
|
|
|
|
|
2008-04-28 09:15:49 +00:00
|
|
|
if (!refdev) {
|
2019-10-30 10:47:02 +00:00
|
|
|
if (!spare_disk)
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
ret = 1;
|
|
|
|
else
|
|
|
|
ret = 0;
|
2008-04-28 09:15:49 +00:00
|
|
|
} else {
|
2005-04-16 22:20:36 +00:00
|
|
|
__u64 ev1, ev2;
|
2011-07-27 01:00:36 +00:00
|
|
|
mdp_super_t *refsb = page_address(refdev->sb_page);
|
2017-05-04 13:26:20 +00:00
|
|
|
if (!md_uuid_equal(refsb, sb)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %pg has different UUID to %pg\n",
|
|
|
|
rdev->bdev, refdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
2017-05-04 13:26:20 +00:00
|
|
|
if (!md_sb_equal(refsb, sb)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %pg has same UUID but different superblock to %pg\n",
|
|
|
|
rdev->bdev, refdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
ev1 = md_event(sb);
|
|
|
|
ev2 = md_event(refsb);
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
|
2019-10-30 10:47:02 +00:00
|
|
|
if (!spare_disk && ev1 > ev2)
|
2005-04-16 22:20:36 +00:00
|
|
|
ret = 1;
|
2014-09-30 04:23:59 +00:00
|
|
|
else
|
2005-04-16 22:20:36 +00:00
|
|
|
ret = 0;
|
|
|
|
}
|
2009-06-17 22:48:58 +00:00
|
|
|
rdev->sectors = rdev->sb_start;
|
2012-08-16 06:46:12 +00:00
|
|
|
/* Limit to 4TB as metadata cannot record more than that.
|
|
|
|
* (not needed for Linear and RAID0 as metadata doesn't
|
|
|
|
* record this size)
|
|
|
|
*/
|
2019-04-05 16:08:59 +00:00
|
|
|
if ((u64)rdev->sectors >= (2ULL << 32) && sb->level >= 1)
|
2015-12-20 23:51:01 +00:00
|
|
|
rdev->sectors = (sector_t)(2ULL << 32) - 2;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-09-10 07:21:28 +00:00
|
|
|
if (rdev->sectors < ((sector_t)sb->size) * 2 && sb->level >= 1)
|
2006-01-06 08:20:55 +00:00
|
|
|
/* "this cannot possibly happen" ... */
|
|
|
|
ret = -EINVAL;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
abort:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* validate_super for 0.90.0
|
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
mdp_disk_t *desc;
|
2011-07-27 01:00:36 +00:00
|
|
|
mdp_super_t *sb = page_address(rdev->sb_page);
|
2006-06-26 07:27:56 +00:00
|
|
|
__u64 ev1 = md_event(sb);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-06-22 00:17:25 +00:00
|
|
|
rdev->raid_disk = -1;
|
2008-02-06 09:39:54 +00:00
|
|
|
clear_bit(Faulty, &rdev->flags);
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2013-12-11 23:13:33 +00:00
|
|
|
clear_bit(Bitmap_sync, &rdev->flags);
|
2008-02-06 09:39:54 +00:00
|
|
|
clear_bit(WriteMostly, &rdev->flags);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev->raid_disks == 0) {
|
|
|
|
mddev->major_version = 0;
|
|
|
|
mddev->minor_version = sb->minor_version;
|
|
|
|
mddev->patch_version = sb->patch_version;
|
2008-02-06 09:39:51 +00:00
|
|
|
mddev->external = 0;
|
2009-06-17 22:45:01 +00:00
|
|
|
mddev->chunk_sectors = sb->chunk_size >> 9;
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->ctime = sb->ctime;
|
|
|
|
mddev->utime = sb->utime;
|
|
|
|
mddev->level = sb->level;
|
2006-01-06 08:20:51 +00:00
|
|
|
mddev->clevel[0] = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->layout = sb->layout;
|
|
|
|
mddev->raid_disks = sb->raid_disks;
|
2011-09-10 07:21:28 +00:00
|
|
|
mddev->dev_sectors = ((sector_t)sb->size) * 2;
|
2006-06-26 07:27:56 +00:00
|
|
|
mddev->events = ev1;
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset = 0;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.space = 0;
|
|
|
|
/* bitmap can use 60 K after the 4K superblocks */
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.default_offset = MD_SB_BYTES >> 9;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.default_space = 64*2 - (MD_SB_BYTES >> 9);
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
if (mddev->minor_version >= 91) {
|
|
|
|
mddev->reshape_position = sb->reshape_position;
|
|
|
|
mddev->delta_disks = sb->delta_disks;
|
|
|
|
mddev->new_level = sb->new_level;
|
|
|
|
mddev->new_layout = sb->new_layout;
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = sb->new_chunk >> 9;
|
2012-05-20 23:27:00 +00:00
|
|
|
if (mddev->delta_disks < 0)
|
|
|
|
mddev->reshape_backwards = 1;
|
2006-03-27 09:18:11 +00:00
|
|
|
} else {
|
|
|
|
mddev->reshape_position = MaxSector;
|
|
|
|
mddev->delta_disks = 0;
|
|
|
|
mddev->new_level = mddev->level;
|
|
|
|
mddev->new_layout = mddev->layout;
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
2006-03-27 09:18:11 +00:00
|
|
|
}
|
2019-09-09 06:52:29 +00:00
|
|
|
if (mddev->level == 0)
|
|
|
|
mddev->layout = -1;
|
2006-03-27 09:18:11 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (sb->state & (1<<MD_SB_CLEAN))
|
|
|
|
mddev->recovery_cp = MaxSector;
|
|
|
|
else {
|
2014-09-30 04:23:59 +00:00
|
|
|
if (sb->events_hi == sb->cp_events_hi &&
|
2005-04-16 22:20:36 +00:00
|
|
|
sb->events_lo == sb->cp_events_lo) {
|
|
|
|
mddev->recovery_cp = sb->recovery_cp;
|
|
|
|
} else
|
|
|
|
mddev->recovery_cp = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(mddev->uuid+0, &sb->set_uuid0, 4);
|
|
|
|
memcpy(mddev->uuid+4, &sb->set_uuid1, 4);
|
|
|
|
memcpy(mddev->uuid+8, &sb->set_uuid2, 4);
|
|
|
|
memcpy(mddev->uuid+12,&sb->set_uuid3, 4);
|
|
|
|
|
|
|
|
mddev->max_disks = MD_SB_DISKS;
|
2005-06-22 00:17:27 +00:00
|
|
|
|
|
|
|
if (sb->state & (1<<MD_SB_BITMAP_PRESENT) &&
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.file == NULL) {
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset =
|
|
|
|
mddev->bitmap_info.default_offset;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.space =
|
2013-08-20 02:26:32 +00:00
|
|
|
mddev->bitmap_info.default_space;
|
2012-05-22 03:55:07 +00:00
|
|
|
}
|
2005-06-22 00:17:27 +00:00
|
|
|
|
2005-06-22 00:17:25 +00:00
|
|
|
} else if (mddev->pers == NULL) {
|
2010-05-18 00:17:09 +00:00
|
|
|
/* Insist on good event counter while assembling, except
|
|
|
|
* for spares (which don't need an event count) */
|
2005-04-16 22:20:36 +00:00
|
|
|
++ev1;
|
2010-05-18 00:17:09 +00:00
|
|
|
if (sb->disks[rdev->desc_nr].state & (
|
|
|
|
(1<<MD_DISK_SYNC) | (1 << MD_DISK_ACTIVE)))
|
2014-09-30 04:23:59 +00:00
|
|
|
if (ev1 < mddev->events)
|
2010-05-18 00:17:09 +00:00
|
|
|
return -EINVAL;
|
2005-06-22 00:17:25 +00:00
|
|
|
} else if (mddev->bitmap) {
|
|
|
|
/* if adding to array with a bitmap, then we can accept an
|
|
|
|
* older device ... but not too old.
|
|
|
|
*/
|
|
|
|
if (ev1 < mddev->bitmap->events_cleared)
|
|
|
|
return 0;
|
2013-12-11 23:13:33 +00:00
|
|
|
if (ev1 < mddev->events)
|
|
|
|
set_bit(Bitmap_sync, &rdev->flags);
|
2006-06-26 07:27:56 +00:00
|
|
|
} else {
|
|
|
|
if (ev1 < mddev->events)
|
|
|
|
/* just a hot-add of a new device, leave raid_disk at -1 */
|
|
|
|
return 0;
|
|
|
|
}
|
2005-06-22 00:17:25 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev->level != LEVEL_MULTIPATH) {
|
|
|
|
desc = sb->disks + rdev->desc_nr;
|
|
|
|
|
|
|
|
if (desc->state & (1<<MD_DISK_FAULTY))
|
2005-11-09 05:39:31 +00:00
|
|
|
set_bit(Faulty, &rdev->flags);
|
2006-06-26 07:27:41 +00:00
|
|
|
else if (desc->state & (1<<MD_DISK_SYNC) /* &&
|
|
|
|
desc->raid_disk < mddev->raid_disks */) {
|
2005-11-09 05:39:31 +00:00
|
|
|
set_bit(In_sync, &rdev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->raid_disk = desc->raid_disk;
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
rdev->saved_raid_disk = desc->raid_disk;
|
2009-11-13 06:40:48 +00:00
|
|
|
} else if (desc->state & (1<<MD_DISK_ACTIVE)) {
|
|
|
|
/* active but not in sync implies recovery up to
|
|
|
|
* reshape position. We don't know exactly where
|
|
|
|
* that is, so set to zero for now */
|
|
|
|
if (mddev->minor_version >= 91) {
|
|
|
|
rdev->recovery_offset = 0;
|
|
|
|
rdev->raid_disk = desc->raid_disk;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2005-09-09 23:23:45 +00:00
|
|
|
if (desc->state & (1<<MD_DISK_WRITEMOSTLY))
|
|
|
|
set_bit(WriteMostly, &rdev->flags);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (desc->state & (1<<MD_DISK_FAILFAST))
|
|
|
|
set_bit(FailFast, &rdev->flags);
|
2005-06-22 00:17:25 +00:00
|
|
|
} else /* MULTIPATH are always insync */
|
2005-11-09 05:39:31 +00:00
|
|
|
set_bit(In_sync, &rdev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* sync_super for 0.90.0
|
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
static void super_90_sync(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
mdp_super_t *sb;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev2;
|
2005-04-16 22:20:36 +00:00
|
|
|
int next_spare = mddev->raid_disks;
|
2005-11-09 05:39:35 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/* make rdev->sb match mddev data..
|
|
|
|
*
|
|
|
|
* 1/ zero out disks
|
|
|
|
* 2/ Add info for each disk, keeping track of highest desc_nr (next_spare);
|
|
|
|
* 3/ any empty disks < next_spare become removed
|
|
|
|
*
|
|
|
|
* disks[0] gets initialised to REMOVED because
|
|
|
|
* we cannot be sure from other fields if it has
|
|
|
|
* been initialised or not.
|
|
|
|
*/
|
|
|
|
int i;
|
|
|
|
int active=0, working=0,failed=0,spare=0,nr_disks=0;
|
|
|
|
|
2005-09-09 23:24:02 +00:00
|
|
|
rdev->sb_size = MD_SB_BYTES;
|
|
|
|
|
2011-07-27 01:00:36 +00:00
|
|
|
sb = page_address(rdev->sb_page);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
memset(sb, 0, sizeof(*sb));
|
|
|
|
|
|
|
|
sb->md_magic = MD_SB_MAGIC;
|
|
|
|
sb->major_version = mddev->major_version;
|
|
|
|
sb->patch_version = mddev->patch_version;
|
|
|
|
sb->gvalid_words = 0; /* ignored */
|
|
|
|
memcpy(&sb->set_uuid0, mddev->uuid+0, 4);
|
|
|
|
memcpy(&sb->set_uuid1, mddev->uuid+4, 4);
|
|
|
|
memcpy(&sb->set_uuid2, mddev->uuid+8, 4);
|
|
|
|
memcpy(&sb->set_uuid3, mddev->uuid+12,4);
|
|
|
|
|
2015-12-20 23:51:01 +00:00
|
|
|
sb->ctime = clamp_t(time64_t, mddev->ctime, 0, U32_MAX);
|
2005-04-16 22:20:36 +00:00
|
|
|
sb->level = mddev->level;
|
2009-03-31 03:33:13 +00:00
|
|
|
sb->size = mddev->dev_sectors / 2;
|
2005-04-16 22:20:36 +00:00
|
|
|
sb->raid_disks = mddev->raid_disks;
|
|
|
|
sb->md_minor = mddev->md_minor;
|
2008-02-06 09:39:51 +00:00
|
|
|
sb->not_persistent = 0;
|
2015-12-20 23:51:01 +00:00
|
|
|
sb->utime = clamp_t(time64_t, mddev->utime, 0, U32_MAX);
|
2005-04-16 22:20:36 +00:00
|
|
|
sb->state = 0;
|
|
|
|
sb->events_hi = (mddev->events>>32);
|
|
|
|
sb->events_lo = (u32)mddev->events;
|
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
if (mddev->reshape_position == MaxSector)
|
|
|
|
sb->minor_version = 90;
|
|
|
|
else {
|
|
|
|
sb->minor_version = 91;
|
|
|
|
sb->reshape_position = mddev->reshape_position;
|
|
|
|
sb->new_level = mddev->new_level;
|
|
|
|
sb->delta_disks = mddev->delta_disks;
|
|
|
|
sb->new_layout = mddev->new_layout;
|
2009-06-17 22:45:27 +00:00
|
|
|
sb->new_chunk = mddev->new_chunk_sectors << 9;
|
2006-03-27 09:18:11 +00:00
|
|
|
}
|
|
|
|
mddev->minor_version = sb->minor_version;
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev->in_sync)
|
|
|
|
{
|
|
|
|
sb->recovery_cp = mddev->recovery_cp;
|
|
|
|
sb->cp_events_hi = (mddev->events>>32);
|
|
|
|
sb->cp_events_lo = (u32)mddev->events;
|
|
|
|
if (mddev->recovery_cp == MaxSector)
|
|
|
|
sb->state = (1<< MD_SB_CLEAN);
|
|
|
|
} else
|
|
|
|
sb->recovery_cp = 0;
|
|
|
|
|
|
|
|
sb->layout = mddev->layout;
|
2009-06-17 22:45:01 +00:00
|
|
|
sb->chunk_size = mddev->chunk_sectors << 9;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-12-14 01:49:52 +00:00
|
|
|
if (mddev->bitmap && mddev->bitmap_info.file == NULL)
|
2005-06-22 00:17:27 +00:00
|
|
|
sb->state |= (1<<MD_SB_BITMAP_PRESENT);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
sb->disks[0].state = (1<<MD_DISK_REMOVED);
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev2, mddev) {
|
2005-04-16 22:20:36 +00:00
|
|
|
mdp_disk_t *d;
|
2005-11-09 05:39:24 +00:00
|
|
|
int desc_nr;
|
2009-11-13 06:40:48 +00:00
|
|
|
int is_active = test_bit(In_sync, &rdev2->flags);
|
|
|
|
|
|
|
|
if (rdev2->raid_disk >= 0 &&
|
|
|
|
sb->minor_version >= 91)
|
|
|
|
/* we have nowhere to store the recovery_offset,
|
|
|
|
* but if it is not below the reshape_position,
|
|
|
|
* we can piggy-back on that.
|
|
|
|
*/
|
|
|
|
is_active = 1;
|
|
|
|
if (rdev2->raid_disk < 0 ||
|
|
|
|
test_bit(Faulty, &rdev2->flags))
|
|
|
|
is_active = 0;
|
|
|
|
if (is_active)
|
2005-11-09 05:39:24 +00:00
|
|
|
desc_nr = rdev2->raid_disk;
|
2005-04-16 22:20:36 +00:00
|
|
|
else
|
2005-11-09 05:39:24 +00:00
|
|
|
desc_nr = next_spare++;
|
2005-11-09 05:39:35 +00:00
|
|
|
rdev2->desc_nr = desc_nr;
|
2005-04-16 22:20:36 +00:00
|
|
|
d = &sb->disks[rdev2->desc_nr];
|
|
|
|
nr_disks++;
|
|
|
|
d->number = rdev2->desc_nr;
|
|
|
|
d->major = MAJOR(rdev2->bdev->bd_dev);
|
|
|
|
d->minor = MINOR(rdev2->bdev->bd_dev);
|
2009-11-13 06:40:48 +00:00
|
|
|
if (is_active)
|
2005-04-16 22:20:36 +00:00
|
|
|
d->raid_disk = rdev2->raid_disk;
|
|
|
|
else
|
|
|
|
d->raid_disk = rdev2->desc_nr; /* compatibility */
|
2006-03-27 09:18:03 +00:00
|
|
|
if (test_bit(Faulty, &rdev2->flags))
|
2005-04-16 22:20:36 +00:00
|
|
|
d->state = (1<<MD_DISK_FAULTY);
|
2009-11-13 06:40:48 +00:00
|
|
|
else if (is_active) {
|
2005-04-16 22:20:36 +00:00
|
|
|
d->state = (1<<MD_DISK_ACTIVE);
|
2009-11-13 06:40:48 +00:00
|
|
|
if (test_bit(In_sync, &rdev2->flags))
|
|
|
|
d->state |= (1<<MD_DISK_SYNC);
|
2005-04-16 22:20:36 +00:00
|
|
|
active++;
|
|
|
|
working++;
|
|
|
|
} else {
|
|
|
|
d->state = 0;
|
|
|
|
spare++;
|
|
|
|
working++;
|
|
|
|
}
|
2005-09-09 23:23:45 +00:00
|
|
|
if (test_bit(WriteMostly, &rdev2->flags))
|
|
|
|
d->state |= (1<<MD_DISK_WRITEMOSTLY);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (test_bit(FailFast, &rdev2->flags))
|
|
|
|
d->state |= (1<<MD_DISK_FAILFAST);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
/* now set the "removed" and "faulty" bits on any missing devices */
|
|
|
|
for (i=0 ; i < mddev->raid_disks ; i++) {
|
|
|
|
mdp_disk_t *d = &sb->disks[i];
|
|
|
|
if (d->state == 0 && d->number == 0) {
|
|
|
|
d->number = i;
|
|
|
|
d->raid_disk = i;
|
|
|
|
d->state = (1<<MD_DISK_REMOVED);
|
|
|
|
d->state |= (1<<MD_DISK_FAULTY);
|
|
|
|
failed++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
sb->nr_disks = nr_disks;
|
|
|
|
sb->active_disks = active;
|
|
|
|
sb->working_disks = working;
|
|
|
|
sb->failed_disks = failed;
|
|
|
|
sb->spare_disks = spare;
|
|
|
|
|
|
|
|
sb->this_disk = sb->disks[rdev->desc_nr];
|
|
|
|
sb->sb_csum = calc_sb_csum(sb);
|
|
|
|
}
|
|
|
|
|
2008-06-27 22:31:46 +00:00
|
|
|
/*
|
|
|
|
* rdev_size_change for 0.90.0
|
|
|
|
*/
|
|
|
|
static unsigned long long
|
2011-10-11 05:45:26 +00:00
|
|
|
super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
|
2008-06-27 22:31:46 +00:00
|
|
|
{
|
2009-03-31 03:33:13 +00:00
|
|
|
if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
|
2008-06-27 22:31:46 +00:00
|
|
|
return 0; /* component must fit device */
|
2009-12-14 01:49:52 +00:00
|
|
|
if (rdev->mddev->bitmap_info.offset)
|
2008-06-27 22:31:46 +00:00
|
|
|
return 0; /* can't move bitmap */
|
2011-01-13 22:14:33 +00:00
|
|
|
rdev->sb_start = calc_dev_sboffset(rdev);
|
2008-07-21 04:42:12 +00:00
|
|
|
if (!num_sectors || num_sectors > rdev->sb_start)
|
|
|
|
num_sectors = rdev->sb_start;
|
2011-09-10 07:21:28 +00:00
|
|
|
/* Limit to 4TB as metadata cannot record more than that.
|
|
|
|
* 4TB == 2^32 KB, or 2*2^32 sectors.
|
|
|
|
*/
|
2019-04-05 16:08:59 +00:00
|
|
|
if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1)
|
2015-12-20 23:51:01 +00:00
|
|
|
num_sectors = (sector_t)(2ULL << 32) - 2;
|
2016-11-18 05:16:11 +00:00
|
|
|
do {
|
|
|
|
md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
|
2008-06-27 22:31:46 +00:00
|
|
|
rdev->sb_page);
|
2016-11-18 05:16:11 +00:00
|
|
|
} while (md_super_wait(rdev->mddev) < 0);
|
2010-11-24 05:36:17 +00:00
|
|
|
return num_sectors;
|
2008-06-27 22:31:46 +00:00
|
|
|
}
|
|
|
|
|
2012-05-20 23:27:00 +00:00
|
|
|
static int
|
|
|
|
super_90_allow_new_offset(struct md_rdev *rdev, unsigned long long new_offset)
|
|
|
|
{
|
|
|
|
/* non-zero offset changes not possible with v0.90 */
|
|
|
|
return new_offset == 0;
|
|
|
|
}
|
2008-06-27 22:31:46 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* version 1 superblock
|
|
|
|
*/
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2006-10-21 17:24:08 +00:00
|
|
|
__le32 disk_csum;
|
|
|
|
u32 csum;
|
2005-04-16 22:20:36 +00:00
|
|
|
unsigned long long newcsum;
|
|
|
|
int size = 256 + le32_to_cpu(sb->max_dev)*2;
|
2006-10-21 17:24:08 +00:00
|
|
|
__le32 *isuper = (__le32*)sb;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
disk_csum = sb->sb_csum;
|
|
|
|
sb->sb_csum = 0;
|
|
|
|
newcsum = 0;
|
2012-12-11 02:09:00 +00:00
|
|
|
for (; size >= 4; size -= 4)
|
2005-04-16 22:20:36 +00:00
|
|
|
newcsum += le32_to_cpu(*isuper++);
|
|
|
|
|
|
|
|
if (size == 2)
|
2006-10-21 17:24:08 +00:00
|
|
|
newcsum += le16_to_cpu(*(__le16*) isuper);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
csum = (newcsum & 0xffffffff) + (newcsum >> 32);
|
|
|
|
sb->sb_csum = disk_csum;
|
|
|
|
return cpu_to_le32(csum);
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:45:26 +00:00
|
|
|
static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct mdp_superblock_1 *sb;
|
|
|
|
int ret;
|
2008-07-11 12:02:23 +00:00
|
|
|
sector_t sb_start;
|
2012-05-20 23:27:00 +00:00
|
|
|
sector_t sectors;
|
2005-09-09 23:23:53 +00:00
|
|
|
int bmask;
|
2019-10-30 10:47:02 +00:00
|
|
|
bool spare_disk = true;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
2008-07-11 12:02:23 +00:00
|
|
|
* Calculate the position of the superblock in 512byte sectors.
|
2005-04-16 22:20:36 +00:00
|
|
|
* It is always aligned to a 4K boundary and
|
|
|
|
* depeding on minor_version, it can be:
|
|
|
|
* 0: At least 8K, but less than 12K, from end of device
|
|
|
|
* 1: At start of device
|
|
|
|
* 2: 4K from start of device.
|
|
|
|
*/
|
|
|
|
switch(minor_version) {
|
|
|
|
case 0:
|
2021-10-18 10:11:06 +00:00
|
|
|
sb_start = bdev_nr_sectors(rdev->bdev) - 8 * 2;
|
2008-07-11 12:02:23 +00:00
|
|
|
sb_start &= ~(sector_t)(4*2-1);
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
|
|
|
case 1:
|
2008-07-11 12:02:23 +00:00
|
|
|
sb_start = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
|
|
|
case 2:
|
2008-07-11 12:02:23 +00:00
|
|
|
sb_start = 8;
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2008-07-11 12:02:23 +00:00
|
|
|
rdev->sb_start = sb_start;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-09-09 23:23:53 +00:00
|
|
|
/* superblock is rarely larger than 1K, but it can be larger,
|
|
|
|
* and it is safe to read 4k, so we do that
|
|
|
|
*/
|
|
|
|
ret = read_disk_sb(rdev, 4096);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (ret) return ret;
|
|
|
|
|
2011-07-27 01:00:36 +00:00
|
|
|
sb = page_address(rdev->sb_page);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (sb->magic != cpu_to_le32(MD_SB_MAGIC) ||
|
|
|
|
sb->major_version != cpu_to_le32(1) ||
|
|
|
|
le32_to_cpu(sb->max_dev) > (4096-256)/2 ||
|
2008-07-11 12:02:23 +00:00
|
|
|
le64_to_cpu(sb->super_offset) != rdev->sb_start ||
|
2005-09-09 23:23:51 +00:00
|
|
|
(le32_to_cpu(sb->feature_map) & ~MD_FEATURE_ALL) != 0)
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (calc_sb_1_csum(sb) != sb->sb_csum) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: invalid superblock checksum on %pg\n",
|
|
|
|
rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
if (le64_to_cpu(sb->data_size) < 10) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: data_size too small on %pg\n",
|
|
|
|
rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2012-05-20 23:27:00 +00:00
|
|
|
if (sb->pad0 ||
|
|
|
|
sb->pad3[0] ||
|
|
|
|
memcmp(sb->pad3, sb->pad3+1, sizeof(sb->pad3) - sizeof(sb->pad3[1])))
|
|
|
|
/* Some padding is non-zero, might be a new feature */
|
|
|
|
return -EINVAL;
|
2007-05-09 09:35:36 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->preferred_minor = 0xffff;
|
|
|
|
rdev->data_offset = le64_to_cpu(sb->data_offset);
|
2012-05-20 23:27:00 +00:00
|
|
|
rdev->new_data_offset = rdev->data_offset;
|
|
|
|
if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE) &&
|
|
|
|
(le32_to_cpu(sb->feature_map) & MD_FEATURE_NEW_OFFSET))
|
|
|
|
rdev->new_data_offset += (s32)le32_to_cpu(sb->new_offset);
|
2006-01-06 08:20:52 +00:00
|
|
|
atomic_set(&rdev->corrected_errors, le32_to_cpu(sb->cnt_corrected_read));
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-09-09 23:23:53 +00:00
|
|
|
rdev->sb_size = le32_to_cpu(sb->max_dev) * 2 + 256;
|
2009-05-22 21:17:49 +00:00
|
|
|
bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1;
|
2005-09-09 23:23:53 +00:00
|
|
|
if (rdev->sb_size & bmask)
|
2008-03-04 22:29:31 +00:00
|
|
|
rdev->sb_size = (rdev->sb_size | bmask) + 1;
|
|
|
|
|
|
|
|
if (minor_version
|
2008-07-11 12:02:23 +00:00
|
|
|
&& rdev->data_offset < sb_start + (rdev->sb_size/512))
|
2008-03-04 22:29:31 +00:00
|
|
|
return -EINVAL;
|
2012-05-20 23:27:00 +00:00
|
|
|
if (minor_version
|
|
|
|
&& rdev->new_data_offset < sb_start + (rdev->sb_size/512))
|
|
|
|
return -EINVAL;
|
2005-09-09 23:23:53 +00:00
|
|
|
|
2006-07-10 11:44:14 +00:00
|
|
|
if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
|
|
|
|
rdev->desc_nr = -1;
|
|
|
|
else
|
|
|
|
rdev->desc_nr = le32_to_cpu(sb->dev_number);
|
|
|
|
|
2011-07-28 01:31:47 +00:00
|
|
|
if (!rdev->bb_page) {
|
|
|
|
rdev->bb_page = alloc_page(GFP_KERNEL);
|
|
|
|
if (!rdev->bb_page)
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BAD_BLOCKS) &&
|
|
|
|
rdev->badblocks.count == 0) {
|
|
|
|
/* need to load the bad block list.
|
|
|
|
* Currently we limit it to one page.
|
|
|
|
*/
|
|
|
|
s32 offset;
|
|
|
|
sector_t bb_sector;
|
2019-04-04 16:56:12 +00:00
|
|
|
__le64 *bbp;
|
2011-07-28 01:31:47 +00:00
|
|
|
int i;
|
|
|
|
int sectors = le16_to_cpu(sb->bblog_size);
|
|
|
|
if (sectors > (PAGE_SIZE / 512))
|
|
|
|
return -EINVAL;
|
|
|
|
offset = le32_to_cpu(sb->bblog_offset);
|
|
|
|
if (offset == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
bb_sector = (long long)offset;
|
|
|
|
if (!sync_page_io(rdev, bb_sector, sectors << 9,
|
2022-07-14 18:06:57 +00:00
|
|
|
rdev->bb_page, REQ_OP_READ, true))
|
2011-07-28 01:31:47 +00:00
|
|
|
return -EIO;
|
2019-04-04 16:56:12 +00:00
|
|
|
bbp = (__le64 *)page_address(rdev->bb_page);
|
2011-07-28 01:31:47 +00:00
|
|
|
rdev->badblocks.shift = sb->bblog_shift;
|
|
|
|
for (i = 0 ; i < (sectors << (9-3)) ; i++, bbp++) {
|
|
|
|
u64 bb = le64_to_cpu(*bbp);
|
|
|
|
int count = bb & (0x3ff);
|
|
|
|
u64 sector = bb >> 10;
|
|
|
|
sector <<= sb->bblog_shift;
|
|
|
|
count <<= sb->bblog_shift;
|
|
|
|
if (bb + 1 == 0)
|
|
|
|
break;
|
2015-12-25 02:20:34 +00:00
|
|
|
if (badblocks_set(&rdev->badblocks, sector, count, 1))
|
2011-07-28 01:31:47 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2013-04-24 01:42:44 +00:00
|
|
|
} else if (sb->bblog_offset != 0)
|
|
|
|
rdev->badblocks.shift = 0;
|
2011-07-28 01:31:47 +00:00
|
|
|
|
2017-08-16 15:13:45 +00:00
|
|
|
if ((le32_to_cpu(sb->feature_map) &
|
|
|
|
(MD_FEATURE_PPL | MD_FEATURE_MULTIPLE_PPLS))) {
|
2017-03-09 08:59:57 +00:00
|
|
|
rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
|
|
|
|
rdev->ppl.size = le16_to_cpu(sb->ppl.size);
|
|
|
|
rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
|
|
|
|
}
|
|
|
|
|
2019-09-09 06:52:29 +00:00
|
|
|
if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) &&
|
|
|
|
sb->level != 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2019-10-30 10:47:02 +00:00
|
|
|
/* not spare disk, or LEVEL_MULTIPATH */
|
|
|
|
if (sb->level == cpu_to_le32(LEVEL_MULTIPATH) ||
|
|
|
|
(rdev->desc_nr >= 0 &&
|
|
|
|
rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
|
|
|
|
(le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
|
|
|
|
le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL)))
|
|
|
|
spare_disk = false;
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
|
2008-04-28 09:15:49 +00:00
|
|
|
if (!refdev) {
|
2019-10-30 10:47:02 +00:00
|
|
|
if (!spare_disk)
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
ret = 1;
|
|
|
|
else
|
|
|
|
ret = 0;
|
2008-04-28 09:15:49 +00:00
|
|
|
} else {
|
2005-04-16 22:20:36 +00:00
|
|
|
__u64 ev1, ev2;
|
2011-07-27 01:00:36 +00:00
|
|
|
struct mdp_superblock_1 *refsb = page_address(refdev->sb_page);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (memcmp(sb->set_uuid, refsb->set_uuid, 16) != 0 ||
|
|
|
|
sb->level != refsb->level ||
|
|
|
|
sb->layout != refsb->layout ||
|
|
|
|
sb->chunksize != refsb->chunksize) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %pg has strangely different superblock to %pg\n",
|
|
|
|
rdev->bdev,
|
|
|
|
refdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
ev1 = le64_to_cpu(sb->events);
|
|
|
|
ev2 = le64_to_cpu(refsb->events);
|
|
|
|
|
2019-10-30 10:47:02 +00:00
|
|
|
if (!spare_disk && ev1 > ev2)
|
2006-02-03 11:03:41 +00:00
|
|
|
ret = 1;
|
|
|
|
else
|
|
|
|
ret = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2021-10-18 10:11:06 +00:00
|
|
|
if (minor_version)
|
|
|
|
sectors = bdev_nr_sectors(rdev->bdev) - rdev->data_offset;
|
|
|
|
else
|
2012-05-20 23:27:00 +00:00
|
|
|
sectors = rdev->sb_start;
|
|
|
|
if (sectors < le64_to_cpu(sb->data_size))
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
2009-03-31 03:33:13 +00:00
|
|
|
rdev->sectors = le64_to_cpu(sb->data_size);
|
2006-02-03 11:03:41 +00:00
|
|
|
return ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-07-27 01:00:36 +00:00
|
|
|
struct mdp_superblock_1 *sb = page_address(rdev->sb_page);
|
2006-06-26 07:27:56 +00:00
|
|
|
__u64 ev1 = le64_to_cpu(sb->events);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-06-22 00:17:25 +00:00
|
|
|
rdev->raid_disk = -1;
|
2008-02-06 09:39:54 +00:00
|
|
|
clear_bit(Faulty, &rdev->flags);
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2013-12-11 23:13:33 +00:00
|
|
|
clear_bit(Bitmap_sync, &rdev->flags);
|
2008-02-06 09:39:54 +00:00
|
|
|
clear_bit(WriteMostly, &rdev->flags);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev->raid_disks == 0) {
|
|
|
|
mddev->major_version = 1;
|
|
|
|
mddev->patch_version = 0;
|
2008-02-06 09:39:51 +00:00
|
|
|
mddev->external = 0;
|
2009-06-17 22:45:01 +00:00
|
|
|
mddev->chunk_sectors = le32_to_cpu(sb->chunksize);
|
2015-12-20 23:51:01 +00:00
|
|
|
mddev->ctime = le64_to_cpu(sb->ctime);
|
|
|
|
mddev->utime = le64_to_cpu(sb->utime);
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->level = le32_to_cpu(sb->level);
|
2006-01-06 08:20:51 +00:00
|
|
|
mddev->clevel[0] = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->layout = le32_to_cpu(sb->layout);
|
|
|
|
mddev->raid_disks = le32_to_cpu(sb->raid_disks);
|
2009-03-31 03:33:13 +00:00
|
|
|
mddev->dev_sectors = le64_to_cpu(sb->size);
|
2006-06-26 07:27:56 +00:00
|
|
|
mddev->events = ev1;
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset = 0;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.space = 0;
|
|
|
|
/* Default location for bitmap is 1K after superblock
|
|
|
|
* using 3K - total of 4K
|
|
|
|
*/
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.default_offset = 1024 >> 9;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.default_space = (4096-1024) >> 9;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
|
|
|
|
memcpy(mddev->uuid, sb->set_uuid, 16);
|
|
|
|
|
|
|
|
mddev->max_disks = (4096-256)/2;
|
2005-06-22 00:17:27 +00:00
|
|
|
|
2005-09-09 23:23:51 +00:00
|
|
|
if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) &&
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.file == NULL) {
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset =
|
|
|
|
(__s32)le32_to_cpu(sb->bitmap_offset);
|
2012-05-22 03:55:07 +00:00
|
|
|
/* Metadata doesn't record how much space is available.
|
|
|
|
* For 1.0, we assume we can use up to the superblock
|
|
|
|
* if before, else to 4K beyond superblock.
|
|
|
|
* For others, assume no change is possible.
|
|
|
|
*/
|
|
|
|
if (mddev->minor_version > 0)
|
|
|
|
mddev->bitmap_info.space = 0;
|
|
|
|
else if (mddev->bitmap_info.offset > 0)
|
|
|
|
mddev->bitmap_info.space =
|
|
|
|
8 - mddev->bitmap_info.offset;
|
|
|
|
else
|
|
|
|
mddev->bitmap_info.space =
|
|
|
|
-mddev->bitmap_info.offset;
|
|
|
|
}
|
2007-05-09 09:35:36 +00:00
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE)) {
|
|
|
|
mddev->reshape_position = le64_to_cpu(sb->reshape_position);
|
|
|
|
mddev->delta_disks = le32_to_cpu(sb->delta_disks);
|
|
|
|
mddev->new_level = le32_to_cpu(sb->new_level);
|
|
|
|
mddev->new_layout = le32_to_cpu(sb->new_layout);
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = le32_to_cpu(sb->new_chunk);
|
2012-05-20 23:27:00 +00:00
|
|
|
if (mddev->delta_disks < 0 ||
|
|
|
|
(mddev->delta_disks == 0 &&
|
|
|
|
(le32_to_cpu(sb->feature_map)
|
|
|
|
& MD_FEATURE_RESHAPE_BACKWARDS)))
|
|
|
|
mddev->reshape_backwards = 1;
|
2006-03-27 09:18:11 +00:00
|
|
|
} else {
|
|
|
|
mddev->reshape_position = MaxSector;
|
|
|
|
mddev->delta_disks = 0;
|
|
|
|
mddev->new_level = mddev->level;
|
|
|
|
mddev->new_layout = mddev->layout;
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
2006-03-27 09:18:11 +00:00
|
|
|
}
|
|
|
|
|
2019-09-09 06:52:29 +00:00
|
|
|
if (mddev->level == 0 &&
|
|
|
|
!(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT))
|
|
|
|
mddev->layout = -1;
|
|
|
|
|
2016-08-19 22:34:01 +00:00
|
|
|
if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
|
2016-01-06 22:37:13 +00:00
|
|
|
set_bit(MD_HAS_JOURNAL, &mddev->flags);
|
2017-03-09 08:59:57 +00:00
|
|
|
|
2017-08-16 15:13:45 +00:00
|
|
|
if (le32_to_cpu(sb->feature_map) &
|
|
|
|
(MD_FEATURE_PPL | MD_FEATURE_MULTIPLE_PPLS)) {
|
2017-03-09 08:59:57 +00:00
|
|
|
if (le32_to_cpu(sb->feature_map) &
|
|
|
|
(MD_FEATURE_BITMAP_OFFSET | MD_FEATURE_JOURNAL))
|
|
|
|
return -EINVAL;
|
2017-08-16 15:13:45 +00:00
|
|
|
if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) &&
|
|
|
|
(le32_to_cpu(sb->feature_map) &
|
|
|
|
MD_FEATURE_MULTIPLE_PPLS))
|
|
|
|
return -EINVAL;
|
2017-03-09 08:59:57 +00:00
|
|
|
set_bit(MD_HAS_PPL, &mddev->flags);
|
|
|
|
}
|
2005-06-22 00:17:25 +00:00
|
|
|
} else if (mddev->pers == NULL) {
|
2010-05-18 00:17:09 +00:00
|
|
|
/* Insist of good event counter while assembling, except for
|
|
|
|
* spares (which don't need an event count) */
|
2005-04-16 22:20:36 +00:00
|
|
|
++ev1;
|
2010-05-18 00:17:09 +00:00
|
|
|
if (rdev->desc_nr >= 0 &&
|
|
|
|
rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
|
2015-10-09 04:54:11 +00:00
|
|
|
(le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
|
|
|
|
le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL))
|
2010-05-18 00:17:09 +00:00
|
|
|
if (ev1 < mddev->events)
|
|
|
|
return -EINVAL;
|
2005-06-22 00:17:25 +00:00
|
|
|
} else if (mddev->bitmap) {
|
|
|
|
/* If adding to array with a bitmap, then we can accept an
|
|
|
|
* older device, but not too old.
|
|
|
|
*/
|
|
|
|
if (ev1 < mddev->bitmap->events_cleared)
|
|
|
|
return 0;
|
2013-12-11 23:13:33 +00:00
|
|
|
if (ev1 < mddev->events)
|
|
|
|
set_bit(Bitmap_sync, &rdev->flags);
|
2006-06-26 07:27:56 +00:00
|
|
|
} else {
|
|
|
|
if (ev1 < mddev->events)
|
|
|
|
/* just a hot-add of a new device, leave raid_disk at -1 */
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev->level != LEVEL_MULTIPATH) {
|
|
|
|
int role;
|
2009-08-03 00:59:56 +00:00
|
|
|
if (rdev->desc_nr < 0 ||
|
|
|
|
rdev->desc_nr >= le32_to_cpu(sb->max_dev)) {
|
2015-08-13 21:31:54 +00:00
|
|
|
role = MD_DISK_ROLE_SPARE;
|
2009-08-03 00:59:56 +00:00
|
|
|
rdev->desc_nr = -1;
|
|
|
|
} else
|
|
|
|
role = le16_to_cpu(sb->dev_roles[rdev->desc_nr]);
|
2005-04-16 22:20:36 +00:00
|
|
|
switch(role) {
|
2015-08-13 21:31:54 +00:00
|
|
|
case MD_DISK_ROLE_SPARE: /* spare */
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
2015-08-13 21:31:54 +00:00
|
|
|
case MD_DISK_ROLE_FAULTY: /* faulty */
|
2005-11-09 05:39:31 +00:00
|
|
|
set_bit(Faulty, &rdev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
2015-08-13 21:31:55 +00:00
|
|
|
case MD_DISK_ROLE_JOURNAL: /* journal device */
|
|
|
|
if (!(le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)) {
|
|
|
|
/* journal device without journal feature */
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: journal device provided without journal feature, ignoring the device\n");
|
2015-08-13 21:31:55 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
set_bit(Journal, &rdev->flags);
|
2015-08-13 21:31:56 +00:00
|
|
|
rdev->journal_tail = le64_to_cpu(sb->journal_tail);
|
2015-12-18 04:19:16 +00:00
|
|
|
rdev->raid_disk = 0;
|
2015-08-13 21:31:55 +00:00
|
|
|
break;
|
2005-04-16 22:20:36 +00:00
|
|
|
default:
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
rdev->saved_raid_disk = role;
|
2006-06-26 07:27:40 +00:00
|
|
|
if ((le32_to_cpu(sb->feature_map) &
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
MD_FEATURE_RECOVERY_OFFSET)) {
|
2006-06-26 07:27:40 +00:00
|
|
|
rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
if (!(le32_to_cpu(sb->feature_map) &
|
|
|
|
MD_FEATURE_RECOVERY_BITMAP))
|
|
|
|
rdev->saved_raid_disk = -1;
|
2019-07-24 09:09:20 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If the array is FROZEN, then the device can't
|
|
|
|
* be in_sync with rest of array.
|
|
|
|
*/
|
|
|
|
if (!test_bit(MD_RECOVERY_FROZEN,
|
|
|
|
&mddev->recovery))
|
|
|
|
set_bit(In_sync, &rdev->flags);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->raid_disk = role;
|
|
|
|
break;
|
|
|
|
}
|
2005-09-09 23:23:45 +00:00
|
|
|
if (sb->devflags & WriteMostly1)
|
|
|
|
set_bit(WriteMostly, &rdev->flags);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (sb->devflags & FailFast1)
|
|
|
|
set_bit(FailFast, &rdev->flags);
|
2011-12-22 23:17:51 +00:00
|
|
|
if (le32_to_cpu(sb->feature_map) & MD_FEATURE_REPLACEMENT)
|
|
|
|
set_bit(Replacement, &rdev->flags);
|
2005-06-22 00:17:25 +00:00
|
|
|
} else /* MULTIPATH are always insync */
|
2005-11-09 05:39:31 +00:00
|
|
|
set_bit(In_sync, &rdev->flags);
|
2005-06-22 00:17:25 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct mdp_superblock_1 *sb;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev2;
|
2005-04-16 22:20:36 +00:00
|
|
|
int max_dev, i;
|
|
|
|
/* make rdev->sb match mddev and rdev data. */
|
|
|
|
|
2011-07-27 01:00:36 +00:00
|
|
|
sb = page_address(rdev->sb_page);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
sb->feature_map = 0;
|
|
|
|
sb->pad0 = 0;
|
2006-06-26 07:27:40 +00:00
|
|
|
sb->recovery_offset = cpu_to_le64(0);
|
2005-04-16 22:20:36 +00:00
|
|
|
memset(sb->pad3, 0, sizeof(sb->pad3));
|
|
|
|
|
|
|
|
sb->utime = cpu_to_le64((__u64)mddev->utime);
|
|
|
|
sb->events = cpu_to_le64(mddev->events);
|
|
|
|
if (mddev->in_sync)
|
|
|
|
sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
|
2015-09-02 20:49:50 +00:00
|
|
|
else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
|
|
|
|
sb->resync_offset = cpu_to_le64(MaxSector);
|
2005-04-16 22:20:36 +00:00
|
|
|
else
|
|
|
|
sb->resync_offset = cpu_to_le64(0);
|
|
|
|
|
2006-10-21 17:24:08 +00:00
|
|
|
sb->cnt_corrected_read = cpu_to_le32(atomic_read(&rdev->corrected_errors));
|
2006-01-06 08:20:52 +00:00
|
|
|
|
2006-02-02 22:28:04 +00:00
|
|
|
sb->raid_disks = cpu_to_le32(mddev->raid_disks);
|
2009-03-31 03:33:13 +00:00
|
|
|
sb->size = cpu_to_le64(mddev->dev_sectors);
|
2009-06-17 22:45:01 +00:00
|
|
|
sb->chunksize = cpu_to_le32(mddev->chunk_sectors);
|
2009-05-25 23:40:59 +00:00
|
|
|
sb->level = cpu_to_le32(mddev->level);
|
|
|
|
sb->layout = cpu_to_le32(mddev->layout);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (test_bit(FailFast, &rdev->flags))
|
|
|
|
sb->devflags |= FailFast1;
|
|
|
|
else
|
|
|
|
sb->devflags &= ~FailFast1;
|
2006-02-02 22:28:04 +00:00
|
|
|
|
2011-08-25 04:43:08 +00:00
|
|
|
if (test_bit(WriteMostly, &rdev->flags))
|
|
|
|
sb->devflags |= WriteMostly1;
|
|
|
|
else
|
|
|
|
sb->devflags &= ~WriteMostly1;
|
2012-05-20 23:27:00 +00:00
|
|
|
sb->data_offset = cpu_to_le64(rdev->data_offset);
|
|
|
|
sb->data_size = cpu_to_le64(rdev->sectors);
|
2011-08-25 04:43:08 +00:00
|
|
|
|
2009-12-14 01:49:52 +00:00
|
|
|
if (mddev->bitmap && mddev->bitmap_info.file == NULL) {
|
|
|
|
sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_info.offset);
|
2005-09-09 23:23:51 +00:00
|
|
|
sb->feature_map = cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
|
2005-06-22 00:17:27 +00:00
|
|
|
}
|
2006-06-26 07:27:40 +00:00
|
|
|
|
2015-10-09 04:54:12 +00:00
|
|
|
if (rdev->raid_disk >= 0 && !test_bit(Journal, &rdev->flags) &&
|
2009-03-31 03:33:13 +00:00
|
|
|
!test_bit(In_sync, &rdev->flags)) {
|
2009-12-14 01:50:06 +00:00
|
|
|
sb->feature_map |=
|
|
|
|
cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
|
|
|
|
sb->recovery_offset =
|
|
|
|
cpu_to_le64(rdev->recovery_offset);
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
if (rdev->saved_raid_disk >= 0 && mddev->bitmap)
|
|
|
|
sb->feature_map |=
|
|
|
|
cpu_to_le32(MD_FEATURE_RECOVERY_BITMAP);
|
2006-06-26 07:27:40 +00:00
|
|
|
}
|
2015-08-13 21:31:56 +00:00
|
|
|
/* Note: recovery_offset and journal_tail share space */
|
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
sb->journal_tail = cpu_to_le64(rdev->journal_tail);
|
2011-12-22 23:17:51 +00:00
|
|
|
if (test_bit(Replacement, &rdev->flags))
|
|
|
|
sb->feature_map |=
|
|
|
|
cpu_to_le32(MD_FEATURE_REPLACEMENT);
|
2006-06-26 07:27:40 +00:00
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
if (mddev->reshape_position != MaxSector) {
|
|
|
|
sb->feature_map |= cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE);
|
|
|
|
sb->reshape_position = cpu_to_le64(mddev->reshape_position);
|
|
|
|
sb->new_layout = cpu_to_le32(mddev->new_layout);
|
|
|
|
sb->delta_disks = cpu_to_le32(mddev->delta_disks);
|
|
|
|
sb->new_level = cpu_to_le32(mddev->new_level);
|
2009-06-17 22:45:27 +00:00
|
|
|
sb->new_chunk = cpu_to_le32(mddev->new_chunk_sectors);
|
2012-05-20 23:27:00 +00:00
|
|
|
if (mddev->delta_disks == 0 &&
|
|
|
|
mddev->reshape_backwards)
|
|
|
|
sb->feature_map
|
|
|
|
|= cpu_to_le32(MD_FEATURE_RESHAPE_BACKWARDS);
|
2012-05-20 23:27:00 +00:00
|
|
|
if (rdev->new_data_offset != rdev->data_offset) {
|
|
|
|
sb->feature_map
|
|
|
|
|= cpu_to_le32(MD_FEATURE_NEW_OFFSET);
|
|
|
|
sb->new_offset = cpu_to_le32((__u32)(rdev->new_data_offset
|
|
|
|
- rdev->data_offset));
|
|
|
|
}
|
2006-03-27 09:18:11 +00:00
|
|
|
}
|
2005-06-22 00:17:27 +00:00
|
|
|
|
2015-08-18 21:35:54 +00:00
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
sb->feature_map |= cpu_to_le32(MD_FEATURE_CLUSTERED);
|
|
|
|
|
2011-07-28 01:31:47 +00:00
|
|
|
if (rdev->badblocks.count == 0)
|
|
|
|
/* Nothing to do for bad blocks*/ ;
|
|
|
|
else if (sb->bblog_offset == 0)
|
|
|
|
/* Cannot record bad blocks on this device */
|
|
|
|
md_error(mddev, rdev);
|
|
|
|
else {
|
|
|
|
struct badblocks *bb = &rdev->badblocks;
|
2019-04-04 16:56:13 +00:00
|
|
|
__le64 *bbp = (__le64 *)page_address(rdev->bb_page);
|
2011-07-28 01:31:47 +00:00
|
|
|
u64 *p = bb->page;
|
|
|
|
sb->feature_map |= cpu_to_le32(MD_FEATURE_BAD_BLOCKS);
|
|
|
|
if (bb->changed) {
|
|
|
|
unsigned seq;
|
|
|
|
|
|
|
|
retry:
|
|
|
|
seq = read_seqbegin(&bb->lock);
|
|
|
|
|
|
|
|
memset(bbp, 0xff, PAGE_SIZE);
|
|
|
|
|
|
|
|
for (i = 0 ; i < bb->count ; i++) {
|
2012-11-08 00:56:27 +00:00
|
|
|
u64 internal_bb = p[i];
|
2011-07-28 01:31:47 +00:00
|
|
|
u64 store_bb = ((BB_OFFSET(internal_bb) << 10)
|
|
|
|
| BB_LEN(internal_bb));
|
2012-11-08 00:56:27 +00:00
|
|
|
bbp[i] = cpu_to_le64(store_bb);
|
2011-07-28 01:31:47 +00:00
|
|
|
}
|
2012-03-19 01:46:41 +00:00
|
|
|
bb->changed = 0;
|
2011-07-28 01:31:47 +00:00
|
|
|
if (read_seqretry(&bb->lock, seq))
|
|
|
|
goto retry;
|
|
|
|
|
|
|
|
bb->sector = (rdev->sb_start +
|
|
|
|
(int)le32_to_cpu(sb->bblog_offset));
|
|
|
|
bb->size = le16_to_cpu(sb->bblog_size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
max_dev = 0;
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev2, mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
if (rdev2->desc_nr+1 > max_dev)
|
|
|
|
max_dev = rdev2->desc_nr+1;
|
2007-05-23 20:58:10 +00:00
|
|
|
|
2009-08-03 00:59:57 +00:00
|
|
|
if (max_dev > le32_to_cpu(sb->max_dev)) {
|
|
|
|
int bmask;
|
2007-05-23 20:58:10 +00:00
|
|
|
sb->max_dev = cpu_to_le32(max_dev);
|
2009-08-03 00:59:57 +00:00
|
|
|
rdev->sb_size = max_dev * 2 + 256;
|
|
|
|
bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1;
|
|
|
|
if (rdev->sb_size & bmask)
|
|
|
|
rdev->sb_size = (rdev->sb_size | bmask) + 1;
|
2010-09-08 06:48:17 +00:00
|
|
|
} else
|
|
|
|
max_dev = le32_to_cpu(sb->max_dev);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
for (i=0; i<max_dev;i++)
|
2017-06-12 02:45:55 +00:00
|
|
|
sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_SPARE);
|
2014-09-30 04:23:59 +00:00
|
|
|
|
2015-10-09 04:54:09 +00:00
|
|
|
if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
|
|
|
|
sb->feature_map |= cpu_to_le32(MD_FEATURE_JOURNAL);
|
2014-09-30 04:23:59 +00:00
|
|
|
|
2017-03-09 08:59:57 +00:00
|
|
|
if (test_bit(MD_HAS_PPL, &mddev->flags)) {
|
2017-08-16 15:13:45 +00:00
|
|
|
if (test_bit(MD_HAS_MULTIPLE_PPLS, &mddev->flags))
|
|
|
|
sb->feature_map |=
|
|
|
|
cpu_to_le32(MD_FEATURE_MULTIPLE_PPLS);
|
|
|
|
else
|
|
|
|
sb->feature_map |= cpu_to_le32(MD_FEATURE_PPL);
|
2017-03-09 08:59:57 +00:00
|
|
|
sb->ppl.offset = cpu_to_le16(rdev->ppl.offset);
|
|
|
|
sb->ppl.size = cpu_to_le16(rdev->ppl.size);
|
|
|
|
}
|
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev2, mddev) {
|
2005-04-16 22:20:36 +00:00
|
|
|
i = rdev2->desc_nr;
|
2005-11-09 05:39:31 +00:00
|
|
|
if (test_bit(Faulty, &rdev2->flags))
|
2015-08-13 21:31:54 +00:00
|
|
|
sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_FAULTY);
|
2005-11-09 05:39:31 +00:00
|
|
|
else if (test_bit(In_sync, &rdev2->flags))
|
2005-04-16 22:20:36 +00:00
|
|
|
sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
|
2015-10-09 04:54:09 +00:00
|
|
|
else if (test_bit(Journal, &rdev2->flags))
|
2015-08-13 21:31:55 +00:00
|
|
|
sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_JOURNAL);
|
2009-12-14 01:50:06 +00:00
|
|
|
else if (rdev2->raid_disk >= 0)
|
2006-06-26 07:27:40 +00:00
|
|
|
sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
|
2005-04-16 22:20:36 +00:00
|
|
|
else
|
2015-08-13 21:31:54 +00:00
|
|
|
sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_SPARE);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
sb->sb_csum = calc_sb_1_csum(sb);
|
|
|
|
}
|
|
|
|
|
2020-06-30 07:55:36 +00:00
|
|
|
static sector_t super_1_choose_bm_space(sector_t dev_size)
|
|
|
|
{
|
|
|
|
sector_t bm_space;
|
|
|
|
|
|
|
|
/* if the device is bigger than 8Gig, save 64k for bitmap
|
|
|
|
* usage, if bigger than 200Gig, save 128k
|
|
|
|
*/
|
|
|
|
if (dev_size < 64*2)
|
|
|
|
bm_space = 0;
|
|
|
|
else if (dev_size - 64*2 >= 200*1024*1024*2)
|
|
|
|
bm_space = 128*2;
|
|
|
|
else if (dev_size - 4*2 > 8*1024*1024*2)
|
|
|
|
bm_space = 64*2;
|
|
|
|
else
|
|
|
|
bm_space = 4*2;
|
|
|
|
return bm_space;
|
|
|
|
}
|
|
|
|
|
2008-06-27 22:31:46 +00:00
|
|
|
static unsigned long long
|
2011-10-11 05:45:26 +00:00
|
|
|
super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
|
2008-06-27 22:31:46 +00:00
|
|
|
{
|
|
|
|
struct mdp_superblock_1 *sb;
|
2008-07-21 04:42:12 +00:00
|
|
|
sector_t max_sectors;
|
2009-03-31 03:33:13 +00:00
|
|
|
if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
|
2008-06-27 22:31:46 +00:00
|
|
|
return 0; /* component must fit device */
|
2012-05-20 23:27:00 +00:00
|
|
|
if (rdev->data_offset != rdev->new_data_offset)
|
|
|
|
return 0; /* too confusing */
|
2008-07-11 12:02:23 +00:00
|
|
|
if (rdev->sb_start < rdev->data_offset) {
|
2008-06-27 22:31:46 +00:00
|
|
|
/* minor versions 1 and 2; superblock before data */
|
2021-10-18 10:11:06 +00:00
|
|
|
max_sectors = bdev_nr_sectors(rdev->bdev) - rdev->data_offset;
|
2008-07-21 04:42:12 +00:00
|
|
|
if (!num_sectors || num_sectors > max_sectors)
|
|
|
|
num_sectors = max_sectors;
|
2009-12-14 01:49:52 +00:00
|
|
|
} else if (rdev->mddev->bitmap_info.offset) {
|
2008-06-27 22:31:46 +00:00
|
|
|
/* minor version 0 with bitmap we can't move */
|
|
|
|
return 0;
|
|
|
|
} else {
|
|
|
|
/* minor version 0; superblock after data */
|
2020-06-30 07:55:36 +00:00
|
|
|
sector_t sb_start, bm_space;
|
2021-10-18 10:11:06 +00:00
|
|
|
sector_t dev_size = bdev_nr_sectors(rdev->bdev);
|
2020-06-30 07:55:36 +00:00
|
|
|
|
|
|
|
/* 8K is for superblock */
|
|
|
|
sb_start = dev_size - 8*2;
|
2008-07-11 12:02:23 +00:00
|
|
|
sb_start &= ~(sector_t)(4*2 - 1);
|
2020-06-30 07:55:36 +00:00
|
|
|
|
|
|
|
bm_space = super_1_choose_bm_space(dev_size);
|
|
|
|
|
|
|
|
/* Space that can be used to store date needs to decrease
|
|
|
|
* superblock bitmap space and bad block space(4K)
|
|
|
|
*/
|
|
|
|
max_sectors = sb_start - bm_space - 4*2;
|
|
|
|
|
2008-07-21 04:42:12 +00:00
|
|
|
if (!num_sectors || num_sectors > max_sectors)
|
|
|
|
num_sectors = max_sectors;
|
2021-11-16 10:21:35 +00:00
|
|
|
rdev->sb_start = sb_start;
|
2008-06-27 22:31:46 +00:00
|
|
|
}
|
2011-07-27 01:00:36 +00:00
|
|
|
sb = page_address(rdev->sb_page);
|
2008-07-21 04:42:12 +00:00
|
|
|
sb->data_size = cpu_to_le64(num_sectors);
|
2017-03-10 03:27:23 +00:00
|
|
|
sb->super_offset = cpu_to_le64(rdev->sb_start);
|
2008-06-27 22:31:46 +00:00
|
|
|
sb->sb_csum = calc_sb_1_csum(sb);
|
2016-11-18 05:16:11 +00:00
|
|
|
do {
|
|
|
|
md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
|
|
|
|
rdev->sb_page);
|
|
|
|
} while (md_super_wait(rdev->mddev) < 0);
|
2010-11-24 05:36:17 +00:00
|
|
|
return num_sectors;
|
2012-05-20 23:27:00 +00:00
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
super_1_allow_new_offset(struct md_rdev *rdev,
|
|
|
|
unsigned long long new_offset)
|
|
|
|
{
|
|
|
|
/* All necessary checks on new >= old have been done */
|
|
|
|
struct bitmap *bitmap;
|
|
|
|
if (new_offset >= rdev->data_offset)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* with 1.0 metadata, there is no metadata to tread on
|
|
|
|
* so we can always move back */
|
|
|
|
if (rdev->mddev->minor_version == 0)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* otherwise we must be sure not to step on
|
|
|
|
* any metadata, so stay:
|
|
|
|
* 36K beyond start of superblock
|
|
|
|
* beyond end of badblocks
|
|
|
|
* beyond write-intent bitmap
|
|
|
|
*/
|
|
|
|
if (rdev->sb_start + (32+4)*2 > new_offset)
|
|
|
|
return 0;
|
|
|
|
bitmap = rdev->mddev->bitmap;
|
|
|
|
if (bitmap && !rdev->mddev->bitmap_info.file &&
|
|
|
|
rdev->sb_start + rdev->mddev->bitmap_info.offset +
|
2012-05-22 03:55:10 +00:00
|
|
|
bitmap->storage.file_pages * (PAGE_SIZE>>9) > new_offset)
|
2012-05-20 23:27:00 +00:00
|
|
|
return 0;
|
|
|
|
if (rdev->badblocks.sector + rdev->badblocks.size > new_offset)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
2008-06-27 22:31:46 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-05-05 23:16:09 +00:00
|
|
|
static struct super_type super_types[] = {
|
2005-04-16 22:20:36 +00:00
|
|
|
[0] = {
|
|
|
|
.name = "0.90.0",
|
|
|
|
.owner = THIS_MODULE,
|
2008-06-27 22:31:46 +00:00
|
|
|
.load_super = super_90_load,
|
|
|
|
.validate_super = super_90_validate,
|
|
|
|
.sync_super = super_90_sync,
|
|
|
|
.rdev_size_change = super_90_rdev_size_change,
|
2012-05-20 23:27:00 +00:00
|
|
|
.allow_new_offset = super_90_allow_new_offset,
|
2005-04-16 22:20:36 +00:00
|
|
|
},
|
|
|
|
[1] = {
|
|
|
|
.name = "md-1",
|
|
|
|
.owner = THIS_MODULE,
|
2008-06-27 22:31:46 +00:00
|
|
|
.load_super = super_1_load,
|
|
|
|
.validate_super = super_1_validate,
|
|
|
|
.sync_super = super_1_sync,
|
|
|
|
.rdev_size_change = super_1_rdev_size_change,
|
2012-05-20 23:27:00 +00:00
|
|
|
.allow_new_offset = super_1_allow_new_offset,
|
2005-04-16 22:20:36 +00:00
|
|
|
},
|
|
|
|
};
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static void sync_super(struct mddev *mddev, struct md_rdev *rdev)
|
2011-06-07 22:51:30 +00:00
|
|
|
{
|
|
|
|
if (mddev->sync_super) {
|
|
|
|
mddev->sync_super(mddev, rdev);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
BUG_ON(mddev->major_version >= ARRAY_SIZE(super_types));
|
|
|
|
|
|
|
|
super_types[mddev->major_version].sync_super(mddev, rdev);
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int match_mddev_units(struct mddev *mddev1, struct mddev *mddev2)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev, *rdev2;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-07-21 07:05:25 +00:00
|
|
|
rcu_read_lock();
|
2015-09-04 06:00:35 +00:00
|
|
|
rdev_for_each_rcu(rdev, mddev1) {
|
|
|
|
if (test_bit(Faulty, &rdev->flags) ||
|
|
|
|
test_bit(Journal, &rdev->flags) ||
|
|
|
|
rdev->raid_disk == -1)
|
|
|
|
continue;
|
|
|
|
rdev_for_each_rcu(rdev2, mddev2) {
|
|
|
|
if (test_bit(Faulty, &rdev2->flags) ||
|
|
|
|
test_bit(Journal, &rdev2->flags) ||
|
|
|
|
rdev2->raid_disk == -1)
|
|
|
|
continue;
|
2020-09-03 05:40:58 +00:00
|
|
|
if (rdev->bdev->bd_disk == rdev2->bdev->bd_disk) {
|
2008-07-21 07:05:25 +00:00
|
|
|
rcu_read_unlock();
|
2007-03-01 04:11:35 +00:00
|
|
|
return 1;
|
2008-07-21 07:05:25 +00:00
|
|
|
}
|
2015-09-04 06:00:35 +00:00
|
|
|
}
|
|
|
|
}
|
2008-07-21 07:05:25 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static LIST_HEAD(pending_raid_disks);
|
|
|
|
|
2009-08-03 00:59:47 +00:00
|
|
|
/*
|
|
|
|
* Try to register data integrity profile for an mddev
|
|
|
|
*
|
|
|
|
* This is called when an array is started and after a disk has been kicked
|
|
|
|
* from the array. It only succeeds if all working and active component devices
|
|
|
|
* are integrity capable with matching profiles.
|
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
int md_integrity_register(struct mddev *mddev)
|
2009-08-03 00:59:47 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev, *reference = NULL;
|
2009-08-03 00:59:47 +00:00
|
|
|
|
|
|
|
if (list_empty(&mddev->disks))
|
|
|
|
return 0; /* nothing to do */
|
2011-06-08 05:10:08 +00:00
|
|
|
if (!mddev->gendisk || blk_get_integrity(mddev->gendisk))
|
|
|
|
return 0; /* shouldn't register, or already is */
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2009-08-03 00:59:47 +00:00
|
|
|
/* skip spares and non-functional disks */
|
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
|
|
|
continue;
|
|
|
|
if (rdev->raid_disk < 0)
|
|
|
|
continue;
|
|
|
|
if (!reference) {
|
|
|
|
/* Use the first rdev as the reference */
|
|
|
|
reference = rdev;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* does this rdev's profile match the reference profile? */
|
|
|
|
if (blk_integrity_compare(reference->bdev->bd_disk,
|
|
|
|
rdev->bdev->bd_disk) < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2011-03-29 00:09:12 +00:00
|
|
|
if (!reference || !bdev_get_integrity(reference->bdev))
|
|
|
|
return 0;
|
2009-08-03 00:59:47 +00:00
|
|
|
/*
|
|
|
|
* All component devices are integrity capable and have matching
|
|
|
|
* profiles, register the common profile for the md device.
|
|
|
|
*/
|
2015-10-21 17:19:49 +00:00
|
|
|
blk_integrity_register(mddev->gendisk,
|
|
|
|
bdev_get_integrity(reference->bdev));
|
|
|
|
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: data integrity enabled on %s\n", mdname(mddev));
|
2021-05-25 09:46:17 +00:00
|
|
|
if (bioset_integrity_create(&mddev->bio_set, BIO_POOL_SIZE) ||
|
2021-06-03 09:21:06 +00:00
|
|
|
(mddev->level != 1 && mddev->level != 10 &&
|
2023-06-21 16:51:04 +00:00
|
|
|
bioset_integrity_create(&mddev->io_clone_set, BIO_POOL_SIZE))) {
|
2021-06-03 09:21:07 +00:00
|
|
|
/*
|
|
|
|
* No need to handle the failure of bioset_integrity_create,
|
|
|
|
* because the function is called by md_run() -> pers->run(),
|
|
|
|
* md_run calls bioset_exit -> bioset_integrity_free in case
|
|
|
|
* of failure case.
|
|
|
|
*/
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_err("md: failed to create integrity pool for %s\n",
|
2011-03-17 10:11:05 +00:00
|
|
|
mdname(mddev));
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2009-08-03 00:59:47 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_integrity_register);
|
|
|
|
|
2016-01-14 00:00:07 +00:00
|
|
|
/*
|
|
|
|
* Attempt to add an rdev, but only if it is consistent with the current
|
|
|
|
* integrity profile
|
|
|
|
*/
|
|
|
|
int md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev)
|
2009-03-31 03:27:02 +00:00
|
|
|
{
|
2012-10-11 02:38:58 +00:00
|
|
|
struct blk_integrity *bi_mddev;
|
|
|
|
|
|
|
|
if (!mddev->gendisk)
|
2016-01-14 00:00:07 +00:00
|
|
|
return 0;
|
2012-10-11 02:38:58 +00:00
|
|
|
|
|
|
|
bi_mddev = blk_get_integrity(mddev->gendisk);
|
2009-03-31 03:27:02 +00:00
|
|
|
|
2009-08-03 00:59:47 +00:00
|
|
|
if (!bi_mddev) /* nothing to do */
|
2016-01-14 00:00:07 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (blk_integrity_compare(mddev->gendisk, rdev->bdev->bd_disk) != 0) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_err("%s: incompatible integrity profile for %pg\n",
|
|
|
|
mdname(mddev), rdev->bdev);
|
2016-01-14 00:00:07 +00:00
|
|
|
return -ENXIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2009-03-31 03:27:02 +00:00
|
|
|
}
|
2009-08-03 00:59:47 +00:00
|
|
|
EXPORT_SYMBOL(md_integrity_add_rdev);
|
2009-03-31 03:27:02 +00:00
|
|
|
|
2021-02-01 13:17:20 +00:00
|
|
|
static bool rdev_read_only(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
return bdev_read_only(rdev->bdev) ||
|
|
|
|
(rdev->meta_bdev && bdev_read_only(rdev->meta_bdev));
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2007-03-01 04:11:35 +00:00
|
|
|
char b[BDEVNAME_SIZE];
|
2007-03-27 05:32:14 +00:00
|
|
|
int err;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-04-30 07:52:32 +00:00
|
|
|
/* prevent duplicates */
|
|
|
|
if (find_rdev(mddev, rdev->bdev->bd_dev))
|
|
|
|
return -EEXIST;
|
|
|
|
|
2021-02-01 13:17:20 +00:00
|
|
|
if (rdev_read_only(rdev) && mddev->pers)
|
2017-04-12 22:53:48 +00:00
|
|
|
return -EROFS;
|
|
|
|
|
2009-03-31 03:33:13 +00:00
|
|
|
/* make sure rdev->sectors exceeds mddev->dev_sectors */
|
2015-12-20 23:51:02 +00:00
|
|
|
if (!test_bit(Journal, &rdev->flags) &&
|
|
|
|
rdev->sectors &&
|
|
|
|
(mddev->dev_sectors == 0 || rdev->sectors < mddev->dev_sectors)) {
|
2007-05-23 20:58:10 +00:00
|
|
|
if (mddev->pers) {
|
|
|
|
/* Cannot change size, so fail
|
|
|
|
* If mddev->level <= 0, then we don't care
|
|
|
|
* about aligning sizes (e.g. linear)
|
|
|
|
*/
|
|
|
|
if (mddev->level > 0)
|
|
|
|
return -ENOSPC;
|
|
|
|
} else
|
2009-03-31 03:33:13 +00:00
|
|
|
mddev->dev_sectors = rdev->sectors;
|
2006-01-06 08:20:55 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* Verify rdev->desc_nr is unique.
|
|
|
|
* If it is -1, assign a free number, else
|
|
|
|
* check number is not in use
|
|
|
|
*/
|
2014-09-25 07:00:11 +00:00
|
|
|
rcu_read_lock();
|
2005-04-16 22:20:36 +00:00
|
|
|
if (rdev->desc_nr < 0) {
|
|
|
|
int choice = 0;
|
2014-09-25 07:00:11 +00:00
|
|
|
if (mddev->pers)
|
|
|
|
choice = mddev->raid_disks;
|
2015-04-14 15:43:55 +00:00
|
|
|
while (md_find_rdev_nr_rcu(mddev, choice))
|
2005-04-16 22:20:36 +00:00
|
|
|
choice++;
|
|
|
|
rdev->desc_nr = choice;
|
|
|
|
} else {
|
2015-04-14 15:43:55 +00:00
|
|
|
if (md_find_rdev_nr_rcu(mddev, rdev->desc_nr)) {
|
2014-09-25 07:00:11 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EBUSY;
|
2014-09-25 07:00:11 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-09-25 07:00:11 +00:00
|
|
|
rcu_read_unlock();
|
2015-12-20 23:51:02 +00:00
|
|
|
if (!test_bit(Journal, &rdev->flags) &&
|
|
|
|
mddev->max_disks && rdev->desc_nr >= mddev->max_disks) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s: array is limited to %d devices\n",
|
|
|
|
mdname(mddev), mddev->max_disks);
|
2009-02-06 07:02:46 +00:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
2022-07-13 05:53:17 +00:00
|
|
|
snprintf(b, sizeof(b), "%pg", rdev->bdev);
|
2015-06-25 22:02:36 +00:00
|
|
|
strreplace(b, '/', '!');
|
2007-12-18 06:05:35 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->mddev = mddev;
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: bind<%s>\n", b);
|
2005-11-09 05:39:24 +00:00
|
|
|
|
2019-06-14 09:10:36 +00:00
|
|
|
if (mddev->raid_disks)
|
2019-12-23 09:48:53 +00:00
|
|
|
mddev_create_serial_pool(mddev, rdev, false);
|
2019-06-14 09:10:36 +00:00
|
|
|
|
2007-12-18 06:05:35 +00:00
|
|
|
if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
|
2007-03-27 05:32:14 +00:00
|
|
|
goto fail;
|
2005-11-09 05:39:24 +00:00
|
|
|
|
2020-07-16 04:54:40 +00:00
|
|
|
/* failure here is OK */
|
2020-11-17 07:18:55 +00:00
|
|
|
err = sysfs_create_link(&rdev->kobj, bdev_kobj(rdev->bdev), "block");
|
2010-06-01 09:37:23 +00:00
|
|
|
rdev->sysfs_state = sysfs_get_dirent_safe(rdev->kobj.sd, "state");
|
2020-07-14 23:10:26 +00:00
|
|
|
rdev->sysfs_unack_badblocks =
|
|
|
|
sysfs_get_dirent_safe(rdev->kobj.sd, "unacknowledged_bad_blocks");
|
|
|
|
rdev->sysfs_badblocks =
|
|
|
|
sysfs_get_dirent_safe(rdev->kobj.sd, "bad_blocks");
|
2008-10-21 02:25:28 +00:00
|
|
|
|
2008-07-21 07:05:25 +00:00
|
|
|
list_add_rcu(&rdev->same_set, &mddev->disks);
|
2010-11-13 10:55:17 +00:00
|
|
|
bd_link_disk_holder(rdev->bdev, mddev->gendisk);
|
2009-01-08 21:31:11 +00:00
|
|
|
|
|
|
|
/* May as well allow recovery to be retried once */
|
2011-07-27 01:00:36 +00:00
|
|
|
mddev->recovery_disabled++;
|
2009-03-31 03:27:02 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
2007-03-27 05:32:14 +00:00
|
|
|
|
|
|
|
fail:
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: failed to register dev-%s for %s\n",
|
|
|
|
b, mdname(mddev));
|
2007-03-27 05:32:14 +00:00
|
|
|
return err;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2022-11-29 13:32:55 +00:00
|
|
|
void md_autodetect_dev(dev_t dev);
|
|
|
|
|
2023-06-08 11:02:43 +00:00
|
|
|
/* just for claiming the bdev */
|
|
|
|
static struct md_rdev claim_rdev;
|
|
|
|
|
|
|
|
static void export_rdev(struct md_rdev *rdev, struct mddev *mddev)
|
2022-11-29 13:32:55 +00:00
|
|
|
{
|
|
|
|
pr_debug("md: export_rdev(%pg)\n", rdev->bdev);
|
|
|
|
md_rdev_clear(rdev);
|
|
|
|
#ifndef MODULE
|
|
|
|
if (test_bit(AutoDetected, &rdev->flags))
|
|
|
|
md_autodetect_dev(rdev->bdev->bd_dev);
|
|
|
|
#endif
|
2023-08-25 02:55:32 +00:00
|
|
|
blkdev_put(rdev->bdev,
|
|
|
|
test_bit(Holder, &rdev->flags) ? rdev : &claim_rdev);
|
2022-11-29 13:32:55 +00:00
|
|
|
rdev->bdev = NULL;
|
|
|
|
kobject_put(&rdev->kobj);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void md_kick_rdev_from_array(struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
struct mddev *mddev = rdev->mddev;
|
|
|
|
|
2011-01-14 17:43:57 +00:00
|
|
|
bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk);
|
2008-07-21 07:05:25 +00:00
|
|
|
list_del_rcu(&rdev->same_set);
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_debug("md: unbind<%pg>\n", rdev->bdev);
|
2019-12-23 09:48:55 +00:00
|
|
|
mddev_destroy_serial_pool(rdev->mddev, rdev, false);
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->mddev = NULL;
|
2005-11-09 05:39:24 +00:00
|
|
|
sysfs_remove_link(&rdev->kobj, "block");
|
2008-10-21 02:25:28 +00:00
|
|
|
sysfs_put(rdev->sysfs_state);
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_put(rdev->sysfs_unack_badblocks);
|
|
|
|
sysfs_put(rdev->sysfs_badblocks);
|
2008-10-21 02:25:28 +00:00
|
|
|
rdev->sysfs_state = NULL;
|
2020-07-14 23:10:26 +00:00
|
|
|
rdev->sysfs_unack_badblocks = NULL;
|
|
|
|
rdev->sysfs_badblocks = NULL;
|
2011-07-28 01:31:46 +00:00
|
|
|
rdev->badblocks.count = 0;
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
|
2008-07-21 07:05:25 +00:00
|
|
|
synchronize_rcu();
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* kobject_del() will wait for all in progress writers to be done, where
|
|
|
|
* reconfig_mutex is held, hence it can't be called under
|
|
|
|
* reconfig_mutex and it's delayed to mddev_unlock().
|
|
|
|
*/
|
|
|
|
list_add(&rdev->same_set, &mddev->deleting);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static void export_array(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2014-09-25 07:43:47 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-09-25 07:43:47 +00:00
|
|
|
while (!list_empty(&mddev->disks)) {
|
|
|
|
rdev = list_first_entry(&mddev->disks, struct md_rdev,
|
|
|
|
same_set);
|
2015-04-14 15:43:24 +00:00
|
|
|
md_kick_rdev_from_array(rdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
mddev->raid_disks = 0;
|
|
|
|
mddev->major_version = 0;
|
|
|
|
}
|
|
|
|
|
2017-03-15 03:05:14 +00:00
|
|
|
static bool set_in_sync(struct mddev *mddev)
|
|
|
|
{
|
2017-10-19 05:08:13 +00:00
|
|
|
lockdep_assert_held(&mddev->lock);
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
if (!mddev->in_sync) {
|
|
|
|
mddev->sync_checkers++;
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
percpu_ref_switch_to_atomic_sync(&mddev->writes_pending);
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
if (!mddev->in_sync &&
|
|
|
|
percpu_ref_is_zero(&mddev->writes_pending)) {
|
2017-03-15 03:05:14 +00:00
|
|
|
mddev->in_sync = 1;
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
/*
|
|
|
|
* Ensure ->in_sync is visible before we clear
|
|
|
|
* ->sync_checkers.
|
|
|
|
*/
|
2017-03-15 03:05:14 +00:00
|
|
|
smp_mb();
|
2017-03-15 03:05:14 +00:00
|
|
|
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
|
|
|
}
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
if (--mddev->sync_checkers == 0)
|
|
|
|
percpu_ref_switch_to_percpu(&mddev->writes_pending);
|
2017-03-15 03:05:14 +00:00
|
|
|
}
|
|
|
|
if (mddev->safemode == 1)
|
|
|
|
mddev->safemode = 0;
|
|
|
|
return mddev->in_sync;
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static void sync_sbs(struct mddev *mddev, int nospares)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2006-06-26 07:27:57 +00:00
|
|
|
/* Update each superblock (in-memory image), but
|
|
|
|
* if we are allowed to, skip spares which already
|
|
|
|
* have the right event counter, or have one earlier
|
|
|
|
* (which would mean they aren't being marked as dirty
|
|
|
|
* with the rest of the array)
|
|
|
|
*/
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2006-06-26 07:27:57 +00:00
|
|
|
if (rdev->sb_events == mddev->events ||
|
|
|
|
(nospares &&
|
|
|
|
rdev->raid_disk < 0 &&
|
|
|
|
rdev->sb_events+1 == mddev->events)) {
|
|
|
|
/* Don't update this superblock */
|
|
|
|
rdev->sb_loaded = 2;
|
|
|
|
} else {
|
2011-06-07 22:51:30 +00:00
|
|
|
sync_super(mddev, rdev);
|
2006-06-26 07:27:57 +00:00
|
|
|
rdev->sb_loaded = 1;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-09-29 00:21:35 +00:00
|
|
|
static bool does_sb_need_changing(struct mddev *mddev)
|
|
|
|
{
|
2022-04-08 08:37:28 +00:00
|
|
|
struct md_rdev *rdev = NULL, *iter;
|
2015-09-29 00:21:35 +00:00
|
|
|
struct mdp_superblock_1 *sb;
|
|
|
|
int role;
|
|
|
|
|
|
|
|
/* Find a good rdev */
|
2022-04-08 08:37:28 +00:00
|
|
|
rdev_for_each(iter, mddev)
|
|
|
|
if ((iter->raid_disk >= 0) && !test_bit(Faulty, &iter->flags)) {
|
|
|
|
rdev = iter;
|
2015-09-29 00:21:35 +00:00
|
|
|
break;
|
2022-04-08 08:37:28 +00:00
|
|
|
}
|
2015-09-29 00:21:35 +00:00
|
|
|
|
|
|
|
/* No good device found. */
|
|
|
|
if (!rdev)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
sb = page_address(rdev->sb_page);
|
|
|
|
/* Check if a device has become faulty or a spare become active */
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
role = le16_to_cpu(sb->dev_roles[rdev->desc_nr]);
|
|
|
|
/* Device activated? */
|
2022-04-21 19:45:58 +00:00
|
|
|
if (role == MD_DISK_ROLE_SPARE && rdev->raid_disk >= 0 &&
|
2015-09-29 00:21:35 +00:00
|
|
|
!test_bit(Faulty, &rdev->flags))
|
|
|
|
return true;
|
|
|
|
/* Device turned faulty? */
|
2022-04-21 19:45:58 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags) && (role < MD_DISK_ROLE_MAX))
|
2015-09-29 00:21:35 +00:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check if any mddev parameters have changed */
|
|
|
|
if ((mddev->dev_sectors != le64_to_cpu(sb->size)) ||
|
|
|
|
(mddev->reshape_position != le64_to_cpu(sb->reshape_position)) ||
|
2017-03-10 03:49:12 +00:00
|
|
|
(mddev->layout != le32_to_cpu(sb->layout)) ||
|
2015-09-29 00:21:35 +00:00
|
|
|
(mddev->raid_disks != le32_to_cpu(sb->raid_disks)) ||
|
|
|
|
(mddev->chunk_sectors != le32_to_cpu(sb->chunksize)))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-10-29 23:51:31 +00:00
|
|
|
void md_update_sb(struct mddev *mddev, int force_change)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-06-22 00:17:12 +00:00
|
|
|
int sync_req;
|
2006-06-26 07:27:57 +00:00
|
|
|
int nospares = 0;
|
2011-07-28 01:31:47 +00:00
|
|
|
int any_badblocks_changed = 0;
|
2015-10-12 09:21:30 +00:00
|
|
|
int ret = -1;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev)) {
|
2013-04-24 01:42:40 +00:00
|
|
|
if (force_change)
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2013-04-24 01:42:40 +00:00
|
|
|
return;
|
|
|
|
}
|
2015-09-29 00:21:35 +00:00
|
|
|
|
2016-05-02 15:33:09 +00:00
|
|
|
repeat:
|
2015-09-29 00:21:35 +00:00
|
|
|
if (mddev_is_clustered(mddev)) {
|
2016-12-08 23:48:19 +00:00
|
|
|
if (test_and_clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags))
|
2015-09-29 00:21:35 +00:00
|
|
|
force_change = 1;
|
2016-12-08 23:48:19 +00:00
|
|
|
if (test_and_clear_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags))
|
2016-05-04 02:22:13 +00:00
|
|
|
nospares = 1;
|
2015-10-12 09:21:30 +00:00
|
|
|
ret = md_cluster_ops->metadata_update_start(mddev);
|
2015-09-29 00:21:35 +00:00
|
|
|
/* Has someone else has updated the sb */
|
|
|
|
if (!does_sb_need_changing(mddev)) {
|
2015-10-12 09:21:30 +00:00
|
|
|
if (ret == 0)
|
|
|
|
md_cluster_ops->metadata_update_cancel(mddev);
|
2016-12-08 23:48:19 +00:00
|
|
|
bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
|
|
|
|
BIT(MD_SB_CHANGE_DEVS) |
|
|
|
|
BIT(MD_SB_CHANGE_CLEAN));
|
2015-09-29 00:21:35 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
2016-05-02 15:33:09 +00:00
|
|
|
|
2017-10-17 05:18:36 +00:00
|
|
|
/*
|
|
|
|
* First make sure individual recovery_offsets are correct
|
|
|
|
* curr_resync_completed can only be used during recovery.
|
|
|
|
* During reshape/resync it might use array-addresses rather
|
|
|
|
* that device addresses.
|
|
|
|
*/
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2010-08-16 08:09:31 +00:00
|
|
|
if (rdev->raid_disk >= 0 &&
|
|
|
|
mddev->delta_disks >= 0 &&
|
2017-10-17 05:18:36 +00:00
|
|
|
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
|
|
|
|
test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
|
|
|
|
!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
2015-10-09 04:54:12 +00:00
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
2010-08-16 08:09:31 +00:00
|
|
|
!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
mddev->curr_resync_completed > rdev->recovery_offset)
|
|
|
|
rdev->recovery_offset = mddev->curr_resync_completed;
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
}
|
2010-08-30 07:33:33 +00:00
|
|
|
if (!mddev->persistent) {
|
2016-12-08 23:48:19 +00:00
|
|
|
clear_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
|
|
|
clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2011-07-28 01:31:48 +00:00
|
|
|
if (!mddev->external) {
|
2016-12-08 23:48:19 +00:00
|
|
|
clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2011-07-28 01:31:48 +00:00
|
|
|
if (rdev->badblocks.changed) {
|
2012-03-19 01:46:41 +00:00
|
|
|
rdev->badblocks.changed = 0;
|
2015-12-25 02:20:34 +00:00
|
|
|
ack_all_badblocks(&rdev->badblocks);
|
2011-07-28 01:31:48 +00:00
|
|
|
md_error(mddev, rdev);
|
|
|
|
}
|
|
|
|
clear_bit(Blocked, &rdev->flags);
|
|
|
|
clear_bit(BlockedBadBlocks, &rdev->flags);
|
|
|
|
wake_up(&rdev->blocked_wait);
|
|
|
|
}
|
|
|
|
}
|
2010-08-16 08:09:31 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2006-08-27 08:23:49 +00:00
|
|
|
|
2015-12-20 23:51:01 +00:00
|
|
|
mddev->utime = ktime_get_real_seconds();
|
2010-08-16 08:09:31 +00:00
|
|
|
|
2016-12-08 23:48:19 +00:00
|
|
|
if (test_and_clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags))
|
2006-10-03 08:15:46 +00:00
|
|
|
force_change = 1;
|
2016-12-08 23:48:19 +00:00
|
|
|
if (test_and_clear_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags))
|
2006-10-03 08:15:46 +00:00
|
|
|
/* just a clean<-> dirty transition, possibly leave spares alone,
|
|
|
|
* though if events isn't the right even/odd, we will have to do
|
|
|
|
* spares after all
|
|
|
|
*/
|
|
|
|
nospares = 1;
|
|
|
|
if (force_change)
|
|
|
|
nospares = 0;
|
|
|
|
if (mddev->degraded)
|
2006-08-27 08:23:49 +00:00
|
|
|
/* If the array is degraded, then skipping spares is both
|
|
|
|
* dangerous and fairly pointless.
|
|
|
|
* Dangerous because a device that was removed from the array
|
|
|
|
* might have a event_count that still looks up-to-date,
|
|
|
|
* so it can be re-added without a resync.
|
|
|
|
* Pointless because if there are any spares to skip,
|
|
|
|
* then a recovery will happen and soon that array won't
|
|
|
|
* be degraded any more and the spare can go back to sleep then.
|
|
|
|
*/
|
2006-10-03 08:15:46 +00:00
|
|
|
nospares = 0;
|
2006-08-27 08:23:49 +00:00
|
|
|
|
2005-06-22 00:17:12 +00:00
|
|
|
sync_req = mddev->in_sync;
|
2006-06-26 07:27:57 +00:00
|
|
|
|
|
|
|
/* If this is just a dirty<->clean transition, and the array is clean
|
|
|
|
* and 'events' is odd, we can roll back to the previous clean state */
|
2006-10-03 08:15:46 +00:00
|
|
|
if (nospares
|
2006-06-26 07:27:57 +00:00
|
|
|
&& (mddev->in_sync && mddev->recovery_cp == MaxSector)
|
2010-05-17 23:28:43 +00:00
|
|
|
&& mddev->can_decrease_events
|
|
|
|
&& mddev->events != 1) {
|
2006-06-26 07:27:57 +00:00
|
|
|
mddev->events--;
|
2010-05-17 23:28:43 +00:00
|
|
|
mddev->can_decrease_events = 0;
|
|
|
|
} else {
|
2006-06-26 07:27:57 +00:00
|
|
|
/* otherwise we have to go forward and ... */
|
|
|
|
mddev->events ++;
|
2010-05-17 23:28:43 +00:00
|
|
|
mddev->can_decrease_events = nospares;
|
2006-06-26 07:27:57 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-09-30 05:52:29 +00:00
|
|
|
/*
|
|
|
|
* This 64-bit counter should never wrap.
|
|
|
|
* Either we are in around ~1 trillion A.C., assuming
|
|
|
|
* 1 reboot per second, or we have a bug...
|
|
|
|
*/
|
|
|
|
WARN_ON(mddev->events == 0);
|
2011-07-28 01:31:47 +00:00
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2011-07-28 01:31:47 +00:00
|
|
|
if (rdev->badblocks.changed)
|
|
|
|
any_badblocks_changed++;
|
2011-07-28 01:31:48 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
|
|
|
set_bit(FaultRecorded, &rdev->flags);
|
|
|
|
}
|
2011-07-28 01:31:47 +00:00
|
|
|
|
2008-02-06 09:39:51 +00:00
|
|
|
sync_sbs(mddev, nospares);
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-10-07 03:23:17 +00:00
|
|
|
pr_debug("md: updating %s RAID superblock on device (in sync %d)\n",
|
|
|
|
mdname(mddev), mddev->in_sync);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-11-18 17:44:08 +00:00
|
|
|
if (mddev->queue)
|
|
|
|
blk_add_trace_msg(mddev->queue, "md md_update_sb");
|
2016-11-18 05:16:11 +00:00
|
|
|
rewrite:
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_update_sb(mddev->bitmap);
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2006-06-26 07:27:57 +00:00
|
|
|
if (rdev->sb_loaded != 1)
|
|
|
|
continue; /* no noise on spare devices */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
if (!test_bit(Faulty, &rdev->flags)) {
|
2005-06-22 00:17:28 +00:00
|
|
|
md_super_write(mddev,rdev,
|
2008-07-11 12:02:23 +00:00
|
|
|
rdev->sb_start, rdev->sb_size,
|
2005-06-22 00:17:28 +00:00
|
|
|
rdev->sb_page);
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_debug("md: (write) %pg's sb offset: %llu\n",
|
|
|
|
rdev->bdev,
|
2011-10-07 03:23:17 +00:00
|
|
|
(unsigned long long)rdev->sb_start);
|
2006-06-26 07:27:57 +00:00
|
|
|
rdev->sb_events = mddev->events;
|
2011-07-28 01:31:47 +00:00
|
|
|
if (rdev->badblocks.size) {
|
|
|
|
md_super_write(mddev, rdev,
|
|
|
|
rdev->badblocks.sector,
|
|
|
|
rdev->badblocks.size << 9,
|
|
|
|
rdev->bb_page);
|
|
|
|
rdev->badblocks.size = 0;
|
|
|
|
}
|
2005-06-22 00:17:28 +00:00
|
|
|
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
} else
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_debug("md: %pg (skipping faulty)\n",
|
|
|
|
rdev->bdev);
|
2011-10-18 01:16:48 +00:00
|
|
|
|
2005-06-22 00:17:28 +00:00
|
|
|
if (mddev->level == LEVEL_MULTIPATH)
|
2005-04-16 22:20:36 +00:00
|
|
|
/* only need to write one superblock... */
|
|
|
|
break;
|
|
|
|
}
|
2016-11-18 05:16:11 +00:00
|
|
|
if (md_super_wait(mddev) < 0)
|
|
|
|
goto rewrite;
|
2016-12-08 23:48:19 +00:00
|
|
|
/* if there was a failure, MD_SB_CHANGE_DEVS was set, and we re-write super */
|
2005-06-22 00:17:28 +00:00
|
|
|
|
2016-05-02 15:33:09 +00:00
|
|
|
if (mddev_is_clustered(mddev) && ret == 0)
|
|
|
|
md_cluster_ops->metadata_update_finish(mddev);
|
|
|
|
|
2006-10-03 08:15:46 +00:00
|
|
|
if (mddev->in_sync != sync_req ||
|
2016-12-08 23:48:19 +00:00
|
|
|
!bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
|
|
|
|
BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_CLEAN)))
|
2005-06-22 00:17:12 +00:00
|
|
|
/* have to write it out again */
|
|
|
|
goto repeat;
|
2005-06-22 00:17:26 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
2009-04-14 06:28:34 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_completed);
|
2005-06-22 00:17:12 +00:00
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2011-07-28 01:31:48 +00:00
|
|
|
if (test_and_clear_bit(FaultRecorded, &rdev->flags))
|
|
|
|
clear_bit(Blocked, &rdev->flags);
|
|
|
|
|
|
|
|
if (any_badblocks_changed)
|
2015-12-25 02:20:34 +00:00
|
|
|
ack_all_badblocks(&rdev->badblocks);
|
2011-07-28 01:31:48 +00:00
|
|
|
clear_bit(BlockedBadBlocks, &rdev->flags);
|
|
|
|
wake_up(&rdev->blocked_wait);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-10-29 23:51:31 +00:00
|
|
|
EXPORT_SYMBOL(md_update_sb);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-04-14 15:45:22 +00:00
|
|
|
static int add_bound_rdev(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
struct mddev *mddev = rdev->mddev;
|
|
|
|
int err = 0;
|
2016-01-06 22:37:14 +00:00
|
|
|
bool add_journal = test_bit(Journal, &rdev->flags);
|
2015-04-14 15:45:22 +00:00
|
|
|
|
2016-01-06 22:37:14 +00:00
|
|
|
if (!mddev->pers->hot_remove_disk || add_journal) {
|
2015-04-14 15:45:22 +00:00
|
|
|
/* If there is hot_add_disk but no hot_remove_disk
|
|
|
|
* then added disks for geometry changes,
|
|
|
|
* and should be added immediately.
|
|
|
|
*/
|
|
|
|
super_types[mddev->major_version].
|
|
|
|
validate_super(mddev, rdev);
|
2016-01-06 22:37:14 +00:00
|
|
|
if (add_journal)
|
|
|
|
mddev_suspend(mddev);
|
2015-04-14 15:45:22 +00:00
|
|
|
err = mddev->pers->hot_add_disk(mddev, rdev);
|
2016-01-06 22:37:14 +00:00
|
|
|
if (add_journal)
|
|
|
|
mddev_resume(mddev);
|
2015-04-14 15:45:22 +00:00
|
|
|
if (err) {
|
2016-06-03 03:32:05 +00:00
|
|
|
md_kick_rdev_from_array(rdev);
|
2015-04-14 15:45:22 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
|
|
|
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2015-04-14 15:45:22 +00:00
|
|
|
if (mddev->degraded)
|
|
|
|
set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2015-04-14 15:45:22 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-03-23 17:34:54 +00:00
|
|
|
/* words written to sysfs files may, or may not, be \n terminated.
|
2006-01-06 08:20:41 +00:00
|
|
|
* We want to accept with case. For this we use cmd_match.
|
|
|
|
*/
|
|
|
|
static int cmd_match(const char *cmd, const char *str)
|
|
|
|
{
|
|
|
|
/* See if cmd, written into a sysfs file, matches
|
|
|
|
* str. They must either be the same, or cmd can
|
|
|
|
* have a trailing newline
|
|
|
|
*/
|
|
|
|
while (*cmd && *str && *cmd == *str) {
|
|
|
|
cmd++;
|
|
|
|
str++;
|
|
|
|
}
|
|
|
|
if (*cmd == '\n')
|
|
|
|
cmd++;
|
|
|
|
if (*str || *cmd)
|
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2005-11-09 05:39:24 +00:00
|
|
|
struct rdev_sysfs_entry {
|
|
|
|
struct attribute attr;
|
2011-10-11 05:45:26 +00:00
|
|
|
ssize_t (*show)(struct md_rdev *, char *);
|
|
|
|
ssize_t (*store)(struct md_rdev *, const char *, size_t);
|
2005-11-09 05:39:24 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
state_show(struct md_rdev *rdev, char *page)
|
2005-11-09 05:39:24 +00:00
|
|
|
{
|
2016-10-21 14:26:57 +00:00
|
|
|
char *sep = ",";
|
2008-02-06 09:39:57 +00:00
|
|
|
size_t len = 0;
|
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-23 21:07:29 +00:00
|
|
|
unsigned long flags = READ_ONCE(rdev->flags);
|
2005-11-09 05:39:24 +00:00
|
|
|
|
2014-12-15 01:56:59 +00:00
|
|
|
if (test_bit(Faulty, &flags) ||
|
2016-10-21 14:27:08 +00:00
|
|
|
(!test_bit(ExternalBbl, &flags) &&
|
|
|
|
rdev->badblocks.unacked_exist))
|
2016-10-21 14:26:57 +00:00
|
|
|
len += sprintf(page+len, "faulty%s", sep);
|
|
|
|
if (test_bit(In_sync, &flags))
|
|
|
|
len += sprintf(page+len, "in_sync%s", sep);
|
|
|
|
if (test_bit(Journal, &flags))
|
|
|
|
len += sprintf(page+len, "journal%s", sep);
|
|
|
|
if (test_bit(WriteMostly, &flags))
|
|
|
|
len += sprintf(page+len, "write_mostly%s", sep);
|
2014-12-15 01:56:59 +00:00
|
|
|
if (test_bit(Blocked, &flags) ||
|
2011-12-08 05:22:48 +00:00
|
|
|
(rdev->badblocks.unacked_exist
|
2016-10-21 14:26:57 +00:00
|
|
|
&& !test_bit(Faulty, &flags)))
|
|
|
|
len += sprintf(page+len, "blocked%s", sep);
|
2014-12-15 01:56:59 +00:00
|
|
|
if (!test_bit(Faulty, &flags) &&
|
2015-10-09 04:54:12 +00:00
|
|
|
!test_bit(Journal, &flags) &&
|
2016-10-21 14:26:57 +00:00
|
|
|
!test_bit(In_sync, &flags))
|
|
|
|
len += sprintf(page+len, "spare%s", sep);
|
|
|
|
if (test_bit(WriteErrorSeen, &flags))
|
|
|
|
len += sprintf(page+len, "write_error%s", sep);
|
|
|
|
if (test_bit(WantReplacement, &flags))
|
|
|
|
len += sprintf(page+len, "want_replacement%s", sep);
|
|
|
|
if (test_bit(Replacement, &flags))
|
|
|
|
len += sprintf(page+len, "replacement%s", sep);
|
|
|
|
if (test_bit(ExternalBbl, &flags))
|
|
|
|
len += sprintf(page+len, "external_bbl%s", sep);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (test_bit(FailFast, &flags))
|
|
|
|
len += sprintf(page+len, "failfast%s", sep);
|
2016-10-21 14:26:57 +00:00
|
|
|
|
|
|
|
if (len)
|
|
|
|
len -= strlen(sep);
|
2011-12-22 23:17:51 +00:00
|
|
|
|
2005-11-09 05:39:24 +00:00
|
|
|
return len+sprintf(page+len, "\n");
|
|
|
|
}
|
|
|
|
|
2006-06-26 07:27:58 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
state_store(struct md_rdev *rdev, const char *buf, size_t len)
|
2006-06-26 07:27:58 +00:00
|
|
|
{
|
|
|
|
/* can write
|
2011-07-28 01:31:48 +00:00
|
|
|
* faulty - simulates an error
|
2006-06-26 07:27:58 +00:00
|
|
|
* remove - disconnects the device
|
2006-06-26 07:28:01 +00:00
|
|
|
* writemostly - sets write_mostly
|
|
|
|
* -writemostly - clears write_mostly
|
2011-07-28 01:31:48 +00:00
|
|
|
* blocked - sets the Blocked flags
|
|
|
|
* -blocked - clears the Blocked and possibly simulates an error
|
2009-04-14 02:01:57 +00:00
|
|
|
* insync - sets Insync providing device isn't active
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
* -insync - clear Insync for a device with a slot assigned,
|
|
|
|
* so that it gets rebuilt based on bitmap
|
2011-07-28 01:31:48 +00:00
|
|
|
* write_error - sets WriteErrorSeen
|
|
|
|
* -write_error - clears WriteErrorSeen
|
2016-11-18 05:16:11 +00:00
|
|
|
* {,-}failfast - set/clear FailFast
|
2006-06-26 07:27:58 +00:00
|
|
|
*/
|
2021-10-13 14:59:33 +00:00
|
|
|
|
|
|
|
struct mddev *mddev = rdev->mddev;
|
2006-06-26 07:27:58 +00:00
|
|
|
int err = -EINVAL;
|
2021-10-13 14:59:33 +00:00
|
|
|
bool need_update_sb = false;
|
|
|
|
|
2006-06-26 07:27:58 +00:00
|
|
|
if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
|
|
|
|
md_error(rdev->mddev, rdev);
|
2022-03-22 15:23:38 +00:00
|
|
|
|
|
|
|
if (test_bit(MD_BROKEN, &rdev->mddev->flags))
|
2011-08-25 04:42:51 +00:00
|
|
|
err = -EBUSY;
|
2022-03-22 15:23:38 +00:00
|
|
|
else
|
|
|
|
err = 0;
|
2006-06-26 07:27:58 +00:00
|
|
|
} else if (cmd_match(buf, "remove")) {
|
2016-07-28 16:06:34 +00:00
|
|
|
if (rdev->mddev->pers) {
|
|
|
|
clear_bit(Blocked, &rdev->flags);
|
|
|
|
remove_and_add_spares(rdev->mddev, rdev);
|
|
|
|
}
|
2006-06-26 07:27:58 +00:00
|
|
|
if (rdev->raid_disk >= 0)
|
|
|
|
err = -EBUSY;
|
|
|
|
else {
|
|
|
|
err = 0;
|
2015-10-12 09:21:27 +00:00
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
err = md_cluster_ops->remove_disk(mddev, rdev);
|
|
|
|
|
|
|
|
if (err == 0) {
|
|
|
|
md_kick_rdev_from_array(rdev);
|
2016-11-04 05:46:03 +00:00
|
|
|
if (mddev->pers) {
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2016-11-04 05:46:03 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2015-10-12 09:21:27 +00:00
|
|
|
}
|
2006-06-26 07:27:58 +00:00
|
|
|
}
|
2006-06-26 07:28:01 +00:00
|
|
|
} else if (cmd_match(buf, "writemostly")) {
|
|
|
|
set_bit(WriteMostly, &rdev->flags);
|
2019-12-23 09:48:53 +00:00
|
|
|
mddev_create_serial_pool(rdev->mddev, rdev, false);
|
2021-10-13 14:59:33 +00:00
|
|
|
need_update_sb = true;
|
2006-06-26 07:28:01 +00:00
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "-writemostly")) {
|
2019-12-23 09:48:55 +00:00
|
|
|
mddev_destroy_serial_pool(rdev->mddev, rdev, false);
|
2006-06-26 07:28:01 +00:00
|
|
|
clear_bit(WriteMostly, &rdev->flags);
|
2021-10-13 14:59:33 +00:00
|
|
|
need_update_sb = true;
|
2008-04-30 07:52:32 +00:00
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "blocked")) {
|
|
|
|
set_bit(Blocked, &rdev->flags);
|
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "-blocked")) {
|
2011-07-28 01:31:48 +00:00
|
|
|
if (!test_bit(Faulty, &rdev->flags) &&
|
2016-10-21 14:27:08 +00:00
|
|
|
!test_bit(ExternalBbl, &rdev->flags) &&
|
2011-08-30 06:20:17 +00:00
|
|
|
rdev->badblocks.unacked_exist) {
|
2011-07-28 01:31:48 +00:00
|
|
|
/* metadata handler doesn't understand badblocks,
|
|
|
|
* so we need to fail the device
|
|
|
|
*/
|
|
|
|
md_error(rdev->mddev, rdev);
|
|
|
|
}
|
2008-04-30 07:52:32 +00:00
|
|
|
clear_bit(Blocked, &rdev->flags);
|
2011-07-28 01:31:48 +00:00
|
|
|
clear_bit(BlockedBadBlocks, &rdev->flags);
|
2008-04-30 07:52:32 +00:00
|
|
|
wake_up(&rdev->blocked_wait);
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
|
|
|
|
md_wakeup_thread(rdev->mddev->thread);
|
|
|
|
|
2009-04-14 02:01:57 +00:00
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
|
|
|
|
set_bit(In_sync, &rdev->flags);
|
2006-06-26 07:28:01 +00:00
|
|
|
err = 0;
|
2016-11-18 05:16:11 +00:00
|
|
|
} else if (cmd_match(buf, "failfast")) {
|
|
|
|
set_bit(FailFast, &rdev->flags);
|
2021-10-13 14:59:33 +00:00
|
|
|
need_update_sb = true;
|
2016-11-18 05:16:11 +00:00
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "-failfast")) {
|
|
|
|
clear_bit(FailFast, &rdev->flags);
|
2021-10-13 14:59:33 +00:00
|
|
|
need_update_sb = true;
|
2016-11-18 05:16:11 +00:00
|
|
|
err = 0;
|
2015-10-09 04:54:12 +00:00
|
|
|
} else if (cmd_match(buf, "-insync") && rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(Journal, &rdev->flags)) {
|
2014-09-30 05:24:25 +00:00
|
|
|
if (rdev->mddev->pers == NULL) {
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
|
|
|
rdev->saved_raid_disk = rdev->raid_disk;
|
|
|
|
rdev->raid_disk = -1;
|
|
|
|
err = 0;
|
|
|
|
}
|
2011-07-28 01:31:48 +00:00
|
|
|
} else if (cmd_match(buf, "write_error")) {
|
|
|
|
set_bit(WriteErrorSeen, &rdev->flags);
|
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "-write_error")) {
|
|
|
|
clear_bit(WriteErrorSeen, &rdev->flags);
|
|
|
|
err = 0;
|
2011-12-22 23:17:51 +00:00
|
|
|
} else if (cmd_match(buf, "want_replacement")) {
|
|
|
|
/* Any non-spare device that is not a replacement can
|
|
|
|
* become want_replacement at any time, but we then need to
|
|
|
|
* check if recovery is needed.
|
|
|
|
*/
|
|
|
|
if (rdev->raid_disk >= 0 &&
|
2015-10-09 04:54:12 +00:00
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
2011-12-22 23:17:51 +00:00
|
|
|
!test_bit(Replacement, &rdev->flags))
|
|
|
|
set_bit(WantReplacement, &rdev->flags);
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
|
|
|
|
md_wakeup_thread(rdev->mddev->thread);
|
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "-want_replacement")) {
|
|
|
|
/* Clearing 'want_replacement' is always allowed.
|
|
|
|
* Once replacements starts it is too late though.
|
|
|
|
*/
|
|
|
|
err = 0;
|
|
|
|
clear_bit(WantReplacement, &rdev->flags);
|
|
|
|
} else if (cmd_match(buf, "replacement")) {
|
|
|
|
/* Can only set a device as a replacement when array has not
|
|
|
|
* yet been started. Once running, replacement is automatic
|
|
|
|
* from spares, or by assigning 'slot'.
|
|
|
|
*/
|
|
|
|
if (rdev->mddev->pers)
|
|
|
|
err = -EBUSY;
|
|
|
|
else {
|
|
|
|
set_bit(Replacement, &rdev->flags);
|
|
|
|
err = 0;
|
|
|
|
}
|
|
|
|
} else if (cmd_match(buf, "-replacement")) {
|
|
|
|
/* Similarly, can only clear Replacement before start */
|
|
|
|
if (rdev->mddev->pers)
|
|
|
|
err = -EBUSY;
|
|
|
|
else {
|
|
|
|
clear_bit(Replacement, &rdev->flags);
|
|
|
|
err = 0;
|
|
|
|
}
|
2015-04-14 15:45:22 +00:00
|
|
|
} else if (cmd_match(buf, "re-add")) {
|
2019-04-02 06:22:14 +00:00
|
|
|
if (!rdev->mddev->pers)
|
|
|
|
err = -EINVAL;
|
|
|
|
else if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1) &&
|
|
|
|
rdev->saved_raid_disk >= 0) {
|
2015-04-14 15:45:42 +00:00
|
|
|
/* clear_bit is performed _after_ all the devices
|
|
|
|
* have their local Faulty bit cleared. If any writes
|
|
|
|
* happen in the meantime in the local node, they
|
|
|
|
* will land in the local bitmap, which will be synced
|
|
|
|
* by this node eventually
|
|
|
|
*/
|
|
|
|
if (!mddev_is_clustered(rdev->mddev) ||
|
|
|
|
(err = md_cluster_ops->gather_bitmaps(rdev)) == 0) {
|
|
|
|
clear_bit(Faulty, &rdev->flags);
|
|
|
|
err = add_bound_rdev(rdev);
|
|
|
|
}
|
2015-04-14 15:45:22 +00:00
|
|
|
} else
|
|
|
|
err = -EBUSY;
|
2016-10-21 14:26:57 +00:00
|
|
|
} else if (cmd_match(buf, "external_bbl") && (rdev->mddev->external)) {
|
|
|
|
set_bit(ExternalBbl, &rdev->flags);
|
|
|
|
rdev->badblocks.shift = 0;
|
|
|
|
err = 0;
|
|
|
|
} else if (cmd_match(buf, "-external_bbl") && (rdev->mddev->external)) {
|
|
|
|
clear_bit(ExternalBbl, &rdev->flags);
|
|
|
|
err = 0;
|
2006-06-26 07:27:58 +00:00
|
|
|
}
|
2021-10-13 14:59:33 +00:00
|
|
|
if (need_update_sb)
|
|
|
|
md_update_sb(mddev, 1);
|
2010-06-01 09:37:23 +00:00
|
|
|
if (!err)
|
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
2006-06-26 07:27:58 +00:00
|
|
|
return err ? err : len;
|
|
|
|
}
|
2006-07-10 11:44:18 +00:00
|
|
|
static struct rdev_sysfs_entry rdev_state =
|
2014-09-29 22:53:05 +00:00
|
|
|
__ATTR_PREALLOC(state, S_IRUGO|S_IWUSR, state_show, state_store);
|
2005-11-09 05:39:24 +00:00
|
|
|
|
2006-01-06 08:20:52 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
errors_show(struct md_rdev *rdev, char *page)
|
2006-01-06 08:20:52 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%d\n", atomic_read(&rdev->corrected_errors));
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
errors_store(struct md_rdev *rdev, const char *buf, size_t len)
|
2006-01-06 08:20:52 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned int n;
|
|
|
|
int rv;
|
|
|
|
|
|
|
|
rv = kstrtouint(buf, 10, &n);
|
|
|
|
if (rv < 0)
|
|
|
|
return rv;
|
|
|
|
atomic_set(&rdev->corrected_errors, n);
|
|
|
|
return len;
|
2006-01-06 08:20:52 +00:00
|
|
|
}
|
|
|
|
static struct rdev_sysfs_entry rdev_errors =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(errors, S_IRUGO|S_IWUSR, errors_show, errors_store);
|
2006-01-06 08:20:52 +00:00
|
|
|
|
2006-01-06 08:20:55 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
slot_show(struct md_rdev *rdev, char *page)
|
2006-01-06 08:20:55 +00:00
|
|
|
{
|
2015-10-09 04:54:12 +00:00
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
return sprintf(page, "journal\n");
|
|
|
|
else if (rdev->raid_disk < 0)
|
2006-01-06 08:20:55 +00:00
|
|
|
return sprintf(page, "none\n");
|
|
|
|
else
|
|
|
|
return sprintf(page, "%d\n", rdev->raid_disk);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
slot_store(struct md_rdev *rdev, const char *buf, size_t len)
|
2006-01-06 08:20:55 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
int slot;
|
2008-02-06 09:39:51 +00:00
|
|
|
int err;
|
2015-05-16 11:02:38 +00:00
|
|
|
|
2015-10-09 04:54:12 +00:00
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
return -EBUSY;
|
2006-01-06 08:20:55 +00:00
|
|
|
if (strncmp(buf, "none", 4)==0)
|
|
|
|
slot = -1;
|
2015-05-16 11:02:38 +00:00
|
|
|
else {
|
|
|
|
err = kstrtouint(buf, 10, (unsigned int *)&slot);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
2023-03-05 22:36:25 +00:00
|
|
|
if (slot < 0)
|
|
|
|
/* overflow */
|
|
|
|
return -ENOSPC;
|
2015-05-16 11:02:38 +00:00
|
|
|
}
|
2008-06-27 22:31:31 +00:00
|
|
|
if (rdev->mddev->pers && slot == -1) {
|
2008-02-06 09:39:51 +00:00
|
|
|
/* Setting 'slot' on an active array requires also
|
|
|
|
* updating the 'rd%d' link, and communicating
|
|
|
|
* with the personality with ->hot_*_disk.
|
|
|
|
* For now we only support removing
|
|
|
|
* failed/spare devices. This normally happens automatically,
|
|
|
|
* but not when the metadata is externally managed.
|
|
|
|
*/
|
|
|
|
if (rdev->raid_disk == -1)
|
|
|
|
return -EEXIST;
|
|
|
|
/* personality does all needed checks */
|
2011-06-09 01:42:54 +00:00
|
|
|
if (rdev->mddev->pers->hot_remove_disk == NULL)
|
2008-02-06 09:39:51 +00:00
|
|
|
return -EINVAL;
|
2013-04-24 01:42:41 +00:00
|
|
|
clear_bit(Blocked, &rdev->flags);
|
|
|
|
remove_and_add_spares(rdev->mddev, rdev);
|
|
|
|
if (rdev->raid_disk >= 0)
|
|
|
|
return -EBUSY;
|
2008-02-06 09:39:51 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
|
|
|
|
md_wakeup_thread(rdev->mddev->thread);
|
2008-06-27 22:31:31 +00:00
|
|
|
} else if (rdev->mddev->pers) {
|
|
|
|
/* Activating a spare .. or possibly reactivating
|
2009-04-14 02:01:57 +00:00
|
|
|
* if we ever get bitmaps working here.
|
2008-06-27 22:31:31 +00:00
|
|
|
*/
|
2015-12-18 04:19:16 +00:00
|
|
|
int err;
|
2008-06-27 22:31:31 +00:00
|
|
|
|
|
|
|
if (rdev->raid_disk != -1)
|
|
|
|
return -EBUSY;
|
|
|
|
|
2011-02-02 00:57:13 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &rdev->mddev->recovery))
|
|
|
|
return -EBUSY;
|
|
|
|
|
2008-06-27 22:31:31 +00:00
|
|
|
if (rdev->mddev->pers->hot_add_disk == NULL)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2011-01-13 22:14:34 +00:00
|
|
|
if (slot >= rdev->mddev->raid_disks &&
|
|
|
|
slot >= rdev->mddev->raid_disks + rdev->mddev->delta_disks)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
2008-06-27 22:31:31 +00:00
|
|
|
rdev->raid_disk = slot;
|
|
|
|
if (test_bit(In_sync, &rdev->flags))
|
|
|
|
rdev->saved_raid_disk = slot;
|
|
|
|
else
|
|
|
|
rdev->saved_raid_disk = -1;
|
2011-10-18 01:13:47 +00:00
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2013-12-11 23:13:33 +00:00
|
|
|
clear_bit(Bitmap_sync, &rdev->flags);
|
2020-04-04 21:57:11 +00:00
|
|
|
err = rdev->mddev->pers->hot_add_disk(rdev->mddev, rdev);
|
2015-12-18 04:19:16 +00:00
|
|
|
if (err) {
|
|
|
|
rdev->raid_disk = -1;
|
|
|
|
return err;
|
|
|
|
} else
|
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
2020-07-16 04:54:40 +00:00
|
|
|
/* failure here is OK */;
|
|
|
|
sysfs_link_rdev(rdev->mddev, rdev);
|
2008-06-27 22:31:31 +00:00
|
|
|
/* don't wakeup anyone, leave that to userspace. */
|
2008-02-06 09:39:51 +00:00
|
|
|
} else {
|
2011-01-13 22:14:34 +00:00
|
|
|
if (slot >= rdev->mddev->raid_disks &&
|
|
|
|
slot >= rdev->mddev->raid_disks + rdev->mddev->delta_disks)
|
2008-02-06 09:39:51 +00:00
|
|
|
return -ENOSPC;
|
|
|
|
rdev->raid_disk = slot;
|
|
|
|
/* assume it is working */
|
2008-02-06 09:39:54 +00:00
|
|
|
clear_bit(Faulty, &rdev->flags);
|
|
|
|
clear_bit(WriteMostly, &rdev->flags);
|
2008-02-06 09:39:51 +00:00
|
|
|
set_bit(In_sync, &rdev->flags);
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
2008-02-06 09:39:51 +00:00
|
|
|
}
|
2006-01-06 08:20:55 +00:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct rdev_sysfs_entry rdev_slot =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(slot, S_IRUGO|S_IWUSR, slot_show, slot_store);
|
2006-01-06 08:20:55 +00:00
|
|
|
|
2006-01-06 08:20:56 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
offset_show(struct md_rdev *rdev, char *page)
|
2006-01-06 08:20:56 +00:00
|
|
|
{
|
2006-01-06 08:20:59 +00:00
|
|
|
return sprintf(page, "%llu\n", (unsigned long long)rdev->data_offset);
|
2006-01-06 08:20:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
offset_store(struct md_rdev *rdev, const char *buf, size_t len)
|
2006-01-06 08:20:56 +00:00
|
|
|
{
|
2012-05-20 23:27:00 +00:00
|
|
|
unsigned long long offset;
|
2013-06-01 07:15:16 +00:00
|
|
|
if (kstrtoull(buf, 10, &offset) < 0)
|
2006-01-06 08:20:56 +00:00
|
|
|
return -EINVAL;
|
2008-06-27 22:31:29 +00:00
|
|
|
if (rdev->mddev->pers && rdev->raid_disk >= 0)
|
2006-01-06 08:20:56 +00:00
|
|
|
return -EBUSY;
|
2009-03-31 03:33:13 +00:00
|
|
|
if (rdev->sectors && rdev->mddev->external)
|
2008-02-06 09:39:54 +00:00
|
|
|
/* Must set offset before size, so overlap checks
|
|
|
|
* can be sane */
|
|
|
|
return -EBUSY;
|
2006-01-06 08:20:56 +00:00
|
|
|
rdev->data_offset = offset;
|
2012-07-19 05:59:18 +00:00
|
|
|
rdev->new_data_offset = offset;
|
2006-01-06 08:20:56 +00:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct rdev_sysfs_entry rdev_offset =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(offset, S_IRUGO|S_IWUSR, offset_show, offset_store);
|
2006-01-06 08:20:56 +00:00
|
|
|
|
2012-05-20 23:27:00 +00:00
|
|
|
static ssize_t new_offset_show(struct md_rdev *rdev, char *page)
|
|
|
|
{
|
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
(unsigned long long)rdev->new_data_offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t new_offset_store(struct md_rdev *rdev,
|
|
|
|
const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
unsigned long long new_offset;
|
|
|
|
struct mddev *mddev = rdev->mddev;
|
|
|
|
|
2013-06-01 07:15:16 +00:00
|
|
|
if (kstrtoull(buf, 10, &new_offset) < 0)
|
2012-05-20 23:27:00 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2014-12-10 23:02:10 +00:00
|
|
|
if (mddev->sync_thread ||
|
|
|
|
test_bit(MD_RECOVERY_RUNNING,&mddev->recovery))
|
2012-05-20 23:27:00 +00:00
|
|
|
return -EBUSY;
|
|
|
|
if (new_offset == rdev->data_offset)
|
|
|
|
/* reset is always permitted */
|
|
|
|
;
|
|
|
|
else if (new_offset > rdev->data_offset) {
|
|
|
|
/* must not push array size beyond rdev_sectors */
|
|
|
|
if (new_offset - rdev->data_offset
|
|
|
|
+ mddev->dev_sectors > rdev->sectors)
|
|
|
|
return -E2BIG;
|
|
|
|
}
|
|
|
|
/* Metadata worries about other space details. */
|
|
|
|
|
|
|
|
/* decreasing the offset is inconsistent with a backwards
|
|
|
|
* reshape.
|
|
|
|
*/
|
|
|
|
if (new_offset < rdev->data_offset &&
|
|
|
|
mddev->reshape_backwards)
|
|
|
|
return -EINVAL;
|
|
|
|
/* Increasing offset is inconsistent with forwards
|
|
|
|
* reshape. reshape_direction should be set to
|
|
|
|
* 'backwards' first.
|
|
|
|
*/
|
|
|
|
if (new_offset > rdev->data_offset &&
|
|
|
|
!mddev->reshape_backwards)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (mddev->pers && mddev->persistent &&
|
|
|
|
!super_types[mddev->major_version]
|
|
|
|
.allow_new_offset(rdev, new_offset))
|
|
|
|
return -E2BIG;
|
|
|
|
rdev->new_data_offset = new_offset;
|
|
|
|
if (new_offset > rdev->data_offset)
|
|
|
|
mddev->reshape_backwards = 1;
|
|
|
|
else if (new_offset < rdev->data_offset)
|
|
|
|
mddev->reshape_backwards = 0;
|
|
|
|
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
static struct rdev_sysfs_entry rdev_new_offset =
|
|
|
|
__ATTR(new_offset, S_IRUGO|S_IWUSR, new_offset_show, new_offset_store);
|
|
|
|
|
2006-01-06 08:21:06 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
rdev_size_show(struct md_rdev *rdev, char *page)
|
2006-01-06 08:21:06 +00:00
|
|
|
{
|
2009-03-31 03:33:13 +00:00
|
|
|
return sprintf(page, "%llu\n", (unsigned long long)rdev->sectors / 2);
|
2006-01-06 08:21:06 +00:00
|
|
|
}
|
|
|
|
|
2022-07-19 09:18:19 +00:00
|
|
|
static int md_rdevs_overlap(struct md_rdev *a, struct md_rdev *b)
|
2008-02-06 09:39:54 +00:00
|
|
|
{
|
|
|
|
/* check if two start/length pairs overlap */
|
2022-07-19 09:18:19 +00:00
|
|
|
if (a->data_offset + a->sectors <= b->data_offset)
|
|
|
|
return false;
|
|
|
|
if (b->data_offset + b->sectors <= a->data_offset)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool md_rdev_overlaps(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
struct mddev *mddev;
|
|
|
|
struct md_rdev *rdev2;
|
|
|
|
|
|
|
|
spin_lock(&all_mddevs_lock);
|
|
|
|
list_for_each_entry(mddev, &all_mddevs, all_mddevs) {
|
2022-07-19 09:18:23 +00:00
|
|
|
if (test_bit(MD_DELETED, &mddev->flags))
|
|
|
|
continue;
|
2022-07-19 09:18:19 +00:00
|
|
|
rdev_for_each(rdev2, mddev) {
|
|
|
|
if (rdev != rdev2 && rdev->bdev == rdev2->bdev &&
|
|
|
|
md_rdevs_overlap(rdev, rdev2)) {
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
return false;
|
2008-02-06 09:39:54 +00:00
|
|
|
}
|
|
|
|
|
2009-03-31 04:00:31 +00:00
|
|
|
static int strict_blocks_to_sectors(const char *buf, sector_t *sectors)
|
|
|
|
{
|
|
|
|
unsigned long long blocks;
|
|
|
|
sector_t new;
|
|
|
|
|
2013-06-01 07:15:16 +00:00
|
|
|
if (kstrtoull(buf, 10, &blocks) < 0)
|
2009-03-31 04:00:31 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (blocks & 1ULL << (8 * sizeof(blocks) - 1))
|
|
|
|
return -EINVAL; /* sector conversion overflow */
|
|
|
|
|
|
|
|
new = blocks * 2;
|
|
|
|
if (new != blocks * 2)
|
|
|
|
return -EINVAL; /* unsigned long long to sector_t overflow */
|
|
|
|
|
|
|
|
*sectors = new;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-01-06 08:21:06 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:45:26 +00:00
|
|
|
rdev_size_store(struct md_rdev *rdev, const char *buf, size_t len)
|
2006-01-06 08:21:06 +00:00
|
|
|
{
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *my_mddev = rdev->mddev;
|
2009-03-31 03:33:13 +00:00
|
|
|
sector_t oldsectors = rdev->sectors;
|
2009-03-31 04:00:31 +00:00
|
|
|
sector_t sectors;
|
2008-03-04 22:29:33 +00:00
|
|
|
|
2015-10-09 04:54:12 +00:00
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
return -EBUSY;
|
2009-03-31 04:00:31 +00:00
|
|
|
if (strict_blocks_to_sectors(buf, §ors) < 0)
|
2008-07-12 00:37:50 +00:00
|
|
|
return -EINVAL;
|
2012-05-20 23:27:00 +00:00
|
|
|
if (rdev->data_offset != rdev->new_data_offset)
|
|
|
|
return -EINVAL; /* too confusing */
|
2008-06-27 22:31:46 +00:00
|
|
|
if (my_mddev->pers && rdev->raid_disk >= 0) {
|
2008-07-12 00:37:50 +00:00
|
|
|
if (my_mddev->persistent) {
|
2009-03-31 03:33:13 +00:00
|
|
|
sectors = super_types[my_mddev->major_version].
|
|
|
|
rdev_size_change(rdev, sectors);
|
|
|
|
if (!sectors)
|
2008-06-27 22:31:46 +00:00
|
|
|
return -EBUSY;
|
2009-03-31 03:33:13 +00:00
|
|
|
} else if (!sectors)
|
2021-10-18 10:11:06 +00:00
|
|
|
sectors = bdev_nr_sectors(rdev->bdev) -
|
2009-03-31 03:33:13 +00:00
|
|
|
rdev->data_offset;
|
2013-02-21 03:33:17 +00:00
|
|
|
if (!my_mddev->pers->resize)
|
|
|
|
/* Cannot change size for RAID0 or Linear etc */
|
|
|
|
return -EINVAL;
|
2008-06-27 22:31:46 +00:00
|
|
|
}
|
2009-03-31 03:33:13 +00:00
|
|
|
if (sectors < my_mddev->dev_sectors)
|
2008-10-13 00:55:11 +00:00
|
|
|
return -EINVAL; /* component must fit device */
|
2008-06-27 22:31:46 +00:00
|
|
|
|
2009-03-31 03:33:13 +00:00
|
|
|
rdev->sectors = sectors;
|
2022-07-19 09:18:19 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that all other rdevs with the same bdev do not overlap. This
|
|
|
|
* check does not provide a hard guarantee, it just helps avoid
|
|
|
|
* dangerous mistakes.
|
|
|
|
*/
|
|
|
|
if (sectors > oldsectors && my_mddev->external &&
|
|
|
|
md_rdev_overlaps(rdev)) {
|
|
|
|
/*
|
|
|
|
* Someone else could have slipped in a size change here, but
|
|
|
|
* doing so is just silly. We put oldsectors back because we
|
|
|
|
* know it is safe, and trust userspace not to race with itself.
|
2008-02-06 09:39:54 +00:00
|
|
|
*/
|
2022-07-19 09:18:19 +00:00
|
|
|
rdev->sectors = oldsectors;
|
|
|
|
return -EBUSY;
|
2008-02-06 09:39:54 +00:00
|
|
|
}
|
2006-01-06 08:21:06 +00:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct rdev_sysfs_entry rdev_size =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(size, S_IRUGO|S_IWUSR, rdev_size_show, rdev_size_store);
|
2006-01-06 08:21:06 +00:00
|
|
|
|
2011-10-11 05:45:26 +00:00
|
|
|
static ssize_t recovery_start_show(struct md_rdev *rdev, char *page)
|
2009-12-13 04:17:12 +00:00
|
|
|
{
|
|
|
|
unsigned long long recovery_start = rdev->recovery_offset;
|
|
|
|
|
|
|
|
if (test_bit(In_sync, &rdev->flags) ||
|
|
|
|
recovery_start == MaxSector)
|
|
|
|
return sprintf(page, "none\n");
|
|
|
|
|
|
|
|
return sprintf(page, "%llu\n", recovery_start);
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:45:26 +00:00
|
|
|
static ssize_t recovery_start_store(struct md_rdev *rdev, const char *buf, size_t len)
|
2009-12-13 04:17:12 +00:00
|
|
|
{
|
|
|
|
unsigned long long recovery_start;
|
|
|
|
|
|
|
|
if (cmd_match(buf, "none"))
|
|
|
|
recovery_start = MaxSector;
|
2013-06-01 07:15:16 +00:00
|
|
|
else if (kstrtoull(buf, 10, &recovery_start))
|
2009-12-13 04:17:12 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (rdev->mddev->pers &&
|
|
|
|
rdev->raid_disk >= 0)
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
rdev->recovery_offset = recovery_start;
|
|
|
|
if (recovery_start == MaxSector)
|
|
|
|
set_bit(In_sync, &rdev->flags);
|
|
|
|
else
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct rdev_sysfs_entry rdev_recovery_start =
|
|
|
|
__ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);
|
|
|
|
|
2015-12-25 02:20:34 +00:00
|
|
|
/* sysfs access to bad-blocks list.
|
|
|
|
* We present two files.
|
|
|
|
* 'bad-blocks' lists sector numbers and lengths of ranges that
|
|
|
|
* are recorded as bad. The list is truncated to fit within
|
|
|
|
* the one-page limit of sysfs.
|
|
|
|
* Writing "sector length" to this file adds an acknowledged
|
|
|
|
* bad block list.
|
|
|
|
* 'unacknowledged-bad-blocks' lists bad blocks that have not yet
|
|
|
|
* been acknowledged. Writing to this file adds bad blocks
|
|
|
|
* without acknowledging them. This is largely for testing.
|
|
|
|
*/
|
2011-10-11 05:45:26 +00:00
|
|
|
static ssize_t bb_show(struct md_rdev *rdev, char *page)
|
2011-07-28 01:31:47 +00:00
|
|
|
{
|
|
|
|
return badblocks_show(&rdev->badblocks, page, 0);
|
|
|
|
}
|
2011-10-11 05:45:26 +00:00
|
|
|
static ssize_t bb_store(struct md_rdev *rdev, const char *page, size_t len)
|
2011-07-28 01:31:47 +00:00
|
|
|
{
|
2011-07-28 01:31:48 +00:00
|
|
|
int rv = badblocks_store(&rdev->badblocks, page, len, 0);
|
|
|
|
/* Maybe that ack was all we needed */
|
|
|
|
if (test_and_clear_bit(BlockedBadBlocks, &rdev->flags))
|
|
|
|
wake_up(&rdev->blocked_wait);
|
|
|
|
return rv;
|
2011-07-28 01:31:47 +00:00
|
|
|
}
|
|
|
|
static struct rdev_sysfs_entry rdev_bad_blocks =
|
|
|
|
__ATTR(bad_blocks, S_IRUGO|S_IWUSR, bb_show, bb_store);
|
|
|
|
|
2011-10-11 05:45:26 +00:00
|
|
|
static ssize_t ubb_show(struct md_rdev *rdev, char *page)
|
2011-07-28 01:31:47 +00:00
|
|
|
{
|
|
|
|
return badblocks_show(&rdev->badblocks, page, 1);
|
|
|
|
}
|
2011-10-11 05:45:26 +00:00
|
|
|
static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
|
2011-07-28 01:31:47 +00:00
|
|
|
{
|
|
|
|
return badblocks_store(&rdev->badblocks, page, len, 1);
|
|
|
|
}
|
|
|
|
static struct rdev_sysfs_entry rdev_unack_bad_blocks =
|
|
|
|
__ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
|
|
|
|
|
2017-03-09 09:00:00 +00:00
|
|
|
static ssize_t
|
|
|
|
ppl_sector_show(struct md_rdev *rdev, char *page)
|
|
|
|
{
|
|
|
|
return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
unsigned long long sector;
|
|
|
|
|
|
|
|
if (kstrtoull(buf, 10, §or) < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
if (sector != (sector_t)sector)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
|
|
|
|
rdev->raid_disk >= 0)
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
if (rdev->mddev->persistent) {
|
|
|
|
if (rdev->mddev->major_version == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
if ((sector > rdev->sb_start &&
|
|
|
|
sector - rdev->sb_start > S16_MAX) ||
|
|
|
|
(sector < rdev->sb_start &&
|
|
|
|
rdev->sb_start - sector > -S16_MIN))
|
|
|
|
return -EINVAL;
|
|
|
|
rdev->ppl.offset = sector - rdev->sb_start;
|
|
|
|
} else if (!rdev->mddev->external) {
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
rdev->ppl.sector = sector;
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct rdev_sysfs_entry rdev_ppl_sector =
|
|
|
|
__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
ppl_size_show(struct md_rdev *rdev, char *page)
|
|
|
|
{
|
|
|
|
return sprintf(page, "%u\n", rdev->ppl.size);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
unsigned int size;
|
|
|
|
|
|
|
|
if (kstrtouint(buf, 10, &size) < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
|
|
|
|
rdev->raid_disk >= 0)
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
if (rdev->mddev->persistent) {
|
|
|
|
if (rdev->mddev->major_version == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
if (size > U16_MAX)
|
|
|
|
return -EINVAL;
|
|
|
|
} else if (!rdev->mddev->external) {
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
rdev->ppl.size = size;
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct rdev_sysfs_entry rdev_ppl_size =
|
|
|
|
__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
|
|
|
|
|
2005-11-09 05:39:24 +00:00
|
|
|
static struct attribute *rdev_default_attrs[] = {
|
|
|
|
&rdev_state.attr,
|
2006-01-06 08:20:52 +00:00
|
|
|
&rdev_errors.attr,
|
2006-01-06 08:20:55 +00:00
|
|
|
&rdev_slot.attr,
|
2006-01-06 08:20:56 +00:00
|
|
|
&rdev_offset.attr,
|
2012-05-20 23:27:00 +00:00
|
|
|
&rdev_new_offset.attr,
|
2006-01-06 08:21:06 +00:00
|
|
|
&rdev_size.attr,
|
2009-12-13 04:17:12 +00:00
|
|
|
&rdev_recovery_start.attr,
|
2011-07-28 01:31:47 +00:00
|
|
|
&rdev_bad_blocks.attr,
|
|
|
|
&rdev_unack_bad_blocks.attr,
|
2017-03-09 09:00:00 +00:00
|
|
|
&rdev_ppl_sector.attr,
|
|
|
|
&rdev_ppl_size.attr,
|
2005-11-09 05:39:24 +00:00
|
|
|
NULL,
|
|
|
|
};
|
2022-01-06 10:03:35 +00:00
|
|
|
ATTRIBUTE_GROUPS(rdev_default);
|
2005-11-09 05:39:24 +00:00
|
|
|
static ssize_t
|
|
|
|
rdev_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
|
|
|
|
{
|
|
|
|
struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
|
2005-11-09 05:39:24 +00:00
|
|
|
|
|
|
|
if (!entry->show)
|
|
|
|
return -EIO;
|
2014-12-15 01:56:59 +00:00
|
|
|
if (!rdev->mddev)
|
2019-06-14 22:41:06 +00:00
|
|
|
return -ENODEV;
|
2014-12-15 01:56:59 +00:00
|
|
|
return entry->show(rdev, page);
|
2005-11-09 05:39:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
rdev_attr_store(struct kobject *kobj, struct attribute *attr,
|
|
|
|
const char *page, size_t length)
|
|
|
|
{
|
|
|
|
struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
struct kernfs_node *kn = NULL;
|
2008-03-04 22:29:33 +00:00
|
|
|
ssize_t rv;
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = rdev->mddev;
|
2005-11-09 05:39:24 +00:00
|
|
|
|
|
|
|
if (!entry->store)
|
|
|
|
return -EIO;
|
2006-07-10 11:44:19 +00:00
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EACCES;
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
|
|
|
|
if (entry->store == state_store && cmd_match(page, "remove"))
|
|
|
|
kn = sysfs_break_active_protection(kobj, attr);
|
|
|
|
|
2019-03-27 12:48:21 +00:00
|
|
|
rv = mddev ? mddev_lock(mddev) : -ENODEV;
|
2008-02-06 09:39:55 +00:00
|
|
|
if (!rv) {
|
2008-03-04 22:29:33 +00:00
|
|
|
if (rdev->mddev == NULL)
|
2019-03-27 12:48:21 +00:00
|
|
|
rv = -ENODEV;
|
2008-03-04 22:29:33 +00:00
|
|
|
else
|
|
|
|
rv = entry->store(rdev, page, length);
|
2008-04-30 07:52:28 +00:00
|
|
|
mddev_unlock(mddev);
|
2008-02-06 09:39:55 +00:00
|
|
|
}
|
md: fix duplicate filename for rdev
Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
2023-05-23 01:27:27 +00:00
|
|
|
|
|
|
|
if (kn)
|
|
|
|
sysfs_unbreak_active_protection(kn);
|
|
|
|
|
2008-02-06 09:39:55 +00:00
|
|
|
return rv;
|
2005-11-09 05:39:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void rdev_free(struct kobject *ko)
|
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev = container_of(ko, struct md_rdev, kobj);
|
2005-11-09 05:39:24 +00:00
|
|
|
kfree(rdev);
|
|
|
|
}
|
2010-01-19 01:58:23 +00:00
|
|
|
static const struct sysfs_ops rdev_sysfs_ops = {
|
2005-11-09 05:39:24 +00:00
|
|
|
.show = rdev_attr_show,
|
|
|
|
.store = rdev_attr_store,
|
|
|
|
};
|
2023-02-14 03:19:22 +00:00
|
|
|
static const struct kobj_type rdev_ktype = {
|
2005-11-09 05:39:24 +00:00
|
|
|
.release = rdev_free,
|
|
|
|
.sysfs_ops = &rdev_sysfs_ops,
|
2022-01-06 10:03:35 +00:00
|
|
|
.default_groups = rdev_default_groups,
|
2005-11-09 05:39:24 +00:00
|
|
|
};
|
|
|
|
|
2011-10-11 05:45:26 +00:00
|
|
|
int md_rdev_init(struct md_rdev *rdev)
|
2010-06-01 09:37:26 +00:00
|
|
|
{
|
|
|
|
rdev->desc_nr = -1;
|
|
|
|
rdev->saved_raid_disk = -1;
|
|
|
|
rdev->raid_disk = -1;
|
|
|
|
rdev->flags = 0;
|
|
|
|
rdev->data_offset = 0;
|
2012-05-20 23:27:00 +00:00
|
|
|
rdev->new_data_offset = 0;
|
2010-06-01 09:37:26 +00:00
|
|
|
rdev->sb_events = 0;
|
2016-06-17 15:33:10 +00:00
|
|
|
rdev->last_read_error = 0;
|
2011-07-28 01:31:47 +00:00
|
|
|
rdev->sb_loaded = 0;
|
|
|
|
rdev->bb_page = NULL;
|
2010-06-01 09:37:26 +00:00
|
|
|
atomic_set(&rdev->nr_pending, 0);
|
|
|
|
atomic_set(&rdev->read_errors, 0);
|
|
|
|
atomic_set(&rdev->corrected_errors, 0);
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&rdev->same_set);
|
|
|
|
init_waitqueue_head(&rdev->blocked_wait);
|
2011-07-28 01:31:46 +00:00
|
|
|
|
|
|
|
/* Add space to store bad block list.
|
|
|
|
* This reserves the space even on arrays where it cannot
|
|
|
|
* be used - I wonder if that matters
|
|
|
|
*/
|
2015-12-25 02:20:34 +00:00
|
|
|
return badblocks_init(&rdev->badblocks, 0);
|
2010-06-01 09:37:26 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(md_rdev_init);
|
2023-06-08 11:02:43 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Import a device. If 'super_format' >= 0, then sanity check the superblock
|
|
|
|
*
|
|
|
|
* mark the device faulty if:
|
|
|
|
*
|
|
|
|
* - the device is nonexistent (zero size)
|
|
|
|
* - the device has no valid superblock
|
|
|
|
*
|
|
|
|
* a faulty rdev _never_ has rdev->sb set.
|
|
|
|
*/
|
2011-10-11 05:45:26 +00:00
|
|
|
static struct md_rdev *md_import_device(dev_t newdev, int super_format, int super_minor)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2023-08-25 02:55:32 +00:00
|
|
|
struct md_rdev *holder;
|
2005-04-16 22:20:36 +00:00
|
|
|
sector_t size;
|
2022-11-29 13:32:53 +00:00
|
|
|
int err;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-01-06 08:20:32 +00:00
|
|
|
rdev = kzalloc(sizeof(*rdev), GFP_KERNEL);
|
2016-11-02 03:16:49 +00:00
|
|
|
if (!rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
2011-07-28 01:31:46 +00:00
|
|
|
err = md_rdev_init(rdev);
|
|
|
|
if (err)
|
2022-11-29 13:32:53 +00:00
|
|
|
goto out_free_rdev;
|
2011-07-28 01:31:46 +00:00
|
|
|
err = alloc_disk_sb(rdev);
|
|
|
|
if (err)
|
2022-11-29 13:32:53 +00:00
|
|
|
goto out_clear_rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-08-25 02:55:32 +00:00
|
|
|
if (super_format == -2) {
|
|
|
|
holder = &claim_rdev;
|
|
|
|
} else {
|
|
|
|
holder = rdev;
|
|
|
|
set_bit(Holder, &rdev->flags);
|
|
|
|
}
|
|
|
|
|
2023-06-08 11:02:55 +00:00
|
|
|
rdev->bdev = blkdev_get_by_dev(newdev, BLK_OPEN_READ | BLK_OPEN_WRITE,
|
2023-08-25 02:55:32 +00:00
|
|
|
holder, NULL);
|
2022-11-29 13:32:53 +00:00
|
|
|
if (IS_ERR(rdev->bdev)) {
|
|
|
|
pr_warn("md: could not open device unknown-block(%u,%u).\n",
|
|
|
|
MAJOR(newdev), MINOR(newdev));
|
|
|
|
err = PTR_ERR(rdev->bdev);
|
|
|
|
goto out_clear_rdev;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-12-18 06:05:35 +00:00
|
|
|
kobject_init(&rdev->kobj, &rdev_ktype);
|
2005-11-09 05:39:24 +00:00
|
|
|
|
2021-10-18 10:11:06 +00:00
|
|
|
size = bdev_nr_bytes(rdev->bdev) >> BLOCK_SIZE_BITS;
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!size) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %pg has zero or unknown size, marking faulty!\n",
|
|
|
|
rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
err = -EINVAL;
|
2022-11-29 13:32:53 +00:00
|
|
|
goto out_blkdev_put;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (super_format >= 0) {
|
|
|
|
err = super_types[super_format].
|
|
|
|
load_super(rdev, NULL, super_minor);
|
|
|
|
if (err == -EINVAL) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %pg does not have a valid v%d.%d superblock, not importing!\n",
|
|
|
|
rdev->bdev,
|
2016-11-02 03:16:49 +00:00
|
|
|
super_format, super_minor);
|
2022-11-29 13:32:53 +00:00
|
|
|
goto out_blkdev_put;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
if (err < 0) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: could not read %pg's sb, not importing!\n",
|
|
|
|
rdev->bdev);
|
2022-11-29 13:32:53 +00:00
|
|
|
goto out_blkdev_put;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
2008-04-30 07:52:32 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return rdev;
|
|
|
|
|
2022-11-29 13:32:53 +00:00
|
|
|
out_blkdev_put:
|
2023-08-25 02:55:32 +00:00
|
|
|
blkdev_put(rdev->bdev, holder);
|
2022-11-29 13:32:53 +00:00
|
|
|
out_clear_rdev:
|
2012-05-22 03:54:30 +00:00
|
|
|
md_rdev_clear(rdev);
|
2022-11-29 13:32:53 +00:00
|
|
|
out_free_rdev:
|
2005-04-16 22:20:36 +00:00
|
|
|
kfree(rdev);
|
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check a full RAID array for plausibility
|
|
|
|
*/
|
|
|
|
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
static int analyze_sbs(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
int i;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev, *freshest, *tmp;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
freshest = NULL;
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each_safe(rdev, tmp, mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
switch (super_types[mddev->major_version].
|
|
|
|
load_super(rdev, freshest, mddev->minor_version)) {
|
|
|
|
case 1:
|
|
|
|
freshest = rdev;
|
|
|
|
break;
|
|
|
|
case 0:
|
|
|
|
break;
|
|
|
|
default:
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: fatal superblock inconsistency in %pg -- removing from array\n",
|
|
|
|
rdev->bdev);
|
2015-04-14 15:43:24 +00:00
|
|
|
md_kick_rdev_from_array(rdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
/* Cannot find a valid fresh disk */
|
|
|
|
if (!freshest) {
|
|
|
|
pr_warn("md: cannot find a valid disk\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
super_types[mddev->major_version].
|
|
|
|
validate_super(mddev, freshest);
|
|
|
|
|
|
|
|
i = 0;
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each_safe(rdev, tmp, mddev) {
|
2010-04-14 07:02:09 +00:00
|
|
|
if (mddev->max_disks &&
|
|
|
|
(rdev->desc_nr >= mddev->max_disks ||
|
|
|
|
i > mddev->max_disks)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %s: %pg: only %d devices permitted\n",
|
|
|
|
mdname(mddev), rdev->bdev,
|
2016-11-02 03:16:49 +00:00
|
|
|
mddev->max_disks);
|
2015-04-14 15:43:24 +00:00
|
|
|
md_kick_rdev_from_array(rdev);
|
2009-02-06 07:02:46 +00:00
|
|
|
continue;
|
|
|
|
}
|
2014-10-29 23:51:31 +00:00
|
|
|
if (rdev != freshest) {
|
2005-04-16 22:20:36 +00:00
|
|
|
if (super_types[mddev->major_version].
|
|
|
|
validate_super(mddev, rdev)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: kicking non-fresh %pg from array!\n",
|
|
|
|
rdev->bdev);
|
2015-04-14 15:43:24 +00:00
|
|
|
md_kick_rdev_from_array(rdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
continue;
|
|
|
|
}
|
2014-10-29 23:51:31 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev->level == LEVEL_MULTIPATH) {
|
|
|
|
rdev->desc_nr = i++;
|
|
|
|
rdev->raid_disk = rdev->desc_nr;
|
2005-11-09 05:39:31 +00:00
|
|
|
set_bit(In_sync, &rdev->flags);
|
2015-10-09 04:54:12 +00:00
|
|
|
} else if (rdev->raid_disk >=
|
|
|
|
(mddev->raid_disks - min(0, mddev->delta_disks)) &&
|
|
|
|
!test_bit(Journal, &rdev->flags)) {
|
2007-05-23 20:58:10 +00:00
|
|
|
rdev->raid_disk = -1;
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
|
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2009-12-14 01:49:55 +00:00
|
|
|
/* Read a fixed-point number.
|
|
|
|
* Numbers in sysfs attributes should be in "standard" units where
|
|
|
|
* possible, so time should be in seconds.
|
2014-09-30 04:23:59 +00:00
|
|
|
* However we internally use a a much smaller unit such as
|
2009-12-14 01:49:55 +00:00
|
|
|
* milliseconds or jiffies.
|
|
|
|
* This function takes a decimal number with a possible fractional
|
|
|
|
* component, and produces an integer which is the result of
|
|
|
|
* multiplying that number by 10^'scale'.
|
|
|
|
* all without any floating-point arithmetic.
|
|
|
|
*/
|
|
|
|
int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale)
|
|
|
|
{
|
|
|
|
unsigned long result = 0;
|
|
|
|
long decimals = -1;
|
|
|
|
while (isdigit(*cp) || (*cp == '.' && decimals < 0)) {
|
|
|
|
if (*cp == '.')
|
|
|
|
decimals = 0;
|
|
|
|
else if (decimals < scale) {
|
|
|
|
unsigned int value;
|
|
|
|
value = *cp - '0';
|
|
|
|
result = result * 10 + value;
|
|
|
|
if (decimals >= 0)
|
|
|
|
decimals++;
|
|
|
|
}
|
|
|
|
cp++;
|
|
|
|
}
|
|
|
|
if (*cp == '\n')
|
|
|
|
cp++;
|
|
|
|
if (*cp)
|
|
|
|
return -EINVAL;
|
|
|
|
if (decimals < 0)
|
|
|
|
decimals = 0;
|
2019-07-23 20:41:55 +00:00
|
|
|
*res = result * int_pow(10, scale - decimals);
|
2009-12-14 01:49:55 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-06-26 07:27:37 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
safe_delay_show(struct mddev *mddev, char *page)
|
2006-06-26 07:27:37 +00:00
|
|
|
{
|
2023-05-22 07:25:33 +00:00
|
|
|
unsigned int msec = ((unsigned long)mddev->safemode_delay*1000)/HZ;
|
|
|
|
|
|
|
|
return sprintf(page, "%u.%03u\n", msec/1000, msec%1000);
|
2006-06-26 07:27:37 +00:00
|
|
|
}
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
safe_delay_store(struct mddev *mddev, const char *cbuf, size_t len)
|
2006-06-26 07:27:37 +00:00
|
|
|
{
|
|
|
|
unsigned long msec;
|
2008-09-25 05:48:19 +00:00
|
|
|
|
2015-10-22 05:01:25 +00:00
|
|
|
if (mddev_is_clustered(mddev)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: Safemode is disabled for clustered mode\n");
|
2015-10-22 05:01:25 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2023-05-22 07:25:33 +00:00
|
|
|
if (strict_strtoul_scaled(cbuf, &msec, 3) < 0 || msec > UINT_MAX / HZ)
|
2006-06-26 07:27:37 +00:00
|
|
|
return -EINVAL;
|
|
|
|
if (msec == 0)
|
|
|
|
mddev->safemode_delay = 0;
|
|
|
|
else {
|
2008-08-05 05:54:13 +00:00
|
|
|
unsigned long old_delay = mddev->safemode_delay;
|
2014-12-15 01:57:00 +00:00
|
|
|
unsigned long new_delay = (msec*HZ)/1000;
|
|
|
|
|
|
|
|
if (new_delay == 0)
|
|
|
|
new_delay = 1;
|
|
|
|
mddev->safemode_delay = new_delay;
|
|
|
|
if (new_delay < old_delay || old_delay == 0)
|
|
|
|
mod_timer(&mddev->safemode_timer, jiffies+1);
|
2006-06-26 07:27:37 +00:00
|
|
|
}
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_safe_delay =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(safe_mode_delay, S_IRUGO|S_IWUSR,safe_delay_show, safe_delay_store);
|
2006-06-26 07:27:37 +00:00
|
|
|
|
2005-11-09 05:39:23 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
level_show(struct mddev *mddev, char *page)
|
2005-11-09 05:39:23 +00:00
|
|
|
{
|
2014-12-15 01:56:58 +00:00
|
|
|
struct md_personality *p;
|
|
|
|
int ret;
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
p = mddev->pers;
|
2006-01-06 08:20:51 +00:00
|
|
|
if (p)
|
2014-12-15 01:56:58 +00:00
|
|
|
ret = sprintf(page, "%s\n", p->name);
|
2006-01-06 08:20:51 +00:00
|
|
|
else if (mddev->clevel[0])
|
2014-12-15 01:56:58 +00:00
|
|
|
ret = sprintf(page, "%s\n", mddev->clevel);
|
2006-01-06 08:20:51 +00:00
|
|
|
else if (mddev->level != LEVEL_NONE)
|
2014-12-15 01:56:58 +00:00
|
|
|
ret = sprintf(page, "%d\n", mddev->level);
|
2006-01-06 08:20:51 +00:00
|
|
|
else
|
2014-12-15 01:56:58 +00:00
|
|
|
ret = 0;
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
return ret;
|
2005-11-09 05:39:23 +00:00
|
|
|
}
|
|
|
|
|
2006-01-06 08:20:51 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
level_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:20:51 +00:00
|
|
|
{
|
2010-05-02 17:04:16 +00:00
|
|
|
char clevel[16];
|
2014-12-15 01:57:01 +00:00
|
|
|
ssize_t rv;
|
|
|
|
size_t slen = len;
|
2014-12-15 01:56:58 +00:00
|
|
|
struct md_personality *pers, *oldpers;
|
2010-05-02 17:04:16 +00:00
|
|
|
long level;
|
2014-12-15 01:56:58 +00:00
|
|
|
void *priv, *oldpriv;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2009-03-31 03:39:39 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
if (slen == 0 || slen >= sizeof(clevel))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
rv = mddev_lock(mddev);
|
|
|
|
if (rv)
|
|
|
|
return rv;
|
|
|
|
|
2009-03-31 03:39:39 +00:00
|
|
|
if (mddev->pers == NULL) {
|
2014-12-15 01:57:01 +00:00
|
|
|
strncpy(mddev->clevel, buf, slen);
|
|
|
|
if (mddev->clevel[slen-1] == '\n')
|
|
|
|
slen--;
|
|
|
|
mddev->clevel[slen] = 0;
|
2009-03-31 03:39:39 +00:00
|
|
|
mddev->level = LEVEL_NONE;
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = len;
|
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = -EROFS;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev))
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
|
|
|
|
/* request to change the personality. Need to ensure:
|
|
|
|
* - array is not engaged in resync/recovery/reshape
|
|
|
|
* - old personality can be suspended
|
|
|
|
* - new personality will access other array.
|
|
|
|
*/
|
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = -EBUSY;
|
2010-08-08 11:18:03 +00:00
|
|
|
if (mddev->sync_thread ||
|
2014-12-10 23:02:10 +00:00
|
|
|
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
|
2010-08-08 11:18:03 +00:00
|
|
|
mddev->reshape_position != MaxSector ||
|
|
|
|
mddev->sysfs_active)
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = -EINVAL;
|
2009-03-31 03:39:39 +00:00
|
|
|
if (!mddev->pers->quiesce) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s: %s does not support online personality change\n",
|
|
|
|
mdname(mddev), mddev->pers->name);
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Now find the new personality */
|
2014-12-15 01:57:01 +00:00
|
|
|
strncpy(clevel, buf, slen);
|
|
|
|
if (clevel[slen-1] == '\n')
|
|
|
|
slen--;
|
|
|
|
clevel[slen] = 0;
|
2013-06-01 07:15:16 +00:00
|
|
|
if (kstrtol(clevel, 10, &level))
|
2010-05-02 17:04:16 +00:00
|
|
|
level = LEVEL_NONE;
|
2009-03-31 03:39:39 +00:00
|
|
|
|
2010-05-02 17:04:16 +00:00
|
|
|
if (request_module("md-%s", clevel) != 0)
|
|
|
|
request_module("md-level-%s", clevel);
|
2009-03-31 03:39:39 +00:00
|
|
|
spin_lock(&pers_lock);
|
2010-05-02 17:04:16 +00:00
|
|
|
pers = find_pers(level, clevel);
|
2009-03-31 03:39:39 +00:00
|
|
|
if (!pers || !try_module_get(pers->owner)) {
|
|
|
|
spin_unlock(&pers_lock);
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: personality %s not loaded\n", clevel);
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = -EINVAL;
|
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&pers_lock);
|
|
|
|
|
|
|
|
if (pers == mddev->pers) {
|
|
|
|
/* Nothing to do! */
|
|
|
|
module_put(pers->owner);
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = len;
|
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
|
|
|
if (!pers->takeover) {
|
|
|
|
module_put(pers->owner);
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s: %s does not support personality takeover\n",
|
|
|
|
mdname(mddev), clevel);
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = -EINVAL;
|
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev)
|
2010-06-15 08:36:03 +00:00
|
|
|
rdev->new_raid_disk = rdev->raid_disk;
|
|
|
|
|
2009-03-31 03:39:39 +00:00
|
|
|
/* ->takeover must set new_* and/or delta_disks
|
|
|
|
* if it succeeds, and may set them when it fails.
|
|
|
|
*/
|
|
|
|
priv = pers->takeover(mddev);
|
|
|
|
if (IS_ERR(priv)) {
|
|
|
|
mddev->new_level = mddev->level;
|
|
|
|
mddev->new_layout = mddev->layout;
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
2009-03-31 03:39:39 +00:00
|
|
|
mddev->raid_disks -= mddev->delta_disks;
|
|
|
|
mddev->delta_disks = 0;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
2009-03-31 03:39:39 +00:00
|
|
|
module_put(pers->owner);
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s: %s would not accept array\n",
|
|
|
|
mdname(mddev), clevel);
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = PTR_ERR(priv);
|
|
|
|
goto out_unlock;
|
2009-03-31 03:39:39 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Looks like we have a winner */
|
|
|
|
mddev_suspend(mddev);
|
2014-12-15 01:56:57 +00:00
|
|
|
mddev_detach(mddev);
|
2014-12-15 01:56:58 +00:00
|
|
|
|
|
|
|
spin_lock(&mddev->lock);
|
2014-12-15 01:56:58 +00:00
|
|
|
oldpers = mddev->pers;
|
|
|
|
oldpriv = mddev->private;
|
|
|
|
mddev->pers = pers;
|
|
|
|
mddev->private = priv;
|
2022-04-01 02:13:17 +00:00
|
|
|
strscpy(mddev->clevel, pers->name, sizeof(mddev->clevel));
|
2014-12-15 01:56:58 +00:00
|
|
|
mddev->level = mddev->new_level;
|
|
|
|
mddev->layout = mddev->new_layout;
|
|
|
|
mddev->chunk_sectors = mddev->new_chunk_sectors;
|
|
|
|
mddev->delta_disks = 0;
|
|
|
|
mddev->reshape_backwards = 0;
|
|
|
|
mddev->degraded = 0;
|
2014-12-15 01:56:58 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2014-12-15 01:56:58 +00:00
|
|
|
|
|
|
|
if (oldpers->sync_request == NULL &&
|
|
|
|
mddev->external) {
|
|
|
|
/* We are converting from a no-redundancy array
|
|
|
|
* to a redundancy array and metadata is managed
|
|
|
|
* externally so we need to be sure that writes
|
|
|
|
* won't block due to a need to transition
|
|
|
|
* clean->dirty
|
|
|
|
* until external management is started.
|
|
|
|
*/
|
|
|
|
mddev->in_sync = 0;
|
|
|
|
mddev->safemode_delay = 0;
|
|
|
|
mddev->safemode = 0;
|
|
|
|
}
|
2014-09-30 04:23:59 +00:00
|
|
|
|
2014-12-15 01:56:58 +00:00
|
|
|
oldpers->free(mddev, oldpriv);
|
|
|
|
|
|
|
|
if (oldpers->sync_request == NULL &&
|
2010-04-14 07:15:37 +00:00
|
|
|
pers->sync_request != NULL) {
|
|
|
|
/* need to add the md_redundancy_group */
|
|
|
|
if (sysfs_create_group(&mddev->kobj, &md_redundancy_group))
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: cannot register extra attributes for %s\n",
|
|
|
|
mdname(mddev));
|
2013-09-12 03:19:13 +00:00
|
|
|
mddev->sysfs_action = sysfs_get_dirent(mddev->kobj.sd, "sync_action");
|
2020-08-05 00:27:18 +00:00
|
|
|
mddev->sysfs_completed = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_completed");
|
|
|
|
mddev->sysfs_degraded = sysfs_get_dirent_safe(mddev->kobj.sd, "degraded");
|
2014-09-30 04:23:59 +00:00
|
|
|
}
|
2014-12-15 01:56:58 +00:00
|
|
|
if (oldpers->sync_request != NULL &&
|
2010-04-14 07:15:37 +00:00
|
|
|
pers->sync_request == NULL) {
|
|
|
|
/* need to remove the md_redundancy_group */
|
|
|
|
if (mddev->to_remove == NULL)
|
|
|
|
mddev->to_remove = &md_redundancy_group;
|
|
|
|
}
|
|
|
|
|
2016-06-23 10:11:01 +00:00
|
|
|
module_put(oldpers->owner);
|
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2010-06-15 08:36:03 +00:00
|
|
|
if (rdev->raid_disk < 0)
|
|
|
|
continue;
|
2011-01-13 22:14:34 +00:00
|
|
|
if (rdev->new_raid_disk >= mddev->raid_disks)
|
2010-06-15 08:36:03 +00:00
|
|
|
rdev->new_raid_disk = -1;
|
|
|
|
if (rdev->new_raid_disk == rdev->raid_disk)
|
|
|
|
continue;
|
2011-07-27 01:00:36 +00:00
|
|
|
sysfs_unlink_rdev(mddev, rdev);
|
2010-06-15 08:36:03 +00:00
|
|
|
}
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2010-06-15 08:36:03 +00:00
|
|
|
if (rdev->raid_disk < 0)
|
|
|
|
continue;
|
|
|
|
if (rdev->new_raid_disk == rdev->raid_disk)
|
|
|
|
continue;
|
|
|
|
rdev->raid_disk = rdev->new_raid_disk;
|
|
|
|
if (rdev->raid_disk < 0)
|
2009-08-03 00:59:55 +00:00
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2010-06-15 08:36:03 +00:00
|
|
|
else {
|
2011-07-27 01:00:36 +00:00
|
|
|
if (sysfs_link_rdev(mddev, rdev))
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: cannot register rd%d for %s after level change\n",
|
|
|
|
rdev->raid_disk, mdname(mddev));
|
2009-08-03 00:59:55 +00:00
|
|
|
}
|
2010-06-15 08:36:03 +00:00
|
|
|
}
|
|
|
|
|
2014-12-15 01:56:58 +00:00
|
|
|
if (pers->sync_request == NULL) {
|
2010-03-08 05:02:44 +00:00
|
|
|
/* this is now an array without redundancy, so
|
|
|
|
* it must always be in_sync
|
|
|
|
*/
|
|
|
|
mddev->in_sync = 1;
|
|
|
|
del_timer_sync(&mddev->safemode_timer);
|
|
|
|
}
|
2013-11-14 04:16:15 +00:00
|
|
|
blk_set_stacking_limits(&mddev->queue->limits);
|
2009-03-31 03:39:39 +00:00
|
|
|
pers->run(mddev);
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2012-05-22 03:55:29 +00:00
|
|
|
mddev_resume(mddev);
|
2014-01-14 04:17:03 +00:00
|
|
|
if (!mddev->thread)
|
|
|
|
md_update_sb(mddev, 1);
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_level);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = len;
|
|
|
|
out_unlock:
|
|
|
|
mddev_unlock(mddev);
|
2006-01-06 08:20:51 +00:00
|
|
|
return rv;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_level =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(level, S_IRUGO|S_IWUSR, level_show, level_store);
|
2005-11-09 05:39:23 +00:00
|
|
|
|
2006-06-26 07:27:59 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
layout_show(struct mddev *mddev, char *page)
|
2006-06-26 07:27:59 +00:00
|
|
|
{
|
|
|
|
/* just a number, not meaningful for all levels */
|
2007-05-09 09:35:38 +00:00
|
|
|
if (mddev->reshape_position != MaxSector &&
|
|
|
|
mddev->layout != mddev->new_layout)
|
|
|
|
return sprintf(page, "%d (%d)\n",
|
|
|
|
mddev->new_layout, mddev->layout);
|
2006-06-26 07:27:59 +00:00
|
|
|
return sprintf(page, "%d\n", mddev->layout);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
layout_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-06-26 07:27:59 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned int n;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2006-06-26 07:27:59 +00:00
|
|
|
|
2015-05-16 11:02:38 +00:00
|
|
|
err = kstrtouint(buf, 10, &n);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2006-06-26 07:27:59 +00:00
|
|
|
|
2009-03-31 03:56:41 +00:00
|
|
|
if (mddev->pers) {
|
2009-06-17 22:47:55 +00:00
|
|
|
if (mddev->pers->check_reshape == NULL)
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EBUSY;
|
2022-09-20 02:39:38 +00:00
|
|
|
else if (!md_is_rdwr(mddev))
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EROFS;
|
|
|
|
else {
|
|
|
|
mddev->new_layout = n;
|
|
|
|
err = mddev->pers->check_reshape(mddev);
|
|
|
|
if (err)
|
|
|
|
mddev->new_layout = mddev->layout;
|
2009-06-17 22:47:42 +00:00
|
|
|
}
|
2009-03-31 03:56:41 +00:00
|
|
|
} else {
|
2007-05-09 09:35:38 +00:00
|
|
|
mddev->new_layout = n;
|
2009-03-31 03:56:41 +00:00
|
|
|
if (mddev->reshape_position == MaxSector)
|
|
|
|
mddev->layout = n;
|
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2006-06-26 07:27:59 +00:00
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_layout =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(layout, S_IRUGO|S_IWUSR, layout_show, layout_store);
|
2006-06-26 07:27:59 +00:00
|
|
|
|
2005-11-09 05:39:23 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
raid_disks_show(struct mddev *mddev, char *page)
|
2005-11-09 05:39:23 +00:00
|
|
|
{
|
2005-11-09 05:39:45 +00:00
|
|
|
if (mddev->raid_disks == 0)
|
|
|
|
return 0;
|
2007-05-09 09:35:38 +00:00
|
|
|
if (mddev->reshape_position != MaxSector &&
|
|
|
|
mddev->delta_disks != 0)
|
|
|
|
return sprintf(page, "%d (%d)\n", mddev->raid_disks,
|
|
|
|
mddev->raid_disks - mddev->delta_disks);
|
2005-11-09 05:39:23 +00:00
|
|
|
return sprintf(page, "%d\n", mddev->raid_disks);
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int update_raid_disks(struct mddev *mddev, int raid_disks);
|
2006-01-06 08:20:54 +00:00
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
raid_disks_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:20:54 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned int n;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2006-01-06 08:20:54 +00:00
|
|
|
|
2015-05-16 11:02:38 +00:00
|
|
|
err = kstrtouint(buf, 10, &n);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
2006-01-06 08:20:54 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2006-01-06 08:20:54 +00:00
|
|
|
if (mddev->pers)
|
2014-12-15 01:57:01 +00:00
|
|
|
err = update_raid_disks(mddev, n);
|
2007-05-09 09:35:38 +00:00
|
|
|
else if (mddev->reshape_position != MaxSector) {
|
2012-05-20 23:27:00 +00:00
|
|
|
struct md_rdev *rdev;
|
2007-05-09 09:35:38 +00:00
|
|
|
int olddisks = mddev->raid_disks - mddev->delta_disks;
|
2012-05-20 23:27:00 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EINVAL;
|
2012-05-20 23:27:00 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if (olddisks < n &&
|
|
|
|
rdev->data_offset < rdev->new_data_offset)
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2012-05-20 23:27:00 +00:00
|
|
|
if (olddisks > n &&
|
|
|
|
rdev->data_offset > rdev->new_data_offset)
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2012-05-20 23:27:00 +00:00
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
2007-05-09 09:35:38 +00:00
|
|
|
mddev->delta_disks = n - olddisks;
|
|
|
|
mddev->raid_disks = n;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = (mddev->delta_disks < 0);
|
2007-05-09 09:35:38 +00:00
|
|
|
} else
|
2006-01-06 08:20:54 +00:00
|
|
|
mddev->raid_disks = n;
|
2014-12-15 01:57:01 +00:00
|
|
|
out_unlock:
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ? err : len;
|
2006-01-06 08:20:54 +00:00
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_raid_disks =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(raid_disks, S_IRUGO|S_IWUSR, raid_disks_show, raid_disks_store);
|
2005-11-09 05:39:23 +00:00
|
|
|
|
2020-07-28 10:01:39 +00:00
|
|
|
static ssize_t
|
|
|
|
uuid_show(struct mddev *mddev, char *page)
|
|
|
|
{
|
|
|
|
return sprintf(page, "%pU\n", mddev->uuid);
|
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_uuid =
|
|
|
|
__ATTR(uuid, S_IRUGO, uuid_show, NULL);
|
|
|
|
|
2006-01-06 08:20:47 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
chunk_size_show(struct mddev *mddev, char *page)
|
2006-01-06 08:20:47 +00:00
|
|
|
{
|
2007-05-09 09:35:38 +00:00
|
|
|
if (mddev->reshape_position != MaxSector &&
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->chunk_sectors != mddev->new_chunk_sectors)
|
|
|
|
return sprintf(page, "%d (%d)\n",
|
|
|
|
mddev->new_chunk_sectors << 9,
|
2009-06-17 22:45:01 +00:00
|
|
|
mddev->chunk_sectors << 9);
|
|
|
|
return sprintf(page, "%d\n", mddev->chunk_sectors << 9);
|
2006-01-06 08:20:47 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
chunk_size_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:20:47 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned long n;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2006-01-06 08:20:47 +00:00
|
|
|
|
2015-05-16 11:02:38 +00:00
|
|
|
err = kstrtoul(buf, 10, &n);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
2006-01-06 08:20:47 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2009-03-31 03:56:41 +00:00
|
|
|
if (mddev->pers) {
|
2009-06-17 22:47:55 +00:00
|
|
|
if (mddev->pers->check_reshape == NULL)
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EBUSY;
|
2022-09-20 02:39:38 +00:00
|
|
|
else if (!md_is_rdwr(mddev))
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EROFS;
|
|
|
|
else {
|
|
|
|
mddev->new_chunk_sectors = n >> 9;
|
|
|
|
err = mddev->pers->check_reshape(mddev);
|
|
|
|
if (err)
|
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
2009-06-17 22:47:42 +00:00
|
|
|
}
|
2009-03-31 03:56:41 +00:00
|
|
|
} else {
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = n >> 9;
|
2009-03-31 03:56:41 +00:00
|
|
|
if (mddev->reshape_position == MaxSector)
|
2009-06-17 22:45:01 +00:00
|
|
|
mddev->chunk_sectors = n >> 9;
|
2009-03-31 03:56:41 +00:00
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2006-01-06 08:20:47 +00:00
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_chunk_size =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(chunk_size, S_IRUGO|S_IWUSR, chunk_size_show, chunk_size_store);
|
2006-01-06 08:20:47 +00:00
|
|
|
|
2006-06-26 07:28:00 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
resync_start_show(struct mddev *mddev, char *page)
|
2006-06-26 07:28:00 +00:00
|
|
|
{
|
2009-03-31 04:24:32 +00:00
|
|
|
if (mddev->recovery_cp == MaxSector)
|
|
|
|
return sprintf(page, "none\n");
|
2006-06-26 07:28:00 +00:00
|
|
|
return sprintf(page, "%llu\n", (unsigned long long)mddev->recovery_cp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
resync_start_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-06-26 07:28:00 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned long long n;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2015-05-16 11:02:38 +00:00
|
|
|
|
|
|
|
if (cmd_match(buf, "none"))
|
|
|
|
n = MaxSector;
|
|
|
|
else {
|
|
|
|
err = kstrtoull(buf, 10, &n);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (n != (sector_t)n)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2006-06-26 07:28:00 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2011-05-11 05:52:21 +00:00
|
|
|
if (mddev->pers && !test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EBUSY;
|
2006-06-26 07:28:00 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
if (!err) {
|
|
|
|
mddev->recovery_cp = n;
|
|
|
|
if (mddev->pers)
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
2014-12-15 01:57:01 +00:00
|
|
|
}
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2006-06-26 07:28:00 +00:00
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_resync_start =
|
2014-09-29 22:53:05 +00:00
|
|
|
__ATTR_PREALLOC(resync_start, S_IRUGO|S_IWUSR,
|
|
|
|
resync_start_show, resync_start_store);
|
2006-06-26 07:28:00 +00:00
|
|
|
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
/*
|
|
|
|
* The array state can be:
|
|
|
|
*
|
|
|
|
* clear
|
|
|
|
* No devices, no size, no level
|
|
|
|
* Equivalent to STOP_ARRAY ioctl
|
|
|
|
* inactive
|
|
|
|
* May have some settings, but array is not active
|
|
|
|
* all IO results in error
|
|
|
|
* When written, doesn't tear down array, but just stops it
|
|
|
|
* suspended (not supported yet)
|
|
|
|
* All IO requests will block. The array can be reconfigured.
|
2008-03-25 20:00:53 +00:00
|
|
|
* Writing this, if accepted, will block until array is quiescent
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
* readonly
|
|
|
|
* no resync can happen. no superblocks get written.
|
|
|
|
* write requests fail
|
|
|
|
* read-auto
|
|
|
|
* like readonly, but behaves like 'clean' on a write request.
|
|
|
|
*
|
|
|
|
* clean - no pending writes, but otherwise active.
|
|
|
|
* When written to inactive array, starts without resync
|
|
|
|
* If a write request arrives then
|
|
|
|
* if metadata is known, mark 'dirty' and switch to 'active'.
|
|
|
|
* if not known, block and switch to write-pending
|
|
|
|
* If written to an active array that has pending writes, then fails.
|
|
|
|
* active
|
|
|
|
* fully active: IO and resync can be happening.
|
|
|
|
* When written to inactive array, starts with resync
|
|
|
|
*
|
|
|
|
* write-pending
|
|
|
|
* clean, but writes are blocked waiting for 'active' to be written.
|
|
|
|
*
|
|
|
|
* active-idle
|
|
|
|
* like active, but no writes have been seen for a while (100msec).
|
|
|
|
*
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
* broken
|
2022-03-22 15:23:38 +00:00
|
|
|
* Array is failed. It's useful because mounted-arrays aren't stopped
|
|
|
|
* when array is failed, so this state will at least alert the user that
|
|
|
|
* something is wrong.
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
*/
|
|
|
|
enum array_state { clear, inactive, suspended, readonly, read_auto, clean, active,
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
write_pending, active_idle, broken, bad_word};
|
2006-06-26 07:28:01 +00:00
|
|
|
static char *array_states[] = {
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
"clear", "inactive", "suspended", "readonly", "read-auto", "clean", "active",
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
"write-pending", "active-idle", "broken", NULL };
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
|
|
|
|
static int match_word(const char *word, char **list)
|
|
|
|
{
|
|
|
|
int n;
|
|
|
|
for (n=0; list[n]; n++)
|
|
|
|
if (cmd_match(word, list[n]))
|
|
|
|
break;
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
array_state_show(struct mddev *mddev, char *page)
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
{
|
|
|
|
enum array_state st = inactive;
|
|
|
|
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
if (mddev->pers && !test_bit(MD_NOT_READY, &mddev->flags)) {
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
switch(mddev->ro) {
|
2022-09-20 02:39:38 +00:00
|
|
|
case MD_RDONLY:
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
st = readonly;
|
|
|
|
break;
|
2022-09-20 02:39:38 +00:00
|
|
|
case MD_AUTO_READ:
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
st = read_auto;
|
|
|
|
break;
|
2022-09-20 02:39:38 +00:00
|
|
|
case MD_RDWR:
|
2017-03-15 03:05:14 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2016-12-08 23:48:19 +00:00
|
|
|
if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
|
2008-02-06 09:39:51 +00:00
|
|
|
st = write_pending;
|
2016-10-24 10:47:28 +00:00
|
|
|
else if (mddev->in_sync)
|
|
|
|
st = clean;
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
else if (mddev->safemode)
|
|
|
|
st = active_idle;
|
|
|
|
else
|
|
|
|
st = active;
|
2017-03-15 03:05:14 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
}
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
|
|
|
|
if (test_bit(MD_BROKEN, &mddev->flags) && st == clean)
|
|
|
|
st = broken;
|
|
|
|
} else {
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
if (list_empty(&mddev->disks) &&
|
|
|
|
mddev->raid_disks == 0 &&
|
2009-03-31 03:33:13 +00:00
|
|
|
mddev->dev_sectors == 0)
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
st = clear;
|
|
|
|
else
|
|
|
|
st = inactive;
|
|
|
|
}
|
|
|
|
return sprintf(page, "%s\n", array_states[st]);
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int do_md_stop(struct mddev *mddev, int ro, struct block_device *bdev);
|
|
|
|
static int md_set_readonly(struct mddev *mddev, struct block_device *bdev);
|
2011-10-11 05:47:53 +00:00
|
|
|
static int restart_array(struct mddev *mddev);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
array_state_store(struct mddev *mddev, const char *buf, size_t len)
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
{
|
2017-03-15 03:05:14 +00:00
|
|
|
int err = 0;
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
enum array_state st = match_word(buf, array_states);
|
2014-12-15 01:57:01 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->pers && (st == active || st == clean) &&
|
|
|
|
mddev->ro != MD_RDONLY) {
|
2014-12-15 01:57:01 +00:00
|
|
|
/* don't take reconfig_mutex when toggling between
|
|
|
|
* clean and active
|
|
|
|
*/
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
if (st == active) {
|
|
|
|
restart_array(mddev);
|
2016-12-08 23:48:19 +00:00
|
|
|
clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
|
2016-10-25 15:07:08 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2014-12-15 01:57:01 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
|
|
|
} else /* st == clean */ {
|
|
|
|
restart_array(mddev);
|
2017-03-15 03:05:14 +00:00
|
|
|
if (!set_in_sync(mddev))
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EBUSY;
|
|
|
|
}
|
2016-06-30 08:47:09 +00:00
|
|
|
if (!err)
|
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2014-12-15 01:57:01 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2015-06-12 09:46:44 +00:00
|
|
|
return err ?: len;
|
2014-12-15 01:57:01 +00:00
|
|
|
}
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
err = -EINVAL;
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
switch(st) {
|
|
|
|
case bad_word:
|
|
|
|
break;
|
|
|
|
case clear:
|
|
|
|
/* stopping an active array */
|
2012-07-19 05:59:18 +00:00
|
|
|
err = do_md_stop(mddev, 0, NULL);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
break;
|
|
|
|
case inactive:
|
|
|
|
/* stopping an active array */
|
2012-07-31 00:04:55 +00:00
|
|
|
if (mddev->pers)
|
2012-07-19 05:59:18 +00:00
|
|
|
err = do_md_stop(mddev, 2, NULL);
|
2012-07-31 00:04:55 +00:00
|
|
|
else
|
2008-02-06 09:39:51 +00:00
|
|
|
err = 0; /* already inactive */
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
break;
|
|
|
|
case suspended:
|
|
|
|
break; /* not supported yet */
|
|
|
|
case readonly:
|
|
|
|
if (mddev->pers)
|
2012-07-19 05:59:18 +00:00
|
|
|
err = md_set_readonly(mddev, NULL);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
else {
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDONLY;
|
2008-04-30 07:52:30 +00:00
|
|
|
set_disk_ro(mddev->gendisk, 1);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
err = do_md_run(mddev);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case read_auto:
|
|
|
|
if (mddev->pers) {
|
2022-09-20 02:39:38 +00:00
|
|
|
if (md_is_rdwr(mddev))
|
2012-07-19 05:59:18 +00:00
|
|
|
err = md_set_readonly(mddev, NULL);
|
2022-09-20 02:39:38 +00:00
|
|
|
else if (mddev->ro == MD_RDONLY)
|
2008-04-30 07:52:30 +00:00
|
|
|
err = restart_array(mddev);
|
|
|
|
if (err == 0) {
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_AUTO_READ;
|
2008-04-30 07:52:30 +00:00
|
|
|
set_disk_ro(mddev->gendisk, 0);
|
|
|
|
}
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
} else {
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_AUTO_READ;
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
err = do_md_run(mddev);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case clean:
|
|
|
|
if (mddev->pers) {
|
2015-10-09 04:54:13 +00:00
|
|
|
err = restart_array(mddev);
|
|
|
|
if (err)
|
|
|
|
break;
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2017-03-15 03:05:14 +00:00
|
|
|
if (!set_in_sync(mddev))
|
2008-02-06 09:39:51 +00:00
|
|
|
err = -EBUSY;
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2009-05-07 02:50:57 +00:00
|
|
|
} else
|
|
|
|
err = -EINVAL;
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
break;
|
|
|
|
case active:
|
|
|
|
if (mddev->pers) {
|
2015-10-09 04:54:13 +00:00
|
|
|
err = restart_array(mddev);
|
|
|
|
if (err)
|
|
|
|
break;
|
2016-12-08 23:48:19 +00:00
|
|
|
clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
|
|
|
err = 0;
|
|
|
|
} else {
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDWR;
|
2008-04-30 07:52:30 +00:00
|
|
|
set_disk_ro(mddev->gendisk, 0);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
err = do_md_run(mddev);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case write_pending:
|
|
|
|
case active_idle:
|
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 19:49:00 +00:00
|
|
|
case broken:
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
/* these cannot be set */
|
|
|
|
break;
|
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
|
|
|
|
if (!err) {
|
2011-12-08 04:49:12 +00:00
|
|
|
if (mddev->hold_active == UNTIL_IOCTL)
|
|
|
|
mddev->hold_active = 0;
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2008-06-27 22:31:36 +00:00
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
}
|
2006-07-10 11:44:18 +00:00
|
|
|
static struct md_sysfs_entry md_array_state =
|
2014-09-29 22:53:05 +00:00
|
|
|
__ATTR_PREALLOC(array_state, S_IRUGO|S_IWUSR, array_state_show, array_state_store);
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
|
2009-12-14 01:49:58 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
max_corrected_read_errors_show(struct mddev *mddev, char *page) {
|
2009-12-14 01:49:58 +00:00
|
|
|
return sprintf(page, "%d\n",
|
|
|
|
atomic_read(&mddev->max_corr_read_errors));
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
max_corrected_read_errors_store(struct mddev *mddev, const char *buf, size_t len)
|
2009-12-14 01:49:58 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned int n;
|
|
|
|
int rv;
|
2009-12-14 01:49:58 +00:00
|
|
|
|
2015-05-16 11:02:38 +00:00
|
|
|
rv = kstrtouint(buf, 10, &n);
|
|
|
|
if (rv < 0)
|
|
|
|
return rv;
|
2023-05-22 07:25:34 +00:00
|
|
|
if (n > INT_MAX)
|
|
|
|
return -EINVAL;
|
2015-05-16 11:02:38 +00:00
|
|
|
atomic_set(&mddev->max_corr_read_errors, n);
|
|
|
|
return len;
|
2009-12-14 01:49:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry max_corr_read_errors =
|
|
|
|
__ATTR(max_read_errors, S_IRUGO|S_IWUSR, max_corrected_read_errors_show,
|
|
|
|
max_corrected_read_errors_store);
|
|
|
|
|
2006-01-06 08:21:16 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
null_show(struct mddev *mddev, char *page)
|
2006-01-06 08:21:16 +00:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
new_dev_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:21:16 +00:00
|
|
|
{
|
|
|
|
/* buf must be %d:%d\n? giving major and minor numbers */
|
|
|
|
/* The new device is added to the array.
|
|
|
|
* If the array has a persistent superblock, we read the
|
|
|
|
* superblock to initialise info and check validity.
|
|
|
|
* Otherwise, only checking done is that in bind_rdev_to_array,
|
|
|
|
* which mainly checks size.
|
|
|
|
*/
|
|
|
|
char *e;
|
|
|
|
int major = simple_strtoul(buf, &e, 10);
|
|
|
|
int minor;
|
|
|
|
dev_t dev;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2006-01-06 08:21:16 +00:00
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!*buf || *e != ':' || !e[1] || e[1] == '\n')
|
|
|
|
return -EINVAL;
|
|
|
|
minor = simple_strtoul(e+1, &e, 10);
|
|
|
|
if (*e && *e != '\n')
|
|
|
|
return -EINVAL;
|
|
|
|
dev = MKDEV(major, minor);
|
|
|
|
if (major != MAJOR(dev) ||
|
|
|
|
minor != MINOR(dev))
|
|
|
|
return -EOVERFLOW;
|
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2006-01-06 08:21:16 +00:00
|
|
|
if (mddev->persistent) {
|
|
|
|
rdev = md_import_device(dev, mddev->major_version,
|
|
|
|
mddev->minor_version);
|
|
|
|
if (!IS_ERR(rdev) && !list_empty(&mddev->disks)) {
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev0
|
|
|
|
= list_entry(mddev->disks.next,
|
|
|
|
struct md_rdev, same_set);
|
2006-01-06 08:21:16 +00:00
|
|
|
err = super_types[mddev->major_version]
|
|
|
|
.load_super(rdev, rdev0, mddev->minor_version);
|
|
|
|
if (err < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2008-02-06 09:39:54 +00:00
|
|
|
} else if (mddev->external)
|
|
|
|
rdev = md_import_device(dev, -2, -1);
|
|
|
|
else
|
2006-01-06 08:21:16 +00:00
|
|
|
rdev = md_import_device(dev, -1, -1);
|
|
|
|
|
2015-06-25 07:06:40 +00:00
|
|
|
if (IS_ERR(rdev)) {
|
|
|
|
mddev_unlock(mddev);
|
2006-01-06 08:21:16 +00:00
|
|
|
return PTR_ERR(rdev);
|
2015-06-25 07:06:40 +00:00
|
|
|
}
|
2006-01-06 08:21:16 +00:00
|
|
|
err = bind_rdev_to_array(rdev, mddev);
|
|
|
|
out:
|
|
|
|
if (err)
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
2017-07-28 13:49:25 +00:00
|
|
|
if (!err)
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2006-01-06 08:21:16 +00:00
|
|
|
return err ? err : len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_new_device =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(new_dev, S_IWUSR, null_show, new_dev_store);
|
2006-01-06 08:20:47 +00:00
|
|
|
|
2006-10-03 08:15:49 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
bitmap_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-10-03 08:15:49 +00:00
|
|
|
{
|
|
|
|
char *end;
|
|
|
|
unsigned long chunk, end_chunk;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2006-10-03 08:15:49 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2006-10-03 08:15:49 +00:00
|
|
|
if (!mddev->bitmap)
|
|
|
|
goto out;
|
|
|
|
/* buf should be <chunk> <chunk> ... or <chunk>-<chunk> ... (range) */
|
|
|
|
while (*buf) {
|
|
|
|
chunk = end_chunk = simple_strtoul(buf, &end, 0);
|
|
|
|
if (buf == end) break;
|
|
|
|
if (*end == '-') { /* range */
|
|
|
|
buf = end + 1;
|
|
|
|
end_chunk = simple_strtoul(buf, &end, 0);
|
|
|
|
if (buf == end) break;
|
|
|
|
}
|
|
|
|
if (*end && !isspace(*end)) break;
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_dirty_bits(mddev->bitmap, chunk, end_chunk);
|
2009-12-15 02:01:06 +00:00
|
|
|
buf = skip_spaces(end);
|
2006-10-03 08:15:49 +00:00
|
|
|
}
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_unplug(mddev->bitmap); /* flush the bits to disk */
|
2006-10-03 08:15:49 +00:00
|
|
|
out:
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
2006-10-03 08:15:49 +00:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_bitmap =
|
|
|
|
__ATTR(bitmap_set_bits, S_IWUSR, null_show, bitmap_store);
|
|
|
|
|
2006-01-06 08:20:49 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
size_show(struct mddev *mddev, char *page)
|
2006-01-06 08:20:49 +00:00
|
|
|
{
|
2009-03-31 03:33:13 +00:00
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
(unsigned long long)mddev->dev_sectors / 2);
|
2006-01-06 08:20:49 +00:00
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int update_size(struct mddev *mddev, sector_t num_sectors);
|
2006-01-06 08:20:49 +00:00
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
size_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:20:49 +00:00
|
|
|
{
|
|
|
|
/* If array is inactive, we can reduce the component size, but
|
|
|
|
* not increase it (except from 0).
|
|
|
|
* If array is active, we can try an on-line resize
|
|
|
|
*/
|
2009-03-31 04:00:31 +00:00
|
|
|
sector_t sectors;
|
|
|
|
int err = strict_blocks_to_sectors(buf, §ors);
|
2006-01-06 08:20:49 +00:00
|
|
|
|
2009-03-31 03:33:13 +00:00
|
|
|
if (err < 0)
|
|
|
|
return err;
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2006-01-06 08:20:49 +00:00
|
|
|
if (mddev->pers) {
|
2009-03-31 03:33:13 +00:00
|
|
|
err = update_size(mddev, sectors);
|
2016-06-12 09:18:00 +00:00
|
|
|
if (err == 0)
|
|
|
|
md_update_sb(mddev, 1);
|
2006-01-06 08:20:49 +00:00
|
|
|
} else {
|
2009-03-31 03:33:13 +00:00
|
|
|
if (mddev->dev_sectors == 0 ||
|
|
|
|
mddev->dev_sectors > sectors)
|
|
|
|
mddev->dev_sectors = sectors;
|
2006-01-06 08:20:49 +00:00
|
|
|
else
|
|
|
|
err = -ENOSPC;
|
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
2006-01-06 08:20:49 +00:00
|
|
|
return err ? err : len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_size =
|
2006-07-10 11:44:18 +00:00
|
|
|
__ATTR(component_size, S_IRUGO|S_IWUSR, size_show, size_store);
|
2006-01-06 08:20:49 +00:00
|
|
|
|
2012-10-29 15:18:08 +00:00
|
|
|
/* Metadata version.
|
2008-02-06 09:39:51 +00:00
|
|
|
* This is one of
|
|
|
|
* 'none' for arrays with no metadata (good luck...)
|
|
|
|
* 'external' for arrays with externally managed metadata,
|
2006-01-06 08:20:50 +00:00
|
|
|
* or N.M for internally known formats
|
|
|
|
*/
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
metadata_show(struct mddev *mddev, char *page)
|
2006-01-06 08:20:50 +00:00
|
|
|
{
|
|
|
|
if (mddev->persistent)
|
|
|
|
return sprintf(page, "%d.%d\n",
|
|
|
|
mddev->major_version, mddev->minor_version);
|
2008-02-06 09:39:51 +00:00
|
|
|
else if (mddev->external)
|
|
|
|
return sprintf(page, "external:%s\n", mddev->metadata_type);
|
2006-01-06 08:20:50 +00:00
|
|
|
else
|
|
|
|
return sprintf(page, "none\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
metadata_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:20:50 +00:00
|
|
|
{
|
|
|
|
int major, minor;
|
|
|
|
char *e;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2008-10-13 00:55:11 +00:00
|
|
|
/* Changing the details of 'external' metadata is
|
|
|
|
* always permitted. Otherwise there must be
|
|
|
|
* no devices attached to the array.
|
|
|
|
*/
|
2014-12-15 01:57:01 +00:00
|
|
|
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
err = -EBUSY;
|
2008-10-13 00:55:11 +00:00
|
|
|
if (mddev->external && strncmp(buf, "external:", 9) == 0)
|
|
|
|
;
|
|
|
|
else if (!list_empty(&mddev->disks))
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2006-01-06 08:20:50 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
2006-01-06 08:20:50 +00:00
|
|
|
if (cmd_match(buf, "none")) {
|
|
|
|
mddev->persistent = 0;
|
2008-02-06 09:39:51 +00:00
|
|
|
mddev->external = 0;
|
|
|
|
mddev->major_version = 0;
|
|
|
|
mddev->minor_version = 90;
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2008-02-06 09:39:51 +00:00
|
|
|
}
|
|
|
|
if (strncmp(buf, "external:", 9) == 0) {
|
2008-02-06 09:39:57 +00:00
|
|
|
size_t namelen = len-9;
|
2008-02-06 09:39:51 +00:00
|
|
|
if (namelen >= sizeof(mddev->metadata_type))
|
|
|
|
namelen = sizeof(mddev->metadata_type)-1;
|
|
|
|
strncpy(mddev->metadata_type, buf+9, namelen);
|
|
|
|
mddev->metadata_type[namelen] = 0;
|
|
|
|
if (namelen && mddev->metadata_type[namelen-1] == '\n')
|
|
|
|
mddev->metadata_type[--namelen] = 0;
|
|
|
|
mddev->persistent = 0;
|
|
|
|
mddev->external = 1;
|
2006-01-06 08:20:50 +00:00
|
|
|
mddev->major_version = 0;
|
|
|
|
mddev->minor_version = 90;
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2006-01-06 08:20:50 +00:00
|
|
|
}
|
|
|
|
major = simple_strtoul(buf, &e, 10);
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EINVAL;
|
2006-01-06 08:20:50 +00:00
|
|
|
if (e==buf || *e != '.')
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2006-01-06 08:20:50 +00:00
|
|
|
buf = e+1;
|
|
|
|
minor = simple_strtoul(buf, &e, 10);
|
2006-12-22 09:11:41 +00:00
|
|
|
if (e==buf || (*e && *e != '\n') )
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
|
|
|
err = -ENOENT;
|
2007-05-09 09:35:34 +00:00
|
|
|
if (major >= ARRAY_SIZE(super_types) || super_types[major].name == NULL)
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2006-01-06 08:20:50 +00:00
|
|
|
mddev->major_version = major;
|
|
|
|
mddev->minor_version = minor;
|
|
|
|
mddev->persistent = 1;
|
2008-02-06 09:39:51 +00:00
|
|
|
mddev->external = 0;
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
|
|
|
out_unlock:
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2006-01-06 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_metadata =
|
2014-09-29 22:53:05 +00:00
|
|
|
__ATTR_PREALLOC(metadata_version, S_IRUGO|S_IWUSR, metadata_show, metadata_store);
|
2006-01-06 08:20:50 +00:00
|
|
|
|
2005-11-09 05:39:26 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
action_show(struct mddev *mddev, char *page)
|
2005-11-09 05:39:26 +00:00
|
|
|
{
|
2005-11-09 05:39:44 +00:00
|
|
|
char *type = "idle";
|
2014-12-15 01:56:59 +00:00
|
|
|
unsigned long recovery = mddev->recovery;
|
|
|
|
if (test_bit(MD_RECOVERY_FROZEN, &recovery))
|
2009-05-25 23:41:17 +00:00
|
|
|
type = "frozen";
|
2014-12-15 01:56:59 +00:00
|
|
|
else if (test_bit(MD_RECOVERY_RUNNING, &recovery) ||
|
2022-09-20 02:39:38 +00:00
|
|
|
(md_is_rdwr(mddev) && test_bit(MD_RECOVERY_NEEDED, &recovery))) {
|
2014-12-15 01:56:59 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &recovery))
|
2006-03-27 09:18:09 +00:00
|
|
|
type = "reshape";
|
2014-12-15 01:56:59 +00:00
|
|
|
else if (test_bit(MD_RECOVERY_SYNC, &recovery)) {
|
|
|
|
if (!test_bit(MD_RECOVERY_REQUESTED, &recovery))
|
2005-11-09 05:39:26 +00:00
|
|
|
type = "resync";
|
2014-12-15 01:56:59 +00:00
|
|
|
else if (test_bit(MD_RECOVERY_CHECK, &recovery))
|
2005-11-09 05:39:26 +00:00
|
|
|
type = "check";
|
|
|
|
else
|
|
|
|
type = "repair";
|
2014-12-15 01:56:59 +00:00
|
|
|
} else if (test_bit(MD_RECOVERY_RECOVER, &recovery))
|
2005-11-09 05:39:26 +00:00
|
|
|
type = "recover";
|
2015-07-06 02:26:57 +00:00
|
|
|
else if (mddev->reshape_position != MaxSector)
|
|
|
|
type = "reshape";
|
2005-11-09 05:39:26 +00:00
|
|
|
}
|
|
|
|
return sprintf(page, "%s\n", type);
|
|
|
|
}
|
|
|
|
|
2023-05-29 13:20:33 +00:00
|
|
|
static void stop_sync_thread(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (mddev_lock(mddev))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check again in case MD_RECOVERY_RUNNING is cleared before lock is
|
|
|
|
* held.
|
|
|
|
*/
|
|
|
|
if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (work_pending(&mddev->del_work))
|
|
|
|
flush_workqueue(md_misc_wq);
|
|
|
|
|
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
2023-05-29 13:20:35 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
|
|
|
/*
|
|
|
|
* Thread might be blocked waiting for metadata update which will now
|
|
|
|
* never happen
|
|
|
|
*/
|
|
|
|
md_wakeup_thread_directly(mddev->sync_thread);
|
2023-05-29 13:20:33 +00:00
|
|
|
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void idle_sync_thread(struct mddev *mddev)
|
|
|
|
{
|
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
2023-05-29 13:20:35 +00:00
|
|
|
int sync_seq = atomic_read(&mddev->sync_seq);
|
|
|
|
|
md: add a mutex to synchronize idle and frozen in action_store()
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
2023-05-29 13:20:34 +00:00
|
|
|
mutex_lock(&mddev->sync_mutex);
|
2023-05-29 13:20:33 +00:00
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
|
|
|
stop_sync_thread(mddev);
|
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
2023-05-29 13:20:35 +00:00
|
|
|
|
|
|
|
wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
|
|
|
|
!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
|
|
|
|
|
md: add a mutex to synchronize idle and frozen in action_store()
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
2023-05-29 13:20:34 +00:00
|
|
|
mutex_unlock(&mddev->sync_mutex);
|
2023-05-29 13:20:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void frozen_sync_thread(struct mddev *mddev)
|
|
|
|
{
|
md: add a mutex to synchronize idle and frozen in action_store()
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
2023-05-29 13:20:34 +00:00
|
|
|
mutex_lock(&mddev->sync_mutex);
|
2023-05-29 13:20:33 +00:00
|
|
|
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
|
|
|
stop_sync_thread(mddev);
|
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
2023-05-29 13:20:35 +00:00
|
|
|
|
|
|
|
wait_event(resync_wait, mddev->sync_thread == NULL &&
|
|
|
|
!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
|
|
|
|
|
md: add a mutex to synchronize idle and frozen in action_store()
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
2023-05-29 13:20:34 +00:00
|
|
|
mutex_unlock(&mddev->sync_mutex);
|
2023-05-29 13:20:33 +00:00
|
|
|
}
|
|
|
|
|
2005-11-09 05:39:26 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
action_store(struct mddev *mddev, const char *page, size_t len)
|
2005-11-09 05:39:26 +00:00
|
|
|
{
|
2005-11-09 05:39:44 +00:00
|
|
|
if (!mddev->pers || !mddev->pers->sync_request)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2009-05-25 23:41:17 +00:00
|
|
|
|
2023-05-29 13:20:33 +00:00
|
|
|
if (cmd_match(page, "idle"))
|
|
|
|
idle_sync_thread(mddev);
|
|
|
|
else if (cmd_match(page, "frozen"))
|
|
|
|
frozen_sync_thread(mddev);
|
|
|
|
else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
2005-11-09 05:39:26 +00:00
|
|
|
return -EBUSY;
|
2008-06-27 22:31:41 +00:00
|
|
|
else if (cmd_match(page, "resync"))
|
2015-05-28 07:53:29 +00:00
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2008-06-27 22:31:41 +00:00
|
|
|
else if (cmd_match(page, "recover")) {
|
2015-05-28 07:53:29 +00:00
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2008-06-27 22:31:41 +00:00
|
|
|
set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
|
|
|
} else if (cmd_match(page, "reshape")) {
|
2006-03-27 09:18:13 +00:00
|
|
|
int err;
|
|
|
|
if (mddev->pers->start_reshape == NULL)
|
|
|
|
return -EINVAL;
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (!err) {
|
2023-05-12 01:56:07 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
|
2015-12-21 00:01:21 +00:00
|
|
|
err = -EBUSY;
|
2023-05-12 01:56:07 +00:00
|
|
|
} else if (mddev->reshape_position == MaxSector ||
|
|
|
|
mddev->pers->check_reshape == NULL ||
|
|
|
|
mddev->pers->check_reshape(mddev)) {
|
2015-12-21 00:01:21 +00:00
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
|
|
|
err = mddev->pers->start_reshape(mddev);
|
2023-05-12 01:56:07 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If reshape is still in progress, and
|
|
|
|
* md_check_recovery() can continue to reshape,
|
|
|
|
* don't restart reshape because data can be
|
|
|
|
* corrupted for raid456.
|
|
|
|
*/
|
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2015-12-21 00:01:21 +00:00
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
|
|
|
}
|
2006-03-27 09:18:13 +00:00
|
|
|
if (err)
|
|
|
|
return err;
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_degraded);
|
2006-03-27 09:18:13 +00:00
|
|
|
} else {
|
2006-01-06 08:20:41 +00:00
|
|
|
if (cmd_match(page, "check"))
|
2005-11-09 05:39:44 +00:00
|
|
|
set_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
2006-05-20 21:59:57 +00:00
|
|
|
else if (!cmd_match(page, "repair"))
|
2005-11-09 05:39:44 +00:00
|
|
|
return -EINVAL;
|
2015-05-28 07:53:29 +00:00
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2005-11-09 05:39:44 +00:00
|
|
|
set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
}
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->ro == MD_AUTO_READ) {
|
2012-10-11 03:19:39 +00:00
|
|
|
/* A write to sync_action is enough to justify
|
|
|
|
* canceling read-auto mode
|
|
|
|
*/
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
flush_work(&mddev->sync_work);
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDWR;
|
2012-10-11 03:19:39 +00:00
|
|
|
md_wakeup_thread(mddev->sync_thread);
|
|
|
|
}
|
2006-01-06 08:20:46 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2005-11-09 05:39:26 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_action);
|
2005-11-09 05:39:26 +00:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2013-06-25 06:23:59 +00:00
|
|
|
static struct md_sysfs_entry md_scan_mode =
|
2014-09-29 22:53:05 +00:00
|
|
|
__ATTR_PREALLOC(sync_action, S_IRUGO|S_IWUSR, action_show, action_store);
|
2013-06-25 06:23:59 +00:00
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
last_sync_action_show(struct mddev *mddev, char *page)
|
|
|
|
{
|
|
|
|
return sprintf(page, "%s\n", mddev->last_sync_action);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_last_scan_mode = __ATTR_RO(last_sync_action);
|
|
|
|
|
2005-11-09 05:39:26 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
mismatch_cnt_show(struct mddev *mddev, char *page)
|
2005-11-09 05:39:26 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%llu\n",
|
2012-10-11 03:17:59 +00:00
|
|
|
(unsigned long long)
|
|
|
|
atomic64_read(&mddev->resync_mismatches));
|
2005-11-09 05:39:26 +00:00
|
|
|
}
|
|
|
|
|
2006-07-10 11:44:18 +00:00
|
|
|
static struct md_sysfs_entry md_mismatches = __ATTR_RO(mismatch_cnt);
|
2005-11-09 05:39:26 +00:00
|
|
|
|
2006-01-06 08:21:36 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_min_show(struct mddev *mddev, char *page)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%d (%s)\n", speed_min(mddev),
|
|
|
|
mddev->sync_speed_min ? "local": "system");
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_min_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned int min;
|
|
|
|
int rv;
|
|
|
|
|
2006-01-06 08:21:36 +00:00
|
|
|
if (strncmp(buf, "system", 6)==0) {
|
2015-05-16 11:02:38 +00:00
|
|
|
min = 0;
|
|
|
|
} else {
|
|
|
|
rv = kstrtouint(buf, 10, &min);
|
|
|
|
if (rv < 0)
|
|
|
|
return rv;
|
|
|
|
if (min == 0)
|
|
|
|
return -EINVAL;
|
2006-01-06 08:21:36 +00:00
|
|
|
}
|
|
|
|
mddev->sync_speed_min = min;
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_sync_min =
|
|
|
|
__ATTR(sync_speed_min, S_IRUGO|S_IWUSR, sync_min_show, sync_min_store);
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_max_show(struct mddev *mddev, char *page)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%d (%s)\n", speed_max(mddev),
|
|
|
|
mddev->sync_speed_max ? "local": "system");
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_max_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned int max;
|
|
|
|
int rv;
|
|
|
|
|
2006-01-06 08:21:36 +00:00
|
|
|
if (strncmp(buf, "system", 6)==0) {
|
2015-05-16 11:02:38 +00:00
|
|
|
max = 0;
|
|
|
|
} else {
|
|
|
|
rv = kstrtouint(buf, 10, &max);
|
|
|
|
if (rv < 0)
|
|
|
|
return rv;
|
|
|
|
if (max == 0)
|
|
|
|
return -EINVAL;
|
2006-01-06 08:21:36 +00:00
|
|
|
}
|
|
|
|
mddev->sync_speed_max = max;
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_sync_max =
|
|
|
|
__ATTR(sync_speed_max, S_IRUGO|S_IWUSR, sync_max_show, sync_max_store);
|
|
|
|
|
2007-10-17 06:30:54 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
degraded_show(struct mddev *mddev, char *page)
|
2007-10-17 06:30:54 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%d\n", mddev->degraded);
|
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_degraded = __ATTR_RO(degraded);
|
2006-01-06 08:21:36 +00:00
|
|
|
|
2008-05-23 20:04:38 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_force_parallel_show(struct mddev *mddev, char *page)
|
2008-05-23 20:04:38 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%d\n", mddev->parallel_resync);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_force_parallel_store(struct mddev *mddev, const char *buf, size_t len)
|
2008-05-23 20:04:38 +00:00
|
|
|
{
|
|
|
|
long n;
|
|
|
|
|
2013-06-01 07:15:16 +00:00
|
|
|
if (kstrtol(buf, 10, &n))
|
2008-05-23 20:04:38 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (n != 0 && n != 1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
mddev->parallel_resync = n;
|
|
|
|
|
|
|
|
if (mddev->sync_thread)
|
|
|
|
wake_up(&resync_wait);
|
|
|
|
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* force parallel resync, even with shared block devices */
|
|
|
|
static struct md_sysfs_entry md_sync_force_parallel =
|
|
|
|
__ATTR(sync_force_parallel, S_IRUGO|S_IWUSR,
|
|
|
|
sync_force_parallel_show, sync_force_parallel_store);
|
|
|
|
|
2006-01-06 08:21:36 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_speed_show(struct mddev *mddev, char *page)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
|
|
|
unsigned long resync, dt, db;
|
2022-06-08 16:27:54 +00:00
|
|
|
if (mddev->curr_resync == MD_RESYNC_NONE)
|
2009-03-31 04:24:32 +00:00
|
|
|
return sprintf(page, "none\n");
|
2008-03-25 21:24:09 +00:00
|
|
|
resync = mddev->curr_mark_cnt - atomic_read(&mddev->recovery_active);
|
|
|
|
dt = (jiffies - mddev->resync_mark) / HZ;
|
2006-01-06 08:21:36 +00:00
|
|
|
if (!dt) dt++;
|
2008-03-25 21:24:09 +00:00
|
|
|
db = resync - mddev->resync_mark_cnt;
|
|
|
|
return sprintf(page, "%lu\n", db/dt/2); /* K/sec */
|
2006-01-06 08:21:36 +00:00
|
|
|
}
|
|
|
|
|
2006-07-10 11:44:18 +00:00
|
|
|
static struct md_sysfs_entry md_sync_speed = __ATTR_RO(sync_speed);
|
2006-01-06 08:21:36 +00:00
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
sync_completed_show(struct mddev *mddev, char *page)
|
2006-01-06 08:21:36 +00:00
|
|
|
{
|
2011-01-13 22:14:34 +00:00
|
|
|
unsigned long long max_sectors, resync;
|
2006-01-06 08:21:36 +00:00
|
|
|
|
2009-04-14 06:28:34 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
|
|
|
return sprintf(page, "none\n");
|
|
|
|
|
2022-06-08 16:27:54 +00:00
|
|
|
if (mddev->curr_resync == MD_RESYNC_YIELDED ||
|
|
|
|
mddev->curr_resync == MD_RESYNC_DELAYED)
|
2012-10-11 03:25:57 +00:00
|
|
|
return sprintf(page, "delayed\n");
|
|
|
|
|
2012-05-20 23:28:33 +00:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
|
2009-03-31 03:33:13 +00:00
|
|
|
max_sectors = mddev->resync_max_sectors;
|
2006-01-06 08:21:36 +00:00
|
|
|
else
|
2009-03-31 03:33:13 +00:00
|
|
|
max_sectors = mddev->dev_sectors;
|
2006-01-06 08:21:36 +00:00
|
|
|
|
2009-04-14 06:28:34 +00:00
|
|
|
resync = mddev->curr_resync_completed;
|
2011-01-13 22:14:34 +00:00
|
|
|
return sprintf(page, "%llu / %llu\n", resync, max_sectors);
|
2006-01-06 08:21:36 +00:00
|
|
|
}
|
|
|
|
|
2014-09-29 22:53:05 +00:00
|
|
|
static struct md_sysfs_entry md_sync_completed =
|
|
|
|
__ATTR_PREALLOC(sync_completed, S_IRUGO, sync_completed_show, NULL);
|
2006-01-06 08:21:36 +00:00
|
|
|
|
2008-06-27 22:31:24 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
min_sync_show(struct mddev *mddev, char *page)
|
2008-06-27 22:31:24 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
(unsigned long long)mddev->resync_min);
|
|
|
|
}
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
min_sync_store(struct mddev *mddev, const char *buf, size_t len)
|
2008-06-27 22:31:24 +00:00
|
|
|
{
|
|
|
|
unsigned long long min;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
|
|
|
|
2013-06-01 07:15:16 +00:00
|
|
|
if (kstrtoull(buf, 10, &min))
|
2008-06-27 22:31:24 +00:00
|
|
|
return -EINVAL;
|
2014-12-15 01:57:01 +00:00
|
|
|
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
err = -EINVAL;
|
2008-06-27 22:31:24 +00:00
|
|
|
if (min > mddev->resync_max)
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
err = -EBUSY;
|
2008-06-27 22:31:24 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2008-06-27 22:31:24 +00:00
|
|
|
|
2015-03-23 06:36:38 +00:00
|
|
|
/* Round down to multiple of 4K for safety */
|
|
|
|
mddev->resync_min = round_down(min, 8);
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
2008-06-27 22:31:24 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
out_unlock:
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
return err ?: len;
|
2008-06-27 22:31:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_min_sync =
|
|
|
|
__ATTR(sync_min, S_IRUGO|S_IWUSR, min_sync_show, min_sync_store);
|
|
|
|
|
2008-02-06 09:39:52 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
max_sync_show(struct mddev *mddev, char *page)
|
2008-02-06 09:39:52 +00:00
|
|
|
{
|
|
|
|
if (mddev->resync_max == MaxSector)
|
|
|
|
return sprintf(page, "max\n");
|
|
|
|
else
|
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
(unsigned long long)mddev->resync_max);
|
|
|
|
}
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
max_sync_store(struct mddev *mddev, const char *buf, size_t len)
|
2008-02-06 09:39:52 +00:00
|
|
|
{
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
|
|
|
spin_lock(&mddev->lock);
|
2008-02-06 09:39:52 +00:00
|
|
|
if (strncmp(buf, "max", 3) == 0)
|
|
|
|
mddev->resync_max = MaxSector;
|
|
|
|
else {
|
2008-06-27 22:31:24 +00:00
|
|
|
unsigned long long max;
|
2014-12-15 01:57:01 +00:00
|
|
|
int chunk;
|
|
|
|
|
|
|
|
err = -EINVAL;
|
2013-06-01 07:15:16 +00:00
|
|
|
if (kstrtoull(buf, 10, &max))
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2008-06-27 22:31:24 +00:00
|
|
|
if (max < mddev->resync_min)
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
err = -EBUSY;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (max < mddev->resync_max && md_is_rdwr(mddev) &&
|
2008-02-06 09:39:52 +00:00
|
|
|
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
2014-12-15 01:57:01 +00:00
|
|
|
goto out_unlock;
|
2008-02-06 09:39:52 +00:00
|
|
|
|
|
|
|
/* Must be a multiple of chunk_size */
|
2014-12-15 01:57:01 +00:00
|
|
|
chunk = mddev->chunk_sectors;
|
|
|
|
if (chunk) {
|
2009-06-16 07:01:42 +00:00
|
|
|
sector_t temp = max;
|
2014-12-15 01:57:01 +00:00
|
|
|
|
|
|
|
err = -EINVAL;
|
|
|
|
if (sector_div(temp, chunk))
|
|
|
|
goto out_unlock;
|
2008-02-06 09:39:52 +00:00
|
|
|
}
|
|
|
|
mddev->resync_max = max;
|
|
|
|
}
|
|
|
|
wake_up(&mddev->recovery_wait);
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
|
|
|
out_unlock:
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
return err ?: len;
|
2008-02-06 09:39:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_max_sync =
|
|
|
|
__ATTR(sync_max, S_IRUGO|S_IWUSR, max_sync_show, max_sync_store);
|
|
|
|
|
2006-03-27 09:18:14 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
suspend_lo_show(struct mddev *mddev, char *page)
|
2006-03-27 09:18:14 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_lo);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-03-27 09:18:14 +00:00
|
|
|
{
|
2017-10-19 01:49:15 +00:00
|
|
|
unsigned long long new;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2006-03-27 09:18:14 +00:00
|
|
|
|
2015-05-16 11:02:38 +00:00
|
|
|
err = kstrtoull(buf, 10, &new);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (new != (sector_t)new)
|
2006-03-27 09:18:14 +00:00
|
|
|
return -EINVAL;
|
2011-01-13 22:14:34 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
err = -EINVAL;
|
|
|
|
if (mddev->pers == NULL ||
|
|
|
|
mddev->pers->quiesce == NULL)
|
|
|
|
goto unlock;
|
2017-10-19 01:49:15 +00:00
|
|
|
mddev_suspend(mddev);
|
2011-01-13 22:14:34 +00:00
|
|
|
mddev->suspend_lo = new;
|
2017-10-19 01:49:15 +00:00
|
|
|
mddev_resume(mddev);
|
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
|
|
|
unlock:
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2006-03-27 09:18:14 +00:00
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_suspend_lo =
|
|
|
|
__ATTR(suspend_lo, S_IRUGO|S_IWUSR, suspend_lo_show, suspend_lo_store);
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
suspend_hi_show(struct mddev *mddev, char *page)
|
2006-03-27 09:18:14 +00:00
|
|
|
{
|
|
|
|
return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_hi);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
|
2006-03-27 09:18:14 +00:00
|
|
|
{
|
2017-10-19 01:49:15 +00:00
|
|
|
unsigned long long new;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
2006-03-27 09:18:14 +00:00
|
|
|
|
2015-05-16 11:02:38 +00:00
|
|
|
err = kstrtoull(buf, 10, &new);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (new != (sector_t)new)
|
2006-03-27 09:18:14 +00:00
|
|
|
return -EINVAL;
|
2011-01-13 22:14:34 +00:00
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
err = -EINVAL;
|
2017-10-19 01:49:15 +00:00
|
|
|
if (mddev->pers == NULL)
|
2014-12-15 01:57:01 +00:00
|
|
|
goto unlock;
|
2017-10-19 01:49:15 +00:00
|
|
|
|
|
|
|
mddev_suspend(mddev);
|
2011-01-13 22:14:34 +00:00
|
|
|
mddev->suspend_hi = new;
|
2017-10-19 01:49:15 +00:00
|
|
|
mddev_resume(mddev);
|
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
|
|
|
unlock:
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2006-03-27 09:18:14 +00:00
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_suspend_hi =
|
|
|
|
__ATTR(suspend_hi, S_IRUGO|S_IWUSR, suspend_hi_show, suspend_hi_store);
|
|
|
|
|
2007-05-09 09:35:38 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
reshape_position_show(struct mddev *mddev, char *page)
|
2007-05-09 09:35:38 +00:00
|
|
|
{
|
|
|
|
if (mddev->reshape_position != MaxSector)
|
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
(unsigned long long)mddev->reshape_position);
|
|
|
|
strcpy(page, "none\n");
|
|
|
|
return 5;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
reshape_position_store(struct mddev *mddev, const char *buf, size_t len)
|
2007-05-09 09:35:38 +00:00
|
|
|
{
|
2012-05-20 23:27:00 +00:00
|
|
|
struct md_rdev *rdev;
|
2015-05-16 11:02:38 +00:00
|
|
|
unsigned long long new;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
|
|
|
|
2015-05-16 11:02:38 +00:00
|
|
|
err = kstrtoull(buf, 10, &new);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (new != (sector_t)new)
|
2007-05-09 09:35:38 +00:00
|
|
|
return -EINVAL;
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
err = -EBUSY;
|
|
|
|
if (mddev->pers)
|
|
|
|
goto unlock;
|
2007-05-09 09:35:38 +00:00
|
|
|
mddev->reshape_position = new;
|
|
|
|
mddev->delta_disks = 0;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
2007-05-09 09:35:38 +00:00
|
|
|
mddev->new_level = mddev->level;
|
|
|
|
mddev->new_layout = mddev->layout;
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
2012-05-20 23:27:00 +00:00
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
rdev->new_data_offset = rdev->data_offset;
|
2014-12-15 01:57:01 +00:00
|
|
|
err = 0;
|
|
|
|
unlock:
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2007-05-09 09:35:38 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_reshape_position =
|
|
|
|
__ATTR(reshape_position, S_IRUGO|S_IWUSR, reshape_position_show,
|
|
|
|
reshape_position_store);
|
|
|
|
|
2012-05-20 23:27:00 +00:00
|
|
|
static ssize_t
|
|
|
|
reshape_direction_show(struct mddev *mddev, char *page)
|
|
|
|
{
|
|
|
|
return sprintf(page, "%s\n",
|
|
|
|
mddev->reshape_backwards ? "backwards" : "forwards");
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
reshape_direction_store(struct mddev *mddev, const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
int backwards = 0;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
|
|
|
|
2012-05-20 23:27:00 +00:00
|
|
|
if (cmd_match(buf, "forwards"))
|
|
|
|
backwards = 0;
|
|
|
|
else if (cmd_match(buf, "backwards"))
|
|
|
|
backwards = 1;
|
|
|
|
else
|
|
|
|
return -EINVAL;
|
|
|
|
if (mddev->reshape_backwards == backwards)
|
|
|
|
return len;
|
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2012-05-20 23:27:00 +00:00
|
|
|
/* check if we are allowed to change */
|
|
|
|
if (mddev->delta_disks)
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EBUSY;
|
|
|
|
else if (mddev->persistent &&
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->major_version == 0)
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
else
|
|
|
|
mddev->reshape_backwards = backwards;
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2012-05-20 23:27:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_reshape_direction =
|
|
|
|
__ATTR(reshape_direction, S_IRUGO|S_IWUSR, reshape_direction_show,
|
|
|
|
reshape_direction_store);
|
|
|
|
|
2009-03-31 04:00:31 +00:00
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
array_size_show(struct mddev *mddev, char *page)
|
2009-03-31 04:00:31 +00:00
|
|
|
{
|
|
|
|
if (mddev->external_size)
|
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
(unsigned long long)mddev->array_sectors/2);
|
|
|
|
else
|
|
|
|
return sprintf(page, "default\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 05:47:53 +00:00
|
|
|
array_size_store(struct mddev *mddev, const char *buf, size_t len)
|
2009-03-31 04:00:31 +00:00
|
|
|
{
|
|
|
|
sector_t sectors;
|
2014-12-15 01:57:01 +00:00
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2009-03-31 04:00:31 +00:00
|
|
|
|
2016-05-02 15:33:13 +00:00
|
|
|
/* cluster raid doesn't support change array_sectors */
|
2017-04-10 06:15:55 +00:00
|
|
|
if (mddev_is_clustered(mddev)) {
|
|
|
|
mddev_unlock(mddev);
|
2016-05-02 15:33:13 +00:00
|
|
|
return -EINVAL;
|
2017-04-10 06:15:55 +00:00
|
|
|
}
|
2016-05-02 15:33:13 +00:00
|
|
|
|
2009-03-31 04:00:31 +00:00
|
|
|
if (strncmp(buf, "default", 7) == 0) {
|
|
|
|
if (mddev->pers)
|
|
|
|
sectors = mddev->pers->size(mddev, 0, 0);
|
|
|
|
else
|
|
|
|
sectors = mddev->array_sectors;
|
|
|
|
|
|
|
|
mddev->external_size = 0;
|
|
|
|
} else {
|
|
|
|
if (strict_blocks_to_sectors(buf, §ors) < 0)
|
2014-12-15 01:57:01 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
else if (mddev->pers && mddev->pers->size(mddev, 0, 0) < sectors)
|
|
|
|
err = -E2BIG;
|
|
|
|
else
|
|
|
|
mddev->external_size = 1;
|
2009-03-31 04:00:31 +00:00
|
|
|
}
|
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
if (!err) {
|
|
|
|
mddev->array_sectors = sectors;
|
2020-11-16 14:57:11 +00:00
|
|
|
if (mddev->pers)
|
|
|
|
set_capacity_and_notify(mddev->gendisk,
|
|
|
|
mddev->array_sectors);
|
2011-02-16 02:58:38 +00:00
|
|
|
}
|
2014-12-15 01:57:01 +00:00
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2009-03-31 04:00:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_array_size =
|
|
|
|
__ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
|
|
|
|
array_size_store);
|
2006-03-27 09:18:14 +00:00
|
|
|
|
2017-03-09 09:00:00 +00:00
|
|
|
static ssize_t
|
|
|
|
consistency_policy_show(struct mddev *mddev, char *page)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
|
|
|
|
ret = sprintf(page, "journal\n");
|
|
|
|
} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
|
|
|
|
ret = sprintf(page, "ppl\n");
|
|
|
|
} else if (mddev->bitmap) {
|
|
|
|
ret = sprintf(page, "bitmap\n");
|
|
|
|
} else if (mddev->pers) {
|
|
|
|
if (mddev->pers->sync_request)
|
|
|
|
ret = sprintf(page, "resync\n");
|
|
|
|
else
|
|
|
|
ret = sprintf(page, "none\n");
|
|
|
|
} else {
|
|
|
|
ret = sprintf(page, "unknown\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
|
|
|
|
{
|
2017-03-09 09:00:03 +00:00
|
|
|
int err = 0;
|
|
|
|
|
2017-03-09 09:00:00 +00:00
|
|
|
if (mddev->pers) {
|
2017-03-09 09:00:03 +00:00
|
|
|
if (mddev->pers->change_consistency_policy)
|
|
|
|
err = mddev->pers->change_consistency_policy(mddev, buf);
|
|
|
|
else
|
|
|
|
err = -EBUSY;
|
2017-03-09 09:00:00 +00:00
|
|
|
} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
|
|
|
|
set_bit(MD_HAS_PPL, &mddev->flags);
|
|
|
|
} else {
|
2017-03-09 09:00:03 +00:00
|
|
|
err = -EINVAL;
|
2017-03-09 09:00:00 +00:00
|
|
|
}
|
2017-03-09 09:00:03 +00:00
|
|
|
|
|
|
|
return err ? err : len;
|
2017-03-09 09:00:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_consistency_policy =
|
|
|
|
__ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show,
|
|
|
|
consistency_policy_store);
|
|
|
|
|
2019-07-24 09:09:19 +00:00
|
|
|
static ssize_t fail_last_dev_show(struct mddev *mddev, char *page)
|
|
|
|
{
|
|
|
|
return sprintf(page, "%d\n", mddev->fail_last_dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setting fail_last_dev to true to allow last device to be forcibly removed
|
|
|
|
* from RAID1/RAID10.
|
|
|
|
*/
|
|
|
|
static ssize_t
|
|
|
|
fail_last_dev_store(struct mddev *mddev, const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
bool value;
|
|
|
|
|
|
|
|
ret = kstrtobool(buf, &value);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (value != mddev->fail_last_dev)
|
|
|
|
mddev->fail_last_dev = value;
|
|
|
|
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
static struct md_sysfs_entry md_fail_last_dev =
|
|
|
|
__ATTR(fail_last_dev, S_IRUGO | S_IWUSR, fail_last_dev_show,
|
|
|
|
fail_last_dev_store);
|
|
|
|
|
2019-12-23 09:48:56 +00:00
|
|
|
static ssize_t serialize_policy_show(struct mddev *mddev, char *page)
|
|
|
|
{
|
|
|
|
if (mddev->pers == NULL || (mddev->pers->level != 1))
|
|
|
|
return sprintf(page, "n/a\n");
|
|
|
|
else
|
|
|
|
return sprintf(page, "%d\n", mddev->serialize_policy);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setting serialize_policy to true to enforce write IO is not reordered
|
|
|
|
* for raid1.
|
|
|
|
*/
|
|
|
|
static ssize_t
|
|
|
|
serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
bool value;
|
|
|
|
|
|
|
|
err = kstrtobool(buf, &value);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
if (value == mddev->serialize_policy)
|
|
|
|
return len;
|
|
|
|
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
if (mddev->pers == NULL || (mddev->pers->level != 1)) {
|
|
|
|
pr_err("md: serialize_policy is only effective for raid1\n");
|
|
|
|
err = -EINVAL;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
mddev_suspend(mddev);
|
|
|
|
if (value)
|
|
|
|
mddev_create_serial_pool(mddev, NULL, true);
|
|
|
|
else
|
|
|
|
mddev_destroy_serial_pool(mddev, NULL, true);
|
|
|
|
mddev->serialize_policy = value;
|
|
|
|
mddev_resume(mddev);
|
|
|
|
unlock:
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry md_serialize_policy =
|
|
|
|
__ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
|
|
|
|
serialize_policy_store);
|
|
|
|
|
|
|
|
|
2005-11-09 05:39:23 +00:00
|
|
|
static struct attribute *md_default_attrs[] = {
|
|
|
|
&md_level.attr,
|
2006-06-26 07:27:59 +00:00
|
|
|
&md_layout.attr,
|
2005-11-09 05:39:23 +00:00
|
|
|
&md_raid_disks.attr,
|
2020-07-28 10:01:39 +00:00
|
|
|
&md_uuid.attr,
|
2006-01-06 08:20:47 +00:00
|
|
|
&md_chunk_size.attr,
|
2006-01-06 08:20:49 +00:00
|
|
|
&md_size.attr,
|
2006-06-26 07:28:00 +00:00
|
|
|
&md_resync_start.attr,
|
2006-01-06 08:20:50 +00:00
|
|
|
&md_metadata.attr,
|
2006-01-06 08:21:16 +00:00
|
|
|
&md_new_device.attr,
|
2006-06-26 07:27:37 +00:00
|
|
|
&md_safe_delay.attr,
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
&md_array_state.attr,
|
2007-05-09 09:35:38 +00:00
|
|
|
&md_reshape_position.attr,
|
2012-05-20 23:27:00 +00:00
|
|
|
&md_reshape_direction.attr,
|
2009-03-31 04:00:31 +00:00
|
|
|
&md_array_size.attr,
|
2009-12-14 01:49:58 +00:00
|
|
|
&max_corr_read_errors.attr,
|
2017-03-09 09:00:00 +00:00
|
|
|
&md_consistency_policy.attr,
|
2019-07-24 09:09:19 +00:00
|
|
|
&md_fail_last_dev.attr,
|
2019-12-23 09:48:56 +00:00
|
|
|
&md_serialize_policy.attr,
|
2005-11-09 05:39:40 +00:00
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
2021-09-01 11:38:31 +00:00
|
|
|
static const struct attribute_group md_default_group = {
|
|
|
|
.attrs = md_default_attrs,
|
|
|
|
};
|
|
|
|
|
2005-11-09 05:39:40 +00:00
|
|
|
static struct attribute *md_redundancy_attrs[] = {
|
2005-11-09 05:39:26 +00:00
|
|
|
&md_scan_mode.attr,
|
2013-06-25 06:23:59 +00:00
|
|
|
&md_last_scan_mode.attr,
|
2005-11-09 05:39:26 +00:00
|
|
|
&md_mismatches.attr,
|
2006-01-06 08:21:36 +00:00
|
|
|
&md_sync_min.attr,
|
|
|
|
&md_sync_max.attr,
|
|
|
|
&md_sync_speed.attr,
|
2008-05-23 20:04:38 +00:00
|
|
|
&md_sync_force_parallel.attr,
|
2006-01-06 08:21:36 +00:00
|
|
|
&md_sync_completed.attr,
|
2008-06-27 22:31:24 +00:00
|
|
|
&md_min_sync.attr,
|
2008-02-06 09:39:52 +00:00
|
|
|
&md_max_sync.attr,
|
2006-03-27 09:18:14 +00:00
|
|
|
&md_suspend_lo.attr,
|
|
|
|
&md_suspend_hi.attr,
|
2006-10-03 08:15:49 +00:00
|
|
|
&md_bitmap.attr,
|
2007-10-17 06:30:54 +00:00
|
|
|
&md_degraded.attr,
|
2005-11-09 05:39:23 +00:00
|
|
|
NULL,
|
|
|
|
};
|
2021-05-29 10:30:49 +00:00
|
|
|
static const struct attribute_group md_redundancy_group = {
|
2005-11-09 05:39:40 +00:00
|
|
|
.name = NULL,
|
|
|
|
.attrs = md_redundancy_attrs,
|
|
|
|
};
|
|
|
|
|
2021-09-01 11:38:31 +00:00
|
|
|
static const struct attribute_group *md_attr_groups[] = {
|
|
|
|
&md_default_group,
|
|
|
|
&md_bitmap_group,
|
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
2005-11-09 05:39:23 +00:00
|
|
|
static ssize_t
|
|
|
|
md_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
|
|
|
|
{
|
|
|
|
struct md_sysfs_entry *entry = container_of(attr, struct md_sysfs_entry, attr);
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = container_of(kobj, struct mddev, kobj);
|
2005-11-09 05:39:39 +00:00
|
|
|
ssize_t rv;
|
2005-11-09 05:39:23 +00:00
|
|
|
|
|
|
|
if (!entry->show)
|
|
|
|
return -EIO;
|
2011-12-08 04:49:46 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
2022-07-19 09:18:23 +00:00
|
|
|
if (!mddev_get(mddev)) {
|
2011-12-08 04:49:46 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
|
2014-12-15 01:56:59 +00:00
|
|
|
rv = entry->show(mddev, page);
|
2011-12-08 04:49:46 +00:00
|
|
|
mddev_put(mddev);
|
2005-11-09 05:39:39 +00:00
|
|
|
return rv;
|
2005-11-09 05:39:23 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
md_attr_store(struct kobject *kobj, struct attribute *attr,
|
|
|
|
const char *page, size_t length)
|
|
|
|
{
|
|
|
|
struct md_sysfs_entry *entry = container_of(attr, struct md_sysfs_entry, attr);
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = container_of(kobj, struct mddev, kobj);
|
2005-11-09 05:39:39 +00:00
|
|
|
ssize_t rv;
|
2005-11-09 05:39:23 +00:00
|
|
|
|
|
|
|
if (!entry->store)
|
|
|
|
return -EIO;
|
2006-07-10 11:44:19 +00:00
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EACCES;
|
2011-12-08 04:49:46 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
2022-07-19 09:18:23 +00:00
|
|
|
if (!mddev_get(mddev)) {
|
2011-12-08 04:49:46 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
2014-12-15 01:57:01 +00:00
|
|
|
rv = entry->store(mddev, page, length);
|
2011-12-08 04:49:46 +00:00
|
|
|
mddev_put(mddev);
|
2005-11-09 05:39:39 +00:00
|
|
|
return rv;
|
2005-11-09 05:39:23 +00:00
|
|
|
}
|
|
|
|
|
2022-07-19 09:18:18 +00:00
|
|
|
static void md_kobj_release(struct kobject *ko)
|
2005-11-09 05:39:23 +00:00
|
|
|
{
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = container_of(ko, struct mddev, kobj);
|
2009-01-08 21:31:09 +00:00
|
|
|
|
|
|
|
if (mddev->sysfs_state)
|
|
|
|
sysfs_put(mddev->sysfs_state);
|
2020-07-14 23:10:26 +00:00
|
|
|
if (mddev->sysfs_level)
|
|
|
|
sysfs_put(mddev->sysfs_level);
|
|
|
|
|
2022-07-19 09:18:15 +00:00
|
|
|
del_gendisk(mddev->gendisk);
|
|
|
|
put_disk(mddev->gendisk);
|
2005-11-09 05:39:23 +00:00
|
|
|
}
|
|
|
|
|
2010-01-19 01:58:23 +00:00
|
|
|
static const struct sysfs_ops md_sysfs_ops = {
|
2005-11-09 05:39:23 +00:00
|
|
|
.show = md_attr_show,
|
|
|
|
.store = md_attr_store,
|
|
|
|
};
|
2023-02-14 03:19:22 +00:00
|
|
|
static const struct kobj_type md_ktype = {
|
2022-07-19 09:18:18 +00:00
|
|
|
.release = md_kobj_release,
|
2005-11-09 05:39:23 +00:00
|
|
|
.sysfs_ops = &md_sysfs_ops,
|
2021-09-01 11:38:31 +00:00
|
|
|
.default_groups = md_attr_groups,
|
2005-11-09 05:39:23 +00:00
|
|
|
};
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
int mdp_major = 0;
|
|
|
|
|
2009-03-04 07:57:25 +00:00
|
|
|
static void mddev_delayed_delete(struct work_struct *ws)
|
|
|
|
{
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = container_of(ws, struct mddev, del_work);
|
2009-03-04 07:57:25 +00:00
|
|
|
|
|
|
|
kobject_put(&mddev->kobj);
|
|
|
|
}
|
|
|
|
|
2022-07-23 06:24:29 +00:00
|
|
|
struct mddev *md_alloc(dev_t dev, char *name)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2017-04-12 06:26:13 +00:00
|
|
|
/*
|
|
|
|
* If dev is zero, name is the name of a device to allocate with
|
|
|
|
* an arbitrary minor number. It will be "md_???"
|
|
|
|
* If dev is non-zero it must be a device number with a MAJOR of
|
|
|
|
* MD_MAJOR or mdp_major. In this case, if "name" is NULL, then
|
|
|
|
* the device is being created by opening a node in /dev.
|
|
|
|
* If "name" is not NULL, the device is being created by
|
|
|
|
* writing to /sys/module/md_mod/parameters/new_array.
|
|
|
|
*/
|
2006-03-27 09:18:20 +00:00
|
|
|
static DEFINE_MUTEX(disks_mutex);
|
2021-04-12 08:05:30 +00:00
|
|
|
struct mddev *mddev;
|
2005-04-16 22:20:36 +00:00
|
|
|
struct gendisk *disk;
|
2009-01-08 21:31:10 +00:00
|
|
|
int partitioned;
|
|
|
|
int shift;
|
|
|
|
int unit;
|
2021-04-12 08:05:30 +00:00
|
|
|
int error ;
|
2009-01-08 21:31:10 +00:00
|
|
|
|
2021-04-12 08:05:30 +00:00
|
|
|
/*
|
|
|
|
* Wait for any previous instance of this device to be completely
|
|
|
|
* removed (mddev_delayed_delete).
|
md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-08 21:31:10 +00:00
|
|
|
*/
|
2010-10-15 13:36:08 +00:00
|
|
|
flush_workqueue(md_misc_wq);
|
md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-08 21:31:10 +00:00
|
|
|
|
2006-03-27 09:18:20 +00:00
|
|
|
mutex_lock(&disks_mutex);
|
2021-04-12 08:05:30 +00:00
|
|
|
mddev = mddev_alloc(dev);
|
|
|
|
if (IS_ERR(mddev)) {
|
2022-07-19 09:18:16 +00:00
|
|
|
error = PTR_ERR(mddev);
|
|
|
|
goto out_unlock;
|
2021-04-12 08:05:30 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
partitioned = (MAJOR(mddev->unit) != MD_MAJOR);
|
|
|
|
shift = partitioned ? MdpMinorShift : 0;
|
|
|
|
unit = MINOR(mddev->unit) >> shift;
|
2009-01-08 21:31:10 +00:00
|
|
|
|
2017-04-12 06:26:13 +00:00
|
|
|
if (name && !dev) {
|
2009-01-08 21:31:10 +00:00
|
|
|
/* Need to ensure that 'name' is not a duplicate.
|
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev2;
|
2009-01-08 21:31:10 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
|
|
|
|
|
|
|
list_for_each_entry(mddev2, &all_mddevs, all_mddevs)
|
|
|
|
if (mddev2->gendisk &&
|
|
|
|
strcmp(mddev2->gendisk->disk_name, name) == 0) {
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
2021-04-12 08:05:30 +00:00
|
|
|
error = -EEXIST;
|
2022-07-19 09:18:16 +00:00
|
|
|
goto out_free_mddev;
|
2009-01-08 21:31:10 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2017-04-12 06:26:13 +00:00
|
|
|
if (name && dev)
|
|
|
|
/*
|
|
|
|
* Creating /dev/mdNNN via "newarray", so adjust hold_active.
|
|
|
|
*/
|
|
|
|
mddev->hold_active = UNTIL_STOP;
|
2009-01-08 21:31:08 +00:00
|
|
|
|
2009-07-01 02:27:21 +00:00
|
|
|
error = -ENOMEM;
|
2021-05-21 05:51:04 +00:00
|
|
|
disk = blk_alloc_disk(NUMA_NO_NODE);
|
|
|
|
if (!disk)
|
2022-07-19 09:18:16 +00:00
|
|
|
goto out_free_mddev;
|
2009-03-31 03:39:39 +00:00
|
|
|
|
2009-01-08 21:31:10 +00:00
|
|
|
disk->major = MAJOR(mddev->unit);
|
2005-04-16 22:20:36 +00:00
|
|
|
disk->first_minor = unit << shift;
|
2021-05-21 05:51:04 +00:00
|
|
|
disk->minors = 1 << shift;
|
2009-01-08 21:31:10 +00:00
|
|
|
if (name)
|
|
|
|
strcpy(disk->disk_name, name);
|
|
|
|
else if (partitioned)
|
2005-04-16 22:20:36 +00:00
|
|
|
sprintf(disk->disk_name, "md_d%d", unit);
|
2005-06-21 04:15:16 +00:00
|
|
|
else
|
2005-04-16 22:20:36 +00:00
|
|
|
sprintf(disk->disk_name, "md%d", unit);
|
|
|
|
disk->fops = &md_fops;
|
|
|
|
disk->private_data = mddev;
|
2021-05-21 05:51:04 +00:00
|
|
|
|
|
|
|
mddev->queue = disk->queue;
|
|
|
|
blk_set_stacking_limits(&mddev->queue->limits);
|
2016-03-30 16:16:53 +00:00
|
|
|
blk_queue_write_cache(mddev->queue, true, true);
|
2020-07-08 12:25:41 +00:00
|
|
|
disk->events |= DISK_EVENT_MEDIA_CHANGE;
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->gendisk = disk;
|
2021-09-01 11:38:30 +00:00
|
|
|
error = add_disk(disk);
|
2021-09-01 11:38:33 +00:00
|
|
|
if (error)
|
2022-07-19 09:18:16 +00:00
|
|
|
goto out_put_disk;
|
2011-05-10 07:49:01 +00:00
|
|
|
|
2022-07-19 09:18:15 +00:00
|
|
|
kobject_init(&mddev->kobj, &md_ktype);
|
2018-06-08 00:52:54 +00:00
|
|
|
error = kobject_add(&mddev->kobj, &disk_to_dev(disk)->kobj, "%s", "md");
|
2022-07-19 09:18:16 +00:00
|
|
|
if (error) {
|
|
|
|
/*
|
|
|
|
* The disk is already live at this point. Clear the hold flag
|
|
|
|
* and let mddev_put take care of the deletion, as it isn't any
|
|
|
|
* different from a normal close on last release now.
|
|
|
|
*/
|
|
|
|
mddev->hold_active = 0;
|
2022-07-23 06:24:29 +00:00
|
|
|
mutex_unlock(&disks_mutex);
|
|
|
|
mddev_put(mddev);
|
|
|
|
return ERR_PTR(error);
|
2022-07-19 09:18:16 +00:00
|
|
|
}
|
2021-09-01 11:38:33 +00:00
|
|
|
|
|
|
|
kobject_uevent(&mddev->kobj, KOBJ_ADD);
|
|
|
|
mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
|
|
|
|
mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");
|
2021-09-01 11:38:32 +00:00
|
|
|
mutex_unlock(&disks_mutex);
|
2022-07-23 06:24:29 +00:00
|
|
|
return mddev;
|
2022-07-19 09:18:16 +00:00
|
|
|
|
|
|
|
out_put_disk:
|
|
|
|
put_disk(disk);
|
|
|
|
out_free_mddev:
|
|
|
|
mddev_free(mddev);
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&disks_mutex);
|
2022-07-23 06:24:29 +00:00
|
|
|
return ERR_PTR(error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int md_alloc_and_put(dev_t dev, char *name)
|
|
|
|
{
|
|
|
|
struct mddev *mddev = md_alloc(dev, name);
|
|
|
|
|
|
|
|
if (IS_ERR(mddev))
|
|
|
|
return PTR_ERR(mddev);
|
|
|
|
mddev_put(mddev);
|
|
|
|
return 0;
|
2009-01-08 21:31:10 +00:00
|
|
|
}
|
|
|
|
|
2020-10-29 14:58:34 +00:00
|
|
|
static void md_probe(dev_t dev)
|
2009-01-08 21:31:10 +00:00
|
|
|
{
|
2020-10-29 14:58:34 +00:00
|
|
|
if (MAJOR(dev) == MD_MAJOR && MINOR(dev) >= 512)
|
|
|
|
return;
|
2017-04-12 06:26:13 +00:00
|
|
|
if (create_on_open)
|
2022-07-23 06:24:29 +00:00
|
|
|
md_alloc_and_put(dev, NULL);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
treewide: Fix function prototypes for module_param_call()
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:
@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@
module_param_call(_name, _set_func, _get_func, _arg, _mode);
@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:
drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-10-18 02:04:42 +00:00
|
|
|
static int add_named_array(const char *val, const struct kernel_param *kp)
|
2009-01-08 21:31:10 +00:00
|
|
|
{
|
2017-04-12 06:26:13 +00:00
|
|
|
/*
|
|
|
|
* val must be "md_*" or "mdNNN".
|
|
|
|
* For "md_*" we allocate an array with a large free minor number, and
|
2009-01-08 21:31:10 +00:00
|
|
|
* set the name to val. val must not already be an active name.
|
2017-04-12 06:26:13 +00:00
|
|
|
* For "mdNNN" we allocate an array with the minor number NNN
|
|
|
|
* which must not already be in use.
|
2009-01-08 21:31:10 +00:00
|
|
|
*/
|
|
|
|
int len = strlen(val);
|
|
|
|
char buf[DISK_NAME_LEN];
|
2017-04-12 06:26:13 +00:00
|
|
|
unsigned long devnum;
|
2009-01-08 21:31:10 +00:00
|
|
|
|
|
|
|
while (len && val[len-1] == '\n')
|
|
|
|
len--;
|
|
|
|
if (len >= DISK_NAME_LEN)
|
|
|
|
return -E2BIG;
|
2022-04-01 02:13:17 +00:00
|
|
|
strscpy(buf, val, len+1);
|
2017-04-12 06:26:13 +00:00
|
|
|
if (strncmp(buf, "md_", 3) == 0)
|
2022-07-23 06:24:29 +00:00
|
|
|
return md_alloc_and_put(0, buf);
|
2017-04-12 06:26:13 +00:00
|
|
|
if (strncmp(buf, "md", 2) == 0 &&
|
|
|
|
isdigit(buf[2]) &&
|
|
|
|
kstrtoul(buf+2, 10, &devnum) == 0 &&
|
|
|
|
devnum <= MINORMASK)
|
2022-07-23 06:24:29 +00:00
|
|
|
return md_alloc_and_put(MKDEV(MD_MAJOR, devnum), NULL);
|
2017-04-12 06:26:13 +00:00
|
|
|
|
|
|
|
return -EINVAL;
|
2009-01-08 21:31:10 +00:00
|
|
|
}
|
|
|
|
|
2017-10-17 00:01:48 +00:00
|
|
|
static void md_safemode_timeout(struct timer_list *t)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2017-10-17 00:01:48 +00:00
|
|
|
struct mddev *mddev = from_timer(mddev, t, safemode_timer);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
mddev->safemode = 1;
|
|
|
|
if (mddev->external)
|
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
|
|
|
|
2006-01-06 08:20:15 +00:00
|
|
|
static int start_dirty_degraded;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
int md_run(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2006-01-06 08:20:36 +00:00
|
|
|
int err;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2011-10-11 05:49:58 +00:00
|
|
|
struct md_personality *pers;
|
2021-12-21 20:06:19 +00:00
|
|
|
bool nowait = true;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-04-16 22:26:42 +00:00
|
|
|
if (list_empty(&mddev->disks))
|
|
|
|
/* cannot run an array with no devices.. */
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (mddev->pers)
|
|
|
|
return -EBUSY;
|
2010-08-08 11:18:03 +00:00
|
|
|
/* Cannot run until previous stop completes properly */
|
|
|
|
if (mddev->sysfs_active)
|
|
|
|
return -EBUSY;
|
2010-04-15 00:13:47 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Analyze all RAID superblock(s)
|
|
|
|
*/
|
2008-02-06 09:39:53 +00:00
|
|
|
if (!mddev->raid_disks) {
|
|
|
|
if (!mddev->persistent)
|
|
|
|
return -EINVAL;
|
md: no longer compare spare disk superblock events in super_load
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 08:00:03 +00:00
|
|
|
err = analyze_sbs(mddev);
|
|
|
|
if (err)
|
|
|
|
return -EINVAL;
|
2008-02-06 09:39:53 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-01-06 08:20:51 +00:00
|
|
|
if (mddev->level != LEVEL_NONE)
|
|
|
|
request_module("md-level-%d", mddev->level);
|
|
|
|
else if (mddev->clevel[0])
|
|
|
|
request_module("md-%s", mddev->clevel);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Drop all container device buffers, from now on
|
|
|
|
* the only valid external interface is through the md
|
|
|
|
* device.
|
|
|
|
*/
|
2018-02-02 22:13:19 +00:00
|
|
|
mddev->has_superblocks = false;
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2005-11-09 05:39:31 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
2005-04-16 22:20:36 +00:00
|
|
|
continue;
|
|
|
|
sync_blockdev(rdev->bdev);
|
2007-05-06 21:49:54 +00:00
|
|
|
invalidate_bdev(rdev->bdev);
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->ro != MD_RDONLY && rdev_read_only(rdev)) {
|
|
|
|
mddev->ro = MD_RDONLY;
|
2017-04-12 22:53:48 +00:00
|
|
|
if (mddev->gendisk)
|
|
|
|
set_disk_ro(mddev->gendisk, 1);
|
|
|
|
}
|
2007-07-17 11:06:12 +00:00
|
|
|
|
2018-02-02 22:13:19 +00:00
|
|
|
if (rdev->sb_page)
|
|
|
|
mddev->has_superblocks = true;
|
|
|
|
|
2007-07-17 11:06:12 +00:00
|
|
|
/* perform some consistency tests on the device.
|
|
|
|
* We don't want the data to overlap the metadata,
|
2009-03-31 03:33:13 +00:00
|
|
|
* Internal Bitmap issues have been handled elsewhere.
|
2007-07-17 11:06:12 +00:00
|
|
|
*/
|
2011-01-13 22:14:34 +00:00
|
|
|
if (rdev->meta_bdev) {
|
|
|
|
/* Nothing to check */;
|
|
|
|
} else if (rdev->data_offset < rdev->sb_start) {
|
2009-03-31 03:33:13 +00:00
|
|
|
if (mddev->dev_sectors &&
|
|
|
|
rdev->data_offset + mddev->dev_sectors
|
2008-07-11 12:02:23 +00:00
|
|
|
> rdev->sb_start) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s: data overlaps metadata\n",
|
|
|
|
mdname(mddev));
|
2007-07-17 11:06:12 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
} else {
|
2008-07-11 12:02:23 +00:00
|
|
|
if (rdev->sb_start + rdev->sb_size/512
|
2007-07-17 11:06:12 +00:00
|
|
|
> rdev->data_offset) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s: metadata overlaps data\n",
|
|
|
|
mdname(mddev));
|
2007-07-17 11:06:12 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
2022-09-27 07:58:15 +00:00
|
|
|
nowait = nowait && bdev_nowait(rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2018-05-20 22:25:52 +00:00
|
|
|
if (!bioset_initialized(&mddev->bio_set)) {
|
|
|
|
err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
|
|
|
|
if (err)
|
2023-08-25 03:09:50 +00:00
|
|
|
return err;
|
2017-02-14 15:29:00 +00:00
|
|
|
}
|
2018-05-20 22:25:52 +00:00
|
|
|
if (!bioset_initialized(&mddev->sync_set)) {
|
|
|
|
err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
|
|
|
|
if (err)
|
2021-05-25 09:46:17 +00:00
|
|
|
goto exit_bio_set;
|
|
|
|
}
|
2010-10-26 07:31:13 +00:00
|
|
|
|
2023-06-21 16:51:04 +00:00
|
|
|
if (!bioset_initialized(&mddev->io_clone_set)) {
|
|
|
|
err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
|
|
|
|
offsetof(struct md_io_clone, bio_clone), 0);
|
2023-06-21 16:51:03 +00:00
|
|
|
if (err)
|
|
|
|
goto exit_sync_set;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_lock(&pers_lock);
|
2006-01-06 08:20:51 +00:00
|
|
|
pers = find_pers(mddev->level, mddev->clevel);
|
2006-01-06 08:20:36 +00:00
|
|
|
if (!pers || !try_module_get(pers->owner)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_unlock(&pers_lock);
|
2006-01-06 08:20:51 +00:00
|
|
|
if (mddev->level != LEVEL_NONE)
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: personality for level %d is not loaded!\n",
|
|
|
|
mddev->level);
|
2006-01-06 08:20:51 +00:00
|
|
|
else
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: personality for level %s is not loaded!\n",
|
|
|
|
mddev->clevel);
|
2018-06-13 15:39:49 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
goto abort;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&pers_lock);
|
2009-03-31 03:39:38 +00:00
|
|
|
if (mddev->level != pers->level) {
|
|
|
|
mddev->level = pers->level;
|
|
|
|
mddev->new_level = pers->level;
|
|
|
|
}
|
2022-04-01 02:13:17 +00:00
|
|
|
strscpy(mddev->clevel, pers->name, sizeof(mddev->clevel));
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
if (mddev->reshape_position != MaxSector &&
|
2006-03-27 09:18:13 +00:00
|
|
|
pers->start_reshape == NULL) {
|
2006-03-27 09:18:11 +00:00
|
|
|
/* This personality cannot handle reshaping... */
|
|
|
|
module_put(pers->owner);
|
2018-06-13 15:39:49 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
goto abort;
|
2006-03-27 09:18:11 +00:00
|
|
|
}
|
|
|
|
|
2007-03-01 04:11:35 +00:00
|
|
|
if (pers->sync_request) {
|
|
|
|
/* Warn if this is a potentially silly
|
|
|
|
* configuration.
|
|
|
|
*/
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev2;
|
2007-03-01 04:11:35 +00:00
|
|
|
int warned = 0;
|
2009-01-08 21:31:08 +00:00
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
rdev_for_each(rdev2, mddev) {
|
2007-03-01 04:11:35 +00:00
|
|
|
if (rdev < rdev2 &&
|
2020-09-03 05:40:58 +00:00
|
|
|
rdev->bdev->bd_disk ==
|
|
|
|
rdev2->bdev->bd_disk) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("%s: WARNING: %pg appears to be on the same physical disk as %pg.\n",
|
2016-11-02 03:16:49 +00:00
|
|
|
mdname(mddev),
|
2022-05-12 06:19:13 +00:00
|
|
|
rdev->bdev,
|
|
|
|
rdev2->bdev);
|
2007-03-01 04:11:35 +00:00
|
|
|
warned = 1;
|
|
|
|
}
|
|
|
|
}
|
2009-01-08 21:31:08 +00:00
|
|
|
|
2007-03-01 04:11:35 +00:00
|
|
|
if (warned)
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("True protection against single-disk failure might be compromised.\n");
|
2007-03-01 04:11:35 +00:00
|
|
|
}
|
|
|
|
|
2005-08-27 01:34:16 +00:00
|
|
|
mddev->recovery = 0;
|
2009-03-31 03:33:13 +00:00
|
|
|
/* may be over-ridden by personality */
|
|
|
|
mddev->resync_max_sectors = mddev->dev_sectors;
|
|
|
|
|
2006-01-06 08:20:15 +00:00
|
|
|
mddev->ok_start_degraded = start_dirty_degraded;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (start_readonly && md_is_rdwr(mddev))
|
|
|
|
mddev->ro = MD_AUTO_READ; /* read-only, but switch on first write */
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
|
2014-12-15 01:56:58 +00:00
|
|
|
err = pers->run(mddev);
|
2008-03-25 23:07:03 +00:00
|
|
|
if (err)
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: pers->run() failed ...\n");
|
2014-12-15 01:56:58 +00:00
|
|
|
else if (pers->size(mddev, 0, 0) < mddev->array_sectors) {
|
2016-11-02 03:16:49 +00:00
|
|
|
WARN_ONCE(!mddev->external_size,
|
|
|
|
"%s: default size too small, but 'external_size' not in effect?\n",
|
|
|
|
__func__);
|
|
|
|
pr_warn("md: invalid array_size %llu > default size %llu\n",
|
|
|
|
(unsigned long long)mddev->array_sectors / 2,
|
|
|
|
(unsigned long long)pers->size(mddev, 0, 0) / 2);
|
2009-03-31 04:00:31 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
}
|
2014-12-15 01:56:58 +00:00
|
|
|
if (err == 0 && pers->sync_request &&
|
2012-05-22 03:55:08 +00:00
|
|
|
(mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
|
2014-06-06 17:43:49 +00:00
|
|
|
struct bitmap *bitmap;
|
|
|
|
|
2018-08-01 22:20:50 +00:00
|
|
|
bitmap = md_bitmap_create(mddev, -1);
|
2014-06-06 17:43:49 +00:00
|
|
|
if (IS_ERR(bitmap)) {
|
|
|
|
err = PTR_ERR(bitmap);
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: failed to create bitmap (%d)\n",
|
|
|
|
mdname(mddev), err);
|
2014-06-06 17:43:49 +00:00
|
|
|
} else
|
|
|
|
mddev->bitmap = bitmap;
|
|
|
|
|
2006-01-06 08:20:16 +00:00
|
|
|
}
|
2019-06-14 09:10:39 +00:00
|
|
|
if (err)
|
|
|
|
goto bitmap_abort;
|
2019-06-19 09:30:46 +00:00
|
|
|
|
|
|
|
if (mddev->bitmap_info.max_write_behind > 0) {
|
2019-12-23 09:48:54 +00:00
|
|
|
bool create_pool = false;
|
2019-06-19 09:30:46 +00:00
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if (test_bit(WriteMostly, &rdev->flags) &&
|
2019-12-23 09:48:53 +00:00
|
|
|
rdev_init_serial(rdev))
|
2019-12-23 09:48:54 +00:00
|
|
|
create_pool = true;
|
2019-06-19 09:30:46 +00:00
|
|
|
}
|
2019-12-23 09:48:54 +00:00
|
|
|
if (create_pool && mddev->serial_info_pool == NULL) {
|
2019-12-23 09:48:53 +00:00
|
|
|
mddev->serial_info_pool =
|
|
|
|
mempool_create_kmalloc_pool(NR_SERIAL_INFOS,
|
|
|
|
sizeof(struct serial_info));
|
|
|
|
if (!mddev->serial_info_pool) {
|
2019-06-19 09:30:46 +00:00
|
|
|
err = -ENOMEM;
|
2019-06-14 09:10:39 +00:00
|
|
|
goto bitmap_abort;
|
2019-06-19 09:30:46 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-12-15 01:56:56 +00:00
|
|
|
if (mddev->queue) {
|
2016-09-30 16:45:40 +00:00
|
|
|
bool nonrot = true;
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
2022-04-15 04:52:42 +00:00
|
|
|
if (rdev->raid_disk >= 0 && !bdev_nonrot(rdev->bdev)) {
|
2016-09-30 16:45:40 +00:00
|
|
|
nonrot = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (mddev->degraded)
|
|
|
|
nonrot = false;
|
|
|
|
if (nonrot)
|
2018-03-08 01:10:10 +00:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NONROT, mddev->queue);
|
2016-09-30 16:45:40 +00:00
|
|
|
else
|
2018-03-08 01:10:10 +00:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_NONROT, mddev->queue);
|
2021-05-25 09:46:17 +00:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_IO_STAT, mddev->queue);
|
2022-02-02 17:24:10 +00:00
|
|
|
|
|
|
|
/* Set the NOWAIT flags if all underlying devices support it */
|
|
|
|
if (nowait)
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, mddev->queue);
|
2014-12-15 01:56:56 +00:00
|
|
|
}
|
2014-12-15 01:56:58 +00:00
|
|
|
if (pers->sync_request) {
|
2010-06-01 09:37:23 +00:00
|
|
|
if (mddev->kobj.sd &&
|
|
|
|
sysfs_create_group(&mddev->kobj, &md_redundancy_group))
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: cannot register extra attributes for %s\n",
|
|
|
|
mdname(mddev));
|
2010-06-01 09:37:23 +00:00
|
|
|
mddev->sysfs_action = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_action");
|
2020-08-05 00:27:18 +00:00
|
|
|
mddev->sysfs_completed = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_completed");
|
|
|
|
mddev->sysfs_degraded = sysfs_get_dirent_safe(mddev->kobj.sd, "degraded");
|
2022-09-20 02:39:38 +00:00
|
|
|
} else if (mddev->ro == MD_AUTO_READ)
|
|
|
|
mddev->ro = MD_RDWR;
|
2005-11-09 05:39:42 +00:00
|
|
|
|
2009-12-14 01:49:58 +00:00
|
|
|
atomic_set(&mddev->max_corr_read_errors,
|
|
|
|
MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->safemode = 0;
|
2015-10-22 05:01:25 +00:00
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
mddev->safemode_delay = 0;
|
|
|
|
else
|
2020-07-20 18:08:52 +00:00
|
|
|
mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->in_sync = 1;
|
2011-01-13 22:14:33 +00:00
|
|
|
smp_wmb();
|
2014-12-15 01:56:58 +00:00
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
mddev->pers = pers;
|
|
|
|
spin_unlock(&mddev->lock);
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev)
|
2011-07-27 01:00:36 +00:00
|
|
|
if (rdev->raid_disk >= 0)
|
2019-06-14 22:41:07 +00:00
|
|
|
sysfs_link_rdev(mddev, rdev); /* failure here is OK */
|
2014-09-30 04:23:59 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->degraded && md_is_rdwr(mddev))
|
2015-07-17 01:57:30 +00:00
|
|
|
/* This ensures that recovering status is reported immediately
|
|
|
|
* via sysfs - until a lack of spares is confirmed.
|
|
|
|
*/
|
|
|
|
set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
2005-04-16 22:20:36 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2014-09-30 04:23:59 +00:00
|
|
|
|
2016-12-08 23:48:19 +00:00
|
|
|
if (mddev->sb_flags)
|
2006-10-03 08:15:46 +00:00
|
|
|
md_update_sb(mddev, 0);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
2018-01-24 04:17:38 +00:00
|
|
|
|
2019-06-14 09:10:39 +00:00
|
|
|
bitmap_abort:
|
|
|
|
mddev_detach(mddev);
|
|
|
|
if (mddev->private)
|
|
|
|
pers->free(mddev, mddev->private);
|
|
|
|
mddev->private = NULL;
|
|
|
|
module_put(pers->owner);
|
|
|
|
md_bitmap_destroy(mddev);
|
2018-01-24 04:17:38 +00:00
|
|
|
abort:
|
2023-06-21 16:51:04 +00:00
|
|
|
bioset_exit(&mddev->io_clone_set);
|
2023-06-21 16:51:03 +00:00
|
|
|
exit_sync_set:
|
2019-03-29 17:46:16 +00:00
|
|
|
bioset_exit(&mddev->sync_set);
|
2021-05-25 09:46:17 +00:00
|
|
|
exit_bio_set:
|
|
|
|
bioset_exit(&mddev->bio_set);
|
2018-01-24 04:17:38 +00:00
|
|
|
return err;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2010-06-01 09:37:27 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_run);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-06-07 15:31:19 +00:00
|
|
|
int do_md_run(struct mddev *mddev)
|
2010-03-29 00:10:42 +00:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
2019-08-20 00:21:09 +00:00
|
|
|
set_bit(MD_NOT_READY, &mddev->flags);
|
2010-03-29 00:10:42 +00:00
|
|
|
err = md_run(mddev);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
2018-08-01 22:20:50 +00:00
|
|
|
err = md_bitmap_load(mddev);
|
2010-06-01 09:37:35 +00:00
|
|
|
if (err) {
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_destroy(mddev);
|
2010-06-01 09:37:35 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2011-06-07 22:49:36 +00:00
|
|
|
|
2015-10-22 05:01:25 +00:00
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
md_allow_write(mddev);
|
|
|
|
|
2017-11-20 06:17:01 +00:00
|
|
|
/* run start up tasks that require md_thread */
|
|
|
|
md_start(mddev);
|
|
|
|
|
2011-06-07 22:49:36 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
|
|
|
|
|
2020-11-16 14:57:11 +00:00
|
|
|
set_capacity_and_notify(mddev->gendisk, mddev->array_sectors);
|
2019-08-20 00:21:09 +00:00
|
|
|
clear_bit(MD_NOT_READY, &mddev->flags);
|
2011-02-24 06:26:41 +00:00
|
|
|
mddev->changed = 1;
|
2010-03-29 00:10:42 +00:00
|
|
|
kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE);
|
2019-08-20 00:21:09 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_action);
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_degraded);
|
2010-03-29 00:10:42 +00:00
|
|
|
out:
|
2019-08-20 00:21:09 +00:00
|
|
|
clear_bit(MD_NOT_READY, &mddev->flags);
|
2010-03-29 00:10:42 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-11-20 06:17:01 +00:00
|
|
|
int md_start(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (mddev->pers->start) {
|
|
|
|
set_bit(MD_RECOVERY_WAIT, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
ret = mddev->pers->start(mddev);
|
|
|
|
clear_bit(MD_RECOVERY_WAIT, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->sync_thread);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(md_start);
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int restart_array(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct gendisk *disk = mddev->gendisk;
|
2017-04-12 22:53:48 +00:00
|
|
|
struct md_rdev *rdev;
|
|
|
|
bool has_journal = false;
|
|
|
|
bool has_readonly = false;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-07-11 12:02:21 +00:00
|
|
|
/* Complain if it has no devices */
|
2005-04-16 22:20:36 +00:00
|
|
|
if (list_empty(&mddev->disks))
|
2008-07-11 12:02:21 +00:00
|
|
|
return -ENXIO;
|
|
|
|
if (!mddev->pers)
|
|
|
|
return -EINVAL;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (md_is_rdwr(mddev))
|
2008-07-11 12:02:21 +00:00
|
|
|
return -EBUSY;
|
2015-10-09 04:54:13 +00:00
|
|
|
|
2017-04-12 22:53:48 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
rdev_for_each_rcu(rdev, mddev) {
|
|
|
|
if (test_bit(Journal, &rdev->flags) &&
|
|
|
|
!test_bit(Faulty, &rdev->flags))
|
|
|
|
has_journal = true;
|
2021-02-01 13:17:21 +00:00
|
|
|
if (rdev_read_only(rdev))
|
2017-04-12 22:53:48 +00:00
|
|
|
has_readonly = true;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
if (test_bit(MD_HAS_JOURNAL, &mddev->flags) && !has_journal)
|
2015-10-09 04:54:13 +00:00
|
|
|
/* Don't restart rw with journal missing/faulty */
|
|
|
|
return -EINVAL;
|
2017-04-12 22:53:48 +00:00
|
|
|
if (has_readonly)
|
|
|
|
return -EROFS;
|
2015-10-09 04:54:13 +00:00
|
|
|
|
2008-07-11 12:02:21 +00:00
|
|
|
mddev->safemode = 0;
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDWR;
|
2008-07-11 12:02:21 +00:00
|
|
|
set_disk_ro(disk, 0);
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: %s switched to read-write mode.\n", mdname(mddev));
|
2008-07-11 12:02:21 +00:00
|
|
|
/* Kick recovery or resync if necessary */
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
md_wakeup_thread(mddev->sync_thread);
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2008-07-11 12:02:21 +00:00
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static void md_clean(struct mddev *mddev)
|
2010-03-29 00:37:13 +00:00
|
|
|
{
|
|
|
|
mddev->array_sectors = 0;
|
|
|
|
mddev->external_size = 0;
|
|
|
|
mddev->dev_sectors = 0;
|
|
|
|
mddev->raid_disks = 0;
|
|
|
|
mddev->recovery_cp = 0;
|
|
|
|
mddev->resync_min = 0;
|
|
|
|
mddev->resync_max = MaxSector;
|
|
|
|
mddev->reshape_position = MaxSector;
|
2023-06-17 05:24:04 +00:00
|
|
|
/* we still need mddev->external in export_rdev, do not clear it yet */
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->persistent = 0;
|
|
|
|
mddev->level = LEVEL_NONE;
|
|
|
|
mddev->clevel[0] = 0;
|
|
|
|
mddev->flags = 0;
|
2016-12-08 23:48:19 +00:00
|
|
|
mddev->sb_flags = 0;
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDWR;
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->metadata_type[0] = 0;
|
|
|
|
mddev->chunk_sectors = 0;
|
|
|
|
mddev->ctime = mddev->utime = 0;
|
|
|
|
mddev->layout = 0;
|
|
|
|
mddev->max_disks = 0;
|
|
|
|
mddev->events = 0;
|
2010-05-17 23:28:43 +00:00
|
|
|
mddev->can_decrease_events = 0;
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->delta_disks = 0;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->new_level = LEVEL_NONE;
|
|
|
|
mddev->new_layout = 0;
|
|
|
|
mddev->new_chunk_sectors = 0;
|
2023-02-01 07:59:20 +00:00
|
|
|
mddev->curr_resync = MD_RESYNC_NONE;
|
2012-10-11 03:17:59 +00:00
|
|
|
atomic64_set(&mddev->resync_mismatches, 0);
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->suspend_lo = mddev->suspend_hi = 0;
|
|
|
|
mddev->sync_speed_min = mddev->sync_speed_max = 0;
|
|
|
|
mddev->recovery = 0;
|
|
|
|
mddev->in_sync = 0;
|
2011-02-24 06:26:41 +00:00
|
|
|
mddev->changed = 0;
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->degraded = 0;
|
|
|
|
mddev->safemode = 0;
|
2015-06-25 07:01:40 +00:00
|
|
|
mddev->private = NULL;
|
2016-08-12 05:42:38 +00:00
|
|
|
mddev->cluster_info = NULL;
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->bitmap_info.offset = 0;
|
|
|
|
mddev->bitmap_info.default_offset = 0;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.default_space = 0;
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->bitmap_info.chunksize = 0;
|
|
|
|
mddev->bitmap_info.daemon_sleep = 0;
|
|
|
|
mddev->bitmap_info.max_write_behind = 0;
|
2016-08-12 05:42:38 +00:00
|
|
|
mddev->bitmap_info.nodes = 0;
|
2010-03-29 00:37:13 +00:00
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static void __md_stop_writes(struct mddev *mddev)
|
2010-03-29 01:07:53 +00:00
|
|
|
{
|
2013-05-08 23:48:30 +00:00
|
|
|
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
md: add checkings before flush md_misc_wq
Coly reported possible circular locking dependencyi with LOCKDEP enabled,
quote the below info from the detailed report [1].
[ 1607.673903] Chain exists of:
[ 1607.673903] kn->count#256 --> (wq_completion)md_misc -->
(work_completion)(&rdev->del_work)
[ 1607.673903]
[ 1607.827946] Possible unsafe locking scenario:
[ 1607.827946]
[ 1607.898780] CPU0 CPU1
[ 1607.952980] ---- ----
[ 1608.007173] lock((work_completion)(&rdev->del_work));
[ 1608.069690] lock((wq_completion)md_misc);
[ 1608.149887] lock((work_completion)(&rdev->del_work));
[ 1608.242563] lock(kn->count#256);
[ 1608.283238]
[ 1608.283238] *** DEADLOCK ***
[ 1608.283238]
[ 1608.354078] 2 locks held by kworker/5:0/843:
[ 1608.405152] #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
process_one_work+0x42b/0xb30
[ 1608.512399] #1: ffff888a1d3b7e10
((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
[ 1608.632130]
Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
then lockdep_map lock is held if either of them are running, then both of
them try to hold kernfs lock by call kobject_del. Then if new_dev_store
or array_state_store are triggered by write to the related sysfs node, so
the write operation gets kernfs lock, but need the lockdep_map because all
of them would trigger flush_workqueue(md_misc_wq) finally, then the same
lockdep_map lock is needed.
To suppress the lockdep warnning, we should flush the workqueue in case the
related work is pending. And several works are attached to md_misc_wq, so
we need to check which work should be checked:
1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
thread is started if it was starting, so check mddev->del_work is pending
or not since md_start_sync is attached to mddev->del_work.
2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
need the kernfs lock.
3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
completed, this case will be handled in next patch.
4. md_open flushes workqueue to ensure the previous md is disappeared, but
it holds bdev->bd_mutex then try to flush workqueue, so it is better to
check mddev->del_work as well to avoid potential lock issue, this will be
done in another patch.
[1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2
Cc: Coly Li <colyli@suse.de>
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-04-04 21:57:07 +00:00
|
|
|
if (work_pending(&mddev->del_work))
|
|
|
|
flush_workqueue(md_misc_wq);
|
2010-03-29 01:07:53 +00:00
|
|
|
if (mddev->sync_thread) {
|
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2022-06-07 02:03:56 +00:00
|
|
|
md_reap_sync_thread(mddev);
|
2010-03-29 01:07:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
del_timer_sync(&mddev->safemode_timer);
|
|
|
|
|
2016-11-21 18:29:19 +00:00
|
|
|
if (mddev->pers && mddev->pers->quiesce) {
|
|
|
|
mddev->pers->quiesce(mddev, 1);
|
|
|
|
mddev->pers->quiesce(mddev, 0);
|
|
|
|
}
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_flush(mddev);
|
2010-03-29 01:07:53 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (md_is_rdwr(mddev) &&
|
2015-10-22 05:01:25 +00:00
|
|
|
((!mddev->in_sync && !mddev_is_clustered(mddev)) ||
|
2016-12-08 23:48:19 +00:00
|
|
|
mddev->sb_flags)) {
|
2010-03-29 01:07:53 +00:00
|
|
|
/* mark array as shutdown cleanly */
|
2015-10-22 05:01:25 +00:00
|
|
|
if (!mddev_is_clustered(mddev))
|
|
|
|
mddev->in_sync = 1;
|
2010-03-29 01:07:53 +00:00
|
|
|
md_update_sb(mddev, 1);
|
|
|
|
}
|
2019-12-23 09:49:00 +00:00
|
|
|
/* disable policy to guarantee rdevs free resources for serialization */
|
|
|
|
mddev->serialize_policy = 0;
|
|
|
|
mddev_destroy_serial_pool(mddev, NULL, true);
|
2010-03-29 01:07:53 +00:00
|
|
|
}
|
2011-01-13 22:14:33 +00:00
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_stop_writes(struct mddev *mddev)
|
2011-01-13 22:14:33 +00:00
|
|
|
{
|
2013-11-14 06:54:51 +00:00
|
|
|
mddev_lock_nointr(mddev);
|
2011-01-13 22:14:33 +00:00
|
|
|
__md_stop_writes(mddev);
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
}
|
2010-06-01 09:37:27 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_stop_writes);
|
2010-03-29 01:07:53 +00:00
|
|
|
|
2014-12-15 01:56:57 +00:00
|
|
|
static void mddev_detach(struct mddev *mddev)
|
|
|
|
{
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_wait_behind_writes(mddev);
|
2023-01-31 05:17:09 +00:00
|
|
|
if (mddev->pers && mddev->pers->quiesce && !is_md_suspended(mddev)) {
|
2014-12-15 01:56:57 +00:00
|
|
|
mddev->pers->quiesce(mddev, 1);
|
|
|
|
mddev->pers->quiesce(mddev, 0);
|
|
|
|
}
|
2023-08-03 07:17:11 +00:00
|
|
|
md_unregister_thread(mddev, &mddev->thread);
|
2014-12-15 01:56:57 +00:00
|
|
|
if (mddev->queue)
|
|
|
|
blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
|
|
|
|
}
|
|
|
|
|
2012-11-18 23:47:48 +00:00
|
|
|
static void __md_stop(struct mddev *mddev)
|
2010-03-29 00:37:13 +00:00
|
|
|
{
|
2014-12-15 01:56:58 +00:00
|
|
|
struct md_personality *pers = mddev->pers;
|
2022-08-17 12:05:13 +00:00
|
|
|
md_bitmap_destroy(mddev);
|
2014-12-15 01:56:57 +00:00
|
|
|
mddev_detach(mddev);
|
2015-07-22 00:20:07 +00:00
|
|
|
/* Ensure ->event_work is done */
|
md: add checkings before flush md_misc_wq
Coly reported possible circular locking dependencyi with LOCKDEP enabled,
quote the below info from the detailed report [1].
[ 1607.673903] Chain exists of:
[ 1607.673903] kn->count#256 --> (wq_completion)md_misc -->
(work_completion)(&rdev->del_work)
[ 1607.673903]
[ 1607.827946] Possible unsafe locking scenario:
[ 1607.827946]
[ 1607.898780] CPU0 CPU1
[ 1607.952980] ---- ----
[ 1608.007173] lock((work_completion)(&rdev->del_work));
[ 1608.069690] lock((wq_completion)md_misc);
[ 1608.149887] lock((work_completion)(&rdev->del_work));
[ 1608.242563] lock(kn->count#256);
[ 1608.283238]
[ 1608.283238] *** DEADLOCK ***
[ 1608.283238]
[ 1608.354078] 2 locks held by kworker/5:0/843:
[ 1608.405152] #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
process_one_work+0x42b/0xb30
[ 1608.512399] #1: ffff888a1d3b7e10
((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
[ 1608.632130]
Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
then lockdep_map lock is held if either of them are running, then both of
them try to hold kernfs lock by call kobject_del. Then if new_dev_store
or array_state_store are triggered by write to the related sysfs node, so
the write operation gets kernfs lock, but need the lockdep_map because all
of them would trigger flush_workqueue(md_misc_wq) finally, then the same
lockdep_map lock is needed.
To suppress the lockdep warnning, we should flush the workqueue in case the
related work is pending. And several works are attached to md_misc_wq, so
we need to check which work should be checked:
1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
thread is started if it was starting, so check mddev->del_work is pending
or not since md_start_sync is attached to mddev->del_work.
2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
need the kernfs lock.
3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
completed, this case will be handled in next patch.
4. md_open flushes workqueue to ensure the previous md is disappeared, but
it holds bdev->bd_mutex then try to flush workqueue, so it is better to
check mddev->del_work as well to avoid potential lock issue, this will be
done in another patch.
[1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2
Cc: Coly Li <colyli@suse.de>
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-04-04 21:57:07 +00:00
|
|
|
if (mddev->event_work.func)
|
|
|
|
flush_workqueue(md_misc_wq);
|
2014-12-15 01:56:58 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2010-03-29 00:37:13 +00:00
|
|
|
mddev->pers = NULL;
|
2014-12-15 01:56:58 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
md: fix double free of mddev->private in autorun_array()
In driver/md/md.c, if the function autorun_array() is called,
the problem of double free may occur.
In function autorun_array(), when the function do_md_run() returns an
error, the function do_md_stop() will be called.
The function do_md_run() called function md_run(), but in function
md_run(), the pointer mddev->private may be freed.
The function do_md_stop() called the function __md_stop(), but in
function __md_stop(), the pointer mddev->private also will be freed
without judging null.
At this time, the pointer mddev->private will be double free, so it
needs to be judged null or not.
Signed-off-by: zhangyue <zhangyue1@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
2021-11-16 02:35:26 +00:00
|
|
|
if (mddev->private)
|
|
|
|
pers->free(mddev, mddev->private);
|
2015-06-25 07:01:40 +00:00
|
|
|
mddev->private = NULL;
|
2014-12-15 01:56:58 +00:00
|
|
|
if (pers->sync_request && mddev->to_remove == NULL)
|
|
|
|
mddev->to_remove = &md_redundancy_group;
|
|
|
|
module_put(pers->owner);
|
2010-04-01 01:08:16 +00:00
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2023-02-22 03:59:16 +00:00
|
|
|
|
|
|
|
bioset_exit(&mddev->bio_set);
|
|
|
|
bioset_exit(&mddev->sync_set);
|
2023-06-21 16:51:04 +00:00
|
|
|
bioset_exit(&mddev->io_clone_set);
|
2018-10-19 14:21:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void md_stop(struct mddev *mddev)
|
|
|
|
{
|
2023-07-08 09:21:53 +00:00
|
|
|
lockdep_assert_held(&mddev->reconfig_mutex);
|
|
|
|
|
2018-10-19 14:21:31 +00:00
|
|
|
/* stop the array and free an attached data structures.
|
|
|
|
* This is called from dm-raid
|
|
|
|
*/
|
2022-08-17 12:05:14 +00:00
|
|
|
__md_stop_writes(mddev);
|
2018-10-19 14:21:31 +00:00
|
|
|
__md_stop(mddev);
|
2012-11-18 23:47:48 +00:00
|
|
|
}
|
|
|
|
|
2010-06-01 09:37:27 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_stop);
|
2010-03-29 00:37:13 +00:00
|
|
|
|
2012-07-19 05:59:18 +00:00
|
|
|
static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
|
2010-03-29 02:23:10 +00:00
|
|
|
{
|
|
|
|
int err = 0;
|
2013-11-14 04:16:17 +00:00
|
|
|
int did_freeze = 0;
|
|
|
|
|
|
|
|
if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
|
|
|
|
did_freeze = 1;
|
|
|
|
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
2014-12-10 23:02:10 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
2013-11-14 04:16:17 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2023-05-23 02:10:13 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Thread might be blocked waiting for metadata update which will now
|
|
|
|
* never happen
|
|
|
|
*/
|
|
|
|
md_wakeup_thread_directly(mddev->sync_thread);
|
2014-12-10 23:02:10 +00:00
|
|
|
|
2016-12-08 23:48:19 +00:00
|
|
|
if (mddev->external && test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
|
2015-09-24 04:00:51 +00:00
|
|
|
return -EBUSY;
|
2013-11-14 04:16:17 +00:00
|
|
|
mddev_unlock(mddev);
|
2014-12-10 23:02:10 +00:00
|
|
|
wait_event(resync_wait, !test_bit(MD_RECOVERY_RUNNING,
|
|
|
|
&mddev->recovery));
|
2015-09-24 04:00:51 +00:00
|
|
|
wait_event(mddev->sb_wait,
|
2016-12-08 23:48:19 +00:00
|
|
|
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
|
2013-11-14 04:16:17 +00:00
|
|
|
mddev_lock_nointr(mddev);
|
|
|
|
|
2010-03-29 02:23:10 +00:00
|
|
|
mutex_lock(&mddev->open_mutex);
|
2014-09-09 04:00:15 +00:00
|
|
|
if ((mddev->pers && atomic_read(&mddev->openers) > !!bdev) ||
|
2013-11-14 04:16:17 +00:00
|
|
|
mddev->sync_thread ||
|
2016-08-12 05:42:37 +00:00
|
|
|
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s still in use.\n",mdname(mddev));
|
2013-11-14 04:16:17 +00:00
|
|
|
if (did_freeze) {
|
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2014-10-28 21:49:50 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2013-11-14 04:16:17 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
2010-03-29 02:23:10 +00:00
|
|
|
err = -EBUSY;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (mddev->pers) {
|
2011-01-13 22:14:33 +00:00
|
|
|
__md_stop_writes(mddev);
|
2010-03-29 02:23:10 +00:00
|
|
|
|
|
|
|
err = -ENXIO;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->ro == MD_RDONLY)
|
2010-03-29 02:23:10 +00:00
|
|
|
goto out;
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDONLY;
|
2010-03-29 02:23:10 +00:00
|
|
|
set_disk_ro(mddev->gendisk, 1);
|
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2014-10-28 21:49:50 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2013-11-14 04:16:17 +00:00
|
|
|
err = 0;
|
2010-03-29 02:23:10 +00:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
mutex_unlock(&mddev->open_mutex);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
/* mode:
|
|
|
|
* 0 - completely stop and dis-assemble array
|
|
|
|
* 2 - stop but do not disassemble array
|
|
|
|
*/
|
2014-09-30 04:23:59 +00:00
|
|
|
static int do_md_stop(struct mddev *mddev, int mode,
|
2012-07-19 05:59:18 +00:00
|
|
|
struct block_device *bdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct gendisk *disk = mddev->gendisk;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2013-11-14 04:16:17 +00:00
|
|
|
int did_freeze = 0;
|
|
|
|
|
|
|
|
if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
|
|
|
|
did_freeze = 1;
|
|
|
|
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
2014-12-10 23:02:10 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
2013-11-14 04:16:17 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2023-05-23 02:10:13 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Thread might be blocked waiting for metadata update which will now
|
|
|
|
* never happen
|
|
|
|
*/
|
|
|
|
md_wakeup_thread_directly(mddev->sync_thread);
|
2014-12-10 23:02:10 +00:00
|
|
|
|
2013-11-14 04:16:17 +00:00
|
|
|
mddev_unlock(mddev);
|
2014-12-10 23:02:10 +00:00
|
|
|
wait_event(resync_wait, (mddev->sync_thread == NULL &&
|
|
|
|
!test_bit(MD_RECOVERY_RUNNING,
|
|
|
|
&mddev->recovery)));
|
2013-11-14 04:16:17 +00:00
|
|
|
mddev_lock_nointr(mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-08-10 02:50:52 +00:00
|
|
|
mutex_lock(&mddev->open_mutex);
|
2014-09-09 04:00:15 +00:00
|
|
|
if ((mddev->pers && atomic_read(&mddev->openers) > !!bdev) ||
|
2013-11-14 04:16:17 +00:00
|
|
|
mddev->sysfs_active ||
|
|
|
|
mddev->sync_thread ||
|
2016-08-12 05:42:37 +00:00
|
|
|
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s still in use.\n",mdname(mddev));
|
2010-08-07 11:41:19 +00:00
|
|
|
mutex_unlock(&mddev->open_mutex);
|
2013-11-14 04:16:17 +00:00
|
|
|
if (did_freeze) {
|
|
|
|
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
2014-10-28 21:49:50 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2013-11-14 04:16:17 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
2013-08-27 06:44:13 +00:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
2010-08-07 11:41:19 +00:00
|
|
|
if (mddev->pers) {
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev))
|
2010-03-29 02:23:10 +00:00
|
|
|
set_disk_ro(disk, 0);
|
2009-03-31 03:39:39 +00:00
|
|
|
|
2011-01-13 22:14:33 +00:00
|
|
|
__md_stop_writes(mddev);
|
2012-11-18 23:47:48 +00:00
|
|
|
__md_stop(mddev);
|
2010-03-29 00:37:13 +00:00
|
|
|
|
2010-03-29 02:23:10 +00:00
|
|
|
/* tell userspace to handle 'inactive' */
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2006-12-10 10:20:44 +00:00
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev)
|
2011-07-27 01:00:36 +00:00
|
|
|
if (rdev->raid_disk >= 0)
|
|
|
|
sysfs_unlink_rdev(mddev, rdev);
|
2009-05-07 02:51:06 +00:00
|
|
|
|
2020-11-16 14:57:11 +00:00
|
|
|
set_capacity_and_notify(disk, 0);
|
2010-08-07 11:41:19 +00:00
|
|
|
mutex_unlock(&mddev->open_mutex);
|
2011-02-24 06:26:41 +00:00
|
|
|
mddev->changed = 1;
|
2006-12-10 10:20:44 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev))
|
|
|
|
mddev->ro = MD_RDWR;
|
2010-08-07 11:41:19 +00:00
|
|
|
} else
|
|
|
|
mutex_unlock(&mddev->open_mutex);
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Free resources if final stop
|
|
|
|
*/
|
[PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 07:27:58 +00:00
|
|
|
if (mode == 0) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: %s stopped.\n", mdname(mddev));
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-12-14 01:49:52 +00:00
|
|
|
if (mddev->bitmap_info.file) {
|
2014-12-15 01:57:00 +00:00
|
|
|
struct file *f = mddev->bitmap_info.file;
|
|
|
|
spin_lock(&mddev->lock);
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.file = NULL;
|
2014-12-15 01:57:00 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
fput(f);
|
2006-02-02 22:28:05 +00:00
|
|
|
}
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset = 0;
|
2006-02-02 22:28:05 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
export_array(mddev);
|
|
|
|
|
2010-03-29 00:37:13 +00:00
|
|
|
md_clean(mddev);
|
2009-01-08 21:31:10 +00:00
|
|
|
if (mddev->hold_active == UNTIL_STOP)
|
|
|
|
mddev->hold_active = 0;
|
2010-03-29 02:23:10 +00:00
|
|
|
}
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2010-08-07 11:41:19 +00:00
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2006-12-10 10:20:50 +00:00
|
|
|
#ifndef MODULE
|
2011-10-11 05:47:53 +00:00
|
|
|
static void autorun_array(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
int err;
|
|
|
|
|
2005-04-16 22:26:42 +00:00
|
|
|
if (list_empty(&mddev->disks))
|
2005-04-16 22:20:36 +00:00
|
|
|
return;
|
|
|
|
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: running: ");
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_cont("<%pg>", rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_cont("\n");
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-10-13 00:55:12 +00:00
|
|
|
err = do_md_run(mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (err) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: do_md_run() returned %d\n", err);
|
2012-07-19 05:59:18 +00:00
|
|
|
do_md_stop(mddev, 0, NULL);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* lets try to run arrays based on all disks that have arrived
|
|
|
|
* until now. (those are in pending_raid_disks)
|
|
|
|
*
|
|
|
|
* the method: pick the first pending disk, collect all disks with
|
|
|
|
* the same UUID, remove all from the pending list and put them into
|
|
|
|
* the 'same_array' list. Then order this list based on superblock
|
|
|
|
* update time (freshest comes first), kick out 'old' disks and
|
|
|
|
* compare superblocks. If everything's fine then run it.
|
|
|
|
*
|
|
|
|
* If "unit" is allocated, then bump its reference count
|
|
|
|
*/
|
|
|
|
static void autorun_devices(int part)
|
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev0, *rdev, *tmp;
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: autorun ...\n");
|
2005-04-16 22:20:36 +00:00
|
|
|
while (!list_empty(&pending_raid_disks)) {
|
2006-10-03 08:15:59 +00:00
|
|
|
int unit;
|
2005-04-16 22:20:36 +00:00
|
|
|
dev_t dev;
|
2006-03-27 09:18:07 +00:00
|
|
|
LIST_HEAD(candidates);
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev0 = list_entry(pending_raid_disks.next,
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev, same_set);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_debug("md: considering %pg ...\n", rdev0->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
INIT_LIST_HEAD(&candidates);
|
2009-01-08 21:31:08 +00:00
|
|
|
rdev_for_each_list(rdev, tmp, &pending_raid_disks)
|
2005-04-16 22:20:36 +00:00
|
|
|
if (super_90_load(rdev, rdev0, 0) >= 0) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_debug("md: adding %pg ...\n",
|
|
|
|
rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
list_move(&rdev->same_set, &candidates);
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* now we have a set of devices, with all of them having
|
|
|
|
* mostly sane superblocks. It's time to allocate the
|
|
|
|
* mddev.
|
|
|
|
*/
|
2006-10-03 08:15:59 +00:00
|
|
|
if (part) {
|
|
|
|
dev = MKDEV(mdp_major,
|
|
|
|
rdev0->preferred_minor << MdpMinorShift);
|
|
|
|
unit = MINOR(dev) >> MdpMinorShift;
|
|
|
|
} else {
|
|
|
|
dev = MKDEV(MD_MAJOR, rdev0->preferred_minor);
|
|
|
|
unit = MINOR(dev);
|
|
|
|
}
|
|
|
|
if (rdev0->preferred_minor != unit) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: unit number in %pg is bad: %d\n",
|
|
|
|
rdev0->bdev, rdev0->preferred_minor);
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2022-07-23 06:24:29 +00:00
|
|
|
mddev = md_alloc(dev, NULL);
|
|
|
|
if (IS_ERR(mddev))
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
md: split mddev_find
Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.
This turns out to fix this bug reported by Zhao Heming.
----------------------------- snip ------------------------------
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.
*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt
*** script ***
about trigger 1 time with 10 tests
`1 node1="15sp3-mdcluster1"
2 node2="15sp3-mdcluster2"
3
4 mdadm -Ss
5 ssh ${node2} "mdadm -Ss"
6 wipefs -a /dev/sda /dev/sdb
7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
/dev/sdb --assume-clean
8
9 for i in {1..100}; do
10 echo ==== $i ====;
11
12 echo "test ...."
13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14 sleep 1
15
16 echo "clean ....."
17 ssh ${node2} "mdadm -Ss"
18 done
`
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.
*** stack ***
`ID: 2831 TASK: ffff8dd7223b5040 CPU: 0 COMMAND: "mdadm"
#0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
#1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
#2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
#3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
#4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
#5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
#6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
#7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
#8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
#9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c
or:
[ 884.226509] mddev_put+0x1c/0xe0 [md_mod]
[ 884.226515] md_open+0x3c/0xe0 [md_mod]
[ 884.226518] __blkdev_get+0x30d/0x710
[ 884.226520] ? bd_acquire+0xd0/0xd0
[ 884.226522] blkdev_get+0x14/0x30
[ 884.226524] do_dentry_open+0x204/0x3a0
[ 884.226531] path_openat+0x2fc/0x1520
[ 884.226534] ? seq_printf+0x4e/0x70
[ 884.226536] do_filp_open+0x9b/0x110
[ 884.226542] ? md_release+0x20/0x20 [md_mod]
[ 884.226543] ? seq_read+0x1d8/0x3e0
[ 884.226545] ? kmem_cache_alloc+0x18a/0x270
[ 884.226547] ? do_sys_open+0x1bd/0x260
[ 884.226548] do_sys_open+0x1bd/0x260
[ 884.226551] do_syscall_64+0x5b/0x1e0
[ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9
`
*** rootcause ***
"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.
The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.
`<thread 1>: "mdadm -Ss"
md_release
mddev_put deletes mddev from all_mddevs
queue_work for mddev_delayed_delete
at this time, "/dev/md0" is still available for opening
<thread 2>: "mdadm --monitor ..."
md_open
+ mddev_find can't find mddev of /dev/md0, and create a new mddev and
| return.
+ trigger "if (mddev->gendisk != bdev->bd_disk)" and return
-ERESTARTSYS.
`
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.
In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------
Cc: stable@vger.kernel.org
Fixes: d3374825ce57 ("md: make devices disappear when they are no longer needed.")
Reported-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2021-04-03 16:15:29 +00:00
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
if (mddev_lock(mddev))
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: %s locked, cannot run\n", mdname(mddev));
|
2005-04-16 22:20:36 +00:00
|
|
|
else if (mddev->raid_disks || mddev->major_version
|
|
|
|
|| !list_empty(&mddev->disks)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %s already running, cannot run %pg\n",
|
|
|
|
mdname(mddev), rdev0->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev_unlock(mddev);
|
|
|
|
} else {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: created %s\n", mdname(mddev));
|
2008-02-06 09:39:53 +00:00
|
|
|
mddev->persistent = 1;
|
2009-01-08 21:31:08 +00:00
|
|
|
rdev_for_each_list(rdev, tmp, &candidates) {
|
2005-04-16 22:20:36 +00:00
|
|
|
list_del_init(&rdev->same_set);
|
|
|
|
if (bind_rdev_to_array(rdev, mddev))
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
autorun_array(mddev);
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
}
|
|
|
|
/* on success, candidates will be empty, on error
|
|
|
|
* it won't...
|
|
|
|
*/
|
2009-01-08 21:31:08 +00:00
|
|
|
rdev_for_each_list(rdev, tmp, &candidates) {
|
2008-07-21 07:05:25 +00:00
|
|
|
list_del_init(&rdev->same_set);
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2008-07-21 07:05:25 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev_put(mddev);
|
|
|
|
}
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: ... autorun DONE.\n");
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2006-12-10 10:20:50 +00:00
|
|
|
#endif /* !MODULE */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int get_version(void __user *arg)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
mdu_version_t ver;
|
|
|
|
|
|
|
|
ver.major = MD_MAJOR_VERSION;
|
|
|
|
ver.minor = MD_MINOR_VERSION;
|
|
|
|
ver.patchlevel = MD_PATCHLEVEL_VERSION;
|
|
|
|
|
|
|
|
if (copy_to_user(arg, &ver, sizeof(ver)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int get_array_info(struct mddev *mddev, void __user *arg)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
mdu_array_info_t info;
|
2009-09-23 08:06:41 +00:00
|
|
|
int nr,working,insync,failed,spare;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-10-11 02:37:33 +00:00
|
|
|
nr = working = insync = failed = spare = 0;
|
|
|
|
rcu_read_lock();
|
|
|
|
rdev_for_each_rcu(rdev, mddev) {
|
2005-04-16 22:20:36 +00:00
|
|
|
nr++;
|
2005-11-09 05:39:31 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
2005-04-16 22:20:36 +00:00
|
|
|
failed++;
|
|
|
|
else {
|
|
|
|
working++;
|
2005-11-09 05:39:31 +00:00
|
|
|
if (test_bit(In_sync, &rdev->flags))
|
2014-09-30 04:23:59 +00:00
|
|
|
insync++;
|
2016-08-12 00:14:45 +00:00
|
|
|
else if (test_bit(Journal, &rdev->flags))
|
|
|
|
/* TODO: add journal count to md_u.h */
|
|
|
|
;
|
2005-04-16 22:20:36 +00:00
|
|
|
else
|
|
|
|
spare++;
|
|
|
|
}
|
|
|
|
}
|
2012-10-11 02:37:33 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
info.major_version = mddev->major_version;
|
|
|
|
info.minor_version = mddev->minor_version;
|
|
|
|
info.patch_version = MD_PATCHLEVEL_VERSION;
|
2015-12-20 23:51:01 +00:00
|
|
|
info.ctime = clamp_t(time64_t, mddev->ctime, 0, U32_MAX);
|
2005-04-16 22:20:36 +00:00
|
|
|
info.level = mddev->level;
|
2009-03-31 03:33:13 +00:00
|
|
|
info.size = mddev->dev_sectors / 2;
|
|
|
|
if (info.size != mddev->dev_sectors / 2) /* overflow */
|
2006-02-03 11:03:40 +00:00
|
|
|
info.size = -1;
|
2005-04-16 22:20:36 +00:00
|
|
|
info.nr_disks = nr;
|
|
|
|
info.raid_disks = mddev->raid_disks;
|
|
|
|
info.md_minor = mddev->md_minor;
|
|
|
|
info.not_persistent= !mddev->persistent;
|
|
|
|
|
2015-12-20 23:51:01 +00:00
|
|
|
info.utime = clamp_t(time64_t, mddev->utime, 0, U32_MAX);
|
2005-04-16 22:20:36 +00:00
|
|
|
info.state = 0;
|
|
|
|
if (mddev->in_sync)
|
|
|
|
info.state = (1<<MD_SB_CLEAN);
|
2009-12-14 01:49:52 +00:00
|
|
|
if (mddev->bitmap && mddev->bitmap_info.offset)
|
2014-07-02 01:35:06 +00:00
|
|
|
info.state |= (1<<MD_SB_BITMAP_PRESENT);
|
2014-11-26 18:22:03 +00:00
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
info.state |= (1<<MD_SB_CLUSTERED);
|
2009-09-23 08:06:41 +00:00
|
|
|
info.active_disks = insync;
|
2005-04-16 22:20:36 +00:00
|
|
|
info.working_disks = working;
|
|
|
|
info.failed_disks = failed;
|
|
|
|
info.spare_disks = spare;
|
|
|
|
|
|
|
|
info.layout = mddev->layout;
|
2009-06-17 22:45:01 +00:00
|
|
|
info.chunk_size = mddev->chunk_sectors << 9;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (copy_to_user(arg, &info, sizeof(info)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int get_bitmap_file(struct mddev *mddev, void __user * arg)
|
2005-06-22 00:17:14 +00:00
|
|
|
{
|
|
|
|
mdu_bitmap_file_t *file = NULL; /* too big for stack allocation */
|
2014-12-15 01:57:00 +00:00
|
|
|
char *ptr;
|
2014-12-15 01:57:00 +00:00
|
|
|
int err;
|
2005-06-22 00:17:14 +00:00
|
|
|
|
2015-07-25 14:36:50 +00:00
|
|
|
file = kzalloc(sizeof(*file), GFP_NOIO);
|
2005-06-22 00:17:14 +00:00
|
|
|
if (!file)
|
2014-12-15 01:57:00 +00:00
|
|
|
return -ENOMEM;
|
2005-06-22 00:17:14 +00:00
|
|
|
|
2014-12-15 01:57:00 +00:00
|
|
|
err = 0;
|
|
|
|
spin_lock(&mddev->lock);
|
2015-07-25 14:36:50 +00:00
|
|
|
/* bitmap enabled */
|
|
|
|
if (mddev->bitmap_info.file) {
|
|
|
|
ptr = file_path(mddev->bitmap_info.file, file->pathname,
|
|
|
|
sizeof(file->pathname));
|
|
|
|
if (IS_ERR(ptr))
|
|
|
|
err = PTR_ERR(ptr);
|
|
|
|
else
|
|
|
|
memmove(file->pathname, ptr,
|
|
|
|
sizeof(file->pathname)-(ptr-file->pathname));
|
|
|
|
}
|
2014-12-15 01:57:00 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2005-06-22 00:17:14 +00:00
|
|
|
|
2014-12-15 01:57:00 +00:00
|
|
|
if (err == 0 &&
|
|
|
|
copy_to_user(arg, file, sizeof(*file)))
|
2005-06-22 00:17:14 +00:00
|
|
|
err = -EFAULT;
|
2014-12-15 01:57:00 +00:00
|
|
|
|
2005-06-22 00:17:14 +00:00
|
|
|
kfree(file);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int get_disk_info(struct mddev *mddev, void __user * arg)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
mdu_disk_info_t info;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (copy_from_user(&info, arg, sizeof(info)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
2012-10-11 02:37:33 +00:00
|
|
|
rcu_read_lock();
|
2015-04-14 15:43:55 +00:00
|
|
|
rdev = md_find_rdev_nr_rcu(mddev, info.number);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (rdev) {
|
|
|
|
info.major = MAJOR(rdev->bdev->bd_dev);
|
|
|
|
info.minor = MINOR(rdev->bdev->bd_dev);
|
|
|
|
info.raid_disk = rdev->raid_disk;
|
|
|
|
info.state = 0;
|
2005-11-09 05:39:31 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
2005-04-16 22:20:36 +00:00
|
|
|
info.state |= (1<<MD_DISK_FAULTY);
|
2005-11-09 05:39:31 +00:00
|
|
|
else if (test_bit(In_sync, &rdev->flags)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
info.state |= (1<<MD_DISK_ACTIVE);
|
|
|
|
info.state |= (1<<MD_DISK_SYNC);
|
|
|
|
}
|
2015-10-12 23:59:50 +00:00
|
|
|
if (test_bit(Journal, &rdev->flags))
|
2015-08-13 21:31:55 +00:00
|
|
|
info.state |= (1<<MD_DISK_JOURNAL);
|
2005-09-09 23:23:45 +00:00
|
|
|
if (test_bit(WriteMostly, &rdev->flags))
|
|
|
|
info.state |= (1<<MD_DISK_WRITEMOSTLY);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (test_bit(FailFast, &rdev->flags))
|
|
|
|
info.state |= (1<<MD_DISK_FAILFAST);
|
2005-04-16 22:20:36 +00:00
|
|
|
} else {
|
|
|
|
info.major = info.minor = 0;
|
|
|
|
info.raid_disk = -1;
|
|
|
|
info.state = (1<<MD_DISK_REMOVED);
|
|
|
|
}
|
2012-10-11 02:37:33 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (copy_to_user(arg, &info, sizeof(info)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-06-07 15:31:19 +00:00
|
|
|
int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
dev_t dev = MKDEV(info->major,info->minor);
|
|
|
|
|
2014-10-29 23:51:31 +00:00
|
|
|
if (mddev_is_clustered(mddev) &&
|
|
|
|
!(info->state & ((1 << MD_DISK_CLUSTER_ADD) | (1 << MD_DISK_CANDIDATE)))) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: Cannot add to clustered mddev.\n",
|
|
|
|
mdname(mddev));
|
2014-10-29 23:51:31 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (info->major != MAJOR(dev) || info->minor != MINOR(dev))
|
|
|
|
return -EOVERFLOW;
|
|
|
|
|
|
|
|
if (!mddev->raid_disks) {
|
|
|
|
int err;
|
|
|
|
/* expecting a device which has a superblock */
|
|
|
|
rdev = md_import_device(dev, mddev->major_version, mddev->minor_version);
|
|
|
|
if (IS_ERR(rdev)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: md_import_device returned %ld\n",
|
2005-04-16 22:20:36 +00:00
|
|
|
PTR_ERR(rdev));
|
|
|
|
return PTR_ERR(rdev);
|
|
|
|
}
|
|
|
|
if (!list_empty(&mddev->disks)) {
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev0
|
|
|
|
= list_entry(mddev->disks.next,
|
|
|
|
struct md_rdev, same_set);
|
2009-09-23 08:06:41 +00:00
|
|
|
err = super_types[mddev->major_version]
|
2005-04-16 22:20:36 +00:00
|
|
|
.load_super(rdev, rdev0, mddev->minor_version);
|
|
|
|
if (err < 0) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: %pg has different UUID to %pg\n",
|
|
|
|
rdev->bdev,
|
|
|
|
rdev0->bdev);
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
err = bind_rdev_to_array(rdev, mddev);
|
|
|
|
if (err)
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2020-06-07 15:31:19 +00:00
|
|
|
* md_add_new_disk can be used once the array is assembled
|
2005-04-16 22:20:36 +00:00
|
|
|
* to add "hot spares". They must already have a superblock
|
|
|
|
* written
|
|
|
|
*/
|
|
|
|
if (mddev->pers) {
|
|
|
|
int err;
|
|
|
|
if (!mddev->pers->hot_add_disk) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: personality does not support diskops!\n",
|
|
|
|
mdname(mddev));
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2005-09-09 23:23:50 +00:00
|
|
|
if (mddev->persistent)
|
|
|
|
rdev = md_import_device(dev, mddev->major_version,
|
|
|
|
mddev->minor_version);
|
|
|
|
else
|
|
|
|
rdev = md_import_device(dev, -1, -1);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (IS_ERR(rdev)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: md_import_device returned %ld\n",
|
2005-04-16 22:20:36 +00:00
|
|
|
PTR_ERR(rdev));
|
|
|
|
return PTR_ERR(rdev);
|
|
|
|
}
|
2010-12-09 05:36:28 +00:00
|
|
|
/* set saved_raid_disk if appropriate */
|
2005-06-22 00:17:25 +00:00
|
|
|
if (!mddev->persistent) {
|
|
|
|
if (info->state & (1<<MD_DISK_SYNC) &&
|
2011-01-11 22:03:35 +00:00
|
|
|
info->raid_disk < mddev->raid_disks) {
|
2005-06-22 00:17:25 +00:00
|
|
|
rdev->raid_disk = info->raid_disk;
|
2013-12-11 23:13:33 +00:00
|
|
|
clear_bit(Bitmap_sync, &rdev->flags);
|
2011-01-11 22:03:35 +00:00
|
|
|
} else
|
2005-06-22 00:17:25 +00:00
|
|
|
rdev->raid_disk = -1;
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
rdev->saved_raid_disk = rdev->raid_disk;
|
2005-06-22 00:17:25 +00:00
|
|
|
} else
|
|
|
|
super_types[mddev->major_version].
|
|
|
|
validate_super(mddev, rdev);
|
2011-05-11 04:26:20 +00:00
|
|
|
if ((info->state & (1<<MD_DISK_SYNC)) &&
|
2012-07-03 05:59:06 +00:00
|
|
|
rdev->raid_disk != info->raid_disk) {
|
2011-05-11 04:26:20 +00:00
|
|
|
/* This was a hot-add request, but events doesn't
|
|
|
|
* match, so reject it.
|
|
|
|
*/
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2011-05-11 04:26:20 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2005-11-09 05:39:31 +00:00
|
|
|
clear_bit(In_sync, &rdev->flags); /* just to be sure */
|
2005-09-09 23:23:45 +00:00
|
|
|
if (info->state & (1<<MD_DISK_WRITEMOSTLY))
|
|
|
|
set_bit(WriteMostly, &rdev->flags);
|
2009-03-31 03:33:13 +00:00
|
|
|
else
|
|
|
|
clear_bit(WriteMostly, &rdev->flags);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (info->state & (1<<MD_DISK_FAILFAST))
|
|
|
|
set_bit(FailFast, &rdev->flags);
|
|
|
|
else
|
|
|
|
clear_bit(FailFast, &rdev->flags);
|
2005-09-09 23:23:45 +00:00
|
|
|
|
2015-12-20 23:51:02 +00:00
|
|
|
if (info->state & (1<<MD_DISK_JOURNAL)) {
|
|
|
|
struct md_rdev *rdev2;
|
|
|
|
bool has_journal = false;
|
|
|
|
|
|
|
|
/* make sure no existing journal disk */
|
|
|
|
rdev_for_each(rdev2, mddev) {
|
|
|
|
if (test_bit(Journal, &rdev2->flags)) {
|
|
|
|
has_journal = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2017-10-17 03:24:09 +00:00
|
|
|
if (has_journal || mddev->bitmap) {
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2015-12-20 23:51:02 +00:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
2015-08-13 21:31:55 +00:00
|
|
|
set_bit(Journal, &rdev->flags);
|
2015-12-20 23:51:02 +00:00
|
|
|
}
|
2014-10-29 23:51:31 +00:00
|
|
|
/*
|
|
|
|
* check whether the device shows up in other nodes
|
|
|
|
*/
|
|
|
|
if (mddev_is_clustered(mddev)) {
|
2015-10-01 18:20:27 +00:00
|
|
|
if (info->state & (1 << MD_DISK_CANDIDATE))
|
2014-10-29 23:51:31 +00:00
|
|
|
set_bit(Candidate, &rdev->flags);
|
2015-10-01 18:20:27 +00:00
|
|
|
else if (info->state & (1 << MD_DISK_CLUSTER_ADD)) {
|
2014-10-29 23:51:31 +00:00
|
|
|
/* --add initiated by this node */
|
2015-10-01 18:20:27 +00:00
|
|
|
err = md_cluster_ops->add_new_disk(mddev, rdev);
|
2014-10-29 23:51:31 +00:00
|
|
|
if (err) {
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2014-10-29 23:51:31 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->raid_disk = -1;
|
|
|
|
err = bind_rdev_to_array(rdev, mddev);
|
2015-10-01 18:20:27 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (err)
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2015-10-01 18:20:27 +00:00
|
|
|
|
|
|
|
if (mddev_is_clustered(mddev)) {
|
2016-08-12 05:42:34 +00:00
|
|
|
if (info->state & (1 << MD_DISK_CANDIDATE)) {
|
|
|
|
if (!err) {
|
|
|
|
err = md_cluster_ops->new_disk_ack(mddev,
|
|
|
|
err == 0);
|
|
|
|
if (err)
|
|
|
|
md_kick_rdev_from_array(rdev);
|
|
|
|
}
|
|
|
|
} else {
|
2015-10-01 18:20:27 +00:00
|
|
|
if (err)
|
|
|
|
md_cluster_ops->add_new_disk_cancel(mddev);
|
|
|
|
else
|
|
|
|
err = add_bound_rdev(rdev);
|
|
|
|
}
|
|
|
|
|
|
|
|
} else if (!err)
|
2015-04-14 15:45:22 +00:00
|
|
|
err = add_bound_rdev(rdev);
|
2015-10-01 18:20:27 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2020-06-07 15:31:19 +00:00
|
|
|
/* otherwise, md_add_new_disk is only allowed
|
2005-04-16 22:20:36 +00:00
|
|
|
* for major_version==0 superblocks
|
|
|
|
*/
|
|
|
|
if (mddev->major_version != 0) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: ADD_NEW_DISK not supported\n", mdname(mddev));
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!(info->state & (1<<MD_DISK_FAULTY))) {
|
|
|
|
int err;
|
2008-10-13 00:55:12 +00:00
|
|
|
rdev = md_import_device(dev, -1, 0);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (IS_ERR(rdev)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: error, md_import_device() returned %ld\n",
|
2005-04-16 22:20:36 +00:00
|
|
|
PTR_ERR(rdev));
|
|
|
|
return PTR_ERR(rdev);
|
|
|
|
}
|
|
|
|
rdev->desc_nr = info->number;
|
|
|
|
if (info->raid_disk < mddev->raid_disks)
|
|
|
|
rdev->raid_disk = info->raid_disk;
|
|
|
|
else
|
|
|
|
rdev->raid_disk = -1;
|
|
|
|
|
|
|
|
if (rdev->raid_disk < mddev->raid_disks)
|
2005-11-09 05:39:31 +00:00
|
|
|
if (info->state & (1<<MD_DISK_SYNC))
|
|
|
|
set_bit(In_sync, &rdev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-09-09 23:23:45 +00:00
|
|
|
if (info->state & (1<<MD_DISK_WRITEMOSTLY))
|
|
|
|
set_bit(WriteMostly, &rdev->flags);
|
2016-11-18 05:16:11 +00:00
|
|
|
if (info->state & (1<<MD_DISK_FAILFAST))
|
|
|
|
set_bit(FailFast, &rdev->flags);
|
2005-09-09 23:23:45 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!mddev->persistent) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: nonpersistent superblock ...\n");
|
2021-10-18 10:11:06 +00:00
|
|
|
rdev->sb_start = bdev_nr_sectors(rdev->bdev);
|
2010-11-08 13:39:12 +00:00
|
|
|
} else
|
2011-01-13 22:14:33 +00:00
|
|
|
rdev->sb_start = calc_dev_sboffset(rdev);
|
2009-06-17 22:48:58 +00:00
|
|
|
rdev->sectors = rdev->sb_start;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-01-06 08:20:55 +00:00
|
|
|
err = bind_rdev_to_array(rdev, mddev);
|
|
|
|
if (err) {
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2006-01-06 08:20:55 +00:00
|
|
|
return err;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int hot_remove_disk(struct mddev *mddev, dev_t dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
md: fix NULL dereference of mddev->pers in remove_and_add_spares()
We met NULL pointer BUG as follow:
[ 151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[ 151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
[ 151.762039] Oops: 0000 [#1] SMP PTI
[ 151.762406] Modules linked in:
[ 151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
[ 151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
[ 151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
[ 151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
[ 151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
[ 151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
[ 151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
[ 151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
[ 151.769160] FS: 00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
[ 151.769971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
[ 151.771272] Call Trace:
[ 151.771542] md_ioctl+0x1df2/0x1e10
[ 151.771906] ? __switch_to+0x129/0x440
[ 151.772295] ? __schedule+0x244/0x850
[ 151.772672] blkdev_ioctl+0x4bd/0x970
[ 151.773048] block_ioctl+0x39/0x40
[ 151.773402] do_vfs_ioctl+0xa4/0x610
[ 151.773770] ? dput.part.23+0x87/0x100
[ 151.774151] ksys_ioctl+0x70/0x80
[ 151.774493] __x64_sys_ioctl+0x16/0x20
[ 151.774877] do_syscall_64+0x5b/0x180
[ 151.775258] entry_SYSCALL_64_after_hwframe+0x44/0xa9
For raid6, when two disk of the array are offline, two spare disks can
be added into the array. Before spare disks recovery completing,
system reboot and mdadm thinks it is ok to restart the degraded
array by md_ioctl(). Since disks in raid6 is not only_parity(),
raid5_run() will abort, when there is no PPL feature or not setting
'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.
But, mddev->raid_disks has been set and it will not be cleared when
raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
remove a disk by mdadm, which will cause NULL pointer dereference
in remove_and_add_spares() finally.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-04 10:08:10 +00:00
|
|
|
if (!mddev->pers)
|
|
|
|
return -ENODEV;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev = find_rdev(mddev, dev);
|
|
|
|
if (!rdev)
|
|
|
|
return -ENXIO;
|
|
|
|
|
2015-09-28 15:27:26 +00:00
|
|
|
if (rdev->raid_disk < 0)
|
|
|
|
goto kick_rdev;
|
2014-06-07 06:44:51 +00:00
|
|
|
|
2013-04-24 01:42:41 +00:00
|
|
|
clear_bit(Blocked, &rdev->flags);
|
|
|
|
remove_and_add_spares(mddev, rdev);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (rdev->raid_disk >= 0)
|
|
|
|
goto busy;
|
|
|
|
|
2015-09-28 15:27:26 +00:00
|
|
|
kick_rdev:
|
md/cluster: fix deadlock when node is doing resync job
md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA nodeB
-------------------- --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
b.
md_do_sync
resync_info_update
send RESYNCING
+ set MD_CLUSTER_SEND_LOCK
+ wait for holding token_lockres:EX
c.
mdadm /dev/md0 --remove /dev/sdg
+ held reconfig_mutex
+ send REMOVE
+ wait_event(MD_CLUSTER_SEND_LOCK)
d.
recv_daemon //METADATA_UPDATED from A
process_metadata_update
+ (mddev_trylock(mddev) ||
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
//this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
This will block another node to send msg
b. B does sync jobs, which will send RESYNCING at intervals.
This will be block for holding token_lockres:EX lock.
c. B do "mdadm --remove", which will send REMOVE.
This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
d. B recv METADATA_UPDATED msg, which send from A in step <a>.
This will be blocked by step <c>: holding mddev lock, it makes
wait_event can't hold mddev lock. (btw,
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".
For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.
Repro steps (I only triggered 3 times with hundreds tests):
two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```
test script will hung when executing "mdadm --remove".
```
# dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D 0 5329 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? _cond_resched+0x2d/0x40
? schedule+0x4a/0xb0
? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
? wait_woken+0x80/0x80
? process_recvd_msg+0x113/0x1d0 [md_cluster]
? recv_daemon+0x9e/0x120 [md_cluster]
? md_thread+0x94/0x160 [md_mod]
? wait_woken+0x80/0x80
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
mdadm D 0 5423 1 0x00004004
Call Trace:
__schedule+0x1f6/0x560
? __schedule+0x1fe/0x560
? schedule+0x4a/0xb0
? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
? wait_woken+0x80/0x80
? remove_disk+0x4f/0x90 [md_cluster]
? hot_remove_disk+0xb1/0x1b0 [md_mod]
? md_ioctl+0x50c/0xba0 [md_mod]
? wait_woken+0x80/0x80
? blkdev_ioctl+0xa2/0x2a0
? block_ioctl+0x39/0x40
? ksys_ioctl+0x82/0xc0
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x5f/0x150
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
md0_resync D 0 5425 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? schedule+0x4a/0xb0
? dlm_lock_sync+0xa1/0xd0 [md_cluster]
? wait_woken+0x80/0x80
? lock_token+0x2d/0x90 [md_cluster]
? resync_info_update+0x95/0x100 [md_cluster]
? raid1_sync_request+0x7d3/0xa40 [raid1]
? md_do_sync.cold+0x737/0xc8f [md_mod]
? md_thread+0x94/0x160 [md_mod]
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
```
At last, thanks for Xiao's solution.
Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-19 11:41:34 +00:00
|
|
|
if (mddev_is_clustered(mddev)) {
|
|
|
|
if (md_cluster_ops->remove_disk(mddev, rdev))
|
|
|
|
goto busy;
|
|
|
|
}
|
2015-04-14 15:44:44 +00:00
|
|
|
|
2015-04-14 15:43:24 +00:00
|
|
|
md_kick_rdev_from_array(rdev);
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2016-11-04 05:46:03 +00:00
|
|
|
if (mddev->thread)
|
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
else
|
|
|
|
md_update_sb(mddev, 1);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
busy:
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_debug("md: cannot remove active disk %pg from %s ...\n",
|
|
|
|
rdev->bdev, mdname(mddev));
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int hot_add_disk(struct mddev *mddev, dev_t dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
int err;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (!mddev->pers)
|
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
if (mddev->major_version != 0) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: HOT_ADD may only be used with version-0 superblocks.\n",
|
2005-04-16 22:20:36 +00:00
|
|
|
mdname(mddev));
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
if (!mddev->pers->hot_add_disk) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: personality does not support diskops!\n",
|
2005-04-16 22:20:36 +00:00
|
|
|
mdname(mddev));
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2008-10-13 00:55:12 +00:00
|
|
|
rdev = md_import_device(dev, -1, 0);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (IS_ERR(rdev)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: error, md_import_device() returned %ld\n",
|
2005-04-16 22:20:36 +00:00
|
|
|
PTR_ERR(rdev));
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mddev->persistent)
|
2011-01-13 22:14:33 +00:00
|
|
|
rdev->sb_start = calc_dev_sboffset(rdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
else
|
2021-10-18 10:11:06 +00:00
|
|
|
rdev->sb_start = bdev_nr_sectors(rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-06-17 22:48:58 +00:00
|
|
|
rdev->sectors = rdev->sb_start;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-11-09 05:39:31 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_warn("md: can not hot-add faulty %pg disk to %s!\n",
|
|
|
|
rdev->bdev, mdname(mddev));
|
2005-04-16 22:20:36 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
goto abort_export;
|
|
|
|
}
|
2014-06-07 06:44:51 +00:00
|
|
|
|
2005-11-09 05:39:31 +00:00
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->desc_nr = -1;
|
2006-10-06 07:44:04 +00:00
|
|
|
rdev->saved_raid_disk = -1;
|
2006-01-06 08:20:55 +00:00
|
|
|
err = bind_rdev_to_array(rdev, mddev);
|
|
|
|
if (err)
|
2015-09-29 00:21:35 +00:00
|
|
|
goto abort_export;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The rest should better be atomic, we can have disk failures
|
|
|
|
* noticed in interrupt contexts ...
|
|
|
|
*/
|
|
|
|
|
|
|
|
rdev->raid_disk = -1;
|
|
|
|
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2016-11-04 05:46:03 +00:00
|
|
|
if (!mddev->thread)
|
|
|
|
md_update_sb(mddev, 1);
|
2021-12-21 20:06:19 +00:00
|
|
|
/*
|
|
|
|
* If the new disk does not support REQ_NOWAIT,
|
|
|
|
* disable on the whole MD.
|
|
|
|
*/
|
2022-09-27 07:58:15 +00:00
|
|
|
if (!bdev_nowait(rdev->bdev)) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_info("%s: Disabling nowait because %pg does not support nowait\n",
|
|
|
|
mdname(mddev), rdev->bdev);
|
2021-12-21 20:06:19 +00:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_NOWAIT, mddev->queue);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Kick recovery, maybe this spare has to be added to the
|
|
|
|
* array immediately.
|
|
|
|
*/
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
abort_export:
|
2023-06-08 11:02:43 +00:00
|
|
|
export_rdev(rdev, mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int set_bitmap_file(struct mddev *mddev, int fd)
|
2005-06-22 00:17:14 +00:00
|
|
|
{
|
2014-04-09 02:25:40 +00:00
|
|
|
int err = 0;
|
2005-06-22 00:17:14 +00:00
|
|
|
|
2005-09-09 23:23:45 +00:00
|
|
|
if (mddev->pers) {
|
2014-08-08 05:40:24 +00:00
|
|
|
if (!mddev->pers->quiesce || !mddev->thread)
|
2005-09-09 23:23:45 +00:00
|
|
|
return -EBUSY;
|
|
|
|
if (mddev->recovery || mddev->sync_thread)
|
|
|
|
return -EBUSY;
|
|
|
|
/* we should be able to change the bitmap.. */
|
|
|
|
}
|
2005-06-22 00:17:14 +00:00
|
|
|
|
2005-09-09 23:23:45 +00:00
|
|
|
if (fd >= 0) {
|
2014-04-09 02:25:40 +00:00
|
|
|
struct inode *inode;
|
2014-12-15 01:57:00 +00:00
|
|
|
struct file *f;
|
|
|
|
|
|
|
|
if (mddev->bitmap || mddev->bitmap_info.file)
|
2005-09-09 23:23:45 +00:00
|
|
|
return -EEXIST; /* cannot add when bitmap is present */
|
2023-06-15 06:48:39 +00:00
|
|
|
|
|
|
|
if (!IS_ENABLED(CONFIG_MD_BITMAP_FILE)) {
|
|
|
|
pr_warn("%s: bitmap files not supported by this kernel\n",
|
|
|
|
mdname(mddev));
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2023-06-15 06:48:40 +00:00
|
|
|
pr_warn("%s: using deprecated bitmap file support\n",
|
|
|
|
mdname(mddev));
|
2023-06-15 06:48:39 +00:00
|
|
|
|
2014-12-15 01:57:00 +00:00
|
|
|
f = fget(fd);
|
2005-06-22 00:17:14 +00:00
|
|
|
|
2014-12-15 01:57:00 +00:00
|
|
|
if (f == NULL) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: error: failed to get bitmap file\n",
|
|
|
|
mdname(mddev));
|
2005-09-09 23:23:45 +00:00
|
|
|
return -EBADF;
|
|
|
|
}
|
|
|
|
|
2014-12-15 01:57:00 +00:00
|
|
|
inode = f->f_mapping->host;
|
2014-04-09 02:25:40 +00:00
|
|
|
if (!S_ISREG(inode->i_mode)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: error: bitmap file must be a regular file\n",
|
|
|
|
mdname(mddev));
|
2014-04-09 02:25:40 +00:00
|
|
|
err = -EBADF;
|
2014-12-15 01:57:00 +00:00
|
|
|
} else if (!(f->f_mode & FMODE_WRITE)) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: error: bitmap file must open for write\n",
|
|
|
|
mdname(mddev));
|
2014-04-09 02:25:40 +00:00
|
|
|
err = -EBADF;
|
|
|
|
} else if (atomic_read(&inode->i_writecount) != 1) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: error: bitmap file is already in use\n",
|
|
|
|
mdname(mddev));
|
2014-04-09 02:25:40 +00:00
|
|
|
err = -EBUSY;
|
|
|
|
}
|
|
|
|
if (err) {
|
2014-12-15 01:57:00 +00:00
|
|
|
fput(f);
|
2005-09-09 23:23:45 +00:00
|
|
|
return err;
|
|
|
|
}
|
2014-12-15 01:57:00 +00:00
|
|
|
mddev->bitmap_info.file = f;
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset = 0; /* file overrides offset */
|
2005-09-09 23:23:45 +00:00
|
|
|
} else if (mddev->bitmap == NULL)
|
|
|
|
return -ENOENT; /* cannot remove what isn't there */
|
|
|
|
err = 0;
|
|
|
|
if (mddev->pers) {
|
2010-06-01 09:37:35 +00:00
|
|
|
if (fd >= 0) {
|
2014-06-06 17:43:49 +00:00
|
|
|
struct bitmap *bitmap;
|
|
|
|
|
2018-08-01 22:20:50 +00:00
|
|
|
bitmap = md_bitmap_create(mddev, -1);
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_suspend(mddev);
|
2014-06-06 17:43:49 +00:00
|
|
|
if (!IS_ERR(bitmap)) {
|
|
|
|
mddev->bitmap = bitmap;
|
2018-08-01 22:20:50 +00:00
|
|
|
err = md_bitmap_load(mddev);
|
2015-02-25 00:44:11 +00:00
|
|
|
} else
|
|
|
|
err = PTR_ERR(bitmap);
|
2017-10-17 02:46:43 +00:00
|
|
|
if (err) {
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_destroy(mddev);
|
2017-10-17 02:46:43 +00:00
|
|
|
fd = -1;
|
|
|
|
}
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_resume(mddev);
|
2017-10-17 02:46:43 +00:00
|
|
|
} else if (fd < 0) {
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_suspend(mddev);
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_destroy(mddev);
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_resume(mddev);
|
2006-06-26 07:27:43 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (fd < 0) {
|
2014-12-15 01:57:00 +00:00
|
|
|
struct file *f = mddev->bitmap_info.file;
|
|
|
|
if (f) {
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
mddev->bitmap_info.file = NULL;
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
fput(f);
|
|
|
|
}
|
2005-09-09 23:23:45 +00:00
|
|
|
}
|
|
|
|
|
2005-06-22 00:17:14 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2020-06-07 15:31:19 +00:00
|
|
|
* md_set_array_info is used two different ways
|
2005-04-16 22:20:36 +00:00
|
|
|
* The original usage is when creating a new array.
|
|
|
|
* In this usage, raid_disks is > 0 and it together with
|
|
|
|
* level, size, not_persistent,layout,chunksize determine the
|
|
|
|
* shape of the array.
|
|
|
|
* This will always create an array with a type-0.90.0 superblock.
|
|
|
|
* The newer usage is when assembling an array.
|
|
|
|
* In this case raid_disks will be 0, and the major_version field is
|
|
|
|
* use to determine which style super-blocks are to be found on the devices.
|
|
|
|
* The minor and patch _version numbers are also kept incase the
|
|
|
|
* super_block handler wishes to interpret them.
|
|
|
|
*/
|
2020-06-07 15:31:19 +00:00
|
|
|
int md_set_array_info(struct mddev *mddev, struct mdu_array_info_s *info)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
if (info->raid_disks == 0) {
|
|
|
|
/* just setting version number for superblock loading */
|
|
|
|
if (info->major_version < 0 ||
|
2007-05-09 09:35:34 +00:00
|
|
|
info->major_version >= ARRAY_SIZE(super_types) ||
|
2005-04-16 22:20:36 +00:00
|
|
|
super_types[info->major_version].name == NULL) {
|
|
|
|
/* maybe try to auto-load a module? */
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: superblock version %d not known\n",
|
2005-04-16 22:20:36 +00:00
|
|
|
info->major_version);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
mddev->major_version = info->major_version;
|
|
|
|
mddev->minor_version = info->minor_version;
|
|
|
|
mddev->patch_version = info->patch_version;
|
2006-12-22 09:11:41 +00:00
|
|
|
mddev->persistent = !info->not_persistent;
|
2009-12-30 01:08:49 +00:00
|
|
|
/* ensure mddev_put doesn't delete this now that there
|
|
|
|
* is some minimal configuration.
|
|
|
|
*/
|
2015-12-20 23:51:01 +00:00
|
|
|
mddev->ctime = ktime_get_real_seconds();
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
mddev->major_version = MD_MAJOR_VERSION;
|
|
|
|
mddev->minor_version = MD_MINOR_VERSION;
|
|
|
|
mddev->patch_version = MD_PATCHLEVEL_VERSION;
|
2015-12-20 23:51:01 +00:00
|
|
|
mddev->ctime = ktime_get_real_seconds();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
mddev->level = info->level;
|
2006-01-17 06:14:57 +00:00
|
|
|
mddev->clevel[0] = 0;
|
2009-03-31 03:33:13 +00:00
|
|
|
mddev->dev_sectors = 2 * (sector_t)info->size;
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->raid_disks = info->raid_disks;
|
|
|
|
/* don't set md_minor, it is determined by which /dev/md* was
|
|
|
|
* openned
|
|
|
|
*/
|
|
|
|
if (info->state & (1<<MD_SB_CLEAN))
|
|
|
|
mddev->recovery_cp = MaxSector;
|
|
|
|
else
|
|
|
|
mddev->recovery_cp = 0;
|
|
|
|
mddev->persistent = ! info->not_persistent;
|
2008-02-06 09:39:51 +00:00
|
|
|
mddev->external = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
mddev->layout = info->layout;
|
2019-09-09 06:52:29 +00:00
|
|
|
if (mddev->level == 0)
|
|
|
|
/* Cannot trust RAID0 layout info here */
|
|
|
|
mddev->layout = -1;
|
2009-06-17 22:45:01 +00:00
|
|
|
mddev->chunk_sectors = info->chunk_size >> 9;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-12-08 23:48:19 +00:00
|
|
|
if (mddev->persistent) {
|
2017-02-28 20:31:28 +00:00
|
|
|
mddev->max_disks = MD_SB_DISKS;
|
|
|
|
mddev->flags = 0;
|
|
|
|
mddev->sb_flags = 0;
|
2016-12-08 23:48:19 +00:00
|
|
|
}
|
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.default_offset = MD_SB_BYTES >> 9;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.default_space = 64*2 - (MD_SB_BYTES >> 9);
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset = 0;
|
2005-11-28 21:44:12 +00:00
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
mddev->reshape_position = MaxSector;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Generate a 128 bit UUID
|
|
|
|
*/
|
|
|
|
get_random_bytes(mddev->uuid, 16);
|
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
mddev->new_level = mddev->level;
|
2009-06-17 22:45:27 +00:00
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
2006-03-27 09:18:11 +00:00
|
|
|
mddev->new_layout = mddev->layout;
|
|
|
|
mddev->delta_disks = 0;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
2006-03-27 09:18:11 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_set_array_sectors(struct mddev *mddev, sector_t array_sectors)
|
2009-03-31 03:59:03 +00:00
|
|
|
{
|
2017-10-19 05:08:13 +00:00
|
|
|
lockdep_assert_held(&mddev->reconfig_mutex);
|
2009-03-31 04:00:31 +00:00
|
|
|
|
|
|
|
if (mddev->external_size)
|
|
|
|
return;
|
|
|
|
|
2009-03-31 03:59:03 +00:00
|
|
|
mddev->array_sectors = array_sectors;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_set_array_sectors);
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int update_size(struct mddev *mddev, sector_t num_sectors)
|
2006-01-06 08:20:49 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2006-01-06 08:20:49 +00:00
|
|
|
int rv;
|
2008-07-11 12:02:22 +00:00
|
|
|
int fit = (num_sectors == 0);
|
2017-03-01 08:42:40 +00:00
|
|
|
sector_t old_dev_sectors = mddev->dev_sectors;
|
2016-05-02 15:33:13 +00:00
|
|
|
|
2006-01-06 08:20:49 +00:00
|
|
|
if (mddev->pers->resize == NULL)
|
|
|
|
return -EINVAL;
|
2008-07-11 12:02:22 +00:00
|
|
|
/* The "num_sectors" is the number of sectors of each device that
|
|
|
|
* is used. This can only make sense for arrays with redundancy.
|
|
|
|
* linear and raid0 always use whatever space is available. We can only
|
|
|
|
* consider changing this number if no resync or reconstruction is
|
|
|
|
* happening, and if the new size is acceptable. It must fit before the
|
2008-07-11 12:02:23 +00:00
|
|
|
* sb_start or, if that is <data_offset, it must fit before the size
|
2008-07-11 12:02:22 +00:00
|
|
|
* of each device. If num_sectors is zero, we find the largest size
|
|
|
|
* that fits.
|
2006-01-06 08:20:49 +00:00
|
|
|
*/
|
2014-12-10 23:02:10 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
|
|
|
|
mddev->sync_thread)
|
2006-01-06 08:20:49 +00:00
|
|
|
return -EBUSY;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev))
|
2014-05-28 03:39:21 +00:00
|
|
|
return -EROFS;
|
2012-05-22 03:55:27 +00:00
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2009-03-31 03:33:13 +00:00
|
|
|
sector_t avail = rdev->sectors;
|
2006-10-28 17:38:30 +00:00
|
|
|
|
2008-07-11 12:02:22 +00:00
|
|
|
if (fit && (num_sectors == 0 || num_sectors > avail))
|
|
|
|
num_sectors = avail;
|
|
|
|
if (avail < num_sectors)
|
2006-01-06 08:20:49 +00:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
2008-07-11 12:02:22 +00:00
|
|
|
rv = mddev->pers->resize(mddev, num_sectors);
|
2017-02-24 03:15:23 +00:00
|
|
|
if (!rv) {
|
2017-03-01 08:42:40 +00:00
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
md_cluster_ops->update_size(mddev, old_dev_sectors);
|
|
|
|
else if (mddev->queue) {
|
2020-11-16 14:57:11 +00:00
|
|
|
set_capacity_and_notify(mddev->gendisk,
|
|
|
|
mddev->array_sectors);
|
2017-02-24 03:15:23 +00:00
|
|
|
}
|
|
|
|
}
|
2006-01-06 08:20:49 +00:00
|
|
|
return rv;
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int update_raid_disks(struct mddev *mddev, int raid_disks)
|
2006-01-06 08:20:54 +00:00
|
|
|
{
|
|
|
|
int rv;
|
2012-05-20 23:27:00 +00:00
|
|
|
struct md_rdev *rdev;
|
2006-01-06 08:20:54 +00:00
|
|
|
/* change the number of raid disks */
|
2006-03-27 09:18:13 +00:00
|
|
|
if (mddev->pers->check_reshape == NULL)
|
2006-01-06 08:20:54 +00:00
|
|
|
return -EINVAL;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev))
|
2014-05-28 03:39:21 +00:00
|
|
|
return -EROFS;
|
2006-01-06 08:20:54 +00:00
|
|
|
if (raid_disks <= 0 ||
|
2010-04-14 07:02:09 +00:00
|
|
|
(mddev->max_disks && raid_disks >= mddev->max_disks))
|
2006-01-06 08:20:54 +00:00
|
|
|
return -EINVAL;
|
2014-12-10 23:02:10 +00:00
|
|
|
if (mddev->sync_thread ||
|
|
|
|
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
|
md/cluster: block reshape with remote resync job
Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.
Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.
The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
# on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```
Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-19 11:41:33 +00:00
|
|
|
test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
|
2014-12-10 23:02:10 +00:00
|
|
|
mddev->reshape_position != MaxSector)
|
2006-01-06 08:20:54 +00:00
|
|
|
return -EBUSY;
|
2012-05-20 23:27:00 +00:00
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if (mddev->raid_disks < raid_disks &&
|
|
|
|
rdev->data_offset < rdev->new_data_offset)
|
|
|
|
return -EINVAL;
|
|
|
|
if (mddev->raid_disks > raid_disks &&
|
|
|
|
rdev->data_offset > rdev->new_data_offset)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2006-03-27 09:18:13 +00:00
|
|
|
mddev->delta_disks = raid_disks - mddev->raid_disks;
|
2012-05-20 23:27:00 +00:00
|
|
|
if (mddev->delta_disks < 0)
|
|
|
|
mddev->reshape_backwards = 1;
|
|
|
|
else if (mddev->delta_disks > 0)
|
|
|
|
mddev->reshape_backwards = 0;
|
2006-03-27 09:18:13 +00:00
|
|
|
|
|
|
|
rv = mddev->pers->check_reshape(mddev);
|
2012-05-20 23:27:00 +00:00
|
|
|
if (rv < 0) {
|
2011-01-31 00:57:42 +00:00
|
|
|
mddev->delta_disks = 0;
|
2012-05-20 23:27:00 +00:00
|
|
|
mddev->reshape_backwards = 0;
|
|
|
|
}
|
2006-01-06 08:20:54 +00:00
|
|
|
return rv;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* update_array_info is used to change the configuration of an
|
|
|
|
* on-line array.
|
|
|
|
* The version, ctime,level,size,raid_disks,not_persistent, layout,chunk_size
|
|
|
|
* fields in the info are checked against the array.
|
|
|
|
* Any differences that cannot be handled will cause an error.
|
|
|
|
* Normally, only one change can be managed at a time.
|
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
int rv = 0;
|
|
|
|
int cnt = 0;
|
2005-09-09 23:23:45 +00:00
|
|
|
int state = 0;
|
|
|
|
|
|
|
|
/* calculate expected state,ignoring low bits */
|
2009-12-14 01:49:52 +00:00
|
|
|
if (mddev->bitmap && mddev->bitmap_info.offset)
|
2005-09-09 23:23:45 +00:00
|
|
|
state |= (1 << MD_SB_BITMAP_PRESENT);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (mddev->major_version != info->major_version ||
|
|
|
|
mddev->minor_version != info->minor_version ||
|
|
|
|
/* mddev->patch_version != info->patch_version || */
|
|
|
|
mddev->ctime != info->ctime ||
|
|
|
|
mddev->level != info->level ||
|
|
|
|
/* mddev->layout != info->layout || */
|
2015-06-11 01:41:10 +00:00
|
|
|
mddev->persistent != !info->not_persistent ||
|
2009-06-17 22:45:01 +00:00
|
|
|
mddev->chunk_sectors != info->chunk_size >> 9 ||
|
2005-09-09 23:23:45 +00:00
|
|
|
/* ignore bottom 8 bits of state, and allow SB_BITMAP_PRESENT to change */
|
|
|
|
((state^info->state) & 0xfffffe00)
|
|
|
|
)
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
|
|
|
/* Check there is only one change */
|
2009-03-31 03:33:13 +00:00
|
|
|
if (info->size >= 0 && mddev->dev_sectors / 2 != info->size)
|
|
|
|
cnt++;
|
|
|
|
if (mddev->raid_disks != info->raid_disks)
|
|
|
|
cnt++;
|
|
|
|
if (mddev->layout != info->layout)
|
|
|
|
cnt++;
|
|
|
|
if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT))
|
|
|
|
cnt++;
|
|
|
|
if (cnt == 0)
|
|
|
|
return 0;
|
|
|
|
if (cnt > 1)
|
|
|
|
return -EINVAL;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (mddev->layout != info->layout) {
|
|
|
|
/* Change layout
|
|
|
|
* we don't need to do anything at the md level, the
|
|
|
|
* personality will take care of it all.
|
|
|
|
*/
|
2009-06-17 22:47:55 +00:00
|
|
|
if (mddev->pers->check_reshape == NULL)
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
2009-06-17 22:47:42 +00:00
|
|
|
else {
|
|
|
|
mddev->new_layout = info->layout;
|
2009-06-17 22:47:55 +00:00
|
|
|
rv = mddev->pers->check_reshape(mddev);
|
2009-06-17 22:47:42 +00:00
|
|
|
if (rv)
|
|
|
|
mddev->new_layout = mddev->layout;
|
|
|
|
return rv;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-03-31 03:33:13 +00:00
|
|
|
if (info->size >= 0 && mddev->dev_sectors / 2 != info->size)
|
2008-07-11 12:02:22 +00:00
|
|
|
rv = update_size(mddev, (sector_t)info->size * 2);
|
2006-01-06 08:20:49 +00:00
|
|
|
|
2006-01-06 08:20:54 +00:00
|
|
|
if (mddev->raid_disks != info->raid_disks)
|
|
|
|
rv = update_raid_disks(mddev, info->raid_disks);
|
|
|
|
|
2005-09-09 23:23:45 +00:00
|
|
|
if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT)) {
|
2014-06-07 06:44:51 +00:00
|
|
|
if (mddev->pers->quiesce == NULL || mddev->thread == NULL) {
|
|
|
|
rv = -EINVAL;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
if (mddev->recovery || mddev->sync_thread) {
|
|
|
|
rv = -EBUSY;
|
|
|
|
goto err;
|
|
|
|
}
|
2005-09-09 23:23:45 +00:00
|
|
|
if (info->state & (1<<MD_SB_BITMAP_PRESENT)) {
|
2014-06-06 17:43:49 +00:00
|
|
|
struct bitmap *bitmap;
|
2005-09-09 23:23:45 +00:00
|
|
|
/* add the bitmap */
|
2014-06-07 06:44:51 +00:00
|
|
|
if (mddev->bitmap) {
|
|
|
|
rv = -EEXIST;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
if (mddev->bitmap_info.default_offset == 0) {
|
|
|
|
rv = -EINVAL;
|
|
|
|
goto err;
|
|
|
|
}
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset =
|
|
|
|
mddev->bitmap_info.default_offset;
|
2012-05-22 03:55:07 +00:00
|
|
|
mddev->bitmap_info.space =
|
|
|
|
mddev->bitmap_info.default_space;
|
2018-08-01 22:20:50 +00:00
|
|
|
bitmap = md_bitmap_create(mddev, -1);
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_suspend(mddev);
|
2014-06-06 17:43:49 +00:00
|
|
|
if (!IS_ERR(bitmap)) {
|
|
|
|
mddev->bitmap = bitmap;
|
2018-08-01 22:20:50 +00:00
|
|
|
rv = md_bitmap_load(mddev);
|
2015-02-25 00:44:11 +00:00
|
|
|
} else
|
|
|
|
rv = PTR_ERR(bitmap);
|
2005-09-09 23:23:45 +00:00
|
|
|
if (rv)
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_destroy(mddev);
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_resume(mddev);
|
2005-09-09 23:23:45 +00:00
|
|
|
} else {
|
|
|
|
/* remove the bitmap */
|
2014-06-07 06:44:51 +00:00
|
|
|
if (!mddev->bitmap) {
|
|
|
|
rv = -ENOENT;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
if (mddev->bitmap->storage.file) {
|
|
|
|
rv = -EINVAL;
|
|
|
|
goto err;
|
|
|
|
}
|
2015-12-20 23:51:00 +00:00
|
|
|
if (mddev->bitmap_info.nodes) {
|
|
|
|
/* hold PW on all the bitmap lock */
|
|
|
|
if (md_cluster_ops->lock_all_bitmaps(mddev) <= 0) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("md: can't change bitmap to none since the array is in use by more than one node\n");
|
2015-12-20 23:51:00 +00:00
|
|
|
rv = -EPERM;
|
|
|
|
md_cluster_ops->unlock_all_bitmaps(mddev);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
mddev->bitmap_info.nodes = 0;
|
|
|
|
md_cluster_ops->leave(mddev);
|
2020-07-20 18:08:53 +00:00
|
|
|
module_put(md_cluster_mod);
|
2020-07-20 18:08:52 +00:00
|
|
|
mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
|
2015-12-20 23:51:00 +00:00
|
|
|
}
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_suspend(mddev);
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_destroy(mddev);
|
2017-10-17 02:46:43 +00:00
|
|
|
mddev_resume(mddev);
|
2009-12-14 01:49:52 +00:00
|
|
|
mddev->bitmap_info.offset = 0;
|
2005-09-09 23:23:45 +00:00
|
|
|
}
|
|
|
|
}
|
2006-10-03 08:15:46 +00:00
|
|
|
md_update_sb(mddev, 1);
|
2014-06-07 06:44:51 +00:00
|
|
|
return rv;
|
|
|
|
err:
|
2005-04-16 22:20:36 +00:00
|
|
|
return rv;
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int set_disk_faulty(struct mddev *mddev, dev_t dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2012-10-11 02:37:33 +00:00
|
|
|
int err = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (mddev->pers == NULL)
|
|
|
|
return -ENODEV;
|
|
|
|
|
2012-10-11 02:37:33 +00:00
|
|
|
rcu_read_lock();
|
2017-12-27 09:31:40 +00:00
|
|
|
rdev = md_find_rdev_rcu(mddev, dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!rdev)
|
2012-10-11 02:37:33 +00:00
|
|
|
err = -ENODEV;
|
|
|
|
else {
|
|
|
|
md_error(mddev, rdev);
|
2022-03-22 15:23:38 +00:00
|
|
|
if (test_bit(MD_BROKEN, &mddev->flags))
|
2012-10-11 02:37:33 +00:00
|
|
|
err = -EBUSY;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
return err;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2008-04-25 16:57:58 +00:00
|
|
|
/*
|
|
|
|
* We have a problem here : there is no easy way to give a CHS
|
|
|
|
* virtual geometry. We currently pretend that we have a 2 heads
|
|
|
|
* 4 sectors (with a BIG number of cylinders...). This drives
|
|
|
|
* dosfs just mad... ;-)
|
|
|
|
*/
|
2006-01-08 09:02:50 +00:00
|
|
|
static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo)
|
|
|
|
{
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = bdev->bd_disk->private_data;
|
2006-01-08 09:02:50 +00:00
|
|
|
|
|
|
|
geo->heads = 2;
|
|
|
|
geo->sectors = 4;
|
2010-03-28 23:51:42 +00:00
|
|
|
geo->cylinders = mddev->array_sectors / 8;
|
2006-01-08 09:02:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-01-15 15:58:52 +00:00
|
|
|
static inline bool md_ioctl_valid(unsigned int cmd)
|
|
|
|
{
|
|
|
|
switch (cmd) {
|
|
|
|
case ADD_NEW_DISK:
|
|
|
|
case GET_ARRAY_INFO:
|
|
|
|
case GET_BITMAP_FILE:
|
|
|
|
case GET_DISK_INFO:
|
|
|
|
case HOT_ADD_DISK:
|
|
|
|
case HOT_REMOVE_DISK:
|
|
|
|
case RAID_VERSION:
|
|
|
|
case RESTART_ARRAY_RW:
|
|
|
|
case RUN_ARRAY:
|
|
|
|
case SET_ARRAY_INFO:
|
|
|
|
case SET_BITMAP_FILE:
|
|
|
|
case SET_DISK_FAULTY:
|
|
|
|
case STOP_ARRAY:
|
|
|
|
case STOP_ARRAY_RO:
|
2014-10-29 23:51:31 +00:00
|
|
|
case CLUSTERED_DISK_NACK:
|
2014-01-15 15:58:52 +00:00
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-09-20 02:39:37 +00:00
|
|
|
static int __md_set_array_info(struct mddev *mddev, void __user *argp)
|
|
|
|
{
|
|
|
|
mdu_array_info_t info;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!argp)
|
|
|
|
memset(&info, 0, sizeof(info));
|
|
|
|
else if (copy_from_user(&info, argp, sizeof(info)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (mddev->pers) {
|
|
|
|
err = update_array_info(mddev, &info);
|
|
|
|
if (err)
|
|
|
|
pr_warn("md: couldn't update array info. %d\n", err);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!list_empty(&mddev->disks)) {
|
|
|
|
pr_warn("md: array %s already has disks!\n", mdname(mddev));
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mddev->raid_disks) {
|
|
|
|
pr_warn("md: array %s already initialised!\n", mdname(mddev));
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = md_set_array_info(mddev, &info);
|
|
|
|
if (err)
|
|
|
|
pr_warn("md: couldn't set array info. %d\n", err);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2023-06-08 11:02:55 +00:00
|
|
|
static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
|
2005-04-16 22:20:36 +00:00
|
|
|
unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
void __user *argp = (void __user *)arg;
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = NULL;
|
2017-04-06 03:16:33 +00:00
|
|
|
bool did_set_md_closing = false;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-01-15 15:58:52 +00:00
|
|
|
if (!md_ioctl_valid(cmd))
|
|
|
|
return -ENOTTY;
|
|
|
|
|
2011-12-22 23:17:26 +00:00
|
|
|
switch (cmd) {
|
|
|
|
case RAID_VERSION:
|
|
|
|
case GET_ARRAY_INFO:
|
|
|
|
case GET_DISK_INFO:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EACCES;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Commands dealing with the RAID driver but not any
|
|
|
|
* particular array:
|
|
|
|
*/
|
2012-12-11 02:39:21 +00:00
|
|
|
switch (cmd) {
|
|
|
|
case RAID_VERSION:
|
|
|
|
err = get_version(argp);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto out;
|
2012-12-11 02:39:21 +00:00
|
|
|
default:;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commands creating/starting a new array:
|
|
|
|
*/
|
|
|
|
|
2008-03-02 15:31:15 +00:00
|
|
|
mddev = bdev->bd_disk->private_data;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (!mddev) {
|
|
|
|
BUG();
|
2014-09-30 05:46:41 +00:00
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2012-10-11 02:37:33 +00:00
|
|
|
/* Some actions do not requires the mutex */
|
|
|
|
switch (cmd) {
|
|
|
|
case GET_ARRAY_INFO:
|
|
|
|
if (!mddev->raid_disks && !mddev->external)
|
|
|
|
err = -ENODEV;
|
|
|
|
else
|
|
|
|
err = get_array_info(mddev, argp);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto out;
|
2012-10-11 02:37:33 +00:00
|
|
|
|
|
|
|
case GET_DISK_INFO:
|
|
|
|
if (!mddev->raid_disks && !mddev->external)
|
|
|
|
err = -ENODEV;
|
|
|
|
else
|
|
|
|
err = get_disk_info(mddev, argp);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto out;
|
2012-10-11 02:37:33 +00:00
|
|
|
|
|
|
|
case SET_DISK_FAULTY:
|
|
|
|
err = set_disk_faulty(mddev, new_decode_dev(arg));
|
2014-09-30 05:46:41 +00:00
|
|
|
goto out;
|
2014-12-15 01:57:00 +00:00
|
|
|
|
|
|
|
case GET_BITMAP_FILE:
|
|
|
|
err = get_bitmap_file(mddev, argp);
|
|
|
|
goto out;
|
|
|
|
|
2012-10-11 02:37:33 +00:00
|
|
|
}
|
|
|
|
|
2013-04-02 06:38:55 +00:00
|
|
|
if (cmd == HOT_REMOVE_DISK)
|
|
|
|
/* need to ensure recovery thread has run */
|
|
|
|
wait_event_interruptible_timeout(mddev->sb_wait,
|
|
|
|
!test_bit(MD_RECOVERY_NEEDED,
|
2016-12-08 23:48:18 +00:00
|
|
|
&mddev->recovery),
|
2013-04-02 06:38:55 +00:00
|
|
|
msecs_to_jiffies(5000));
|
2013-08-27 06:44:13 +00:00
|
|
|
if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
|
|
|
|
/* Need to flush page cache, and ensure no-one else opens
|
|
|
|
* and writes
|
|
|
|
*/
|
|
|
|
mutex_lock(&mddev->open_mutex);
|
2014-09-09 04:00:15 +00:00
|
|
|
if (mddev->pers && atomic_read(&mddev->openers) > 1) {
|
2013-08-27 06:44:13 +00:00
|
|
|
mutex_unlock(&mddev->open_mutex);
|
|
|
|
err = -EBUSY;
|
2014-09-30 05:46:41 +00:00
|
|
|
goto out;
|
2013-08-27 06:44:13 +00:00
|
|
|
}
|
2020-10-22 01:21:28 +00:00
|
|
|
if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
|
|
|
|
mutex_unlock(&mddev->open_mutex);
|
|
|
|
err = -EBUSY;
|
|
|
|
goto out;
|
|
|
|
}
|
2017-04-06 03:16:33 +00:00
|
|
|
did_set_md_closing = true;
|
2013-08-27 06:44:13 +00:00
|
|
|
mutex_unlock(&mddev->open_mutex);
|
|
|
|
sync_blockdev(bdev);
|
|
|
|
}
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
|
|
|
|
if (!md_is_rdwr(mddev))
|
|
|
|
flush_work(&mddev->sync_work);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n",
|
|
|
|
err, cmd);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
if (cmd == SET_ARRAY_INFO) {
|
2022-09-20 02:39:37 +00:00
|
|
|
err = __md_set_array_info(mddev, argp);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commands querying/configuring an existing array:
|
|
|
|
*/
|
2005-06-22 00:17:14 +00:00
|
|
|
/* if we are not initialised yet, only ADD_NEW_DISK, STOP_ARRAY,
|
2006-12-22 09:11:41 +00:00
|
|
|
* RUN_ARRAY, and GET_ and SET_BITMAP_FILE are allowed */
|
2008-02-06 09:39:55 +00:00
|
|
|
if ((!mddev->raid_disks && !mddev->external)
|
|
|
|
&& cmd != ADD_NEW_DISK && cmd != STOP_ARRAY
|
|
|
|
&& cmd != RUN_ARRAY && cmd != SET_BITMAP_FILE
|
|
|
|
&& cmd != GET_BITMAP_FILE) {
|
2005-04-16 22:20:36 +00:00
|
|
|
err = -ENODEV;
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commands even a read-only array can execute:
|
|
|
|
*/
|
2012-12-11 02:39:21 +00:00
|
|
|
switch (cmd) {
|
|
|
|
case RESTART_ARRAY_RW:
|
|
|
|
err = restart_array(mddev);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
case STOP_ARRAY:
|
|
|
|
err = do_md_stop(mddev, 0, bdev);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
case STOP_ARRAY_RO:
|
|
|
|
err = md_set_readonly(mddev, bdev);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-04-24 01:42:41 +00:00
|
|
|
case HOT_REMOVE_DISK:
|
|
|
|
err = hot_remove_disk(mddev, new_decode_dev(arg));
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2013-04-24 01:42:41 +00:00
|
|
|
|
md: Allow devices to be re-added to a read-only array.
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.
We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.
Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto. This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.
So:
- remove the "md_update_sb()" call from add_new_disk(). This isn't
really needed as just adding a disk doesn't require a metadata
update. Instead, just set MD_CHANGE_DEVS. This will effect a
metadata update soon enough, once the array is not read-only.
- Allow the ADD_NEW_DISK ioctl to succeed without activating a
read-auto array, providing the MD_DISK_SYNC flag is set.
In this case, the device will be rejected if it cannot be added
with the correct device number, or has an incorrect event count.
- Teach remove_and_add_spares() to be careful about adding spares
when the array is read-only (or read-mostly) - only add devices
that are thought to be in-sync, and only do it if the array is
in-sync itself.
- In md_check_recovery, use remove_and_add_spares in the read-only
case, rather than open coding just the 'remove' part of it.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 01:42:42 +00:00
|
|
|
case ADD_NEW_DISK:
|
|
|
|
/* We can support ADD_NEW_DISK on read-only arrays
|
2016-03-21 11:19:30 +00:00
|
|
|
* only if we are re-adding a preexisting device.
|
md: Allow devices to be re-added to a read-only array.
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.
We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.
Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto. This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.
So:
- remove the "md_update_sb()" call from add_new_disk(). This isn't
really needed as just adding a disk doesn't require a metadata
update. Instead, just set MD_CHANGE_DEVS. This will effect a
metadata update soon enough, once the array is not read-only.
- Allow the ADD_NEW_DISK ioctl to succeed without activating a
read-auto array, providing the MD_DISK_SYNC flag is set.
In this case, the device will be rejected if it cannot be added
with the correct device number, or has an incorrect event count.
- Teach remove_and_add_spares() to be careful about adding spares
when the array is read-only (or read-mostly) - only add devices
that are thought to be in-sync, and only do it if the array is
in-sync itself.
- In md_check_recovery, use remove_and_add_spares in the read-only
case, rather than open coding just the 'remove' part of it.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 01:42:42 +00:00
|
|
|
* So require mddev->pers and MD_DISK_SYNC.
|
|
|
|
*/
|
|
|
|
if (mddev->pers) {
|
|
|
|
mdu_disk_info_t info;
|
|
|
|
if (copy_from_user(&info, argp, sizeof(info)))
|
|
|
|
err = -EFAULT;
|
|
|
|
else if (!(info.state & (1<<MD_DISK_SYNC)))
|
|
|
|
/* Need to clear read-only for this */
|
|
|
|
break;
|
|
|
|
else
|
2020-06-07 15:31:19 +00:00
|
|
|
err = md_add_new_disk(mddev, &info);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
md: Allow devices to be re-added to a read-only array.
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.
We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.
Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto. This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.
So:
- remove the "md_update_sb()" call from add_new_disk(). This isn't
really needed as just adding a disk doesn't require a metadata
update. Instead, just set MD_CHANGE_DEVS. This will effect a
metadata update soon enough, once the array is not read-only.
- Allow the ADD_NEW_DISK ioctl to succeed without activating a
read-auto array, providing the MD_DISK_SYNC flag is set.
In this case, the device will be rejected if it cannot be added
with the correct device number, or has an incorrect event count.
- Teach remove_and_add_spares() to be careful about adding spares
when the array is read-only (or read-mostly) - only add devices
that are thought to be in-sync, and only do it if the array is
in-sync itself.
- In md_check_recovery, use remove_and_add_spares in the read-only
case, rather than open coding just the 'remove' part of it.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 01:42:42 +00:00
|
|
|
}
|
|
|
|
break;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The remaining ioctls are changing the state of the
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
* superblock, so we do not allow them on read-only arrays.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev) && mddev->pers) {
|
|
|
|
if (mddev->ro != MD_AUTO_READ) {
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
err = -EROFS;
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
}
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDWR;
|
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
/* mddev_unlock will wake thread */
|
|
|
|
/* If a device failed while we were read-only, we
|
|
|
|
* need to make sure the metadata is updated now.
|
|
|
|
*/
|
|
|
|
if (test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags)) {
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
wait_event(mddev->sb_wait,
|
|
|
|
!test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags) &&
|
|
|
|
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
|
|
|
|
mddev_lock_nointr(mddev);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
switch (cmd) {
|
|
|
|
case ADD_NEW_DISK:
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2012-12-11 02:39:21 +00:00
|
|
|
mdu_disk_info_t info;
|
|
|
|
if (copy_from_user(&info, argp, sizeof(info)))
|
|
|
|
err = -EFAULT;
|
|
|
|
else
|
2020-06-07 15:31:19 +00:00
|
|
|
err = md_add_new_disk(mddev, &info);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2012-12-11 02:39:21 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-10-29 23:51:31 +00:00
|
|
|
case CLUSTERED_DISK_NACK:
|
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
md_cluster_ops->new_disk_ack(mddev, false);
|
|
|
|
else
|
|
|
|
err = -EINVAL;
|
|
|
|
goto unlock;
|
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
case HOT_ADD_DISK:
|
|
|
|
err = hot_add_disk(mddev, new_decode_dev(arg));
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
case RUN_ARRAY:
|
|
|
|
err = do_md_run(mddev);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
case SET_BITMAP_FILE:
|
|
|
|
err = set_bitmap_file(mddev, (int)arg);
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-06-22 00:17:14 +00:00
|
|
|
|
2012-12-11 02:39:21 +00:00
|
|
|
default:
|
|
|
|
err = -EINVAL;
|
2014-09-30 05:46:41 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2014-09-30 05:46:41 +00:00
|
|
|
unlock:
|
md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-08 21:31:10 +00:00
|
|
|
if (mddev->hold_active == UNTIL_IOCTL &&
|
|
|
|
err != -EINVAL)
|
|
|
|
mddev->hold_active = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev_unlock(mddev);
|
2014-09-30 05:46:41 +00:00
|
|
|
out:
|
2017-04-06 03:16:33 +00:00
|
|
|
if(did_set_md_closing)
|
|
|
|
clear_bit(MD_CLOSING, &mddev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
return err;
|
|
|
|
}
|
2009-12-14 01:50:05 +00:00
|
|
|
#ifdef CONFIG_COMPAT
|
2023-06-08 11:02:55 +00:00
|
|
|
static int md_compat_ioctl(struct block_device *bdev, blk_mode_t mode,
|
2009-12-14 01:50:05 +00:00
|
|
|
unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
|
|
|
switch (cmd) {
|
|
|
|
case HOT_REMOVE_DISK:
|
|
|
|
case HOT_ADD_DISK:
|
|
|
|
case SET_DISK_FAULTY:
|
|
|
|
case SET_BITMAP_FILE:
|
|
|
|
/* These take in integer arg, do not convert */
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
arg = (unsigned long)compat_ptr(arg);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return md_ioctl(bdev, mode, cmd, arg);
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_COMPAT */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-11-03 10:00:13 +00:00
|
|
|
static int md_set_read_only(struct block_device *bdev, bool ro)
|
|
|
|
{
|
|
|
|
struct mddev *mddev = bdev->bd_disk->private_data;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
if (!mddev->raid_disks && !mddev->external) {
|
|
|
|
err = -ENODEV;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transitioning to read-auto need only happen for arrays that call
|
|
|
|
* md_write_start and which are not ready for writes yet.
|
|
|
|
*/
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!ro && mddev->ro == MD_RDONLY && mddev->pers) {
|
2020-11-03 10:00:13 +00:00
|
|
|
err = restart_array(mddev);
|
|
|
|
if (err)
|
|
|
|
goto out_unlock;
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_AUTO_READ;
|
2020-11-03 10:00:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2023-06-08 11:02:55 +00:00
|
|
|
static int md_open(struct gendisk *disk, blk_mode_t mode)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2022-07-19 09:18:24 +00:00
|
|
|
struct mddev *mddev;
|
2005-04-16 22:20:36 +00:00
|
|
|
int err;
|
|
|
|
|
2022-07-19 09:18:24 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
2023-06-08 11:02:36 +00:00
|
|
|
mddev = mddev_get(disk->private_data);
|
2022-07-19 09:18:24 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
2012-05-22 03:55:32 +00:00
|
|
|
if (!mddev)
|
|
|
|
return -ENODEV;
|
|
|
|
|
2022-07-19 09:18:24 +00:00
|
|
|
err = mutex_lock_interruptible(&mddev->open_mutex);
|
|
|
|
if (err)
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
|
2022-07-19 09:18:24 +00:00
|
|
|
err = -ENODEV;
|
|
|
|
if (test_bit(MD_CLOSING, &mddev->flags))
|
|
|
|
goto out_unlock;
|
2016-08-12 05:42:37 +00:00
|
|
|
|
2008-07-21 07:05:25 +00:00
|
|
|
atomic_inc(&mddev->openers);
|
2009-08-10 02:50:52 +00:00
|
|
|
mutex_unlock(&mddev->open_mutex);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-06-08 11:02:36 +00:00
|
|
|
disk_check_media_change(disk);
|
2022-07-19 09:18:24 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&mddev->open_mutex);
|
|
|
|
out:
|
|
|
|
mddev_put(mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2023-06-08 11:02:37 +00:00
|
|
|
static void md_release(struct gendisk *disk)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2014-09-30 04:23:59 +00:00
|
|
|
struct mddev *mddev = disk->private_data;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-10-03 21:33:23 +00:00
|
|
|
BUG_ON(!mddev);
|
2008-07-21 07:05:25 +00:00
|
|
|
atomic_dec(&mddev->openers);
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev_put(mddev);
|
|
|
|
}
|
2011-02-24 06:26:41 +00:00
|
|
|
|
2020-07-08 12:25:41 +00:00
|
|
|
static unsigned int md_check_events(struct gendisk *disk, unsigned int clearing)
|
2011-02-24 06:26:41 +00:00
|
|
|
{
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = disk->private_data;
|
2020-07-08 12:25:41 +00:00
|
|
|
unsigned int ret = 0;
|
2011-02-24 06:26:41 +00:00
|
|
|
|
2020-07-08 12:25:41 +00:00
|
|
|
if (mddev->changed)
|
|
|
|
ret = DISK_EVENT_MEDIA_CHANGE;
|
2011-02-24 06:26:41 +00:00
|
|
|
mddev->changed = 0;
|
2020-07-08 12:25:41 +00:00
|
|
|
return ret;
|
2011-02-24 06:26:41 +00:00
|
|
|
}
|
2020-07-08 12:25:41 +00:00
|
|
|
|
2022-07-19 09:18:17 +00:00
|
|
|
static void md_free_disk(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
struct mddev *mddev = disk->private_data;
|
|
|
|
|
2022-07-19 09:18:23 +00:00
|
|
|
mddev_free(mddev);
|
2022-07-19 09:18:17 +00:00
|
|
|
}
|
|
|
|
|
2020-06-07 15:31:19 +00:00
|
|
|
const struct block_device_operations md_fops =
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
.owner = THIS_MODULE,
|
2020-07-01 08:59:43 +00:00
|
|
|
.submit_bio = md_submit_bio,
|
2008-03-02 15:31:15 +00:00
|
|
|
.open = md_open,
|
|
|
|
.release = md_release,
|
2009-05-26 02:57:36 +00:00
|
|
|
.ioctl = md_ioctl,
|
2009-12-14 01:50:05 +00:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
.compat_ioctl = md_compat_ioctl,
|
|
|
|
#endif
|
2006-01-08 09:02:50 +00:00
|
|
|
.getgeo = md_getgeo,
|
2020-07-08 12:25:41 +00:00
|
|
|
.check_events = md_check_events,
|
2020-11-03 10:00:13 +00:00
|
|
|
.set_read_only = md_set_read_only,
|
2022-07-19 09:18:17 +00:00
|
|
|
.free_disk = md_free_disk,
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
|
|
|
|
2014-09-30 04:23:59 +00:00
|
|
|
static int md_thread(void *arg)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:48:23 +00:00
|
|
|
struct md_thread *thread = arg;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* md_thread is a 'system-thread', it's priority should be very
|
|
|
|
* high. We avoid resource deadlocks individually in each
|
|
|
|
* raid personality. (RAID5 does preallocation) We also use RR and
|
|
|
|
* the very same RT priority as kswapd, thus we will never get
|
|
|
|
* into a priority inversion deadlock.
|
|
|
|
*
|
|
|
|
* we definitely have to have equal or higher priority than
|
|
|
|
* bdflush, otherwise bdflush will deadlock if there are too
|
|
|
|
* many dirty RAID5 blocks.
|
|
|
|
*/
|
|
|
|
|
2005-10-20 04:23:47 +00:00
|
|
|
allow_signal(SIGKILL);
|
2005-09-09 23:23:56 +00:00
|
|
|
while (!kthread_should_stop()) {
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-11-15 08:09:12 +00:00
|
|
|
/* We need to wait INTERRUPTIBLE so that
|
|
|
|
* we don't add to the load-average.
|
|
|
|
* That means we need to be sure no signals are
|
|
|
|
* pending
|
|
|
|
*/
|
|
|
|
if (signal_pending(current))
|
|
|
|
flush_signals(current);
|
|
|
|
|
|
|
|
wait_event_interruptible_timeout
|
|
|
|
(thread->wqueue,
|
|
|
|
test_bit(THREAD_WAKEUP, &thread->flags)
|
2016-11-21 18:29:18 +00:00
|
|
|
|| kthread_should_stop() || kthread_should_park(),
|
2005-11-15 08:09:12 +00:00
|
|
|
thread->timeout);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-01-13 22:13:53 +00:00
|
|
|
clear_bit(THREAD_WAKEUP, &thread->flags);
|
2016-11-21 18:29:18 +00:00
|
|
|
if (kthread_should_park())
|
|
|
|
kthread_parkme();
|
2011-01-13 22:13:53 +00:00
|
|
|
if (!kthread_should_stop())
|
2012-10-11 02:34:00 +00:00
|
|
|
thread->run(thread);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2005-09-09 23:23:56 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-05-23 02:10:17 +00:00
|
|
|
static void md_wakeup_thread_directly(struct md_thread __rcu *thread)
|
2023-05-23 02:10:13 +00:00
|
|
|
{
|
2023-05-23 02:10:17 +00:00
|
|
|
struct md_thread *t;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
t = rcu_dereference(thread);
|
|
|
|
if (t)
|
|
|
|
wake_up_process(t->tsk);
|
|
|
|
rcu_read_unlock();
|
2023-05-23 02:10:13 +00:00
|
|
|
}
|
|
|
|
|
2023-05-23 02:10:17 +00:00
|
|
|
void md_wakeup_thread(struct md_thread __rcu *thread)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2023-05-23 02:10:17 +00:00
|
|
|
struct md_thread *t;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
t = rcu_dereference(thread);
|
|
|
|
if (t) {
|
|
|
|
pr_debug("md: waking up MD thread %s.\n", t->tsk->comm);
|
|
|
|
set_bit(THREAD_WAKEUP, &t->flags);
|
|
|
|
wake_up(&t->wqueue);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2023-05-23 02:10:17 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_wakeup_thread);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-10-11 02:34:00 +00:00
|
|
|
struct md_thread *md_register_thread(void (*run) (struct md_thread *),
|
|
|
|
struct mddev *mddev, const char *name)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:48:23 +00:00
|
|
|
struct md_thread *thread;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-10-11 05:48:23 +00:00
|
|
|
thread = kzalloc(sizeof(struct md_thread), GFP_KERNEL);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!thread)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
init_waitqueue_head(&thread->wqueue);
|
|
|
|
|
|
|
|
thread->run = run;
|
|
|
|
thread->mddev = mddev;
|
2005-06-22 00:17:14 +00:00
|
|
|
thread->timeout = MAX_SCHEDULE_TIMEOUT;
|
2009-09-23 08:09:45 +00:00
|
|
|
thread->tsk = kthread_run(md_thread, thread,
|
|
|
|
"%s_%s",
|
|
|
|
mdname(thread->mddev),
|
2012-07-03 05:56:52 +00:00
|
|
|
name);
|
2005-09-09 23:23:56 +00:00
|
|
|
if (IS_ERR(thread->tsk)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
kfree(thread);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
return thread;
|
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_register_thread);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-08-03 07:17:11 +00:00
|
|
|
void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **threadp)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2023-08-03 07:17:11 +00:00
|
|
|
struct md_thread *thread = rcu_dereference_protected(*threadp,
|
|
|
|
lockdep_is_held(&mddev->reconfig_mutex));
|
2022-04-29 08:49:09 +00:00
|
|
|
|
2023-05-23 02:10:17 +00:00
|
|
|
if (!thread)
|
2022-04-29 08:49:09 +00:00
|
|
|
return;
|
2023-05-23 02:10:17 +00:00
|
|
|
|
|
|
|
rcu_assign_pointer(*threadp, NULL);
|
|
|
|
synchronize_rcu();
|
2005-09-09 23:23:56 +00:00
|
|
|
|
2022-04-29 08:49:09 +00:00
|
|
|
pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
|
2005-09-09 23:23:56 +00:00
|
|
|
kthread_stop(thread->tsk);
|
2005-04-16 22:20:36 +00:00
|
|
|
kfree(thread);
|
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_unregister_thread);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_error(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2005-11-09 05:39:31 +00:00
|
|
|
if (!rdev || test_bit(Faulty, &rdev->flags))
|
2005-04-16 22:20:36 +00:00
|
|
|
return;
|
2008-04-30 07:52:32 +00:00
|
|
|
|
2011-07-28 01:31:48 +00:00
|
|
|
if (!mddev->pers || !mddev->pers->error_handler)
|
2005-04-16 22:20:36 +00:00
|
|
|
return;
|
2022-03-22 15:23:38 +00:00
|
|
|
mddev->pers->error_handler(mddev, rdev);
|
|
|
|
|
2023-03-06 13:03:17 +00:00
|
|
|
if (mddev->pers->level == 0 || mddev->pers->level == LEVEL_LINEAR)
|
|
|
|
return;
|
|
|
|
|
2022-03-22 15:23:38 +00:00
|
|
|
if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
|
2008-06-27 22:31:41 +00:00
|
|
|
set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
2005-04-16 22:20:36 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2022-03-22 15:23:38 +00:00
|
|
|
if (!test_bit(MD_BROKEN, &mddev->flags)) {
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
2010-07-26 01:49:55 +00:00
|
|
|
if (mddev->event_work.func)
|
2010-10-15 13:36:08 +00:00
|
|
|
queue_work(md_misc_wq, &mddev->event_work);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_error);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* seq_file implementation /proc/mdstat */
|
|
|
|
|
|
|
|
static void status_unused(struct seq_file *seq)
|
|
|
|
{
|
|
|
|
int i = 0;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
seq_printf(seq, "unused devices: ");
|
|
|
|
|
2009-01-08 21:31:08 +00:00
|
|
|
list_for_each_entry(rdev, &pending_raid_disks, same_set) {
|
2005-04-16 22:20:36 +00:00
|
|
|
i++;
|
2022-05-12 06:19:13 +00:00
|
|
|
seq_printf(seq, "%pg ", rdev->bdev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
if (!i)
|
|
|
|
seq_printf(seq, "<none>");
|
|
|
|
|
|
|
|
seq_printf(seq, "\n");
|
|
|
|
}
|
|
|
|
|
2015-07-02 07:12:58 +00:00
|
|
|
static int status_resync(struct seq_file *seq, struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2009-05-07 02:49:35 +00:00
|
|
|
sector_t max_sectors, resync, res;
|
2019-06-13 14:11:41 +00:00
|
|
|
unsigned long dt, db = 0;
|
|
|
|
sector_t rt, curr_mark_cnt, resync_mark_cnt;
|
|
|
|
int scale, recovery_active;
|
2006-03-27 09:18:04 +00:00
|
|
|
unsigned int per_milli;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-05-20 23:28:33 +00:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
|
2009-05-07 02:49:35 +00:00
|
|
|
max_sectors = mddev->resync_max_sectors;
|
2005-04-16 22:20:36 +00:00
|
|
|
else
|
2009-05-07 02:49:35 +00:00
|
|
|
max_sectors = mddev->dev_sectors;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-07-02 07:12:58 +00:00
|
|
|
resync = mddev->curr_resync;
|
2022-06-08 16:27:54 +00:00
|
|
|
if (resync < MD_RESYNC_ACTIVE) {
|
2015-07-02 07:12:58 +00:00
|
|
|
if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
|
|
|
|
/* Still cleaning up */
|
|
|
|
resync = max_sectors;
|
md: Ensure resync is reported after it starts
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-08 16:27:55 +00:00
|
|
|
} else if (resync > max_sectors) {
|
2017-11-30 16:33:30 +00:00
|
|
|
resync = max_sectors;
|
md: Ensure resync is reported after it starts
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-08 16:27:55 +00:00
|
|
|
} else {
|
2023-03-10 07:38:51 +00:00
|
|
|
res = atomic_read(&mddev->recovery_active);
|
|
|
|
/*
|
|
|
|
* Resync has started, but the subtraction has overflowed or
|
|
|
|
* yielded one of the special values. Force it to active to
|
|
|
|
* ensure the status reports an active resync.
|
|
|
|
*/
|
|
|
|
if (resync < res || resync - res < MD_RESYNC_ACTIVE)
|
md: Ensure resync is reported after it starts
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-08 16:27:55 +00:00
|
|
|
resync = MD_RESYNC_ACTIVE;
|
2023-03-10 07:38:51 +00:00
|
|
|
else
|
|
|
|
resync -= res;
|
md: Ensure resync is reported after it starts
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-08 16:27:55 +00:00
|
|
|
}
|
2015-07-02 07:12:58 +00:00
|
|
|
|
2022-06-08 16:27:54 +00:00
|
|
|
if (resync == MD_RESYNC_NONE) {
|
2018-07-02 08:26:25 +00:00
|
|
|
if (test_bit(MD_RESYNCING_REMOTE, &mddev->recovery)) {
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
if (rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(Faulty, &rdev->flags) &&
|
|
|
|
rdev->recovery_offset != MaxSector &&
|
|
|
|
rdev->recovery_offset) {
|
|
|
|
seq_printf(seq, "\trecover=REMOTE");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
if (mddev->reshape_position != MaxSector)
|
|
|
|
seq_printf(seq, "\treshape=REMOTE");
|
|
|
|
else
|
|
|
|
seq_printf(seq, "\tresync=REMOTE");
|
|
|
|
return 1;
|
|
|
|
}
|
2015-07-02 07:12:58 +00:00
|
|
|
if (mddev->recovery_cp < MaxSector) {
|
|
|
|
seq_printf(seq, "\tresync=PENDING");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2022-06-08 16:27:54 +00:00
|
|
|
if (resync < MD_RESYNC_ACTIVE) {
|
2015-07-02 07:12:58 +00:00
|
|
|
seq_printf(seq, "\tresync=DELAYED");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2014-09-30 05:52:29 +00:00
|
|
|
WARN_ON(max_sectors == 0);
|
2006-03-27 09:18:04 +00:00
|
|
|
/* Pick 'scale' such that (resync>>scale)*1000 will fit
|
2009-05-07 02:49:35 +00:00
|
|
|
* in a sector_t, and (max_sectors>>scale) will fit in a
|
2006-03-27 09:18:04 +00:00
|
|
|
* u32, as those are the requirements for sector_div.
|
|
|
|
* Thus 'scale' must be at least 10
|
|
|
|
*/
|
|
|
|
scale = 10;
|
|
|
|
if (sizeof(sector_t) > sizeof(unsigned long)) {
|
2009-05-07 02:49:35 +00:00
|
|
|
while ( max_sectors/2 > (1ULL<<(scale+32)))
|
2006-03-27 09:18:04 +00:00
|
|
|
scale++;
|
|
|
|
}
|
|
|
|
res = (resync>>scale)*1000;
|
2009-05-07 02:49:35 +00:00
|
|
|
sector_div(res, (u32)((max_sectors>>scale)+1));
|
2006-03-27 09:18:04 +00:00
|
|
|
|
|
|
|
per_milli = res;
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2006-03-27 09:18:04 +00:00
|
|
|
int i, x = per_milli/50, y = 20-x;
|
2005-04-16 22:20:36 +00:00
|
|
|
seq_printf(seq, "[");
|
|
|
|
for (i = 0; i < x; i++)
|
|
|
|
seq_printf(seq, "=");
|
|
|
|
seq_printf(seq, ">");
|
|
|
|
for (i = 0; i < y; i++)
|
|
|
|
seq_printf(seq, ".");
|
|
|
|
seq_printf(seq, "] ");
|
|
|
|
}
|
2006-03-27 09:18:04 +00:00
|
|
|
seq_printf(seq, " %s =%3u.%u%% (%llu/%llu)",
|
2006-03-27 09:18:09 +00:00
|
|
|
(test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)?
|
|
|
|
"reshape" :
|
2006-10-03 08:15:57 +00:00
|
|
|
(test_bit(MD_RECOVERY_CHECK, &mddev->recovery)?
|
|
|
|
"check" :
|
|
|
|
(test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?
|
|
|
|
"resync" : "recovery"))),
|
|
|
|
per_milli/10, per_milli % 10,
|
2009-05-07 02:49:35 +00:00
|
|
|
(unsigned long long) resync/2,
|
|
|
|
(unsigned long long) max_sectors/2);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* dt: time from mark until now
|
|
|
|
* db: blocks written from mark until now
|
|
|
|
* rt: remaining time
|
2009-05-07 02:49:35 +00:00
|
|
|
*
|
2019-06-13 14:11:41 +00:00
|
|
|
* rt is a sector_t, which is always 64bit now. We are keeping
|
|
|
|
* the original algorithm, but it is not really necessary.
|
|
|
|
*
|
|
|
|
* Original algorithm:
|
|
|
|
* So we divide before multiply in case it is 32bit and close
|
|
|
|
* to the limit.
|
|
|
|
* We scale the divisor (db) by 32 to avoid losing precision
|
|
|
|
* near the end of resync when the number of remaining sectors
|
|
|
|
* is close to 'db'.
|
|
|
|
* We then divide rt by 32 after multiplying by db to compensate.
|
|
|
|
* The '+1' avoids division by zero if db is very small.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
dt = ((jiffies - mddev->resync_mark) / HZ);
|
|
|
|
if (!dt) dt++;
|
2019-06-13 14:11:41 +00:00
|
|
|
|
|
|
|
curr_mark_cnt = mddev->curr_mark_cnt;
|
|
|
|
recovery_active = atomic_read(&mddev->recovery_active);
|
|
|
|
resync_mark_cnt = mddev->resync_mark_cnt;
|
|
|
|
|
|
|
|
if (curr_mark_cnt >= (recovery_active + resync_mark_cnt))
|
|
|
|
db = curr_mark_cnt - (recovery_active + resync_mark_cnt);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-05-07 02:49:35 +00:00
|
|
|
rt = max_sectors - resync; /* number of remaining sectors */
|
2019-06-13 14:11:41 +00:00
|
|
|
rt = div64_u64(rt, db/32+1);
|
2009-05-07 02:49:35 +00:00
|
|
|
rt *= dt;
|
|
|
|
rt >>= 5;
|
|
|
|
|
|
|
|
seq_printf(seq, " finish=%lu.%lumin", (unsigned long)rt / 60,
|
|
|
|
((unsigned long)rt % 60)/6);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-07-10 11:44:16 +00:00
|
|
|
seq_printf(seq, " speed=%ldK/sec", db/2/dt);
|
2015-07-02 07:12:58 +00:00
|
|
|
return 1;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void *md_seq_start(struct seq_file *seq, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct list_head *tmp;
|
|
|
|
loff_t l = *pos;
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2021-03-17 14:04:39 +00:00
|
|
|
if (l == 0x10000) {
|
|
|
|
++*pos;
|
|
|
|
return (void *)2;
|
|
|
|
}
|
|
|
|
if (l > 0x10000)
|
2005-04-16 22:20:36 +00:00
|
|
|
return NULL;
|
|
|
|
if (!l--)
|
|
|
|
/* header */
|
|
|
|
return (void*)1;
|
|
|
|
|
|
|
|
spin_lock(&all_mddevs_lock);
|
|
|
|
list_for_each(tmp,&all_mddevs)
|
|
|
|
if (!l--) {
|
2011-10-11 05:47:53 +00:00
|
|
|
mddev = list_entry(tmp, struct mddev, all_mddevs);
|
2022-07-19 09:18:23 +00:00
|
|
|
if (!mddev_get(mddev))
|
|
|
|
continue;
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
return mddev;
|
|
|
|
}
|
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
if (!l--)
|
|
|
|
return (void*)2;/* tail */
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *md_seq_next(struct seq_file *seq, void *v, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct list_head *tmp;
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *next_mddev, *mddev = v;
|
2022-07-19 09:18:23 +00:00
|
|
|
struct mddev *to_put = NULL;
|
2014-09-30 04:23:59 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
++*pos;
|
|
|
|
if (v == (void*)2)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
spin_lock(&all_mddevs_lock);
|
2022-07-19 09:18:23 +00:00
|
|
|
if (v == (void*)1) {
|
2005-04-16 22:20:36 +00:00
|
|
|
tmp = all_mddevs.next;
|
2022-07-19 09:18:23 +00:00
|
|
|
} else {
|
|
|
|
to_put = mddev;
|
2005-04-16 22:20:36 +00:00
|
|
|
tmp = mddev->all_mddevs.next;
|
2014-09-30 04:23:59 +00:00
|
|
|
}
|
2022-07-19 09:18:23 +00:00
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
if (tmp == &all_mddevs) {
|
|
|
|
next_mddev = (void*)2;
|
|
|
|
*pos = 0x10000;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
next_mddev = list_entry(tmp, struct mddev, all_mddevs);
|
|
|
|
if (mddev_get(next_mddev))
|
|
|
|
break;
|
|
|
|
mddev = next_mddev;
|
|
|
|
tmp = mddev->all_mddevs.next;
|
2022-07-22 00:27:55 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
|
2022-07-19 09:18:23 +00:00
|
|
|
if (to_put)
|
2023-09-14 15:24:16 +00:00
|
|
|
mddev_put(to_put);
|
2005-04-16 22:20:36 +00:00
|
|
|
return next_mddev;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
static void md_seq_stop(struct seq_file *seq, void *v)
|
|
|
|
{
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = v;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (mddev && v != (void*)1 && v != (void*)2)
|
|
|
|
mddev_put(mddev);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int md_seq_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev = v;
|
2009-03-31 03:33:13 +00:00
|
|
|
sector_t sectors;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (v == (void*)1) {
|
2011-10-11 05:49:58 +00:00
|
|
|
struct md_personality *pers;
|
2005-04-16 22:20:36 +00:00
|
|
|
seq_printf(seq, "Personalities : ");
|
|
|
|
spin_lock(&pers_lock);
|
2006-01-06 08:20:36 +00:00
|
|
|
list_for_each_entry(pers, &pers_list, list)
|
|
|
|
seq_printf(seq, "[%s] ", pers->name);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
spin_unlock(&pers_lock);
|
|
|
|
seq_printf(seq, "\n");
|
2011-07-12 18:48:39 +00:00
|
|
|
seq->poll_event = atomic_read(&md_event_count);
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
if (v == (void*)2) {
|
|
|
|
status_unused(seq);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-12-15 01:56:58 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev->pers || mddev->raid_disks || !list_empty(&mddev->disks)) {
|
|
|
|
seq_printf(seq, "%s : %sactive", mdname(mddev),
|
|
|
|
mddev->pers ? "" : "in");
|
|
|
|
if (mddev->pers) {
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->ro == MD_RDONLY)
|
2005-04-16 22:20:36 +00:00
|
|
|
seq_printf(seq, " (read-only)");
|
2022-09-20 02:39:38 +00:00
|
|
|
if (mddev->ro == MD_AUTO_READ)
|
2008-03-10 18:43:47 +00:00
|
|
|
seq_printf(seq, " (auto-read-only)");
|
2005-04-16 22:20:36 +00:00
|
|
|
seq_printf(seq, " %s", mddev->pers->name);
|
|
|
|
}
|
|
|
|
|
2009-03-31 03:33:13 +00:00
|
|
|
sectors = 0;
|
2014-12-15 01:56:59 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
rdev_for_each_rcu(rdev, mddev) {
|
2022-05-12 06:19:13 +00:00
|
|
|
seq_printf(seq, " %pg[%d]", rdev->bdev, rdev->desc_nr);
|
|
|
|
|
2005-09-09 23:23:45 +00:00
|
|
|
if (test_bit(WriteMostly, &rdev->flags))
|
|
|
|
seq_printf(seq, "(W)");
|
2015-10-12 23:59:50 +00:00
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
seq_printf(seq, "(J)");
|
2005-11-09 05:39:31 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
seq_printf(seq, "(F)");
|
|
|
|
continue;
|
2011-12-22 23:17:51 +00:00
|
|
|
}
|
|
|
|
if (rdev->raid_disk < 0)
|
2005-09-09 23:24:00 +00:00
|
|
|
seq_printf(seq, "(S)"); /* spare */
|
2011-12-22 23:17:51 +00:00
|
|
|
if (test_bit(Replacement, &rdev->flags))
|
|
|
|
seq_printf(seq, "(R)");
|
2009-03-31 03:33:13 +00:00
|
|
|
sectors += rdev->sectors;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-12-15 01:56:59 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (!list_empty(&mddev->disks)) {
|
|
|
|
if (mddev->pers)
|
|
|
|
seq_printf(seq, "\n %llu blocks",
|
2008-07-21 07:05:22 +00:00
|
|
|
(unsigned long long)
|
|
|
|
mddev->array_sectors / 2);
|
2005-04-16 22:20:36 +00:00
|
|
|
else
|
|
|
|
seq_printf(seq, "\n %llu blocks",
|
2009-03-31 03:33:13 +00:00
|
|
|
(unsigned long long)sectors / 2);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2005-09-09 23:24:00 +00:00
|
|
|
if (mddev->persistent) {
|
|
|
|
if (mddev->major_version != 0 ||
|
|
|
|
mddev->minor_version != 90) {
|
|
|
|
seq_printf(seq," super %d.%d",
|
|
|
|
mddev->major_version,
|
|
|
|
mddev->minor_version);
|
|
|
|
}
|
2008-02-06 09:39:51 +00:00
|
|
|
} else if (mddev->external)
|
|
|
|
seq_printf(seq, " super external:%s",
|
|
|
|
mddev->metadata_type);
|
|
|
|
else
|
2005-09-09 23:24:00 +00:00
|
|
|
seq_printf(seq, " super non-persistent");
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (mddev->pers) {
|
2008-10-13 00:55:12 +00:00
|
|
|
mddev->pers->status(seq, mddev);
|
2014-09-30 04:23:59 +00:00
|
|
|
seq_printf(seq, "\n ");
|
2005-11-09 05:39:41 +00:00
|
|
|
if (mddev->pers->sync_request) {
|
2015-07-02 07:12:58 +00:00
|
|
|
if (status_resync(seq, mddev))
|
2005-11-09 05:39:41 +00:00
|
|
|
seq_printf(seq, "\n ");
|
|
|
|
}
|
2005-06-22 00:17:14 +00:00
|
|
|
} else
|
|
|
|
seq_printf(seq, "\n ");
|
|
|
|
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_status(seq, mddev->bitmap);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
seq_printf(seq, "\n");
|
|
|
|
}
|
2014-12-15 01:56:58 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2014-09-30 04:23:59 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-05-07 02:49:37 +00:00
|
|
|
static const struct seq_operations md_seq_ops = {
|
2005-04-16 22:20:36 +00:00
|
|
|
.start = md_seq_start,
|
|
|
|
.next = md_seq_next,
|
|
|
|
.stop = md_seq_stop,
|
|
|
|
.show = md_seq_show,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int md_seq_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
2011-07-12 18:48:39 +00:00
|
|
|
struct seq_file *seq;
|
2005-04-16 22:20:36 +00:00
|
|
|
int error;
|
|
|
|
|
|
|
|
error = seq_open(file, &md_seq_ops);
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
if (error)
|
2011-07-12 18:48:39 +00:00
|
|
|
return error;
|
|
|
|
|
|
|
|
seq = file->private_data;
|
|
|
|
seq->poll_event = atomic_read(&md_event_count);
|
2005-04-16 22:20:36 +00:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2014-04-09 04:33:51 +00:00
|
|
|
static int md_unloading;
|
2017-07-03 10:39:46 +00:00
|
|
|
static __poll_t mdstat_poll(struct file *filp, poll_table *wait)
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
{
|
2011-07-12 18:48:39 +00:00
|
|
|
struct seq_file *seq = filp->private_data;
|
2017-07-03 10:39:46 +00:00
|
|
|
__poll_t mask;
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
|
2014-04-09 04:33:51 +00:00
|
|
|
if (md_unloading)
|
2018-02-11 22:34:03 +00:00
|
|
|
return EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
poll_wait(filp, &md_event_waiters, wait);
|
|
|
|
|
|
|
|
/* always allow read */
|
2018-02-11 22:34:03 +00:00
|
|
|
mask = EPOLLIN | EPOLLRDNORM;
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
|
2011-07-12 18:48:39 +00:00
|
|
|
if (seq->poll_event != atomic_read(&md_event_count))
|
2018-02-11 22:34:03 +00:00
|
|
|
mask |= EPOLLERR | EPOLLPRI;
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
return mask;
|
|
|
|
}
|
|
|
|
|
2020-02-04 01:37:17 +00:00
|
|
|
static const struct proc_ops mdstat_proc_ops = {
|
|
|
|
.proc_open = md_seq_open,
|
|
|
|
.proc_read = seq_read,
|
|
|
|
.proc_lseek = seq_lseek,
|
|
|
|
.proc_release = seq_release,
|
|
|
|
.proc_poll = mdstat_poll,
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
|
|
|
|
2011-10-11 05:49:58 +00:00
|
|
|
int register_md_personality(struct md_personality *p)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: %s personality registered for level %d\n",
|
|
|
|
p->name, p->level);
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_lock(&pers_lock);
|
2006-01-06 08:20:36 +00:00
|
|
|
list_add_tail(&p->list, &pers_list);
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_unlock(&pers_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(register_md_personality);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-10-11 05:49:58 +00:00
|
|
|
int unregister_md_personality(struct md_personality *p)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: %s personality unregistered\n", p->name);
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_lock(&pers_lock);
|
2006-01-06 08:20:36 +00:00
|
|
|
list_del_init(&p->list);
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_unlock(&pers_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(unregister_md_personality);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-08-13 02:32:55 +00:00
|
|
|
int register_md_cluster_operations(struct md_cluster_operations *ops,
|
|
|
|
struct module *module)
|
2014-03-29 15:01:53 +00:00
|
|
|
{
|
2015-08-13 02:32:55 +00:00
|
|
|
int ret = 0;
|
2014-03-29 15:01:53 +00:00
|
|
|
spin_lock(&pers_lock);
|
2015-08-13 02:32:55 +00:00
|
|
|
if (md_cluster_ops != NULL)
|
|
|
|
ret = -EALREADY;
|
|
|
|
else {
|
|
|
|
md_cluster_ops = ops;
|
|
|
|
md_cluster_mod = module;
|
|
|
|
}
|
2014-03-29 15:01:53 +00:00
|
|
|
spin_unlock(&pers_lock);
|
2015-08-13 02:32:55 +00:00
|
|
|
return ret;
|
2014-03-29 15:01:53 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(register_md_cluster_operations);
|
|
|
|
|
|
|
|
int unregister_md_cluster_operations(void)
|
|
|
|
{
|
|
|
|
spin_lock(&pers_lock);
|
|
|
|
md_cluster_ops = NULL;
|
|
|
|
spin_unlock(&pers_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(unregister_md_cluster_operations);
|
|
|
|
|
|
|
|
int md_setup_cluster(struct mddev *mddev, int nodes)
|
|
|
|
{
|
2020-07-20 18:08:52 +00:00
|
|
|
int ret;
|
2016-09-05 02:17:28 +00:00
|
|
|
if (!md_cluster_ops)
|
|
|
|
request_module("md-cluster");
|
2014-03-29 15:01:53 +00:00
|
|
|
spin_lock(&pers_lock);
|
2016-09-05 02:17:28 +00:00
|
|
|
/* ensure module won't be unloaded */
|
2014-03-29 15:01:53 +00:00
|
|
|
if (!md_cluster_ops || !try_module_get(md_cluster_mod)) {
|
2021-12-26 02:24:11 +00:00
|
|
|
pr_warn("can't find md-cluster module or get its reference.\n");
|
2014-03-29 15:01:53 +00:00
|
|
|
spin_unlock(&pers_lock);
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
spin_unlock(&pers_lock);
|
|
|
|
|
2020-07-20 18:08:52 +00:00
|
|
|
ret = md_cluster_ops->join(mddev, nodes);
|
|
|
|
if (!ret)
|
|
|
|
mddev->safemode_delay = 0;
|
|
|
|
return ret;
|
2014-03-29 15:01:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void md_cluster_stop(struct mddev *mddev)
|
|
|
|
{
|
2014-03-29 15:20:02 +00:00
|
|
|
if (!md_cluster_ops)
|
|
|
|
return;
|
2014-03-29 15:01:53 +00:00
|
|
|
md_cluster_ops->leave(mddev);
|
|
|
|
module_put(md_cluster_mod);
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
static int is_mddev_idle(struct mddev *mddev, int init)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2014-09-30 04:23:59 +00:00
|
|
|
struct md_rdev *rdev;
|
2005-04-16 22:20:36 +00:00
|
|
|
int idle;
|
2009-03-31 03:27:02 +00:00
|
|
|
int curr_events;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
idle = 1;
|
2008-07-21 07:05:25 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
rdev_for_each_rcu(rdev, mddev) {
|
2020-09-03 05:40:59 +00:00
|
|
|
struct gendisk *disk = rdev->bdev->bd_disk;
|
2020-11-24 08:36:54 +00:00
|
|
|
curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
|
2009-03-31 03:27:02 +00:00
|
|
|
atomic_read(&disk->sync_io);
|
2007-07-17 11:06:12 +00:00
|
|
|
/* sync IO will cause sync_io to increase before the disk_stats
|
|
|
|
* as sync_io is counted when a request starts, and
|
|
|
|
* disk_stats is counted when it completes.
|
|
|
|
* So resync activity will cause curr_events to be smaller than
|
|
|
|
* when there was no such activity.
|
|
|
|
* non-sync IO will cause disk_stat to increase without
|
|
|
|
* increasing sync_io so curr_events will (eventually)
|
|
|
|
* be larger than it was before. Once it becomes
|
|
|
|
* substantially larger, the test below will cause
|
|
|
|
* the array to appear non-idle, and resync will slow
|
|
|
|
* down.
|
|
|
|
* If there is a lot of outstanding resync activity when
|
|
|
|
* we set last_event to curr_events, then all that activity
|
|
|
|
* completing might cause the array to appear non-idle
|
|
|
|
* and resync will be slowed down even though there might
|
|
|
|
* not have been non-resync activity. This will only
|
|
|
|
* happen once though. 'last_events' will soon reflect
|
|
|
|
* the state where there is little or no outstanding
|
|
|
|
* resync requests, and further resync activity will
|
|
|
|
* always make curr_events less than last_events.
|
2005-11-18 09:11:01 +00:00
|
|
|
*
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2009-03-31 03:27:02 +00:00
|
|
|
if (init || curr_events - rdev->last_events > 64) {
|
2005-04-16 22:20:36 +00:00
|
|
|
rdev->last_events = curr_events;
|
|
|
|
idle = 0;
|
|
|
|
}
|
|
|
|
}
|
2008-07-21 07:05:25 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
return idle;
|
|
|
|
}
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_done_sync(struct mddev *mddev, int blocks, int ok)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
/* another "blocks" (512byte) blocks have been synced */
|
|
|
|
atomic_sub(blocks, &mddev->recovery_active);
|
|
|
|
wake_up(&mddev->recovery_wait);
|
|
|
|
if (!ok) {
|
md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23 20:04:39 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2012-11-19 11:57:34 +00:00
|
|
|
set_bit(MD_RECOVERY_ERROR, &mddev->recovery);
|
2005-04-16 22:20:36 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
// stop recovery, signal do_sync ....
|
|
|
|
}
|
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_done_sync);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-06-22 00:17:12 +00:00
|
|
|
/* md_write_start(mddev, bi)
|
|
|
|
* If we need to update some array metadata (e.g. 'active' flag
|
2005-06-22 00:17:26 +00:00
|
|
|
* in superblock) before writing, schedule a superblock update
|
|
|
|
* and wait for it to complete.
|
2017-06-05 06:49:39 +00:00
|
|
|
* A return value of 'false' means that the write wasn't recorded
|
|
|
|
* and cannot proceed as the array is being suspend.
|
2005-06-22 00:17:12 +00:00
|
|
|
*/
|
2017-06-05 06:49:39 +00:00
|
|
|
bool md_write_start(struct mddev *mddev, struct bio *bi)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2008-06-27 22:31:36 +00:00
|
|
|
int did_change = 0;
|
2018-02-02 22:13:19 +00:00
|
|
|
|
2005-06-22 00:17:12 +00:00
|
|
|
if (bio_data_dir(bi) != WRITE)
|
2017-06-05 06:49:39 +00:00
|
|
|
return true;
|
2005-06-22 00:17:12 +00:00
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
BUG_ON(mddev->ro == MD_RDONLY);
|
|
|
|
if (mddev->ro == MD_AUTO_READ) {
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
/* need to switch to read/write */
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
flush_work(&mddev->sync_work);
|
2022-09-20 02:39:38 +00:00
|
|
|
mddev->ro = MD_RDWR;
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
2008-03-04 22:29:32 +00:00
|
|
|
md_wakeup_thread(mddev->sync_thread);
|
2008-06-27 22:31:36 +00:00
|
|
|
did_change = 1;
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
}
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
percpu_ref_get(&mddev->writes_pending);
|
2017-03-15 03:05:14 +00:00
|
|
|
smp_mb(); /* Match smp_mb in set_in_sync() */
|
2008-04-30 07:52:30 +00:00
|
|
|
if (mddev->safemode == 1)
|
|
|
|
mddev->safemode = 0;
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
/* sync_checkers is always 0 when writes_pending is in per-cpu mode */
|
2017-08-08 06:56:36 +00:00
|
|
|
if (mddev->in_sync || mddev->sync_checkers) {
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2005-06-22 00:17:26 +00:00
|
|
|
if (mddev->in_sync) {
|
|
|
|
mddev->in_sync = 0;
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
|
|
|
set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
|
2005-06-22 00:17:26 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2008-06-27 22:31:36 +00:00
|
|
|
did_change = 1;
|
2005-06-22 00:17:26 +00:00
|
|
|
}
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2005-06-22 00:17:12 +00:00
|
|
|
}
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
rcu_read_unlock();
|
2008-06-27 22:31:36 +00:00
|
|
|
if (did_change)
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2018-02-02 22:13:19 +00:00
|
|
|
if (!mddev->has_superblocks)
|
|
|
|
return true;
|
2008-05-23 20:04:36 +00:00
|
|
|
wait_event(mddev->sb_wait,
|
2017-10-05 05:23:16 +00:00
|
|
|
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
|
2023-01-31 05:17:09 +00:00
|
|
|
is_md_suspended(mddev));
|
2017-06-05 06:49:39 +00:00
|
|
|
if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
|
|
|
|
percpu_ref_put(&mddev->writes_pending);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_write_start);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:12 +00:00
|
|
|
/* md_write_inc can only be called when md_write_start() has
|
|
|
|
* already been called at least once of the current request.
|
|
|
|
* It increments the counter and is useful when a single request
|
|
|
|
* is split into several parts. Each part causes an increment and
|
|
|
|
* so needs a matching md_write_end().
|
|
|
|
* Unlike md_write_start(), it is safe to call md_write_inc() inside
|
|
|
|
* a spinlocked region.
|
|
|
|
*/
|
|
|
|
void md_write_inc(struct mddev *mddev, struct bio *bi)
|
|
|
|
{
|
|
|
|
if (bio_data_dir(bi) != WRITE)
|
|
|
|
return;
|
2022-09-20 02:39:38 +00:00
|
|
|
WARN_ON_ONCE(mddev->in_sync || !md_is_rdwr(mddev));
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
percpu_ref_get(&mddev->writes_pending);
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:12 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_write_inc);
|
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_write_end(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
percpu_ref_put(&mddev->writes_pending);
|
|
|
|
|
|
|
|
if (mddev->safemode == 2)
|
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
else if (mddev->safemode_delay)
|
|
|
|
/* The roundup() ensures this only performs locking once
|
|
|
|
* every ->safemode_delay jiffies
|
|
|
|
*/
|
|
|
|
mod_timer(&mddev->safemode_timer,
|
|
|
|
roundup(jiffies, mddev->safemode_delay) +
|
|
|
|
mddev->safemode_delay);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_write_end);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2021-02-04 07:50:43 +00:00
|
|
|
/* This is used by raid0 and raid10 */
|
|
|
|
void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
|
|
|
|
struct bio *bio, sector_t start, sector_t size)
|
|
|
|
{
|
|
|
|
struct bio *discard_bio = NULL;
|
|
|
|
|
2022-04-15 04:52:57 +00:00
|
|
|
if (__blkdev_issue_discard(rdev->bdev, start, size, GFP_NOIO,
|
2021-02-04 07:50:43 +00:00
|
|
|
&discard_bio) || !discard_bio)
|
|
|
|
return;
|
|
|
|
|
|
|
|
bio_chain(discard_bio, bio);
|
|
|
|
bio_clone_blkg_association(discard_bio, bio);
|
|
|
|
if (mddev->gendisk)
|
|
|
|
trace_block_bio_remap(discard_bio,
|
|
|
|
disk_devt(mddev->gendisk),
|
|
|
|
bio->bi_iter.bi_sector);
|
|
|
|
submit_bio_noacct(discard_bio);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(md_submit_discard_bio);
|
|
|
|
|
2023-06-21 16:51:04 +00:00
|
|
|
static void md_end_clone_io(struct bio *bio)
|
2021-12-10 09:31:15 +00:00
|
|
|
{
|
2023-06-21 16:51:04 +00:00
|
|
|
struct md_io_clone *md_io_clone = bio->bi_private;
|
|
|
|
struct bio *orig_bio = md_io_clone->orig_bio;
|
|
|
|
struct mddev *mddev = md_io_clone->mddev;
|
2021-05-25 09:46:17 +00:00
|
|
|
|
|
|
|
orig_bio->bi_status = bio->bi_status;
|
|
|
|
|
2023-06-21 16:51:04 +00:00
|
|
|
if (md_io_clone->start_time)
|
|
|
|
bio_end_io_acct(orig_bio, md_io_clone->start_time);
|
|
|
|
|
2021-05-25 09:46:17 +00:00
|
|
|
bio_put(bio);
|
|
|
|
bio_endio(orig_bio);
|
2023-02-03 05:13:44 +00:00
|
|
|
percpu_ref_put(&mddev->active_io);
|
2021-05-25 09:46:17 +00:00
|
|
|
}
|
|
|
|
|
2023-06-21 16:51:04 +00:00
|
|
|
static void md_clone_bio(struct mddev *mddev, struct bio **bio)
|
2021-05-25 09:46:17 +00:00
|
|
|
{
|
2022-02-02 16:01:09 +00:00
|
|
|
struct block_device *bdev = (*bio)->bi_bdev;
|
2023-06-21 16:51:04 +00:00
|
|
|
struct md_io_clone *md_io_clone;
|
|
|
|
struct bio *clone =
|
|
|
|
bio_alloc_clone(bdev, *bio, GFP_NOIO, &mddev->io_clone_set);
|
|
|
|
|
|
|
|
md_io_clone = container_of(clone, struct md_io_clone, bio_clone);
|
|
|
|
md_io_clone->orig_bio = *bio;
|
|
|
|
md_io_clone->mddev = mddev;
|
|
|
|
if (blk_queue_io_stat(bdev->bd_disk->queue))
|
|
|
|
md_io_clone->start_time = bio_start_io_acct(*bio);
|
|
|
|
|
|
|
|
clone->bi_end_io = md_end_clone_io;
|
|
|
|
clone->bi_private = md_io_clone;
|
|
|
|
*bio = clone;
|
|
|
|
}
|
2021-05-25 09:46:17 +00:00
|
|
|
|
2023-06-21 16:51:04 +00:00
|
|
|
void md_account_bio(struct mddev *mddev, struct bio **bio)
|
|
|
|
{
|
2023-02-03 05:13:44 +00:00
|
|
|
percpu_ref_get(&mddev->active_io);
|
2023-06-21 16:51:04 +00:00
|
|
|
md_clone_bio(mddev, bio);
|
2021-05-25 09:46:17 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(md_account_bio);
|
|
|
|
|
2007-01-26 08:57:11 +00:00
|
|
|
/* md_allow_write(mddev)
|
|
|
|
* Calling this ensures that the array is marked 'active' so that writes
|
|
|
|
* may proceed without blocking. It is important to call this before
|
|
|
|
* attempting a GFP_KERNEL allocation while holding the mddev lock.
|
|
|
|
* Must be called with mddev_lock held.
|
|
|
|
*/
|
2017-05-08 09:56:55 +00:00
|
|
|
void md_allow_write(struct mddev *mddev)
|
2007-01-26 08:57:11 +00:00
|
|
|
{
|
|
|
|
if (!mddev->pers)
|
2017-05-08 09:56:55 +00:00
|
|
|
return;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev))
|
2017-05-08 09:56:55 +00:00
|
|
|
return;
|
2008-06-27 22:31:27 +00:00
|
|
|
if (!mddev->pers->sync_request)
|
2017-05-08 09:56:55 +00:00
|
|
|
return;
|
2007-01-26 08:57:11 +00:00
|
|
|
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2007-01-26 08:57:11 +00:00
|
|
|
if (mddev->in_sync) {
|
|
|
|
mddev->in_sync = 0;
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
|
|
|
set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
|
2007-01-26 08:57:11 +00:00
|
|
|
if (mddev->safemode_delay &&
|
|
|
|
mddev->safemode == 0)
|
|
|
|
mddev->safemode = 1;
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2007-01-26 08:57:11 +00:00
|
|
|
md_update_sb(mddev, 0);
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
2017-05-08 09:56:55 +00:00
|
|
|
/* wait for the dirty state to be recorded in the metadata */
|
|
|
|
wait_event(mddev->sb_wait,
|
|
|
|
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
|
2007-01-26 08:57:11 +00:00
|
|
|
} else
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2007-01-26 08:57:11 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(md_allow_write);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#define SYNC_MARKS 10
|
|
|
|
#define SYNC_MARK_STEP (3*HZ)
|
2012-10-31 00:59:10 +00:00
|
|
|
#define UPDATE_FREQUENCY (5*60*HZ)
|
2012-10-11 02:34:00 +00:00
|
|
|
void md_do_sync(struct md_thread *thread)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2012-10-11 02:34:00 +00:00
|
|
|
struct mddev *mddev = thread->mddev;
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev2;
|
2019-06-14 22:41:07 +00:00
|
|
|
unsigned int currspeed = 0, window;
|
2014-08-07 13:37:41 +00:00
|
|
|
sector_t max_sectors,j, io_sectors, recovery_done;
|
2005-04-16 22:20:36 +00:00
|
|
|
unsigned long mark[SYNC_MARKS];
|
2012-10-31 00:59:10 +00:00
|
|
|
unsigned long update_time;
|
2005-04-16 22:20:36 +00:00
|
|
|
sector_t mark_cnt[SYNC_MARKS];
|
|
|
|
int last_mark,m;
|
|
|
|
sector_t last_check;
|
2005-06-22 00:17:13 +00:00
|
|
|
int skipped = 0;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2013-06-25 06:23:59 +00:00
|
|
|
char *desc, *action = NULL;
|
md:Add blk_plug in sync_thread.
Add blk_plug in sync_thread will increase the performance of sync.
Because sync_thread did not blk_plug,so when raid sync, the bio merge
not well.
Testing environment:
SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller.
OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
x86_64 x86_64 x86_64 GNU/Linux.
RAID5: four ST31000524NS disk.
Without blk_plug:recovery speed about 63M/Sec;
Add blk_plug:recovery speed about 120M/Sec.
Using blktrace:
blktrace -d /dev/sdb -w 60 -o -|blkparse -i -
without blk_plug:
Total (8,16):
Reads Queued: 309811, 1239MiB Writes Queued: 0, 0KiB
Read Dispatches: 283583, 1189MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 273351, 1149MiB Writes Completed: 0, 0KiB
Read Merges: 23533, 94132KiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 0
add blk_plug:
Total (8,16):
Reads Queued: 428697, 1714MiB Writes Queued: 0, 0KiB
Read Dispatches: 3954, 1714MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 3956, 1715MiB Writes Completed: 0, 0KiB
Read Merges: 424743, 1698MiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 3384
The ratio of merge will be markedly increased.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 02:12:26 +00:00
|
|
|
struct blk_plug plug;
|
2016-05-02 15:33:08 +00:00
|
|
|
int ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* just incase thread restarts... */
|
2017-11-20 06:17:01 +00:00
|
|
|
if (test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_WAIT, &mddev->recovery))
|
2005-04-16 22:20:36 +00:00
|
|
|
return;
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev)) {/* never try to sync a read-only array */
|
2014-05-28 03:39:23 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2006-06-26 07:27:40 +00:00
|
|
|
return;
|
2014-05-28 03:39:23 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-05-02 15:33:08 +00:00
|
|
|
if (mddev_is_clustered(mddev)) {
|
|
|
|
ret = md_cluster_ops->resync_start(mddev);
|
|
|
|
if (ret)
|
|
|
|
goto skip;
|
|
|
|
|
2016-06-03 03:32:04 +00:00
|
|
|
set_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags);
|
2016-05-02 15:33:08 +00:00
|
|
|
if (!(test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_RECOVER, &mddev->recovery))
|
|
|
|
&& ((unsigned long long)mddev->curr_resync_completed
|
|
|
|
< (unsigned long long)mddev->resync_max_sectors))
|
|
|
|
goto skip;
|
|
|
|
}
|
|
|
|
|
2006-10-03 08:15:57 +00:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
2013-06-25 06:23:59 +00:00
|
|
|
if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) {
|
2006-10-03 08:15:57 +00:00
|
|
|
desc = "data-check";
|
2013-06-25 06:23:59 +00:00
|
|
|
action = "check";
|
|
|
|
} else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) {
|
2006-10-03 08:15:57 +00:00
|
|
|
desc = "requested-resync";
|
2013-06-25 06:23:59 +00:00
|
|
|
action = "repair";
|
|
|
|
} else
|
2006-10-03 08:15:57 +00:00
|
|
|
desc = "resync";
|
|
|
|
} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
|
|
|
|
desc = "reshape";
|
|
|
|
else
|
|
|
|
desc = "recovery";
|
|
|
|
|
2013-06-25 06:23:59 +00:00
|
|
|
mddev->last_sync_action = action ?: desc;
|
|
|
|
|
2022-06-08 16:27:54 +00:00
|
|
|
/*
|
2005-04-16 22:20:36 +00:00
|
|
|
* Before starting a resync we must have set curr_resync to
|
|
|
|
* 2, and then checked that every "conflicting" array has curr_resync
|
|
|
|
* less than ours. When we find one that is the same or higher
|
|
|
|
* we wait on resync_wait. To avoid deadlock, we reduce curr_resync
|
|
|
|
* to 1 if we choose to yield (based arbitrarily on address of mddev structure).
|
|
|
|
* This will mean we have to start checking from the beginning again.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
do {
|
2016-08-16 12:26:08 +00:00
|
|
|
int mddev2_minor = -1;
|
2022-06-08 16:27:54 +00:00
|
|
|
mddev->curr_resync = MD_RESYNC_DELAYED;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
try_again:
|
2009-12-30 04:25:23 +00:00
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
2005-04-16 22:20:36 +00:00
|
|
|
goto skip;
|
2022-07-19 09:18:20 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
|
|
|
list_for_each_entry(mddev2, &all_mddevs, all_mddevs) {
|
2022-07-19 09:18:23 +00:00
|
|
|
if (test_bit(MD_DELETED, &mddev2->flags))
|
|
|
|
continue;
|
2005-04-16 22:20:36 +00:00
|
|
|
if (mddev2 == mddev)
|
|
|
|
continue;
|
2008-05-23 20:04:38 +00:00
|
|
|
if (!mddev->parallel_resync
|
|
|
|
&& mddev2->curr_resync
|
|
|
|
&& match_mddev_units(mddev, mddev2)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
DEFINE_WAIT(wq);
|
2022-06-08 16:27:54 +00:00
|
|
|
if (mddev < mddev2 &&
|
|
|
|
mddev->curr_resync == MD_RESYNC_DELAYED) {
|
2005-04-16 22:20:36 +00:00
|
|
|
/* arbitrarily yield */
|
2022-06-08 16:27:54 +00:00
|
|
|
mddev->curr_resync = MD_RESYNC_YIELDED;
|
2005-04-16 22:20:36 +00:00
|
|
|
wake_up(&resync_wait);
|
|
|
|
}
|
2022-06-08 16:27:54 +00:00
|
|
|
if (mddev > mddev2 &&
|
|
|
|
mddev->curr_resync == MD_RESYNC_YIELDED)
|
2005-04-16 22:20:36 +00:00
|
|
|
/* no need to wait here, we can wait the next
|
|
|
|
* time 'round when curr_resync == 2
|
|
|
|
*/
|
|
|
|
continue;
|
2008-09-19 01:49:54 +00:00
|
|
|
/* We need to wait 'interruptible' so as not to
|
|
|
|
* contribute to the load average, and not to
|
|
|
|
* be caught by 'softlockup'
|
|
|
|
*/
|
|
|
|
prepare_to_wait(&resync_wait, &wq, TASK_INTERRUPTIBLE);
|
2013-11-19 01:02:01 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
|
2005-10-26 08:58:58 +00:00
|
|
|
mddev2->curr_resync >= mddev->curr_resync) {
|
2016-08-16 12:26:08 +00:00
|
|
|
if (mddev2_minor != mddev2->md_minor) {
|
|
|
|
mddev2_minor = mddev2->md_minor;
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: delaying %s of %s until %s has finished (they share one or more physical units)\n",
|
|
|
|
desc, mdname(mddev),
|
|
|
|
mdname(mddev2));
|
2016-08-16 12:26:08 +00:00
|
|
|
}
|
2022-07-19 09:18:20 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
|
2008-09-19 01:49:54 +00:00
|
|
|
if (signal_pending(current))
|
|
|
|
flush_signals(current);
|
2005-04-16 22:20:36 +00:00
|
|
|
schedule();
|
|
|
|
finish_wait(&resync_wait, &wq);
|
|
|
|
goto try_again;
|
|
|
|
}
|
|
|
|
finish_wait(&resync_wait, &wq);
|
|
|
|
}
|
|
|
|
}
|
2022-07-19 09:18:20 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
2022-06-08 16:27:54 +00:00
|
|
|
} while (mddev->curr_resync < MD_RESYNC_DELAYED);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-06-26 07:27:40 +00:00
|
|
|
j = 0;
|
2005-11-09 05:39:26 +00:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
/* resync follows the size requested by the personality,
|
2005-06-22 00:17:13 +00:00
|
|
|
* which defaults to physical size, but can be virtual size
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
max_sectors = mddev->resync_max_sectors;
|
2012-10-11 03:17:59 +00:00
|
|
|
atomic64_set(&mddev->resync_mismatches, 0);
|
2006-06-26 07:27:40 +00:00
|
|
|
/* we don't use the checkpoint if there's a bitmap */
|
2008-06-27 22:31:24 +00:00
|
|
|
if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
|
|
|
|
j = mddev->resync_min;
|
|
|
|
else if (!mddev->bitmap)
|
2006-06-26 07:27:40 +00:00
|
|
|
j = mddev->recovery_cp;
|
2008-06-27 22:31:24 +00:00
|
|
|
|
2018-10-18 08:37:47 +00:00
|
|
|
} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
|
2012-05-20 23:28:33 +00:00
|
|
|
max_sectors = mddev->resync_max_sectors;
|
2018-10-18 08:37:47 +00:00
|
|
|
/*
|
|
|
|
* If the original node aborts reshaping then we continue the
|
|
|
|
* reshaping, so set j again to avoid restart reshape from the
|
|
|
|
* first beginning
|
|
|
|
*/
|
|
|
|
if (mddev_is_clustered(mddev) &&
|
|
|
|
mddev->reshape_position != MaxSector)
|
|
|
|
j = mddev->reshape_position;
|
|
|
|
} else {
|
2005-04-16 22:20:36 +00:00
|
|
|
/* recovery follows the physical size of devices */
|
2009-03-31 03:33:13 +00:00
|
|
|
max_sectors = mddev->dev_sectors;
|
2006-06-26 07:27:40 +00:00
|
|
|
j = MaxSector;
|
2009-12-13 04:17:06 +00:00
|
|
|
rcu_read_lock();
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each_rcu(rdev, mddev)
|
2006-06-26 07:27:40 +00:00
|
|
|
if (rdev->raid_disk >= 0 &&
|
2015-10-09 04:54:12 +00:00
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
2006-06-26 07:27:40 +00:00
|
|
|
!test_bit(Faulty, &rdev->flags) &&
|
|
|
|
!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
rdev->recovery_offset < j)
|
|
|
|
j = rdev->recovery_offset;
|
2009-12-13 04:17:06 +00:00
|
|
|
rcu_read_unlock();
|
2014-07-02 02:04:14 +00:00
|
|
|
|
|
|
|
/* If there is a bitmap, we need to make sure all
|
|
|
|
* writes that started before we added a spare
|
|
|
|
* complete before we start doing a recovery.
|
|
|
|
* Otherwise the write might complete and (via
|
|
|
|
* bitmap_endwrite) set a bit in the bitmap after the
|
|
|
|
* recovery has checked that bit and skipped that
|
|
|
|
* region.
|
|
|
|
*/
|
|
|
|
if (mddev->bitmap) {
|
|
|
|
mddev->pers->quiesce(mddev, 1);
|
|
|
|
mddev->pers->quiesce(mddev, 0);
|
|
|
|
}
|
2006-06-26 07:27:40 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: %s of RAID array %s\n", desc, mdname(mddev));
|
|
|
|
pr_debug("md: minimum _guaranteed_ speed: %d KB/sec/disk.\n", speed_min(mddev));
|
|
|
|
pr_debug("md: using maximum available idle IO bandwidth (but not more than %d KB/sec) for %s.\n",
|
|
|
|
speed_max(mddev), desc);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-03-31 03:27:02 +00:00
|
|
|
is_mddev_idle(mddev, 1); /* this initializes IO event counters */
|
2006-06-26 07:27:40 +00:00
|
|
|
|
2005-06-22 00:17:13 +00:00
|
|
|
io_sectors = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
for (m = 0; m < SYNC_MARKS; m++) {
|
|
|
|
mark[m] = jiffies;
|
2005-06-22 00:17:13 +00:00
|
|
|
mark_cnt[m] = io_sectors;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
last_mark = 0;
|
|
|
|
mddev->resync_mark = mark[last_mark];
|
|
|
|
mddev->resync_mark_cnt = mark_cnt[last_mark];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Tune reconstruction:
|
|
|
|
*/
|
2019-06-14 22:41:07 +00:00
|
|
|
window = 32 * (PAGE_SIZE / 512);
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: using %dk window, over a total of %lluk.\n",
|
|
|
|
window/2, (unsigned long long)max_sectors/2);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
atomic_set(&mddev->recovery_active, 0);
|
|
|
|
last_check = 0;
|
|
|
|
|
2023-02-01 07:59:20 +00:00
|
|
|
if (j >= MD_RESYNC_ACTIVE) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: resuming %s of %s from checkpoint.\n",
|
|
|
|
desc, mdname(mddev));
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev->curr_resync = j;
|
2012-10-11 03:25:57 +00:00
|
|
|
} else
|
2022-06-08 16:27:54 +00:00
|
|
|
mddev->curr_resync = MD_RESYNC_ACTIVE; /* no longer delayed */
|
2011-01-13 22:14:34 +00:00
|
|
|
mddev->curr_resync_completed = j;
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_completed);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2012-10-31 00:59:10 +00:00
|
|
|
update_time = jiffies;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
md:Add blk_plug in sync_thread.
Add blk_plug in sync_thread will increase the performance of sync.
Because sync_thread did not blk_plug,so when raid sync, the bio merge
not well.
Testing environment:
SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller.
OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
x86_64 x86_64 x86_64 GNU/Linux.
RAID5: four ST31000524NS disk.
Without blk_plug:recovery speed about 63M/Sec;
Add blk_plug:recovery speed about 120M/Sec.
Using blktrace:
blktrace -d /dev/sdb -w 60 -o -|blkparse -i -
without blk_plug:
Total (8,16):
Reads Queued: 309811, 1239MiB Writes Queued: 0, 0KiB
Read Dispatches: 283583, 1189MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 273351, 1149MiB Writes Completed: 0, 0KiB
Read Merges: 23533, 94132KiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 0
add blk_plug:
Total (8,16):
Reads Queued: 428697, 1714MiB Writes Queued: 0, 0KiB
Read Dispatches: 3954, 1714MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 3956, 1715MiB Writes Completed: 0, 0KiB
Read Merges: 424743, 1698MiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 3384
The ratio of merge will be markedly increased.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 02:12:26 +00:00
|
|
|
blk_start_plug(&plug);
|
2005-04-16 22:20:36 +00:00
|
|
|
while (j < max_sectors) {
|
2005-06-22 00:17:13 +00:00
|
|
|
sector_t sectors;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-06-22 00:17:13 +00:00
|
|
|
skipped = 0;
|
2009-03-31 03:33:13 +00:00
|
|
|
|
2009-05-26 02:57:21 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
|
|
|
((mddev->curr_resync > mddev->curr_resync_completed &&
|
|
|
|
(mddev->curr_resync - mddev->curr_resync_completed)
|
|
|
|
> (max_sectors >> 4)) ||
|
2012-10-31 00:59:10 +00:00
|
|
|
time_after_eq(jiffies, update_time + UPDATE_FREQUENCY) ||
|
2009-05-26 02:57:21 +00:00
|
|
|
(j - mddev->curr_resync_completed)*2
|
2015-07-17 02:06:02 +00:00
|
|
|
>= mddev->resync_max - mddev->curr_resync_completed ||
|
|
|
|
mddev->curr_resync_completed > mddev->resync_max
|
2009-05-26 02:57:21 +00:00
|
|
|
)) {
|
2009-03-31 03:33:13 +00:00
|
|
|
/* time to update curr_resync_completed */
|
|
|
|
wait_event(mddev->recovery_wait,
|
|
|
|
atomic_read(&mddev->recovery_active) == 0);
|
2011-01-13 22:14:34 +00:00
|
|
|
mddev->curr_resync_completed = j;
|
2012-10-31 00:59:10 +00:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
|
|
|
|
j > mddev->recovery_cp)
|
|
|
|
mddev->recovery_cp = j;
|
2012-10-31 00:59:10 +00:00
|
|
|
update_time = jiffies;
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_completed);
|
2009-03-31 03:33:13 +00:00
|
|
|
}
|
2009-04-14 06:28:34 +00:00
|
|
|
|
2013-11-19 01:02:01 +00:00
|
|
|
while (j >= mddev->resync_max &&
|
|
|
|
!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
|
2009-07-01 03:15:35 +00:00
|
|
|
/* As this condition is controlled by user-space,
|
|
|
|
* we can block indefinitely, so use '_interruptible'
|
|
|
|
* to avoid triggering warnings.
|
|
|
|
*/
|
|
|
|
flush_signals(current); /* just in case */
|
|
|
|
wait_event_interruptible(mddev->recovery_wait,
|
|
|
|
mddev->resync_max > j
|
2013-11-19 01:02:01 +00:00
|
|
|
|| test_bit(MD_RECOVERY_INTR,
|
|
|
|
&mddev->recovery));
|
2009-07-01 03:15:35 +00:00
|
|
|
}
|
2009-04-14 06:28:34 +00:00
|
|
|
|
2013-11-19 01:02:01 +00:00
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
|
|
|
break;
|
2009-04-14 06:28:34 +00:00
|
|
|
|
2015-02-19 05:04:40 +00:00
|
|
|
sectors = mddev->pers->sync_request(mddev, j, &skipped);
|
2005-06-22 00:17:13 +00:00
|
|
|
if (sectors == 0) {
|
md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23 20:04:39 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2013-11-19 01:02:01 +00:00
|
|
|
break;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2005-06-22 00:17:13 +00:00
|
|
|
|
|
|
|
if (!skipped) { /* actual IO requested */
|
|
|
|
io_sectors += sectors;
|
|
|
|
atomic_add(sectors, &mddev->recovery_active);
|
|
|
|
}
|
|
|
|
|
2011-07-28 01:39:24 +00:00
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
|
|
|
break;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
j += sectors;
|
2015-07-24 03:27:08 +00:00
|
|
|
if (j > max_sectors)
|
|
|
|
/* when skipping, extra large numbers can be returned. */
|
|
|
|
j = max_sectors;
|
2023-02-01 07:59:20 +00:00
|
|
|
if (j >= MD_RESYNC_ACTIVE)
|
2012-10-11 03:25:57 +00:00
|
|
|
mddev->curr_resync = j;
|
2006-07-10 11:44:16 +00:00
|
|
|
mddev->curr_mark_cnt = io_sectors;
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
if (last_check == 0)
|
2011-07-28 01:39:24 +00:00
|
|
|
/* this is the earliest that rebuild will be
|
[PATCH] md: make /proc/mdstat pollable
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:20:30 +00:00
|
|
|
* visible in /proc/mdstat
|
|
|
|
*/
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2005-06-22 00:17:13 +00:00
|
|
|
|
|
|
|
if (last_check + window > io_sectors || j == max_sectors)
|
2005-04-16 22:20:36 +00:00
|
|
|
continue;
|
|
|
|
|
2005-06-22 00:17:13 +00:00
|
|
|
last_check = io_sectors;
|
2005-04-16 22:20:36 +00:00
|
|
|
repeat:
|
|
|
|
if (time_after_eq(jiffies, mark[last_mark] + SYNC_MARK_STEP )) {
|
|
|
|
/* step marks */
|
|
|
|
int next = (last_mark+1) % SYNC_MARKS;
|
|
|
|
|
|
|
|
mddev->resync_mark = mark[next];
|
|
|
|
mddev->resync_mark_cnt = mark_cnt[next];
|
|
|
|
mark[next] = jiffies;
|
2005-06-22 00:17:13 +00:00
|
|
|
mark_cnt[next] = io_sectors - atomic_read(&mddev->recovery_active);
|
2005-04-16 22:20:36 +00:00
|
|
|
last_mark = next;
|
|
|
|
}
|
|
|
|
|
2013-11-19 01:02:01 +00:00
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
|
|
|
break;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this loop exits only if either when we are slower than
|
|
|
|
* the 'hard' speed limit, or the system was IO-idle for
|
|
|
|
* a jiffy.
|
|
|
|
* the system might be non-idle CPU-wise, but we only care
|
|
|
|
* about not overloading the IO subsystem. (things like an
|
|
|
|
* e2fsck being done on the RAID array should execute fast)
|
|
|
|
*/
|
|
|
|
cond_resched();
|
|
|
|
|
2014-08-07 13:37:41 +00:00
|
|
|
recovery_done = io_sectors - atomic_read(&mddev->recovery_active);
|
|
|
|
currspeed = ((unsigned long)(recovery_done - mddev->resync_mark_cnt))/2
|
2005-06-22 00:17:13 +00:00
|
|
|
/((jiffies-mddev->resync_mark)/HZ +1) +1;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-01-06 08:21:36 +00:00
|
|
|
if (currspeed > speed_min(mddev)) {
|
2015-02-19 05:55:00 +00:00
|
|
|
if (currspeed > speed_max(mddev)) {
|
2005-11-18 09:11:01 +00:00
|
|
|
msleep(500);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto repeat;
|
|
|
|
}
|
2015-02-19 05:55:00 +00:00
|
|
|
if (!is_mddev_idle(mddev, 0)) {
|
|
|
|
/*
|
|
|
|
* Give other IO more of a chance.
|
|
|
|
* The faster the devices, the less we wait.
|
|
|
|
*/
|
|
|
|
wait_event(mddev->recovery_wait,
|
|
|
|
!atomic_read(&mddev->recovery_active));
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: %s: %s %s.\n",mdname(mddev), desc,
|
|
|
|
test_bit(MD_RECOVERY_INTR, &mddev->recovery)
|
|
|
|
? "interrupted" : "done");
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* this also signals 'finished resyncing' to md_stop
|
|
|
|
*/
|
md:Add blk_plug in sync_thread.
Add blk_plug in sync_thread will increase the performance of sync.
Because sync_thread did not blk_plug,so when raid sync, the bio merge
not well.
Testing environment:
SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller.
OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
x86_64 x86_64 x86_64 GNU/Linux.
RAID5: four ST31000524NS disk.
Without blk_plug:recovery speed about 63M/Sec;
Add blk_plug:recovery speed about 120M/Sec.
Using blktrace:
blktrace -d /dev/sdb -w 60 -o -|blkparse -i -
without blk_plug:
Total (8,16):
Reads Queued: 309811, 1239MiB Writes Queued: 0, 0KiB
Read Dispatches: 283583, 1189MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 273351, 1149MiB Writes Completed: 0, 0KiB
Read Merges: 23533, 94132KiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 0
add blk_plug:
Total (8,16):
Reads Queued: 428697, 1714MiB Writes Queued: 0, 0KiB
Read Dispatches: 3954, 1714MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 3956, 1715MiB Writes Completed: 0, 0KiB
Read Merges: 424743, 1698MiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 3384
The ratio of merge will be markedly increased.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 02:12:26 +00:00
|
|
|
blk_finish_plug(&plug);
|
2005-04-16 22:20:36 +00:00
|
|
|
wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
|
|
|
|
|
2015-07-24 03:27:08 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
|
|
|
!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
|
2022-06-08 16:27:54 +00:00
|
|
|
mddev->curr_resync >= MD_RESYNC_ACTIVE) {
|
2015-07-24 03:27:08 +00:00
|
|
|
mddev->curr_resync_completed = mddev->curr_resync;
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_completed);
|
2015-07-24 03:27:08 +00:00
|
|
|
}
|
2015-02-19 05:04:40 +00:00
|
|
|
mddev->pers->sync_request(mddev, max_sectors, &skipped);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23 20:04:39 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
|
2023-01-31 07:07:19 +00:00
|
|
|
mddev->curr_resync > MD_RESYNC_ACTIVE) {
|
2006-06-26 07:27:40 +00:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
|
|
|
|
if (mddev->curr_resync >= mddev->recovery_cp) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: checkpointing %s of %s.\n",
|
|
|
|
desc, mdname(mddev));
|
2012-11-19 11:57:34 +00:00
|
|
|
if (test_bit(MD_RECOVERY_ERROR,
|
|
|
|
&mddev->recovery))
|
|
|
|
mddev->recovery_cp =
|
|
|
|
mddev->curr_resync_completed;
|
|
|
|
else
|
|
|
|
mddev->recovery_cp =
|
|
|
|
mddev->curr_resync;
|
2006-06-26 07:27:40 +00:00
|
|
|
}
|
|
|
|
} else
|
|
|
|
mddev->recovery_cp = MaxSector;
|
|
|
|
} else {
|
|
|
|
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
|
|
|
mddev->curr_resync = MaxSector;
|
2017-10-17 05:18:36 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
|
|
|
test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
|
|
|
|
rcu_read_lock();
|
|
|
|
rdev_for_each_rcu(rdev, mddev)
|
|
|
|
if (rdev->raid_disk >= 0 &&
|
|
|
|
mddev->delta_disks >= 0 &&
|
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
|
|
|
!test_bit(Faulty, &rdev->flags) &&
|
|
|
|
!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
rdev->recovery_offset < mddev->curr_resync)
|
|
|
|
rdev->recovery_offset = mddev->curr_resync;
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
2006-06-26 07:27:40 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2012-02-07 01:01:51 +00:00
|
|
|
skip:
|
2016-06-03 03:32:04 +00:00
|
|
|
/* set CHANGE_PENDING here since maybe another update is needed,
|
|
|
|
* so other nodes are informed. It should be harmless for normal
|
|
|
|
* raid */
|
2016-12-08 23:48:19 +00:00
|
|
|
set_mask_bits(&mddev->sb_flags, 0,
|
|
|
|
BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
|
2015-09-30 18:20:35 +00:00
|
|
|
|
md: fix a potential deadlock of raid5/raid10 reshape
There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.
How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
write through critical data into raid5 emulated block device. So it
waits for raid5 kernel thread handling stripes in order to finish it
I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
called first in order to reap the raid5 resync thread. That is,
raid5_finish_reshape() will be called. In this function, it will try
to update conf and call VFS revalidate_disk() to grow the raid5
emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.
The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.
For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.
2017/12/29:
Thanks to Guoqing Jiang <gqjiang@suse.com>,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.
Reported-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-22 05:34:46 +00:00
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
|
|
|
!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
|
|
|
|
mddev->delta_disks > 0 &&
|
|
|
|
mddev->pers->finish_reshape &&
|
|
|
|
mddev->pers->size &&
|
|
|
|
mddev->queue) {
|
|
|
|
mddev_lock_nointr(mddev);
|
|
|
|
md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
|
|
|
|
mddev_unlock(mddev);
|
2020-11-16 14:57:11 +00:00
|
|
|
if (!mddev_is_clustered(mddev))
|
|
|
|
set_capacity_and_notify(mddev->gendisk,
|
|
|
|
mddev->array_sectors);
|
md: fix a potential deadlock of raid5/raid10 reshape
There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.
How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
write through critical data into raid5 emulated block device. So it
waits for raid5 kernel thread handling stripes in order to finish it
I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
called first in order to reap the raid5 resync thread. That is,
raid5_finish_reshape() will be called. In this function, it will try
to update conf and call VFS revalidate_disk() to grow the raid5
emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.
The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.
For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.
2017/12/29:
Thanks to Guoqing Jiang <gqjiang@suse.com>,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.
Reported-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-22 05:34:46 +00:00
|
|
|
}
|
|
|
|
|
2014-12-15 01:57:01 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2009-12-14 01:49:48 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
|
|
|
|
/* We completed so min/max setting can be forgotten if used. */
|
|
|
|
if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
|
|
|
|
mddev->resync_min = 0;
|
|
|
|
mddev->resync_max = MaxSector;
|
|
|
|
} else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
|
|
|
|
mddev->resync_min = mddev->curr_resync_completed;
|
2015-07-02 07:12:58 +00:00
|
|
|
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
|
2022-06-08 16:27:54 +00:00
|
|
|
mddev->curr_resync = MD_RESYNC_NONE;
|
2014-12-15 01:57:01 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
wake_up(&resync_wait);
|
md/raid5: fix a deadlock in the case that reshape is interrupted
If reshape is in progress and io across reshape_position is issued, such
io will wait for reshape to make progress(see details in the case that
make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY).
It has been reported several times that if system reboot while growing
raid5 to raid6, array assemble will hang infinitely([1, 2]). This is
because following deadlock is triggered:
1) a normal io is waiting for reshape to progress, this io can be from
system-udevd or mdadm.
2) while assemble, mdadm tries to suspend the array, hence
'reconfig_mutex' is held and mddev_suspend() must wait for normal io
to be done.
3) daemon thread can't start reshape because 'reconfig_mutex' can't be
held.
1) and 3) is unbreakable because they're foundation design. In order to
break 2), following is possible solutions that I can think of:
a) Let mddev_suspend() fail is not a good option, because this will
break many scenarios since mddev_suspend() doesn't fail before.
b) Fail the io that is waiting for reshape to make progress from
mddev_suspend().
c) Return false for the io that is waiting for reshape to make
progress from raid5_make_request(), and these io will wait for
suspend to be done in md_handle_request(), where 'active_io' is
not grabbed.
c) sounds better than b), however, b) is used because it's easy and
straightforward, and it's verified that mdadm can assemble in this case.
On the other hand, c) breaks the logic that mddev_suspend() will wait
for submitted io to be completely handled.
Fix the problem by checking reshape in mddev_suspend(), if reshape can't
make progress and there are still some io waiting for reshape, fail
those io.
[1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/
[2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/
Reported-by: Jove <jovetoo@gmail.com>
Reported-by: David Gilmour <dgilmour76@gmail.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-6-yukuai1@huaweicloud.com
2023-05-12 01:56:10 +00:00
|
|
|
wake_up(&mddev->sb_wait);
|
2005-04-16 22:20:36 +00:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2008-02-06 09:39:52 +00:00
|
|
|
return;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2006-03-27 09:18:10 +00:00
|
|
|
EXPORT_SYMBOL_GPL(md_do_sync);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-08-25 03:16:19 +00:00
|
|
|
static bool rdev_removeable(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
/* rdev is not used. */
|
|
|
|
if (rdev->raid_disk < 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* There are still inflight io, don't remove this rdev. */
|
|
|
|
if (atomic_read(&rdev->nr_pending))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* An error occurred but has not yet been acknowledged by the metadata
|
|
|
|
* handler, don't remove this rdev.
|
|
|
|
*/
|
|
|
|
if (test_bit(Blocked, &rdev->flags))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* Fautly rdev is not used, it's safe to remove it. */
|
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Journal disk can only be removed if it's faulty. */
|
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 'In_sync' is cleared while 'raid_disk' is valid, which means
|
|
|
|
* replacement has just become active from pers->spare_active(), and
|
|
|
|
* then pers->hot_remove_disk() will replace this rdev with replacement.
|
|
|
|
*/
|
|
|
|
if (!test_bit(In_sync, &rdev->flags))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2023-08-25 03:16:20 +00:00
|
|
|
static bool rdev_is_spare(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
return !test_bit(Candidate, &rdev->flags) && rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
|
|
|
!test_bit(Faulty, &rdev->flags);
|
|
|
|
}
|
|
|
|
|
2023-08-25 03:16:21 +00:00
|
|
|
static bool rdev_addable(struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
/* rdev is already used, don't add it again. */
|
|
|
|
if (test_bit(Candidate, &rdev->flags) || rdev->raid_disk >= 0 ||
|
|
|
|
test_bit(Faulty, &rdev->flags))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* Allow to add journal disk. */
|
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Allow to add if array is read-write. */
|
|
|
|
if (md_is_rdwr(rdev->mddev))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For read-only array, only allow to readd a rdev. And if bitmap is
|
|
|
|
* used, don't allow to readd a rdev that is too old.
|
|
|
|
*/
|
|
|
|
if (rdev->saved_raid_disk >= 0 && !test_bit(Bitmap_sync, &rdev->flags))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
static bool md_spares_need_change(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
if (rdev_removeable(rdev) || rdev_addable(rdev))
|
|
|
|
return true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2013-04-24 01:42:41 +00:00
|
|
|
static int remove_and_add_spares(struct mddev *mddev,
|
|
|
|
struct md_rdev *this)
|
2007-03-01 04:11:48 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2007-03-01 04:11:48 +00:00
|
|
|
int spares = 0;
|
2012-01-08 13:46:41 +00:00
|
|
|
int removed = 0;
|
2016-06-02 06:19:53 +00:00
|
|
|
bool remove_some = false;
|
2007-03-01 04:11:48 +00:00
|
|
|
|
2018-02-02 22:19:30 +00:00
|
|
|
if (this && test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
|
|
|
/* Mustn't remove devices when resync thread is running */
|
|
|
|
return 0;
|
|
|
|
|
2016-06-02 06:19:53 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if ((this == NULL || rdev == this) &&
|
|
|
|
rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(Blocked, &rdev->flags) &&
|
|
|
|
test_bit(Faulty, &rdev->flags) &&
|
|
|
|
atomic_read(&rdev->nr_pending)==0) {
|
|
|
|
/* Faulty non-Blocked devices with nr_pending == 0
|
|
|
|
* never get nr_pending incremented,
|
|
|
|
* never get Faulty cleared, and never get Blocked set.
|
|
|
|
* So we can synchronize_rcu now rather than once per device
|
|
|
|
*/
|
|
|
|
remove_some = true;
|
|
|
|
set_bit(RemoveSynchronized, &rdev->flags);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (remove_some)
|
|
|
|
synchronize_rcu();
|
|
|
|
rdev_for_each(rdev, mddev) {
|
2013-04-24 01:42:41 +00:00
|
|
|
if ((this == NULL || rdev == this) &&
|
2023-08-25 03:16:19 +00:00
|
|
|
(test_bit(RemoveSynchronized, &rdev->flags) ||
|
|
|
|
rdev_removeable(rdev))) {
|
2007-03-01 04:11:48 +00:00
|
|
|
if (mddev->pers->hot_remove_disk(
|
2011-12-22 23:17:51 +00:00
|
|
|
mddev, rdev) == 0) {
|
2011-07-27 01:00:36 +00:00
|
|
|
sysfs_unlink_rdev(mddev, rdev);
|
2018-04-26 04:46:29 +00:00
|
|
|
rdev->saved_raid_disk = rdev->raid_disk;
|
2007-03-01 04:11:48 +00:00
|
|
|
rdev->raid_disk = -1;
|
2012-01-08 13:46:41 +00:00
|
|
|
removed++;
|
2007-03-01 04:11:48 +00:00
|
|
|
}
|
|
|
|
}
|
2016-06-02 06:19:53 +00:00
|
|
|
if (remove_some && test_bit(RemoveSynchronized, &rdev->flags))
|
|
|
|
clear_bit(RemoveSynchronized, &rdev->flags);
|
|
|
|
}
|
|
|
|
|
2013-03-07 22:24:26 +00:00
|
|
|
if (removed && mddev->kobj.sd)
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_degraded);
|
2007-03-01 04:11:48 +00:00
|
|
|
|
2015-09-28 15:27:26 +00:00
|
|
|
if (this && removed)
|
2013-04-24 01:42:41 +00:00
|
|
|
goto no_add;
|
|
|
|
|
2012-03-19 01:46:39 +00:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2015-09-28 15:27:26 +00:00
|
|
|
if (this && this != rdev)
|
|
|
|
continue;
|
2023-08-25 03:16:20 +00:00
|
|
|
if (rdev_is_spare(rdev))
|
|
|
|
spares++;
|
2023-08-25 03:16:21 +00:00
|
|
|
if (!rdev_addable(rdev))
|
md: Allow devices to be re-added to a read-only array.
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.
We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.
Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto. This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.
So:
- remove the "md_update_sb()" call from add_new_disk(). This isn't
really needed as just adding a disk doesn't require a metadata
update. Instead, just set MD_CHANGE_DEVS. This will effect a
metadata update soon enough, once the array is not read-only.
- Allow the ADD_NEW_DISK ioctl to succeed without activating a
read-auto array, providing the MD_DISK_SYNC flag is set.
In this case, the device will be rejected if it cannot be added
with the correct device number, or has an incorrect event count.
- Teach remove_and_add_spares() to be careful about adding spares
when the array is read-only (or read-mostly) - only add devices
that are thought to be in-sync, and only do it if the array is
in-sync itself.
- In md_check_recovery, use remove_and_add_spares in the read-only
case, rather than open coding just the 'remove' part of it.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 01:42:42 +00:00
|
|
|
continue;
|
2023-08-25 03:16:21 +00:00
|
|
|
if (!test_bit(Journal, &rdev->flags))
|
2015-12-20 23:51:02 +00:00
|
|
|
rdev->recovery_offset = 0;
|
2020-04-04 21:57:11 +00:00
|
|
|
if (mddev->pers->hot_add_disk(mddev, rdev) == 0) {
|
2020-07-16 04:54:40 +00:00
|
|
|
/* failure here is OK */
|
|
|
|
sysfs_link_rdev(mddev, rdev);
|
2015-12-20 23:51:02 +00:00
|
|
|
if (!test_bit(Journal, &rdev->flags))
|
|
|
|
spares++;
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23 20:04:39 +00:00
|
|
|
}
|
2007-03-01 04:11:48 +00:00
|
|
|
}
|
2013-04-24 01:42:41 +00:00
|
|
|
no_add:
|
2012-09-19 02:54:22 +00:00
|
|
|
if (removed)
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2007-03-01 04:11:48 +00:00
|
|
|
return spares;
|
|
|
|
}
|
2011-01-13 22:14:33 +00:00
|
|
|
|
2023-08-25 03:16:17 +00:00
|
|
|
static bool md_choose_sync_action(struct mddev *mddev, int *spares)
|
|
|
|
{
|
|
|
|
/* Check if reshape is in progress first. */
|
|
|
|
if (mddev->reshape_position != MaxSector) {
|
|
|
|
if (mddev->pers->check_reshape == NULL ||
|
|
|
|
mddev->pers->check_reshape(mddev) != 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove any failed drives, then add spares if possible. Spares are
|
|
|
|
* also removed and re-added, to allow the personality to fail the
|
|
|
|
* re-add.
|
|
|
|
*/
|
|
|
|
*spares = remove_and_add_spares(mddev, NULL);
|
|
|
|
if (*spares) {
|
|
|
|
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
|
|
|
|
|
|
|
|
/* Start new recovery. */
|
|
|
|
set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check if recovery is in progress. */
|
|
|
|
if (mddev->recovery_cp < MaxSector) {
|
|
|
|
set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Delay to choose resync/check/repair in md_do_sync(). */
|
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Nothing to be done */
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-09-29 22:10:42 +00:00
|
|
|
static void md_start_sync(struct work_struct *ws)
|
|
|
|
{
|
2023-08-25 03:16:16 +00:00
|
|
|
struct mddev *mddev = container_of(ws, struct mddev, sync_work);
|
md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-08-25 03:16:18 +00:00
|
|
|
int spares = 0;
|
|
|
|
|
|
|
|
mddev_lock_nointr(mddev);
|
|
|
|
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
if (!md_is_rdwr(mddev)) {
|
|
|
|
/*
|
|
|
|
* On a read-only array we can:
|
|
|
|
* - remove failed devices
|
|
|
|
* - add already-in_sync devices if the array itself is in-sync.
|
|
|
|
* As we only add devices that are already in-sync, we can
|
|
|
|
* activate the spares immediately.
|
|
|
|
*/
|
|
|
|
remove_and_add_spares(mddev, NULL);
|
|
|
|
goto not_running;
|
|
|
|
}
|
|
|
|
|
md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-08-25 03:16:18 +00:00
|
|
|
if (!md_choose_sync_action(mddev, &spares))
|
|
|
|
goto not_running;
|
|
|
|
|
|
|
|
if (!mddev->pers->sync_request)
|
|
|
|
goto not_running;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We are adding a device or devices to an array which has the bitmap
|
|
|
|
* stored on all devices. So make sure all bitmap pages get written.
|
|
|
|
*/
|
|
|
|
if (spares)
|
|
|
|
md_bitmap_write_all(mddev->bitmap);
|
2015-09-30 18:20:35 +00:00
|
|
|
|
2023-05-23 02:10:17 +00:00
|
|
|
rcu_assign_pointer(mddev->sync_thread,
|
|
|
|
md_register_thread(md_do_sync, mddev, "resync"));
|
2014-09-29 22:10:42 +00:00
|
|
|
if (!mddev->sync_thread) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_warn("%s: could not start resync thread...\n",
|
|
|
|
mdname(mddev));
|
2014-09-29 22:10:42 +00:00
|
|
|
/* leave the spares where they are, it shouldn't hurt */
|
md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-08-25 03:16:18 +00:00
|
|
|
goto not_running;
|
|
|
|
}
|
|
|
|
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
md_wakeup_thread(mddev->sync_thread);
|
2014-09-29 22:10:42 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_action);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-08-25 03:16:18 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
not_running:
|
|
|
|
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
|
|
|
|
wake_up(&resync_wait);
|
|
|
|
if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
|
|
|
|
mddev->sysfs_action)
|
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_action);
|
2014-09-29 22:10:42 +00:00
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* This routine is regularly called by all per-raid-array threads to
|
|
|
|
* deal with generic issues like resync and super-block update.
|
|
|
|
* Raid personalities that don't have a thread (linear/raid0) do not
|
|
|
|
* need this as they never do any recovery or update the superblock.
|
|
|
|
*
|
|
|
|
* It does not do any resync itself, but rather "forks" off other threads
|
|
|
|
* to do that as needed.
|
|
|
|
* When it is determined that resync is needed, we set MD_RECOVERY_RUNNING in
|
|
|
|
* "->recovery" and create a thread at ->sync_thread.
|
md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23 20:04:39 +00:00
|
|
|
* When the thread finishes it sets MD_RECOVERY_DONE
|
2005-04-16 22:20:36 +00:00
|
|
|
* and wakeups up this thread which will reap the thread and finish up.
|
|
|
|
* This thread also removes any faulty devices (with nr_pending == 0).
|
|
|
|
*
|
|
|
|
* The overall approach is:
|
|
|
|
* 1/ if the superblock needs updating, update it.
|
|
|
|
* 2/ If a recovery thread is running, don't do anything else.
|
|
|
|
* 3/ If recovery has finished, clean up, possibly marking spares active.
|
|
|
|
* 4/ If there are any faulty devices, remove them.
|
|
|
|
* 5/ If array is degraded, try to add spares devices
|
|
|
|
* 6/ If array has spares or is not in-sync, start a resync thread.
|
|
|
|
*/
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_check_recovery(struct mddev *mddev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2018-10-03 05:04:41 +00:00
|
|
|
if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags) && mddev->sb_flags) {
|
|
|
|
/* Write superblock - thread that called mddev_suspend()
|
|
|
|
* holds reconfig_mutex for us.
|
|
|
|
*/
|
|
|
|
set_bit(MD_UPDATING_SB, &mddev->flags);
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags))
|
|
|
|
md_update_sb(mddev, 0);
|
|
|
|
clear_bit_unlock(MD_UPDATING_SB, &mddev->flags);
|
|
|
|
wake_up(&mddev->sb_wait);
|
|
|
|
}
|
|
|
|
|
2023-01-31 05:17:09 +00:00
|
|
|
if (is_md_suspended(mddev))
|
2011-06-08 05:10:08 +00:00
|
|
|
return;
|
|
|
|
|
2005-06-22 00:17:16 +00:00
|
|
|
if (mddev->bitmap)
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_daemon_work(mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-06-22 00:17:11 +00:00
|
|
|
if (signal_pending(current)) {
|
2008-04-30 07:52:30 +00:00
|
|
|
if (mddev->pers->sync_request && !mddev->external) {
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: %s in immediate safe mode\n",
|
|
|
|
mdname(mddev));
|
2005-06-22 00:17:11 +00:00
|
|
|
mddev->safemode = 2;
|
|
|
|
}
|
|
|
|
flush_signals(current);
|
|
|
|
}
|
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev) &&
|
|
|
|
!test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
|
2008-08-05 05:54:13 +00:00
|
|
|
return;
|
2005-04-16 22:20:36 +00:00
|
|
|
if ( ! (
|
2016-12-08 23:48:19 +00:00
|
|
|
(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
|
2005-04-16 22:20:36 +00:00
|
|
|
test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
|
2005-06-22 00:17:11 +00:00
|
|
|
test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
|
2008-04-30 07:52:30 +00:00
|
|
|
(mddev->external == 0 && mddev->safemode == 1) ||
|
MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.
So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.
We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.
It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.
mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.
md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 03:05:14 +00:00
|
|
|
(mddev->safemode == 2
|
2005-06-22 00:17:11 +00:00
|
|
|
&& !mddev->in_sync && mddev->recovery_cp == MaxSector)
|
2005-04-16 22:20:36 +00:00
|
|
|
))
|
|
|
|
return;
|
2005-06-22 00:17:11 +00:00
|
|
|
|
2006-03-27 09:18:20 +00:00
|
|
|
if (mddev_trylock(mddev)) {
|
2019-08-20 00:21:09 +00:00
|
|
|
bool try_set_sync = mddev->safemode != 0;
|
2005-06-22 00:17:11 +00:00
|
|
|
|
2017-08-12 03:34:45 +00:00
|
|
|
if (!mddev->external && mddev->safemode == 1)
|
md: always clear ->safemode when md_check_recovery gets the mddev lock.
If ->safemode == 1, md_check_recovery() will try to get the mddev lock
and perform various other checks.
If mddev->in_sync is zero, it will call set_in_sync, and clear
->safemode. However if mddev->in_sync is not zero, ->safemode will not
be cleared.
When md_check_recovery() drops the mddev lock, the thread is woken
up again. Normally it would just check if there was anything else to
do, find nothing, and go to sleep. However as ->safemode was not
cleared, it will take the mddev lock again, then wake itself up
when unlocking.
This results in an infinite loop, repeatedly calling
md_check_recovery(), which RCU or the soft-lockup detector
will eventually complain about.
Prior to commit 4ad23a976413 ("MD: use per-cpu counter for
writes_pending"), safemode would only be set to one when the
writes_pending counter reached zero, and would be cleared again
when writes_pending is incremented. Since that patch, safemode
is set more freely, but is not reliably cleared.
So in md_check_recovery() clear ->safemode before checking ->in_sync.
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (4.12+)
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Reported-by: David R <david@unsolicited.net>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08 06:56:36 +00:00
|
|
|
mddev->safemode = 0;
|
|
|
|
|
2022-09-20 02:39:38 +00:00
|
|
|
if (!md_is_rdwr(mddev)) {
|
2015-06-17 02:31:46 +00:00
|
|
|
struct md_rdev *rdev;
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
|
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
|
|
|
|
/* sync_work already queued. */
|
|
|
|
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
2015-06-17 02:31:46 +00:00
|
|
|
if (!mddev->external && mddev->in_sync)
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
/*
|
|
|
|
* 'Blocked' flag not needed as failed devices
|
2015-06-17 02:31:46 +00:00
|
|
|
* will be recorded if array switched to read/write.
|
|
|
|
* Leaving it set will prevent the device
|
|
|
|
* from being removed.
|
|
|
|
*/
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
clear_bit(Blocked, &rdev->flags);
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* There is no thread, but we need to call
|
2013-12-11 23:13:33 +00:00
|
|
|
* ->spare_active and clear saved_raid_disk
|
|
|
|
*/
|
2014-05-29 01:40:03 +00:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2022-06-07 02:03:56 +00:00
|
|
|
md_reap_sync_thread(mddev);
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Let md_start_sync() to remove and add rdevs to the
|
|
|
|
* array.
|
|
|
|
*/
|
|
|
|
if (md_spares_need_change(mddev)) {
|
|
|
|
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
|
|
|
queue_work(md_misc_wq, &mddev->sync_work);
|
|
|
|
}
|
|
|
|
|
2015-07-17 01:57:30 +00:00
|
|
|
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
|
2013-12-11 23:13:33 +00:00
|
|
|
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2016-12-08 23:48:19 +00:00
|
|
|
clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
|
md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-08-25 03:16:22 +00:00
|
|
|
|
2008-08-05 05:54:13 +00:00
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
2015-12-20 23:50:59 +00:00
|
|
|
if (mddev_is_clustered(mddev)) {
|
2021-04-08 07:44:15 +00:00
|
|
|
struct md_rdev *rdev, *tmp;
|
2015-12-20 23:50:59 +00:00
|
|
|
/* kick the device if another node issued a
|
|
|
|
* remove disk.
|
|
|
|
*/
|
2021-04-08 07:44:15 +00:00
|
|
|
rdev_for_each_safe(rdev, tmp, mddev) {
|
2015-12-20 23:50:59 +00:00
|
|
|
if (test_and_clear_bit(ClusterRemove, &rdev->flags) &&
|
|
|
|
rdev->raid_disk < 0)
|
|
|
|
md_kick_rdev_from_array(rdev);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-08-20 00:21:09 +00:00
|
|
|
if (try_set_sync && !mddev->external && !mddev->in_sync) {
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2017-03-15 03:05:14 +00:00
|
|
|
set_in_sync(mddev);
|
2014-12-15 01:56:56 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2005-06-22 00:17:11 +00:00
|
|
|
}
|
|
|
|
|
2016-12-08 23:48:19 +00:00
|
|
|
if (mddev->sb_flags)
|
2006-10-03 08:15:46 +00:00
|
|
|
md_update_sb(mddev, 0);
|
2005-06-22 00:17:12 +00:00
|
|
|
|
2023-05-29 13:20:37 +00:00
|
|
|
/*
|
|
|
|
* Never start a new sync thread if MD_RECOVERY_RUNNING is
|
|
|
|
* still set.
|
|
|
|
*/
|
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
|
|
|
|
if (!test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
|
|
|
|
/* resync/recovery still happening */
|
|
|
|
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!mddev->sync_thread))
|
|
|
|
goto unlock;
|
|
|
|
|
2022-06-07 02:03:56 +00:00
|
|
|
md_reap_sync_thread(mddev);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto unlock;
|
|
|
|
}
|
2023-05-29 13:20:37 +00:00
|
|
|
|
2008-06-27 22:31:41 +00:00
|
|
|
/* Set RUNNING before clearing NEEDED to avoid
|
|
|
|
* any transients in the value of "sync_action".
|
|
|
|
*/
|
2012-10-11 03:25:57 +00:00
|
|
|
mddev->curr_resync_completed = 0;
|
2014-12-15 01:57:01 +00:00
|
|
|
spin_lock(&mddev->lock);
|
2008-06-27 22:31:41 +00:00
|
|
|
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
2014-12-15 01:57:01 +00:00
|
|
|
spin_unlock(&mddev->lock);
|
2005-11-09 05:39:26 +00:00
|
|
|
/* Clear some bits that don't mean anything, but
|
|
|
|
* might be left set
|
|
|
|
*/
|
|
|
|
clear_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-08-25 03:16:18 +00:00
|
|
|
if (test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) &&
|
|
|
|
!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
|
2023-08-25 03:16:16 +00:00
|
|
|
queue_work(md_misc_wq, &mddev->sync_work);
|
md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-08-25 03:16:18 +00:00
|
|
|
} else {
|
2008-06-27 22:31:41 +00:00
|
|
|
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
2014-12-10 23:02:10 +00:00
|
|
|
wake_up(&resync_wait);
|
2008-06-27 22:31:41 +00:00
|
|
|
}
|
md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-08-25 03:16:18 +00:00
|
|
|
|
2014-09-29 22:10:42 +00:00
|
|
|
unlock:
|
|
|
|
wake_up(&mddev->sb_wait);
|
2005-04-16 22:20:36 +00:00
|
|
|
mddev_unlock(mddev);
|
|
|
|
}
|
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_check_recovery);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2022-06-07 02:03:56 +00:00
|
|
|
void md_reap_sync_thread(struct mddev *mddev)
|
2013-04-24 01:42:43 +00:00
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
2018-10-18 08:37:44 +00:00
|
|
|
sector_t old_dev_sectors = mddev->dev_sectors;
|
|
|
|
bool is_reshaped = false;
|
2013-04-24 01:42:43 +00:00
|
|
|
|
Revert "md: unlock mddev before reap sync_thread in action_store"
This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.
Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:
list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18
This is because work is requeued while it's still inside workqueue:
t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock
t3:
// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted
The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com
2023-05-29 13:20:32 +00:00
|
|
|
/* resync has finished, collect result */
|
2023-08-03 07:17:11 +00:00
|
|
|
md_unregister_thread(mddev, &mddev->sync_thread);
|
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
2023-05-29 13:20:35 +00:00
|
|
|
atomic_inc(&mddev->sync_seq);
|
|
|
|
|
2013-04-24 01:42:43 +00:00
|
|
|
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
|
2019-07-24 09:09:21 +00:00
|
|
|
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
|
|
|
|
mddev->degraded != mddev->raid_disks) {
|
2013-04-24 01:42:43 +00:00
|
|
|
/* success...*/
|
|
|
|
/* activate any spares */
|
|
|
|
if (mddev->pers->spare_active(mddev)) {
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_degraded);
|
2016-12-08 23:48:19 +00:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2013-04-24 01:42:43 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
2018-10-18 08:37:44 +00:00
|
|
|
mddev->pers->finish_reshape) {
|
2013-04-24 01:42:43 +00:00
|
|
|
mddev->pers->finish_reshape(mddev);
|
2018-10-18 08:37:44 +00:00
|
|
|
if (mddev_is_clustered(mddev))
|
|
|
|
is_reshaped = true;
|
|
|
|
}
|
2013-04-24 01:42:43 +00:00
|
|
|
|
|
|
|
/* If array is no-longer degraded, then any saved_raid_disk
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
* information must be scrapped.
|
2013-04-24 01:42:43 +00:00
|
|
|
*/
|
md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-09 01:04:56 +00:00
|
|
|
if (!mddev->degraded)
|
|
|
|
rdev_for_each(rdev, mddev)
|
2013-04-24 01:42:43 +00:00
|
|
|
rdev->saved_raid_disk = -1;
|
|
|
|
|
|
|
|
md_update_sb(mddev, 1);
|
2016-12-08 23:48:19 +00:00
|
|
|
/* MD_SB_CHANGE_PENDING should be cleared by md_update_sb, so we can
|
2016-06-03 03:32:04 +00:00
|
|
|
* call resync_finish here if MD_CLUSTER_RESYNC_LOCKED is set by
|
|
|
|
* clustered raid */
|
|
|
|
if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags))
|
|
|
|
md_cluster_ops->resync_finish(mddev);
|
2013-04-24 01:42:43 +00:00
|
|
|
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
2015-06-12 10:05:04 +00:00
|
|
|
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
|
2013-04-24 01:42:43 +00:00
|
|
|
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
2018-10-18 08:37:44 +00:00
|
|
|
/*
|
|
|
|
* We call md_cluster_ops->update_size here because sync_size could
|
|
|
|
* be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared,
|
|
|
|
* so it is time to update size across cluster.
|
|
|
|
*/
|
|
|
|
if (mddev_is_clustered(mddev) && is_reshaped
|
|
|
|
&& !test_bit(MD_CLOSING, &mddev->flags))
|
|
|
|
md_cluster_ops->update_size(mddev, old_dev_sectors);
|
2013-04-24 01:42:43 +00:00
|
|
|
/* flag recovery needed just to double check */
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
2022-06-08 16:27:56 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_completed);
|
2013-04-24 01:42:43 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_action);
|
2021-10-04 15:34:53 +00:00
|
|
|
md_new_event();
|
2013-04-24 01:42:43 +00:00
|
|
|
if (mddev->event_work.func)
|
|
|
|
queue_work(md_misc_wq, &mddev->event_work);
|
2023-05-29 13:20:36 +00:00
|
|
|
wake_up(&resync_wait);
|
2013-04-24 01:42:43 +00:00
|
|
|
}
|
2014-09-30 06:15:38 +00:00
|
|
|
EXPORT_SYMBOL(md_reap_sync_thread);
|
2013-04-24 01:42:43 +00:00
|
|
|
|
2011-10-11 05:47:53 +00:00
|
|
|
void md_wait_for_blocked_rdev(struct md_rdev *rdev, struct mddev *mddev)
|
2008-04-30 07:52:32 +00:00
|
|
|
{
|
2010-06-01 09:37:23 +00:00
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
2008-04-30 07:52:32 +00:00
|
|
|
wait_event_timeout(rdev->blocked_wait,
|
2011-07-28 01:31:48 +00:00
|
|
|
!test_bit(Blocked, &rdev->flags) &&
|
|
|
|
!test_bit(BlockedBadBlocks, &rdev->flags),
|
2008-04-30 07:52:32 +00:00
|
|
|
msecs_to_jiffies(5000));
|
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_wait_for_blocked_rdev);
|
|
|
|
|
2012-05-20 23:27:00 +00:00
|
|
|
void md_finish_reshape(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
/* called be personality module when reshape completes. */
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if (rdev->data_offset > rdev->new_data_offset)
|
|
|
|
rdev->sectors += rdev->data_offset - rdev->new_data_offset;
|
|
|
|
else
|
|
|
|
rdev->sectors -= rdev->new_data_offset - rdev->data_offset;
|
|
|
|
rdev->data_offset = rdev->new_data_offset;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_finish_reshape);
|
2011-07-28 01:31:46 +00:00
|
|
|
|
2015-12-25 02:20:34 +00:00
|
|
|
/* Bad block management */
|
2011-07-28 01:31:46 +00:00
|
|
|
|
2015-12-25 02:20:34 +00:00
|
|
|
/* Returns 1 on success, 0 on failure */
|
2011-10-11 05:45:26 +00:00
|
|
|
int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
|
2012-05-20 23:27:00 +00:00
|
|
|
int is_new)
|
2011-07-28 01:31:46 +00:00
|
|
|
{
|
2016-05-04 02:22:13 +00:00
|
|
|
struct mddev *mddev = rdev->mddev;
|
2012-05-20 23:27:00 +00:00
|
|
|
int rv;
|
|
|
|
if (is_new)
|
|
|
|
s += rdev->new_data_offset;
|
|
|
|
else
|
|
|
|
s += rdev->data_offset;
|
2015-12-25 02:20:34 +00:00
|
|
|
rv = badblocks_set(&rdev->badblocks, s, sectors, 0);
|
|
|
|
if (rv == 0) {
|
2011-07-28 01:31:46 +00:00
|
|
|
/* Make sure they get written out promptly */
|
2016-10-21 14:26:57 +00:00
|
|
|
if (test_bit(ExternalBbl, &rdev->flags))
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_unack_badblocks);
|
2011-12-08 05:26:08 +00:00
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_state);
|
2016-12-08 23:48:19 +00:00
|
|
|
set_mask_bits(&mddev->sb_flags, 0,
|
|
|
|
BIT(MD_SB_CHANGE_CLEAN) | BIT(MD_SB_CHANGE_PENDING));
|
2011-07-28 01:31:46 +00:00
|
|
|
md_wakeup_thread(rdev->mddev->thread);
|
2015-12-25 02:20:34 +00:00
|
|
|
return 1;
|
|
|
|
} else
|
|
|
|
return 0;
|
2011-07-28 01:31:46 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rdev_set_badblocks);
|
|
|
|
|
2012-05-20 23:27:00 +00:00
|
|
|
int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
|
|
|
|
int is_new)
|
2011-07-28 01:31:46 +00:00
|
|
|
{
|
2016-10-21 14:26:57 +00:00
|
|
|
int rv;
|
2012-05-20 23:27:00 +00:00
|
|
|
if (is_new)
|
|
|
|
s += rdev->new_data_offset;
|
|
|
|
else
|
|
|
|
s += rdev->data_offset;
|
2016-10-21 14:26:57 +00:00
|
|
|
rv = badblocks_clear(&rdev->badblocks, s, sectors);
|
|
|
|
if ((rv == 0) && test_bit(ExternalBbl, &rdev->flags))
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(rdev->sysfs_badblocks);
|
2016-10-21 14:26:57 +00:00
|
|
|
return rv;
|
2011-07-28 01:31:46 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
|
|
|
|
|
2005-05-05 23:16:09 +00:00
|
|
|
static int md_notify_reboot(struct notifier_block *this,
|
|
|
|
unsigned long code, void *x)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2022-07-19 09:18:21 +00:00
|
|
|
struct mddev *mddev, *n;
|
2011-09-23 09:40:45 +00:00
|
|
|
int need_delay = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2022-07-19 09:18:21 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
|
|
|
list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
|
2022-07-19 09:18:23 +00:00
|
|
|
if (!mddev_get(mddev))
|
|
|
|
continue;
|
2022-07-19 09:18:21 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
2012-03-19 01:46:37 +00:00
|
|
|
if (mddev_trylock(mddev)) {
|
2012-04-24 00:23:16 +00:00
|
|
|
if (mddev->pers)
|
|
|
|
__md_stop_writes(mddev);
|
2014-05-05 23:36:08 +00:00
|
|
|
if (mddev->persistent)
|
|
|
|
mddev->safemode = 2;
|
2012-03-19 01:46:37 +00:00
|
|
|
mddev_unlock(mddev);
|
2011-09-23 09:40:45 +00:00
|
|
|
}
|
2012-03-19 01:46:37 +00:00
|
|
|
need_delay = 1;
|
2022-07-19 09:18:21 +00:00
|
|
|
mddev_put(mddev);
|
|
|
|
spin_lock(&all_mddevs_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2022-07-19 09:18:21 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
|
2012-03-19 01:46:37 +00:00
|
|
|
/*
|
|
|
|
* certain more exotic SCSI devices are known to be
|
|
|
|
* volatile wrt too early system reboots. While the
|
|
|
|
* right place to handle this issue is the given
|
|
|
|
* driver, we do want to have a safe RAID driver ...
|
|
|
|
*/
|
|
|
|
if (need_delay)
|
2022-03-03 23:19:33 +00:00
|
|
|
msleep(1000);
|
2012-03-19 01:46:37 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return NOTIFY_DONE;
|
|
|
|
}
|
|
|
|
|
2005-05-05 23:16:09 +00:00
|
|
|
static struct notifier_block md_notifier = {
|
2005-04-16 22:20:36 +00:00
|
|
|
.notifier_call = md_notify_reboot,
|
|
|
|
.next = NULL,
|
|
|
|
.priority = INT_MAX, /* before any real devices */
|
|
|
|
};
|
|
|
|
|
|
|
|
static void md_geninit(void)
|
|
|
|
{
|
2011-10-07 03:23:17 +00:00
|
|
|
pr_debug("md: sizeof(mdp_super_t) = %d\n", (int)sizeof(mdp_super_t));
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-02-04 01:37:17 +00:00
|
|
|
proc_create("mdstat", S_IRUGO, NULL, &mdstat_proc_ops);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2005-05-05 23:16:09 +00:00
|
|
|
static int __init md_init(void)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2010-10-15 13:36:08 +00:00
|
|
|
int ret = -ENOMEM;
|
|
|
|
|
2011-01-25 13:35:54 +00:00
|
|
|
md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
|
2010-10-15 13:36:08 +00:00
|
|
|
if (!md_wq)
|
|
|
|
goto err_wq;
|
|
|
|
|
|
|
|
md_misc_wq = alloc_workqueue("md_misc", 0, 0);
|
|
|
|
if (!md_misc_wq)
|
|
|
|
goto err_misc_wq;
|
|
|
|
|
2023-05-29 13:11:04 +00:00
|
|
|
md_bitmap_wq = alloc_workqueue("md_bitmap", WQ_MEM_RECLAIM | WQ_UNBOUND,
|
|
|
|
0);
|
|
|
|
if (!md_bitmap_wq)
|
|
|
|
goto err_bitmap_wq;
|
|
|
|
|
2020-10-29 14:58:34 +00:00
|
|
|
ret = __register_blkdev(MD_MAJOR, "md", md_probe);
|
|
|
|
if (ret < 0)
|
2010-10-15 13:36:08 +00:00
|
|
|
goto err_md;
|
|
|
|
|
2020-10-29 14:58:34 +00:00
|
|
|
ret = __register_blkdev(0, "mdp", md_probe);
|
|
|
|
if (ret < 0)
|
2010-10-15 13:36:08 +00:00
|
|
|
goto err_mdp;
|
|
|
|
mdp_major = ret;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
register_reboot_notifier(&md_notifier);
|
2023-03-02 20:46:09 +00:00
|
|
|
raid_table_header = register_sysctl("dev/raid", raid_table);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
md_geninit();
|
2008-10-13 00:55:12 +00:00
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-10-15 13:36:08 +00:00
|
|
|
err_mdp:
|
|
|
|
unregister_blkdev(MD_MAJOR, "md");
|
|
|
|
err_md:
|
2023-05-29 13:11:04 +00:00
|
|
|
destroy_workqueue(md_bitmap_wq);
|
|
|
|
err_bitmap_wq:
|
2010-10-15 13:36:08 +00:00
|
|
|
destroy_workqueue(md_misc_wq);
|
|
|
|
err_misc_wq:
|
|
|
|
destroy_workqueue(md_wq);
|
|
|
|
err_wq:
|
|
|
|
return ret;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-08-21 15:33:39 +00:00
|
|
|
static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
|
2014-06-07 06:53:00 +00:00
|
|
|
{
|
2015-08-21 15:33:39 +00:00
|
|
|
struct mdp_superblock_1 *sb = page_address(rdev->sb_page);
|
2021-04-08 07:44:15 +00:00
|
|
|
struct md_rdev *rdev2, *tmp;
|
2015-08-21 15:33:39 +00:00
|
|
|
int role, ret;
|
2014-06-07 06:53:00 +00:00
|
|
|
|
2017-03-01 08:42:40 +00:00
|
|
|
/*
|
|
|
|
* If size is changed in another node then we need to
|
|
|
|
* do resize as well.
|
|
|
|
*/
|
|
|
|
if (mddev->dev_sectors != le64_to_cpu(sb->size)) {
|
|
|
|
ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size));
|
|
|
|
if (ret)
|
|
|
|
pr_info("md-cluster: resize failed\n");
|
|
|
|
else
|
2018-08-01 22:20:50 +00:00
|
|
|
md_bitmap_update_sb(mddev->bitmap);
|
2017-03-01 08:42:40 +00:00
|
|
|
}
|
|
|
|
|
2015-08-21 15:33:39 +00:00
|
|
|
/* Check for change of roles in the active devices */
|
2021-04-08 07:44:15 +00:00
|
|
|
rdev_for_each_safe(rdev2, tmp, mddev) {
|
2015-08-21 15:33:39 +00:00
|
|
|
if (test_bit(Faulty, &rdev2->flags))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Check if the roles changed */
|
|
|
|
role = le16_to_cpu(sb->dev_roles[rdev2->desc_nr]);
|
2015-10-01 18:20:27 +00:00
|
|
|
|
|
|
|
if (test_bit(Candidate, &rdev2->flags)) {
|
2022-04-21 19:45:58 +00:00
|
|
|
if (role == MD_DISK_ROLE_FAULTY) {
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_info("md: Removing Candidate device %pg because add failed\n",
|
|
|
|
rdev2->bdev);
|
2015-10-01 18:20:27 +00:00
|
|
|
md_kick_rdev_from_array(rdev2);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
clear_bit(Candidate, &rdev2->flags);
|
|
|
|
}
|
|
|
|
|
2015-08-21 15:33:39 +00:00
|
|
|
if (role != rdev2->raid_disk) {
|
2018-10-18 08:37:45 +00:00
|
|
|
/*
|
|
|
|
* got activated except reshape is happening.
|
|
|
|
*/
|
2022-04-21 19:45:58 +00:00
|
|
|
if (rdev2->raid_disk == -1 && role != MD_DISK_ROLE_SPARE &&
|
2018-10-18 08:37:45 +00:00
|
|
|
!(le32_to_cpu(sb->feature_map) &
|
|
|
|
MD_FEATURE_RESHAPE_ACTIVE)) {
|
2015-08-21 15:33:39 +00:00
|
|
|
rdev2->saved_raid_disk = role;
|
|
|
|
ret = remove_and_add_spares(mddev, rdev2);
|
2022-05-12 06:19:13 +00:00
|
|
|
pr_info("Activated spare: %pg\n",
|
|
|
|
rdev2->bdev);
|
2016-05-02 15:33:14 +00:00
|
|
|
/* wakeup mddev->thread here, so array could
|
|
|
|
* perform resync with the new activated disk */
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
md_wakeup_thread(mddev->thread);
|
2015-08-21 15:33:39 +00:00
|
|
|
}
|
|
|
|
/* device faulty
|
|
|
|
* We just want to do the minimum to mark the disk
|
|
|
|
* as faulty. The recovery is performed by the
|
|
|
|
* one who initiated the error.
|
|
|
|
*/
|
2022-04-21 19:45:58 +00:00
|
|
|
if (role == MD_DISK_ROLE_FAULTY ||
|
|
|
|
role == MD_DISK_ROLE_JOURNAL) {
|
2015-08-21 15:33:39 +00:00
|
|
|
md_error(mddev, rdev2);
|
|
|
|
clear_bit(Blocked, &rdev2->flags);
|
|
|
|
}
|
|
|
|
}
|
2014-06-07 06:53:00 +00:00
|
|
|
}
|
2015-08-21 15:33:39 +00:00
|
|
|
|
md/cluster: block reshape with remote resync job
Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.
Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.
The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
# on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```
Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-19 11:41:33 +00:00
|
|
|
if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
|
|
|
|
ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
|
|
|
|
if (ret)
|
|
|
|
pr_warn("md: updating array disks failed. %d\n", ret);
|
|
|
|
}
|
2015-08-21 15:33:39 +00:00
|
|
|
|
md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.
Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.
For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.
Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.
2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 08:37:42 +00:00
|
|
|
/*
|
|
|
|
* Since mddev->delta_disks has already updated in update_raid_disks,
|
|
|
|
* so it is time to check reshape.
|
|
|
|
*/
|
|
|
|
if (test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) &&
|
|
|
|
(le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE)) {
|
|
|
|
/*
|
|
|
|
* reshape is happening in the remote node, we need to
|
|
|
|
* update reshape_position and call start_reshape.
|
|
|
|
*/
|
2019-04-04 16:56:10 +00:00
|
|
|
mddev->reshape_position = le64_to_cpu(sb->reshape_position);
|
md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.
Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.
For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.
Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.
2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 08:37:42 +00:00
|
|
|
if (mddev->pers->update_reshape_pos)
|
|
|
|
mddev->pers->update_reshape_pos(mddev);
|
|
|
|
if (mddev->pers->start_reshape)
|
|
|
|
mddev->pers->start_reshape(mddev);
|
|
|
|
} else if (test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) &&
|
|
|
|
mddev->reshape_position != MaxSector &&
|
|
|
|
!(le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE)) {
|
|
|
|
/* reshape is just done in another node. */
|
|
|
|
mddev->reshape_position = MaxSector;
|
|
|
|
if (mddev->pers->update_reshape_pos)
|
|
|
|
mddev->pers->update_reshape_pos(mddev);
|
|
|
|
}
|
|
|
|
|
2015-08-21 15:33:39 +00:00
|
|
|
/* Finally set the event to be up to date */
|
|
|
|
mddev->events = le64_to_cpu(sb->events);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int read_rdev(struct mddev *mddev, struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct page *swapout = rdev->sb_page;
|
|
|
|
struct mdp_superblock_1 *sb;
|
|
|
|
|
|
|
|
/* Store the sb page of the rdev in the swapout temporary
|
|
|
|
* variable in case we err in the future
|
|
|
|
*/
|
|
|
|
rdev->sb_page = NULL;
|
2016-11-02 03:16:49 +00:00
|
|
|
err = alloc_disk_sb(rdev);
|
|
|
|
if (err == 0) {
|
|
|
|
ClearPageUptodate(rdev->sb_page);
|
|
|
|
rdev->sb_loaded = 0;
|
|
|
|
err = super_types[mddev->major_version].
|
|
|
|
load_super(rdev, NULL, mddev->minor_version);
|
|
|
|
}
|
2015-08-21 15:33:39 +00:00
|
|
|
if (err < 0) {
|
|
|
|
pr_warn("%s: %d Could not reload rdev(%d) err: %d. Restoring old values\n",
|
|
|
|
__func__, __LINE__, rdev->desc_nr, err);
|
2016-11-02 03:16:49 +00:00
|
|
|
if (rdev->sb_page)
|
|
|
|
put_page(rdev->sb_page);
|
2015-08-21 15:33:39 +00:00
|
|
|
rdev->sb_page = swapout;
|
|
|
|
rdev->sb_loaded = 1;
|
|
|
|
return err;
|
2014-06-07 06:53:00 +00:00
|
|
|
}
|
|
|
|
|
2015-08-21 15:33:39 +00:00
|
|
|
sb = page_address(rdev->sb_page);
|
|
|
|
/* Read the offset unconditionally, even if MD_FEATURE_RECOVERY_OFFSET
|
|
|
|
* is not set
|
|
|
|
*/
|
|
|
|
|
|
|
|
if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RECOVERY_OFFSET))
|
|
|
|
rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
|
|
|
|
|
|
|
|
/* The other node finished recovery, call spare_active to set
|
|
|
|
* device In_sync and mddev->degraded
|
|
|
|
*/
|
|
|
|
if (rdev->recovery_offset == MaxSector &&
|
|
|
|
!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
mddev->pers->spare_active(mddev))
|
2020-07-14 23:10:26 +00:00
|
|
|
sysfs_notify_dirent_safe(mddev->sysfs_degraded);
|
2015-08-21 15:33:39 +00:00
|
|
|
|
|
|
|
put_page(swapout);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void md_reload_sb(struct mddev *mddev, int nr)
|
|
|
|
{
|
2022-04-08 08:47:15 +00:00
|
|
|
struct md_rdev *rdev = NULL, *iter;
|
2015-08-21 15:33:39 +00:00
|
|
|
int err;
|
|
|
|
|
|
|
|
/* Find the rdev */
|
2022-04-08 08:47:15 +00:00
|
|
|
rdev_for_each_rcu(iter, mddev) {
|
|
|
|
if (iter->desc_nr == nr) {
|
|
|
|
rdev = iter;
|
2015-08-21 15:33:39 +00:00
|
|
|
break;
|
2022-04-08 08:47:15 +00:00
|
|
|
}
|
2015-08-21 15:33:39 +00:00
|
|
|
}
|
|
|
|
|
2022-04-08 08:47:15 +00:00
|
|
|
if (!rdev) {
|
2015-08-21 15:33:39 +00:00
|
|
|
pr_warn("%s: %d Could not find rdev with nr %d\n", __func__, __LINE__, nr);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = read_rdev(mddev, rdev);
|
|
|
|
if (err < 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
check_sb_changes(mddev, rdev);
|
|
|
|
|
|
|
|
/* Read all rdev's to update recovery_offset */
|
2018-04-09 09:01:21 +00:00
|
|
|
rdev_for_each_rcu(rdev, mddev) {
|
|
|
|
if (!test_bit(Faulty, &rdev->flags))
|
|
|
|
read_rdev(mddev, rdev);
|
|
|
|
}
|
2014-06-07 06:53:00 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(md_reload_sb);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef MODULE
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Searches all registered partitions for autorun RAID arrays
|
|
|
|
* at boot time.
|
|
|
|
*/
|
2007-10-17 06:30:52 +00:00
|
|
|
|
2016-06-08 16:20:16 +00:00
|
|
|
static DEFINE_MUTEX(detected_devices_mutex);
|
2007-10-17 06:30:52 +00:00
|
|
|
static LIST_HEAD(all_detected_devices);
|
|
|
|
struct detected_devices_node {
|
|
|
|
struct list_head list;
|
|
|
|
dev_t dev;
|
|
|
|
};
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
void md_autodetect_dev(dev_t dev)
|
|
|
|
{
|
2007-10-17 06:30:52 +00:00
|
|
|
struct detected_devices_node *node_detected_dev;
|
|
|
|
|
|
|
|
node_detected_dev = kzalloc(sizeof(*node_detected_dev), GFP_KERNEL);
|
|
|
|
if (node_detected_dev) {
|
|
|
|
node_detected_dev->dev = dev;
|
2016-06-08 16:20:16 +00:00
|
|
|
mutex_lock(&detected_devices_mutex);
|
2007-10-17 06:30:52 +00:00
|
|
|
list_add_tail(&node_detected_dev->list, &all_detected_devices);
|
2016-06-08 16:20:16 +00:00
|
|
|
mutex_unlock(&detected_devices_mutex);
|
2007-10-17 06:30:52 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2020-06-06 13:00:24 +00:00
|
|
|
void md_autostart_arrays(int part)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *rdev;
|
2007-10-17 06:30:52 +00:00
|
|
|
struct detected_devices_node *node_detected_dev;
|
|
|
|
dev_t dev;
|
|
|
|
int i_scanned, i_passed;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-10-17 06:30:52 +00:00
|
|
|
i_scanned = 0;
|
|
|
|
i_passed = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_info("md: Autodetecting RAID arrays.\n");
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-06-08 16:20:16 +00:00
|
|
|
mutex_lock(&detected_devices_mutex);
|
2007-10-17 06:30:52 +00:00
|
|
|
while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) {
|
|
|
|
i_scanned++;
|
|
|
|
node_detected_dev = list_entry(all_detected_devices.next,
|
|
|
|
struct detected_devices_node, list);
|
|
|
|
list_del(&node_detected_dev->list);
|
|
|
|
dev = node_detected_dev->dev;
|
|
|
|
kfree(node_detected_dev);
|
2016-09-14 21:26:54 +00:00
|
|
|
mutex_unlock(&detected_devices_mutex);
|
2007-07-17 11:06:11 +00:00
|
|
|
rdev = md_import_device(dev,0, 90);
|
2016-09-14 21:26:54 +00:00
|
|
|
mutex_lock(&detected_devices_mutex);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (IS_ERR(rdev))
|
|
|
|
continue;
|
|
|
|
|
2014-09-30 05:52:29 +00:00
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
2005-04-16 22:20:36 +00:00
|
|
|
continue;
|
2014-09-30 05:52:29 +00:00
|
|
|
|
2008-03-04 22:29:31 +00:00
|
|
|
set_bit(AutoDetected, &rdev->flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
list_add(&rdev->same_set, &pending_raid_disks);
|
2007-10-17 06:30:52 +00:00
|
|
|
i_passed++;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2016-06-08 16:20:16 +00:00
|
|
|
mutex_unlock(&detected_devices_mutex);
|
2007-10-17 06:30:52 +00:00
|
|
|
|
2016-11-02 03:16:49 +00:00
|
|
|
pr_debug("md: Scanned %d and added %d devices.\n", i_scanned, i_passed);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
autorun_devices(part);
|
|
|
|
}
|
|
|
|
|
2006-12-10 10:20:50 +00:00
|
|
|
#endif /* !MODULE */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
static __exit void md_exit(void)
|
|
|
|
{
|
2022-07-19 09:18:22 +00:00
|
|
|
struct mddev *mddev, *n;
|
2014-04-09 04:33:51 +00:00
|
|
|
int delay = 1;
|
2005-06-21 04:15:16 +00:00
|
|
|
|
2009-03-31 03:27:02 +00:00
|
|
|
unregister_blkdev(MD_MAJOR,"md");
|
2005-04-16 22:20:36 +00:00
|
|
|
unregister_blkdev(mdp_major, "mdp");
|
|
|
|
unregister_reboot_notifier(&md_notifier);
|
|
|
|
unregister_sysctl_table(raid_table_header);
|
2014-04-09 04:33:51 +00:00
|
|
|
|
|
|
|
/* We cannot unload the modules while some process is
|
|
|
|
* waiting for us in select() or poll() - wake them up
|
|
|
|
*/
|
|
|
|
md_unloading = 1;
|
|
|
|
while (waitqueue_active(&md_event_waiters)) {
|
|
|
|
/* not safe to leave yet */
|
|
|
|
wake_up(&md_event_waiters);
|
|
|
|
msleep(delay);
|
|
|
|
delay += delay;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
remove_proc_entry("mdstat", NULL);
|
2014-04-09 04:33:51 +00:00
|
|
|
|
2022-07-19 09:18:22 +00:00
|
|
|
spin_lock(&all_mddevs_lock);
|
|
|
|
list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
|
2022-07-19 09:18:23 +00:00
|
|
|
if (!mddev_get(mddev))
|
|
|
|
continue;
|
2022-07-19 09:18:22 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
export_array(mddev);
|
2017-02-06 02:41:39 +00:00
|
|
|
mddev->ctime = 0;
|
md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-08 21:31:10 +00:00
|
|
|
mddev->hold_active = 0;
|
2017-02-06 02:41:39 +00:00
|
|
|
/*
|
2022-07-19 09:18:22 +00:00
|
|
|
* As the mddev is now fully clear, mddev_put will schedule
|
|
|
|
* the mddev for destruction by a workqueue, and the
|
2017-02-06 02:41:39 +00:00
|
|
|
* destroy_workqueue() below will wait for that to complete.
|
|
|
|
*/
|
2022-07-19 09:18:22 +00:00
|
|
|
mddev_put(mddev);
|
|
|
|
spin_lock(&all_mddevs_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2022-07-19 09:18:22 +00:00
|
|
|
spin_unlock(&all_mddevs_lock);
|
|
|
|
|
2010-10-15 13:36:08 +00:00
|
|
|
destroy_workqueue(md_misc_wq);
|
2023-05-29 13:11:04 +00:00
|
|
|
destroy_workqueue(md_bitmap_wq);
|
2010-10-15 13:36:08 +00:00
|
|
|
destroy_workqueue(md_wq);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2007-07-09 18:56:42 +00:00
|
|
|
subsys_initcall(md_init);
|
2005-04-16 22:20:36 +00:00
|
|
|
module_exit(md_exit)
|
|
|
|
|
treewide: Fix function prototypes for module_param_call()
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:
@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@
module_param_call(_name, _set_func, _get_func, _arg, _mode);
@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:
drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-10-18 02:04:42 +00:00
|
|
|
static int get_ro(char *buffer, const struct kernel_param *kp)
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
{
|
2020-05-11 08:23:25 +00:00
|
|
|
return sprintf(buffer, "%d\n", start_readonly);
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
}
|
treewide: Fix function prototypes for module_param_call()
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:
@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@
module_param_call(_name, _set_func, _get_func, _arg, _mode);
@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:
drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-10-18 02:04:42 +00:00
|
|
|
static int set_ro(const char *val, const struct kernel_param *kp)
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
{
|
2015-05-16 11:02:38 +00:00
|
|
|
return kstrtouint(val, 10, (unsigned int *)&start_readonly);
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
}
|
|
|
|
|
2006-07-10 11:44:18 +00:00
|
|
|
module_param_call(start_ro, set_ro, get_ro, NULL, S_IRUSR|S_IWUSR);
|
|
|
|
module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR);
|
2009-01-08 21:31:10 +00:00
|
|
|
module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR);
|
2017-04-12 06:26:13 +00:00
|
|
|
module_param(create_on_open, bool, S_IRUSR|S_IWUSR);
|
[PATCH] md: allow md arrays to be started read-only (module parameter).
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 05:39:36 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
MODULE_LICENSE("GPL");
|
2009-12-14 01:49:58 +00:00
|
|
|
MODULE_DESCRIPTION("MD RAID framework");
|
2005-08-04 19:53:32 +00:00
|
|
|
MODULE_ALIAS("md");
|
2005-08-27 01:34:15 +00:00
|
|
|
MODULE_ALIAS_BLOCKDEV_MAJOR(MD_MAJOR);
|