A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.
'chunksize' and 'time_base' must be set before 'location'
can be set.
'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.
'time_base' and 'backlog' can be updated at any time.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
safe_delay_store can parse fixed point numbers (for fractions
of a second). We will want to do that for another sysfs
file soon, so factor out the code.
Signed-off-by: NeilBrown <neilb@suse.de>
For md arrays were metadata is managed externally, the kernel does not
know about a superblock so the superblock offset is 0.
If we want to have a write-intent-bitmap near the end of the
devices of such an array, we should support sector_t sized offset.
We need offset be possibly negative for when the bitmap is before
the metadata, so use loff_t instead.
Also add sanity check that bitmap does not overlap with data.
Signed-off-by: NeilBrown <neilb@suse.de>
As bitmap_create and bitmap_destroy already set thread->timeout
as appropriate, there is no need to do it in raid10_quiesce.
There is a possible need to wake the thread after the timeout
has been set low, but it is better to do that where the timeout
is actually set low, in bitmap_create.
Signed-off-by: NeilBrown <neilb@suse.de>
... and into bitmap_info. These are all configuration parameters
that need to be set before the bitmap is created.
Signed-off-by: NeilBrown <neilb@suse.de>
In preparation for making bitmap fields configurable via sysfs,
start tidying up by making a single structure to contain the
configuration fields.
Signed-off-by: NeilBrown <neilb@suse.de>
This will allow us to stop writeout to portions of the array
while they are resynced by someone else - e.g. another node in
a cluster.
Signed-off-by: NeilBrown <neilb@suse.de>
The post-barrier-flush is sent by md as soon as make_request on the
barrier write completes. For raid5, the data might not be in the
per-device queues yet. So for barrier requests, wait for any
pre-reading to be done so that the request will be in the per-device
queues.
We use the 'preread_active' count to check that nothing is still in
the preread phase, and delay the decrement of this count until after
write requests have been submitted to the underlying devices.
Signed-off-by: NeilBrown <neilb@suse.de>
Previously barriers were only supported on RAID1. This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.
When a barrier arrives, we send a zero-length barrier to every active
device. When that completes - and if the original request was not
empty - we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.
The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.
The reason for clearing the barrier flag is that a barrier request is
allowed to fail. If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail. That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.
RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet. So we flush the stripe cache before
proceeding with the barrier.
Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted. Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.
That will be done in later patches.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
If a resync/recovery/check/repair is interrupted for some reason, it
can be useful to know exactly where it got up to.
So in that case, do not clear curr_resync_completed.
Initialise it when starting a resync/recovery/... instead.
Signed-off-by: NeilBrown <neilb@suse.de>
When a 'check' or 'repair' finished we should clear resync_min
so that a future check/repair will cover the whole array (by default).
However if it is interrupted, we should update resync_min to
where we got up to, so that when the check/repair continues it
just does the remainder of the array.
Signed-off-by: NeilBrown <neilb@suse.de>
A write intent bitmap can be removed from an array while the
array is active.
When this happens, all IO is suspended and flushed before the
bitmap is removed.
However it is possible that bitmap_daemon_work is still running to
clear old bits from the bitmap. If it is, it can dereference the
bitmap after it has been freed.
So introduce a new mutex to protect bitmap_daemon_work and get it
before destroying a bitmap.
This is suitable for any current -stable kernel.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
If the snapshot we are merging became invalid (e.g. it ran out of
space) redirect all I/O directly to the origin device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Set 'merge_failed' flag if a snapshot fails to merge. Update
snapshot_status() to report "Merge failed" if 'merge_failed' is set.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
s->store->type->prepare_merge returns the number of chunks that can be
copied linearly working backwards from the returned chunk number.
For example, if it returns 3 chunks with old_chunk == 10 and new_chunk
== 20, then chunk 20 can be copied to 10, chunk 19 to 9 and 18 to 8.
Until now kcopyd only copied one chunk at a time. This patch now copies
the full set at once.
Consequently, snapshot_merge_process() needs to delay the merging of all
chunks if any have writes in progress, not just the first chunk in the
region that is to be merged.
snapshot-merge's performance is now comparable to the original
snapshot-origin target.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When there is one merging snapshot and other non-merging snapshots,
snapshot_merge_process() must make exceptions in the non-merging
snapshots.
Use a sequence count to resolve the race between I/O to chunks that are
about to be merged. The count increases each time an exception
reallocation finishes. Use wait_event() to wait until the count
changes.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Track writes to chunks that are currently being merged and delay merging
a chunk until all writes to that chunk finish.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
While a set of chunks is being merged, any overlapping writes need to be
queued.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Merging is started when origin is resumed and it is stopped when
origin is suspended or when the merging snapshot is destroyed or
errors are detected.
Merging is not yet interlocked with writes: this will be handled in
subsequent patches.
The code relies on callbacks from a private kcopyd thread.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Merging more than one snapshot is not supported, so prevent
this happening.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Sets num_flush_requests=2 to support flushing both the origin and cow
devices used by the snapshot-merge target.
Also, snapshot_ctr() now gets the origin device using FMODE_WRITE if the
target is snapshot-merge (which writes to the origin device).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The snapshot-merge target should not allocate new exceptions because the
intent is to merge all of its exceptions as quickly and safely as
possible.
This patch introduces the snapshot-merge mapping function and updates
__origin_write() so that it doesn't allocate exceptions on any snapshots
that are being merged.
If a write request to a merging snapshot device is to be dispatched
directly to the origin (because the chunk is not remapped or was already
merged), snapshot_merge_map() must make exceptions in other snapshots so
calls do_origin().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
To track the completion of exceptions relating to the same location on
the device, the current code selects one exception as primary_pe, links
the other exceptions to it and uses reference counting to wait until all
the reallocations are complete.
It is considered too complicated to extend this code to handle the new
snapshot-merge target, where sets of non-overlapping chunks would also
need to become linked.
Instead, a simpler (but less efficient) approach is taken. Bios are
linked to one exception. When it completes, bios are simply retried,
and if other related exceptions are still outstanding, they'll get
queued again to wait for another one.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The snapshot-merge target allows a snapshot to be merged back into the
snapshot's origin device.
One anticipated use of snapshot merging is the rollback of filesystems
to back out problematic system upgrades.
This patch adds snapshot-merge target management to both
dm_snapshot_init() and dm_snapshot_exit(). As an initial place-holder,
snapshot-merge is identical to the snapshot target. Documentation is
provided.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add functions that decide how many consecutive chunks of snapshot to
merge back into the origin next and to update the metadata afterwards.
prepare_merge provides a pointer to the most recent still-to-be-merged
chunk and returns how many previous ones are consecutive and can be
processed together.
commit_merge removes the nr_merged most-recent chunks permanently from
the exception store. The number must not exceed that returned by
prepare_merge.
Introduce NUM_SNAPSHOT_HDR_CHUNKS to show where the snapshot header
chunk is accounted for.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the __chunk_is_tracked() loop into a separate function as we will
also need to call it from the write path in the rare case of conflicting
writes to the same chunk.
Originally introduced in commit a8d41b59f3f5a7ac19452ef442a7fc1b5fa17366
("dm snapshot: fix race during exception creation").
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
To support the merging of snapshots back into their origin we need
to trigger exceptions in other snapshots not being merged without
any incoming bio on the origin device. The bio parameter to
__origin_write() becomes optional and the sector needs supplying
separately.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch rejects messages that can generate I/O while the device
itself is suspended.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds the exported dm_suspended() function so that targets
can check whether or not they are suspended.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch renames dm_suspended() to dm_suspended_md() and
keeps it internal to dm.
No functional change.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch moves DMF_SUSPENDED flag set before postsuspend.
No one should care about the ordering, because the flag set and
the postsuspend are protected by a single lock, md->suspend_lock,
and all strict flag-checkers take the lock.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The default plain IV is 32-bit only.
This plain64 IV provides a compatible mode for encrypted devices bigger
than 4TB.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Permit in-use snapshot exception data to be 'handed over' from one
snapshot instance to another. This is a pre-requisite for patches
that allow the changes made in a snapshot device to be merged back into
its origin device and also allows device resizing.
The basic call sequence is:
dmsetup load new_snapshot (referencing the existing in-use cow device)
- the ctr code detects that the cow is already in use and allows the
two snapshot target instances to be linked together
dmsetup suspend original_snapshot
dmsetup resume new_snapshot
- the new_snapshot becomes live, and if anything now tries to access
the original one it will receive -EIO
dmsetup remove original_snapshot
(There can only be two snapshot targets referencing the same cow device
simultaneously.)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When swapping a new table into place, retain the old table until
its replacement is in place.
An old check for an empty table is removed because this is enforced
in populate_table().
__unbind() becomes redundant when followed by __bind().
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When replacing a mapped device's table during a 'resume', delay the
destruction of the old table until the new one is successfully in place.
This will make it easier for a later patch to transfer internal state
information from the old table to the new one (something we do not currently
support) while giving us more options for reversion if a later part
of the operation fails.
Devices are always in the suspended state during dm_swap_table().
This patch reinforces the requirement that all I/O must have been
flushed from the table targets while in this state (including any in
workqueues). In the case of 'noflush' suspending, unprocessed
I/O should have been 'pushed back' to the dm core prior to this point,
for resubmission after the new table is in place.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add the flag DM_QUERY_INACTIVE_TABLE_FLAG to the ioctls to return
infomation about the loaded-but-not-yet-active table instead of the live
table. Prior to this patch it was impossible to obtain this information
until the device had been 'resumed'.
Userspace dmsetup and libdevmapper support the flag as of version 1.02.40.
e.g. dmsetup info --inactive vg1-lv1
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Accept empty barriers in dm-io.
dm-io will process empty write barrier requests just like the other
read/write requests.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reject messages that can generate I/O while the device itself
is suspended.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a mutex to allow possible creators of new work to synchronize with
flushing work queues.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Once we begin deleting a device, prevent any further messages being sent
to targets of its table (to avoid races).
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add dm_deleting_md to check whether or not a given mapped
device is currently being deleted.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch stops the remaining dm-mpath activity during the suspend
sequence by flushing workqueues in postsuspend function.
The current dm-mpath target may not be quiet even after suspend completes
because some workqueues (e.g. device_handler's work, event handling)
are not flushed during the suspend sequence, even though suspended
devices/targets are supposed to be quiet in this state.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch moves dm_end_request() to make the next patch more readable.
No functional change.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>