The dm-bufio interface allows you to do cached I/O on devices,
holding recently-read blocks in memory and performing delayed writes.
We don't use buffer cache or page cache already present in the kernel, because:
* we need to handle block sizes larger than a page
* we can't allocate memory to perform reads or we'd have deadlocks
Currently, when a cache is required, we limit its size to a fraction of
available memory. Usage can be viewed and changed in
/sys/module/dm_bufio/parameters/ .
The first user is thin provisioning, but more dm users are planned.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
with any other target type, and once loaded into a device, it cannot be
replaced with a table containing a different type.
The thin provisioning pool device will use this.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target
does not support read-only mode.
The initial implementation of the thin provisioning target uses this.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce the concept of a singleton table which contains exactly one target.
If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper
will ensure that any table that includes that target contains no others.
The thin provisioning pool target uses this.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch introduces dm_kcopyd_zero() to make it easy to use
kcopyd to write zeros into the requested areas instead
instead of copying. It is implemented by passing a NULL
copying source to dm_kcopyd_copy().
The forthcoming thin provisioning target uses this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Since set_current_state() contains a memory barrier in it,
an additional barrier isn't needed.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
printk_ratelimit() shares global ratelimiting state with all
other subsystems, so its usage is discouraged. Instead,
define and use dm's local state.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow QUEUE_FLAG_NONROT to propagate up the device stack if all
underlying devices are non-rotational. Tools like ureadahead will
schedule IOs differently based on the rotational flag.
With this patch, I see boot time go from 7.75 s to 7.46 s on my device.
Suggested-by: J. Richard Barnette <jrbarnette@chromium.org>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This is a fairly serious bug in RAID10.
When a RAID10 array is degraded and a hot-spare is activated, the
spare does not take up the empty slot, but rather replaces the first
working device.
This is likely to make the array non-functional. It would normally
be possible to recover the data, but that would need care and is not
guaranteed.
This bug was introduced in commit
2bb77736ae5dca0a189829fbb7379d43364a9dac
which first appeared in 3.1.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
1/ one place in raid1 was missed and still sets to '1'.
2/ We didn't explicitly set the conf-> value at array creation
time.
It defaulted to '0' just like the mddev value does so they
could appear equal and thus disable recovery.
This did not affect normal 'md' as it calls bind_rdev_to_array
which changes the mddev value. However the dmraid interface
doesn't call this and so doesn't change ->recovery_disabled; so at
array start all recovery is incorrectly disabled.
So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This bug was introduced in 415e72d034c50520ddb7ff79e7d1792c1306f0c9
which was in 2.6.36.
There is a small window of time between when a device fails and when
it is removed from the array. During this time we might still read
from it, but we won't write to it - so it is possible that we could
read stale data.
We didn't need the test of 'Faulty' before because the test on
In_sync is sufficient. Since we started allowing reads from the early
part of non-In_sync devices we need a test on Faulty too.
This is suitable for any kernel from 2.6.36 onwards, though the patch
might need a bit of tweaking in 3.0 and earlier.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
bio originally has the functionality to set the complete cpu, but
it is broken.
Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken. The only thing keeping
it serves is creating more confusion and possibly more bugs."
And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".
So this patch tries to remove all the work of controling complete cpu
from a bio.
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix memory leak introduced by commit a6e50b409d3f9e0833e69c3c9cca822e8fa4adbb
(dm snapshot: skip reading origin when overwriting complete chunk).
When allocating a set of jobs from kc->job_pool, job->master_job must be
set (to point to itself) so that the mempool item gets freed when the
master_job completes.
master_job was introduced by commit c6ea41fbbe08f270a8edef99dc369faf809d1bd6
(dm kcopyd: preallocate sub jobs to avoid deadlock)
Reported-by: Michael Leun <ml@newton.leun.net>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).
Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we add a device to an active array it can be meaningful to set
the 'insync' flag. This indicates that the device is in-sync with the
array except for locations recorded in the bitmap.
A bitmap-based recovery can then bring it completely in-sync.
Internally we move that flag to 'saved_raid_disk' but forgot to clear
In_sync like we do in add_new_disk.
So clear In_sync after moving its value to saved_raid_disk.
Reported-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
RAID1 and RAID10 handle write requests by queuing them for handling by
a separate thread. This is because when a write-intent-bitmap is
active we might need to update the bitmap first, so it is good to
queue a lot of writes, then do one big bitmap update for them all.
However writeback request devices to appear to be congested after a
while so it can make some guesstimate of throughput. The infinite
queue defeats that (note that RAID5 has already has a finite queue so
it doesn't suffer from this problem).
So impose a limit on the number of pending write requests. By default
it is 1024 which seems to be generally suitable. Make it configurable
via module option just in case someone finds a regression.
Signed-off-by: NeilBrown <neilb@suse.de>
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.
Signed-off-by: NeilBrown <neilb@suse.de>
When md assembles a RAID0 array it prints out lots of info which
is really just for debugging, so convert that to pr_debug.
It also prints out the resulting configuration which could be
interesting, so keep that as 'printk' but tidy it up a bit.
Signed-off-by: NeilBrown <neilb@suse.de>
When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In the 'abort' branch of run(), 'conf' cannot possibly be NULL,
so remove the test.
Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There wasn't much and it is inconsistent.
Also rearrange fields to keep related fields together.
Reported-by: Aapo Laine <aapo.laine@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
If optional discard support in dm-crypt is enabled, discards requests
bypass the crypt queue and blocks of the underlying device are discarded.
For the read path, discarded blocks are handled the same as normal
ciphertext blocks, thus decrypted.
So if the underlying device announces discarded regions return zeroes,
dm-crypt must disable this flag because after decryption there is just
random noise instead of zeroes.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix off-by-one error in validation of write_mostly.
The user-supplied value given for the 'write_mostly' argument must be an
index starting at 0. The validation of the supplied argument failed to
check for 'N' ('>' vs '>='), which would have caused an access beyond the
end of the array.
Reported-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Commit a63a5cf (dm: improve block integrity support) introduced a
two-phase initialization of a DM device's integrity profile. This
patch avoids dereferencing a NULL 'template_disk' pointer in
blk_integrity_register() if there is an integrity profile mismatch in
dm_table_set_integrity().
This can occur if the integrity profiles for stacked devices in a DM
table are changed between the call to dm_table_prealloc_integrity() and
dm_table_set_integrity().
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org # 2.6.39
If no arguments were provided to the corrupt_bio_byte feature an error
should be returned immediately.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The md_notify_reboot() method includes a call to mdelay(1000),
to deal with "exotic SCSI devices" which are too volatile on
reboot. The delay is unconditional. Even if the machine does
not have any block devices, let alone MD devices, the kernel
shutdown sequence is slowed down.
1 second does not matter much with physical hardware, but with
certain virtualization use cases any wasted time in the bootup
& shutdown sequence counts for alot.
* drivers/md/md.c: md_notify_reboot() - only impose a delay if
there was at least one MD device to be stopped during reboot
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>