aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2009-05-27md: raid5: change incorrect usage of 'min' macro to 'min_t'NeilBrown
A recent patch to raid5.c use min on an int and a sector_t. This isn't allowed. So change it to min_t(sector_t,x,y). Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26md: don't use locked_ioctl.NeilBrown
md has no need for the BKL - it does its own locking. So md_ioctl doesn't need to be a locked_ioctl. Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26md: don't update curr_resync_completed without also updating reshape_position.NeilBrown
In order for the metadata to always be consistent, we mustn't updated curr_resync_completed without also updating reshape_position. The reshape code updates both at the same time. However since commit 97e4f42d62badb0f9fbc27c013e89bc1336a03bc the common md_do_sync will sometimes update curr_resync_completed but is not in a position to update reshape_position. So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is happening, so reshape_position might change), don't update curr_resync_completed in md_do_sync, leave it to the per-personality reshape code. Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26md: raid5: avoid sector values going negative when testing reshape progress.NeilBrown
As sector_t in unsigned, we cannot afford to let 'safepos' etc go negative. So replace a -= b; by a -= min(b,a); Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26md: export 'frozen' resync state through sysfsNeilBrown
The md resync engine has a 'frozen' state which ensures that no resync/recovery. This is used to avoid races. Export this state through the 'sync_action' sysfs attribute so that user-space can benefit and also avoid some races. Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26md: bitmap: improve bitmap maintenance code.NeilBrown
The code for checking which bits in the bitmap can be cleared has 2 problems: 1/ it repeatedly takes and drops a spinlock, where it would make more sense to just hold on to it most of the time. 2/ it doesn't make use of some opportunities to skip large sections of the bitmap This patch fixes those. It will only affect CPU consumption, not correctness. Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26md: improve errno return when setting array_sizeNeilBrown
Instead of always returns EINVAL if anything goes wrong when setting the array size, add the option of E2BIG if the size requested is too large. This makes it easier for user-space to be sure what went wrong. Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26md: always update level / chunk_size / layout when writing v1.x metadata.NeilBrown
We previously didn't update these fields when writing the metadata because they could never change. They can now, so we better write them. v0.90 metadata always updated these fields. Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07md: remove rd%d links immediately after stopping an array.NeilBrown
md maintains link in sys/mdXX/md/ to identify which device has which role in the array. e.g. rd2 -> dev-sda indicates that the device with role '2' in the array is sda. These links are only present when the array is active. They are created immediately after ->run is called, and so should be removed immediately after ->stop is called. However they are currently removed a little bit later, and it is possible for ->run to be called again, thus adding these links, before they are removed. So move the removal earlier so they are consistently only present when the array is active. Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07md: remove ability to explicit set an inactive array to 'clean'.NeilBrown
Being able to write 'clean' to an 'array_state' of an inactive array to activate it in 'clean' mode is both unnecessary and inconvenient. It is unnecessary because the same can be achieved by writing 'active'. This activates and array, but it still remains 'clean' until the first write. It is inconvenient because writing 'clean' is more often used to cause an 'active' array to revert to 'clean' mode (thus blocking any writes until a 'write-pending' is promoted to 'active'). Allowing 'clean' to both activate an array and mark an active array as clean can lead to races: One program writes 'clean' to mark the active array as clean at the same time as another program writes 'inactive' to deactivate (stop) and active array. Depending on which writes first, the array could be deactivated and immediately reactivated which isn't what was desired. So just disable the use of 'clean' to activate an array. This avoids a race that can be triggered with mdadm-3.0 and external metadata, so it suitable for -stable. Reported-by: Rafal Marszewski <rafal.marszewski@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: <stable@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07md: constify VFTsJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07md: tidy up status_resync to handle large arrays.NeilBrown
Two problems in status_resync. 1/ It still used Kilobytes as the basic block unit, while most code now uses sectors uniformly. 2/ It doesn't allow for the possibility that max_sectors exceeds the range of "unsigned long". So - change "max_blocks" to "max_sectors", and store sector numbers in there and in 'resync' - Make 'rt' a 'sector_t' so it can temporarily hold the number of remaining sectors. - use sector_div rather than normal division. - change the magic '100' used to preserve precision to '32'. + making it a power of 2 makes division easier + it doesn't need to be as large as it was chosen when we averaged speed over the entire run. Now we average speed over the last 30 seconds or so. Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07md: fix some (more) errors with bitmaps on devices larger than 2TB.NeilBrown
If a write intent bitmap covers more than 2TB, we sometimes work with values beyond 32bit, so these need to be sector_t. This patches add the required casts to some unsigned longs that are being shifted up. This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with member devices that are larger than 2TB. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Cc: stable@kernel.org
2009-05-07md/raid10: don't clear bitmap during recovery if array will still be degraded.NeilBrown
If we have a raid10 with multiple missing devices, and we recover just one of these to a spare, then we risk (depending on the bitmap and array chunk size) clearing bits of the bitmap for which recovery isn't complete (because a device is still missing). This can lead to a subsequent "re-add" being recovered without any IO happening, which would result in loss of data. This patch takes the safe approach of not clearing bitmap bits if the array will still be degraded. This patch is suitable for all active -stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07md: fix loading of out-of-date bitmap.NeilBrown
When md is loading a bitmap which it knows is out of date, it fills each page with 1s and writes it back out again. However the write_page call makes used of bitmap->file_pages and bitmap->last_page_size which haven't been set correctly yet. So this can sometimes fail. Move the setting of file_pages and last_page_size to before the call to write_page. This bug can cause the assembly on an array to fail, thus making the data inaccessible. Hence I think it is a suitable candidate for -stable. Cc: stable@kernel.org Reported-by: Vojtech Pavlik <vojtech@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-20Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds
* 'for-linus' of git://neil.brown.name/md: md: support bitmaps on RAID10 arrays larger then 2 terabytes md: update sync_completed and reshape_position even more often. md: improve usefulness and accuracy of sysfs file md/sync_completed. md: allow setting newly added device to 'in_sync' via sysfs. md: tiny md.h cleanups
2009-04-20md: support bitmaps on RAID10 arrays larger then 2 terabytesNeilBrown
.. and other arrays with components larger than 2 terabytes. We use a "long" rather than a "sector_t" in part of the bitmap size calculations, which is sad. Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-17md: update sync_completed and reshape_position even more often.NeilBrown
There are circumstances when a user-space process might need to "oversee" a resync/reshape process. For example when doing an in-place reshape of a raid5, it is prudent to take a backup of each section before reshaping it as this is the only way to provide safety against an unplanned shutdown (i.e. crash/power failure). The sync_max sysfs value can be used to stop the resync from advancing beyond a particular point. So user-space can: suspend IO to the first section and back it up set 'sync_max' to the end of the section wait for 'sync_completed' to reach that point resume IO on the first section and move on to the next section. However this process requires the kernel and user-space to run in lock-step which could introduce unnecessary delays. It would be better if a 'double buffered' approach could be used with userspace and kernel space working on different sections with the 'next' section always ready when the 'current' section is finished. One problem with implementing this is that sync_completed is only guaranteed to be updated when the sync process reaches sync_max. (it is updated on a time basis at other times, but it is hard to rely on that). This defeats some of the double buffering. With this patch, sync_completed (and reshape_position) get updated as the current position approaches sync_max, so there is room for userspace to advance sync_max early without losing updates. To be precise, sync_completed is updated when the current sync position reaches half way between the current value of sync_completed and the value of sync_max. This will usually be a good time for user space to update sync_max. If sync_max does not get updated, the updates to sync_completed (together with associated metadata updates) will occur at an exponentially increasing frequency which will get unreasonably fast (one update every page) immediately before the process hits sync_max and stops. So the update rate will be unreasonably fast only for an insignificant period of time. Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-15block: move bio list helpers into bio.hChristoph Hellwig
It's used by DM and MD and generally useful, so move the bio list helpers into bio.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-14md: improve usefulness and accuracy of sysfs file md/sync_completed.NeilBrown
The sync_completed file reports how much of a resync (or recovery or reshape) has been completed. However due to the possibility of out-of-order completion of writes, it is not certain to be accurate. We have an internal value - mddev->curr_resync_completed - which is an accurate value (though it might not always be quite so uptodate). So: - make curr_resync_completed be uptodate a little more often, particularly when raid5 reshape updates status in the metadata - report curr_resync_completed in the sysfs file - allow poll/select to report all updates to md/sync_completed. This makes sync_completed completed usable by any external metadata handler that wants to record this status information in its metadata. Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-14md: allow setting newly added device to 'in_sync' via sysfs.NeilBrown
When adding devices to an active array via sysfs, there is currently no way to mark a device as 'in-sync' which is useful when incrementally assembling an array. So add that option. Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-14md: tiny md.h cleanupsChristoph Hellwig
- update inclusion guard and make sure it covers the whole file - remove superflous #ifdef CONFIG_BLOCK - make sure all required headers are included so that new users aren't required to include others before Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-09dm kcopyd: fix callback raceMikulas Patocka
If the thread calling dm_kcopyd_copy is delayed due to scheduling inside split_job/segment_complete and the subjobs complete before the loop in split_job completes, the kcopyd callback could be invoked from the thread that called dm_kcopyd_copy instead of the kcopyd workqueue. dm_kcopyd_copy -> split_job -> segment_complete -> job->fn() Snapshots depend on the fact that callbacks are called from the singlethreaded kcopyd workqueue and expect that there is no racing between individual callbacks. The racing between callbacks can lead to corruption of exception store and it can also mean that exception store callbacks are called twice for the same exception - a likely reason for crashes reported inside pending_complete() / remove_exception(). This patch fixes two problems: 1. job->fn being called from the thread that submitted the job (see above). - Fix: hand over the completion callback to the kcopyd thread. 2. job->fn(read_err, write_err, job->context); in segment_complete reports the error of the last subjob, not the union of all errors. - Fix: pass job->write_err to the callback to report all error bits (it is done already in run_complete_job) Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm kcopyd: prepare for callback race fixMikulas Patocka
Use a variable in segment_complete() to point to the dm_kcopyd_client struct and only release job->pages in run_complete_job() if any are defined. These changes are needed by the next patch. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: implement basic barrier supportMikulas Patocka
Barriers are submitted to a worker thread that issues them in-order. The thread is modified so that when it sees a barrier request it waits for all pending IO before the request then submits the barrier and waits for it. (We must wait, otherwise it could be intermixed with following requests.) Errors from the barrier request are recorded in a per-device barrier_error variable. There may be only one barrier request in progress at once. For now, the barrier request is converted to a non-barrier request when sending it to the underlying device. This patch guarantees correct barrier behavior if the underlying device doesn't perform write-back caching. The same requirement existed before barriers were supported in dm. Bottom layer barrier support (sending barriers by target drivers) and handling devices with write-back caches will be done in further patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: remove dm_request loopMikulas Patocka
Remove queue_io return value and a loop in dm_request. IO may be submitted to a worker thread with queue_io(). queue_io() sets DMF_QUEUE_IO_TO_THREAD so that all further IO is queued for the thread. When the thread finishes its work, it clears DMF_QUEUE_IO_TO_THREAD and from this point on, requests are submitted from dm_request again. This will be used for processing barriers. Remove the loop in dm_request. queue_io() can submit I/Os to the worker thread even if DMF_QUEUE_IO_TO_THREAD was not set. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: rework queueing and suspensionMikulas Patocka
Rework shutting down on suspend and document the associated rules. Drop write lock in __split_and_process_bio to allow more processing concurrency. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: simplify dm_request loopAlasdair G Kergon
Refactor the code in dm_request(). Require the new DMF_BLOCK_FOR_SUSPEND flag on readahead bios we will discard so we don't drop such bios while processing a barrier. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: split DMF_BLOCK_IO flag into twoAlasdair G Kergon
Split the DMF_BLOCK_IO flag into two. DMF_BLOCK_IO_FOR_SUSPEND is set when I/O must be blocked while suspending a device. DMF_QUEUE_IO_TO_THREAD is set when I/O must be queued to a worker thread for later processing. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: rearrange dm_wq_workAlasdair G Kergon
Refactor dm_wq_work() to make later patch more readable. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: remove limited barrier supportMikulas Patocka
Prepare for full barrier implementation: first remove the restricted support. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09dm: add integrity supportMartin K. Petersen
This patch provides support for data integrity passthrough in the device mapper. - If one or more component devices support integrity an integrity profile is preallocated for the DM device. - If all component devices have compatible profiles the DM device is flagged as capable. - Handle integrity metadata when splitting and cloning bios. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-06md/raid1: fix build breakageAlexander Beregalov
Fix this build error: drivers/md/raid1.c: In function 'raid1_congested': drivers/md/raid1.c:589: error: 'BDI_write_congested' undeclared BDI_write_congested was changed in commit 1faa16d228 ("block: change the request allocation/congestion logic to be sync/async based") Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06md/raid1 - don't assume newly allocated bvecs are initialised.NeilBrown
Since commit d3f761104b097738932afcc310fbbbbfb007ef92 newly allocated bvecs aren't initialised to NULL, so we have to be more careful about freeing a bio which only managed to get a few pages allocated to it. Otherwise the resync process crashes. This patch is appropriate for 2.6.29-stable. Cc: stable@kernel.org Cc: "Jens Axboe" <jens.axboe@oracle.com> Reported-by: Gabriele Tozzi <gabriele@tozzi.eu> Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (36 commits) dm: set queue ordered mode dm: move wait queue declaration dm: merge pushback and deferred bio lists dm: allow uninterruptible wait for pending io dm: merge __flush_deferred_io into caller dm: move bio_io_error into __split_and_process_bio dm: rename __split_bio dm: remove unnecessary struct dm_wq_req dm: remove unnecessary work queue context field dm: remove unnecessary work queue type field dm: bio list add bio_list_add_head dm snapshot: persistent fix dtr cleanup dm snapshot: move status to exception store dm snapshot: move ctr parsing to exception store dm snapshot: use DMEMIT macro for status dm snapshot: remove dm_snap header dm snapshot: remove dm_snap header use dm exception store: move cow pointer dm exception store: move chunk_fields dm exception store: move dm_target pointer ...
2009-04-03Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds
* 'for-linus' of git://neil.brown.name/md: (53 commits) md/raid5 revise rules for when to update metadata during reshape md/raid5: minor code cleanups in make_request. md: remove CONFIG_MD_RAID_RESHAPE config option. md/raid5: be more careful about write ordering when reshaping. md: don't display meaningless values in sysfs files resync_start and sync_speed md/raid5: allow layout and chunksize to be changed on active array. md/raid5: reshape using largest of old and new chunk size md/raid5: prepare for allowing reshape to change layout md/raid5: prepare for allowing reshape to change chunksize. md/raid5: clearly differentiate 'before' and 'after' stripes during reshape. Documentation/md.txt update md: allow number of drives in raid5 to be reduced md/raid5: change reshape-progress measurement to cope with reshaping backwards. md: add explicit method to signal the end of a reshape. md/raid5: enhance raid5_size to work correctly with negative delta_disks md/raid5: drop qd_idx from r6_state md/raid6: move raid6 data processing to raid6_pq.ko md: raid5 run(): Fix max_degraded for raid level 4. md: 'array_size' sysfs attribute md: centralize ->array_sectors modifications ...
2009-04-02dm: set queue ordered modeMikulas Patocka
Set queue ordered mode. It doesn't really matter what we set here because we don't ever put any requests on the queue. But we need to set something other than QUEUE_ORDERED_NONE so that __generic_make_request passes barrier requests to us. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: move wait queue declarationMikulas Patocka
Move wait queue declaration and unplug to dm_wait_for_completion. The purpose is to minimize duplicate code in the further patches. The patch reorders functions a little bit. It doesn't change any functionality. For proper non-deadlock operation, add_wait_queue must happen before set_current_state(interruptible) and before the test for !atomic_read(&md->pending). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: merge pushback and deferred bio listsMikulas Patocka
Merge pushback and deferred lists into one list - use deferred list for both deferred and pushed-back bios. This will be needed for proper support of barrier bios: it is impossible to support ordering correctly with two lists because the requests on both lists will be mixed up. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: allow uninterruptible wait for pending ioMikulas Patocka
Allow uninterruptible wait for pending IOs. Add argument "interruptible" to dm_wait_for_completion that specifies either interruptible or uninterruptible waiting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: merge __flush_deferred_io into callerMikulas Patocka
Merge __flush_deferred_io() into the only caller, dm_wq_work(). There's no need to have a function that has only one caller. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: move bio_io_error into __split_and_process_bioMikulas Patocka
Move the bio_io_error() calls directly into __split_and_process_bio(). This avoids some code duplication in later patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: rename __split_bioMikulas Patocka
Rename __split_bio() to __split_and_process_bio() because it not only splits the bio to serveral parts, but also submits them to target drivers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: remove unnecessary struct dm_wq_reqMikulas Patocka
Remove struct dm_wq_req and move "work" directly into struct mapped_device. In the revised implementation, the thread will do just one type of work (processing the queue). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: remove unnecessary work queue context fieldMikulas Patocka
Remove the context field from struct dm_wq_req because we will no longer need it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: remove unnecessary work queue type fieldMikulas Patocka
Remove "type" field from struct dm_wq_req because we no longer need it to have more than one value. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm: bio list add bio_list_add_headMikulas Patocka
Introduce a function that adds a bio to the head of the list for use by the patch that will support barriers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm snapshot: persistent fix dtr cleanupJonathan Brassow
The persistent exception store destructor does not properly account for all conditions in which it can be called. If it is called after 'ctr' but before 'read_metadata' (e.g. if something else in 'snapshot_ctr' fails) then it will attempt to free areas of memory that haven't been allocated yet. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm snapshot: move status to exception storeJonathan Brassow
Let the exception store types print out their status through the new API, rather than having the snapshot code do it. Adjust the buffer position to allow for the preceding DMEMIT in the arguments to type->status(). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02dm snapshot: move ctr parsing to exception storeJonathan Brassow
First step of having the exception stores parse their own arguments - generalizing the interface. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>