1. 13 Dec, 2009 14 commits
    • Dan Williams's avatar
      md: add 'recovery_start' per-device sysfs attribute · 06e3c817
      Dan Williams authored
      
      Enable external metadata arrays to manage rebuild checkpointing via a
      md/dev-XXX/recovery_start attribute which reflects rdev->recovery_offset
      
      Also update resync_start_store to allow 'none' to be written, for
      consistency.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      06e3c817
    • Dan Williams's avatar
      md: rcu_read_lock() walk of mddev->disks in md_do_sync() · 4e59ca7d
      Dan Williams authored
      
      Other walks of this list are either under rcu_read_lock() or the list
      mutation lock (mddev_lock()).  This protects against the improbable case of a
      disk being removed from the array at the start of md_do_sync().
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      4e59ca7d
    • NeilBrown's avatar
      md: integrate spares into array at earliest opportunity. · 93be75ff
      NeilBrown authored
      
      As v1.x metadata can record that a member of the array is
      not completely recovered, it make sense to record that a
      spare has become a regular member of the array at the earliest
      opportunity.
      So remove the tests on "recovery_offset > 0" in super_1_sync
      as they really aren't needed, and schedule a metadata update
      immediately after adding spares to a degraded array.
      
      This means that if a crash happens immediately after a recovery
      starts, the new device will be included in the array and recovery will
      continue from wherever it was up to.  Previously this didn't happen
      unless recovery was at least 1/16 of the way through.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      93be75ff
    • Arnd Bergmann's avatar
      md: move compat_ioctl handling into md.c · aa98aa31
      Arnd Bergmann authored
      
      The RAID ioctls are only implemented in md.c, so the
      handling for them should also be moved there from
      fs/compat_ioctl.c.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Andre Noll <maan@systemlinux.org>
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      aa98aa31
    • NeilBrown's avatar
      md: add MODULE_DESCRIPTION for all md related modules. · 0efb9e61
      NeilBrown authored
      
      Suggested by  Oren Held <orenhe@il.ibm.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0efb9e61
    • Robert Becker's avatar
      raid: improve MD/raid10 handling of correctable read errors. · 1e50915f
      Robert Becker authored
      
      We've noticed severe lasting performance degradation of our raid
      arrays when we have drives that yield large amounts of media errors.
      The raid10 module will queue each failed read for retry, and also
      will attempt call fix_read_error() to perform the read recovery.
      Read recovery is performed while the array is frozen, so repeated
      recovery attempts can degrade the performance of the array for
      extended periods of time.
      
      With this patch I propose adding a per md device max number of
      corrected read attempts.  Each rdev will maintain a count of
      read correction attempts in the rdev->read_errors field (not
      used currently for raid10). When we enter fix_read_error()
      we'll check to see when the last read error occurred, and
      divide the read error count by 2 for every hour since the
      last read error. If at that point our read error count
      exceeds the read error threshold, we'll fail the raid device.
      
      In addition in this patch I add sysfs nodes (get/set) for
      the per md max_read_errors attribute, the rdev->read_errors
      attribute, and added some printk's to indicate when
      fix_read_error fails to repair an rdev.
      
      For testing I used debugfs->fail_make_request to inject
      IO errors to the rdev while doing IO to the raid array.
      Signed-off-by: default avatarRobert Becker <Rob.Becker@riverbed.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      1e50915f
    • NeilBrown's avatar
      md: support updating bitmap parameters via sysfs. · 43a70507
      NeilBrown authored
      
      A new attribute directory 'bitmap' in 'md' is created which
      contains files for configuring the bitmap.
      'location' identifies where the bitmap is, either 'none',
      or 'file' or 'sector offset from metadata'.
      Writing 'location' can create or remove a bitmap.
      Adding a 'file' bitmap this way is not yet supported.
      'chunksize' and 'time_base' must be set before 'location'
      can be set.
      
      'chunksize' can be set before creating a bitmap, but is
      currently always over-ridden by the bitmap superblock.
      
      'time_base' and 'backlog' can be updated at any time.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      43a70507
    • NeilBrown's avatar
      md: factor out parsing of fixed-point numbers · 72e02075
      NeilBrown authored
      
      safe_delay_store can parse fixed point numbers (for fractions
      of a second).  We will want to do that for another sysfs
      file soon, so factor out the code.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      72e02075
    • NeilBrown's avatar
      md: move offset, daemon_sleep and chunksize out of bitmap structure · 42a04b50
      NeilBrown authored
      
      ... and into bitmap_info.  These are all configuration parameters
      that need to be set before the bitmap is created.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      42a04b50
    • NeilBrown's avatar
      md: collect bitmap-specific fields into one structure. · c3d9714e
      NeilBrown authored
      
      In preparation for making bitmap fields configurable via sysfs,
      start tidying up by making a single structure to contain the
      configuration fields.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c3d9714e
    • NeilBrown's avatar
      md: support barrier requests on all personalities. · a2826aa9
      NeilBrown authored
      
      Previously barriers were only supported on RAID1.  This is because
      other levels requires synchronisation across all devices and so needed
      a different approach.
      Here is that approach.
      
      When a barrier arrives, we send a zero-length barrier to every active
      device.  When that completes - and if the original request was not
      empty -  we submit the barrier request itself (with the barrier flag
      cleared) and then submit a fresh load of zero length barriers.
      
      The barrier request itself is asynchronous, but any subsequent
      request will block until the barrier completes.
      
      The reason for clearing the barrier flag is that a barrier request is
      allowed to fail.  If we pass a non-empty barrier through a striping
      raid level it is conceivable that part of it could succeed and part
      could fail.  That would be way too hard to deal with.
      So if the first run of zero length barriers succeed, we assume all is
      sufficiently well that we send the request and ignore errors in the
      second run of barriers.
      
      RAID5 needs extra care as write requests may not have been submitted
      to the underlying devices yet.  So we flush the stripe cache before
      proceeding with the barrier.
      
      Note that the second set of zero-length barriers are submitted
      immediately after the original request is submitted.  Thus when
      a personality finds mddev->barrier to be set during make_request,
      it should not return from make_request until the corresponding
      per-device request(s) have been queued.
      
      That will be done in later patches.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      a2826aa9
    • NeilBrown's avatar
      md: don't reset curr_resync_completed after an interrupted resync · efa59339
      NeilBrown authored
      
      If a resync/recovery/check/repair is interrupted for some reason, it
      can be useful to know exactly where it got up to.
      So in that case, do not clear curr_resync_completed.
      Initialise it when starting a resync/recovery/... instead.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      efa59339
    • NeilBrown's avatar
      md: adjust resync_min usefully when resync aborts. · c07b70ad
      NeilBrown authored
      
      When a 'check' or 'repair' finished we should clear resync_min
      so that a future check/repair will cover the whole array (by default).
      However if it is interrupted, we should update resync_min to
      where we got up to, so that when the check/repair continues it
      just does the remainder of the array.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c07b70ad
    • NeilBrown's avatar
      md/bitmap: protect against bitmap removal while being updated. · aa5cbd10
      NeilBrown authored
      
      A write intent bitmap can be removed from an array while the
      array is active.
      When this happens, all IO is suspended and flushed before the
      bitmap is removed.
      However it is possible that bitmap_daemon_work is still running to
      clear old bits from the bitmap.  If it is, it can dereference the
      bitmap after it has been freed.
      
      So introduce a new mutex to protect bitmap_daemon_work and get it
      before destroying a bitmap.
      
      This is suitable for any current -stable kernel.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      aa5cbd10
  2. 18 Nov, 2009 1 commit
  3. 13 Nov, 2009 1 commit
    • NeilBrown's avatar
      md: allow v0.91 metadata to record devices as being active but not in-sync. · 0261cd9f
      NeilBrown authored
      
      This is a combination that didn't really make sense before.
      However when a reshape is converting e.g. raid5 -> raid6, the extra
      device is not fully in-sync, but is certainly active and contains
      important data.
      So allow that start to be meaningful and in particular get
      the 'recovery_offset' value (which is needed for any non-in-sync
      active device) from the reshape_position.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0261cd9f
  4. 12 Nov, 2009 1 commit
    • Eric W. Biederman's avatar
      sysctl drivers: Remove dead binary sysctl support · 894d2491
      Eric W. Biederman authored
      
      Now that sys_sysctl is a wrapper around /proc/sys all of
      the binary sysctl support elsewhere in the tree is
      dead code.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Corey Minyard <minyard@acm.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@suse.de>
      Acked-by: Clemens Ladisch <clemens@ladisch.de> for drivers/char/hpet.c
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      894d2491
  5. 11 Nov, 2009 1 commit
    • NeilBrown's avatar
      md: factor out updating of 'recovery_offset'. · 5e865106
      NeilBrown authored
      
      Each device has its own 'recovery_offset' showing how far
      recovery has progressed on the device.
      As the only real significance of this is that fact that it can
      be stored in the metadata and recovered at restart, and as
      only 1.x metadata can do this, we were only updating
      'recovery_offset' to 'curr_resync_completed' when updating
      v1.x metadata.
      But this is wrong, and we will shortly make limited use of this
      field in v0.90 metadata.
      
      So move the update into common code.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5e865106
  6. 05 Nov, 2009 1 commit
    • NeilBrown's avatar
      md: don't clear endpoint for resync when resync is interrupted. · 24395a85
      NeilBrown authored
      
      If a 'sync_max' has been set (via sysfs), it is wrong to clear it
      until a resync (or reshape or recovery ...) actually reached that
      point.
      So if a resync is interrupted (e.g. by device failure),
      leave 'resync_max' unchanged.
      
      This is particularly important for 'reshape' operations that do not
      change the size of the array.  For such operations mdadm needs to
      monitor the reshape taking rolling backups of the section being
      reshaped.  If resync_max gets cleared, the reshape can get ahead of
      mdadm and then the backups that mdadm creates are useless.
      
      This is suitable for 2.6.31.y stable kernels.
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      24395a85
  7. 16 Oct, 2009 1 commit
    • NeilBrown's avatar
      md: Fix handling of raid5 array which is being reshaped to fewer devices. · 5e5e3e78
      NeilBrown authored
      
      When a raid5 (or raid6) array is being reshaped to have fewer devices,
      conf->raid_disks is the latter and hence smaller number of devices.
      However sometimes we want to use a number which is the total number of
      currently required devices - the larger of the 'old' and 'new' sizes.
      Before we implemented reducing the number of devices, this was always
      'new' i.e. ->raid_disks.
      Now we need max(raid_disks, previous_raid_disks) in those places.
      
      This particularly affects assembling an array that was shutdown while
      in the middle of a reshape to fewer devices.
      
      md.c needs a similar fix when interpreting the md metadata.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5e5e3e78
  8. 23 Sep, 2009 3 commits
  9. 22 Sep, 2009 1 commit
  10. 17 Aug, 2009 1 commit
  11. 12 Aug, 2009 2 commits
    • NeilBrown's avatar
      md: allow upper limit for resync/reshape to be set when array is read-only · 4d484a4a
      NeilBrown authored
      
      Normally we only allow the upper limit for a reshape to be decreased
      when the array not performing a sync/recovery/reshape, otherwise there
      could be races.  But if an array is part-way through a reshape when it
      is assembled the reshape is started immediately leaving no window
      to set an upper bound.
      
      If the array is started read-only, the reshape will be suspended until
      the array becomes writable, so that provides a window during which it
      is perfectly safe to reduce the upper limit of a reshape.
      
      So: allow the upper limit (sync_max) to be reduced even if the reshape
      thread is running, as long as the array is still read-only.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      4d484a4a
    • NeilBrown's avatar
      md: never advance 'events' counter by more than 1. · 51d5668c
      NeilBrown authored
      
      When assembling arrays, md allows two devices to have different event
      counts as long as the difference is only '1'.  This is to cope with
      a system failure between updating the metadata on two difference
      devices.
      
      However there are currently times when we update the event count by
      2.  This was done to keep the event count even when the array is clean
      and odd when it is dirty, which allows us to avoid writing common
      update to spare devices and so allow those spares to go to sleep.
      
      This is bad for the above reason.  So change it to never increase by
      two.  This means that the alignment between 'odd/even' and
      'clean/dirty' might take a little longer to attain, but that is only a
      small cost.  The spares will get a few more updates but that will
      still be spared (;-) most updates and can still go to sleep.
      
      Prior to this patch there was a small chance that after a crash an
      array would fail to assemble due to the overly large event count
      mismatch.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      51d5668c
  12. 09 Aug, 2009 1 commit
    • NeilBrown's avatar
      Remove deadlock potential in md_open · c8c00a69
      NeilBrown authored
      A recent commit:
        commit 449aad3e
      
      introduced the possibility of an A-B/B-A deadlock between
      bd_mutex and reconfig_mutex.
      
      __blkdev_get holds bd_mutex while calling md_open which takes
         reconfig_mutex,
      do_md_run is always called with reconfig_mutex held, and it now
         takes bd_mutex in the call the revalidate_disk.
      
      This potential deadlock was not caught by lockdep due to the
      use of mutex_lock_interruptible_nexted which was introduced
      by
         commit d63a5a74
      do avoid a warning of an impossible deadlock.
      
      It is quite possible to split reconfig_mutex in to two locks.
      One protects the array data structures while it is being
      reconfigured, the other ensures that an array is never even partially
      open while it is being deactivated.
      In particular, the second lock prevents an open from completing
      between the time when do_md_stop checks if there are any active opens,
      and the time when the array is either set read-only, or when ->pers is
      set to NULL.  So we can be certain that no IO is in flight as the
      array is being destroyed.
      
      So create a new lock, open_mutex, just to ensure exclusion between
      'open' and 'stop'.
      
      This avoids the deadlock and also avoids the lockdep warning mentioned
      in commit d63a5a74
      
      Reported-by: default avatar"Mike Snitzer" <snitzer@gmail.com>
      Reported-by: default avatar"H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c8c00a69
  13. 02 Aug, 2009 5 commits
    • NeilBrown's avatar
      md: Use revalidate_disk to effect changes in size of device. · 449aad3e
      NeilBrown authored
      
      As revalidate_disk calls check_disk_size_change, it will cause
      any capacity change of a gendisk to be propagated to the blockdev
      inode.  So use that instead of mucking about with locks and
      i_size_write.
      
      Also add a call to revalidate_disk in do_md_run and a few other places
      where the gendisk capacity is changed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      449aad3e
    • NeilBrown's avatar
      md: Handle growth of v1.x metadata correctly. · 70471daf
      NeilBrown authored
      
      The v1.x metadata does not have a fixed size and can grow
      when devices are added.
      If it grows enough to require an extra sector of storage,
      we need to update the 'sb_size' to match.
      
      Without this, md can write out an incomplete superblock with a
      bad checksum, which will be rejected when trying to re-assemble
      the array.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      70471daf
    • NeilBrown's avatar
      md: avoid array overflow with bad v1.x metadata · 3673f305
      NeilBrown authored
      
      We trust the 'desc_nr' field in v1.x metadata enough to use it
      as an index in an array.  This isn't really safe.
      So range-check the value first.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3673f305
    • NeilBrown's avatar
      md: when a level change reduces the number of devices, remove the excess. · 3a981b03
      NeilBrown authored
      
      When an array is changed from RAID6 to RAID5, fewer drives are
      needed.  So any device that is made superfluous by the level
      conversion must be marked as not-active.
      For the RAID6->RAID5 conversion, this will be a drive which only
      has 'Q' blocks on it.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3a981b03
    • Andre Noll's avatar
      md: Push down data integrity code to personalities. · ac5e7113
      Andre Noll authored
      
      This patch replaces md_integrity_check() by two new public functions:
      md_integrity_register() and md_integrity_add_rdev() which are both
      personality-independent.
      
      md_integrity_register() is called from the ->run and ->hot_remove
      methods of all personalities that support data integrity.  The
      function iterates over the component devices of the array and
      determines if all active devices are integrity capable and if their
      profiles match. If this is the case, the common profile is registered
      for the mddev via blk_integrity_register().
      
      The second new function, md_integrity_add_rdev() is called from the
      ->hot_add_disk methods, i.e. whenever a new device is being added
      to a raid array. If the new device does not support data integrity,
      or has a profile different from the one already registered, data
      integrity for the mddev is disabled.
      
      For raid0 and linear, only the call to md_integrity_register() from
      the ->run method is necessary.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ac5e7113
  14. 08 Jul, 2009 1 commit
  15. 30 Jun, 2009 4 commits
  16. 17 Jun, 2009 2 commits
    • Andre Noll's avatar
      md: Move check for bitmap presence to personality code. · 0894cc30
      Andre Noll authored
      
      If the superblock of a component device indicates the presence of a
      bitmap but the corresponding raid personality does not support bitmaps
      (raid0, linear, multipath, faulty), then something is seriously wrong
      and we'd better refuse to run such an array.
      
      Currently, this check is performed while the superblocks are examined,
      i.e. before entering personality code. Therefore the generic md layer
      must know which raid levels support bitmaps and which do not.
      
      This patch avoids this layer violation without adding identical code
      to various personalities. This is accomplished by introducing a new
      public function to md.c, md_check_no_bitmap(), which replaces the
      hard-coded checks in the superblock loading functions.
      
      A call to md_check_no_bitmap() is added to the ->run method of each
      personality which does not support bitmaps and assembly is aborted
      if at least one component device contains a bitmap.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0894cc30
    • NeilBrown's avatar
      md: remove chunksize rounding from common code. · 8190e754
      NeilBrown authored
      
      It is easiest to round sizes to multiples of chunk size in
      the personality code for those personalities which care.
      Those personalities now do the rounding, so we can
      remove that function from common code.
      
      Also remove the upper bound on the size of a chunk, and the lower
      bound on the size of a device (1 chunk), neither of which really buy
      us anything.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8190e754