1. 17 Dec, 2009 3 commits
  2. 15 Dec, 2009 1 commit
  3. 11 Nov, 2009 4 commits
  4. 14 Oct, 2009 1 commit
  5. 13 Oct, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: avoid tree log commit when there are no changes · 257c62e1
      Chris Mason authored
      
      rpm has a habit of running fdatasync when the file hasn't
      changed.  We already detect if a file hasn't been changed
      in the current transaction but it might have been sent to
      the tree-log in this transaction and not changed since
      the last call to fsync.
      
      In this case, we want to avoid a tree log sync, which includes
      a number of synchronous writes and barriers.  This commit
      extends the existing tracking of the last transaction to change
      a file to also track the last sub-transaction.
      
      The end result is that rpm -ivh and -Uvh are roughly twice as fast,
      and on par with ext3.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      257c62e1
  6. 09 Oct, 2009 3 commits
  7. 08 Oct, 2009 3 commits
    • Josef Bacik's avatar
      Btrfs: release delalloc reservations on extent item insertion · 32c00aff
      Josef Bacik authored
      
      This patch fixes an issue with the delalloc metadata space reservation
      code.  The problem is we used to free the reservation as soon as we
      allocated the delalloc region.  The problem with this is if we are not
      inserting an inline extent, we don't actually insert the extent item until
      after the ordered extent is written out.  This patch does 3 things,
      
      1) It moves the reservation clearing stuff into the ordered code, so when
      we remove the ordered extent we remove the reservation.
      2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
      delalloc bits in the cases where we want to clear the metadata reservation
      when we clear the delalloc extent, in the case that we do an inline extent
      or we invalidate the page.
      3) It adds another waitqueue to the space info so that when we start a fs
      wide delalloc flush, anybody else who also hits that area will simply wait
      for the flush to finish and then try to make their allocation.
      
      This has been tested thoroughly to make sure we did not regress on
      performance.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      32c00aff
    • Chris Mason's avatar
      Btrfs: delay clearing EXTENT_DELALLOC for compressed extents · a3429ab7
      Chris Mason authored
      
      When compression is on, the cow_file_range code is farmed off to
      worker threads.  This allows us to do significant CPU work in parallel
      on SMP machines.
      
      But it is a delicate balance around when we clear flags and how.  In
      the past we cleared the delalloc flag immediately, which was safe
      because the pages stayed locked.
      
      But this is causing problems with the newest ENOSPC code, and with the
      recent extent state cleanups we can now clear the delalloc bit at the
      same time the uncompressed code does.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a3429ab7
    • Chris Mason's avatar
      Btrfs: cleanup extent_clear_unlock_delalloc flags · a791e35e
      Chris Mason authored
      
      extent_clear_unlock_delalloc has a growing set of ugly parameters
      that is very difficult to read and maintain.
      
      This switches to a flag field and well named flag defines.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a791e35e
  8. 01 Oct, 2009 2 commits
  9. 28 Sep, 2009 1 commit
    • Josef Bacik's avatar
      Btrfs: proper -ENOSPC handling · 9ed74f2d
      Josef Bacik authored
      
      At the start of a transaction we do a btrfs_reserve_metadata_space() and
      specify how many items we plan on modifying.  Then once we've done our
      modifications and such, just call btrfs_unreserve_metadata_space() for
      the same number of items we reserved.
      
      For keeping track of metadata needed for data I've had to add an extent_io op
      for when we merge extents.  This lets us track space properly when we are doing
      sequential writes, so we don't end up reserving way more metadata space than
      what we need.
      
      The only place where the metadata space accounting is not done is in the
      relocation code.  This is because Yan is going to be reworking that code in the
      near future, so running btrfs-vol -b could still possibly result in a ENOSPC
      related panic.  This patch also turns off the metadata_ratio stuff in order to
      allow users to more efficiently use their disk space.
      
      This patch makes it so we track how much metadata we need for an inode's
      delayed allocation extents by tracking how many extents are currently
      waiting for allocation.  It introduces two new callbacks for the
      extent_io tree's, merge_extent_hook and split_extent_hook.  These help
      us keep track of when we merge delalloc extents together and split them
      up.  Reservations are handled prior to any actually dirty'ing occurs,
      and then we unreserve after we dirty.
      
      btrfs_unreserve_metadata_for_delalloc() will make the appropriate
      unreservations as needed based on the number of reservations we
      currently have and the number of extents we currently have.  Doing the
      reservation outside of doing any of the actual dirty'ing lets us do
      things like filemap_flush() the inode to try and force delalloc to
      happen, or as a last resort actually start allocation on all delalloc
      inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
      test.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9ed74f2d
  10. 24 Sep, 2009 2 commits
    • Yan, Zheng's avatar
      Btrfs: don't rename file into dummy directory · f679a840
      Yan, Zheng authored
      
      A recent change enforces only one access point to each subvolume. The first
      directory entry (the one added when the subvolume/snapshot was created) is
      treated as valid access point, all other subvolume links are linked to dummy
      empty directories. The dummy directories are temporary inodes that only in
      memory, so we can not rename file into them.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f679a840
    • Yan, Zheng's avatar
      Btrfs: check size of inode backref before adding hardlink · a5719521
      Yan, Zheng authored
      
      For every hardlink in btrfs, there is a corresponding inode back
      reference. All inode back references for hardlinks in a given
      directory are stored in single b-tree item. The size of b-tree item
      is limited by the size of b-tree leaf, so we can only create limited
      number of hardlinks to a given file in a directory.
      
      The original code lacks of the check, it oops if the number of
      hardlinks goes over the limit. This patch fixes the issue by adding
      check to btrfs_link and btrfs_rename.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a5719521
  11. 22 Sep, 2009 2 commits
  12. 21 Sep, 2009 3 commits
    • Yan, Zheng's avatar
      Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c
      Yan, Zheng authored
      
      This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
      used and doesn't contains links to other subvolumes can be destroyed.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      76dda93c
    • Yan, Zheng's avatar
      Btrfs: change how subvolumes are organized · 4df27c4d
      Yan, Zheng authored
      
      btrfs allows subvolumes and snapshots anywhere in the directory tree.
      If we snapshot a subvolume that contains a link to other subvolume
      called subvolA, subvolA can be accessed through both the original
      subvolume and the snapshot. This is similar to creating hard link to
      directory, and has the very similar problems.
      
      The aim of this patch is enforcing there is only one access point to
      each subvolume. Only the first directory entry (the one added when
      the subvolume/snapshot was created) is treated as valid access point.
      The first directory entry is distinguished by checking root forward
      reference. If the corresponding root forward reference is missing,
      we know the entry is not the first one.
      
      This patch also adds snapshot/subvolume rename support, the code
      allows rename subvolume link across subvolumes.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4df27c4d
    • Yan, Zheng's avatar
      Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8
      Yan, Zheng authored
      
      The new back reference format does not allow reusing objectid of
      deleted snapshot/subvol. So we use ++highest_objectid to allocate
      objectid for new snapshot/subvol.
      
      Now we use ++highest_objectid to allocate objectid for both new inode
      and new snapshot/subvolume, so this patch removes 'find hole' code in
      btrfs_find_free_objectid.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      13a8a7c8
  13. 18 Sep, 2009 1 commit
  14. 16 Sep, 2009 1 commit
  15. 11 Sep, 2009 7 commits
    • Chris Mason's avatar
      Btrfs: zero page past end of inline file items · 93c82d57
      Chris Mason authored
      
      When btrfs_get_extent is reading inline file items for readpage,
      it needs to copy the inline extent into the page.  If the
      inline extent doesn't cover all of the page, that means there
      is a hole in the file, or that our file is smaller than one
      page.
      
      readpage does zeroing for the case where the file is smaller than one
      page, but nobody is currently zeroing for the case where there is
      a hole after the inline item.
      
      This commit changes btrfs_get_extent to zero fill the page past
      the end of the inline item.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      93c82d57
    • Chris Mason's avatar
      Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214
      Chris Mason authored
      
      This closes a whole where the page may be written before
      the page_mkwrite caller has a chance to dirty it
      
      (thanks to Nick Piggin)
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      50a9b214
    • Chris Mason's avatar
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason authored
      
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a1ed835e
    • Chris Mason's avatar
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason authored
      
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      8b62b72b
    • Chris Mason's avatar
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason authored
      
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9655d298
    • Chris Mason's avatar
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason authored
      
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2c64c53d
    • Chris Mason's avatar
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason authored
      
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      890871be
  16. 21 Aug, 2009 1 commit
  17. 07 Aug, 2009 1 commit
  18. 22 Jul, 2009 1 commit
  19. 12 Jul, 2009 1 commit
  20. 02 Jul, 2009 1 commit