aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/file.c
AgeCommit message (Collapse)Author
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits) Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree Btrfs: allow treeid==0 in the inode lookup ioctl Btrfs: return keys for large items to the search ioctl Btrfs: fix key checks and advance in the search ioctl Btrfs: buffer results in the space_info ioctl Btrfs: use __u64 types in ioctl.h Btrfs: fix search_ioctl key advance Btrfs: fix gfp flags masking in the compression code Btrfs: don't look at bio flags after submit_bio btrfs: using btrfs_stack_device_id() get devid btrfs: use memparse Btrfs: add a "df" ioctl for btrfs Btrfs: cache the extent state everywhere we possibly can V2 Btrfs: cache ordered extent when completing io Btrfs: cache extent state in find_delalloc_range Btrfs: change the ordered tree to use a spinlock instead of a mutex Btrfs: finish read pages in the order they are submitted btrfs: fix btrfs_mkdir goto for no free objectids Btrfs: flush data on snapshot creation Btrfs: make df be a little bit more understandable ...
2010-03-15Btrfs: cache the extent state everywhere we possibly can V2Josef Bacik
This patch just goes through and fixes everybody that does lock_extent() blah unlock_extent() to use lock_extent_bits() blah unlock_extent_cached() and pass around a extent_state so we only have to do the searches once per function. This gives me about a 3 mb/s boots on my random write test. I have not converted some things, like the relocation and ioctl's, since they aren't heavily used and the relocation stuff is in the middle of being re-written. I also changed the clear_extent_bit() to only unset the cached state if we are clearing EXTENT_LOCKED and related stuff, so we can do things like this lock_extent_bits() clear delalloc bits unlock_extent_cached() without losing our cached state. I tested this thoroughly and turned on LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out fine. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: btrfs_mark_extent_written uses the wrong slot
2010-02-12Btrfs: btrfs_mark_extent_written uses the wrong slotShaohua Li
My test do: fallocate a big file and do write. The file is 512M, but after file write is done btrfs-debug-tree shows: item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53 extent data disk byte 1103101952 nr 536870912 extent data offset 0 nr 399634432 ram 536870912 extent compression 0 Looks like a regression introducted by 6c7d54ac87f338c479d9729e8392eca3f76e11e1, where we set wrong slot. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: apply updated fallocate i_size fix Btrfs: do not try and lookup the file extent when finishing ordered io Btrfs: Fix oopsen when dropping empty tree. Btrfs: remove BUG_ON() due to mounting bad filesystem Btrfs: make error return negative in btrfs_sync_file() Btrfs: fix race between allocate and release extent buffer.
2010-02-04Btrfs: make error return negative in btrfs_sync_file()Roel Kluin
It appears the error return should be negative Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix possible panic on unmount Btrfs: deal with NULL acl sent to btrfs_set_acl Btrfs: fix regression in orphan cleanup Btrfs: Fix race in btrfs_mark_extent_written Btrfs, fix memory leaks in error paths Btrfs: align offsets for btrfs_ordered_update_i_size btrfs: fix missing last-entry in readdir(3)
2010-01-17Btrfs: Fix race in btrfs_mark_extent_writtenYan, Zheng
Fix bug reported by Johannes Hirte. The reason of that bug is btrfs_del_items is called after btrfs_duplicate_item and btrfs_del_items triggers tree balance. The fix is check that case and call btrfs_search_slot when needed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17Merge branch btrfs-master into for-linusChris Mason
Conflicts: fs/btrfs/acl.c
2009-12-17Btrfs: Fix btrfs_drop_extent_cache for skip pinned caseYan, Zheng
The check for skip pinned case is wrong, it may breaks the while loop too soon. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15Btrfs: Rewrite btrfs_drop_extentsYan, Zheng
Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can avoid calling lock_extent within transaction. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-10vfs: Implement proper O_SYNC semanticsChristoph Hellwig
While Linux provided an O_SYNC flag basically since day 1, it took until Linux 2.4.0-test12pre2 to actually get it implemented for filesystems, since that day we had generic_osync_around with only minor changes and the great "For now, when the user asks for O_SYNC, we'll actually give O_DSYNC" comment. This patch intends to actually give us real O_SYNC semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC patches which are required before this patch it's actually surprisingly simple, we just need to figure out when to set the datasync flag to vfs_fsync_range and when not. This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's numerical value to keep binary compatibility, and adds a new real O_SYNC flag. To guarantee backwards compatiblity it is defined as expanding to both the O_DSYNC and the new additional binary flag (__O_SYNC) to make sure we are backwards-compatible when compiled against the new headers. This also means that all places that don't care about the differences can just check O_DSYNC and get the right behaviour for O_SYNC, too - only places that actuall care need to check __O_SYNC in addition. Drivers and network filesystems have been updated in a fail safe way to always do the full sync magic if O_DSYNC is set. The few places setting O_SYNC for lower layers are kept that way for now to stay failsafe. We enforce that O_DSYNC is set when __O_SYNC is set early in the open path to make sure we always get these sane options. Note that parisc really screwed up their headers as they already define a O_DSYNC that has always been a no-op. We try to repair it by using it for the new O_DSYNC and redefinining O_SYNC to send both the traditional O_SYNC numerical value _and_ the O_DSYNC one. Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger@sun.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-10-15Merge branch 'master' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: always pin metadata in discard mode Btrfs: enable discard support Btrfs: add -o discard option Btrfs: properly wait log writers during log sync Btrfs: fix possible ENOSPC problems with truncate Btrfs: fix btrfs acl #ifdef checks Btrfs: streamline tree-log btree block writeout Btrfs: avoid tree log commit when there are no changes Btrfs: only write one super copy during fsync
2009-10-13Btrfs: avoid tree log commit when there are no changesChris Mason
rpm has a habit of running fdatasync when the file hasn't changed. We already detect if a file hasn't been changed in the current transaction but it might have been sent to the tree-log in this transaction and not changed since the last call to fsync. In this case, we want to avoid a tree log sync, which includes a number of synchronous writes and barriers. This commit extends the existing tracking of the last transaction to change a file to also track the last sub-transaction. The end result is that rpm -ivh and -Uvh are roughly twice as fast, and on par with ext3. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix file clone ioctl for bookend extents Btrfs: fix uninit compiler warning in cow_file_range_nocow Btrfs: constify dentry_operations Btrfs: optimize back reference update during btrfs_drop_snapshot Btrfs: remove negative dentry when deleting subvolumne Btrfs: optimize fsync for the single writer case Btrfs: async delalloc flushing under space pressure Btrfs: release delalloc reservations on extent item insertion Btrfs: delay clearing EXTENT_DELALLOC for compressed extents Btrfs: cleanup extent_clear_unlock_delalloc flags Btrfs: fix possible softlockup in the allocator Btrfs: fix deadlock on async thread startup
2009-10-08Btrfs: release delalloc reservations on extent item insertionJosef Bacik
This patch fixes an issue with the delalloc metadata space reservation code. The problem is we used to free the reservation as soon as we allocated the delalloc region. The problem with this is if we are not inserting an inline extent, we don't actually insert the extent item until after the ordered extent is written out. This patch does 3 things, 1) It moves the reservation clearing stuff into the ordered code, so when we remove the ordered extent we remove the reservation. 2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear delalloc bits in the cases where we want to clear the metadata reservation when we clear the delalloc extent, in the case that we do an inline extent or we invalidate the page. 3) It adds another waitqueue to the space info so that when we start a fs wide delalloc flush, anybody else who also hits that area will simply wait for the flush to finish and then try to make their allocation. This has been tested thoroughly to make sure we did not regress on performance. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix data space leak fix Btrfs: remove duplicates of filemap_ helpers Btrfs: take i_mutex before generic_write_checks Btrfs: fix arguments to btrfs_wait_on_page_writeback_range Btrfs: fix deadlock with free space handling and user transactions Btrfs: fix error cases for ioctl transactions Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code Btrfs: introduce missing kfree Btrfs: Fix setting umask when POSIX ACLs are not enabled Btrfs: proper -ENOSPC handling
2009-10-01const: constify remaining file_operationsAlexey Dobriyan
[akpm@linux-foundation.org: fix KVM] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-01Btrfs: remove duplicates of filemap_ helpersChristoph Hellwig
Use filemap_fdatawrite_range and filemap_fdatawait_range instead of local copies of the functions. For filemap_fdatawait_range that also means replacing the awkward old wait_on_page_writeback_range calling convention with the regular filemap byte offsets. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01Merge branch 'master' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus
2009-10-01Btrfs: take i_mutex before generic_write_checksChris Mason
btrfs_file_write was incorrectly calling generic_write_checks without taking i_mutex. This lead to problems with racing around i_size when doing O_APPEND writes. The fix here is to move i_mutex higher. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-28Btrfs: proper -ENOSPC handlingJosef Bacik
At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-27const: mark struct vm_struct_operationsAlexey Dobriyan
* mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-11Merge branch 'master' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
2009-09-11Btrfs: Fix extent replacment raceChris Mason
Data COW means that whenever we write to a file, we replace any old extent pointers with new ones. There was a window where a readpage might find the old extent pointers on disk and cache them in the extent_map tree in ram in the middle of a given write replacing them. Even though both the readpage and the write had their respective bytes in the file locked, the extent readpage inserts may cover more bytes than it had locked down. This commit closes the race by keeping the new extent pinned in the extent map tree until after the on-disk btree is properly setup with the new extent pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11Btrfs: reduce CPU usage in the extent_state treeChris Mason
Btrfs is currently mirroring some of the page state bits into its extent state tree. The goal behind this was to use it in supporting blocksizes other than the page size. But, we don't currently support that, and we're using quite a lot of CPU on the rb tree and its spin lock. This commit starts a series of cleanups to reduce the amount of work done in the extent state tree as part of each IO. This commit: * Adds the ability to lock an extent in the state tree and also set other bits. The idea is to do locking and delalloc in one call * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits. Btrfs is using a combination of the page bits and the ordered write code for this instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11Btrfs: switch extent_map to a rw lockChris Mason
There are two main users of the extent_map tree. The first is regular file inodes, where it is evenly spread between readers and writers. The second is the chunk allocation tree, which maps blocks from logical addresses to phyiscal ones, and it is 99.99% reads. The mapping tree is a point of lock contention during heavy IO workloads, so this commit switches things to a rw lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11Btrfs: optimize set extent bitChris Mason
The Btrfs set_extent_bit call currently searches the rbtree every time it needs to find more extent_state objects to fill the requested operation. This adds a simple test with rb_next to see if the next object in the tree was adjacent to the one we just found. If so, we skip the search and just use the next object. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-12headers: smp_lock.h reduxAlexey Dobriyan
* Remove smp_lock.h from files which don't need it (including some headers!) * Add smp_lock.h to files which do need it * Make smp_lock.h include conditional in hardirq.h It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT This will make hardirq.h inclusion cheaper for every PREEMPT=n config (which includes allmodconfig/allyesconfig, BTW) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-02Btrfs: don't log the inode in file_write while growing the fileChris Mason
2009-06-10Btrfs: fdatasync should skip metadata writeoutHisashi Hifumi
In btrfs, fdatasync and fsync are identical, but fdatasync should skip committing transaction when inode->i_state is set just I_DIRTY_SYNC and this indicates only atime or/and mtime updates. Following patch improves fdatasync throughput. --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr --file-fsync-mode=fdatasync run Results: -2.6.30-rc8 Test execution summary: total time: 1980.6540s total number of events: 10001 total time taken by event execution: 1192.9804 per-request statistics: min: 0.0000s avg: 0.1193s max: 15.3720s approx. 95 percentile: 0.7257s Threads fairness: events (avg/stddev): 625.0625/151.32 execution time (avg/stddev): 74.5613/9.46 -2.6.30-rc8-patched Test execution summary: total time: 1695.9118s total number of events: 10000 total time taken by event execution: 871.3214 per-request statistics: min: 0.0000s avg: 0.0871s max: 10.4644s approx. 95 percentile: 0.4787s Threads fairness: events (avg/stddev): 625.0000/131.86 execution time (avg/stddev): 54.4576/8.98 Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)Yan Zheng
This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27Btrfs: remove #if 0 codeChris Mason
Btrfs had some old code sitting around under #if 0, this drops it. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24Btrfs: fix fallocate deadlock on inode extent lockChris Mason
The btrfs fallocate call takes an extent lock on the entire range being fallocated, and then runs through insert_reserved_extent on each extent as they are allocated. The problem with this is that btrfs_drop_extents may decide to try and take the same extent lock fallocate was already holding. The solution used here is to push down knowledge of the range that is already locked going into btrfs_drop_extents. It turns out that at least one other caller had the same bug. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-21Btrfs: fix btrfs fallocate oops and deadlockChris Mason
Btrfs fallocate was incorrectly starting a transaction with a lock held on the extent_io tree for the file, which could deadlock. Strictly speaking it was using join_transaction which would be safe, but it is better to move the transaction outside of the lock. When preallocated extents are overwritten, btrfs_mark_buffer_dirty was being called on an unlocked buffer. This was triggering an assertion and oops because the lock is supposed to be held. The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had been run. btrfs_del_item takes care of dirtying things, so the solution is a to skip the btrfs_mark_buffer_dirty call in this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20Btrfs: add a priority queue to the async thread helpersChris Mason
Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a higher priority. But, the checksumming helper threads prevent it from being fully effective. There are two problems. First, a big queue of pending checksumming will delay the synchronous IO behind other lower priority writes. Second, the checksumming uses an ordered async work queue. The ordering makes sure that IOs are sent to the block layer in the same order they are sent to the checksumming threads. Usually this gives us less seeky IO. But, when we start mixing IO priorities, the lower priority IO can delay the higher priority IO. This patch solves both problems by adding a high priority list to the async helper threads, and a new btrfs_set_work_high_prio(), which is used to make put a new async work item onto the higher priority list. The ordering is still done on high priority IO, but all of the high priority bios are ordered separately from the low priority bios. This ordering is purely an IO optimization, it is not involved in data or metadata integrity. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-31Btrfs: add extra flushing for renames and truncatesChris Mason
Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24Btrfs: tree logging unlink/rename fixesChris Mason
The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24Btrfs: leave btree locks spinning more oftenChris Mason
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24Btrfs: do extent allocation and reference count updates in the backgroundChris Mason
The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-20Btrfs: add better -ENOSPC handlingJosef Bacik
This is a step in the direction of better -ENOSPC handling. Instead of checking the global bytes counter we check the space_info bytes counters to make sure we have enough space. If we don't we go ahead and try to allocate a new chunk, and then if that fails we return -ENOSPC. This patch adds two counters to btrfs_space_info, bytes_delalloc and bytes_may_use. bytes_delalloc account for extents we've actually setup for delalloc and will be allocated at some point down the line. bytes_may_use is to keep track of how many bytes we may use for delalloc at some point. When we actually set the extent_bit for the delalloc bytes we subtract the reserved bytes from the bytes_may_use counter. This keeps us from not actually being able to allocate space for any delalloc bytes. Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20Btrfs: check file pointer in btrfs_sync_fileChris Mason
fsync can be called by NFS with a null file pointer, and btrfs was oopsing in this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21Btrfs: fix tree logs parallel syncYan Zheng
To improve performance, btrfs_sync_log merges tree log sync requests. But it wrongly merges sync requests for different tree logs. If multiple tree logs are synced at the same time, only one of them actually gets synced. This patch has following changes to fix the bug: Move most tree log related fields in btrfs_fs_info to btrfs_root. This allows merging sync requests separately for each tree log. Don't insert root item into the log root tree immediately after log tree is allocated. Root item for log tree is inserted when log tree get synced for the first time. This allows syncing the log root tree without first syncing all log trees. At tree-log sync, btrfs_sync_log first sync the log tree; then updates corresponding root item in the log root tree; sync the log root tree; then update the super block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21Btrfs: removed unused #include <version.h>'sHuang Weiyi
Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-06Btrfs: don't change file extent's ram_bytes in btrfs_drop_extentsYan Zheng
btrfs_drop_extents doesn't change file extent's ram_bytes in the case of booked extent. To be consistent, we should also not change ram_bytes when truncating existing extent. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-05Btrfs: Fix checkpatch.pl warningsChris Mason
There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05Btrfs: Fix memset length in btrfs_file_writeyanhai zhu
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-12Btrfs: fix nodatasum handling in balancing codeYan Zheng
Checksums on data can be disabled by mount option, so it's possible some data extents don't have checksums or have invalid checksums. This causes trouble for data relocation. This patch contains following things to make data relocation work. 1) make nodatasum/nodatacow mount option only affects new files. Checksums and COW on data are only controlled by the inode flags. 2) check the existence of checksum in the nodatacow checker. If checksums exist, force COW the data extent. This ensure that checksum for a given block is either valid or does not exist. 3) update data relocation code to properly handle the case of checksum missing. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-08Btrfs: Fix compressed checksum fsync log copiesChris Mason
The fsync logging code makes sure to onl copy the relevant checksum for each extent based on the file extent pointers it finds. But for compressed extents, it needs to copy the checksum for the entire extent. Signed-off-by: Chris Mason <chris.mason@oracle.com>