aboutsummaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2009-06-17ext4: avoid unnecessary spinlock in critical POSIX ACL pathTheodore Ts'o
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem to do POSIX ACL checks on any files not owned by the caller, and it does this for every single pathname component that it looks up. That obviously can be pretty expensive if the filesystem isn't careful about it, especially with locking. That's doubly sad, since the common case tends to be that there are no ACL's associated with the files in question. ext4 already caches the ACL data so that it doesn't have to look it up over and over again, but it does so by taking the inode->i_lock spinlock on every lookup. Which is a noticeable overhead even if it's a private lock, especially on CPU's where the serialization is expensive (eg Intel Netburst aka 'P4'). For the special case of not actually having any ACL's, all that locking is unnecessary. Even if somebody else were to be changing the ACL's on another CPU, we simply don't care - if we've seen a NULL ACL, we might as well use it. So just load the ACL speculatively without any locking, and if it was NULL, just use it. If it's non-NULL (either because we had a cached entry, or because the cache hasn't been filled in at all), it means that we'll need to get the lock and re-load it properly. (This commit was ported from a patch originally authored by Linus for ext3.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11Push BKL down into ->remount_fs()Alessio Igor Bogani
[xfs, btrfs, capifs, shmem don't need BKL, exempt] Signed-off-by: Alessio Igor Bogani <abogani@texware.it> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11->write_super lock_super pushdownChristoph Hellwig
Push down lock_super into ->write_super instances and remove it from the caller. Following filesystem don't need ->s_lock in ->write_super and are skipped: * bfs, nilfs2 - no other uses of s_lock and have internal locks in ->write_super * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in ->write_super * xfs - no other uses of s_lock and uses internal lock (buffer lock on superblock buffer) to serialize ->write_super. Also xfs_fs_write_super is superflous and will go away in the next merge window Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11Push lock_super() into the ->remount_fs() of filesystems that care about itAl Viro
Note that since we can't run into contention between remount_fs and write_super (due to exclusion on s_umount), we have to care only about filesystems that touch lock_super() on their own. Out of those ext3, ext4, hpfs, sysv and ufs do need it; fat doesn't since its ->remount_fs() only accesses assign-once data (basically, it's "we have no atime on directories and only have atime on files for vfat; force nodiratime and possibly noatime into *flags"). [folded a build fix from hch] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11push BKL down into ->put_superChristoph Hellwig
Move BKL into ->put_super from the only caller. A couple of filesystems had trivial enough ->put_super (only kfree and NULLing of s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs, hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most of them probably don't need it, but I'd rather sort that out individually. Preferably after all the other BKL pushdowns in that area. [AV: original used to move lock_super() down as well; these changes are removed since we don't do lock_super() at all in generic_shutdown_super() now] [AV: fuse, btrfs and xfs are known to need no damn BKL, exempt] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11No need to do lock_super() for exclusion in generic_shutdown_super()Al Viro
We can't run into contention on it. All other callers of lock_super() either hold s_umount (and we have it exclusive) or hold an active reference to superblock in question, which prevents the call of generic_shutdown_super() while the reference is held. So we can replace lock_super(s) with get_fs_excl() in generic_shutdown_super() (and corresponding change for unlock_super(), of course). Since ext4 expects s_lock held for its put_super, take lock_super() into it. The rest of filesystems do not care at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11remove ->write_super call in generic_shutdown_superChristoph Hellwig
We just did a full fs writeout using sync_filesystem before, and if that's not enough for the filesystem it can perform it's own writeout in ->put_super, which many filesystems already do. Move a call to foofs_write_super into every foofs_put_super for now to guarantee identical behaviour until it's cleaned up by the individual filesystem maintainers. Exceptions: - affs already has identical copy & pasted code at the beginning of affs_put_super so no need to do it twice. - xfs does the right thing without it and I have changes pending for the xfs tree touching this are so I don't really need conflicts here.. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits) block: add request clone interface (v2) floppy: fix hibernation ramdisk: remove long-deprecated "ramdisk=" boot-time parameter fs/bio.c: add missing __user annotation block: prevent possible io_context->refcount overflow Add serial number support for virtio_blk, V4a block: Add missing bounce_pfn stacking and fix comments Revert "block: Fix bounce limit setting in DM" cciss: decode unit attention in SCSI error handling code cciss: Remove no longer needed sendcmd reject processing code cciss: change SCSI error handling routines to work with interrupts enabled. cciss: separate error processing and command retrying code in sendcmd_withirq_core() cciss: factor out fix target status processing code from sendcmd functions cciss: simplify interface of sendcmd() and sendcmd_withirq() cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code cciss: Use schedule_timeout_uninterruptible in SCSI error handling code block: needs to set the residual length of a bidi request Revert "block: implement blkdev_readpages" block: Fix bounce limit setting in DM Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt ... Manually fix conflicts with tracing updates in: block/blk-sysfs.c drivers/ide/ide-atapi.c drivers/ide/ide-cd.c drivers/ide/ide-floppy.c drivers/ide/ide-tape.c include/trace/events/block.h kernel/trace/blktrace.c
2009-06-10ext4: Avoid corrupting the uninitialized bit in the extent during truncateAneesh Kumar K.V
The unitialized bit was not properly getting preserved in in an extent which is partially truncated because the it was geting set to the value of the first extent to be removed or truncated as part of the truncate operation, and if there are multiple extents are getting removed or modified as part of the truncate operation, it is only the last extent which will might be partially truncated, and its uninitalized bit is not necessarily the same as the first extent to be truncated. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09ext4: Don't treat a truncation of a zero-length file as replace-via-truncateTheodore Ts'o
If a non-existent file is opened via O_WRONLY|O_CREAT|O_TRUNC, there's no need to treat this as a true file truncation, so we shouldn't activate the replace-via-truncate hueristic. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-08ext4: fix dx_map_entry to support 256k directory blocksToshiyuki Okajima
The dx_map_entry structure doesn't support over 64KB block size by current usage of its member("offs"). Because "offs" treats an offset of copies of the ext4_dir_entry_2 structure as is. This member size is 16 bits. But real offset for over 64KB(256KB) block size needs 18 bits. However, real offset keeps 4 byte boundary, so lower 2 bits is not used. Therefore, we do the following to fix this limitation: For "store": we divide the real offset by 4 and then store this result to "offs" member. For "use": we multiply "offs" member by 4 and then use this result as real offset. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05ext4: truncate the file properly if we fail to copy data from userspaceAneesh Kumar K.V
In generic_perform_write if we fail to copy the user data we don't update the inode->i_size. We should truncate the file in the above case so that we don't have blocks allocated outside inode->i_size. Add the inode to orphan list in the same transaction as block allocation This ensures that if we crash in between the recovery would do the truncate. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05ext4: Avoid leaking blocks after a block allocation failureAneesh Kumar K.V
We should add inode to the orphan list in the same transaction as block allocation. This ensures that if we crash after a failed block allocation and before we do a vmtruncate we don't leak block (ie block marked as used in bitmap but not claimed by the inode). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-04ext4: Change all super.c messages to print the deviceEric Sandeen
This patch changes ext4 super.c to include the device name with all warning/error messages, by using a new utility function ext4_msg. It's a rather large patch, but very mechanic. I left debug printks alone. This is a straightforward port of a patch which Andi Kleen did for ext3. Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()Jan Kara
Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This seems to be a relict from some old days and setting disksize in this function does not make much sense. Currently it was set only by ext4_getblk(). Since the parameter has some effect only if create == 1, it is easy to check by grepping through the sources that the three callers which end up calling ext4_getblk() with create == 1 (ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set disksize themselves. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-03ext4: super.c whitespace cleanupAndreas Dilger
Cleanup of whitespace and formatting. Initially driven by confusing indents for the ext4_{block,inode}_bitmap() et. al. helper routines, but figured I'd cleanup some other 80-column wrapping and other indenting problems at the same time. Signed-off-by: Andreas Dilger <adilger@sun.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25ext4: Clean up calls to ext4_get_group_desc()Theodore Ts'o
If the caller isn't planning on modifying the block group descriptors, there's no need to pass in a pointer to a struct buffer_head. Nuking this saves a tiny amount of CPU time and stack space usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25ext4: remove unused function __ext4_write_dirty_metadataTheodore Ts'o
The __ext4_write_dirty_metadata() function was introduced by commit 0390131b, "ext4: Allow ext4 to run without a journal", but nothing ever used the function, either then or since. So let's remove it and save a bit of space. Cc: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-22block: Do away with the notion of hardsect_sizeMartin K. Petersen
Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-17ext4: Fix memory leak in ext4_fill_super() in case of a failed mountManish Katiyar
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17ext4: down i_data_sem only for read when walking tree for fiemapTheodore Ts'o
Not sure why I put this in as down_write originally; all we are doing is walking the tree, nothing will change under us and concurrent reads should be no problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17ext4: Add a comprehensive block validity check to ext4_get_blocks()Theodore Ts'o
To catch filesystem bugs or corruption which could lead to the filesystem getting severly damaged, this patch adds a facility for tracking all of the filesystem metadata blocks by contiguous regions in a red-black tree. This allows quick searching of the tree to locate extents which might overlap with filesystem metadata blocks. This facility is also used by the multi-block allocator to assure that it is not allocating blocks out of the system zone, as well as by the routines used when reading indirect blocks and extents information from disk to make sure their contents are valid. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-15ext4: Fix race in ext4_inode_info.i_cached_extentTheodore Ts'o
If two CPU's simultaneously call ext4_ext_get_blocks() at the same time, there is nothing protecting the i_cached_extent structure from being used and updated at the same time. This could potentially cause the wrong location on disk to be read or written to, including potentially causing the corruption of the block group descriptors and/or inode table. This bug has been in the ext4 code since almost the very beginning of ext4's development. Fortunately once the data is stored in the page cache cache, ext4_get_blocks() doesn't need to be called, so trying to replicate this problem to the point where we could identify its root cause was *extremely* difficult. Many thanks to Kevin Shanahan for working over several months to be able to reproduce this easily so we could finally nail down the cause of the corruption. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-05-14ext4: Clear the unwritten buffer_head flag after the extent is initializedAneesh Kumar K.V
The BH_Unwritten flag indicates that the buffer is allocated on disk but has not been written; that is, the disk was part of a persistent preallocation area. That flag should only be set when a get_blocks() function is looking up a inode's logical to physical block mapping. When ext4_get_blocks_wrap() is called with create=1, the uninitialized extent is converted into an initialized one, so the BH_Unwritten flag is no longer appropriate. Hence, we need to make sure the BH_Unwritten is not left set, since the combination of BH_Mapped and BH_Unwritten is not allowed; among other things, it will result ext4's get_block() to be called over and over again during the write_begin phase of write(2). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_stateTheodore Ts'o
The ext4_get_blocks() function was depending on the value of bh_result->b_state as an input parameter to decide whether or not update the delalloc accounting statistics by calling ext4_da_update_reserve_space(). We now use a separate flag, EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE, to requests this update, so that all callers of ext4_get_blocks() can clear map_bh.b_state before calling ext4_get_blocks() without worrying about any consistency issues. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()Theodore Ts'o
The static function ext4_da_get_block_write() was only used by mpage_da_map_blocks(). So to simplify the code, merge that function into mpage_da_map_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12ext4: Use a fake block number for delayed new buffer_headAneesh Kumar K.V
Use a very large unsigned number (~0xffff) as as the fake block number for the delayed new buffer. The VFS should never try to write out this number, but if it does, this will make it obvious. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13ext4: Fix sub-block zeroing for writes into preallocated extentsAneesh Kumar K.V
We need to mark the buffer_head mapping preallocated space as new during write_begin. Otherwise we don't zero out the page cache content properly for a partial write. This will cause file corruption with preallocation. Now that we mark the buffer_head new we also need to have a valid buffer_head blocknr so that unmap_underlying_metadata() unmaps the correct block. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12ext4: Add BUG_ON debugging checks to noalloc_get_block_write()Theodore Ts'o
Enforce that noalloc_get_block_write() is only called to map one block at a time, and that it always is successful in finding a mapping for given an inode's logical block block number if it is called with create == 1. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14ext4: Add documentation to the ext4_*get_block* functionsTheodore Ts'o
This adds more documentation to various internal functions in fs/ext4/inode.c, most notably ext4_ind_get_blocks(), ext4_da_get_block_write(), ext4_da_get_block_prep(), ext4_normal_get_block_write(). In addition, the static function ext4_normal_get_block_write() has been renamed noalloc_get_block_write(), since it is used in many places far beyond ext4_normal_writepage(). Plenty of warnings have been added to the noalloc_get_block_write() function, since the way it is used is amazingly fragile. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14ext4: Define a new set of flags for ext4_get_blocks()Theodore Ts'o
The functions ext4_get_blocks(), ext4_ext_get_blocks(), and ext4_ind_get_blocks() used an ad-hoc set of integer variables used as boolean flags passed in as arguments. Use a single flags parameter and a setandard set of bitfield flags instead. This saves space on the call stack, and it also makes the code a bit more understandable. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14ext4: Rename ext4_get_blocks_wrap() to be ext4_get_blocks()Theodore Ts'o
Another function rename for clarity's sake. The _wrap prefix simply confuses people, and didn't add much people trying to follow the code paths. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12ext4: Rename ext4_get_blocks_handle() to be ext4_ind_get_blocks()Theodore Ts'o
The static function ext4_get_blocks_handle() is badly named. Of *course* it takes a handle. Since its counterpart for extent-based file is ext4_ext_get_blocks(), rename it to be ext4_ind_get_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12ext4: Simplify function signature for ext4_da_get_block_write() Theodore Ts'o
The function ext4_da_get_block_write() is called in exactly one write, and the last argument, create, is always 1. Remove it to simplify the code slightly. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-15ext4: Fix spinlock assertions on UP systemsVincent Minet
On UP systems without DEBUG_SPINLOCK, ext4_is_group_locked always fails which triggers a BUG_ON() call. This patch fixes it by using assert_spin_locked instead. Signed-off-by: Vincent Minet <vincent@vincent-minet.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02ext4: Convert ext4_lock_group to use sb_bgl_lockAneesh Kumar K.V
We have sb_bgl_lock() and ext4_group_info.bb_state bit spinlock to protech group information. The later is only used within mballoc code. Consolidate them to use sb_bgl_lock(). This makes the mballoc.c code much simpler and also avoid confusion with two locks protecting same info. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02ext4: fix the length returned by fiemap for an unallocated extentTheodore Ts'o
If the file's blocks have not yet been allocated because of delayed allocation, the length of the extent returned by fiemap is incorrect. This commit fixes this bug. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: fix for fiemap last-block testEric Sandeen
Carl Henrik Lunde reported and debugged this; the test for the last allocated block was comparing bytes to blocks in this test: if (logical + length - 1 == EXT_MAX_BLOCK || ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK) flags |= FIEMAP_EXTENT_LAST; so any extent which ended right at 4G was stopping the extent walk. Just replacing these values with the extent block & length should fix it. Also give blksize_bits a saner type, and reverse the order of the tests to make the more likely case tested first. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no> Tested-by: Carl Henrik Lunde <chlunde@ping.uio.no> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02ext4: hook fiemap operation for directoriesAneesh Kumar K.V
Add fiemap callback for directories Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: Make the length of the mb_history file tunableCurt Wohlgemuth
In memory-constrained systems with many partitions, the ~68K for each partition for the mb_history buffer can be excessive. This patch adds a new mount option, mb_history_length, as well as a way of setting the default via a module parameter (or via a sysfs parameter in /sys/module/ext4/parameter/default_mb_history_length). If the mb_history_length is set to zero, the mb_history facility is disabled entirely. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: Move fs/ext4/group.h into ext4.hTheodore Ts'o
Move the function prototypes in group.h into ext4.h so they are all defined in one place. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: Move fs/ext4/namei.h into ext4.hTheodore Ts'o
The fs/ext4/namei.h header file had only a single function declaration, and should have never been a standalone file. Move it into ext4.h, where should have been from the beginning. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-03ext4: Move the ext4_sb.h header file into ext4.hTheodore Ts'o
There is no longer a reason for a separate ext4_sb.h header file, so move it into ext4.h just to make life easier for developers to find the relevant data structures and typedefs. Should also speed up compiles slightly, too. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: Move the ext4_i.h header file into ext4.hTheodore Ts'o
There is no longer a reason for a separate ext4_i.h header file, so move it into ext4.h just to make life easier for developers to find the relevant data structures and typedefs. Should also speed up compiles slightly, too. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: Don't avoid using BLOCK_UNINIT block groups in mballocTheodore Ts'o
By avoiding the use of not-yet-used block groups (i.e., block groups with the BLOCK_UNINIT flag), mballoc had a tendency to create large files with large non-contiguous gaps. In addition avoiding the use of new block groups had a tendency to push regular file data into the first block group in a flex_bg group, which slows down the speed of e2fsck pass 2, since it has a tendency to seek much more. For example: Before Patch After Patch Time in seconds Time in seconds Real / User/ Sys MB/s Real / User/ Sys MB/s Pass 1 8.52 / 2.21 / 0.46 20.43 8.84 / 4.97 / 1.11 19.68 Pass 2 21.16 / 1.02 / 1.86 11.30 6.54 / 1.77 / 1.78 36.39 Pass 3 0.01 / 0.00 / 0.00 139.00 0.01 / 0.01 / 0.00 128.90 Pass 4 0.16 / 0.15 / 0.00 0.00 0.17 / 0.17 / 0.00 0.00 Pass 5 2.52 / 1.99 / 0.09 0.79 2.31 / 1.78 / 0.06 0.86 Total 32.40 / 5.11 / 2.49 12.81 17.99 / 8.75 / 2.98 23.01 This was on a sample 80 gig root filesystem which was approximately 50% full. Note the improved e2fsck pass 2 performance, by over a factor of 3, due to a decreased number of seeks. (The total amount of I/O in pass 2 was unchanged; the layout of the directory blocks was simply much better from e2fsck's's perspective.) Other changes as a result of this patch on this sample filesystem: Before Patch After Patch # of non-contig files 762 779 # of non-contig directories 571 570 # of BLOCK_UNINIT bg's 307 293 # of INODE_UNINIT bg's 503 503 Out of 640 block groups, of which 333 were in use, this patch caused an extra 14 block groups to be utilized. The number of non-contiguous files did go up slightly, but when measured against the 99.9% of the files (603,154) which were contiguously allocated, this is pretty insignificant. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andreas Dilger <adilger@sun.com>
2009-04-25ext4: Replace lock/unlock_super() with an explicit lock for resizingTheodore Ts'o
Use a separate lock to protect s_groups_count and the other block group descriptors which get changed via an on-line resize operation, so we can stop overloading the use of lock_super(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25ext4: Replace lock/unlock_super() with an explicit lock for the orphan listTheodore Ts'o
Use a separate lock to protect the orphan list, so we can stop overloading the use of lock_super(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: ext4_mark_recovery_complete() doesn't need to use lock_superTheodore Ts'o
The function ext4_mark_recovery_complete() is called from two call paths: either (a) while mounting the filesystem, in which case there's no danger of any other CPU calling write_super() until the mount is completed, and (b) while remounting the filesystem read-write, in which case the fs core has already locked the superblock. This also allows us to take out a very vile unlock_super()/lock_super() pair in ext4_remount(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25ext4: Remove outdated comment about lock_super()Theodore Ts'o
ext4_fill_super() is no longer called by read_super(), and it is no longer called with the superblock locked. The unlock_super()/lock_super() is no longer present, so this comment is entirely superfluous. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01ext4: Avoid races caused by on-line resizing and SMP memory reorderingTheodore Ts'o
Ext4's on-line resizing adds a new block group and then, only at the last step adjusts s_groups_count. However, it's possible on SMP systems that another CPU could see the updated the s_group_count and not see the newly initialized data structures for the just-added block group. For this reason, it's important to insert a SMP read barrier after reading s_groups_count and before reading any (for example) the new block group descriptors allowed by the increased value of s_groups_count. Unfortunately, we rather blatently violate this locking protocol documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes happen relatively rarely, and (2) it seems rare that the filesystem code will immediately try to use just-added block group before any memory ordering issues resolve themselves. So apparently problems here are relatively hard to hit, since ext3 has been vulnerable to the same issue for years with no one apparently complaining. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>