aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2009-04-24Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix potential inode allocation soft lockup in Orlov allocator ext4: Make the extent validity check more paranoid jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records ext4: really print the find_group_flex fallback warning only once
2009-04-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: Larger buffer for encrypted symlink targets eCryptfs: Lock lower directory inode mutex during lookup eCryptfs: Remove ecryptfs_unlink_sigs warnings eCryptfs: Fix data corruption when using ecryptfs_passthrough eCryptfs: Print FNEK sig properly in /proc/mounts eCryptfs: NULL pointer dereference in ecryptfs_send_miscdev() eCryptfs: Copy lower inode attrs before dentry instantiation
2009-04-24Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6Linus Torvalds
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: [S390] update default configuration. [S390] omit frame pointers on s390 when possible [S390] Use tape_generic_offline directly. [S390] /proc/stat idle field for idle cpus [S390] appldata: avoid deadlock with appldata_mem [S390] ipl: fix compile breakage
2009-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixesLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Ensure that the inode goal block settings are updated GFS2: Fix bug in block allocation bitops: Add __ffs64 bitop
2009-04-24Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: cfq-iosched: cache prio_tree root in cfqq->p_root cfq-iosched: fix bug with aliased request and cooperation detection cfq-iosched: clear ->prio_trees[] on cfqd alloc block: fix intermittent dm timeout based oops umem: fix request_queue lock warning block: simplify I/O stat accounting pktcdvd.h should include mempool.h cfq-iosched: use the default seek distance when there aren't enough seek samples cfq-iosched: make seek_mean converge more quickly block: make blk_abort_queue() ignore non-request based devices block: include empty disks in /proc/diskstats bio: use bio_kmalloc() in copy/map functions bio: fix bio_kmalloc() block: fix queue bounce limit setting block: fix SG_IO vector request data length handling scatterlist: make sure sg_miter_next() doesn't return 0 sized mappings
2009-04-24check_unsafe_exec: s/lock_task_sighand/rcu_read_lock/Oleg Nesterov
write_lock(&current->fs->lock) guarantees we can't wrongly miss LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock() instead of ->siglock to iterate over the sub-threads. We must see all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it takes fs->lock too. With or without this patch we can miss the freshly cloned thread and set LSM_UNSAFE_SHARE, we don't care. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> [ Fixed lock/unlock typo - Hugh ] Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24do_execve() must not clear fs->in_exec if it was set by another threadOleg Nesterov
If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec unconditionally. This is wrong if we race with our sub-thread which also does do_execve: Two threads T1 and T2 and another process P, all share the same ->fs. T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since ->fs is shared, we set LSM_UNSAFE but not ->in_exec. P exits and decrements fs->users. T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not shared, we set fs->in_exec. T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and return to the user-space. T1 does clone(CLONE_FS /* without CLONE_THREAD */). T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with another process. Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change do_execve() to clear ->in_exec depending on res. When do_execve() suceeds, it is safe to clear ->in_exec unconditionally. It can be set only if we don't share ->fs with another process, and since we already killed all sub-threads either ->in_exec == 0 or we are the only user of this ->fs. Also, we do not need fs->lock to clear fs->in_exec. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-23[S390] /proc/stat idle field for idle cpusMartin Schwidefsky
The cpu idle field in the output of /proc/stat is too small for cpus that have been idle for more than a tick. Add the architecture hook arch_idle_time that allows to add the not accounted idle time of a sleeping cpu without waking the cpu. The s390 implementation of arch_idle_time uses the already existing s390_idle_data per_cpu variable to find the sleep time of a neighboring idle cpu. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-04-23GFS2: Ensure that the inode goal block settings are updatedSteven Whitehouse
GFS2 has a goal block associated with each inode indicating the search start position for future block allocations (in fact there are two, but thats for backward compatibility with GFS1 as they are set to identical locations in GFS2). In some circumstances, depending on the ordering of updates to the inode it was possible for the goal block settings to not be updated on disk. This patch ensures that the goal block will always get updated, thus reducing the potential for searching the same (already allocated) blocks again when looking for free space during block allocation. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-23GFS2: Fix bug in block allocationSteven Whitehouse
The new bitfit algorithm was counting from the wrong end of 64 bit words in the bitfield. This fixes it by using __ffs64 instead of fls64 Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-22ext4: Fix potential inode allocation soft lockup in Orlov allocatorTheodore Ts'o
If the Orlov allocator is having trouble finding an appropriate block group, the fallback code could loop forever, causing a soft lockup warning in find_group_orlov(): BUG: soft lockup - CPU#0 stuck for 61s! [cp:11728] ... Pid: 11728, comm: cp Not tainted (2.6.30-rc1-dirty #77) Lenovo EIP: 0060:[<c021650e>] EFLAGS: 00000246 CPU: 0 EIP is at ext4_get_group_desc+0x54/0x9d ... Call Trace: [<c0218021>] find_group_orlov+0x2ee/0x334 [<c0120a5f>] ? sched_clock+0x8/0xb [<c02188e3>] ext4_new_inode+0x2cf/0xb1a Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-22ext4: Make the extent validity check more paranoidTheodore Ts'o
Instead of just checking that the extent block number is greater or equal than s_first_data_block, make sure it it is not pointing into the block group descriptors, since that is clearly wrong. This helps prevent filesystem from getting very badly corrupted in case an extent block is corrupted. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-22eCryptfs: Larger buffer for encrypted symlink targetsTyler Hicks
When using filename encryption with eCryptfs, the value of the symlink in the lower filesystem is encrypted and stored as a Tag 70 packet. This results in a longer symlink target than if the target value wasn't encrypted. Users were reporting these messages in their syslog: [ 45.653441] ecryptfs_parse_tag_70_packet: max_packet_size is [56]; real packet size is [51] [ 45.653444] ecryptfs_decode_and_decrypt_filename: Could not parse tag 70 packet from filename; copying through filename as-is This was due to bufsiz, one the arguments in readlink(), being used to when allocating the buffer passed to the lower inode's readlink(). That symlink target may be very large, but when decoded and decrypted, could end up being smaller than bufsize. To fix this, the buffer passed to the lower inode's readlink() will always be PATH_MAX in size when filename encryption is enabled. Any necessary truncation occurs after the decoding and decrypting. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22eCryptfs: Lock lower directory inode mutex during lookupTyler Hicks
This patch locks the lower directory inode's i_mutex before calling lookup_one_len() to find the appropriate dentry in the lower filesystem. This bug was found thanks to the warning set in commit 2f9092e1. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22eCryptfs: Remove ecryptfs_unlink_sigs warningsTyler Hicks
A feature was added to the eCryptfs umount helper to automatically unlink the keys used for an eCryptfs mount from the kernel keyring upon umount. This patch keeps the unrecognized mount option warnings for ecryptfs_unlink_sigs out of the logs. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22eCryptfs: Fix data corruption when using ecryptfs_passthroughTyler Hicks
ecryptfs_passthrough is a mount option that allows eCryptfs to allow data to be written to non-eCryptfs files in the lower filesystem. The passthrough option was causing data corruption due to it not always being treated as a non-eCryptfs file. The first 8 bytes of an eCryptfs file contains the decrypted file size. This value was being written to the non-eCryptfs files, too. Also, extra 0x00 characters were being written to make the file size a multiple of PAGE_CACHE_SIZE. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22eCryptfs: Print FNEK sig properly in /proc/mountsTyler Hicks
The filename encryption key signature is not properly displayed in /proc/mounts. The "ecryptfs_sig=" mount option name is displayed for all global authentication tokens, included those for filename keys. This patch checks the global authentication token flags to determine if the key is a FEKEK or FNEK and prints the appropriate mount option name before the signature. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22eCryptfs: NULL pointer dereference in ecryptfs_send_miscdev()Tyler Hicks
If data is NULL, msg_ctx->msg is set to NULL and then dereferenced afterwards. ecryptfs_send_raw_message() is the only place that ecryptfs_send_miscdev() is called with data being NULL, but the only caller of that function (ecryptfs_process_helo()) is never called. In short, there is currently no way to trigger the NULL pointer dereference. This patch removes the two unused functions and modifies ecryptfs_send_miscdev() to remove the NULL dereferences. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22eCryptfs: Copy lower inode attrs before dentry instantiationTyler Hicks
Copies the lower inode attributes to the upper inode before passing the upper inode to d_instantiate(). This is important for security_d_instantiate(). The problem was discovered by a user seeing SELinux denials like so: type=AVC msg=audit(1236812817.898:47): avc: denied { 0x100000 } for pid=3584 comm="httpd" name="testdir" dev=ecryptfs ino=943872 scontext=root:system_r:httpd_t:s0 tcontext=root:object_r:httpd_sys_content_t:s0 tclass=file Notice target class is file while testdir is really a directory, confusing the permission translation (0x100000) due to the wrong i_mode. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22bio: use bio_kmalloc() in copy/map functionsTejun Heo
Impact: remove possible deadlock condition There is no reason to use mempool backed allocation for map functions. Also, because kern mapping is used inside LLDs (e.g. for EH), using mempool backed allocation can lead to deadlock under extreme conditions (mempool already consumed by the time a request reached EH and requests are blocked on EH). Switch copy/map functions to bio_kmalloc(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22bio: fix bio_kmalloc()Tejun Heo
Impact: fix bio_kmalloc() and its destruction path bio_kmalloc() was broken in two ways. * bvec_alloc_bs() first allocates bvec using kmalloc() and then ignores it and allocates again like non-kmalloc bvecs. * bio_kmalloc_destructor() didn't check for and free bio integrity data. This patch fixes the above problems. kmalloc patch is separated out from bio_alloc_bioset() and allocates the requested number of bvecs as inline bvecs. * bio_alloc_bioset() no longer takes NULL @bs. None other than bio_kmalloc() used it and outside users can't know how it was allocated anyway. * Define and use BIO_POOL_NONE so that pool index check in bvec_free_bs() triggers if inline or kmalloc allocated bvec gets there. * Relocate destructors on top of each allocation function so that how they're used is more clear. Jens Axboe suggested allocating bvecs inline. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix btrfs fallocate oops and deadlock Btrfs: use the right node in reada_for_balance Btrfs: fix oops on page->mapping->host during writepage Btrfs: add a priority queue to the async thread helpers Btrfs: use WRITE_SYNC for synchronous writes
2009-04-21hugetlbfs: return negative error code for bad mount optionAkinobu Mita
This fixes the following BUG: # mount -o size=MM -t hugetlbfs none /huge hugetlbfs: Bad value 'MM' for mount option 'size=MM' ------------[ cut here ]------------ kernel BUG at fs/super.c:996! Due to BUG_ON(!mnt->mnt_sb); in vfs_kern_mount(). Also, remove unused #include <linux/quotaops.h> Cc: William Irwin <wli@holomorphy.com> Cc: <stable@kernel.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-21Btrfs: fix btrfs fallocate oops and deadlockChris Mason
Btrfs fallocate was incorrectly starting a transaction with a lock held on the extent_io tree for the file, which could deadlock. Strictly speaking it was using join_transaction which would be safe, but it is better to move the transaction outside of the lock. When preallocated extents are overwritten, btrfs_mark_buffer_dirty was being called on an unlocked buffer. This was triggering an assertion and oops because the lock is supposed to be held. The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had been run. btrfs_del_item takes care of dirtying things, so the solution is a to skip the btrfs_mark_buffer_dirty call in this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixesLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Fix page_mkwrite() return code GFS2: Clear dirty bit at end of inode glock sync
2009-04-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: reiserfs: fix j_last_flush_trans_id type fs: Mark get_filesystem_list() as __init function. kill vfs_stat_fd / vfs_lstat_fd Separate out common fstatat code into vfs_fstatat ecryptfs: use memdup_user() ncpfs: use memdup_user() xfs: use memdup_user() sysfs: use memdup_user() btrfs: use memdup_user() xattr: use memdup_user() autofs4: use memchr() in invalid_string() Documentation/filesystems: remove out of date reference to BKL being held Fix i_mutex vs. readdir handling in nfsd fs/compat_ioctl: fix build when !BLOCK Fix autofs_expire() No need for crossing to mountpoint in audit_tag_tree() Safer nfsd_cross_mnt() Touch all affected namespaces on propagation of mount Fix AUTOFS_DEV_IOCTL_REQUESTER_CMD
2009-04-21NFS: Fix the XDR iovec calculation in nfs3_xdr_setaclargsTrond Myklebust
Commit ae46141ff08f1965b17c531b571953c39ce8b9e2 (NFSv3: Fix posix ACL code) introduces a bug in the calculation of the XDR header iovec. In the case where we are inlining the acls, we need to adjust the length of the iovec req->rq_svec, in addition to adjusting the total buffer length. Tested-by: Leonardo Chiquitto <leonardo.lists@gmail.com> Tested-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-20fs: Mark get_filesystem_list() as __init function.Tetsuo Handa
"int get_filesystem_list(char * buf)" is called by only "static void __init get_fs_names(char *page)". We can mark get_filesystem_list() as "__init". Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20kill vfs_stat_fd / vfs_lstat_fdChristoph Hellwig
There's really no reason to keep vfs_stat_fd and vfs_lstat_fd with Oleg's vfs_fstatat. Use vfs_fstatat for the few cases having the directory fd, and switch all others to vfs_stat / vfs_lstat. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Separate out common fstatat code into vfs_fstatatOleg Drokin
This is a version incorporating Christoph's suggestion. Separate out common *fstatat functionality into a single function instead of duplicating it all over the code. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20ecryptfs: use memdup_user()Li Zefan
Remove open-coded memdup_user(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20ncpfs: use memdup_user()Li Zefan
Remove open-coded memdup_user() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20xfs: use memdup_user()Li Zefan
Remove open-coded memdup_user() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20sysfs: use memdup_user()Li Zefan
Remove open-coded memdup_user(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20btrfs: use memdup_user()Li Zefan
Remove open-coded memdup_user(). Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may cause pagefault, it's pointless to pass GFP_NOFS to kmalloc(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20xattr: use memdup_user()Li Zefan
Remove open-coded memdup_user() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20autofs4: use memchr() in invalid_string()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Fix i_mutex vs. readdir handling in nfsdDavid Woodhouse
Commit 14f7dd63 ("Copy XFS readdir hack into nfsd code") introduced a bug to generic code which had been extant for a long time in the XFS version -- it started to call through into lookup_one_len() and hence into the file systems' ->lookup() methods without i_mutex held on the directory. This patch fixes it by locking the directory's i_mutex again before calling the filldir functions. The original deadlocks which commit 14f7dd63 was designed to avoid are still avoided, because they were due to fs-internal locking, not i_mutex. While we're at it, fix the return type of nfsd_buffered_readdir() which should be a __be32 not an int -- it's an NFS errno, not a Linux errno. And return nfserrno(-ENOMEM) when allocation fails, not just -ENOMEM. Sparse would have caught that, if it wasn't so busy bitching about __cold__. Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery code") introduced a similar problem with calling lookup_one_len() without i_mutex, which this patch also addresses. To fix that, it was necessary to fix the called functions so that they expect i_mutex to be held; that part was done by J. Bruce Fields. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Umm-I-can-live-with-that-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: J. R. Okajima <hooanon05@yahoo.co.jp> Tested-by: J. Bruce Fields <bfields@citi.umich.edu> LKML-Reference: <8036.1237474444@jrobl> Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20fs/compat_ioctl: fix build when !BLOCKAlexander Beregalov
In file included from fs/compat_ioctl.c:61: include/linux/loop.h:59: error: field 'lo_bio_list' has incomplete type Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Fix autofs_expire()Al Viro
mnt should remain the same for all iterations through the list; as it is, if we have a busy mount, mnt follows into it and isn't restored for the next iteration. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20No need for crossing to mountpoint in audit_tag_tree()Al Viro
is_under() will DTRT anyway. And yes, is_subdir() behaviour is intentional. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Safer nfsd_cross_mnt()Al Viro
AFAICS, we have a subtle bug there: if we have crossed mountpoint *and* it got mount --move'd away, we'll be holding only one reference to fs containing dentry - exp->ex_path.mnt. IOW, we ought to dput() before exp_put(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Touch all affected namespaces on propagation of mountAl Viro
We shouldn't just touch the namespace of current process Caught-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Fix AUTOFS_DEV_IOCTL_REQUESTER_CMDAl Viro
Missing conversion from kernel to userland dev_t; this sucker breaks as soon as we get sufficiently many autofs mounts for new_encode_dev(s_dev) != s_dev. Note: this is the minimal fix. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Btrfs: use the right node in reada_for_balanceChris Mason
reada_for_balance was using the wrong index into the path node array, so it wasn't reading the right blocks. We never directly used the results of the read done by this function because the btree search is started over at the end. This fixes reada_for_balance to reada in the correct node and to avoid searching past the last slot in the node. It also makes sure to hold the parent lock while we are finding the nodes to read. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20Btrfs: fix oops on page->mapping->host during writepageChris Mason
The extent_io writepage call updates the writepage index in the inode as it makes progress. But, it was doing the update after unlocking the page, which isn't legal because page->mapping can't be trusted once the page is unlocked. This lead to an oops, especially common with compression turned on. The fix here is to update the writeback index before unlocking the page. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20Btrfs: add a priority queue to the async thread helpersChris Mason
Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a higher priority. But, the checksumming helper threads prevent it from being fully effective. There are two problems. First, a big queue of pending checksumming will delay the synchronous IO behind other lower priority writes. Second, the checksumming uses an ordered async work queue. The ordering makes sure that IOs are sent to the block layer in the same order they are sent to the checksumming threads. Usually this gives us less seeky IO. But, when we start mixing IO priorities, the lower priority IO can delay the higher priority IO. This patch solves both problems by adding a high priority list to the async helper threads, and a new btrfs_set_work_high_prio(), which is used to make put a new async work item onto the higher priority list. The ordering is still done on high priority IO, but all of the high priority bios are ordered separately from the low priority bios. This ordering is purely an IO optimization, it is not involved in data or metadata integrity. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20Btrfs: use WRITE_SYNC for synchronous writesChris Mason
Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for writes we plan on waiting on in the near future. This patch mirrors recent changes in other filesystems and the generic code to use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for other latency critical writes. Btrfs uses async worker threads for checksumming before the write is done, and then again to actually submit the bios. The bio submission code just runs a per-device list of bios that need to be sent down the pipe. This list is split into low priority and high priority lists so the WRITE_SYNC IO happens first. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20GFS2: Fix page_mkwrite() return codeSteven Whitehouse
This allows for the possibility of returning VM_FAULT_OOM as well as VM_FAULT_SIGBUS. This ensures that the correct action is taken. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-20GFS2: Clear dirty bit at end of inode glock syncSteven Whitehouse
The dirty bit can get set during the inode glock sync. Its too complicated to change that at the moment, so this is the quick fix - to clear the bit again at the end of the function. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>