From 9d9b87c1218be78ddecbc85ec3bb91c79c1d56ab Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Wed, 4 Feb 2009 17:35:38 -0500 Subject: lockd: fix regression in lockd's handling of blocked locks If a client requests a blocking lock, is denied, then requests it again, then here in nlmsvc_lock() we will call vfs_lock_file() without FL_SLEEP set, because we've already queued a block and don't need the locks code to do it again. But that means vfs_lock_file() will return -EAGAIN instead of FILE_LOCK_DENIED. So we still need to translate that -EAGAIN return into a nlm_lck_blocked error in this case, and put ourselves back on lockd's block list. The bug was introduced by bde74e4bc64415b1 "locks: add special return value for asynchronous locks". Thanks to Frank van Maarseveen for the report; his original test case was essentially for i in `seq 30`; do flock /nfsmount/foo sleep 10 & done Tested-by: Frank van Maarseveen Reported-by: Frank van Maarseveen Cc: Miklos Szeredi Signed-off-by: J. Bruce Fields --- fs/lockd/svclock.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c index 6063a8e4b9f..763b78a6e9d 100644 --- a/fs/lockd/svclock.c +++ b/fs/lockd/svclock.c @@ -427,7 +427,7 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file, goto out; case -EAGAIN: ret = nlm_lck_denied; - goto out; + break; case FILE_LOCK_DEFERRED: if (wait) break; @@ -443,6 +443,10 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file, goto out; } + ret = nlm_lck_denied; + if (!wait) + goto out; + ret = nlm_lck_blocked; /* Append to list of blocked */ -- cgit v1.2.3 From 284b066af41579f62649048fdec5c5e7091703e6 Mon Sep 17 00:00:00 2001 From: Chris Mason Date: Mon, 9 Feb 2009 16:22:03 -0500 Subject: Btrfs: don't use spin_is_contended Btrfs was using spin_is_contended to see if it should drop locks before doing extent allocations during btrfs_search_slot. The idea was to avoid expensive searches in the tree unless the lock was actually contended. But, spin_is_contended is specific to the ticket spinlocks on x86, so this is causing compile errors everywhere else. In practice, the contention could easily appear some time after we started doing the extent allocation, and it makes more sense to always drop the lock instead. Signed-off-by: Chris Mason --- fs/btrfs/ctree.c | 3 +-- fs/btrfs/locking.c | 22 ---------------------- fs/btrfs/locking.h | 2 -- 3 files changed, 1 insertion(+), 26 deletions(-) (limited to 'fs') diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 551177c0011..35443cc4b9a 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1530,8 +1530,7 @@ again: * for higher level blocks, try not to allocate blocks * with the block and the parent locks held. */ - if (level > 0 && !prealloc_block.objectid && - btrfs_path_lock_waiting(p, level)) { + if (level > 0 && !prealloc_block.objectid) { u32 size = b->len; u64 hint = b->start; diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index 68fd9ccf180..9ebe9385129 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -236,25 +236,3 @@ int btrfs_tree_locked(struct extent_buffer *eb) return test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags) || spin_is_locked(&eb->lock); } - -/* - * btrfs_search_slot uses this to decide if it should drop its locks - * before doing something expensive like allocating free blocks for cow. - */ -int btrfs_path_lock_waiting(struct btrfs_path *path, int level) -{ - int i; - struct extent_buffer *eb; - - for (i = level; i <= level + 1 && i < BTRFS_MAX_LEVEL; i++) { - eb = path->nodes[i]; - if (!eb) - break; - smp_mb(); - if (spin_is_contended(&eb->lock) || - waitqueue_active(&eb->lock_wq)) - return 1; - } - return 0; -} - diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h index d92e707f587..6bb0afbff92 100644 --- a/fs/btrfs/locking.h +++ b/fs/btrfs/locking.h @@ -26,8 +26,6 @@ int btrfs_tree_locked(struct extent_buffer *eb); int btrfs_try_tree_lock(struct extent_buffer *eb); int btrfs_try_spin_lock(struct extent_buffer *eb); -int btrfs_path_lock_waiting(struct btrfs_path *path, int level); - void btrfs_set_lock_blocking(struct extent_buffer *eb); void btrfs_clear_lock_blocking(struct extent_buffer *eb); #endif -- cgit v1.2.3 From 5a6fe125950676015f5108fb71b2a67441755003 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 10 Feb 2009 14:02:27 +0000 Subject: Do not account for the address space used by hugetlbfs using VM_ACCOUNT When overcommit is disabled, the core VM accounts for pages used by anonymous shared, private mappings and special mappings. It keeps track of VMAs that should be accounted for with VM_ACCOUNT and VMAs that never had a reserve with VM_NORESERVE. Overcommit for hugetlbfs is much riskier than overcommit for base pages due to contiguity requirements. It avoids overcommiting on both shared and private mappings using reservation counters that are checked and updated during mmap(). This ensures (within limits) that hugepages exist in the future when faults occurs or it is too easy to applications to be SIGKILLed. As hugetlbfs makes its own reservations of a different unit to the base page size, VM_ACCOUNT should never be set. Even if the units were correct, we would double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may be set because an application can request no reserves be made for hugetlbfs at the risk of getting killed later. With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This breaks the accounting for both the core VM and hugetlbfs, can trigger an OOM storm when hugepage pools are too small lockups and corrupted counters otherwise are used. This patch brings hugetlbfs more in line with how the core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set. Signed-off-by: Mel Gorman Signed-off-by: Linus Torvalds --- fs/hugetlbfs/inode.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'fs') diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 6903d37af03..9b800d97a68 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -108,7 +108,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) if (hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h), - len >> huge_page_shift(h), vma)) + len >> huge_page_shift(h), vma, + vma->vm_flags)) goto out; ret = 0; @@ -947,7 +948,7 @@ static int can_do_hugetlb_shm(void) can_do_mlock()); } -struct file *hugetlb_file_setup(const char *name, size_t size) +struct file *hugetlb_file_setup(const char *name, size_t size, int acctflag) { int error = -ENOMEM; struct file *file; @@ -981,7 +982,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size) error = -ENOMEM; if (hugetlb_reserve_pages(inode, 0, - size >> huge_page_shift(hstate_inode(inode)), NULL)) + size >> huge_page_shift(hstate_inode(inode)), NULL, + acctflag)) goto out_inode; d_instantiate(dentry, inode); -- cgit v1.2.3 From 8fe4cd0dc5ea43760c59eb256404188272cc95dd Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 11 Feb 2009 13:04:25 -0800 Subject: jbd: fix return value of journal_start_commit() journal_start_commit() returns 1 if either a transaction is committing or the function has queued a transaction commit. But it returns 0 if we raced with somebody queueing the transaction commit as well. This resulted in ext3_sync_fs() not functioning correctly (description from Arthur Jones): In the case of a data=ordered umount with pending long symlinks which are delayed due to a long list of other I/O on the backing block device, this causes the buffer associated with the long symlinks to not be moved to the inode dirty list in the second phase of fsync_super. Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are never written to the backing block device, causing long symlink corruption and exposing new or previously freed block data to userspace. This can be reproduced with a script created by Eric Sandeen : #!/bin/bash umount /mnt/test2 mount /dev/sdb4 /mnt/test2 rm -f /mnt/test2/* dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512 touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename /mnt/test2/link umount /mnt/test2 mount /dev/sdb4 /mnt/test2 ls /mnt/test2/ This patch fixes journal_start_commit() to always return 1 when there's a transaction committing or queued for commit. Cc: Eric Sandeen Cc: Mike Snitzer Cc: Signed-off-by: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/jbd/journal.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) (limited to 'fs') diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index 9e4fa52d7dc..e79c07812af 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -427,7 +427,7 @@ int __log_space_left(journal_t *journal) } /* - * Called under j_state_lock. Returns true if a transaction was started. + * Called under j_state_lock. Returns true if a transaction commit was started. */ int __log_start_commit(journal_t *journal, tid_t target) { @@ -495,7 +495,8 @@ int journal_force_commit_nested(journal_t *journal) /* * Start a commit of the current running transaction (if any). Returns true - * if a transaction was started, and fills its tid in at *ptid + * if a transaction is going to be committed (or is currently already + * committing), and fills its tid in at *ptid */ int journal_start_commit(journal_t *journal, tid_t *ptid) { @@ -505,15 +506,19 @@ int journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - ret = __log_start_commit(journal, tid); - if (ret && ptid) + __log_start_commit(journal, tid); + /* There's a running transaction and we've just made sure + * it's commit has been scheduled. */ + if (ptid) *ptid = tid; - } else if (journal->j_committing_transaction && ptid) { + ret = 1; + } else if (journal->j_committing_transaction) { /* * If ext3_write_super() recently started a commit, then we * have to wait for completion of that transaction */ - *ptid = journal->j_committing_transaction->t_tid; + if (ptid) + *ptid = journal->j_committing_transaction->t_tid; ret = 1; } spin_unlock(&journal->j_state_lock); -- cgit v1.2.3 From 02ac597c9b86af49b2016aa98aee20ab59dbf0d2 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 11 Feb 2009 13:04:26 -0800 Subject: ext3: revert "ext3: wait on all pending commits in ext3_sync_fs" This reverts commit c87591b719737b4e91eb1a9fa8fd55a4ff1886d6. Since journal_start_commit() is now fixed to return 1 when we started a transaction commit, there's some transaction waiting to be committed or there's a transaction already committing, we don't need to call ext3_force_commit() in ext3_sync_fs(). Furthermore ext3_force_commit() can unnecessarily create sync transaction which is expensive so it's worthwhile to remove it when we can. Cc: Eric Sandeen Cc: Signed-off-by: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ext3/super.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'fs') diff --git a/fs/ext3/super.c b/fs/ext3/super.c index b70d90e08a3..4a970411a45 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -2428,12 +2428,13 @@ static void ext3_write_super (struct super_block * sb) static int ext3_sync_fs(struct super_block *sb, int wait) { - sb->s_dirt = 0; - if (wait) - ext3_force_commit(sb); - else - journal_start_commit(EXT3_SB(sb)->s_journal, NULL); + tid_t target; + sb->s_dirt = 0; + if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) { + if (wait) + log_wait_commit(EXT3_SB(sb)->s_journal, target); + } return 0; } -- cgit v1.2.3 From 0e4a9b59282914fe057ab17027f55123964bc2e2 Mon Sep 17 00:00:00 2001 From: Carsten Otte Date: Wed, 11 Feb 2009 13:04:37 -0800 Subject: ext2/xip: refuse to change xip flag during remount with busy inodes For a reason that I was unable to understand in three months of debugging, mount ext2 -o remount stopped working properly when remounting from regular operation to xip, or the other way around. According to a git bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL rework in the vm: commit 70688e4dd1647f0ceb502bbd5964fa344c5eb411 Author: Nick Piggin Date: Mon Apr 28 02:13:02 2008 -0700 xip: support non-struct page backed memory In the failing scenario, the filesystem is mounted read only via root= kernel parameter on s390x. During remount (in rc.sysinit), the inodes of the bash binary and its libraries are busy and cannot be invalidated (the bash which is running rc.sysinit resides on subject filesystem). Afterwards, another bash process (running ifup-eth) recurses into a subshell, runs dup_mm (via fork). Some of the mappings in this bash process were created from inodes that could not be invalidated during remount. Both parent and child process crash some time later due to inconsistencies in their address spaces. The issue seems to be timing sensitive, various attempts to recreate it have failed. This patch refuses to change the xip flag during remount in case some inodes cannot be invalidated. This patch keeps users from running into that issue. [akpm@linux-foundation.org: cleanup] Signed-off-by: Carsten Otte Cc: Nick Piggin Cc: Jared Hulbert Cc: Martin Schwidefsky Cc: Heiko Carstens Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ext2/super.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'fs') diff --git a/fs/ext2/super.c b/fs/ext2/super.c index da8bdeaa2e6..7c6e3606f0e 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -1185,9 +1185,12 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) es = sbi->s_es; if (((sbi->s_mount_opt & EXT2_MOUNT_XIP) != (old_mount_opt & EXT2_MOUNT_XIP)) && - invalidate_inodes(sb)) - ext2_warning(sb, __func__, "busy inodes while remounting "\ - "xip remain in cache (no functional problem)"); + invalidate_inodes(sb)) { + ext2_warning(sb, __func__, "refusing change of xip flag " + "with busy inodes while remounting"); + sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; + sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; + } if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) return 0; if (*flags & MS_RDONLY) { -- cgit v1.2.3