aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-01-26fix oops in fs/9p late mount failureAl Viro
if 9P ->get_sb() fails late (at root inode or root dentry allocation), we'll hit its ->kill_sb() with NULL ->s_root Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26fix leak in romfs_fill_super()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26get rid of pointless checks after simple_pin_fs()Al Viro
if we'd just got success from it, vfsmount won't be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26Fix failure exits in bfs_fill_super()Al Viro
double iput(), leaks... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26fix affs parse_options()Al Viro
Error handling in that sucker got broken back in 2003. If function returns 0 on failure, it's not nice to add return -EINVAL into it. Adding return 1 on other failure exits is also not a good thing (and yes, original success exits with 1 and some of failure exits with 0 are still there; so's the original logics in callers). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26Fix remount races with symlink handling in affsAl Viro
A couple of fields in affs_sb_info is used in follow_link() and symlink() for handling AFFS "absolute" symlinks. Need locking against affs_remount() updates. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26Fix a leak in affs_fill_super()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26fnctl: f_modown should call write_lock_irqsave/restoreGreg Kroah-Hartman
Commit 703625118069f9f8960d356676662d3db5a9d116 exposed that f_modown() should call write_lock_irqsave instead of just write_lock_irq so that because a caller could have a spinlock held and it would not be good to renable interrupts. Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Tavis Ormandy <taviso@google.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-25Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Drop EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE flag ext4: Fix quota accounting error with fallocate ext4: Handle -EDQUOT error on write
2010-01-25eventfd - allow atomic read and waitqueue removeDavide Libenzi
KVM needs a wait to atomically remove themselves from the eventfd ->poll() wait queue head, in order to handle correctly their IRQfd deassign operation. This patch introduces such API, plus a way to read an eventfd from its context. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Avi Kivity <avi@redhat.com>
2010-01-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: tty: fix race in tty_fasync serial: serial_cs: oxsemi quirk breaks resume serial: imx: bit &/| confusion serial: Fix crash if the minimum rate of the device is > 9600 baud serial-core: resume serial hardware with no_console_suspend serial: 8250_pnp: use wildcard for serial Wacom tablets nozomi: quick fix for the close/close bug compat_ioctl: Supress "unknown cmd" message on serial /dev/console
2010-01-21Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: fs/bio.c: fix shadows sparse warning drbd: The kernel code is now equivalent to out of tree release 8.3.7 drbd: Allow online resizing of DRBD devices while peer not reachable (needs to be explicitly forced) drbd: Don't go into StandAlone mode when authentification failes because of network error drivers/block/drbd/drbd_receiver.c: correct NULL test cfq-iosched: Respect ioprio_class when preempting genhd: overlapping variable definition block: removed unused as_io_context DM: Fix device mapper topology stacking block: bdev_stack_limits wrapper block: Fix discard alignment calculation and printing block: Correct handling of bottom device misaligment drbd: check on CONFIG_LBDAF, not LBD drivers/block/drbd: Correct NULL test drbd: Silenced an assert that could triggered after changing write ordering method drbd: Kconfig fix drbd: Fix for a race between IO and a detach operation [Bugz 262] drbd: Use drbd_crypto_is_hash() instead of an open coded check
2010-01-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: ecryptfs: use after free ecryptfs: Eliminate useless code ecryptfs: fix interpose/interpolate typos in comments ecryptfs: pass matching flags to interpose as defined and used there ecryptfs: remove unnecessary d_drop calls in ecryptfs_link ecryptfs: don't ignore return value from lock_rename ecryptfs: initialize private persistent file before dereferencing pointer eCryptfs: Remove mmap from directory operations eCryptfs: Add getattr function eCryptfs: Use notify_change for truncating lower inodes
2010-01-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix possible panic on unmount Btrfs: deal with NULL acl sent to btrfs_set_acl Btrfs: fix regression in orphan cleanup Btrfs: Fix race in btrfs_mark_extent_written Btrfs, fix memory leaks in error paths Btrfs: align offsets for btrfs_ordered_update_i_size btrfs: fix missing last-entry in readdir(3)
2010-01-20compat_ioctl: Supress "unknown cmd" message on serial /dev/consoleAtsushi Nemoto
After the commit fb07a5f8 ("compat_ioctl: remove all VT ioctl handling"), I got this error message on 64-bit mips kernel with 32-bit busybox userland: ioctl32(init:1): Unknown cmd fd(0) cmd(00005600){t:'V';sz:0} arg(7fd76480) on /dev/console The cmd 5600 is VT_OPENQRY. The busybox's init issues this ioctl to know vt-console or serial-console. If the console was serial console, VT ioctls are not handled by the serial driver. And by quick search, I found some programs using VT_GETMODE to check vt-console is available or not. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-19ecryptfs: use after freeDan Carpenter
The "full_alg_name" variable is used on a couple error paths, so we shouldn't free it until the end. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: stable@kernel.org Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19ecryptfs: Eliminate useless codeJulia Lawall
The variable lower_dentry is initialized twice to the same (side effect-free) expression. Drop one initialization. A simplified version of the semantic match that finds this problem is: (http://coccinelle.lip6.fr/) // <smpl> @forall@ idexpression *x; identifier f!=ERR_PTR; @@ x = f(...) ... when != x ( x = f(...,<+...x...+>,...) | * x = f(...) ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19ecryptfs: fix interpose/interpolate typos in commentsErez Zadok
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Acked-by: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19ecryptfs: pass matching flags to interpose as defined and used thereErez Zadok
ecryptfs_interpose checks if one of the flags passed is ECRYPTFS_INTERPOSE_FLAG_D_ADD, defined as 0x00000001 in ecryptfs_kernel.h. But the only user of ecryptfs_interpose to pass a non-zero flag to it, has hard-coded the value as "1". This could spell trouble if any of these values changes in the future. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Cc: Dustin Kirkland <kirkland@canonical.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19ecryptfs: remove unnecessary d_drop calls in ecryptfs_linkErez Zadok
Unnecessary because it would unhash perfectly valid dentries, causing them to have to be re-looked up the next time they're needed, which presumably is right after. Signed-off-by: Aseem Rastogi <arastogi@cs.sunysb.edu> Signed-off-by: Shrikar archak <shrikar84@gmail.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Cc: Saumitra Bhanage <sbhanage@cs.sunysb.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19ecryptfs: don't ignore return value from lock_renameErez Zadok
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Cc: Dustin Kirkland <kirkland@canonical.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19ecryptfs: initialize private persistent file before dereferencing pointerErez Zadok
Ecryptfs_open dereferences a pointer to the private lower file (the one stored in the ecryptfs inode), without checking if the pointer is NULL. Right afterward, it initializes that pointer if it is NULL. Swap order of statements to first initialize. Bug discovered by Duckjin Kang. Signed-off-by: Duckjin Kang <fromdj2k@gmail.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Cc: Dustin Kirkland <kirkland@canonical.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19eCryptfs: Remove mmap from directory operationsTyler Hicks
Adrian reported that mkfontscale didn't work inside of eCryptfs mounts. Strace revealed the following: open("./", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 fcntl64(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) open("./fonts.scale", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 getdents(3, /* 80 entries */, 32768) = 2304 open("./.", O_RDONLY) = 5 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0 fstat64(5, {st_mode=S_IFDIR|0755, st_size=16384, ...}) = 0 mmap2(NULL, 16384, PROT_READ, MAP_PRIVATE, 5, 0) = 0xb7fcf000 close(5) = 0 --- SIGBUS (Bus error) @ 0 (0) --- +++ killed by SIGBUS +++ The mmap2() on a directory was successful, resulting in a SIGBUS signal later. This patch removes mmap() from the list of possible ecryptfs_dir_fops so that mmap() isn't possible on eCryptfs directory files. https://bugs.launchpad.net/ecryptfs/+bug/400443 Reported-by: Adrian C. <anrxc@sysphere.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19eCryptfs: Add getattr functionTyler Hicks
The i_blocks field of an eCryptfs inode cannot be trusted, but generic_fillattr() uses it to instantiate the blocks field of a stat() syscall when a filesystem doesn't implement its own getattr(). Users have noticed that the output of du is incorrect on newly created files. This patch creates ecryptfs_getattr() which calls into the lower filesystem's getattr() so that eCryptfs can use its kstat.blocks value after calling generic_fillattr(). It is important to note that the block count includes the eCryptfs metadata stored in the beginning of the lower file plus any padding used to fill an extent before encryption. https://bugs.launchpad.net/ecryptfs/+bug/390833 Reported-by: Dominic Sacré <dominic.sacre@gmx.de> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19eCryptfs: Use notify_change for truncating lower inodesTyler Hicks
When truncating inodes in the lower filesystem, eCryptfs directly invoked vmtruncate(). As Christoph Hellwig pointed out, vmtruncate() is a filesystem helper function, but filesystems may need to do more than just a call to vmtruncate(). This patch moves the lower inode truncation out of ecryptfs_truncate() and renames the function to truncate_upper(). truncate_upper() updates an iattr for the lower inode to indicate if the lower inode needs to be truncated upon return. ecryptfs_setattr() then calls notify_change(), using the updated iattr for the lower inode, to complete the truncation. For eCryptfs functions needing to truncate, ecryptfs_truncate() is reintroduced as a simple way to truncate the upper inode to a specified size and then truncate the lower inode accordingly. https://bugs.launchpad.net/bugs/451368 Reported-by: Christoph Hellwig <hch@lst.de> Acked-by: Dustin Kirkland <kirkland@canonical.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19fs/bio.c: fix shadows sparse warningThiago Farina
fs/bio.c:81:33: warning: symbol 'bslab' shadows an earlier one fs/bio.c:74:25: originally declared here Signed-off-by: Thiago Farina <tfransosi@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-18Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: xfs_swap_extents needs to handle dynamic fork offsets xfs: fix missing error check in xfs_rtfree_range xfs: fix stale inode flush avoidance xfs: Remove inode iolock held check during allocation xfs: reclaim all inodes by background tree walks xfs: Avoid inodes in reclaim when flushing from inode cache xfs: reclaim inodes under a write lock
2010-01-17Btrfs: fix possible panic on unmountJosef Bacik
We can race with the unmount of an fs and the stopping of a kthread where we will free the block group before we're done using it. The reason for this is because we do not hold a reference on the block group while its caching, since the allocator drops its reference once it exits or moves on to the next block group. This patch fixes the problem by taking a reference to the block group before we start caching and dropping it when we're done to make sure all accesses to the block group are safe. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17Btrfs: deal with NULL acl sent to btrfs_set_aclChris Mason
It is legal for btrfs_set_acl to be sent a NULL acl. This makes sure we don't dereference it. A similar patch was sent by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17Btrfs: fix regression in orphan cleanupJosef Bacik
Currently orphan cleanup only ever gets triggered if we cross subvolumes during a lookup, which means that if we just mount a plain jane fs that has orphans in it, they will never get cleaned up. This results in panic's like these http://www.kerneloops.org/oops.php?number=1109085 where adding an orphan entry results in -EEXIST being returned and we panic. In order to fix this, we check to see on lookup if our root has had the orphan cleanup done, and if not go ahead and do it. This is easily reproduceable by running this testcase #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <string.h> #include <unistd.h> #include <stdio.h> int main(int argc, char **argv) { char data[4096]; char newdata[4096]; int fd1, fd2; memset(data, 'a', 4096); memset(newdata, 'b', 4096); while (1) { int i; fd1 = creat("file1", 0666); if (fd1 < 0) break; for (i = 0; i < 512; i++) write(fd1, data, 4096); fsync(fd1); close(fd1); fd2 = creat("file2", 0666); if (fd2 < 0) break; ftruncate(fd2, 4096 * 512); for (i = 0; i < 512; i++) write(fd2, newdata, 4096); close(fd2); i = rename("file2", "file1"); unlink("file1"); } return 0; } and then pulling the power on the box, and then trying to run that test again when the box comes back up. I've tested this locally and it fixes the problem. Thanks to Tomas Carnecky for helping me track this down initially. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17Btrfs: Fix race in btrfs_mark_extent_writtenYan, Zheng
Fix bug reported by Johannes Hirte. The reason of that bug is btrfs_del_items is called after btrfs_duplicate_item and btrfs_del_items triggers tree balance. The fix is check that case and call btrfs_search_slot when needed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17Btrfs, fix memory leaks in error pathsJiri Slaby
Stanse found 2 memory leaks in relocate_block_group and __btrfs_map_block. cluster and multi are not freed/assigned on all paths. Fix that. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17Btrfs: align offsets for btrfs_ordered_update_i_sizeYan, Zheng
Some callers of btrfs_ordered_update_i_size can now pass in a NULL for the ordered extent to update against. This makes sure we properly align the offset they pass in when deciding how much to bump the on disk i_size. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17btrfs: fix missing last-entry in readdir(3)Jan Engelhardt
parent 49313cdac7b34c9f7ecbb1780cfc648b1c082cd7 (v2.6.32-1-g49313cd) commit ff48c08e1c05c67e8348ab6f8a24de8034e0e34d Author: Jan Engelhardt <jengelh@medozas.de> Date: Wed Dec 9 22:57:36 2009 +0100 Btrfs: fix missing last-entry in readdir(3) When one does a 32-bit readdir(3), the last entry of a directory is missing. This is however not due to passing a large value to filldir, but it seems to have to do with glibc doing telldir or something quirky. In any case, this patch fixes it in practice. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: do_add_mount() should sanitize mnt_flags CIFS shouldn't make mountpoints shrinkable mnt_flags fixes in do_remount() attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared() may_umount() needs namespace_sem Fix configfs leak Fix the -ESTALE handling in do_filp_open() ecryptfs: Fix refcnt leak on ecryptfs_follow_link() error path Fix ACC_MODE() for real Unrot uml mconsole a bit hppfs: handle ->put_link() Kill 9p readlink() fix autofs/afs/etc. magic mountpoint breakage
2010-01-16nommu: fix shared mmap after truncate shrinkage problemsDavid Howells
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen over the end of a truncation. The problem is that ramfs_nommu_check_mappings() checks that the reduced file size against the VMA tree, but not the vm_region tree. The following sequence of events can cause the problem: fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600); ftruncate(fd, 32 * 1024); a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); munmap(a, 32 * 1024); ftruncate(fd, 16 * 1024); c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b' sees that the vm_region from 'a' is covering the region it wants and so shares it, pinning it in memory. Mapping 'a' then goes away and the file is truncated to the end of VMA 'b'. However, the region allocated by 'a' is still in effect, and has _not_ been reduced. Mapping 'c' is then created, and because there's a vm_region covering the desired region, get_unmapped_area() is _not_ called to repeat the check, and the mapping is granted, even though the pages from the latter half of the mapping have been discarded. However: d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'd' should work, and should end up sharing the region allocated by 'a'. To deal with this, we shrink the vm_region struct during the truncation, lest do_mmap_pgoff() take it as licence to share the full region automatically without calling the get_unmapped_area() file op again. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16nommu: fix race between ramfs truncation and shared mmapDavid Howells
Fix the race between the truncation of a ramfs file and an attempt to make a shared mmap of region of that file. The problem is that do_mmap_pgoff() calls f_op->get_unmapped_area() to verify that the file region is made of contiguous pages and to find its base address - but there isn't any locking to guarantee this region until vma_prio_tree_insert() is called by add_vma_to_mm(). Note that moving the functionality into f_op->mmap() doesn't help as that is also called before vma_prio_tree_insert(). Instead make ramfs_nommu_check_mappings() grab nommu_region_sem whilst it does its checks. This means that this function will wait whilst mmaps take place. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16do_add_mount() should sanitize mnt_flagsAl Viro
MNT_WRITE_HOLD shouldn't leak into new vfsmount and neither should MNT_SHARED (the latter will be set properly, along with the rest of shared-subtree data structures) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16CIFS shouldn't make mountpoints shrinkableAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16mnt_flags fixes in do_remount()Al Viro
* need vfsmount_lock over modifying it * need to preserve MNT_SHARED/MNT_UNBINDABLE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared()Al Viro
race in mnt_flags update Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16may_umount() needs namespace_semAl Viro
otherwise it races with clone_mnt() changing mnt_share/mnt_slaves Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-15inotify: only warn once for inotify problemsEric Paris
inotify will WARN() if it finds that the idr and the fsnotify internals somehow got out of sync. It was only supposed to do this once but due to this stupid bug it would warn every single time a problem was detected. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15inotify: do not reuse watch descriptorsEric Paris
Since commit 7e790dd5fc937bc8d2400c30a05e32a9e9eef276 ("inotify: fix error paths in inotify_update_watch") inotify changed the manor in which it gave watch descriptors back to userspace. Previous to this commit inotify acted like the following: inotify_add_watch(X, Y, Z) = 1 inotify_rm_watch(X, 1); inotify_add_watch(X, Y, Z) = 2 but after this patch inotify would return watch descriptors like so: inotify_add_watch(X, Y, Z) = 1 inotify_rm_watch(X, 1); inotify_add_watch(X, Y, Z) = 1 which I saw as equivalent to opening an fd where open(file) = 1; close(1); open(file) = 1; seemed perfectly reasonable. The issue is that quite a bit of userspace apparently relies on the behavior in which watch descriptors will not be quickly reused. KDE relies on it, I know some selinux packages rely on it, and I have heard complaints from other random sources such as debian bug 558981. Although the man page implies what we do is ok, we broke userspace so this patch almost reverts us to the old behavior. It is still slightly racey and I have patches that would fix that, but they are rather large and this will fix it for all real world cases. The race is as follows: - task1 creates a watch and blocks in idr_new_watch() before it updates the hint. - task2 creates a watch and updates the hint. - task1 updates the hint with it's older wd - task removes the watch created by task2 - task adds a new watch and will reuse the wd originally given to task2 it requires moving some locking around the hint (last_wd) but this should solve it for the real world and be -stable safe. As a side effect this patch papers over a bug in the lib/idr code which is causing a large number WARN's to pop on people's system and many reports in kerneloops.org. I'm working on the root cause of that idr bug seperately but this should make inotify immune to that issue. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15xfs: xfs_swap_extents needs to handle dynamic fork offsetsDave Chinner
When swapping extents, we can corrupt inodes by swapping data forks that are in incompatible formats. This is caused by the two indoes having different fork offsets due to the presence of an attribute fork on an attr2 filesystem. xfs_fsr tries to be smart about setting the fork offset, but the trick it plays only works on attr1 (old fixed format attribute fork) filesystems. Changing the way xfs_fsr sets up the attribute fork will prevent this situation from ever occurring, so in the kernel code we can get by with a preventative fix - check that the data fork in the defragmented inode is in a format valid for the inode it is being swapped into. This will lead to files that will silently and potentially repeatedly fail defragmentation, so issue a warning to the log when this particular failure occurs to let us know that xfs_fsr needs updating/fixing. To help identify how to improve xfs_fsr to avoid this issue, add trace points for the inodes being swapped so that we can determine why the swap was rejected and to confirm that the code is making the right decisions and modifications when swapping forks. A further complication is even when the swap is allowed to proceed when the fork offset is different between the two inodes then value for the maximum number of extents the data fork can hold can be wrong. Make sure these are also set correctly after the swap occurs. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: fix missing error check in xfs_rtfree_rangeDave Chinner
When xfs_rtfind_forw() returns an error, the block is returned uninitialised. xfs_rtfree_range() is not checking the error return, so could be using an uninitialised block number for modifying bitmap summary info. The problem was found by gcc when compiling the *userspace* libxfs code - it is an copy of the kernel code with the exact same bug. gcc gives an uninitialised variable warning on the userspace code but not on the kernel code. You gotta love the consistency (Mmmm, slightly chewy today!). Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: fix stale inode flush avoidanceDave Chinner
When reclaiming stale inodes, we need to guarantee that inodes are unpinned before returning with a "clean" status. If we don't we can reclaim inodes that are pinned, leading to use after free in the transaction subsystem as transactions complete. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: Remove inode iolock held check during allocationDave Chinner
lockdep complains about a the lock not being initialised as we do an ASSERT based check that the lock is not held before we initialise it to catch inodes freed with the lock held. lockdep does this check for us in the lock initialisation code, so remove the ASSERT to stop the lockdep warning. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: reclaim all inodes by background tree walksDave Chinner
We cannot do direct inode reclaim without taking the flush lock to ensure that we do not reclaim an inode under IO. We check the inode is clean before doing direct reclaim, but this is not good enough because the inode flush code marks the inode clean once it has copied the in-core dirty state to the backing buffer. It is the flush lock that determines whether the inode is still under IO, even though it is marked clean, and the inode is still required at IO completion so we can't reclaim it even though it is clean in core. Hence the requirement that we need to take the flush lock even on clean inodes because this guarantees that the inode writeback IO has completed and it is safe to reclaim the inode. With delayed write inode flushing, we coul dend up waiting a long time on the flush lock even for a clean inode. The background reclaim already handles this efficiently, so avoid all the problems by killing the direct reclaim path altogether. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: Avoid inodes in reclaim when flushing from inode cacheDave Chinner
The reclaim code will handle flushing of dirty inodes before reclaim occurs, so avoid them when determining whether an inode is a candidate for flushing to disk when walking the radix trees. This is based on a test patch from Christoph Hellwig. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>