aboutsummaryrefslogtreecommitdiff
path: root/include/linux/fs.h
AgeCommit message (Collapse)Author
2008-04-21[PATCH] move a bunch of declarations to fs/internal.hAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-21Merge branch 'semaphore' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: Deprecate the asm/semaphore.h files in feature-removal-schedule. Convert asm/semaphore.h users to linux/semaphore.h security: Remove unnecessary inclusions of asm/semaphore.h lib: Remove unnecessary inclusions of asm/semaphore.h kernel: Remove unnecessary inclusions of asm/semaphore.h include: Remove unnecessary inclusions of asm/semaphore.h fs: Remove unnecessary inclusions of asm/semaphore.h drivers: Remove unnecessary inclusions of asm/semaphore.h net: Remove unnecessary inclusions of asm/semaphore.h arch: Remove unnecessary inclusions of asm/semaphore.h
2008-04-19[PATCH] r/o bind mounts: debugging for missed callsDave Hansen
There have been a few oopses caused by 'struct file's with NULL f_vfsmnts. There was also a set of potentially missed mnt_want_write()s from dentry_open() calls. This patch provides a very simple debugging framework to catch these kinds of bugs. It will WARN_ON() them, but should stop us from having any oopses or mnt_writer count imbalances. I'm quite convinced that this is a good thing because it found bugs in the stuff I was working on as soon as I wrote it. [hch: made it conditional on a debug option. But it's still a little bit too ugly] [hch: merged forced remount r/o fix from Dave and akpm's fix for the fix] Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19[PATCH] merge open_namei() and do_filp_open()Christoph Hellwig
open_namei() will, in the future, need to take mount write counts over its creation and truncation (via may_open()) operations. It needs to keep these write counts until any potential filp that is created gets __fput()'d. This gets complicated in the error handling and becomes very murky as to how far open_namei() actually got, and whether or not that mount write count was taken. That makes it a bad interface. All that the current do_filp_open() really does is allocate the nameidata on the stack, then call open_namei(). So, this merges those two functions and moves filp_open() over to namei.c so it can be close to its buddy: do_filp_open(). It also gets a kerneldoc comment in the process. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-18Convert asm/semaphore.h users to linux/semaphore.hMatthew Wilcox
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-02-19make struct def_blk_aops staticAdrian Bunk
This patch makes the needlessly global struct def_blk_aops static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2008-02-14vfs: add explanation of I_DIRTY_DATASYNC bitJan Kara
Add explanation of I_DIRTY_DATASYNC bit. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Joern Engel <joern@logfs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08fs/char_dev.c: chrdev_open marked static and removed from fs.hDenis Cheng
There is an outdated comment in serial_core.c also fixed. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08mount options: add generic_show_options()Miklos Szeredi
Add a new s_options field to struct super_block. Filesystems can save mount options passed to them in mount or remount. It is automatically freed when the superblock is destroyed. A new helper function, generic_show_options() is introduced, which uses this field to display the mount options in /proc/mounts. Another helper function, save_mount_options() may be used by filesystems to save the options in the super block. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08kill do_generic_mapping_readChristoph Hellwig
do_generic_mapping_read was used by gfs2 for internals reads, but this use of the interface was rather suboptimal (as was the whole interface) and has been replaced by an internal helper now. This patch kills do_generic_mapping_read and surrounding damage in preparation of additional cleanups for the buffered read path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08libfs: rename simple_attr_close to simple_attr_releaseChristoph Hellwig
simple_attr_close implementes ->release so it should be named accordingly. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: <stefano.brivio@polimi.it> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg KH <greg@kroah.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08libfs: allow error return from simple attributesChristoph Hellwig
Sometimes simple attributes might need to return an error, e.g. for acquiring a mutex interruptibly. In fact we have that situation in spufs already which is the original user of the simple attributes. This patch merged the temporarily forked attributes in spufs back into the main ones and allows to return errors. [akpm@linux-foundation.org: build fix] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: <stefano.brivio@polimi.it> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg KH <greg@kroah.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07iget: remove iget() and the read_inode() super op as being obsoleteDavid Howells
Remove the old iget() call and the read_inode() superblock operation it uses as these are really obsolete, and the use of read_inode() does not produce proper error handling (no distinction between ENOMEM and EIO when marking an inode bad). Furthermore, this removes the temptation to use iget() to find an inode by number in a filesystem from code outside that filesystem. iget_locked() should be used instead. A new function is added in an earlier patch (iget_failed) that is to be called to mark an inode as bad, unlock it and release it should the get routine fail. Mark iget() and read_inode() as being obsolete and remove references to them from the documentation. Typically a filesystem will be modified such that the read_inode function becomes an internal iget function, for example the following: void thingyfs_read_inode(struct inode *inode) { ... } would be changed into something like: struct inode *thingyfs_iget(struct super_block *sp, unsigned long ino) { struct inode *inode; int ret; inode = iget_locked(sb, ino); if (!inode) return ERR_PTR(-ENOMEM); if (!(inode->i_state & I_NEW)) return inode; ... unlock_new_inode(inode); return inode; error: iget_failed(inode); return ERR_PTR(ret); } and then thingyfs_iget() would be called rather than iget(), for example: ret = -EINVAL; inode = iget(sb, ino); if (!inode || is_bad_inode(inode)) goto error; becomes: inode = thingyfs_iget(sb, ino); if (IS_ERR(inode)) { ret = PTR_ERR(inode); goto error; } Note that is_bad_inode() does not need to be called. The error returned by thingyfs_iget() should render it unnecessary. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07iget: introduce a function to register iget failureDavid Howells
Introduce a function to register failure in an inode construction path. This includes marking the inode under construction as bad, unlocking it and releasing it. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07VFS: swap do_ioctl and vfs_ioctl namesErez Zadok
Rename old vfs_ioctl to do_ioctl, because the comment above it clearly indicates that it is an internal function not to be exported to modules; therefore it should have a more traditional do_XXX name. The new do_ioctl is exported in fs.h but not to modules. Rename the old do_ioctl to vfs_ioctl because the names vfs_XXX should preferably be reserved to callable VFS functions which modules may call, as many other vfs_XXX functions already do. Export the new vfs_ioctl to GPL modules so others can use it (including Unionfs and eCryptfs). Add DocBook for new vfs_ioctl. [akpm@linux-foundation.org: fix build] Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06get rid of NR_OPEN and introduce a sysctl_nr_openEric Dumazet
NR_OPEN (historically set to 1024*1024) actually forbids processes to open more than 1024*1024 handles. Unfortunatly some production servers hit the not so 'ridiculously high value' of 1024*1024 file descriptors per process. Changing NR_OPEN is not considered safe because of vmalloc space potential exhaust. This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to 1024*1024, so that admins can decide to change this limit if their workload needs it. [akpm@linux-foundation.org: export it for sparc64] Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06fs: use list_for_each_entry_reverse and kill sb_entryAkinobu Mita
Use list_for_each_entry_reverse for super_blocks list and remove unused sb_entry macro. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06Document I_SYNC and I_DATASYNCJoern Engel
After some archeology (see http://logfs.org/logfs/inode_state_bits) I finally figured out what the three I_DIRTY bits do. Maybe others would prefer less effort to reach this insight. Signed-off-by: Joern Engel <joern@logfs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06proper prototype for get_filesystem_list()Adrian Bunk
Ad a proper prototype for migration_init() in include/linux/fs.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (79 commits) Jesper Juhl is the new trivial patches maintainer Documentation: mention email-clients.txt in SubmittingPatches fs/binfmt_elf.c: spello fix do_invalidatepage() comment typo fix Documentation/filesystems/porting fixes typo fixes in net/core/net_namespace.c typo fix in net/rfkill/rfkill.c typo fixes in net/sctp/sm_statefuns.c lib/: Spelling fixes kernel/: Spelling fixes include/scsi/: Spelling fixes include/linux/: Spelling fixes include/asm-m68knommu/: Spelling fixes include/asm-frv/: Spelling fixes fs/: Spelling fixes drivers/watchdog/: Spelling fixes drivers/video/: Spelling fixes drivers/ssb/: Spelling fixes drivers/serial/: Spelling fixes drivers/scsi/: Spelling fixes ...
2008-02-03pid-namespaces-vs-locks-interactionVitaliy Gusev
fcntl(F_GETLK,..) can return pid of process for not current pid namespace (if process is belonged to the several namespaces). It is true also for pids in /proc/locks. So correct behavior is saving pointer to the struct pid of the process lock owner. Signed-off-by: Vitaliy Gusev <vgusev@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-02-03include/linux/: Spelling fixesJoe Perches
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
2008-01-28ext4: Add inode version support in ext4Jean Noel Cordenner
This patch adds 64-bit inode version support to ext4. The lower 32 bits are stored in the osd1.linux1.l_i_version field while the high 32 bits are stored in the i_version_hi field newly created in the ext4_inode. This field is incremented in case the ext4_inode is large enough. A i_version mount option has been added to enable the feature. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net>
2008-01-28vfs: Add 64 bit i_version supportJean Noel Cordenner
The i_version field of the inode is changed to be a 64-bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the "ctime" time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530. This first part concerns the vfs, it converts the 32-bit i_version in the generic inode to a 64-bit, a flag is added in the super block in order to check if the feature is enabled and the i_version is incremented in the vfs. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
2008-01-24kobject: convert main fs kobject to use kobject_createGreg Kroah-Hartman
This also renames fs_subsys to fs_kobj to catch all current users with a build error instead of a build warning which can easily be missed. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-22exportfs: make struct export_operations constChristoph Hellwig
Now that nfsd has stopped writing to the find_exported_dentry member we an mark the export_operations const Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: <linux-ext4@vger.kernel.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: David Chinner <dgc@sgi.com> Cc: Timothy Shimmin <tes@sgi.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: Chris Mason <mason@suse.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: "Vladimir V. Saveliev" <vs@namesys.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-21[PATCH] new helpers - collect_mounts() and release_collected_mounts()Al Viro
Get a snapshot of a subtree, creating private clones of vfsmounts for all its components and release such snapshot resp. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-10-20fix do_sys_open() prototypeJason Uhlenkott
Fix an argument name in do_sys_open()'s prototype. Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19pid namespaces: introduce MS_KERNMOUNT flagPavel Emelyanov
This flag tells the .get_sb callback that this is a kern_mount() call so that it can trust *data pointer to be valid in-kernel one. If this flag is passed from the user process, it is cleared since the *data pointer is not a valid kernel object. Running a few steps forward - this will be needed for proc to create the superblock and store a valid pid namespace on it during the namespace creation. The reason, why the namespace cannot live without proc mount is described in the appropriate patch. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19put declaration of put_filesystem() in fs.hMiklos Szeredi
Declarations go into headers. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Ram Pai <linuxram@us.ibm.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18VFS: allow filesystems to implement atomic open+truncateMiklos Szeredi
Add a new attribute flag ATTR_OPEN, with the meaning: "truncation was initiated by open() due to the O_TRUNC flag". This way filesystems wanting to implement truncation within their ->open() method can ignore such truncate requests. This is a quick & dirty hack, but it comes for free. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Implement file posix capabilitiesSerge E. Hallyn
Implement file posix capabilities. This allows programs to be given a subset of root's powers regardless of who runs them, without having to use setuid and giving the binary all of root's powers. This version works with Kaigai Kohei's userspace tools, found at http://www.kaigai.gr.jp/index.php. For more information on how to use this patch, Chris Friedhoff has posted a nice page at http://www.friedhoff.org/fscaps.html. Changelog: Nov 27: Incorporate fixes from Andrew Morton (security-introduce-file-caps-tweaks and security-introduce-file-caps-warning-fix) Fix Kconfig dependency. Fix change signaling behavior when file caps are not compiled in. Nov 13: Integrate comments from Alexey: Remove CONFIG_ ifdef from capability.h, and use %zd for printing a size_t. Nov 13: Fix endianness warnings by sparse as suggested by Alexey Dobriyan. Nov 09: Address warnings of unused variables at cap_bprm_set_security when file capabilities are disabled, and simultaneously clean up the code a little, by pulling the new code into a helper function. Nov 08: For pointers to required userspace tools and how to use them, see http://www.friedhoff.org/fscaps.html. Nov 07: Fix the calculation of the highest bit checked in check_cap_sanity(). Nov 07: Allow file caps to be enabled without CONFIG_SECURITY, since capabilities are the default. Hook cap_task_setscheduler when !CONFIG_SECURITY. Move capable(TASK_KILL) to end of cap_task_kill to reduce audit messages. Nov 05: Add secondary calls in selinux/hooks.c to task_setioprio and task_setscheduler so that selinux and capabilities with file cap support can be stacked. Sep 05: As Seth Arnold points out, uid checks are out of place for capability code. Sep 01: Define task_setscheduler, task_setioprio, cap_task_kill, and task_setnice to make sure a user cannot affect a process in which they called a program with some fscaps. One remaining question is the note under task_setscheduler: are we ok with CAP_SYS_NICE being sufficient to confine a process to a cpuset? It is a semantic change, as without fsccaps, attach_task doesn't allow CAP_SYS_NICE to override the uid equivalence check. But since it uses security_task_setscheduler, which elsewhere is used where CAP_SYS_NICE can be used to override the uid equivalence check, fixing it might be tough. task_setscheduler note: this also controls cpuset:attach_task. Are we ok with CAP_SYS_NICE being used to confine to a cpuset? task_setioprio task_setnice sys_setpriority uses this (through set_one_prio) for another process. Need same checks as setrlimit Aug 21: Updated secureexec implementation to reflect the fact that euid and uid might be the same and nonzero, but the process might still have elevated caps. Aug 15: Handle endianness of xattrs. Enforce capability version match between kernel and disk. Enforce that no bits beyond the known max capability are set, else return -EPERM. With this extra processing, it may be worth reconsidering doing all the work at bprm_set_security rather than d_instantiate. Aug 10: Always call getxattr at bprm_set_security, rather than caching it at d_instantiate. [morgan@kernel.org: file-caps clean up for linux/capability.h] [bunk@kernel.org: unexport cap_inode_killpriv] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17introduce I_SYNCJoern Engel
I_LOCK was used for several unrelated purposes, which caused deadlock situations in certain filesystems as a side effect. One of the purposes now uses the new I_SYNC bit. Also document the various bits and change their order from historical to logical. [bunk@stusta.de: make fs/inode.c:wake_up_inode() static] Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: David Chinner <dgc@sgi.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix ntfs with sb_has_dirty_inodes()Fengguang Wu
NTFS's if-condition on dirty inodes is not complete. Fix it with sb_has_dirty_inodes(). Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix periodic superblock dirty inode flushingKen Chen
Current -mm tree has bucketful of bug fixes in periodic writeback path. However, we still hit a glitch where dirty pages on a given inode aren't completely flushed to the disk, and system will accumulate large amount of dirty pages beyond what dirty_expire_interval is designed for. The problem is __sync_single_inode() will move an inode to sb->s_dirty list even when there are more pending dirty pages on that inode. If there is another inode with a small number of dirty pages, we hit a case where the loop iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0. Thus leaving the inode that has large amount of dirty pages behind and it has to wait for another dirty_writeback_interval before we flush it again. We effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval. If the rate of dirtying is sufficiently high, the system will start accumulate a large number of dirty pages. So fix it by having another sb->s_more_io list on which to park the inode while we iterate through sb->s_io and to allow each dirty inode which resides on that sb to have an equal chance of flushing some amount of dirty pages. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Fix f_version type: should be u64 instead of unsigned longMathieu Desnoyers
Fix f_version type: should be u64 instead of long There is a type inconsistency between struct inode i_version and struct file f_version. fs.h: struct inode u64 i_version; and struct file unsigned long f_version; Users do: fs/ext3/dir.c: if (filp->f_version != inode->i_version) { So why isn't f_version a u64 ? It becomes a problem if versions gets higher than 2^32 and we are on an architecture where longs are 32 bits. This patch changes the f_version type to u64, and updates the users accordingly. It applies to 2.6.23-rc2-mm2. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Martin Bligh <mbligh@google.com> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: <linux-ext4@vger.kernel.org> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17make fs/libfs.c:simple_commit_write() staticAdrian Bunk
simple_commit_write() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fs: remove the unused mempages parameterDenis Cheng
Since the mempages parameter is actually not used, they should be removed. Now there is only files_init use the mempages parameter, files_init(mempages); but I don't think the adaptation to mempages in files_init is really useful; and if files_init also changed to the prototype void (*func)(void), the wrapper vfs_caches_init would also not need the mempages parameter. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Remove sysctl.h from fs.hAlexey Dobriyan
Rrrr, addition of sysctl.h to fs.h was't very smart, because simple editing of the former will buy you big recompile, where it shouldn't have to. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16fs: remove some AOP_TRUNCATED_PAGENick Piggin
prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and GFS2 were converted to the new aops, so we can make some simplifications for that. [michal.k.k.piotrowski@gmail.com: fix warning] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16fs: new cont helpersNick Piggin
Rework the generic block "cont" routines to handle the new aops. Supporting cont_prepare_write would take quite a lot of code to support, so remove it instead (and we later convert all filesystems to use it). write_begin gets passed AOP_FLAG_CONT_EXPAND when called from generic_cont_expand, so filesystems can avoid the old hacks they used. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16fs: introduce write_begin, write_end, and perform_write aopsNick Piggin
These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). [mark.fasheh@oracle.com: API design contributions, code review and fixes] [akpm@linux-foundation.org: various fixes] [dmonakhov@sw.ru: new aop block_write_begin fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16mm: buffered write iteratorNick Piggin
Add an iterator data structure to operate over an iovec. Add usercopy operators needed by generic_file_buffered_write, and convert that function over. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16readahead: combine file_ra_state.prev_index/prev_offset into prev_posFengguang Wu
Combine the file_ra_state members unsigned long prev_index unsigned int prev_offset into loff_t prev_pos It is more consistent and better supports huge files. Thanks to Peter for the nice proposal! [akpm@linux-foundation.org: fix shift overflow] Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16readahead: mmap read-around simplificationFengguang Wu
Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16readahead: compacting file_ra_stateFengguang Wu
Use 'unsigned int' instead of 'unsigned long' for readahead sizes. This helps reduce memory consumption on 64bit CPU when a lot of files are opened. CC: Andi Kleen <andi@firstfloor.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15Merge branch 'locks' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
* 'locks' of git://linux-nfs.org/~bfields/linux: nfsd: remove IS_ISMNDLCK macro Rework /proc/locks via seq_files and seq_list helpers fs/locks.c: use list_for_each_entry() instead of list_for_each() NFS: clean up explicit check for mandatory locks AFS: clean up explicit check for mandatory locks 9PFS: clean up explicit check for mandatory locks GFS2: clean up explicit check for mandatory locks Cleanup macros for distinguishing mandatory locks Documentation: move locks.txt in filesystems/ locks: add warning about mandatory locking races Documentation: move mandatory locking documentation to filesystems/ locks: Fix potential OOPS in generic_setlease() Use list_first_entry in locks_wake_up_blocks locks: fix flock_lock_file() comment Memory shortage can result in inconsistent flocks state locks: kill redundant local variable locks: reverse order of posix_locks_conflict() arguments
2007-10-14lockdep: annotate dir vs file i_mutexPeter Zijlstra
On Mon, 2007-09-24 at 22:13 -0400, Steven Rostedt wrote: > The circular lock seems to be this: > > #1: > > sys_mmap2: down_write(&mm->mmap_sem); > nfs_revalidate_mapping: mutex_lock(&inode->i_mutex); > > > #0: > > vfs_readdir: mutex_lock(&inode->i_mutex); > - during the readdir (filldir64), we take a user fault (missing page?) > and call do_page_fault - > do_page_fault: down_read(&mm->mmap_sem); > > > So it does indeed look like a circular locking. Now the question is, "is > this a bug?". Looking like the inode of #1 must be a file or something > else that you can mmap and the inode of #0 seems it must be a directory. > I would say "no". > > Now if you can readdir on a file or mmap a directory, then this could be > an issue. > > Otherwise, I'd love to see someone teach lockdep about this issue! ;-) Make a distinction between file and dir usage of i_mutex. The inode should be complete and unused at unlock_new_inode(), re-init i_mutex depending on its type. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15lockdep: per filesystem inode lock classPeter Zijlstra
Give each filesystem its own inode lock class. The various filesystems have different locking order wrt the inode locks; esp. the pseudo filesystems differ from the rest. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-09Rework /proc/locks via seq_files and seq_list helpersPavel Emelyanov
Currently /proc/locks is shown with a proc_read function, but its behavior is rather complex as it has to manually handle current offset and buffer length. On the other hand, files that show objects from lists can be easily reimplemented using the sequential files and the seq_list_XXX() helpers. This saves (as usually) 16 lines of code and more than 200 from the .text section. [akpm@linux-foundation.org: no externs in C] [akpm@linux-foundation.org: warning fixes] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>