From e0ca87391694dfacd01465d5c01c579c3b8b63e0 Mon Sep 17 00:00:00 2001 From: Evgeniy Polyakov Date: Fri, 27 Mar 2009 15:04:29 +0300 Subject: Staging: Pohmelfs: Added IO permissions and priorities. Signed-off-by: Evgeniy Polyakov Signed-off-by: Greg Kroah-Hartman --- Documentation/filesystems/pohmelfs/design_notes.txt | 5 +++-- Documentation/filesystems/pohmelfs/info.txt | 21 +++++++++++++++++---- 2 files changed, 20 insertions(+), 6 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt index 6d6db60d567..dcf83358716 100644 --- a/Documentation/filesystems/pohmelfs/design_notes.txt +++ b/Documentation/filesystems/pohmelfs/design_notes.txt @@ -56,9 +56,10 @@ workloads and can fully utilize the bandwidth to the servers when doing bulk data transfers. POHMELFS clients operate with a working set of servers and are capable of balancing read-only -operations (like lookups or directory listings) between them. +operations (like lookups or directory listings) between them according to IO priorities. Administrators can add or remove servers from the set at run-time via special commands (described -in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers. +in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers, which are connected +with write permission turned on. IO priority and permissions can be changed in run-time. POHMELFS is capable of full data channel encryption and/or strong crypto hashing. One can select any kernel supported cipher, encryption mode, hash type and operation mode diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt index 4e3d5015708..db2e4139362 100644 --- a/Documentation/filesystems/pohmelfs/info.txt +++ b/Documentation/filesystems/pohmelfs/info.txt @@ -1,6 +1,8 @@ POHMELFS usage information. -Mount options: +Mount options. +All but index, number of crypto threads and maximum IO size can changed via remount. + idx=%u Each mountpoint is associated with a special index via this option. Administrator can add or remove servers from the given index, so all mounts, @@ -52,16 +54,27 @@ mcache_timeout=%u Usage examples. -Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx +Add server server1.net:1025 into the working set with index $idx with appropriate hash algorithm and key file and cipher algorithm, mode and key file: -$cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key +$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key Mount filesystem with given index $idx to /mnt mountpoint. Client will connect to all servers specified in the working set via previous command: mount -t pohmel -o idx=$idx q /mnt -One can add or remove servers from working set after mounting too. +Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw): +$cfg A modify -a server1.net -p 1025 -i $idx -I 1 + +Change IO priority to 123 (node with the highest priority gets read requests). +$cfg A modify -a server1.net -p 1025 -i $idx -P 123 +One can check currect status of all connections in the mountstats file: +# cat /proc/$PID/mountstats +... +device none mounted on /mnt with fstype pohmel +idx addr(:port) socket_type protocol active priority permissions +0 server1.net:1026 1 6 1 250 1 +0 server2.net:1025 1 6 1 123 3 Server installation. -- cgit v1.2.3 From 66672fefaa91802fec51c3fe0cc55bc9baea5a2d Mon Sep 17 00:00:00 2001 From: Adrian McMenamin Date: Mon, 20 Apr 2009 18:38:28 -0700 Subject: Documentation/filesystems: remove out of date reference to BKL being held Documentation/filesystems/vfs.txt incorrectly states that the kernel is locked during the call to statfs (Documentation/filesystems/Locking correctly says it is not). This patch removes the offending sentence. remove reference to BKL being held in statfs Signed-off-by: Adrian McMenamin Signed-off-by: Randy Dunlap Cc: Alexander Viro Signed-off-by: Al Viro --- Documentation/filesystems/vfs.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index deeeed0faa8..f49eecf2e57 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -277,8 +277,7 @@ or bottom half). unfreeze_fs: called when VFS is unlocking a filesystem and making it writable again. - statfs: called when the VFS needs to get filesystem statistics. This - is called with the kernel lock held + statfs: called when the VFS needs to get filesystem statistics. remount_fs: called when the filesystem is remounted. This is called with the kernel lock held -- cgit v1.2.3 From 91ac033d8377552d3654501a105ab55bf546940e Mon Sep 17 00:00:00 2001 From: Marc Dionne Date: Thu, 23 Apr 2009 11:21:55 +0100 Subject: CacheFiles: Fix the documentation to use the correct credential pointer names Adjust the CacheFiles documentation to use the correct names of the credential pointers in task_struct. The documentation was using names from the old versions of the credentials patches. Signed-off-by: Marc Dionne Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- Documentation/filesystems/caching/cachefiles.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt index c78a49b7bba..748a1ae49e1 100644 --- a/Documentation/filesystems/caching/cachefiles.txt +++ b/Documentation/filesystems/caching/cachefiles.txt @@ -407,7 +407,7 @@ A NOTE ON SECURITY ================== CacheFiles makes use of the split security in the task_struct. It allocates -its own task_security structure, and redirects current->act_as to point to it +its own task_security structure, and redirects current->cred to point to it when it acts on behalf of another process, in that process's context. The reason it does this is that it calls vfs_mkdir() and suchlike rather than @@ -429,9 +429,9 @@ This means it may lose signals or ptrace events for example, and affects what the process looks like in /proc. So CacheFiles makes use of a logical split in the security between the -objective security (task->sec) and the subjective security (task->act_as). The -objective security holds the intrinsic security properties of a process and is -never overridden. This is what appears in /proc, and is what is used when a +objective security (task->real_cred) and the subjective security (task->cred). +The objective security holds the intrinsic security properties of a process and +is never overridden. This is what appears in /proc, and is what is used when a process is the target of an operation by some other process (SIGKILL for example). -- cgit v1.2.3 From b827e496c893de0c0f142abfaeb8730a2fd6b37f Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Thu, 30 Apr 2009 15:08:16 -0700 Subject: mm: close page_mkwrite races Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil Cc: Trond Myklebust Signed-off-by: Nick Piggin Cc: Valdis Kletnieks Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/Locking | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 76efe5b71d7..3120f8dd2c3 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -512,16 +512,24 @@ locking rules: BKL mmap_sem PageLocked(page) open: no yes close: no yes -fault: no yes -page_mkwrite: no yes no +fault: no yes can return with page locked +page_mkwrite: no yes can return with page locked access: no yes - ->page_mkwrite() is called when a previously read-only page is -about to become writeable. The file system is responsible for -protecting against truncate races. Once appropriate action has been -taking to lock out truncate, the page range should be verified to be -within i_size. The page mapping should also be checked that it is not -NULL. + ->fault() is called when a previously not present pte is about +to be faulted in. The filesystem must find and return the page associated +with the passed in "pgoff" in the vm_fault structure. If it is possible that +the page may be truncated and/or invalidated, then the filesystem must lock +the page, then ensure it is not already truncated (the page lock will block +subsequent truncate), and then return with VM_FAULT_LOCKED, and the page +locked. The VM will unlock the page. + + ->page_mkwrite() is called when a previously read-only pte is +about to become writeable. The filesystem again must ensure that there are +no truncate/invalidate races, and then return with the page locked. If +the page has been truncated, the filesystem should not look up a new page +like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which +will cause the VM to retry the fault. ->access() is called when get_user_pages() fails in acces_process_vm(), typically used to debug a process through -- cgit v1.2.3