aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2007-02-05[GFS2/DLM] use sysfsRandy Dunlap
With CONFIG_DLM=m, CONFIG_PROC_FS=n, and CONFIG_SYSFS=n, kernel build fails with: WARNING: "kernel_subsys" [fs/gfs2/locking/dlm/lock_dlm.ko] undefined! WARNING: "kernel_subsys" [fs/dlm/dlm.ko] undefined! WARNING: "kernel_subsys" [fs/configfs/configfs.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 Since fs/dlm/lockspace.c and fs/gfs2/locking/dlm/sysfs.c use kernel_subsys, they should either DEPEND on it or SELECT it. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] make lock_dlm drop_count tunable in sysfsDavid Teigland
We want to be able to change or disable the default drop_count (number at which the dlm asks gfs to limit the the number of locks it's holding). Add it to the collection of sysfs tunables for an fs. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] increase default lock limitDavid Teigland
Increase the number of locks at which point the dlm begins asking gfs to reduce its lock usage. The default value is largely arbitrary, but the current value of 50,000 ends up limiting performance unnecessarily for too many users. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Fix list corruption in lops.cSteven Whitehouse
The patch below appears to fix the list corruption that we are seeing on occasion. Although the transaction structure is private to a single thread, when the queued structures are dismantled during an in-core commit, its possible for a different thread to be trying to add the same structure to another, new, transaction at the same time. To avoid this, this patch takes the log spinlock during this operation. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Fix recursive locking attempt with NFSSteven Whitehouse
In certain cases, its possible for NFS to call the lookup code while holding the glock (when doing a readdirplus operation) so we need to check for that and not try and lock the glock twice. This also fixes a typo in a previous NFS related GFS2 patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] can miss clearing resend flagDavid Teigland
A long, complicated sequence of events, beginning with the RESEND flag not being cleared on an lkb, can result in an unlock never completing. - lkb on waiters list for remote lookup - the remote node is both the dir node and the master node, so it optimizes the lookup into a request and sends a request reply back - the request reply is saved on the requestqueue to be processed after recovery - recovery runs dlm_recover_waiters_pre() which sets RESEND flag so the lookup will be resent after recovery - end of recovery: process_requestqueue takes saved request reply which removes the lkb off the waitesr list, _without_ clearing the RESEND flag - end of recovery: dlm_recover_waiters_post() doesn't do anything with the now completed lookup lkb (would usually clear RESEND) - later, the node unmounts, unlocks this lkb that still has RESEND flag set - the lkb is on the waiters list again, now for unlock, when recovery occurs, dlm_recover_waiters_pre() shows the lkb for unlock with RESEND set, doesn't do anything since the master still exists - end of recovery: dlm_recover_waiters_post() takes this lkb off the waiters list because it has the RESEND flag set, then reports an error because unlocks are never supposed to be handled in recover_waiters_post(). - later, the unlock reply is received, doesn't find the lkb on the waiters list because recover_waiters_post() has wrongly removed it. - the unlock operation has been lost, and we're left with a stray granted lock - unmount spins waiting for the unlock to complete The visible evidence of this problem will be a node where gfs umount is spinning, the dlm waiters list will be empty, and the dlm locks list will show a granted lock. The fix is simply to clear the RESEND flag when taking an lkb off the waiters list. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] saved dlm message can be droppedDavid Teigland
dlm_receive_message() returns 0 instead of returning 'error'. What would happen is that process_requestqueue would take a saved message off the requestqueue and call receive_message on it. receive_message would then see that recovery had been aborted, set error to EINTR, and 'goto out', expecting that the error would be returned. Instead, 0 was always returned, so process_requestqueue would think that the message had been processed and delete it instead of saving it to process next time. This means the message (usually an unlock in my tests) would be lost. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] Make sock_sem into a mutexPatrick Caulfield
Now that there can be multiple dlm_recv threads running we need to prevent two recvs running for the same connection - it's unlikely but it can happen and it causes message corruption. Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Fix typo in glock.cSteven Whitehouse
This is a one letter typo fix in glock.c, spotted by Rob Kenna. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] use CURRENT_TIME_SEC instead of get_seconds in gfs2Eric Sandeen
I was looking something else up and came across this... I don't honestly have a good reason to change it other than to make it like every other Linux filesystem in this regard. ;-) It doesn't functionally change anything, but makes some lines shorter. :) I'm also curious; why does gfs2 have 64-bits of on-disk timestamps, but not in timespec_t format, and only stores second resolutions? Seems like you're halfway to sub-second resolutions already. I suppose if that gets implemented then all of the below should instead be CURRENT_TIME not CURRENT_TIME_SEC. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Compile fix for glock.cSteven Whitehouse
This one liner got missed from the previous patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Remove queue_empty() functionSteven Whitehouse
This function is not longer required since we do not do recursive locking in the glock layer. As a result all its callers can be replaceed with list_empty() calls. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix lowcomms receivingPatrick Caulfield
This patch fixes a bug whereby data on a newly accepted connection would be ignored if it arrived soon after the accept. Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Tidy up glops callsSteven Whitehouse
This patch doesn't make any changes to the ordering of the various operations related to glocking, but it does tidy up the calls to the glops.c functions to make the structure more obvious. The two functions: gfs2_glock_xmote_th() and gfs2_glock_drop_th() can be made static within glock.c since they are called by every set of glock operations. The xmote_th and drop_th glock operations are then made conditional upon those two routines existing and called from the previously mentioned functions in glock.c respectively. Also it can be seen that the go_sync operation isn't needed since it can easily be replaced by calls to xmote_bh and drop_bh respectively. This results in no longer (confusingly) calling back into routines in glock.c from glops.c and also reducing the glock operations by one member. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] lowcomms tidyPatrick Caulfield
This patch removes some redundant fields from the connection structure and adds some lockdep annotation to remove spurious warnings. Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Remove local exclusive glock modeSteven Whitehouse
Here is a patch for GFS2 to remove the local exclusive flag. In the places it was used, mutex's are always held earlier in the call path, so it appears redundant in the LM_ST_SHARED case. Also, the GFS2 holders were setting local exclusive in any case where the requested lock was LM_ST_EXCLUSIVE. So the other places in the glock code where the flag was tested have been replaced with tests for the lock state being LM_ST_EXCLUSIVE in order to ensure the logic is the same as before (i.e. LM_ST_EXCLUSIVE is always locally exclusive as well as globally exclusive). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Remove unused go_callback operationSteven Whitehouse
This is never used, so we might as well remove it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Remove the "greedy" function from glock.[ch]Steven Whitehouse
The "greedy" code was an attempt to retain glocks for a minimum length of time when they relate to mmap()ed files. The current implementation of this feature is not, however, ideal in that it required allocating memory in order to do this and its overly complicated. It also misses the mark by ignoring the other I/O operations which are just as likely to suffer from the same problem. So the plan is to remove this now and then add the functionality back as part of the glock state machine at a later date (and thus take into account all the possible users of this feature) Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Shrink gfs2_inode memory by halfSteven Whitehouse
Here is something I spotted (while looking for something entirely different) the other day. Rather than using a completion in each and every struct gfs2_holder, this removes it in favour of hashed wait queues, thus saving a considerable amount of memory both on the stack (where a number of gfs2_holder structures are allocated) and in particular in the gfs2_inode which has 8 gfs2_holder structures embedded within it. As a result on x86_64 the gfs2_inode shrinks from 2488 bytes to 1912 bytes, a saving of 576 bytes per inode (no thats not a typo!). In actual practice we get a much better result than that since now that a gfs2_inode is under the 2048 byte barrier, we get two per 4k slab page effectively halving the amount of memory required to store gfs2_inodes. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Remove max_atomic_write tunableSteven Whitehouse
This removes an unused sysfs tunable parameter. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Clean up/speed up readdirSteven Whitehouse
This removes the extra filldir callback which gfs2 was using to enclose an attempt at readahead for inodes during readdir. The code was too complicated and also hurts performance badly in the case that the getdents64/readdir call isn't being followed by stat() and it wasn't even getting it right all the time when it was. As a result, on my test box an "ls" of a directory containing 250000 files fell from about 7mins (freshly mounted, so nothing cached) to between about 15 to 25 seconds. When the directory content was cached, the time taken fell from about 3mins to about 4 or 5 seconds. Interestingly in the cached case, running "ls -l" once reduced the time taken for subsequent runs of "ls" to about 6 secs even without this patch. Now it turns out that there was a special case of glocks being used for prefetching the metadata, but because of the timeouts for these locks (set to 10 secs) the metadata was being timed out before it was being used and this the prefetch code was constantly trying to prefetch the same data over and over. Calling "ls -l" meant that the inodes were brought into memory and once the inodes are cached, the glocks are not disposed of until the inodes are pushed out of the cache, thus extending the lifetime of the glocks, and thus bringing down the time for subsequent runs of "ls" considerably. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Add writepages for "data=writeback" mountsSteven Whitehouse
It occurred to me that although a gfs2 specific writepages for ordered writes and journaled data would be tricky, by hooking writepages only for "data=writeback" mounts we could take advantage of not needing buffer heads (we don't use them on the read side, nor have we for some time) and create much larger I/Os for the block layer. Using blktrace both before and after, its possible to see that for large I/Os, most of the requests generated through writepages are now 1024 sectors after this patch is applied as opposed to 8 sectors before. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix master recoveryDavid Teigland
If master recovery happens on an rsb in one recovery sequence, then that sequence is aborted before lock recovery happens, then in the next sequence, we rely on the previous master recovery (which may now be invalid due to another node ignoring a lookup result) and go on do to the lock recovery where we get stuck due to an invalid master value. recovery cycle begins: master of rsb X has left nodes A and B send node C an rcom lookup for X to find the new master C gets lookup from B first, sets B as new master, and sends reply back to B C gets lookup from A next, and sends reply back to A saying B is master A gets lookup reply from C and sets B as the new master in the rsb recovery cycle on A, B and C is aborted to start a new recovery B gets lookup reply from C and ignores it since there's a new recovery recovery cycle begins: some other node has joined B doesn't think it's the master of X so it doesn't rebuild it in the directory C looks up the master of X, no one is master, so it becomes new master B looks up the master of X, finds it's C A believes that B is the master of X, so it sends its lock to B B sends an error back to A A resends this repeats forever, the incorrect master value on A is never corrected The fix is to do master recovery on an rsb that still has the NEW_MASTER flag set from an earlier recovery sequence, and therefore didn't complete lock recovery. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix user unlockingDavid Teigland
When a user process exits, we clear all the locks it holds. There is a problem, though, with locks that the process had begun unlocking before it exited. We couldn't find the lkb's that were in the process of being unlocked remotely, to flag that they are DEAD. To solve this, we move lkb's being unlocked onto a new list in the per-process structure that tracks what locks the process is holding. We can then go through this list to flag the necessary lkb's when clearing locks for a process when it exits. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] Use workqueues for dlm lowcommsPatrick Caulfield
This patch converts the DLM TCP lowcomms to use workqueues rather than using its own daemon functions. Simultaneously removing a lot of code and making it more scalable on multi-processor machines. Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] make gfs2_change_nlink_i() staticAdrian Bunk
On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote: >... > Changes since 2.6.20-rc3-mm1: >... > git-gfs2-nmw.patch >... > git trees >... This patch makes the needlessly globlal gfs2_change_nlink_i() static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] gfs2 knows of directories which it chooses not to displayRobert Peterson
This is for Red Hat bugzilla bug bz #222302: Moving a virtual IP from node to node between two NFS-over-GFS2 servers was causing one of the GFS2 servers to become confused and reference a deleted inode. The problem was due to vfs dentries that did not reference the gfs2_dops and therefore didn't call the gfs2 revalidate code to revalidate a dentry after a directory had been deleted & recreated. This patch is a crosswrite from a RHEL4 bug found in GFS1 as bz #190756 and it is against the latest -nmw git tree. Signed-off-by: Robert Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] expose dlm_config_info fields in configfsDavid Teigland
Make the dlm_config_info values readable and writeable via configfs entries. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] add config entry to enable log_debugDavid Teigland
Add a new dlm_config_info field to enable log_debug output and change log_debug() to use it. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] rename dlm_config_info fieldsDavid Teigland
Add a "ci_" prefix to the fields in the dlm_config_info struct so that we can use macros to add configfs functions to access them (in a later patch). No functional changes in this patch, just naming changes. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] change some log_error to log_debugDavid Teigland
Some common, non-error messages should use log_debug instead of log_error so they can be turned off. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Fix gfs2_rename deadlockS. Wendy Cheng
Second round of gfs2_rename lock re-ordering to allow Anaconda adding root partition on top of gfs2. Previous to this patch the recursive lock detector in glock.c can be triggered due to attempting to lock the rgrp twice. This fixes it by checking to see whether the rgrp is already locked. This fixes Red Hat bugzilla #221237 Signed-off-by: S. Wendy Cheng <wcheng@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] BZ 217008 fsfuzzer fix.Russell Cattelan
Update the quilt header comments to match the code changes. Change gfs2_lookup_simple to return an error in the case of a NULL inode. The callers of gfs2_lookup_simple do not check for NULL in the no entry case and such would end up dereferencing a NULL ptr. This fixes: http://projects.info-pull.com/mokb/MOKB-15-11-2006.html Signed-off-by: Russell Cattelan <cattelan@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Fix ordering of page disposal vs. glock_dqSteven Whitehouse
In case of unlinked files with dirty pages GFS2 wasn't clearing the pages in quite the right order. This patch clears the pages earlier (before the qlock_dq) to avoid the situation that the release of the glock results in attempting to write back data that has already been deallocated. This fixes Red Hat bugzilla: #220117 Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] Fix spin lock already unlocked bugPatrick Caulfield
I just noticed this message when testing some other changes I'd made to lowcomms (to use workqueues) but the problem seems to be in the current git trees too. I'm amazed no-one has seen it. BUG: spinlock already unlocked on CPU#1, dlm_recoverd/16868 Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] Fix schedule() callsPatrick Caulfield
I was a little over-enthusiastic turning schedule() calls int cond_sched() when fixing the DLM for Andrew Morton. These four should really be calls to schedule() or the dlm can busy-wait. Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Fix change nlink deadlockS. Wendy Cheng
Bugzilla 215088 Fix deadlock in gfs2_change_nlink() while installing RHEL5 into GFS2 partition. The gfs2_rename() apparently needs block allocation for the new name (into the directory) where it requires rg locks. At the same time, while updating the nlink count for the replaced file, gfs2_change_nlink() tries to return the inode meta-data back to resource group where it needs rg locks too. Our logic doesn't allow process to acquire these locks recursively by the same process (RHEL installer) that results a BUG call. This only happens within rename code path and only if the destination file exists before the rename operation. Signed-off-by: S. Wendy Cheng <wcheng@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] Fail over to readpage for stuffed filesSteven Whitehouse
This is partially derrived from a patch written by Russell Cattelan. It fixes a bug where there is a race between readpages and truncate by ignoring readpages for stuffed files. This is ok because a stuffed file will never be more than one block (minus sizeof(struct gfs2_dinode)) in size and block size is always less than page size, so we do not lose anything efficiency-wise by not doing readahead for stuffed files. They will have already been "read ahead" by the action of reading the inode in, in the first place. This is the remaining part of the fix for Red Hat bugzilla #218966 which had not yet made it upstream. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Russell Cattelan <cattelan@redhat.com>
2007-02-05[GFS2] Fix DIO deadlockSteven Whitehouse
This patch fixes Red Hat bugzilla #212627 in which a deadlock occurs due to trying to take the i_mutex while holding a glock. The correct locking order is defined as i_mutex -> glock in all cases. I've left dealing with allocating writes. I know that we need to do that, but for now this should do the trick. We don't need to take the i_mutex on write, because the VFS has already taken it for us. On read we don't need it since the glock is enough protection. The reason that I've made some of the checks into a separate function is that we'll need to do the checks again in the allocating write case eventually, so this is partly in preparation for this. Likewise the return value test of != 1 might look a bit odd and thats because we'll need a third return value in case of requiring an allocation. I've made the change to deferred mode on the glock to ensure flushing read caches on other nodes. I notice that (using blktrace to look at whats going on) we appear to do a better job of large I/Os than ext3 after this patch (in terms of not splitting up the I/Os). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Wendy Cheng <wcheng@redhat.com>
2007-02-05[DLM] fs/dlm/lowcomms-tcp.c: remove 2 functionsAdrian Bunk
Remove the following unused functions: - lowcomms_send_message() - lowcomms_max_buffer_size() Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix lost flags in stub repliesDavid Teigland
When the dlm fakes an unlock/cancel reply from a failed node using a stub message struct, it wasn't setting the flags in the stub message. So, in the process of receiving the fake message the lkb flags would be updated and cleared from the zero flags in the message. The problem observed in tests was the loss of the USER flag which caused the dlm to think a user lock was a kernel lock and subsequently fail an assertion checking the validity of the ast/callback field. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix receive_request() lvb copyingDavid Teigland
LVB's are not sent as part of new requests, but the code receiving the request was copying data into the lvb anyway. The space in the message where it mistakenly thought the lvb lived actually contained the resource name, so it wound up incorrectly copying this name data into the lvb. Fix is to just create the lvb, not copy junk into it. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix send_args() lvb copyingDavid Teigland
The send_args() function is used to copy parameters into a message for a number different message types. Only some of those types are set up beforehand (in create_message) to include space for sending lvb data. send_args was wrongly copying the lvb for all message types as long as the lock had an lvb. This means that the lvb data was being written past the end of the message into unknown space. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] add version checkDavid Teigland
Check if we receive a message from another lockspace member running a version of the dlm with an incompatible inter-node message protocol. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix old rcom messagesDavid Teigland
A reply to a recovery message will often be received after the relevant recovery sequence has aborted and the next recovery sequence has begun. We need to ignore replies to these old messages from the previous recovery. There's already a way to do this for synchronous recovery requests using the rc_id number, but not for async. Each recovery sequence already has a locally unique sequence number associated with it. This patch adds a field to the rcom (recovery message) structure where this recovery sequence number can be placed, rc_seq. When a node sends a reply to a recovery request, it copies the rc_seq number it received into rc_seq_reply. When the first node receives the reply to its recovery message, it will check whether rc_seq_reply matches the current recovery sequence number, ls_recover_seq, and if not then it ignores the old reply. An old, inadequate approach to filtering out old replies (checking if the current stage of recovery has moved back to the start) has been removed from two spots. The protocol version number is changed to reflect the different rcom structures. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[DLM] fix resend rcom lockDavid Teigland
There's a chance the new master of resource hasn't learned it's the new master before another node sends it a lock during recovery. The node sending the lock needs to resend if this happens. - A sends a master lookup for resource R to C - B sends a master lookup for resource R to C - C receives A's lookup, assigns A to be master of R and sends a reply back to A - C receives B's lookup and sends a reply back to B saying that A is the master - B receives lookup reply from C and sends its lock for R to A - A receives lock from B, doesn't think it's the master of R and sends an error back to B - A receives lookup reply from C and becomes master of R - B gets error back from A and resends its lock back to A (this resending is what this patch does) - A receives lock from B, it now sees it's the master of R and takes the lock Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-05[GFS2] don't try to lockfs after shutdownDavid Teigland
If an fs has already been shut down, a lockfs callback should do nothing. An fs that's been shut down can't acquire locks or do anything with respect to the cluster. Also, remove FIXME comment in withdraw function. The missing bits of the withdraw procedure are now all done by user space. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-02-03[PATCH] revert blockdev direct io back to 2.6.19 versionAndrew Morton
Andrew Vasquez is reporting as-iosched oopses and a 65% throughput slowdown due to the recent special-casing of direct-io against blockdevs. We don't know why either of these things are occurring. The patch minimally reverts us back to the 2.6.19 code for a 2.6.20 release. Cc: Andrew Vasquez <andrew.vasquez@qlogic.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-03[PATCH] aio: fix buggy put_ioctx call in aio_complete - v2Ken Chen
An AIO bug was reported that sleeping function is being called in softirq context: BUG: warning at kernel/mutex.c:132/__mutex_lock_common() Call Trace: [<a000000100577b00>] __mutex_lock_slowpath+0x640/0x6c0 [<a000000100577ba0>] mutex_lock+0x20/0x40 [<a0000001000a25b0>] flush_workqueue+0xb0/0x1a0 [<a00000010018c0c0>] __put_ioctx+0xc0/0x240 [<a00000010018d470>] aio_complete+0x2f0/0x420 [<a00000010019cc80>] finished_one_bio+0x200/0x2a0 [<a00000010019d1c0>] dio_bio_complete+0x1c0/0x200 [<a00000010019d260>] dio_bio_end_aio+0x60/0x80 [<a00000010014acd0>] bio_endio+0x110/0x1c0 [<a0000001002770e0>] __end_that_request_first+0x180/0xba0 [<a000000100277b90>] end_that_request_chunk+0x30/0x60 [<a0000002073c0c70>] scsi_end_request+0x50/0x300 [scsi_mod] [<a0000002073c1240>] scsi_io_completion+0x200/0x8a0 [scsi_mod] [<a0000002074729b0>] sd_rw_intr+0x330/0x860 [sd_mod] [<a0000002073b3ac0>] scsi_finish_command+0x100/0x1c0 [scsi_mod] [<a0000002073c2910>] scsi_softirq_done+0x230/0x300 [scsi_mod] [<a000000100277d20>] blk_done_softirq+0x160/0x1c0 [<a000000100083e00>] __do_softirq+0x200/0x240 [<a000000100083eb0>] do_softirq+0x70/0xc0 See report: http://marc.theaimsgroup.com/?l=linux-kernel&m=116599593200888&w=2 flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and triggers bug. It is simply incorrect to perform ioctx freeing from aio_complete. The bug is trigger-able from a race between io_destroy() and aio_complete(). A possible scenario: cpu0 cpu1 io_destroy aio_complete wait_for_all_aios { __aio_put_req ... ctx->reqs_active--; if (!ctx->reqs_active) return; } ... put_ioctx(ioctx) put_ioctx(ctx); __put_ioctx bam! Bug trigger! The real problem is that the condition check of ctx->reqs_active in wait_for_all_aios() is incorrect that access to reqs_active is not being properly protected by spin lock. This patch adds that protective spin lock, and at the same time removes all duplicate ref counting for each kiocb as reqs_active is already used as a ref count for each active ioctx. This also ensures that buggy call to flush_workqueue() in softirq context is eliminated. Signed-off-by: "Ken Chen" <kenchen@google.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: <stable@kernel.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-01[PATCH] procfs: Fix listing of /proc/NOT_A_TGID/taskGuillaume Chazarain
Listing /proc/PID/task were PID is not a TGID should not result in duplicated entries. [g ~]$ pidof thunderbird-bin 2751 [g ~]$ ls /proc/2751/task 2751 2770 2771 2824 2826 2834 2835 2851 2853 [g ~]$ ls /proc/2770/task 2751 2770 2771 2824 2826 2834 2835 2851 2853 2770 2771 2824 2826 2834 2835 2851 2853 [g ~]$ Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>