aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2006-01-09[SCSI] Kill the SCSI softirq handlingJens Axboe
This patch moves the SCSI softirq handling to the block layer version. There should be no functional changes. Signed-off-by: Jens Axboe <axboe@suse.de>
2006-01-09[BLOCK] ll_rw_blk: Enable out-of-order request completions through softirqJens Axboe
Request completion can be a quite heavy process, since it needs to iterate through the entire request and complete the bio's it holds. This patch adds blk_complete_request() which moves this processing into a dedicated block softirq. Signed-off-by: Jens Axboe <axboe@suse.de>
2006-01-08[PATCH] remove semicolons from save_flags()Andrew Morton
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Eliminate __attribute__ ((packed)) warnings for gcc-4.1Jan Blunck
Since version 4.1 the gcc is warning about ignored attributes. This patch is using the equivalent attribute on the struct instead of on each of the structure or union members. GCC Manual: "Specifying Attributes of Types packed This attribute, attached to struct or union type definition, specifies that each member of the structure or union is placed to minimize the memory required. When attached to an enum definition, it indicates that the smallest integral type should be used. Specifying this attribute for struct and union types is equivalent to specifying the packed attribute on each of the structure or union members." Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] parport: bring back an unused phase for ppdev ioctlMarko Kohtala
Earlier fix removed unused phase, but that changed the values for other phases. Since these are exposed to userspace through ppdev, it is safer not to change them. Restore the unused phase value. Signed-off-by: Marko Kohtala <marko.kohtala@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Split out screen_info from tty.hBrian Gerst
This makes it possible for boot code to use screen_info without dragging in all of tty.h. Signed-off-by: Brian Gerst <bgerst@didntduck.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] PTRACE_SYSEMU is only for i386 and clashes with other ptrace codes ↵Paolo 'Blaisorblade' Giarrusso
of other archs PTRACE_SYSEMU{,_SINGLESTEP} is actually arch specific, for now, and the current allocated number clashes with a ptrace code of frv, i.e. PTRACE_GETFDPIC. I should have submitted this much earlier, anyway we get no breakage for this. CC: Daniel Jacobowitz <dan@debian.org> Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] shrink struct pageAndrew Morton
Reduce the size of the pageframe for NR_CPUS>4, CONFIG_PREEMPT back to the minimal size by unionising both ->private and ->mapping with the pagetable lock. It uses an anonymous struct and hence requires gcc-3.x. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] aio: reorder kiocb structure elements to make sync iocb setup fasterBenjamin LaHaise
Reorder members of the kiocb structure to make sync kiocb setup faster. By setting the elements sequentially, the write combining buffers on the CPU are able to combine the writes into a single burst, which results in fewer cache cycles being consumed, freeing them up for other code. This results in a 10-20KB/s[*] increase on the bw_unix part of LMbench on my test system. * The improvement varies based on what other patches are in the system, as there are a number of bottlenecks, so this number is not absolutely accurate. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] remove gcc-2 checksAndrew Morton
Remove various things which were checking for gcc-1.x and gcc-2.x compilers. From: Adrian Bunk <bunk@stusta.de> Some documentation updates and removes some code paths for gcc < 3.2. Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Abandon gcc-2.95.xAndrew Morton
There's one scsi driver which doesn't compile due to weird __VA_ARGS__ tricks and the rather useful scsi/sd.c is currently getting an ICE. None of the new SAS code compiles, due to extensive use of anonymous unions. The V4L guys are very good at exploiting the gcc-2.95.x macro expansion bug (_why_ does each driver need to implement its own debug macros?) and various people keep on sneaking in anonymous unions, which are rather nice. Plus anonymous unions are rather useful. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] remove unused blkp field in percpu_dataEric Dumazet
I found that blkp field was not used in kernel tree. As most of the times NR_CPUS is a power of two and kmalloc() memory blocks too, this extra field basically doubles the memory space allocated in __alloc_percpu() to store the 'struct percpu_data' (for example, if NR_CPUS=8 on i386, kmalloc(4*8+4) returns a 64 bytes block instead of a 32 bytes block after this patch) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fs: remove s_old_blocksize from struct super_blockPekka Enberg
This patch inlines the single user of struct super_block field s_old_blocksize and removes the field. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] shrink dentry structEric Dumazet
Some long time ago, dentry struct was carefully tuned so that on 32 bits UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple of memory cache lines. Then RCU was added and dentry struct enlarged by two pointers, with nice results for SMP, but not so good on UP, because breaking the above tuning (128 + 8 = 136 bytes) This patch reverts this unwanted side effect, by using an union (d_u), where d_rcu and d_child are placed so that these two fields can share their memory needs. At the time d_free() is called (and d_rcu is really used), d_child is known to be empty and not touched by the dentry freeing. Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so the previous content of d_child is not needed if said dentry was unhashed but still accessed by a CPU because of RCU constraints) As dentry cache easily contains millions of entries, a size reduction is worth the extra complexity of the ugly C union. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Ian Kent <raven@themaw.net> Cc: Paul Jackson <pj@sgi.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@epoch.ncsc.mil> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] shared mounts: cleanupMiklos Szeredi
Small cleanups in shared mounts code. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: Ram Pai <linuxram@us.ibm.com> Cc: <viro@parcelfarce.linux.theplanet.co.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Add block_device_operations.getgeo block device methodChristoph Hellwig
HDIO_GETGEO is implemented in most block drivers, and all of them have to duplicate the code to copy the structure to userspace, as well as getting the start sector. This patch moves that to common code [1] and adds a ->getgeo method to fill out the raw kernel hd_geometry structure. For many drivers this means ->ioctl can go away now. [1] the s390 block drivers are odd in this respect. xpram sets ->start to 4 always which seems more than odd, and the dasd driver shifts the start offset around, probably because of it's non-standard sector size. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@suse.de> Cc: <mike.miller@hp.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: Markus Lidel <Markus.Lidel@shadowconnect.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] sigaction should clear all signals on SIG_IGN, not just < 32George Anzinger
While rooting aroung in the signal code trying to understand how to fix the SIG_IGN ploy (set sig handler to SIG_IGN and flood system with high speed repeating timers) I came across what, I think, is a problem in sigaction() in that when processing a SIG_IGN request it flushes signals from 1 to SIGRTMIN and leaves the rest. Attempt to fix this. Signed-off-by: George Anzinger <george@mvista.com> Cc: Roland McGrath <roland@redhat.com> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] keys: Permit running process to instantiate keysDavid Howells
Make it possible for a running process (such as gssapid) to be able to instantiate a key, as was requested by Trond Myklebust for NFS4. The patch makes the following changes: (1) A new, optional key type method has been added. This permits a key type to intercept requests at the point /sbin/request-key is about to be spawned and do something else with them - passing them over the rpc_pipefs files or netlink sockets for instance. The uninstantiated key, the authorisation key and the intended operation name are passed to the method. (2) The callout_info is no longer passed as an argument to /sbin/request-key to prevent unauthorised viewing of this data using ps or by looking in /proc/pid/cmdline. This means that the old /sbin/request-key program will not work with the patched kernel as it will expect to see an extra argument that is no longer there. A revised keyutils package will be made available tomorrow. (3) The callout_info is now attached to the authorisation key. Reading this key will retrieve the information. (4) A new field has been added to the task_struct. This holds the authorisation key currently active for a thread. Searches now look here for the caller's set of keys rather than looking for an auth key in the lowest level of the session keyring. This permits a thread to be servicing multiple requests at once and to switch between them. Note that this is per-thread, not per-process, and so is usable in multithreaded programs. The setting of this field is inherited across fork and exec. (5) A new keyctl function (KEYCTL_ASSUME_AUTHORITY) has been added that permits a thread to assume the authority to deal with an uninstantiated key. Assumption is only permitted if the authorisation key associated with the uninstantiated key is somewhere in the thread's keyrings. This function can also clear the assumption. (6) A new magic key specifier has been added to refer to the currently assumed authorisation key (KEY_SPEC_REQKEY_AUTH_KEY). (7) Instantiation will only proceed if the appropriate authorisation key is assumed first. The assumed authorisation key is discarded if instantiation is successful. (8) key_validate() is moved from the file of request_key functions to the file of permissions functions. (9) The documentation is updated. From: <Valdis.Kletnieks@vt.edu> Build fix. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Alexander Zangerl <az@bond.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] keys: Permit key expiry time to be setDavid Howells
Add a new keyctl function that allows the expiry time to be set on a key or removed from a key, provided the caller has attribute modification access. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Alexander Zangerl <az@bond.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Fix some problems with truncate and mtime semantics.NeilBrown
SUS requires that when truncating a file to the size that it currently is: truncate and ftruncate should NOT modify ctime or mtime O_TRUNC SHOULD modify ctime and mtime. Currently mtime and ctime are always modified on most local filesystems (side effect of ->truncate) or never modified (on NFS). With this patch: ATTR_CTIME|ATTR_MTIME are sent with ATTR_SIZE precisely when an update of these times is required whether size changes or not (via a new argument to do_truncate). This allows NFS to do the right thing for O_TRUNC. inode_setattr nolonger forces ATTR_MTIME|ATTR_CTIME when the ATTR_SIZE sets the size to it's current value. This allows local filesystems to do the right thing for f?truncate. Also, the logic in inode_setattr is changed a bit so there are two return points. One returns the error from vmtruncate if it failed, the other returns 0 (there can be no other failure). Finally, if vmtruncate succeeds, and ATTR_SIZE is the only change requested, we now fall-through and mark_inode_dirty. If a filesystem did not have a ->truncate function, then vmtruncate will have changed i_size, without marking the inode as 'dirty', and I think this is wrong. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Permit multiple inclusion of linux/pagevec.hDavid Howells
Make it possible to include linux/pagevec.h multiple times without incurring errors due to duplicate definitions. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] use ptrace_get_task_struct in various placesChristoph Hellwig
The ptrace_get_task_struct() helper that I added as part of the ptrace consolidation is useful in variety of places that currently opencode it. Switch them to the common helpers. Add a ptrace_traceme() helper that needs to be explicitly called, and simplify the ptrace_get_task_struct() interface. We don't need the request argument now, and we return the task_struct directly, using ERR_PTR() for error returns. It's a bit more code in the callers, but we have two sane routines that do one thing well now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] relayfs: cleanup, change relayfs_file_* to relay_file_*Tom Zanussi
This patch renames relayfs_file_operations to relay_file_operations, and the file operations themselves from relayfs_XXX to relay_file_XXX, to make it more clear that they refer to relay files. Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] relayfs: add support for global relay buffersTom Zanussi
This patch adds the optional is_global outparam to the create_buf_file() callback. This can be used by clients to create a single global relayfs buffer instead of the default per-cpu buffers. This was suggested as being useful for certain debugging applications where it's more convenient to be able to get all the data from a single channel without having to go to the bother of dealing with per-cpu files. Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] relayfs: add support for relay files in other filesystemsTom Zanussi
This patch adds a couple of callback functions that allow a client to hook into relay_open()/close() and supply the files that will be used to represent the channel buffers; the default implementation if no callbacks are defined is to create the files in relayfs. This is to support the creation and use of relay files in other filesystems such as debugfs, as implied by the fact that relayfs_file_operations are exported. Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] relayfs: remove unused alloc/destroy_inode()Tom Zanussi
Since we're no longer using relayfs_inode_info, remove relayfs_alloc_inode() and relayfs_destroy_inode() along with the relayfs inode cache. Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] relayfs: add relayfs_remove_file()Tom Zanussi
This patch adds and exports relayfs_remove_file(), for API symmetry (with relayfs_create_file()). Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] relayfs: export relayfs_create_file() with fileops paramTom Zanussi
This patch adds a mandatory fileops param to relayfs_create_file() and exports that function so that clients can use it to create files defined by their own set of file operations, in relayfs. The purpose is to allow relayfs applications to create their own set of 'control' files alongside their relay files in relayfs rather than having to create them in /proc or debugfs for instance. relayfs_create_file() is also used by relay_open_buf() to create the relay files for a channel. In this case, a pointer to relayfs_file_operations is passed in, along with a pointer to the buffer associated with the file. Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] ELF: symbol table type additionsJan Beulich
Needed for the Novell kernel debugger and perhaps some per-cpu data on x86_64 in the future. Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] rcu file: use atomic primitivesNick Piggin
Use atomic_inc_not_zero for rcu files instead of special case rcuref. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] move rtc_interrupt() prototype to rtc.hAdrian Bunk
This patch moves the rtc_interrupt() prototype to rtc.h and removes the prototypes from C files. It also renames static rtc_interrupt() functions in arch/arm/mach-integrator/time.c and arch/sh64/kernel/time.c to avoid compile problems. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Paul Gortmaker <p_gortmaker@yahoo.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fat: support a truncate() for expanding size (generic_cont_expand)OGAWA Hirofumi
This patch changes generic_cont_expand(), in order to share the code with fatfs. - Use vmtruncate() if ->prepare_write() returns a error. Even if ->prepare_write() returns an error, it may already have added some blocks. So, this truncates blocks outside of ->i_size by vmtruncate(). - Add generic_cont_expand_simple(). The generic_cont_expand_simple() assumes that ->prepare_write() can handle the block boundary. With this, we don't need to care the extra byte. And for expanding a file size by truncate(), fatfs uses the added generic_cont_expand_simple(). Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] export/change sync_page_range/_nolock()OGAWA Hirofumi
This exports/changes the sync_page_range/_nolock(). The fatfs needs sync_page_range/_nolock() for expanding truncate, and changes "size_t count" to "loff_t count". Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fat: support ->direct_IO()OGAWA Hirofumi
This patch add to support of ->direct_IO() for mostly read. The user of this seems to want to use for streaming read. So, current direct I/O has limitation, it can only overwrite. (For write operation, mainly we need to handle the hole etc..) Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] IRQ type flagsRussell King
Some ARM platforms have the ability to program the interrupt controller to detect various interrupt edges and/or levels. For some platforms, this is critical to setup correctly, particularly those which the setting is dependent on the device. Currently, ARM drivers do (eg) the following: err = request_irq(irq, ...); set_irq_type(irq, IRQT_RISING); However, if the interrupt has previously been programmed to be level sensitive (for whatever reason) then this will cause an interrupt storm. Hence, if we combine set_irq_type() with request_irq(), we can then safely set the type prior to unmasking the interrupt. The unfortunate problem is that in order to support this, these flags need to be visible outside of the ARM architecture - drivers such as smc91x need these flags and they're cross-architecture. Finally, the SA_TRIGGER_* flag passed to request_irq() should reflect the property that the device would like. The IRQ controller code should do its best to select the most appropriate supported mode. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] new driver synclink_gtPaul Fulghum
New character device driver for the SyncLink GT and SyncLink AC families of synchronous and asynchronous serial adapters Signed-off-by: Paul Fulghum <paulkf@microgate.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fix more missing includesTim Schmielau
Include fixes for 2.6.14-git11. Should allow to remove sched.h from module.h on i386, x86_64, arm, ia64, ppc, ppc64, and s390. Probably more to come since I haven't yet checked the other archs. Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: remove test for null cpuset from alloc code pathPaul Jackson
Remove a couple of more lines of code from the cpuset hooks in the page allocation code path. There was a check for a NULL cpuset pointer in the routine cpuset_update_task_memory_state() that was only needed during system boot, after the memory subsystem was initialized, before the cpuset subsystem was initialized, to catch a NULL task->cpuset pointer. Add a cpuset_init_early() routine, just before the mem_init() call in init/main.c, that sets up just enough of the init tasks cpuset structure to render cpuset_update_task_memory_state() calls harmless. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: rebind vma mempolicies fixPaul Jackson
Fix more of longstanding bug in cpuset/mempolicy interaction. NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset to just the Memory Nodes allowed by that cpuset. The kernel maintains internal state for each mempolicy, tracking what nodes are used for the MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies. When a tasks cpuset memory placement changes, whether because the cpuset changed, or because the task was attached to a different cpuset, then the tasks mempolicies have to be rebound to the new cpuset placement, so as to preserve the cpuset-relative numbering of the nodes in that policy. An earlier fix handled such mempolicy rebinding for mempolicies attached to a task. This fix rebinds mempolicies attached to vma's (address ranges in a tasks address space.) Due to the need to hold the task->mm->mmap_sem semaphore while updating vma's, the rebinding of vma mempolicies has to be done when the cpuset memory placement is changed, at which time mmap_sem can be safely acquired. The tasks mempolicy is rebound later, when the task next attempts to allocate memory and notices that its task->cpuset_mems_generation is out-of-date with its cpusets mems_generation. Because walking the tasklist to find all tasks attached to a changing cpuset requires holding tasklist_lock, a spinlock, one cannot update the vma's of the affected tasks while doing the tasklist scan. In general, one cannot acquire a semaphore (which can sleep) while already holding a spinlock (such as tasklist_lock). So a list of mm references has to be built up during the tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem acquired, and the vma's in that mm rebound. Once the tasklist lock is dropped, affected tasks may fork new tasks, before their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to point to the cpuset being rebound (there can only be one; cpuset modifications are done under a global 'manage_sem' semaphore), and the mpol_copy code that is used to copy a tasks mempolicies during fork catches such forking tasks, and ensures their children are also rebound. When a task is moved to a different cpuset, it is easier, as there is only one task involved. It's mm->vma's are scanned, using the same mpol_rebind_policy() as used above. It may happen that both the mpol_copy hook and the update done via the tasklist scan update the same mm twice. This is ok, as the mempolicies of each vma in an mm keep track of what mems_allowed they are relative to, and safely no-op a second request to rebind to the same nodes. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: number_of_cpusets optimizationPaul Jackson
Easy little optimization hack to avoid actually having to call cpuset_zone_allowed() and check mems_allowed, in the main page allocation routine, __alloc_pages(). This saves several CPU cycles per page allocation on systems not using cpusets. A counter is updated each time a cpuset is created or removed, and whenever there is only one cpuset in the system, it must be the root cpuset, which contains all CPUs and all Memory Nodes. In that case, when the counter is one, all allocations are allowed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: numa_policy_rebind cleanupPaul Jackson
Cleanup, reorganize and make more robust the mempolicy.c code to rebind mempolicies relative to the containing cpuset after a tasks memory placement changes. The real motivator for this cleanup patch is to lay more groundwork for the upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's after the containing cpuset memory placement changes. NUMA mempolicies are constrained by the cpuset their task is a member of. When either (1) a task is moved to a different cpuset, or (2) the 'mems' mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to be recalculated, relative to their new cpuset placement. The old code used an unreliable method of determining what was the old mems_allowed constraining the mempolicy. It just looked at the tasks mems_allowed value. This sort of worked with the present code, that just rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken, referring to the old nodes. But in an upcoming patch, the vma mempolicies will be rebound as well. Then the order in which the various task and vma mempolicies are updated will no longer be deterministic, and one can no longer count on the task->mems_allowed holding the old value for as long as needed. It's not even clear if the current code was guaranteed to work reliably for task mempolicies. So I added a mems_allowed field to each mempolicy, stating exactly what mems_allowed the policy is relative to, and updated synchronously and reliably anytime that the mempolicy is rebound. Also removed a useless wrapper routine, numa_policy_rebind(), and had its caller, cpuset_update_task_memory_state(), call directly to the rewritten policy_rebind() routine, and made that rebind routine extern instead of static, and added a "mpol_" prefix to its name, making it mpol_rebind_policy(). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: implement cpuset_mems_allowedPaul Jackson
Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code needed, to obtain the mems_allowed vector of a cpuset, and replaced the workaround in sys_migrate_pages() to call this new method. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: combine refresh_mems and update_memsPaul Jackson
The important code paths through alloc_pages_current() and alloc_page_vma(), by which most kernel page allocations go, both called cpuset_update_current_mems_allowed(), which in turn called refresh_mems(). -Both- of these latter two routines did a tasklock, got the tasks cpuset pointer, and checked for out of date cpuset->mems_generation. That was a silly duplication of code and waste of CPU cycles on an important code path. Consolidated those two routines into a single routine, called cpuset_update_task_memory_state(), since it updates more than just mems_allowed. Changed all callers of either routine to call the new consolidated routine. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: memory pressure meterPaul Jackson
Provide a simple per-cpuset metric of memory pressure, tracking the -rate- that the tasks in a cpuset call try_to_free_pages(), the synchronous (direct) memory reclaim code. This enables batch managers monitoring jobs running in dedicated cpusets to efficiently detect what level of memory pressure that job is causing. This is useful both on tightly managed systems running a wide mix of submitted jobs, which may choose to terminate or reprioritize jobs that are trying to use more memory than allowed on the nodes assigned them, and with tightly coupled, long running, massively parallel scientific computing jobs that will dramatically fail to meet required performance goals if they start to use more memory than allowed to them. This patch just provides a very economical way for the batch manager to monitor a cpuset for signs of memory pressure. It's up to the batch manager or other user code to decide what to do about it and take action. ==> Unless this feature is enabled by writing "1" to the special file /dev/cpuset/memory_pressure_enabled, the hook in the rebalance code of __alloc_pages() for this metric reduces to simply noticing that the cpuset_memory_pressure_enabled flag is zero. So only systems that enable this feature will compute the metric. Why a per-cpuset, running average: Because this meter is per-cpuset, rather than per-task or mm, the system load imposed by a batch scheduler monitoring this metric is sharply reduced on large systems, because a scan of the tasklist can be avoided on each set of queries. Because this meter is a running average, instead of an accumulating counter, a batch scheduler can detect memory pressure with a single read, instead of having to read and accumulate results for a period of time. Because this meter is per-cpuset rather than per-task or mm, the batch scheduler can obtain the key information, memory pressure in a cpuset, with a single read, rather than having to query and accumulate results over all the (dynamically changing) set of tasks in the cpuset. A per-cpuset simple digital filter (requires a spinlock and 3 words of data per-cpuset) is kept, and updated by any task attached to that cpuset, if it enters the synchronous (direct) page reclaim code. A per-cpuset file provides an integer number representing the recent (half-life of 10 seconds) rate of direct page reclaims caused by the tasks in the cpuset, in units of reclaims attempted per second, times 1000. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: mempolicy one more nodemask conversionPaul Jackson
Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous conversion had left one routine using bitmaps, since it involved a corresponding change to kernel/cpuset.c Fix that interface by replacing with a simple macro that calls nodes_subset(), or if !CONFIG_CPUSET, returns (1). Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] slob: introduce the SLOB allocatorMatt Mackall
configurable replacement for slab allocator This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is disabled, the kernel falls back to using the 'SLOB' allocator. SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer, similar to the original Linux kmalloc allocator that SLAB replaced. It's signicantly smaller code and is more memory efficient. But like all similar allocators, it scales poorly and suffers from fragmentation more than SLAB, so it's only appropriate for small systems. It's been tested extensively in the Linux-tiny tree. I've also stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not recommended). Here's a comparison for otherwise identical builds, showing SLOB saving nearly half a megabyte of RAM: $ size vmlinux* text data bss dec hex filename 3336372 529360 190812 4056544 3de5e0 vmlinux-slab 3323208 527948 190684 4041840 3dac70 vmlinux-slob $ size mm/{slab,slob}.o text data bss dec hex filename 13221 752 48 14021 36c5 mm/slab.o 1896 52 8 1956 7a4 mm/slob.o /proc/meminfo: SLAB SLOB delta MemTotal: 27964 kB 27980 kB +16 kB MemFree: 24596 kB 25092 kB +496 kB Buffers: 36 kB 36 kB 0 kB Cached: 1188 kB 1188 kB 0 kB SwapCached: 0 kB 0 kB 0 kB Active: 608 kB 600 kB -8 kB Inactive: 808 kB 812 kB +4 kB HighTotal: 0 kB 0 kB 0 kB HighFree: 0 kB 0 kB 0 kB LowTotal: 27964 kB 27980 kB +16 kB LowFree: 24596 kB 25092 kB +496 kB SwapTotal: 0 kB 0 kB 0 kB SwapFree: 0 kB 0 kB 0 kB Dirty: 4 kB 12 kB +8 kB Writeback: 0 kB 0 kB 0 kB Mapped: 560 kB 556 kB -4 kB Slab: 1756 kB 0 kB -1756 kB CommitLimit: 13980 kB 13988 kB +8 kB Committed_AS: 4208 kB 4208 kB 0 kB PageTables: 28 kB 28 kB 0 kB VmallocTotal: 1007312 kB 1007312 kB 0 kB VmallocUsed: 48 kB 48 kB 0 kB VmallocChunk: 1007264 kB 1007264 kB 0 kB (this work has been sponsored in part by CELF) From: Ingo Molnar <mingo@elte.hu> Fix 32-bitness bugs in mm/slob.c. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] remove get_task_struct_rcu()Paul E. McKenney
The latest set of signal-RCU patches does not use get_task_struct_rcu(). Attached is a patch that removes it. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] RCU signal handlingIngo Molnar
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked instead of tasklist_lock read-locked. This is a scalability improvement on SMP and a preemption-latency improvement under PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: William Irwin <wli@holomorphy.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp ↵Ravikiran G Thirumalai
macros ____cacheline_maxaligned_in_smp is currently used to align critical structures and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find L1_CACHE_SHIFT_MAX useless. However, we have been using ____cacheline_maxaligned_in_smp to align structures on the internode cacheline size. As per Andi's suggestion, following patch kills ____cacheline_maxaligned_in_smp and introduces INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches. Arches needing L3/Internode cacheline alignment can define INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp With this patch, L1_CACHE_SHIFT_MAX can be killed Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] mempolicies: unexport get_vma_policy()Christoph Lameter
Since the numa_maps functionality is now in mempolicy.c we no longer need to export get_vma_policy(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>