aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2006-02-05mm/slab.c (non-NUMA): Fix compile warning and clean up codeLinus Torvalds
The non-NUMA case would do an unmatched "free_alien_cache()" on an alien pointer that had never been allocated. It might not matter from a code generation standpoint (since in the non-NUMA case, the code doesn't actually _do_ anything), but it not only results in a compiler warning, it's really really ugly too. Fix the compiler warning by just having a matching dummy allocation. That also avoids an unnecessary #ifdef in the code. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] NUMA slab locking fixes: fix cpu down and up lockingRavikiran G Thirumalai
This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab allocator. Sonny Rao <sonny@burdell.org> reported problems sometime back on POWER5 boxes, when the last cpu on the nodes were being offlined. We could not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not being updated on cpu down. Since that issue is now fixed, we can reproduce Sonny's problems on x86_64 NUMA, and here is the fix. The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go down, the array_caches (shared, alien) and the kmem_list3 of the node were being freed (kfree) with the kmem_list3 lock held. If the l3 or the array_caches were to come from the same cache being cleared, we hit on badness. This patch cleans up the locking in cpu_up and cpu_down path. We cannot really free l3 on cpu down because, there is no node offlining yet and even though a cpu is not yet up, node local memory can be allocated for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get to the cachep->nodelist[nodeid] either. Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4 dbench process running all the time. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] NUMA slab locking fixes: irq disabling from cahep->spinlock to l3 lockRavikiran G Thirumalai
Earlier, we had to disable on chip interrupts while taking the cachep->spinlock because, at cache_grow, on every addition of a slab to a slab cache, we incremented colour_next which was protected by the cachep->spinlock, and cache_grow could occur at interrupt context. Since, now we protect the per-node colour_next with the node's list_lock, we do not need to disable on chip interrupts while taking the per-cache spinlock, but we just need to disable interrupts when taking the per-node kmem_list3 list_lock. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] NUMA slab locking fixes: move color_next to l3Ravikiran G Thirumalai
colour_next is used as an index to add a colouring offset to a new slab in the cache (colour_off * colour_next). Now with the NUMA aware slab allocator, it makes sense to colour slabs added on the same node sequentially with colour_next. This patch moves the colouring index "colour_next" per-node by placing it on kmem_list3 rather than kmem_cache. This also helps simplify locking for CPU up and down paths. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] hugetlb: add comment explaining reasons for Bus ErrorsChristoph Lameter
I just spent some time researching a Bus Error. Turns out that the huge page fault handler can return VM_FAULT_SIGBUS for various conditions where no huge page is available. Add a note explaining the reasoning in the source. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] percpu data: only iterate over possible CPUsEric Dumazet
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of cpudata, instead of allocating memory only for possible cpus. As a preparation for changing that, we need to convert various 0 -> NR_CPUS loops to use for_each_cpu(). (The above only applies to users of asm-generic/percpu.h. powerpc has gone it alone and is presently only allocating memory for present CPUs, so it's currently corrupting memory). Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@suse.de> Cc: Anton Blanchard <anton@samba.org> Acked-by: William Irwin <wli@holomorphy.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-04[PATCH] x86_64: Fix memory policy build without CONFIG_HUGETLBFSChen, Kenneth W
> mm/mempolicy.c: In function `huge_zonelist': > mm/mempolicy.c:1045: error: `HPAGE_SHIFT' undeclared (first use in this function) > mm/mempolicy.c:1045: error: (Each undeclared identifier is reported only once > mm/mempolicy.c:1045: error: for each function it appears in.) > make[1]: *** [mm/mempolicy.o] Error 1 Need to wrap huge_zonelist function with CONFIG_HUGETLBFS. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: fix sparse warningRandy Dunlap
mm/slab.c:1522:13: error: incompatible types for operation (&) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] mm/slab: add kernel-doc for one functionRandy.Dunlap
Fix kernel-doc for calculate_slab_order(). Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: fix kzalloc and kstrdup caller report for CONFIG_DEBUG_SLABPekka Enberg
Fix kzalloc() and kstrdup() caller report for CONFIG_DEBUG_SLAB. We must pass the caller to __cache_alloc() instead of directly doing __builtin_return_address(0) there; otherwise kzalloc() and kstrdup() are reported as the allocation site instead of the real one. Thanks to Valdis Kletnieks for reporting the problem and Steven Rostedt for the original idea. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] dump_stack() in oom handlerAndrew Morton
Sometimes it's nice to know who's calling. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: replace kmem_cache_t with struct kmem_cachePekka Enberg
Replace uses of kmem_cache_t with proper struct kmem_cache in mm/slab.c. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: rename ac_data to cpu_cache_getPekka Enberg
Rename the ac_data() function to more descriptive cpu_cache_get(). Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: extract virt_to_{cache|slab}Pekka Enberg
Introduce virt_to_cache() and virt_to_slab() functions to reduce duplicate code and introduce a proper abstraction should we want to support other kind of mapping for address to slab and cache (eg. for vmalloc() or I/O memory). Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: reduce inliningPekka Enberg
From: Manfred Spraul <manfred@colorfullife.com> Reduce the amount of inline functions in slab to the functions that are used in the hot path: - no inline for debug functions - no __always_inline, inline is already __always_inline - remove inline from a few numa support functions. Before: text data bss dec hex filename 13588 752 48 14388 3834 mm/slab.o (defconfig) 16671 2492 48 19211 4b0b mm/slab.o (numa) After: text data bss dec hex filename 13366 752 48 14166 3756 mm/slab.o (defconfig) 16230 2492 48 18770 4952 mm/slab.o (numa) Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: extract slab_{put|get}_objMatthew Dobson
Create two helper functions slab_get_obj() and slab_put_obj() to replace duplicated code in mm/slab.c Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: extract slab_destroy_objs()Matthew Dobson
Create a helper function, slab_destroy_objs() which called from slab_destroy(). This makes slab_destroy() smaller and more readable, and moves ifdefs outside the function body. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: cache_estimate cleanupSteven Rostedt
Clean up cache_estimate() in mm/slab.c and improves the algorithm from O(n) to O(1). We first calculate the maximum number of objects a slab can hold after struct slab and kmem_bufctl_t for each object has been given enough space. After that, to respect alignment rules, we decrease the number of objects if necessary. As required padding is at most align-1 and memory of obj_size is at least align, it is always enough to decrease number of objects by one. The optimization was originally made by Balbir Singh with more improvements from Steven Rostedt. Manfred Spraul provider further modifications: no loop at all for the off-slab case and added comments to explain the background. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: have index_of bug at compile timeSteven Rostedt
I noticed the code for index_of is a creative way of finding the cache index using the compiler to optimize to a single hard coded number. But I couldn't help noticing that it uses two methods to let you know that someone used it wrong. One is at compile time (the correct way), and the other is at run time (not good). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: minor cleanup to kmem_cache_alloc_nodeChristoph Lameter
Clean up kmem_cache_alloc_node a bit. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] slab: distinguish between object and buffer sizeManfred Spraul
An object cache has two different object lengths: - the amount of memory available for the user (object size) - the amount of memory allocated internally (buffer size) This patch does some renames to make the code reflect that better. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: Avoid writeback / page_migrate() methodChristoph Lameter
Migrate a page with buffers without requiring writeback This introduces a new address space operation migratepage() that may be used by a filesystem to implement its own version of page migration. A version is provided that migrates buffers attached to pages. Some filesystems (ext2, ext3, xfs) are modified to utilize this feature. The swapper address space operation are modified so that a regular migrate_page() will occur for anonymous pages without writeback (migrate_pages forces every anonymous page to have a swap entry). Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages()Christoph Lameter
Modify policy layer to support direct page migration - Add migrate_pages_to() allowing the migration of a list of pages to a a specified node or to vma with a specific allocation policy in sets of MIGRATE_CHUNK_SIZE pages - Modify do_migrate_pages() to do a staged move of pages from the source nodes to the target nodes. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: remove_from_swap() to remove swap ptesChristoph Lameter
Add remove_from_swap remove_from_swap() allows the restoration of the pte entries that existed before page migration occurred for anonymous pages by walking the reverse maps. This reduces swap use and establishes regular pte's without the need for page faults. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: migrate_pages() extensionChristoph Lameter
Add direct migration support with fall back to swap. Direct migration support on top of the swap based page migration facility. This allows the direct migration of anonymous pages and the migration of file backed pages by dropping the associated buffers (requires writeout). Fall back to swap out if necessary. The patch is based on lots of patches from the hotplug project but the code was restructured, documented and simplified as much as possible. Note that an additional patch that defines the migrate_page() method for filesystems is necessary in order to avoid writeback for anonymous and file backed pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: PageSwapCache checksChristoph Lameter
Check for PageSwapCache after looking up and locking a swap page. The page migration code may change a swap pte to point to a different page under lock_page(). If that happens then the vm must retry the lookup operation in the swap space to find the correct page number. There are a couple of locations in the VM where a lock_page() is done on a swap page. In these locations we need to check afterwards if the page was migrated. If the page was migrated then the old page that was looked up before was freed and no longer has the PageSwapCache bit set. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Reclaim slab during zone reclaimChristoph Lameter
If large amounts of zone memory are used by empty slabs then zone_reclaim becomes uneffective. This patch shakes the slab a bit. The problem with this patch is that the slab reclaim is not containable to a zone. Thus slab reclaim may affect the whole system and be extremely slow. This also means that we cannot determine how many pages were freed in this zone. Thus we need to go off node for at least one allocation. The functionality is disabled by default. We could modify the shrinkers to take a zone parameter but that would be quite invasive. Better ideas are welcome. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Zone reclaim: Allow modification of zone reclaim behaviorChristoph Lameter
In some situations one may want zone_reclaim to behave differently. For example a process writing large amounts of memory will spew unto other nodes to cache the writes if many pages in a zone become dirty. This may impact the performance of processes running on other nodes. Allowing writes during reclaim puts a stop to that behavior and throttles the process by restricting the pages to the local zone. Similarly one may want to contain processes to local memory by enabling regular swap behavior during zone_reclaim. Off node memory allocation can then be controlled through memory policies and cpusets. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] zone_reclaim: configurable off node allocation period.Christoph Lameter
Currently the zone_reclaim code has a fixed window of 30 seconds of off node allocations should a local zone have no unused pagecache pages left. Reclaim will be attempted again after this timeout period to avoid repeated useless scans for memory. This is also useful to established sufficiently large off node allocation chunks to relieve the local node. It may be beneficial to adjust that time period for some special situations. For example if memory use was exceeding node capacity one may want to give up for longer periods of time. If memory spikes intermittendly then one may want to shorten the time period to reduce the number of off node allocations. This patch allows just that.... Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] zone_reclaim: partial scans instead of full scanChristoph Lameter
Instead of scanning all the pages in a zone, imitate real swap and scan only a portion of the pages and gradually scan more if we do not free up enough pages. This avoids a zone suddenly loosing all unused pagecache pages (we may after all access some of these again so they deserve another chance) but it still frees up large chunks of memory if a zone only contains unused pagecache pages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] zone_reclaim: do not unmap file backed pagesChristoph Lameter
zone_reclaim should leave that to the real swapper. We are only interested in evicting unmapped pages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Use 32 bit division in slab_put_obj()Benjamin LaHaise
Improve the performance of slab_put_obj(). Without the cast, gcc considers ptrdiff_t a 64 bit signed integer and ends up emitting code to use a full signed 128 bit divide on EM64T, which is substantially slower than a 32 bit unsigned divide. I noticed this when looking at the profile of a case where the slab balance is just on edge and thrashes back and forth freeing a block. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] zone_reclaim: minor fixesChristoph Lameter
- If we only reclaim nr_pages then its okay to stay on node. Switch from > to >= for the comparison. - vm_table[] entry for zone_reclaim_mode is a bit screwed up. - Add empty lines around shrink_zone to show that this is the central function to be called. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] mm: improve function of sc->may_writepageChristoph Lameter
Make sc->may_writepage control the writeout behavior of shrink_list. Remove the laptop_mode trick from shrink_list and instead set may_writepage in try_to_free_pages properly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] zone_reclaim: reclaim on memory only node supportChristoph Lameter
Zone reclaim is usually only run on the local node. Headless nodes do not have any local processors. This patch checks for headless nodes and performs zone reclaim on them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Optimize off-node performance of zone reclaimChristoph Lameter
Ensure that the performance of off node pages stays the same as before. Off node pagefault tests showed an 18% drop in performance without this patch. - Increase the timeout to 30 seconds to reduce the overhead. - Move all code possible out of the off node hot path for zone reclaim (Sorry Andrew, the struct initialization had to be sacrificed). The read_page_state() bit us there. - Check first for the timeout before any other checks. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] __cpuinit functions wrongly marked __meminitAshok Raj
__meminit has overzelously been modified and crept its way into marking cpuup callbacks as __meminit. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] mm: optimize numa policy handling in slab allocatorChristoph Lameter
Move the interrupt check from slab_node into ___cache_alloc and adds an "unlikely()" to avoid pipeline stalls on some architectures. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] NUMA policies in the slab allocator V2Christoph Lameter
This patch fixes a regression in 2.6.14 against 2.6.13 that causes an imbalance in memory allocation during bootup. The slab allocator in 2.6.13 is not numa aware and simply calls alloc_pages(). This means that memory policies may control the behavior of alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE resulting in the spreading out of allocations during bootup over all available nodes. The slab allocator in 2.6.13 has only a single list of slab pages. As a result the per cpu slab cache and the spinlock controlled page lists may contain slab entries from off node memory. The slab allocator in 2.6.13 makes no effort to discern the locality of an entry on its lists. The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages explicitly by calling alloc_pages_node(). The NUMA slab allocator manages slab entries by having lists of available slab pages for each node. The per cpu slab cache can only contain slab entries associated with the node local to the processor. This guarantees that the default allocation mode of the slab allocator always assigns local memory if available. Setting MPOL_INTERLEAVE as a default policy during bootup has no effect anymore. In 2.6.14 all node unspecific slab allocations are performed on the boot processor. This means that most of key data structures are allocated on one node. Most processors will have to refer to these structures making the boot node a potential bottleneck. This may reduce performance and cause unnecessary memory pressure on the boot node. This patch implements NUMA policies in the slab layer. There is the need of explicit application of NUMA memory policies by the slab allcator itself since the NUMA slab allocator does no longer let the page_allocator control locality. The check for policies is made directly at the beginning of __cache_alloc using current->mempolicy. The memory policy is already frequently checked by the page allocator (alloc_page_vma() and alloc_page_current()). So it is highly likely that the cacheline is present. For MPOL_INTERLEAVE kmalloc() will spread out each request to one node after another so that an equal distribution of allocations can be obtained during bootup. It is not possible to push the policy check to lower layers of the NUMA slab allocator since the per cpu caches are now only containing slab entries from the current node. If the policy says that the local node is not to be preferred or forbidden then there is no point in checking the slab cache or local list of slab pages. The allocation better be directed immediately to the lists containing slab entries for the allowed set of nodes. This way of applying policy also fixes another strange behavior in 2.6.13. alloc_pages() is controlled by the memory allocation policy of the current process. It could therefore be that one process is running with MPOL_INTERLEAVE and would f.e. obtain a new page following that policy since no slab entries are in the lists anymore. A page can typically be used for multiple slab entries but lets say that the current process is only using one. The other entries are then added to the slab lists. These are now non local entries in the slab lists despite of the possible availability of local pages that would provide faster access and increase the performance of the application. Another process without MPOL_INTERLEAVE may now run and expect a local slab entry from kmalloc(). However, there are still these free slab entries from the off node page obtained from the other process via MPOL_INTERLEAVE in the cache. The process will then get an off node slab entry although other slab entries may be available that are local to that process. This means that the policy if one process may contaminate the locality of the slab caches for other processes. This patch in effect insures that a per process policy is followed for the allocation of slab entries and that there cannot be a memory policy influence from one process to another. A process with default policy will always get a local slab entry if one is available. And the process using memory policies will get its memory arranged as requested. Off-node slab allocation will require the use of spinlocks and will make the use of per cpu caches not possible. A process using memory policies to redirect allocations offnode will have to cope with additional lock overhead in addition to the latency added by the need to access a remote slab entry. Changes V1->V2 - Remove #ifdef CONFIG_NUMA by moving forward declaration into prior #ifdef CONFIG_NUMA section. - Give the function determining the node number to use a saner name. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] sem2mutex: mm/slab.cIngo Molnar
Convert mm/swapfile.c's swapon_sem to swapon_mutex. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] Zone reclaim: Reclaim logicChristoph Lameter
Some bits for zone reclaim exists in 2.6.15 but they are not usable. This patch fixes them up, removes unused code and makes zone reclaim usable. Zone reclaim allows the reclaiming of pages from a zone if the number of free pages falls below the watermarks even if other zones still have enough pages available. Zone reclaim is of particular importance for NUMA machines. It can be more beneficial to reclaim a page than taking the performance penalties that come with allocating a page on a remote zone. Zone reclaim is enabled if the maximum distance to another node is higher than RECLAIM_DISTANCE, which may be defined by an arch. By default RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same component (enclosure or motherboard) on IA64. The meaning of the NUMA distance information seems to vary by arch. If zone reclaim is not successful then no further reclaim attempts will occur for a certain time period (ZONE_RECLAIM_INTERVAL). This patch was discussed before. See http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] Zone reclaim: resurrect may_swapChristoph Lameter
Zone reclaim has a huge impact on NUMA performance (f.e. our maximum throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination of numa nodes destroys locality if one just does a large copy operation which results in performance dropping for good until reboot). This patch: Resurrect may_swap in struct scan_control Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] Simplify migrate_page_addChristoph Lameter
Simplify migrate_page_add after feedback from Hugh. This also allows us to drop one parameter from migrate_page_add. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] mm: migration page refcounting fixNick Piggin
Migration code currently does not take a reference to target page properly, so between unlocking the pte and trying to take a new reference to the page with isolate_lru_page, anything could happen to it. Fix this by holding the pte lock until we get a chance to elevate the refcount. Other small cleanups while we're here. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] mm: dirty_exceeded speedupAndrew Morton
Ravikiran reports that this variable is bouncing all around nodes on NUMA machines, causing measurable performance problems. Fix that up by only writing to it when it actually changed. And put it in a new cacheline to prevent it sharing with other things (this happened). Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16[PATCH] x86_64: add __meminit for memory hotplugMatt Tolentino
Add __meminit to the __init lineup to ensure functions default to __init when memory hotplug is not enabled. Replace __devinit with __meminit on functions that were changed when the memory hotplug code was introduced. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] cpuset oom lock fixPaul Jackson
The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] Add tmpfs options for memory placement policiesRobin Holt
Anything that writes into a tmpfs filesystem is liable to disproportionately decrease the available memory on a particular node. Since there's no telling what sort of application (e.g. dd/cp/cat) might be dropping large files there, this lets the admin choose the appropriate default behavior for their site's situation. Introduce a tmpfs mount option which allows specifying a memory policy and a second option to specify the nodelist for that policy. With the default policy, tmpfs will behave as it does today. This patch adds support for preferred, bind, and interleave policies. The default policy will cause pages to be added to tmpfs files on the node which is doing the writing. Some jobs expect a single process to create and manage the tmpfs files. This results in a node which has a significantly reduced number of free pages. With this patch, the administrator can specify the policy and nodes for that policy where they would prefer allocations. This patch was originally written by Brent Casavant and Hugh Dickins. I added support for the bind and preferred policies and the mpol_nodelist mount option. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12Merge git://oss.sgi.com:8090/oss/git/xfs-2.6Linus Torvalds
2006-01-12[PATCH] memmap_init_zone(): remove uneccesary page++Greg Ungerer
Remove unecessary page++ from memmap_init_zone loop. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>