aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2007-05-07SLAB: don't allocate empty shared cachesEric Dumazet
We can avoid allocating empty shared caches and avoid unecessary check of cache->limit. We save some memory. We avoid bringing into CPU cache unecessary cache lines. All accesses to l3->shared are already checking NULL pointers so this patch is safe. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07SLAB: use num_possible_cpus() in enable_cpucache()Eric Dumazet
The existing comment in mm/slab.c is *perfect*, so I reproduce it : /* * CPU bound tasks (e.g. network routing) can exhibit cpu bound * allocation behaviour: Most allocs on one cpu, most free operations * on another cpu. For these cases, an efficient object passing between * cpus is necessary. This is provided by a shared array. The array * replaces Bonwick's magazine layer. * On uniprocessor, it's functionally equivalent (but less efficient) * to a larger limit. Thus disabled by default. */ As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way a preprocessor #if can detect if the machine is UP or SMP. Better to use num_possible_cpus(). This means on UP we allocate a 'size=0 shared array', to be more efficient. Another patch can later avoid the allocations of 'empty shared arrays', to save some memory. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07readahead: code cleanupJan Kara
Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to prev_offset. Also update of prev_index in do_generic_mapping_read() is now moved close to the update of prev_offset. [wfg@mail.ustc.edu.cn: fix it] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07readahead: improve heuristic detecting sequential readsJan Kara
Introduce ra.offset and store in it an offset where the previous read ended. This way we can detect whether reads are really sequential (and thus we should not mark the page as accessed repeatedly) or whether they are random and just happen to be in the same page (and the page should really be marked accessed again). Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07Add unitialized_var() macro for suppressing gcc warningsBorislav Petkov
Introduce a macro for suppressing gcc from generating a warning about a probable uninitialized state of a variable. Example: - spinlock_t *ptl; + spinlock_t *uninitialized_var(ptl); Not a happy solution, but those warnings are obnoxious. - Using the usual pointlessly-set-it-to-zero approach wastes several bytes of text. - Using a macro means we can (hopefully) do something else if gcc changes cause the `x = x' hack to stop working - Using a macro means that people who are worried about hiding true bugs can easily turn it off. Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07mm: simplify filemap_nopageNick Piggin
Identical block is duplicated twice: contrary to the comment, we have been re-reading the page *twice* in filemap_nopage rather than once. If any retry logic or anything is needed, it belongs in lower levels anyway. Only retry once. Linus agrees. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07add pfn_valid_within helper for sub-MAX_ORDER hole detectionAndy Whitcroft
Generally we work under the assumption that memory the mem_map array is contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie. that if we have validated any page within this MAX_ORDER_NR_PAGES block we need not check any other. This is not true when CONFIG_HOLES_IN_ZONE is set and we must check each and every reference we make from a pfn. Add a pfn_valid_within() helper which should be used when scanning pages within a MAX_ORDER_NR_PAGES block when we have already checked the validility of the block normally with pfn_valid(). This can then be optimised away when we do not have holes within a MAX_ORDER_NR_PAGES block of pages. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07allow oom_adj of saintly processesJoshua N Pritikin
If the badness of a process is zero then oom_adj>0 has no effect. This patch makes sure that the oom_adj shift actually increases badness points appropriately. Signed-off-by: Joshua N. Pritikin <jpritikin@pobox.com> Cc: Andrea Arcangeli <andrea@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07mm: make read_cache_page synchronousNick Piggin
Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate calls. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07slab: ensure cache_alloc_refill terminatesPekka Enberg
If slab->inuse is corrupted, cache_alloc_refill can enter an infinite loop as detailed by Michael Richardson in the following post: <http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch those cases. Cc: Michael Richardson <mcr@sandelman.ca> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07mm: remove gcc workaroundNick Piggin
Minimum gcc version is 3.2 now. However, with likely profiling, even modern gcc versions cannot always eliminate the call. Replace the placeholder functions with the more conventional empty static inlines, which should be optimal for everyone. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07Use ZVC counters to establish exact size of dirtyable pagesChristoph Lameter
We can use the global ZVC counters to establish the exact size of the LRU and the free pages. This allows a more accurate determination of the dirty ratio. This patch will fix the broken ratio calculations if large amounts of memory are allocated to huge pags or other consumers that do not put the pages on to the LRU. Notes: - I did not add NR_SLAB_RECLAIMABLE to the calculation of the dirtyable pages. Those may be reclaimable but they are at this point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered then a huge number of reclaimable pages would stop writeback from occurring. - This patch used to be in mm as the last one in a series of patches. It was removed when Linus updated the treatment of highmem because there was a conflict. I updated the patch to follow Linus' approach. This patch is neede to fulfill the claims made in the beginning of the patchset that is now in Linus' tree. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07Safer nr_node_ids and nr_node_ids determination and initial valuesChristoph Lameter
The nr_cpu_ids value is currently only calculated in smp_init. However, it may be needed before (SLUB needs it on kmem_cache_init!) and other kernel components may also want to allocate dynamically sized per cpu array before smp_init. So move the determination of possible cpus into sched_init() where we already loop over all possible cpus early in boot. Also initialize both nr_node_ids and nr_cpu_ids with the highest value they could take. If we have accidental users before these values are determined then the current valud of 0 may cause too small per cpu and per node arrays to be allocated. If it is set to the maximum possible then we only waste some memory for early boot users. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07Add apply_to_page_range() which applies a function to a pte rangeJeremy Fitzhardinge
Add a new mm function apply_to_page_range() which applies a given function to every pte in a given virtual address range in a given mm structure. This is a generic alternative to cut-and-pasting the Linux idiomatic pagetable walking code in every place that a sequence of PTEs must be accessed. Although this interface is intended to be useful in a wide range of situations, it is currently used specifically by several Xen subsystems, for example: to ensure that pagetables have been allocated for a virtual address range, and to construct batched special pagetable update requests to map I/O memory (in ioremap()). [akpm@linux-foundation.org: fix warning, unpleasantly] Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Matt Mackall <mpm@waste.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07slab: introduce kreallocPekka Enberg
This introduce krealloc() that reallocates memory while keeping the contents unchanged. The allocator avoids reallocation if the new size fits the currently used cache. I also added a simple non-optimized version for mm/slob.c for compatibility. [akpm@linux-foundation.org: fix warnings] Acked-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu> Acked-by: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-06[MM]: sparse_init() should be __init.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-02[PATCH] x86-64: skip cache_free_alien() on non NUMASiddha, Suresh B
Set use_alien_caches to 0 on non NUMA platforms. And avoid calling the cache_free_alien() when use_alien_caches is not set. This will avoid the cache miss that happens while dereferencing slabp to get nodeid. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pagesJeremy Fitzhardinge
Xen and VMI both have special requirements when mapping a highmem pte page into the kernel address space. These can be dealt with by adding a new kmap_atomic_pte() function for mapping highptes, and hooking it into the paravirt_ops infrastructure. Xen specifically wants to map the pte page RO, so this patch exposes a helper function, kmap_atomic_prot, which maps the page with the specified page protections. This also adds a kmap_flush_unused() function to clear out the cached kmap mappings. Xen needs this to clear out any potential stray RW mappings of pages which will become part of a pagetable. [ Zach - vmi.c will need some attention after this patch. It wasn't immediately obvious to me what needs to be done. ] Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Zachary Amsden <zach@vmware.com>
2007-05-02[PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destructionJeremy Fitzhardinge
Add hooks to allow a paravirt implementation to track the lifetime of an mm. Paravirtualization requires three hooks, but only two are needed in common code. They are: arch_dup_mmap, which is called when a new mmap is created at fork arch_exit_mmap, which is called when the last process reference to an mm is dropped, which typically happens on exit and exec. The third hook is activate_mm, which is called from the arch-specific activate_mm() macro/function, and so doesn't need stub versions for other architectures. It's called when an mm is first used. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: linux-arch@vger.kernel.org Cc: James Bottomley <James.Bottomley@SteelEye.com> Acked-by: Ingo Molnar <mingo@elte.hu>
2007-05-02[PATCH] x86-64: Fix vmalloc_32 to really allocate <4GB on 64bit platformsAndi Kleen
Ugly ifdef, but should handle all 64bit platforms that have suitable zones. On some like Altix it's probably impossible without IOMMU use to get memory <4GB this way, but they have to live with that. Signed-off-by: Andi Kleen <ak@suse.de>
2007-04-27Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6Linus Torvalds
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (38 commits) [S390] SPIN_LOCK_UNLOCKED cleanup in drivers/s390 [S390] Clean up smp code in preparation for some larger changes. [S390] Remove debugging junk. [S390] Switch etr from tasklet to workqueue. [S390] split page_test_and_clear_dirty. [S390] Processor degradation notification. [S390] vtime: cleanup per_cpu usage. [S390] crypto: cleanup. [S390] sclp: fix coding style. [S390] vmlogrdr: stop IUCV connection in vmlogrdr_release. [S390] sclp: initialize early. [S390] ctc: kmalloc->kzalloc/casting cleanups. [S390] zfcpdump support. [S390] dasd: Add ipldev parameter. [S390] dasd: Add sysfs attribute status and generate uevents. [S390] Improved kernel stack overflow checking. [S390] Get rid of console setup functions. [S390] No execute support cleanup. [S390] Minor fault path optimization. [S390] Use generic bug. ...
2007-04-27Change default dirty-writeback limitsLinus Torvalds
Do this really early in the 2.6.22-rc series, so that we'll get feedback. And don't change by half measures. Just cut the default dirty limit to a quarter of what it was, and see if anybody even notices. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-27[S390] split page_test_and_clear_dirty.Martin Schwidefsky
The page_test_and_clear_dirty primitive really consists of two operations, page_test_dirty and the page_clear_dirty. The combination of the two is not an atomic operation, so it makes more sense to have two separate operations instead of one. In addition to the improved readability of the s390 version of SetPageUptodate, it now avoids the page_test_dirty operation which is an insert-storage-key-extended (iske) instruction which is an expensive operation. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-04-24page migration: fix NR_FILE_PAGES accountingChristoph Lameter
NR_FILE_PAGES must be accounted for depending on the zone that the page belongs to. If we replace the page in the radix tree then we may have to shift the count to another zone. Suggested-by: Ethan Solomita <solo@google.com> Eventually-typed-in-by: Christoph Lameter <clameter@sgi.com> Cc: Martin Bligh <mbligh@mbligh.org> Cc: <stable@kernel.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24fix OOM killing processes wrongly thought MPOL_BINDHugh Dickins
I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog to see lots of other processes killed with "No available memory (MPOL_BIND)". memhog is killed correctly once we initialize nodemask in constrained_alloc(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24oom: kill all threads that share mm with killed taskDavid Rientjes
oom_kill_task() calls __oom_kill_task() to OOM kill a selected task. When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one. (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584) Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-12[PATCH] nommu: fix bug ip_conntrack does not work on nommuWu, Bryan
num_physpages is not exported out in mm/nommu.c, so the ip_conntrack module link will fail. Signed-off-by: Bryan Wu <bryan.wu@analog.com> Acked-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6Linus Torvalds
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: [S390] cio: Fix handling of interrupt for csch(). [S390] page_mkclean data corruption.
2007-04-04[PATCH] SLAB: Mention slab name when listing corrupt objectsDavid Howells
Mention the slab name when listing corrupt objects. Although the function that released the memory is mentioned, that is frequently ambiguous as such functions often release several pieces of memory. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04[S390] page_mkclean data corruption.Martin Schwidefsky
The git commit c2fda5fed81eea077363b285b66eafce20dfd45a which added the page_test_and_clear_dirty call to page_mkclean and the git commit 7658cc289288b8ae7dd2c2224549a048431222b3 which fixes the "nasty and subtle race in shared mmap'ed page writeback" problem in clear_page_dirty_for_io cause data corruption on s390. The effect of the two changes is that for every call to clear_page_dirty_for_io a page_test_and_clear_dirty is done. If the per page dirty bit is set set_page_dirty is called. Strangly clear_page_dirty_for_io is called for not-uptodate pages, e.g. over this call-chain: [<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130 [<000000000007c494>] generic_writepages+0x258/0x3e0 [<000000000007c692>] do_writepages+0x76/0x7c [<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4 [<00000000000c831a>] sync_sb_inodes+0x23e/0x398 [<00000000000c8802>] writeback_inodes+0x12e/0x140 [<000000000007b9ee>] wb_kupdate+0xd2/0x178 [<000000000007cca2>] pdflush+0x162/0x23c The bad news now is that page_test_and_clear_dirty might claim that a not-uptodate page is dirty since SetPageUptodate which resets the per page dirty bit has not yet been called. The page writeback that follows clobbers the data on disk. The simplest solution to this problem is to move the call to page_test_and_clear_dirty under the "if (page_mapped(page))". If a file backed page is mapped it is uptodate. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-03-29[PATCH] mm: fix xip issue with /dev/zeroCarsten Otte
Fix the bug, that reading into xip mapping from /dev/zero fills the user page table with ZERO_PAGE() entries. Later on, xip cannot tell which pages have been ZERO_PAGE() filled by access to a sparse mapping, and which ones origin from /dev/zero. It will unmap ZERO_PAGE from all mappings when filling the sparse hole with data. xip does now use its own zeroed page for its sparse mappings. Please apply. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29[PATCH] holepunch: fix mmap_sem i_mutex deadlockHugh Dickins
sys_madvise has down_write of mmap_sem, then madvise_remove calls vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can easily devise deadlocks from that ordering. madvise_remove drop mmap_sem while calling vmtruncate_range: luckily, since madvise_remove doesn't split or merge vmas, it's easy to handle this case with a NULL prev, without restructuring sys_madvise. (Though sad to retake mmap_sem when it's unlikely to be needed, and certainly down_read is sufficient for MADV_REMOVE, unlike the other madvices.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29[PATCH] holepunch: fix disconnected pages after second truncateHugh Dickins
shmem_truncate_range has its own truncate_inode_pages_range, to free any pages racily instantiated while it was in progress: a SHMEM_PAGEIN flag is set when this might have happened. But holepunching gets no chance to clear that flag at the start of vmtruncate_range, so it's always set (unless a truncate came just before), so holepunch almost always does this second truncate_inode_pages_range. shmem holepunch has unlikely swap<->file races hereabouts whatever we do (without a fuller rework than is fit for this release): I was going to skip the second truncate in the punch_hole case, but Miklos points out that would make holepunch correctness more vulnerable to swapoff. So keep the second truncate, but follow it by an unmap_mapping_range to eliminate the disconnected pages (freed from pagecache while still mapped in userspace) that it might have left behind. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29[PATCH] holepunch: fix shmem_truncate_range punch lockingHugh Dickins
Miklos Szeredi observes that during truncation of shmem page directories, info->lock is released to improve latency (after lowering i_size and next_index to exclude races); but this is quite wrong for holepunching, which receives no such protection from i_size or next_index, and is left vulnerable to races with shmem_unuse, shmem_getpage and shmem_writepage. Hold info->lock throughout when holepunching? No, any user could prevent rescheduling for far too long. Instead take info->lock just when needed: in shmem_free_swp when removing the swap entries, and whenever removing a directory page from the level above. But so long as we remove before scanning, we can safely skip taking the lock at the lower levels, except at misaligned start and end of the hole. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29[PATCH] holepunch: fix shmem_truncate_range punching too farHugh Dickins
Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered in rare circumstances, because shmem_truncate_range() erroneously removes partially truncated directory pages at the end of the range: later reclaim on pages pointing to these removed directories triggers the BUG. Indeed, and it can also cause data loss beyond the hole. Fix this as in the patch proposed by Miklos, but distinguish between "limit" (how far we need to search: ignore truncation's next_index optimization in the holepunch case - if there are races it's more consistent to act on the whole range specified) and "upper_limit" (how far we can free directory pages: generally we must be careful to keep partially punched pages, but can relax at end of file - i_size being held stable by i_mutex). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cs> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27block: blk_max_pfn is somtimes wrongVasily Tarasov
There is a small problem in handling page bounce. At the moment blk_max_pfn equals max_pfn, which is in fact not maximum possible _number_ of a page frame, but the _amount_ of page frames. For example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not 0xFFFF. request_queue structure has a member q->bounce_pfn and queue needs bounce pages for the pages _above_ this limit. This routine is handled by blk_queue_bounce(), where the following check is produced: if (q->bounce_pfn >= blk_max_pfn) return; Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn equals 0x10000. In such situation the check above fails and for each bio we always fall down for iterating over pages tied to the bio. I want to notice, that for quite a big range of device drivers (ide, md, ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for bounce_pfn. BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and then the check above doesn't fail. But for other drivers, which obtain reuired value from drivers, it fails. For example sata_nv uses ATA_DMA_MASK or dev->dma_mask. I propose to use (max_pfn - 1) for blk_max_pfn. And the same for blk_max_low_pfn. The patch also cleanses some checks related with bounce_pfn. Signed-off-by: Vasily Tarasov <vtaras@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-22[PATCH] NOMMU: make SYSV SHM nattch work correctlyDavid Howells
Make the SYSV SHM nattch counter work correctly by forcing multiple VMAs to be produced to represent MAP_SHARED segments, even if they overlap exactly. Using this test program: http://people.redhat.com/~dhowells/doshm.c Run as: doshm sysv I can see nattch going from one before the patch: # /doshm sysv Command: sysv shmid: 65536 memory: 0xc3700000 c0b00000-c0b04000 rw-p 00000000 00:00 0 c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so c3180000-c31dede4 r-xs 00000000 00:0b 14582179 /lib/libuClibc-0.9.28.so c3520000-c352278c rw-p 00000000 00:0b 13763417 /doshm c3584000-c35865e8 r-xs 00000000 00:0b 13763417 /doshm c3588000-c358aa00 rw-p 00008000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so c3590000-c359b6c0 rw-p 00000000 00:00 0 c3620000-c3640000 rwxp 00000000 00:00 0 c3700000-c37fa000 rw-S 00000000 00:06 1411 /SYSV00000000 (deleted) c3700000-c37fa000 rw-S 00000000 00:06 1411 /SYSV00000000 (deleted) nattch 1 To two after the patch: # /doshm sysv Command: sysv shmid: 0 memory: 0xc3700000 c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so c3180000-c31dede4 r-xs 00000000 00:0b 14582179 /lib/libuClibc-0.9.28.so c3320000-c3340000 rwxp 00000000 00:00 0 c3530000-c35325e8 r-xs 00000000 00:0b 13763417 /doshm c3534000-c353678c rw-p 00000000 00:0b 13763417 /doshm c3538000-c353aa00 rw-p 00008000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so c3590000-c359b6c0 rw-p 00000000 00:00 0 c35a4000-c35a8000 rw-p 00000000 00:00 0 c3700000-c37fa000 rw-S 00000000 00:06 1369 /SYSV00000000 (deleted) c3700000-c37fa000 rw-S 00000000 00:06 1369 /SYSV00000000 (deleted) nattch 2 That's +1 to nattch for each shmat() made. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22[PATCH] NOMMU: supply get_unmapped_area() to fix NOMMU SYSV SHMDavid Howells
Supply a get_unmapped_area() to fix NOMMU SYSV SHM support. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16[PATCH] oom fix: prevent oom from killing a process with children/sibling ↵Ankita Garg
unkillable Looking at oom_kill.c, found that the intention to not kill the selected process if any of its children/siblings has OOM_DISABLE set, is not being met. Signed-off-by: Ankita Garg <ankita@in.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16[PATCH] nfs: fix congestion controlPeter Zijlstra
The current NFS client congestion logic is severly broken, it marks the backing device congested during each nfs_writepages() call but doesn't mirror this in nfs_writepage() which makes for deadlocks. Also it implements its own waitqueue. Replace this by a more regular congestion implementation that puts a cap on the number of active writeback pages and uses the bdi congestion waitqueue. Also always use an interruptible wait since it makes sense to be able to SIGKILL the process even for mounts without 'intr'. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16[PATCH] dio: invalidate clean pages before dio writeZach Brown
This patch fixes a user-triggerable oops that was reported by Leonid Ananiev as archived at http://lkml.org/lkml/2007/2/8/337. dio writes invalidate clean pages that intersect the written region so that subsequent buffered reads go to disk to read the new data. If this fails the interface tries to tell the caller that the cache is inconsistent by returning EIO. Before this patch we had the problem where this invalidation failure would clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c. Both fs/aio.c and bio completion call aio_complete() and we reference freed memory, usually oopsing. This patch addresses this problem by invalidating before the write so that we can cleanly return -EIO before ->direct_IO() has had a chance to return -EIOCBQUEUED. There is a compromise here. During the dio write we can fault in mmap()ed pages which intersect the written range with get_user_pages() if the user provided them for the source buffer. This is a crazy thing to do, but we can make it mostly work in most cases by trying the invalidation again. The compromise is that we won't return an error if this second invalidation fails if it's an AIO write and we have -EIOCBQUEUED. This was tested by having two processes race performing large O_DIRECT and buffered ordered writes. Within minutes ext3 would see a race between ext3_releasepage() and jbd holding a reference on ordered data buffers and would cause invalidation to fail, panicing the box. The test can be found in the 'aio_dio_bugs' test group in test.kernel.org/autotest. After this patch the test passes. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16[PATCH] mm: fix madvise infinine loopNick Piggin
madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the call covers a region from the start of a vma, and extending past that vma. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05[PATCH] Page migration: Fix vma flag checkingChristoph Lameter
Currently we do not check for vma flags if sys_move_pages is called to move individual pages. If sys_migrate_pages is called to move pages then we check for vm_flags that indicate a non migratable vma but that still includes VM_LOCKED and we can migrate mlocked pages. Extract the vma_migratable check from mm/mempolicy.c, fix it and put it into migrate.h so that is can be used from both locations. Problem was spotted by Lee Schermerhorn Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05[PATCH] shmem and simple const super_operationsHugh Dickins
shmem's super_operations were missed from the recent const-ification; and simple_fill_super()'s, which can share with get_sb_pseudo()'s. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01[PATCH] VM: invalidate_inode_pages2_range() should not exit earlyTrond Myklebust
Fix invalidate_inode_pages2_range() so that it does not immediately exit just because a single page in the specified range could not be removed. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01[PATCH] adapt page_lock_anon_vma() to PREEMPT_RCUOleg Nesterov
page_lock_anon_vma() uses spin_lock() to block RCU. This doesn't work with PREEMPT_RCU, we have to do rcu_read_lock() explicitely. Otherwise, it is theoretically possible that slab returns anon_vma's memory to the system before we do spin_unlock(&anon_vma->lock). [ Hugh points out that this only matters for PREEMPT_RCU, which isn't merged yet, and may never be. Regardless, this patch is conceptually the right thing to do, even if it doesn't matter at this point. - Linus ] Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01[PATCH] throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocationsAndrew Morton
throttle_vm_writeout() is designed to wait for the dirty levels to subside. But if the caller holds IO or FS locks, we might be holding up that writeout. So change it to take a single nap to give other devices a chance to clean some memory, then return. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01[PATCH] Bug in MM_RB debuggingDavid Miller
The code is seemingly trying to make sure that rb_next() brings us to successive increasing vma entries. But the two variables, prev and pend, used to perform these checks, are never advanced. Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <andrea@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01[PATCH] Rename PG_checked to PG_owner_priv_1Nick Piggin
Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a private flag for use by the owner/allocator of the page. In the case of pagecache pages (which might be considered to be owned by the mm), filesystems may use the flag. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01[PATCH] kernel-doc fixes for 2.6.20-git15 (non-drivers)Randy Dunlap
Fix kernel-doc warnings in 2.6.20-git15 (lib/, mm/, kernel/, include/). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>