aboutsummaryrefslogtreecommitdiff
path: root/include/linux/mm.h
AgeCommit message (Collapse)Author
2006-06-30[PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.hChristoph Lameter
NOTE: ZVC are *not* the lightweight event counters. ZVCs are reliable whereas event counters do not need to be. Zone based VM statistics are necessary to be able to determine what the state of memory in one zone is. In a NUMA system this can be helpful for local reclaim and other memory optimizations that may be able to shift VM load in order to get more balanced memory use. It is also useful to know how the computing load affects the memory allocations on various zones. This patchset allows the retrieval of that data from userspace. The patchset introduces a framework for counters that is a cross between the existing page_stats --which are simply global counters split per cpu-- and the approach of deferred incremental updates implemented for nr_pagecache. Small per cpu 8 bit counters are added to struct zone. If the counter exceeds certain thresholds then the counters are accumulated in an array of atomic_long in the zone and in a global array that sums up all zone values. The small 8 bit counters are next to the per cpu page pointers and so they will be in high in the cpu cache when pages are allocated and freed. Access to VM counter information for a zone and for the whole machine is then possible by simply indexing an array (Thanks to Nick Piggin for pointing out that approach). The access to the total number of pages of various types does no longer require the summing up of all per cpu counters. Benefits of this patchset right now: - Ability for UP and SMP configuration to determine how memory is balanced between the DMA, NORMAL and HIGHMEM zones. - loops over all processors are avoided in writeback and reclaim paths. We can avoid caching the writeback information because the needed information is directly accessible. - Special handling for nr_pagecache removed. - zone_reclaim_interval vanishes since VM stats can now determine when it is worth to do local reclaim. - Fast inline per node page state determination. - Accurate counters in /sys/devices/system/node/node*/meminfo. Current counters are counting simply which processor allocated a page somewhere and guestimate based on that. So the counters were not useful to show the actual distribution of page use on a specific zone. - The swap_prefetch patch requires per node statistics in order to figure out when processors of a node can prefetch. This patch provides some of the needed numbers. - Detailed VM counters available in more /proc and /sys status files. References to earlier discussions: V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2 V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2 V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2 V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2 Performance tests with AIM7 did not show any regressions. Seems to be a tad faster even. Tested on ia64/NUMA. Builds fine on i386, SMP / UP. Includes fixes for s390/arm/uml arch code. This patch: Move counter code from page_alloc.c/page-flags.h to vmstat.c/h. Create vmstat.c/vmstat.h by separating the counter code and the proc functions. Move the vm_stat_text array before zoneinfo_show. [akpm@osdl.org: s390 build fix] [akpm@osdl.org: HOTPLUG_CPU build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex debugIngo Molnar
Runtime debugging functionality for rt-mutexes. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: introduce debug_check_no_locks_freed()Ingo Molnar
Add debug_check_no_locks_freed(), as a central inline to add bad-lock-free-debugging functionality to. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] vdso: randomize the i386 vDSO by moving it into a vmaIngo Molnar
Move the i386 VDSO down into a vma and thus randomize it. Besides the security implications, this feature also helps debuggers, which can COW a vma-backed VDSO just like a normal DSO and can thus do single-stepping and other debugging features. It's good for hypervisors (Xen, VMWare) too, which typically live in the same high-mapped address space as the VDSO, hence whenever the VDSO is used, they get lots of guest pagefaults and have to fix such guest accesses up - which slows things down instead of speeding things up (the primary purpose of the VDSO). There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support for older glibcs that still rely on a prelinked high-mapped VDSO. Newer distributions (using glibc 2.3.3 or later) can turn this option off. Turning it off is also recommended for security reasons: attackers cannot use the predictable high-mapped VDSO page as syscall trampoline anymore. There is a new vdso=[0|1] boot option as well, and a runtime /proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned on/off. (This version of the VDSO-randomization patch also has working ELF coredumping, the previous patch crashed in the coredumping code.) This code is a combined work of the exec-shield VDSO randomization code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell started this patch and i completed it. [akpm@osdl.org: cleanups] [akpm@osdl.org: compile fix] [akpm@osdl.org: compile fix 2] [akpm@osdl.org: compile fix 3] [akpm@osdl.org: revernt MAXMEM change] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Cc: Gerd Hoffmann <kraxel@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Zachary Amsden <zach@vmware.com> Cc: Andi Kleen <ak@muc.de> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] page migration: Support a vma migration functionChristoph Lameter
Hooks for calling vma specific migration functions With this patch a vma may define a vma->vm_ops->migrate function. That function may perform page migration on its own (some vmas may not contain page structs and therefore cannot be handled by regular page migration. Pages in a vma may require special preparatory treatment before migration is possible etc) . Only mmap_sem is held when the migration function is called. The migrate() function gets passed two sets of nodemasks describing the source and the target of the migration. The flags parameter either contains MPOL_MF_MOVE which means that only pages used exclusively by the specified mm should be moved or MPOL_MF_MOVE_ALL which means that pages shared with other processes should also be moved. The migration function returns 0 on success or an error condition. An error condition will prevent regular page migration from occurring. On its own this patch cannot be included since there are no users for this functionality. But it seems that the uncached allocator will need this functionality at some point. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] mm: remove VM_LOCKED before remap_pfn_range and drop VM_SHMChristoph Lameter
Remove VM_LOCKED before remap_pfn range from device drivers and get rid of VM_SHM. remap_pfn_range() already sets VM_IO. There is no need to set VM_SHM since it does nothing. VM_LOCKED is of no use since the remap_pfn_range does not place pages on the LRU. The pages are therefore never subject to swap anyways. Remove all the vm_flags settings before calling remap_pfn_range. After removing all the vm_flag settings no use of VM_SHM is left. Drop it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] add page_mkwrite() vm_operations methodDavid Howells
Add a new VMA operation to notify a filesystem or other driver about the MMU generating a fault because userspace attempted to write to a page mapped through a read-only PTE. This facility permits the filesystem or driver to: (*) Implement storage allocation/reservation on attempted write, and so to deal with problems such as ENOSPC more gracefully (perhaps by generating SIGBUS). (*) Delay making the page writable until the contents have been written to a backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS. It permits the filesystem to have some guarantee about the state of the cache. (*) Account and limit number of dirty pages. This is one piece of the puzzle needed to make shared writable mapping work safely in FUSE. Needed by cachefs (Or is it cachefiles? Or fscache? <head spins>). At least four other groups have stated an interest in it or a desire to use the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like EXT3 really ought to use it to deal with the case of shared-writable mmap encountering ENOSPC before we permit the page to be dirtied. From: Peter Zijlstra <a.p.zijlstra@chello.nl> get_user_pages(.write=1, .force=1) can generate COW hits on read-only shared mappings, this patch traps those as mkpage_write candidates and fails to handle them the old way. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Joel Becker <Joel.Becker@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] zone handle unaligned zone boundariesAndy Whitcroft
The buddy allocator has a requirement that boundaries between contigious zones occur aligned with the the MAX_ORDER ranges. Where they do not we will incorrectly merge pages cross zone boundaries. This can lead to pages from the wrong zone being handed out. Originally the buddy allocator would check that buddies were in the same zone by referencing the zone start and end page frame numbers. This was removed as it became very expensive and the buddy allocator already made the assumption that zones boundaries were aligned. It is clear that not all configurations and architectures are honouring this alignment requirement. Therefore it seems safest to reintroduce support for non-aligned zone boundaries. This patch introduces a new check when considering a page a buddy it compares the zone_table index for the two pages and refuses to merge the pages where they do not match. The zone_table index is unique for each node/zone combination when FLATMEM/DISCONTIGMEM is enabled and for each section/zone combination when SPARSEMEM is enabled (a SPARSEMEM section is at least a MAX_ORDER size). Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26Don't include linux/config.h from anywhere else in include/David Woodhouse
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-04-10[PATCH] Fix buddy list race that could lead to page lru list corruptionsNick Piggin
Rohit found an obscure bug causing buddy list corruption. page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0) to determine whether or not a free page's buddy is itself free and in the buddy lists. Each of the conjuncts may be true at different times due to unrelated conditions, so the non-atomic page_is_buddy test may find each conjunct to be true even if they were not both true at the same time (ie. the page was not on the buddy lists). Signed-off-by: Martin Bligh <mbligh@google.com> Signed-off-by: Rohit Seth <rohitseth@google.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: optimise page_countNick Piggin
Optimise page_count compound page test and make it consistent with similar functions. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] remove set_page_count() outside mm/Nick Piggin
set_page_count usage outside mm/ is limited to setting the refcount to 1. Remove set_page_count from outside mm/, and replace those users with init_page_count() and set_page_refcounted(). This allows more debug checking, and tighter control on how code is allowed to play around with page->_count. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: nommu use compound pagesNick Piggin
Now that compound page handling is properly fixed in the VM, move nommu over to using compound pages rather than rolling their own refcounting. nommu vm page refcounting is broken anyway, but there is no need to have divergent code in the core VM now, nor when it gets fixed. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: David Howells <dhowells@redhat.com> (Needs testing, please). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: make __put_page internalNick Piggin
Remove __put_page from outside the core mm/. It is dangerous because it does not handle compound pages nicely, and misses 1->0 transitions. If a user later appears that really needs the extra speed we can reevaluate. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] vmscan: use unsigned longsAndrew Morton
Turn basically everything in vmscan.c into `unsigned long'. This is to avoid the possibility that some piece of code in there might decide to operate upon more than 4G (or even 2G) of pages in one hit. This might be silly, but we'll need it one day. Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: split highorder pagesNick Piggin
Have an explicit mm call to split higher order pages into individual pages. Should help to avoid bugs and be more explicit about the code's intention. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: de-skew page refcountingNick Piggin
atomic_add_unless (atomic_inc_not_zero) no longer requires an offset refcount to function correctly. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: simplify vmscan vs release refcountingNick Piggin
The VM has an interesting race where a page refcount can drop to zero, but it is still on the LRU lists for a short time. This was solved by testing a 0->1 refcount transition when picking up pages from the LRU, and dropping the refcount in that case. Instead, use atomic_add_unless to ensure we never pick up a 0 refcount page from the LRU, thus a 0 refcount page will never have its refcount elevated until it is allocated again. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20[PATCH] Fix undefined symbols for nommu architectureLuke Yang
Signed-off-by: Luke Yang <luke.adi@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] x86_64: Add boot option to disable randomized mappings and cleanupAndi Kleen
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution installation it's easiest if it's a boot time option. Also I moved the variable to a more appropiate place and make it independent from sysctl And marked __read_mostly which it is. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] mm: compound release fixNick Piggin
Compound pages on SMP systems can now often be freed from pagetables via the release_pages path. This uses put_page_testzero which does not handle compound pages at all. Releasing constituent pages from process mappings decrements their count to a large negative number and leaks the reference at the head page - net result is a memory leak. The problem was hidden because the debug check in put_page_testzero itself actually did take compound pages into consideration. Fix the bug and the debug check. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] mark several functions __always_inlineIngo Molnar
Arjan van de Ven <arjan@infradead.org> Mark a number of functions as 'must inline'. The functions affected by this patch need to be inlined because they use knowledge that their arguments are constant so that most of the function optimizes away. At this point this patch does not change behavior, it's for documentation only (and for future patches in the inline series) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11[PATCH] move capable() to capability.hRandy.Dunlap
- Move capable() from sched.h to capability.h; - Use <linux/capability.h> where capable() is used (in include/, block/, ipc/, kernel/, a few drivers/, mm/, security/, & sound/; many more drivers/ to go) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11[PATCH] fix/simplify mutex debugging codeDavid Woodhouse
Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as arguments instead, since all its callers were just calculating the 'to' address for themselves anyway... (and sometimes doing so badly). Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09[PATCH] mutex subsystem, more debugging codeIngo Molnar
more mutex debugging: check for held locks during memory freeing, task exit, enable sysrq printouts, etc. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-08[PATCH] shrink struct pageAndrew Morton
Reduce the size of the pageframe for NR_CPUS>4, CONFIG_PREEMPT back to the minimal size by unionising both ->private and ->mapping with the pagetable lock. It uses an anonymous struct and hence requires gcc-3.x. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] set_page_count() macro safetyAvishay Traeger
Fix set_page_count() macro to handle complex arguments. Signed-off-by: Avishay Traeger <atraeger@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] drop-pagecacheAndrew Morton
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to discard as much pagecache and/or reclaimable slab objects as it can. THis operation requires root permissions. It won't drop dirty data, so the user should run `sync' first. Caveats: a) Holds inode_lock for exorbitant amounts of time. b) Needs to be taught about NUMA nodes: propagate these all the way through so the discarding can be controlled on a per-node basis. This is a debugging feature: useful for getting consistent results between filesystem benchmarks. We could possibly put it under a config option, but it's less than 300 bytes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] FRV: Make futex code compilable on nommu [try #2]David Howells
Make the futex code compilable and usable on NOMMU by making the attempt to handle page faults conditional on CONFIG_MMU. If this is not enabled, then we can assume that EFAULT returned from futex_atomic_op_inuser() is not recoverable, and that the address lies outside of valid memory. handle_mm_fault() is made to BUG if called on NOMMU without attempting to invoke the actual handler (__handle_mm_fault). Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMUDavid Howells
The attached patch makes the SYSV IPC shared memory facilities use the new ramfs facilities on a no-MMU kernel. The following changes are made: (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to allow the IPC SHM facilities to commune with the tiny-shmem and shmem code. (2) ramfs files now need resizing using do_truncate() rather than by modifying the inode size directly (see shmem_file_setup()). This causes ramfs to attempt to bind a block of pages of sufficient size to the inode. (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Shut up warnings in ipc/shm.cRussell King
Fix two warnings in ipc/shm.c ipc/shm.c:122: warning: statement with no effect ipc/shm.c:560: warning: statement with no effect by converting the macros to empty inline functions. For safety, let's do all three. This also has the advantage that typechecking gets performed even without CONFIG_SHMEM enabled. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing storeBadari Pulavarty
Here is the patch to implement madvise(MADV_REMOVE) - which frees up a given range of pages & its associated backing store. Current implementation supports only shmfs/tmpfs and other filesystems return -ENOSYS. "Some app allocates large tmpfs files, then when some task quits and some client disconnect, some memory can be released. However the only way to release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli Databases want to use this feature to drop a section of their bufferpool (shared memory segments) - without writing back to disk/swap space. This feature is also useful for supporting hot-plug memory on UML. Concerns raised by Andrew Morton: - "We have no plan for holepunching! If we _do_ have such a plan (or might in the future) then what would the API look like? I think sys_holepunch(fd, start, len), so we should start out with that." - Using madvise is very weird, because people will ask "why do I need to mmap my file before I can stick a hole in it?" - None of the other madvise operations call into the filesystem in this manner. A broad question is: is this capability an MM operation or a filesytem operation? truncate, for example, is a filesystem operation which sometimes has MM side-effects. madvise is an mm operation and with this patch, it gains FS side-effects, only they're really, really significant ones." Comments: - Andrea suggested the fs operation too but then it's more efficient to have it as a mm operation with fs side effects, because they don't immediatly know fd and physical offset of the range. It's possible to fixup in userland and to use the fs operation but it's more expensive, the vmas are already in the kernel and we can use them. Short term plan & Future Direction: - We seem to need this interface only for shmfs/tmpfs files in the short term. We have to add hooks into the filesystem for correctness and completeness. This is what this patch does. - In the future, plan is to support both fs and mmap apis also. This also involves (other) filesystem specific functions to be implemented. - Current patch doesn't support VM_NONLINEAR - which can be addressed in the future. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrea Arcangeli <andrea@suse.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] reiser4: vfs: add truncate_inode_pages_range()Hans Reiser
This patch makes truncate_inode_pages_range from truncate_inode_pages. truncate_inode_pages became a one-liner call to truncate_inode_pages_range. Reiser4 needs truncate_inode_pages_ranges because it tries to keep correspondence between existences of metadata pointing to data pages and pages to which those metadata point to. So, when metadata of certain part of file is removed from filesystem tree, only pages of corresponding range are to be truncated. (Needed by the madvise(MADV_REMOVE) patch) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-16Make sure we copy pages inserted with "vm_insert_page()" on forkLinus Torvalds
The logic that decides that a fork() might be able to avoid copying a VM area when it can be re-created by page faults didn't know about the new vm_insert_page() case. Also make some things a bit more anal wrt VM_PFNMAP. Pointed out by Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-11Remove (at least temporarily) the "incomplete PFN mapping" supportLinus Torvalds
With the previous commit, we can handle arbitrary shared re-mappings even without this complexity, and since the only known private mappings are for strange users of /dev/mem (which never create an incomplete one), there seems to be no reason to support it. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-30VM: add "vm_insert_page()" functionLinus Torvalds
This is what a lot of drivers will actually want to use to insert individual pages into a user VMA. It doesn't have the old PageReserved restrictions of remap_pfn_range(), and it doesn't complain about partial remappings. The page you insert needs to be a nice clean kernel allocation, so you can't insert arbitrary page mappings with this, but that's not what people want. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29VM: add common helper function to create the page tablesLinus Torvalds
This logic was duplicated four times, for no good reason. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29Support strange discontiguous PFN remappingsLinus Torvalds
These get created by some drivers that don't generally even want a pfn remapping at all, but would really mostly prefer to just map pages they've allocated individually instead. For now, create a helper function that turns such an incomplete PFN remapping call into a loop that does that explicit mapping. In the long run we almost certainly want to export a totally different interface for that, though. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28mm: re-architect the VM_UNPAGED logicLinus Torvalds
This replaces the (in my opinion horrible) VM_UNMAPPED logic with very explicit support for a "remapped page range" aka VM_PFNMAP. It allows a VM area to contain an arbitrary range of page table entries that the VM never touches, and never considers to be normal pages. Any user of "remap_pfn_range()" automatically gets this new functionality, and doesn't even have to mark the pages reserved or indeed mark them any other way. It just works. As a side effect, doing mmap() on /dev/mem works for arbitrary ranges. Sparc update from David in the next commit. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: VM_UNPAGEDHugh Dickins
Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few drivers set VM_RESERVED on areas which are then populated by nopage. The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in zap_pte_range, without changing those drivers not to set it: so their pages just leak away. Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core, to flag the special areas where the ptes may have no struct page, or if they have then it's not to be touched. Replace most instances of VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and sparc64 io_remap_pfn_range. Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it needed anywhere? It still governs the mm->reserved_vm statistic, and special vmas not to be merged, and areas not to be core dumped; but could probably be eliminated later (the drivers are probably specifying it because in 2.4 it kept swapout off the vma, but in 2.6 we work from the LRU, which these pages don't get on). Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no purpose whatsoever, and should be removed from drivers when we clean up. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: unifdefed PageCompoundHugh Dickins
It looks like snd_xxx is not the only nopage to be using PageReserved as a way of holding a high-order page together: which no longer works, but is masked by our failure to free from VM_RESERVED areas. We cannot fix that bug without first substituting another way to hold the high-order page together, while farming out the 0-order pages from within it. That's just what PageCompound is designed for, but it's been kept under CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line put_page), doesn't slow down what most needs to be fast (already using hugetlb), and unifies the way we handle high-order pages. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-18[PARISC] Fix compile warning caused by conflicting types of expand_upwards()Matthew Wilcox
Fix compile warning caused by conflicting types of expand_upwards. IA64 requires it to not be static inline, as it's used outside mm/mmap.c Signed-off-by: Matthew Wilcox <willy@parisc-linux.org> Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
2005-11-14Merge x86-64 update from AndiLinus Torvalds
2005-11-14[PATCH] x86_64: Remove obsolete ARCH_HAS_ATOMIC_UNSIGNED and page_flags_tAndi Kleen
Has been introduced for x86-64 at some point to save memory in struct page, but has been obsolete for some time. Just remove it. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] readahead commentaryAndrew Morton
Add a few comments surrounding the generic readahead API. Also convert some ulongs into pgoff_t: the identifier for PAGE_CACHE_SIZE offsets into pagecache. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug: sysfs and add/remove functionsDave Hansen
This adds generic memory add/remove and supporting functions for memory hotplug into a new file as well as a memory hotplug kernel config option. Individual architecture patches will follow. For now, disable memory hotplug when swsusp is enabled. There's a lot of churn there right now. We'll fix it up properly once it calms down. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: split page table lockHugh Dickins
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: follow_page with inner ptlockHugh Dickins
Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: kill check_user_page_readableHugh Dickins
check_user_page_readable is a problematic variant of follow_page. It's used only by oprofile's i386 and arm backtrace code, at interrupt time, to establish whether a userspace stackframe is currently readable. This is problematic, because we want to push the page_table_lock down inside follow_page, and later split it; whereas oprofile is doing a spin_trylock on it (in the i386 case, forgotten in the arm case), and needs that to pin perhaps two pages spanned by the stackframe (which might be covered by different locks when we split). I think oprofile is going about this in the wrong way: it doesn't need to know the area is readable (neither i386 nor arm uses read protection of user pages), it doesn't need to pin the memory, it should simply __copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've not got around to devising the sparse __user annotations for this. Then we can eliminate check_user_page_readable, and return to a single follow_page without the __follow_page variants. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: unmap_vmas with inner ptlockHugh Dickins
Remove the page_table_lock from around the calls to unmap_vmas, and replace the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are now safe to descend without page_table_lock. Don't attempt fancy locking for hugepages, just take page_table_lock in unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor does unmap_vmas have much use for its mm arg now. The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without page_table_lock: if they're implemented at all, they typically come down to flush_cache_range (usually done outside page_table_lock) and flush_tlb_range (which we already audited for the mprotect case). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>