Age | Commit message (Collapse) | Author |
|
Remove three degrees of obfuscation, left over from when we had
CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
built when CONFIG_MMU, so don't need such conditions at all.
Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
169 defconfigs: leave those to evolve in due course.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
in page->mapping, with the higher bits a pointer to the anon_vma; and have
defined PageKsm(page) as that with NULL anon_vma.
But KSM swapping will need to store a pointer there: so in preparation for
that, now define PAGE_MAPPING_FLAGS as the low two bits, including
PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
other use for the bit emerges).
Declare page_rmapping(page) to return the pointer part of page->mapping,
and page_anon_vma(page) to return the anon_vma pointer when that's what it
is. Use these in a few appropriate places: notably, unuse_vma() has been
testing page->mapping, but is better to be testing page_anon_vma() (cases
may be added in which flag bits are set without any pointer).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
being met
If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.
However, if the zone is below the min watermark then kswapd needs to
continue working without delay as there is a danger of an increased rate
of GFP_ATOMIC allocation failure.
This patch changes the conditions under which kswapd waits on congestion
by only going to sleep if the min watermarks are being met.
[mel@csn.ul.ie: add stats to track how relevant the logic is]
[mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
asleep
After kswapd balances all zones in a pgdat, it goes to sleep. In the
event of no IO congestion, kswapd can go to sleep very shortly after the
high watermark was reached. If there are a constant stream of allocations
from parallel processes, it can mean that kswapd went to sleep too quickly
and the high watermark is not being maintained for sufficient length time.
This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.
This adds more counters to distinguish between fast and slow breaches of
watermarks. A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep. A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Frans Pop <elendil@planet.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Seems that page_io.c doesn't really need to know that page_private(page)
is the swp_entry 'val'. Rework map_swap_page() to do what its name says
and map a page to a page offset in the swap space.
The only other caller of map_swap_page() is internal to mm/swapfile.c and
it does want to map a swap entry to the 'sector'. So rename
map_swap_page() to map_swap_entry(), make it 'static' and and implement
map_swap_page() as a wrapper around that.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Reorder (and comment) the fields of swap_info_struct, to make better
use of its cachelines: it's good for swap_duplicate() in particular
if unsigned int max and swap_map are near the start.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While we're fiddling with the swap_map values, let's assign a particular
value to shmem/tmpfs swap pages: their swap counts are never incremented,
and it helps swapoff's try_to_unuse() a little if it can immediately
distinguish those pages from process pages.
Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
we might as well use that 0xbf value for SWAP_MAP_SHMEM.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Swap is duplicated (reference count incremented by one) whenever the same
swap page is inserted into another mm (when forking finds a swap entry in
place of a pte, or when reclaim unmaps a pte to insert the swap entry).
swap_info_struct's vmalloc'ed swap_map is the array of these reference
counts: but what happens when the unsigned short (or unsigned char since
the preceding patch) is full? (and its high bit is kept for a cache flag)
We then lose track of it, never freeing, leaving it in use until swapoff:
at which point we _hope_ that a single pass will have found all instances,
assume there are no more, and will lose user data if we're wrong.
Swapping of KSM pages has not yet been enabled; but it is implemented,
and makes it very easy for a user to overflow the maximum swap count:
possible with ordinary process pages, but unlikely, even when pid_max
has been raised from PID_MAX_DEFAULT.
This patch implements swap count continuations: when the count overflows,
a continuation page is allocated and linked to the original vmalloc'ed
map page, and this used to hold the continuation counts for that entry
and its neighbours. These continuation pages are seldom referenced:
the common paths all work on the original swap_map, only referring to
a continuation page when the low "digit" of a count is incremented or
decremented through SWAP_MAP_MAX.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
chars: it's still very unusual to reach a swap count of 126, and the
next patch allows it to be extended indefinitely.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Though swap_count() is useful, I'm finding that swap_has_cache() and
encode_swapmap() obscure what happens in the swap_map entry, just at
those points where I need to understand it. Remove them, and pass
more usable "usage" values to scan_swap_map(), swap_entry_free() and
__swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make better use of the space by folding first swap_extent into its
swap_info_struct, instead of just the list_head: swap partitions need
only that one, and for others it's used as a circular list anyway.
[jirislaby@gmail.com: fix crash on double swapon]
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
to reserve an array of about 30 of them in bss, when most people will
want only one. Change swap_info[] to an array of pointers.
That does need a "type" field in the structure: pack it as a char with
next type and short prio (aha, char is unsigned by default on PowerPC).
Use the (admittedly peculiar) name "type" throughout for this index.
/proc/swaps does not take swap_lock: I wouldn't want it to, but do take
care with barriers when adding a new item to the array (never removed).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The swap_info_struct is mostly private to mm/swapfile.c, with only
one other in-tree user: get_swap_bio(). Adjust its interface to
map_swap_page(), so that we can then remove get_swap_info_struct().
But there is a popular user out-of-tree, TuxOnIce: so leave the
declaration of swap_info_struct in linux/swap.h.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Nigel Cunningham <ncunningham@crca.org.au>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Objects passed to NODEMASK_ALLOC() are relatively small in size and are
backed by slab caches that are not of large order, traditionally never
greater than PAGE_ALLOC_COSTLY_ORDER.
Thus, using GFP_KERNEL for these allocations on large machines when
CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
the allocation attempt, each time invoking both direct reclaim or the oom
killer.
This is of particular interest when using NODEMASK_ALLOC() from a
mempolicy context (either directly in mm/mempolicy.c or the mempolicy
constrained hugetlb allocations) since the oom killer always kills current
when allocations are constrained by mempolicies. So for all present use
cases in the kernel, current would end up being oom killed when direct
reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
current would have sacrificed itself upon returning.
This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
All current use cases either directly from hugetlb code or indirectly via
NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
killer when the slab allocator needs to allocate additional pages.
The side-effect of this change is that all current use cases of either
NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
when the allocation fails (never for CONFIG_NODES_SHIFT <= 8). All
current use cases were audited and do have appropriate error handling at
this time.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Offload the registration and unregistration of per node hstate sysfs
attributes to a worker thread rather than attempt the
allocation/attachment or detachment/freeing of the attributes in the
context of the memory hotplug handler.
I don't know that this is absolutely required, but the registration can
sleep in allocations and other mem hot plug handlers do it this way. If
it turns out this is NOT required, we can drop this patch.
N.B., Only tested build, boot, libhugetlbfs regression.
i.e., no memory hotplug testing.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
there are no present pages left.
In such a situation, kswapd must also be stopped since it has nothing left
to do.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add the per huge page size control/query attributes to the per node
sysdevs:
/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
nr_hugepages - r/w
free_huge_pages - r/o
surplus_huge_pages - r/o
The patch attempts to re-use/share as much of the existing global hstate
attribute initialization and handling, and the "nodes_allowed" constraint
processing as possible.
Calling set_max_huge_pages() with no node indicates a change to global
hstate parameters. In this case, any non-default task mempolicy will be
used to generate the nodes_allowed mask. A valid node id indicates an
update to that node's hstate parameters, and the count argument specifies
the target count for the specified node. From this info, we compute the
target global count for the hstate and construct a nodes_allowed node mask
contain only the specified node.
Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.
With this patch:
(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./ ../ free_hugepages nr_hugepages surplus_hugepages
Starting from:
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 0
Node 2 HugePages_Free: 0
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0
vm.nr_hugepages = 0
Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
[Note that this is equivalent to:
numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]
Yields:
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 16
Node 2 HugePages_Free: 16
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0
vm.nr_hugepages = 16
Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 8
Node 2 HugePages_Free: 8
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
to generic header 'linux/numa.h' for use in generic code. NUMA_NO_NODE
replaces bare '-1' where it's used in this series to indicate "no node id
specified". Ultimately, it can be used to replace the -1 elsewhere where
it is used similarly.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Factor init_nodemask_of_node() out of the nodemask_of_node() macro.
This will be used to populate the huge pages "nodes_allowed" nodemask for
a single node when basing nodes_allowed on a preferred/local mempolicy or
when a persistent huge page pool page count is modified via a per node
sysfs attribute.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is a series of patches to provide control over the location of the
allocation and freeing of persistent huge pages on a NUMA platform.
Please consider for merging into mmotm.
This series uses two mechanisms to constrain the nodes from which
persistent huge pages are allocated: 1) the task NUMA mempolicy of the
task modifying a new sysctl "nr_hugepages_mempolicy", based on a
suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
attributes have been added [in V4] to each node system device under:
/sys/devices/node/node[0-9]*/hugepages
The per node attibutes allow direct assignment of a huge page count on a
specific node, regardless of the task's mempolicy or cpuset constraints.
This patch:
NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary.
It's perfectly reasonable to use this macro to allocate a nodemask_t,
which is anonymous, either dynamically or on the stack depending on
NODES_SHIFT.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
* 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
i2c-core: i2c bus should support PM entries in struct dev_pm_ops
i2c: Get rid of I2C_CLIENT_MODULE_PARM
i2c: Drop I2C_CLIENT_INSMOD_2 to 8
i2c: Drop I2C_CLIENT_INSMOD_1
i2c: Get rid of struct i2c_client_address_data
i2c: Drop the kind parameter from detect callbacks
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6:
pcmcia: CodingStyle fixes
pcmcia: remove unused IRQ_FIRST_SHARED
|
|
There is no user left of I2C_CLIENT_MODULE_PARM, so we can finally
get rid of this ugly macro.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>
|
|
These macros simply declare an enum, so drivers might as well declare
it themselves. This puts an end to the arbitrary limit of 8 chip types
per i2c driver.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>
|
|
This macro simply declares an enum, so drivers might as well declare
it themselves.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>
|
|
Struct i2c_client_address_data only contains one field at this point,
which makes its usefulness questionable. Get rid of it and pass simple
address lists around instead.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>
|
|
The "kind" parameter always has value -1, and nobody is using it any
longer, so we can remove it.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>
|
|
* 'next-spi' of git://git.secretlab.ca/git/linux-2.6: (23 commits)
spi: fix probe/remove section markings
Add OMAP spi100k driver
spi-imx: don't access struct device directly but use dev_get_platdata
spi-imx: Add mx25 support
spi-imx: use positive logic to distinguish cpu variants
spi-imx: correct check for platform_get_irq failing
ARM: NUC900: Add spi driver support for nuc900
spi: SuperH MSIOF SPI Master driver V2
spi: fix spidev compilation failure when VERBOSE is defined
spi/au1550_spi: fix setupxfer not to override cfg with zeros
spi/mpc8xxx: don't use __exit_p to wrap plat_mpc8xxx_spi_remove
spi/i.MX: fix broken error handling for gpio_request
spi/i.mx: drain MXC SPI transfer buffer when probing device
MAINTAINERS: add SPI co-maintainer.
spi/xilinx_spi: fix incorrect casting
spi/mpc52xx-spi: minor cleanups
xilinx_spi: add a platform driver using the xilinx_spi common module.
xilinx_spi: add support for the DS570 IP.
xilinx_spi: Switch to iomem functions and support little endian.
xilinx_spi: Split into of driver and generic part.
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf sched: Fix build failure on sparc
perf bench: Add "all" pseudo subsystem and "all" pseudo suite
perf tools: Introduce perf_session class
perf symbols: Ditch dso->find_symbol
perf symbols: Allow lookups by symbol name too
perf symbols: Add missing "Variables" entry to map_type__name
perf symbols: Add support for 'variable' symtabs
perf symbols: Introduce ELF counterparts to symbol_type__is_a
perf symbols: Introduce symbol_type__is_a
perf symbols: Rename kthreads to kmaps, using another abstraction for it
perf tools: Allow building for ARM
hw-breakpoints: Handle bad modify_user_hw_breakpoint off-case return value
perf tools: Allow cross compiling
tracing, slab: Fix no callsite ifndef CONFIG_KMEMTRACE
tracing, slab: Define kmem_cache_alloc_notrace ifdef CONFIG_TRACING
Trivial conflict due to different fixes to modify_user_hw_breakpoint()
in include/linux/hw_breakpoint.h
|
|
Global variable declarations must match the definitions in section attributes
as the compiler is at liberty to vary the method it uses to access a variable,
depending on the section it is in.
When building the FRV arch, I now see:
drivers/built-in.o: In function `pci_apply_final_quirks':
drivers/pci/quirks.c:2606: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o
drivers/pci/quirks.c:2623: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o
drivers/pci/quirks.c:2630: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o
because the declaration of pci_dfl_cache_line_size in linux/pci.h does not
match the definition in drivers/pci/pci.c.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If there is no hardware breakpoint support, modify_user_hw_breakpoint()
tries to return a NULL pointer through as an 'int' return value:
In file included from kernel/exit.c:53:
include/linux/hw_breakpoint.h: In function 'modify_user_hw_breakpoint':
include/linux/hw_breakpoint.h:96: warning: return makes integer from pointer without a cast
Return 0 instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* 'for-linus' of git://neil.brown.name/md: (27 commits)
md: add 'recovery_start' per-device sysfs attribute
md: rcu_read_lock() walk of mddev->disks in md_do_sync()
md: integrate spares into array at earliest opportunity.
md: move compat_ioctl handling into md.c
md: revise Kconfig help for MD_MULTIPATH
md: add MODULE_DESCRIPTION for all md related modules.
raid: improve MD/raid10 handling of correctable read errors.
md/raid10: print more useful messages on device failure.
md/bitmap: update dirty flag when bitmap bits are explicitly set.
md: Support write-intent bitmaps with externally managed metadata.
md/bitmap: move setting of daemon_lastrun out of bitmap_read_sb
md: support updating bitmap parameters via sysfs.
md: factor out parsing of fixed-point numbers
md: support bitmap offset appropriate for external-metadata arrays.
md: remove needless setting of thread->timeout in raid10_quiesce
md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.
md: move offset, daemon_sleep and chunksize out of bitmap structure
md: collect bitmap-specific fields into one structure.
md/raid1: add takeover support for raid5->raid1
md: add honouring of suspend_{lo,hi} to raid1.
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (58 commits)
mfd: Add twl6030 regulator subdevices
regulator: Add support for twl6030 regulators
rtc: Add twl6030 RTC support
mfd: Add support for twl6030 irq framework
mfd: Rename twl4030_ routines in twl-regulator.c
mfd: Rename twl4030_ routines in rtc-twl.c
mfd: Rename all twl4030_i2c*
mfd: Rename twl4030* driver files to enable re-use
mfd: Clarify twl4030 return value for read and write
mfd: Add all twl4030 regulators to the twl4030 mfd driver
mfd: Don't set mc13783 ADREFMODE for touch conversions
mfd: Remove ezx-pcap defines for custom led gpio encoding
mfd: Near complete mc13783 rewrite
mfd: Remove build time warning for WM835x register default tables
mfd: Force I2C to be built in when building WM831x
mfd: Don't allow wm831x to be built as a module
mfd: Fix incorrect error check for wm8350-core
mfd: Fix twl4030 warning
gpiolib: Implement gpio_to_irq() for wm831x
mfd: Remove default selection of AB4500
...
|
|
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (75 commits)
NFS: Fix nfs_migrate_page()
rpc: remove unneeded function parameter in gss_add_msg()
nfs41: Invoke RECLAIM_COMPLETE on all new client ids
SUNRPC: IS_ERR/PTR_ERR confusion
NFSv41: Fix a potential state leakage when restarting nfs4_close_prepare
nfs41: Handle NFSv4.1 session errors in the delegation recall code
nfs41: Retry delegation return if it failed with session error
nfs41: Handle session errors during delegation return
nfs41: Mark stateids in need of reclaim if state manager gets stale clientid
NFS: Fix up the declaration of nfs4_restart_rpc when NFSv4 not configured
nfs41: Don't clear DRAINING flag on NFS4ERR_STALE_CLIENTID
nfs41: nfs41_setup_state_renewal
NFSv41: More cleanups
NFSv41: Fix up some bugs in the NFS4CLNT_SESSION_DRAINING code
NFSv41: Clean up slot table management
NFSv41: Fix nfs4_proc_create_session
nfs41: Invoke RECLAIM_COMPLETE
nfs41: RECLAIM_COMPLETE functionality
nfs41: RECLAIM_COMPLETE XDR functionality
Cleanup some NFSv4 XDR decode comments
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
m68k: rename global variable vmalloc_end to m68k_vmalloc_end
percpu: add missing per_cpu_ptr_to_phys() definition for UP
percpu: Fix kdump failure if booted with percpu_alloc=page
percpu: make misc percpu symbols unique
percpu: make percpu symbols in ia64 unique
percpu: make percpu symbols in powerpc unique
percpu: make percpu symbols in x86 unique
percpu: make percpu symbols in xen unique
percpu: make percpu symbols in cpufreq unique
percpu: make percpu symbols in oprofile unique
percpu: make percpu symbols in tracer unique
percpu: make percpu symbols under kernel/ and mm/ unique
percpu: remove some sparse warnings
percpu: make alloc_percpu() handle array types
vmalloc: fix use of non-existent percpu variable in put_cpu_var()
this_cpu: Use this_cpu_xx in trace_functions_graph.c
this_cpu: Use this_cpu_xx for ftrace
this_cpu: Use this_cpu_xx in nmi handling
this_cpu: Use this_cpu operations in RCU
this_cpu: Use this_cpu ops for VM statistics
...
Fix up trivial (famous last words) global per-cpu naming conflicts in
arch/x86/kvm/svm.c
mm/slab.c
|
|
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This patch adds initial support for creating twl6030 PMIC
specific voltage regulators in the twl mfd driver.
Board specific regulator configurations will have to be passed from
respective board files.
Signed-off-by: Rajendra Nayak <rnayak@ti.com>
Signed-off-by: Balaji T K <balajitk@ti.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
This patch updates the regulator driver to add support
for TWL6030 PMIC specific LDO regulators.
SMPS resources are not yet supported for TWL6030 and
also .set_mode and .get_status for LDO's are yet to
be implemented for TWL6030.
Signed-off-by: Rajendra Nayak <rnayak@ti.com>
Signed-off-by: Balaji T K <balajitk@ti.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
This patch adds support for phoenix interrupt framework. New iInterrupt
status register A, B, C are introduced in Phoenix and are cleared on write.
Due to the differences in interrupt handling with respect to TWL4030,
twl6030-irq.c is created for TWL6030 PMIC
Signed-off-by: Rajendra Nayak <rnayak@ti.com>
Signed-off-by: Balaji T K <balajitk@ti.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Reviewed-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
This patch renames function names like twl4030_i2c_write_u8,
twl4030_i2c_read_u8 to twl_i2c_write_u8, twl_i2c_read_u8
and also common variable in twl-core.c
Signed-off-by: Rajendra Nayak <rnayak@ti.com>
Signed-off-by: Balaji T K <balajitk@ti.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
The upcoming TWL6030 is companion chip for OMAP4 like the current TWL4030
for OMAP3. The common modules like RTC, Regulator creates opportunity
to re-use the most of the code from twl4030.
This patch renames few common drivers twl4030* files to twl* to enable
the code re-use.
Signed-off-by: Rajendra Nayak <rnayak@ti.com>
Signed-off-by: Balaji T K <balajitk@ti.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
|
|
Add all twl4030 regulators to the twl4030 mfd driver and
twl4030_platform_data
Signed-off-by: Juha Keski-Saari <ext-juha.1.keski-saari@nokia.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
We used these, in a first version of leds-pcap driver, in order to encode gpio
enabling and gpio inversion for a led inside the variable used for the gpio
number. In the new leds-pcap driver we rely on gpio_is_valid() to derive if a
led is gpio enabled and we have a dedicated flag to tell if the gpio value has
to be inverted.
Signed-off-by: Antonio Ospite <ospite@studenti.unina.it>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
This fixes several things while still providing the old API:
- simplify and fix locking
- better error handling
- don't ack all irqs making it impossible to detect a reset of the
rtc
- use a timeout variant to wait for completion of ADC conversion
- provide platform-data to regulator subdevice (This allows making
struct mc13783 opaque for other drivers after the regulator driver is
updated to use its platform_data.)
- expose all interrupts
- use threaded irq
After all users in mainline are converted to the new API, some things
(e.g. mc13783-private.h) can go away.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
Replace the wm831x-local IRQ infrastructure with genirq, allowing access
to the diagnostic infrastructure of genirq and allowing us to implement
interrupt support for the GPIOs. The switchover is done within the
wm831x specific IRQ API, further patches will convert the individual
drivers to use genirq directly.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
TWL5031 introduces two new interrupts in PIH. Moreover, BCI
has changed remarkably and, thus, it's disabled when TWL5031
is in use.
Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
This is done as simple code transformation, the semantics of the
IRQ API provided by the core are are still very different to those
of genirq (mainly with regard to masking).
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
Add function to allow the configuation fo the VDCDC2 register by
external users, to allow changing of the standard and low-power
running modes.
This is needed, for example, for the Simtec IM2440D20 where we need
to use the low-power mode to shutdown the LDO/DCDC that are not needed
during suspend (saving substantial power) and the runtime use of the
low-power mode to change VCore.
Signed-off-by: Ben Dooks <ben@simtec.co.uk>
Signed-off-by: Simtec Linux Team <linux@simtec.co.uk>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|