aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-04-02sysctl: fix suid_dumpable and lease-break-time sysctlsMatthew Wilcox
Arne de Bruijn points out that commit 76fdbb25f963de5dc1e308325f0578a2f92b1c2d ("coredump masking: bound suid_dumpable sysctl") mistakenly limits lease-break-time instead of suid_dumpable. Signed-off-by: Matthew Wilcox <matthew@wil.cx> Reported-by: Arne de Bruijn <kernelbt@arbruijn.dds.nl> Cc: Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02proc_sysctl: use CONFIG_PROC_SYSCTL around ipc and utsname proc_handlersSerge E. Hallyn
As pointed out by Cedric Le Goater (in response to Alexey's original comment wrt mqns), ipc_sysctl.c and utsname_sysctl.c are using CONFIG_PROC_FS, not CONFIG_PROC_SYSCTL, to determine whether to define the proc_handlers. Change that. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02workqueue: avoid recursion in run_workqueue()Lai Jiangshan
1) lockdep will complain when run_workqueue() performs recursion. 2) The recursive implementation of run_workqueue() means that flush_workqueue() and its documentation are inconsistent. This may hide deadlocks and other bugs. 3) The recursion in run_workqueue() will poison cwq->current_work, but flush_work() and __cancel_work_timer(), etcetera need a reliable cwq->current_work. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace_untrace: fix the SIGNAL_STOP_STOPPED checkOleg Nesterov
This bug is ancient too. ptrace_untrace() must not resume the task if the group stop in progress, we should set TASK_STOPPED instead. Unfortunately, we still have problems here: - if the process/thread was traced, SIGNAL_STOP_STOPPED does not necessary means this thread group is stopped. - ptrace breaks the bookkeeping of ->group_stop_count. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logicOleg Nesterov
Another ancient bug. Consider this trivial test-case, int main(void) { int pid = fork(); if (pid) { ptrace(PTRACE_ATTACH, pid, NULL, NULL); wait(NULL); ptrace(PTRACE_DETACH, pid, NULL, NULL); } else { pause(); printf("WE HAVE A KERNEL BUG!!!\n"); } return 0; } the child must not "escape" for sys_pause(), but it can and this was seen in practice. This is because ptrace_detach does: if (!child->exit_state) wake_up_process(child); this wakeup can happen after this child has already restarted sys_pause(), because it gets another wakeup from ptrace_untrace(). With or without this patch, perhaps sys_pause() needs a fix. But this wakeup also breaks the SIGNAL_STOP_STOPPED logic in ptrace_untrace(). Remove this wakeup. The caller saw this task in TASK_TRACED state, and unless it was SIGKILL'ed in between __ptrace_unlink()->ptrace_untrace() should handle this case correctly. If it was SIGKILL'ed, we don't need to wakup the dying tracee too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02forget_original_parent: do not abuse child->ptrace_entryOleg Nesterov
By discussion with Roland. - Use ->sibling instead of ->ptrace_entry to chain the need to be release_task'd childs. Nobody else can use ->sibling, this task is EXIT_DEAD and nobody can find it on its own list. - rename ptrace_dead to dead_childs. - Now that we don't have the "parallel" untrace code, change back reparent_thread() to return void, pass dead_childs as an argument. Actually, I don't understand why do we notify /sbin/init when we reparent a zombie, probably it is better to reap it unconditionally. [akpm@linux-foundation.org: s/childs/children/] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02forget_original_parent: split out the un-ptrace partOleg Nesterov
By discussion with Roland. - Rename ptrace_exit() to exit_ptrace(), and change it to do all the necessary work with ->ptraced list by its own. - Move this code from exit.c to ptrace.c - Update the comment in ptrace_detach() to explain the rechecking of the child->ptrace. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02reparent_thread: fix a zombie leak if /sbin/init ignores SIGCHLDOleg Nesterov
If /sbin/init ignores SIGCHLD and we re-parent a zombie, it is leaked. reparent_thread() does do_notify_parent() which sets ->exit_signal = -1 in this case. This means that nobody except us can reap it, the detached task is not visible to do_wait(). Change reparent_thread() to return a boolean (like __pthread_detach) to indicate that the thread is dead and must be released. Also change forget_original_parent() to add the child to ptrace_dead list in this case. The naming becomes insane, the next patch does the cleanup. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02reparent_thread: fix the "is it traced" checkOleg Nesterov
reparent_thread() uses ptrace_reparented() to check whether this thread is ptraced, in that case we should not notify the new parent. But ptrace_reparented() is not exactly correct when the reparented thread is traced by /sbin/init, because forget_original_parent() has already changed ->real_parent. Currently, the only problem is the false notification. But with the next patch the kernel crash in this (yes, pathological) case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02reparent_thread: don't call kill_orphaned_pgrp() if task_detached()Oleg Nesterov
If task_detached(p) == T, then either a) p is not the main thread, we will find the group leader on the ->children list. or b) p is the group leader but its ->exit_state = EXIT_DEAD. This can only happen when the last sub-thread has died, but in that case that thread has already called kill_orphaned_pgrp() from exit_notify(). In both cases kill_orphaned_pgrp() looks bogus. Move the task_detached() check up and simplify the code, this is also right from the "common sense" pov: we should do nothing with the detached childs, except move them to the new parent's ->children list. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: fix possible zombie leak on PTRACE_DETACHOleg Nesterov
When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed. If it has already passed exit_notify() we can leak a zombie, because a) ptracing disables the auto-reaping logic, and b) ->real_parent was not notified about the child's death. ptrace_detach() should follow the ptrace_exit's logic, change the code accordingly. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Tested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: reintroduce __ptrace_detach() as a callee of ptrace_exit()Oleg Nesterov
No functional changes, preparation for the next patch. Move the "should we release this child" logic into the separate handler, __ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: simplify ptrace_exit()->ignoring_children() pathOleg Nesterov
ignoring_children() takes parent->sighand->siglock and checks k_sigaction[SIGCHLD] atomically. But this buys nothing, we can't get the "really" wrong result even if we race with sigaction(SIGCHLD). If we read the "stale" sa_handler/sa_flags we can pretend it was changed right after the check. Remove spin_lock(->siglock), and kill "int ign" which caches the result of ignoring_children() which becomes rather trivial. Perhaps it makes sense to export this helper, do_notify_parent() can use it too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: kill __ptrace_detach(), fix ->exit_state checkOleg Nesterov
Move the code from __ptrace_detach() to its single caller and kill this helper. Also, fix the ->exit_state check, we shouldn't wake up EXIT_DEAD tasks. Actually, I think task_is_stopped_or_traced() makes more sense, but this needs another patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: SI_USER: Masquerade si_pid when crossing pid ns boundarySukadev Bhattiprolu
When sending a signal to a descendant namespace, set ->si_pid to 0 since the sender does not have a pid in the receiver's namespace. Note: - If rt_sigqueueinfo() sets si_code to SI_USER when sending a signal across a pid namespace boundary, the value in ->si_pid will be cleared to 0. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: protect cinit from blocked fatal signalsSukadev Bhattiprolu
Normally SIG_DFL signals to global and container-init are dropped early. But if a signal is blocked when it is posted, we cannot drop the signal since the receiver may install a handler before unblocking the signal. Once this signal is queued however, the receiver container-init has no way of knowing if the signal was sent from an ancestor or descendant namespace. This patch ensures that contianer-init drops all SIG_DFL signals in get_signal_to_deliver() except SIGKILL/SIGSTOP. If SIGSTOP/SIGKILL originate from a descendant of container-init they are never queued (i.e dropped in sig_ignored() in an earler patch). If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued and container-init processes the signal. IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global or container-init, the signal must have been generated internally or must have come from an ancestor ns and we process the signal. Further, the signal_group_exit() check was needed to cover the case of a multi-threaded init sending SIGKILL to other threads when doing an exit() or exec(). But since the new sig_kernel_only() check covers the SIGKILL, the signal_group_exit() check is no longer needed and can be removed. Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: zap_pid_ns_process() should use force_sig()Sukadev Bhattiprolu
send_signal() assumes that signals with SEND_SIG_PRIV are generated from within the same namespace. So any nested container-init processes become immune to the SIGKILL generated by kill_proc_info() in zap_pid_ns_processes(). Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the SIGNAL_UNKILLABLE flag ensuring the signal is processed by container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: protect cinit from unblocked SIG_DFL signalsSukadev Bhattiprolu
Drop early any SIG_DFL or SIG_IGN signals to container-init from within the same container. But queue SIGSTOP and SIGKILL to the container-init if they are from an ancestor container. Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the container can still terminate the container-init. That will be addressed in the next patch. Note: To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits in a follow-on patch. Until then, this patch is just a preparatory step. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: add from_ancestor_ns parameter to send_signal()Sukadev Bhattiprolu
send_signal() (or its helper) needs to determine the pid namespace of the sender. But a signal sent via kill_pid_info_as_uid() comes from within the kernel and send_signal() does not need to determine the pid namespace of the sender. So define a helper for send_signal() which takes an additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid() use that helper directly. The 'from_ancestor_ns' parameter will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: protect init from unwanted signals moreOleg Nesterov
(This is a modified version of the patch submitted by Oleg Nesterov http://lkml.org/lkml/2008/11/18/249 and tries to address comments that came up in that discussion) init ignores the SIG_DFL signals but we queue them anyway, including SIGKILL. This is mostly OK, the signal will be dropped silently when dequeued, but the pending SIGKILL has 2 bad implications: - it implies fatal_signal_pending(), so we confuse things like wait_for_completion_killable/lock_page_killable. - for the sub-namespace inits, the pending SIGKILL can mask (legacy_queue) the subsequent SIGKILL from the parent namespace which must kill cinit reliably. (preparation, cinits don't have SIGNAL_UNKILLABLE yet) The patch can't help when init is ptraced, but ptracing of init is not "safe" anyway. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: remove 'handler' parameter to tracehook functionsOleg Nesterov
Container-init must behave like global-init to processes within the container and hence it must be immune to unhandled fatal signals from within the container (i.e SIG_DFL signals that terminate the process). But the same container-init must behave like a normal process to processes in ancestor namespaces and so if it receives the same fatal signal from a process in ancestor namespace, the signal must be processed. Implementing these semantics requires that send_signal() determine pid namespace of the sender but since signals can originate from workqueues/ interrupt-handlers, determining pid namespace of sender may not always be possible or safe. This patchset implements the design/simplified semantics suggested by Oleg Nesterov. The simplified semantics for container-init are: - container-init must never be terminated by a signal from a descendant process. - container-init must never be immune to SIGKILL from an ancestor namespace (so a process in parent namespace must always be able to terminate a descendant container). - container-init may be immune to unhandled fatal signals (like SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP are the only reliable signals to a container-init from ancestor namespace. This patch: Based on an earlier patch submitted by Oleg Nesterov and comments from Roland McGrath (http://lkml.org/lkml/2008/11/19/258). The handler parameter is currently unused in the tracehook functions. Besides, the tracehook functions are called with siglock held, so the functions can check the handler if they later need to. Removing the parameter simiplifies changes to sig_ignored() in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02do_wait: fix waiting for the group stop with the dead leaderOleg Nesterov
do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is not true if the leader is already dead. Check SIGNAL_STOP_STOPPED instead and use signal->group_exit_code. Trivial test-case: void *tfunc(void *arg) { pause(); return NULL; } int main(void) { pthread_t thr; pthread_create(&thr, NULL, tfunc, NULL); pthread_exit(NULL); return 0; } It doesn't react to ^Z (and then to ^C or ^\). The task is stopped, but bash can't see this. The bug is very old, and it was reported multiple times. This patch was sent more than a year ago (http://marc.info/?t=119713920000003) but it was ignored. This change also fixes other oddities (but not all) in this area. For example, before this patch: $ sleep 100 ^Z [1]+ Stopped sleep 100 $ strace -p `pidof sleep` Process 11442 attached - interrupt to quit strace hangs in do_wait(), because ->exit_code was already consumed by bash. After this patch, strace happily proceeds: --- SIGTSTP (Stopped) @ 0 (0) --- restart_syscall(<... resuming interrupted call ...> To me, this looks much more "natural" and correct. Another example. Let's suppose we have the main thread M and sub-thread T, the process is stopped, and its parent did wait(WSTOPPED). Now we can ptrace T but not M. This looks at least strange to me. Imho, do_wait() should not confuse the per-thread ptrace stops with the per-process job control stops. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Kaz Kylheku <kkylheku@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cpusets: prevent PF_THREAD_BOUND tasks from attaching to non-root cpusetsDavid Rientjes
Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a specific cpu. Thus, their set of allowed cpus shall not change. This patch prevents such threads from attaching to non-root cpusets. They do not have mempolicies that restrict them to a subset of system nodes and, since their cpumask may never change, they cannot use any of the features of cpusets. The tasks will forever be a member of the root cpuset and will be returned when listing the tasks attached to that cpuset. Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cpusets: allow cpusets to be configured/built on non-SMP systemsPaul Menage
Allow cpusets to be configured/built on non-SMP systems Currently it's impossible to build cpusets under UML on x86-64, since cpusets depends on SMP and x86-64 UML doesn't support SMP. There's code in cpusets that doesn't depend on SMP. This patch surrounds the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to allow cpusets to build/run on UP systems (for testing purposes under UML). Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cpusets: replace zone allowed functions with node allowedDavid Rientjes
The cpuset_zone_allowed() variants are actually only a function of the zone's node. Cc: Paul Menage <menage@google.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cpuset: remove struct cpuset_hotplug_scannerLi Zefan
Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cpuset: avoid changing cpuset's mems when errno returnedLi Zefan
When writing to cpuset.mems, cpuset has to update its mems_allowed before calling update_tasks_nodemask(), but this function might return -ENOMEM. To avoid this rare case, we allocate the memory before changing mems_allowed, and then pass to update_tasks_nodemask(). Similar to what update_cpumask() does. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cpuset: rewrite update_tasks_nodemask()Li Zefan
This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's mems_allowed. Not only simplify the code largely, but also avoid allocating an array to hold mm pointers of all the tasks in the cpuset. This array can be big (size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a chance to fail the allocation when under memory stress. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cpuset: fix possible races in cpu/memory hotplugLi Zefan
Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected by callback_mutex, otherwise the reader may read wrong cpus/mems. This is cpuset's lock rule. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02memcg: fix OOM killer under memcgKAMEZAWA Hiroyuki
This patch tries to fix OOM Killer problems caused by hierarchy. Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to kill a task in memcg. But, when hierarchy is used, it's broken and correct task cannot be killed. For example, in following cgroup /groupA/ hierarchy=1, limit=1G, 01 nolimit 02 nolimit All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to groupA's 1Gbytes but OOM Killer just kills tasks in groupA. This patch provides makes the bad process be selected from all tasks under hierarchy. BTW, currently, oom_jiffies is updated against groupA in above case. oom_jiffies of tree should be updated. To see how oom_jiffies is used, please check mem_cgroup_oom_called() callers. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02debug cgroup: remove unneeded cgroup_lockLi Zefan
Since we are in cgroup write handler, so the cgrp is valid, so we don't have to hold cgroup_mutex when calling cgroup_task_count(). One similar example is in cgroup_tasks_open(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: don't change release_agent when remount failedLi Zefan
Remount can fail in either case: - wrong mount options is specified, or option 'noprefix' is changed. - a to-be-added subsys is already mounted/active. When using remount to change 'release_agent', for the above former failure case, remount will return errno with release_agent unchanged, but for the latter case, remount will return EBUSY with relase_agent changed, which is unexpected I think: # mount -t cgroup -o cpu xxx /cgrp1 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2 # cat /cgrp2/release_agent agent1 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 not mounted already, or bad option # cat /cgrp2/release_agent agent1 <-- ok # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 is busy # cat /cgrp2/release_agent agent2 <-- unexpected! Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: show correct file modeLi Zefan
We have some read-only files and write-only files, but currently they are all set to 0644, which is counter-intuitive and cause trouble for some cgroup tools like libcgroup. This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's own files' file mode, and for the most cases cft->mode can be default to 0 and cgroup will figure out proper mode. Acked-by: Paul Menage <menage@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02kernel/cgroup.c: kfree(NULL) is legalJesper Juhl
Reduces object file size a bit: Before: $ size kernel/cgroup.o text data bss dec hex filename 21593 7804 4924 34321 8611 kernel/cgroup.o After: $ size kernel/cgroup.o text data bss dec hex filename 21537 7744 4924 34205 859d kernel/cgroup.o Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroup: fix frequent -EBUSY at rmdirKAMEZAWA Hiroyuki
In following situation, with memory subsystem, /groupA use_hierarchy==1 /01 some tasks /02 some tasks /03 some tasks /04 empty When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim is triggered and the kernel walks tree under groupA. In this case, rmdir /groupA/04 fails with -EBUSY frequently because of temporal refcnt from the kernel. In general. cgroup can be rmdir'd if there are no children groups and no tasks. Frequent fails of rmdir() is not useful to users. (And the reason for -EBUSY is unknown to users.....in most cases) This patch tries to modify above behavior, by - retries if css_refcnt is got by someone. - add "return value" to pre_destroy() and allows subsystem to say "we're really busy!" Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroup: CSS ID supportKAMEZAWA Hiroyuki
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code. This patch attaches unique ID to each css and provides following. - css_lookup(subsys, id) returns pointer to struct cgroup_subysys_state of id. - css_get_next(subsys, id, rootid, depth, foundid) returns the next css under "root" by scanning When cgroup_subsys->use_id is set, an id for css is maintained. The cgroup framework only parepares - css_id of root css for subsys - id is automatically attached at creation of css. - id is *not* freed automatically. Because the cgroup framework don't know lifetime of cgroup_subsys_state. free_css_id() function is provided. This must be called by subsys. There are several reasons to develop this. - Saving space .... For example, memcg's swap_cgroup is array of pointers to cgroup. But it is not necessary to be very fast. By replacing pointers(8bytes per ent) to ID (2byes per ent), we can reduce much amount of memory usage. - Scanning without lock. CSS_ID provides "scan id under this ROOT" function. By this, scanning css under root can be written without locks. ex) do { rcu_read_lock(); next = cgroup_get_next(subsys, id, root, &found); /* check sanity of next here */ css_tryget(); rcu_read_unlock(); id = found + 1 } while(...) Characteristics: - Each css has unique ID under subsys. - Lifetime of ID is controlled by subsys. - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy - Allowed ID is 1-65535, ID 0 is UNUSED ID. Design Choices: - scan-by-ID v.s. scan-by-tree-walk. As /proc's pid scan does, scan-by-ID is robust when scanning is done by following kind of routine. scan -> rest a while(release a lock) -> conitunue from interrupted memcg's hierarchical reclaim does this. - When subsys->use_id is set, # of css in the system is limited to 65535. [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroupsGrzegorz Nosek
The ns_proxy cgroup allows moving processes to child cgroups only one level deep at a time. This commit relaxes this restriction and makes it possible to attach tasks directly to grandchild cgroups, e.g.: ($pid is in the root cgroup) echo $pid > /cgroup/CG1/CG2/tasks Previously this operation would fail with -EPERM and would have to be performed as two steps: echo $pid > /cgroup/CG1/tasks echo $pid > /cgroup/CG1/CG2/tasks Also, the target cgroup no longer needs to be empty to move a task there. Signed-off-by: Grzegorz Nosek <root@localdomain.pl> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02Simplify copy_thread()Alexey Dobriyan
First argument unused since 2.3.11. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02nommu: fix a number of issues with the per-MM VMA patchDavid Howells
Fix a number of issues with the per-MM VMA patch: (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on a NOMMU system with more than 2G pages. Makes no difference on a 32-bit system. (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value, lest it overflow. (3) Move the allocation of the vm_area_struct slab back for fork.c. (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs. (5) Use BUG_ON() rather than if () BUG(). (6) Make the default validate_nommu_regions() a static inline rather than a #define. (7) Make free_page_series()'s objection to pages with a refcount != 1 more informative. (8) Adjust the __put_nommu_region() banner comment to indicate that the semaphore must be held for writing. (9) Limit the number of warnings about munmaps of non-mmapped regions. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02Merge branch 'tracing/core-v2' into tracing-for-linusIngo Molnar
Conflicts: include/linux/slub_def.h lib/Kconfig.debug mm/slob.c mm/slub.c
2009-04-01epoll keyed wakeups: add __wake_up_locked_key() and __wake_up_sync_key()Davide Libenzi
This patchset introduces wakeup hints for some of the most popular (from epoll POV) devices, so that epoll code can avoid spurious wakeups on its waiters. The problem with epoll is that the callback-based wakeups do not, ATM, carry any information about the events the wakeup is related to. So the only choice epoll has (not being able to call f_op->poll() from inside the callback), is to add the file* to a ready-list and resolve the real events later on, at epoll_wait() (or its own f_op->poll()) time. This can cause spurious wakeups, since the wake_up() itself might be for an event the caller is not interested into. The rate of these spurious wakeup can be pretty high in case of many network sockets being monitored. By allowing devices to report the events the wakeups refer to (at least the two major classes - POLLIN/POLLOUT), we are able to spare useless wakeups by proper handling inside the epoll's poll callback. Epoll will have in any case to call f_op->poll() on the file* later on, since the change to be done in order to have the full event set sent via wakeup, is too invasive for the way our f_op->poll() system works (the full event set is calculated inside the poll function - there are too many of them to even start thinking the change - also poll/select would need change too). Epoll is changed in a way that both devices which send event hints, and the ones that don't, are correctly handled. The former will gain some efficiency though. As a general rule for devices, would be to add an event mask by using key-aware wakeup macros, when making up poll wait queues. I tested it (together with the epoll's poll fix patch Andrew has in -mm) and wakeups for the supported devices are correctly filtered. Test program available here: http://www.xmailserver.org/epoll_test.c This patch: Nothing revolutionary here. Just using the available "key" that our wakeup core already support. The __wake_up_locked_key() was no brainer, since both __wake_up_locked() and __wake_up_locked_key() are thin wrappers around __wake_up_common(). The __wake_up_sync() function had a body, so the choice was between borrowing the body for __wake_up_sync_key() and calling it from __wake_up_sync(), or make an inline and calling it from both. I chose the former since in most archs it all resolves to "mov $0, REG; jmp ADDR". Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: William Lee Irwin III <wli@movementarian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01pm: rework includes, remove arch ifdefsMagnus Damm
Make the following header file changes: - remove arch ifdefs and asm/suspend.h from linux/suspend.h - add asm/suspend.h to disk.c (for arch_prepare_suspend()) - add linux/io.h to swsusp.c (for ioremap()) - x86 32/64 bit compile fixes Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: fix proc_dointvec_userhz_jiffies "breakage"Alexey Dobriyan
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838 On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange way from the user's point of view: # echo 500 >/proc/sys/vm/dirty_writeback_centisecs # cat /proc/sys/vm/dirty_writeback_centisecs 499 So, we have 5000 jiffies converted to only 499 clock ticks and reported back. TICK_NSEC = 999848 ACTHZ = 256039 Keeping in-kernel variable in units passed from userspace will fix issue of course, but this probably won't be right for every sysctl. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: introduce for_each_populated_zone() macroKOSAKI Motohiro
Impact: cleanup In almost cases, for_each_zone() is used with populated_zone(). It's because almost function doesn't need memoryless node information. Therefore, for_each_populated_zone() can help to make code simplify. This patch has no functional change. [akpm@linux-foundation.org: small cleanup] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01ring-buffer: do not remove reader page from list on ring buffer freeSteven Rostedt
Impact: prevent possible memory leak The reader page of the ring buffer is special. Although it points into the ring buffer, it is not part of the actual buffer. It is a page used by the reader to swap with a page in the ring buffer. Once the swap is made, the new reader page is again outside the buffer. Even though the reader page points into the buffer, it is really pointing to residual data. Note, this data is used by the reader. reader page | v (prev) +---+ (next) +----------| |----------+ | +---+ | v v +---+ +---+ +---+ -->| |------->| |------->| |---> <--| |<-------| |<-------| |<--- +---+ +---+ +---+ ^ ^ ^ \ | / ------- Buffer--------- If we perform a list_del_init() on the reader page we will actually remove the last page the reader swapped with and not the reader page itself. This will cause that page to not be freed, and thus is a memory leak. Luckily, the only user of the ring buffer so far is ftrace. And ftrace will not free its ring buffer after it allocates it. There is no current possible memory leak. But once there are other users, or if ftrace dynamically creates and frees its ring buffer, then this would be a memory leak. This patch fixes the leak for future cases. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01function-graph: allow unregistering twiceSteven Rostedt
Impact: fix to permanent disabling of function graph tracer There should be nothing to prevent a tracer from unregistering a function graph callback more than once. This can simplify error paths. But currently, the counter does not account for mulitple unregistering of the function graph callback. If it happens, the function graph tracer will be permanently disabled. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31Get rid of indirect include of fs_struct.hAl Viro
Don't pull it in sched.h; very few files actually need it and those can include directly. sched.h itself only needs forward declaration of struct fs_struct; Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31New locking/refcounting for fs_structAl Viro
* all changes of current->fs are done under task_lock and write_lock of old fs->lock * refcount is not atomic anymore (same protection) * its decrements are done when removing reference from current; at the same time we decide whether to free it. * put_fs_struct() is gone * new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do execve() and only subthreads share fs_struct. Cleared when finishing exec (success and failure alike). Makes CLONE_FS fail with -EAGAIN if set. * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread is in progress. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31Take fs_struct handling to new file (fs/fs_struct.c)Al Viro
Pure code move; two new helper functions for nfsd and daemonize (unshare_fs_struct() and daemonize_fs_struct() resp.; for now - the same code as used to be in callers). unshare_fs_struct() exported (for nfsd, as copy_fs_struct()/exit_fs() used to be), copy_fs_struct() and exit_fs() don't need exports anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31Kill unsharing fs_struct in __set_personality()Al Viro
That's a rudiment of altroot support. I.e. it should've been buried a long time ago. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>