aboutsummaryrefslogtreecommitdiff
path: root/kernel/exit.c
AgeCommit message (Collapse)Author
2008-04-29cgroups: add an owner to the mm_structBalbir Singh
Remove the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: David Rientjes <rientjes@google.com>, Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: rename mpol_free to mpol_putLee Schermerhorn
This is a change that was requested some time ago by Mel Gorman. Makes sense to me, so here it is. Note: I retain the name "mpol_free_shared_policy()" because it actually does free the shared_policy, which is NOT a reference counted object. However, ... The mempolicy object[s] referenced by the shared_policy are reference counted, so mpol_put() is used to release the reference held by the shared_policy. The mempolicy might not be freed at this time, because some task attached to the shared object associated with the shared policy may be in the process of allocating a page based on the mempolicy. In that case, the task performing the allocation will hold a reference on the mempolicy, obtained via mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks holding such a reference have called mpol_put() for the mempolicy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-25[PATCH] sanitize unshare_files/reset_files_structAl Viro
* let unshare_files() give caller the displaced files_struct * don't bother with grabbing reference only to drop it in the caller if it hadn't been shared in the first place * in that form unshare_files() is trivially implemented via unshare_fd(), so we eliminate the duplicate logics in fork.c * reset_files_struct() is not just only called for current; it will break the system if somebody ever calls it for anything else (we can't modify ->files of somebody else). Lose the task_struct * argument. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-25[PATCH] sanitize handling of shared descriptor tables in failing execve()Al Viro
* unshare_files() can fail; doing it after irreversible actions is wrong and de_thread() is certainly irreversible. * since we do it unconditionally anyway, we might as well do it in do_execve() and save ourselves the PITA in binfmt handlers, etc. * while we are at it, binfmt_som actually leaked files_struct on failure. As a side benefit, unshare_files(), put_files_struct() and reset_files_struct() become unexported. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-22[PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct()Al Viro
The only reason to have separated __...() for those was to keep them inlined for local users in exit.c. Since Alexey removed the inline on those, there's no reason whatsoever to keep them around; just collapse with normal variants. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-10asmlinkage_protect replaces prevent_tail_callRoland McGrath
The prevent_tail_call() macro works around the problem of the compiler clobbering argument words on the stack, which for asmlinkage functions is the caller's (user's) struct pt_regs. The tail/sibling-call optimization is not the only way that the compiler can decide to use stack argument words as scratch space, which we have to prevent. Other optimizations can do it too. Until we have new compiler support to make "asmlinkage" binding on the compiler's own use of the stack argument frame, we have work around all the manifestations of this issue that crop up. More cases seem to be prevented by also keeping the incoming argument variables live at the end of the function. This makes their original stack slots attractive places to leave those variables, so the compiler tends not clobber them for something else. It's still no guarantee, but it handles some observed cases that prevent_tail_call() did not. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-08Fix waitid si_code regressionRoland McGrath
In commit ee7c82da830ea860b1f9274f1f0cdf99f206e7c2 ("wait_task_stopped: simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short) cast when storing si_code was lost in wait_task_stopped. This leaks the in-kernel CLD_* values that do not match what userland expects. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03exit_notify: fix kill_orphaned_pgrp() usage with mt exitOleg Nesterov
1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we should do this only when the whole process exits. 2. exit_notify() uses "current" as "ignored_task", obviously wrong. Use ->group_leader instead. Test case: void hup(int sig) { printf("HUP received\n"); } void *tfunc(void *arg) { sleep(2); printf("sub-thread exited\n"); return NULL; } int main(int argc, char *argv[]) { if (!fork()) { signal(SIGHUP, hup); kill(getpid(), SIGSTOP); exit(0); } pthread_t thr; pthread_create(&thr, NULL, tfunc, NULL); sleep(1); printf("main thread exited\n"); syscall(__NR_exit, 0); return 0; } output: main thread exited HUP received Hangup With this patch the output is: main thread exited sub-thread exited HUP received Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03will_become_orphaned_pgrp: partially fix insufficient ->exit_state checkOleg Nesterov
p->exit_state != 0 doesn't mean this process is dead, it may have sub-threads. Change the code to use "p->exit_state && thread_group_empty(p)" instead. Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process if the main thread has exited. However, the new check is not perfect either. There is a window when exit_notify() drops tasklist and before release_task(). Suppose that the last (non-leader) thread exits. This means that entire group exits, but thread_group_empty() is not true yet. As Eric pointed out, is_global_init() is wrong as well, but I did not dare to do other changes. Just for the record, has_stopped_jobs() is absolutely wrong too. But we can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues. Even with this patch ^Z doesn't play well with the dead main thread. The task is stopped correctly but do_wait(WSTOPPED) won't see it. This is another unrelated issue, will be (hopefully) fixed separately. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03introduce kill_orphaned_pgrp() helperOleg Nesterov
Factor out the common code in reparent_thread() and exit_notify(). No functional changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14Use struct path in fs_structJan Blunck
* Use struct path in fs_struct. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08kernel: remove fastcall in kernel/*Harvey Harrison
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Pidns: make full use of xxx_vnr() callsPavel Emelyanov
Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were _all_ converted to operate on the current pid namespace. After this each call like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo) one. Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where appropriate. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08pid: sys_wait... fixesEric W. Biederman
This modifies do_wait and eligible child to take a pair of enum pid_type and struct pid *pid to precisely specify what set of processes are eligible to be waited for, instead of the raw pid_t value from sys_wait4. This fixes a bug in sys_waitid where you could not wait for children in just process group 1. This fixes a pid namespace crossing case in eligible_child. Allowing us to wait for a processes in our current process group even if our current process group == 0. This allows the no child with this pid case to be optimized. This allows us to optimize the pid membership test in eligible child to be optimized. This even closes a theoretical pid wraparound race where in a threaded parent if two threads are waiting for the same child and one thread picks up the child and the pid numbers wrap around and generate another child with that same pid before the other thread is scheduled (teribly insanely unlikely) we could end up waiting on the second child with the same pid# and not discover that the specific child we were waiting for has exited. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08move the related code from exit_notify() to exit_signals()Oleg Nesterov
The previous bugfix was not optimal, we shouldn't care about group stop when we are the only thread or the group stop is in progress. In that case nothing special is needed, just set PF_EXITING and return. Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify(). So, from the performance POV the only difference is that we don't trust !signal_pending() until we take ->siglock. But this in fact fixes another ___pure___ theoretical minor race. __group_complete_signal() finds the task without PF_EXITING and chooses it as the target for signal_wake_up(). But nothing prevents this task from exiting in between without noticing the pending signal and thus unpredictably delaying the actual delivery. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08fix group stop with exit raceOleg Nesterov
do_signal_stop() counts all sub-thread and sets ->group_stop_count accordingly. Every thread should decrement ->group_stop_count and stop, the last one should notify the parent. However a sub-thread can exit before it notices the signal_pending(), or it may be somewhere in do_exit() already. In that case the group stop never finishes properly. Note: this is a minimal fix, we can add some optimizations later. Say we can return quickly if thread_group_empty(). Also, we can move some signal related code from exit_notify() to exit_signals(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08move daemonized kernel threads into the swapper's sessionOleg Nesterov
Daemonized kernel threads run in the init's session. This doesn't match the behaviour of kthread_create()'ed threads, and this is one of the 2 reasons why we need a special hack in sys_setsid(). Now that set_special_pids() was changed to use struct pid, not pid_t, we can use init_struct_pid and set 0,0 special pids. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08teach set_special_pids() to use struct pidOleg Nesterov
Change set_special_pids() to work with struct pid, not pid_t from global name space. This again speedups and imho cleanups the code, also a preparation for the next patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08wait_task_zombie: remove ->exit_state/exit_signal checks for WNOWAITOleg Nesterov
The first "p->exit_state != EXIT_ZOMBIE" check doesn't make too much sense. The exit_state was EXIT_ZOMBIE when the function was called, and another thread can change it to EXIT_DEAD right after the check. The second condition is not possible, detached non-traced threads were already filtered out by eligible_child(), we didn't drop tasklist since then. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08wait_task_continued/zombie: don't use task_pid_nr_ns() locklessOleg Nesterov
Surprise, the other two wait_task_*() functions also abuse the task_pid_nr_ns() function, and may cause read-after-free or report nr == 0 in wait_task_continued(). wait_task_zombie() doesn't have this problem, but it is still better to cache pid_t rather than call task_pid_nr_ns() three times on the saved pid_namespace. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08do_wait: fix security checksOleg Nesterov
Imho, the current usage of security_task_wait() is not logical. Suppose we have the single child p, and security_task_wait(p) return -EANY. In that case waitpid(-1) returns this error. Why? Isn't it better to return ECHLD? We don't really have reapable children. Now suppose that child was stolen by gdb. In that case we find this child on ->ptrace_children and set flag = 1, but we don't check that the child was denied. So, do_wait(..., WNOHANG) returns 0, this doesn't match the behaviour above. Without WNOHANG do_wait() blocks only to return the error later, when the child will be untraced. Inho, really strange. I think eligible_child() should return the error only if the child's pid was requested explicitly, otherwise we should silently ignore the tasks which were nacked by security_task_wait(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08do_wait: cleanup delay_group_leader() usageOleg Nesterov
eligible_child() == 2 means delay_group_leader(). With the previous patch this only matters for EXIT_ZOMBIE task, we can move that special check to the only place it is really needed. Also, with this patch we don't skip security_task_wait() for the group leaders in a non-empty thread group. I don't really understand the exact semantics of security_task_wait(), but imho this change is a bugfix. Also rearrange the code a bit to kill an ugly "check_continued" backdoor. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Roland McGrath <roland@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08wait_task_stopped(): remove unneeded delay_group_leader checkOleg Nesterov
wait_task_stopped() doesn't need the "delay_group_leader" parameter. If the child is not traced it must be a group leader. With or without subthreads ->group_stop_count == 0 when the whole task is stopped. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Mika Penttila <mika.penttila@kolumbus.fi> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08do_wait: factor out "retval != 0" checksOleg Nesterov
Every branch if the main "if" statement does the same code at the end. Move it down. Also, fix the indentation. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08wait_task_stopped: simplify and fix races with SIGCONT/SIGKILL/untraceOleg Nesterov
wait_task_stopped() has multiple races with SIGCONT/SIGKILL. tasklist_lock does not pin the child in TASK_TRACED/TASK_STOPPED stated, almost all info reported (including exit_code) may be wrong. In fact, the code under write_lock_irq(tasklist_lock) is not safe. The child may be PTRACE_DETACH'ed at this time by another subthread, in that case it is possible we are no longer its ->parent. Change wait_task_stopped() to take ->siglock before inspecting the task. This guarantees that the child can't resume and (for example) clear its ->exit_code, so we don't need to use xchg(&p->exit_code) and re-check. The only exception is ptrace_stop() which changes ->state and ->exit_code without ->siglock held during abort. But this can only happen if both the tracer and the tracee are dying (coredump is in progress), we don't care. With this patch wait_task_stopped() doesn't move the child to the end of the ->parent list on success. This optimization could be restored, but in that case we have to take write_lock(tasklist) and do some nasty checks. Also change the do_wait() since we don't return EAGAIN any longer. [akpm@linux-foundation.org: fix up after Willy renamed everything] Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08kill my_ptrace_child()Oleg Nesterov
Now that my_ptrace_child() is trivial we can use the "p->ptrace & PT_PTRACED" inline and simplify the corresponding logic in do_wait: we can't find the child in TASK_TRACED state without PT_PTRACED flag set, ptrace_untrace() either sets TASK_STOPPED or wakes up the tracee. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08kill PT_ATTACHEDOleg Nesterov
Since the patch "Fix ptrace_attach()/ptrace_traceme()/de_thread() race" commit f5b40e363ad6041a96e3da32281d8faa191597b9 we set PT_ATTACHED and change child->parent "atomically" wrt task_list lock. This means we can remove the checks like "PT_ATTACHED && ->parent != ptracer" which were needed to catch the "ptrace attach is in progress" case. We can also remove the flag itself since nobody else uses it. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06do_wait: remove one "else if" branchOleg Nesterov
Minor cleanup. We can remove one "else if" branch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05exec: rework the group exit and fix the race with killOleg Nesterov
As Roland pointed out, we have the very old problem with exec. de_thread() sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then clears signal->flags. All signals (even fatal ones) sent in this window (which is not too small) will be lost. With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(), the new helper, should be used to detect exit_group() or exec() in progress. It can have more users, but this patch does only strictly necessary changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Robin Holt <holt@sgi.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-03Remove one useless extern declarationPierre Peiffer
The file exit.c contains one useless extern declaration of sem_exit(). Moreover, it refers to nothing. This trivial patch removes it. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-12-06exit: Use task_is_*Matthew Wilcox
Also restructure the loop in do_wait() Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2007-11-29wait_task_stopped(): pass correct exit_code to wait_noreap_copyout()Scott James Remnant
In wait_task_stopped() exit_code already contains the right value for the si_status member of siginfo, and this is simply set in the non WNOWAIT case. If you call waitid() with a stopped or traced process, you'll get the signal in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the same time, you'll get the signal << 8 | 0x7f Pass it unchanged to wait_noreap_copyout(); we would only need to shift it and add 0x7f if we were returning it in the user status field and that isn't used for any function that permits WNOWAIT. Signed-off-by: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29wait_task_stopped(): don't use task_pid_nr_ns() locklessOleg Nesterov
wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock, we can read an already freed memory. Use the cached pid_t value. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Looks-good-to: Roland McGrath <roland@redhat.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-15wait_task_stopped: Check p->exit_state instead of TASK_TRACEDRoland McGrath
The original meaning of the old test (p->state > TASK_STOPPED) was "not dead", since it was before TASK_TRACED existed and before the state/exit_state split. It was a wrong correction in commit 14bf01bb0599c89fc7f426d20353b76e12555308 to make this test for TASK_TRACED instead. It should have been changed when TASK_TRACED was introducted and again when exit_state was introduced. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Kees Cook <kees@ubuntu.com> Acked-by: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Uninline fork.c/exit.cAlexey Dobriyan
Save ~650 bytes here. add/remove: 4/0 grow/shrink: 0/7 up/down: 430/-1088 (-658) function old new delta __copy_fs_struct - 202 +202 __put_fs_struct - 112 +112 __exit_fs - 58 +58 __exit_files - 58 +58 exit_files 58 2 -56 put_fs_struct 112 5 -107 exit_fs 161 2 -159 sys_unshare 774 590 -184 copy_process 4031 3840 -191 do_exit 1791 1597 -194 copy_fs_struct 202 5 -197 No difference in lmbench lat_proc tests on 2-way Opteron 246. Smaaaal degradation on UP P4 (within errors). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Use helpers to obtain task pid in printksPavel Emelyanov
The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Isolate the explicit usage of signal->pgrpPavel Emelyanov
The pgrp field is not used widely around the kernel so it is now marked as deprecated with appropriate comment. The initialization of INIT_SIGNALS is trimmed because a) they are set to 0 automatically; b) gcc cannot properly initialize two anonymous (the second one is the one with the session) unions. In this particular case to make it compile we'd have to add some field initialized right before the .pgrp. This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one (from Cedric), but for the pgrp field. Some progress report: We have to deprecate the pid, tgid, session and pgrp fields on struct task_struct and struct signal_struct. The session and pgrp are already deprecated. The tgid value is close to being such - the worst known usage in in fs/locks.c and audit code. The pid field deprecation is mainly blocked by numerous printk-s around the kernel that print the tsk->pid to log. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: changes to show virtual ids to userPavel Emelyanov
This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: destroy pid namespace on init's deathSukadev Bhattiprolu
Terminate all processes in a namespace when the reaper of the namespace is exiting. We do this by walking the pidmap of the namespace and sending SIGKILL to all processes. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: prepare proc_flust_task() to flush entries from multiple ↵Pavel Emelyanov
proc trees The first part is trivial - we just make the proc_flush_task() to operate on arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to it. The other change is more tricky: I moved the proc_flush_task() call in release_task() higher to address the following problem. When flushing task from many proc trees we need to know the set of ids (not just one pid) to find the dentries' names to flush. Thus we need to pass the task's pid to proc_flush_task() as struct pid is the only object that can provide all the pid numbers. But after __exit_signal() task has detached all his pids and this information is lost. This creates a tiny gap for proc_pid_lookup() to bring some dentries back to tree and keep them in hash (since pids are still alive before __exit_signal()) till the next shrink, but since proc_flush_task() does not provide a 100% guarantee that the dentries will be flushed, this is OK to do so. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: move exit_task_namespaces()Pavel Emelyanov
Make task release its namespaces after it has reparented all his children to child_reaper, but before it notifies its parent about its death. The reason to release namespaces after reparenting is that when task exits it may send a signal to its parent (SIGCHLD), but if the parent has already exited its namespaces there will be no way to decide what pid to dever to him - parent can be from different namespace. The reason to release namespace before notifying the parent it that when task sends a SIGCHLD to parent it can call wait() on this taks and release it. But releasing the mnt namespace implies dropping of all the mounts in the mnt namespace and NFS expects the task to have valid sighand pointer. Thanks to Oleg for pointing out some races that can apear and helping with patches and fixes. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: rework forget_original_parent()Oleg Nesterov
A pid namespace is a "view" of a particular set of tasks on the system. They work in a similar way to filesystem namespaces. A file (or a process) can be accessed in multiple namespaces, but it may have a different name in each. In a filesystem, this name might be /etc/passwd in one namespace, but /chroot/etc/passwd in another. For processes, a process may have pid 1234 in one namespace, but be pid 1 in another. This allows new pid namespaces to have basically arbitrary pids, and not have to worry about what pids exist in other namespaces. This is essential for checkpoint/restart where a restarted process's pid might collide with an existing process on the system's pid. In this particular implementation, pid namespaces have a parent-child relationship, just like processes. A process in a pid namespace may see all of the processes in the same namespace, as well as all of the processes in all of the namespaces which are children of its namespace. Processes may not, however, see others which are in their parent's namespace, but not in their own. The same goes for sibling namespaces. The know issue to be solved in the nearest future is signal handling in the namespace boundary. That is, currently the namespace's init is treated like an ordinary task that can be killed from within an namespace. Ideally, the signal handling by the namespace's init should have two sides: when signaling the init from its namespace, the init should look like a real init task, i.e. receive only those signals, that is explicitly wants to; when signaling the init from one of the parent namespaces, init should look like an ordinary task, i.e. receive any signal, only taking the general permissions into account. The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and we eventually came to almost the same implementation, which differed in some details. This set is based on Pavel's patches, but it includes comments and patches that from Sukadev. Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made valuable advises on how to make this set cleaner. This patch: We have to call exit_task_namespaces() only after the exiting task has reparented all his children and is sure that no other threads will reparent theirs for it. Why this is needed is explained in appropriate patch. This one only reworks the forget_original_parent() so that after calling this a task cannot be/become parent of any other task. We check PF_EXITING instead of ->exit_state while choosing the new parent. Note that tasklits_lock acts as a barrier, everyone who takes tasklist after us (when forget_original_parent() drops it) must see PF_EXITING. The other changes are just cleanups. They just move some code from exit_notify to forget_original_parent(). It is a bit silly to declare ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to forget_original_parent(), unlock-lock-unlock tasklist, and then use ptrace_dead. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19whitespace fixes: task exit handlingDaniel Walker
Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19kernel/exit.c: Use list_for_each_entry(_safe) instead of list_for_each(_safe)Matthias Kaehlcke
kernel/exit.c: Convert list_for_each(_safe) to list_for_each_entry(_safe) in forget_original_parent(), exit_notify() and do_wait() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Make access to task's nsproxy lighterPavel Emelyanov
When someone wants to deal with some other taks's namespaces it has to lock the task and then to get the desired namespace if the one exists. This is slow on read-only paths and may be impossible in some cases. E.g. Oleg recently noticed a race between unshare() and the (sent for review in cgroups) pid namespaces - when the task notifies the parent it has to know the parent's namespace, but taking the task_lock() is impossible there - the code is under write locked tasklist lock. On the other hand switching the namespace on task (daemonize) and releasing the namespace (after the last task exit) is rather rare operation and we can sacrifice its speed to solve the issues above. The access to other task namespaces is proposed to be performed like this: rcu_read_lock(); nsproxy = task_nsproxy(tsk); if (nsproxy != NULL) { / * * work with the namespaces here * e.g. get the reference on one of them * / } / * * NULL task_nsproxy() means that this task is * almost dead (zombie) * / rcu_read_unlock(); This patch has passed the review by Eric and Oleg :) and, of course, tested. [clg@fr.ibm.com: fix unshare()] [ebiederm@xmission.com: Update get_net_ns_by_pid] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: define is_global_init() and is_container_init()Serge E. Hallyn
is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: rename child_reaper() functionSukadev Bhattiprolu
Rename the child_reaper() function to task_child_reaper() to be similar to other task_* functions and to distinguish the function from 'struct pid_namspace.child_reaper'. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: round up the APIPavel Emelianov
The set of functions process_session, task_session, process_group and task_pgrp is confusing, as the names can be mixed with each other when looking at the code for a long time. The proposals are to * equip the functions that return the integer with _nr suffix to represent that fact, * and to make all functions work with task (not process) by making the common prefix of the same name. For monotony the routines signal_session() and set_signal_session() are replaced with task_session_nr() and set_task_session(), especially since they are only used with the explicit task->signal dereference. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Task Control Groups: make cpusets a client of cgroupsPaul Menage
Remove the filesystem support logic from the cpusets system and makes cpusets a cgroup subsystem The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get passed through to the cgroup filesystem with the appropriate options to emulate the old cpuset filesystem behaviour. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Task Control Groups: add fork()/exit() hooksPaul Menage
This adds the necessary hooks to the fork() and exit() paths to ensure that new children inherit their parent's cgroup assignments, and that exiting processes release reference counts on their cgroups. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>