Age | Commit message (Collapse) | Author |
|
803bf5ec259941936262d10ecc84511b76a20921 ("fs/exec.c: restrict initial
stack space expansion to rlimit") attempts to limit the initial stack to
20*PAGE_SIZE. Unfortunately, in attempting ensure the stack is not
reduced in size, we ended up not changing the stack at all.
This size reduction check is not necessary as the expand_stack call does
this already.
This caused a regression in UML resulting in most guest processes being
killed.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jouni Malinen <j@w1.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When reserving stack space for a new process, make sure we're not
attempting to expand the stack by more than rlimit allows.
This fixes a bug caused by b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
variable length argument support") and unmasked by
fc63cf237078c86214abcb2ee9926d8ad289da9b ("exec: setup_arg_pages() fails
to return errors").
This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
80K on 4K pages or 'ulimit -s 79') all processes will be killed before
they start. This is particularly bad with 64K pages, where a ulimit below
1280K will kill every process.
To test, do:
'ulimit -s 15; ls'
before and after the patch is applied. Before it's applied, 'ls' should
be killed. After the patch is applied, 'ls' should no longer be killed.
A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
correctly with this code.
4K pages should be fine to test with.
[kosaki.motohiro@jp.fujitsu.com: cleanup]
[akpm@linux-foundation.org: cleanup cleanup]
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 221af7f87b9 ("Split 'flush_old_exec' into two functions") split
the function at the point of no return - ie right where there were no
more error cases to check. That made sense from a technical standpoint,
but when we then also combined it with the actual personality setting
going in between flush_old_exec() and setup_new_exec(), it needs to be a
bit more careful.
In particular, we need to make sure that we really flush the old
personality bits in the 'flush' stage, rather than later in the 'setup'
stage, since otherwise we might be flushing the _new_ personality state
that we're just setting up.
So this moves the flags and personality flushing (and 'flush_thread()',
which is the arch-specific function that generally resets lazy FP state
etc) of the old process into flush_old_exec(), so that it doesn't affect
any state that execve() is setting up for the new process environment.
This was reported by Michal Simek as breaking his Microblaze qemu
environment.
Reported-and-tested-by: Michal Simek <michal.simek@petalogix.com>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Introduce coredump parameter data structure (struct coredump_params) to
simplify binfmt->core_dump() arguments.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Suggested-by: Ingo Molnar <mingo@elte.hu>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Thanks to Roland who pointed out de_thread() issues.
Currently we add sub-threads to ->real_parent->children list. This buys
nothing but slows down do_wait().
With this patch ->children contains only main threads (group leaders).
The only complication is that forget_original_parent() should iterate over
sub-threads by hand, and de_thread() needs another list_replace() when it
changes ->group_leader.
Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check (and we can unify do_wait_thread() and
ptrace_do_wait()).
This change can confuse the optimistic search in mm_update_next_owner(),
but this is fixable and minor.
Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications. However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.
However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves. This sort of behavior is
available on other systems via the pthread_setname_np() interface.
This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Mike Fulton <fultonm@ca.ibm.com>
Cc: Sean Foley <Sean_Foley@ca.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
In setup_arg_pages we work hard to assign a value to ret, but on exit we
always return 0.
Also remove a now duplicated exit path and branch to out_unlock instead.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Based on discussions on LKML and LSM, where there are consecutive
security_ and ima_ calls in the vfs layer, move the ima_ calls to
the existing security_ hooks.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Because the binfmt is not different between threads in the same process,
it can be moved from task_struct to mm_struct. And binfmt moudle is
handled per mm_struct instead of task_struct.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
sys_delete_module() can set MODULE_STATE_GOING after
search_binary_handler() does try_module_get(). In this case
set_binfmt()->try_module_get() fails but since none of the callers
check the returned error, the task will run with the wrong old
->binfmt.
The proper fix should change all ->load_binary() methods, but we can
rely on fact that the caller must hold a reference to binfmt->module
and use __module_get() which never fails.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Allow core_pattern pipes to wait for user space to complete
One of the things that user space processes like to do is look at metadata
for a crashing process in their /proc/<pid> directory. this is racy
however, since do_coredump in the kernel doesn't wait for the user space
process to complete before it reaps the crashing process. This patch
corrects that. Allowing the kernel to wait for the user space process to
complete before cleaning up the crashing process. This is a bit tricky to
do for a few reasons:
1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition
I've discussed several solutions with Oleg Nesterov off-list about this,
and this is the one we've come up with. We add ourselves as a pipe reader
(to prevent premature cleanup of the pipe_inode_info), and remove
ourselves as a writer (to provide an EOF condition to the writer in user
space), then we iterate until the user space process exits (which we
detect by pipe->readers == 1, hence the > 1 check in the loop). When we
exit the loop, we restore the proper reader/writer values, then we return
and let filp_close in do_coredump clean up the pipe data properly.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Introduce core pipe limiting sysctl.
Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.
If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.
This sysctl introduces an ability to limit that resource consumption.
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log.
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Change how we detect recursive dumps.
Currently we have a mechanism by which we try to compare pathnames of the
crashing process to the core_pattern path. This is broken for a dozen
reasons, and just doesn't work in any sort of robust way.
I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps
set RLIMIT_CORE to zero, we don't write out core files for any process
with that particular limit set. It the core_pattern is a pipe, any
non-zero limit is translated to RLIM_INFINITY.
This allows complete dumps to be captured, but prevents infinite recursion
in the event that the core_pattern process itself crashes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A patch to give a better overview of the userland application stack usage,
especially for embedded linux.
Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.
There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.
A sample output of /proc/<pid>/task/<tid>/maps looks like:
08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
which will you give the current stack usage in kb.
A sample output of /proc/self/status looks like:
Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB
I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.
[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
mark. This struct is filled as a parameter to getrusage syscall.
->ru_maxrss value is set to KBs which is the way it is done in BSD
systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
which seems to be incorrect behavior. Maintainer of this util was
notified by me with the patch which corrects it and cc'ed.
To make this happen we extend struct signal_struct by two fields. The
first one is ->maxrss which we use to store rss hiwater of the task. The
second one is ->cmaxrss which we use to store highest rss hiwater of all
task childs. These values are used in k_getrusage() to actually fill
->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
struct exists.
Note:
exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
it is intetionally behavior. *BSD getrusage have exec() inheriting.
test programs
========================================================
getrusage.c
===========
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <signal.h>
#include <sys/mman.h>
#include "common.h"
#define err(str) perror(str), exit(1)
int main(int argc, char** argv)
{
int status;
printf("allocate 100MB\n");
consume(100);
printf("testcase1: fork inherit? \n");
printf(" expect: initial.self ~= child.self\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
} else {
show_rusage("fork child");
_exit(0);
}
printf("\n");
printf("testcase2: fork inherit? (cont.) \n");
printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
} else {
show_rusage("child");
_exit(0);
}
printf("\n");
printf("testcase3: fork + malloc \n");
printf(" expect: child.self ~= initial.self + 50MB\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
} else {
printf("allocate +50MB\n");
consume(50);
show_rusage("fork child");
_exit(0);
}
printf("\n");
printf("testcase4: grandchild maxrss\n");
printf(" expect: post_wait.children ~= 300MB\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
show_rusage("post_wait");
} else {
system("./child -n 0 -g 300");
_exit(0);
}
printf("\n");
printf("testcase5: zombie\n");
printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
show_rusage("initial");
if (__fork()) {
sleep(1); /* children become zombie */
show_rusage("pre_wait");
wait(&status);
show_rusage("post_wait");
} else {
system("./child -n 400");
_exit(0);
}
printf("\n");
printf("testcase6: SIG_IGN\n");
printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
show_rusage("initial");
signal(SIGCHLD, SIG_IGN);
if (__fork()) {
sleep(1); /* children become zombie */
show_rusage("after_zombie");
} else {
system("./child -n 500");
_exit(0);
}
printf("\n");
signal(SIGCHLD, SIG_DFL);
printf("testcase7: exec (without fork) \n");
printf(" expect: initial ~= exec \n");
show_rusage("initial");
execl("./child", "child", "-v", NULL);
return 0;
}
child.c
=======
#include <sys/types.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>
#include "common.h"
int main(int argc, char** argv)
{
int status;
int c;
long consume_size = 0;
long grandchild_consume_size = 0;
int show = 0;
while ((c = getopt(argc, argv, "n:g:v")) != -1) {
switch (c) {
case 'n':
consume_size = atol(optarg);
break;
case 'v':
show = 1;
break;
case 'g':
grandchild_consume_size = atol(optarg);
break;
default:
break;
}
}
if (show)
show_rusage("exec");
if (consume_size) {
printf("child alloc %ldMB\n", consume_size);
consume(consume_size);
}
if (grandchild_consume_size) {
if (fork()) {
wait(&status);
} else {
printf("grandchild alloc %ldMB\n", grandchild_consume_size);
consume(grandchild_consume_size);
exit(0);
}
}
return 0;
}
common.c
========
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <signal.h>
#include <sys/mman.h>
#include "common.h"
#define err(str) perror(str), exit(1)
void show_rusage(char *prefix)
{
int err, err2;
struct rusage rusage_self;
struct rusage rusage_children;
printf("%s: ", prefix);
err = getrusage(RUSAGE_SELF, &rusage_self);
if (!err)
printf("self %ld ", rusage_self.ru_maxrss);
err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
if (!err2)
printf("children %ld ", rusage_children.ru_maxrss);
printf("\n");
}
/* Some buggy OS need this worthless CPU waste. */
void make_pagefault(void)
{
void *addr;
int size = getpagesize();
int i;
for (i=0; i<1000; i++) {
addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
if (addr == MAP_FAILED)
err("make_pagefault");
memset(addr, 0, size);
munmap(addr, size);
}
}
void consume(int mega)
{
size_t sz = mega * 1024 * 1024;
void *ptr;
ptr = malloc(sz);
memset(ptr, 0, sz);
make_pagefault();
}
pid_t __fork(void)
{
pid_t pid;
pid = fork();
make_pagefault();
return pid;
}
common.h
========
void show_rusage(char *prefix);
void make_pagefault(void);
void consume(int mega);
pid_t __fork(void);
FreeBSD result (expected result)
========================================================
allocate 100MB
testcase1: fork inherit?
expect: initial.self ~= child.self
initial: self 103492 children 0
fork child: self 103540 children 0
testcase2: fork inherit? (cont.)
expect: initial.children ~= 100MB, but child.children = 0
initial: self 103540 children 103540
child: self 103564 children 0
testcase3: fork + malloc
expect: child.self ~= initial.self + 50MB
initial: self 103564 children 103564
allocate +50MB
fork child: self 154860 children 0
testcase4: grandchild maxrss
expect: post_wait.children ~= 300MB
initial: self 103564 children 154860
grandchild alloc 300MB
post_wait: self 103564 children 308720
testcase5: zombie
expect: pre_wait ~= initial, IOW the zombie process is not accounted.
post_wait ~= 400MB, IOW wait() collect child's max_rss.
initial: self 103564 children 308720
child alloc 400MB
pre_wait: self 103564 children 308720
post_wait: self 103564 children 411312
testcase6: SIG_IGN
expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
initial: self 103564 children 411312
child alloc 500MB
after_zombie: self 103624 children 411312
testcase7: exec (without fork)
expect: initial ~= exec
initial: self 103624 children 411312
exec: self 103624 children 411312
Linux result (actual test result)
========================================================
allocate 100MB
testcase1: fork inherit?
expect: initial.self ~= child.self
initial: self 102848 children 0
fork child: self 102572 children 0
testcase2: fork inherit? (cont.)
expect: initial.children ~= 100MB, but child.children = 0
initial: self 102876 children 102644
child: self 102572 children 0
testcase3: fork + malloc
expect: child.self ~= initial.self + 50MB
initial: self 102876 children 102644
allocate +50MB
fork child: self 153804 children 0
testcase4: grandchild maxrss
expect: post_wait.children ~= 300MB
initial: self 102876 children 153864
grandchild alloc 300MB
post_wait: self 102876 children 307536
testcase5: zombie
expect: pre_wait ~= initial, IOW the zombie process is not accounted.
post_wait ~= 400MB, IOW wait() collect child's max_rss.
initial: self 102876 children 307536
child alloc 400MB
pre_wait: self 102876 children 307536
post_wait: self 102876 children 410076
testcase6: SIG_IGN
expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
initial: self 102876 children 410076
child alloc 500MB
after_zombie: self 102880 children 410076
testcase7: exec (without fork)
expect: initial ~= exec
initial: self 102880 children 410076
exec: self 102880 children 410076
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Tom Horsley reports that his debugger hangs when it tries to read
/proc/pid_of_tracee/maps, this happens since
"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
04b836cbf19e885f8366bccb2e4b0474346c02d
commit in 2.6.31.
But the root of the problem lies in the fact that do_execve() path calls
tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
The tracee must not sleep in TASK_TRACED holding this mutex. Even if we
remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
another task doing PTRACE_ATTACH should not hang until it is killed or the
tracee resumes.
With this patch do_execve() does not use ->cred_guard_mutex directly and
we do not hold it throughout, instead:
- introduce prepare_bprm_creds() helper, it locks the mutex
and calls prepare_exec_creds() to initialize bprm->cred.
- install_exec_creds() drops the mutex after commit_creds(),
and thus before tracehook_report_exec()->ptrace_stop().
or, if exec fails,
free_bprm() drops this mutex when bprm->cred != NULL which
indicates install_exec_creds() was not called.
Reported-by: Tom Horsley <tom.horsley@att.net>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
vfs_read() offset is defined as loff_t, but kernel_read()
offset is only defined as unsigned long. Redefine
kernel_read() offset as loff_t.
Cc: stable@kernel.org
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
do_execve() and ptrace_attach() return -EINTR if
mutex_lock_interruptible(->cred_guard_mutex) fails.
This is not right, change the code to return ERESTARTNOINTR.
Perhaps we should also change proc_pid_attr_write().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
perf_counter: Turn off by default
perf_counter: Add counter->id to the throttle event
perf_counter: Better align code
perf_counter: Rename L2 to LL cache
perf_counter: Standardize event names
perf_counter: Rename enums
perf_counter tools: Clean up u64 usage
perf_counter: Rename perf_counter_limit sysctl
perf_counter: More paranoia settings
perf_counter: powerpc: Implement generalized cache events for POWER processors
perf_counters: powerpc: Add support for POWER7 processors
perf_counter: Accurate period data
perf_counter: Introduce struct for sample data
perf_counter tools: Normalize data using per sample period data
perf_counter: Annotate exit ctx recursion
perf_counter tools: Propagate signals properly
perf_counter tools: Small frequency related fixes
perf_counter: More aggressive frequency adjustment
perf_counter/x86: Fix the model number of Intel Core2 processors
perf_counter, x86: Correct some event and umask values for Intel processors
...
|
|
Conflicts:
fs/exec.c
Removed IMA changes (the IMA checks are now performed via may_open()).
Signed-off-by: James Morris <jmorris@namei.org>
|
|
- Add support in ima_path_check() for integrity checking without
incrementing the counts. (Required for nfsd.)
- rename and export opencount_get to ima_counts_get
- replace ima_shm_check calls with ima_counts_get
- export ima_path_check
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Merge reason: this branch was on an -rc4 base, merge it up to -rc6
to get the latest upstream fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Rename cred_exec_mutex to reflect that it's a guard against foreign
intervention on a process's credential state, such as is made by ptrace(). The
attachment of a debugger to a process affects execve()'s calculation of the new
credential state - _and_ also setprocattr()'s calculation of that state.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
... and make path_lookup_open() static
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
parts
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This fixes the problem introduced by commit 3bfacef412 (get rid of
special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
when binfmt_aout built as module. That happens because aout binary
handler gets on the top of the binfmt list due to late registration, and
kernel attempts to execute the binary without preparatory work that must
be done by binfmt_loader.
Fixed by changing the registration order of the default binfmt handlers
using list_add_tail() and introducing insert_binfmt() function which
places new handler on the top of the binfmt list. This might be generally
useful for installing arch-specific frontends for default handlers or just
for overriding them.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up
the latest upstream fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
write_lock(¤t->fs->lock) guarantees we can't wrongly miss
LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock()
instead of ->siglock to iterate over the sub-threads. We must see
all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it
takes fs->lock too.
With or without this patch we can miss the freshly cloned thread
and set LSM_UNSAFE_SHARE, we don't care.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
[ Fixed lock/unlock typo - Hugh ]
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
unconditionally. This is wrong if we race with our sub-thread which
also does do_execve:
Two threads T1 and T2 and another process P, all share the same
->fs.
T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
->fs is shared, we set LSM_UNSAFE but not ->in_exec.
P exits and decrements fs->users.
T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
shared, we set fs->in_exec.
T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
return to the user-space.
T1 does clone(CLONE_FS /* without CLONE_THREAD */).
T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
another process.
Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
do_execve() to clear ->in_exec depending on res.
When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
It can be set only if we don't share ->fs with another process, and since
we already killed all sub-threads either ->in_exec == 0 or we are the
only user of this ->fs.
Also, we do not need fs->lock to clear fs->in_exec.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Similar to the mmap data stream, add one that tracks the task COMM field,
so that the userspace reporting knows what to call a task.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130409.127422406@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Merge reason: we have gathered quite a few conflicts, need to merge upstream
Conflicts:
arch/powerpc/kernel/Makefile
arch/x86/ia32/ia32entry.S
arch/x86/include/asm/hardirq.h
arch/x86/include/asm/unistd_32.h
arch/x86/include/asm/unistd_64.h
arch/x86/kernel/cpu/common.c
arch/x86/kernel/irq.c
arch/x86/kernel/syscall_table_32.S
arch/x86/mm/iomap_32.c
include/linux/sched.h
kernel/Makefile
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Don't pull it in sched.h; very few files actually need it and those
can include directly. sched.h itself only needs forward declaration
of struct fs_struct;
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... since we'll unshare sighand anyway
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
* all changes of current->fs are done under task_lock and write_lock of
old fs->lock
* refcount is not atomic anymore (same protection)
* its decrements are done when removing reference from current; at the
same time we decide whether to free it.
* put_fs_struct() is gone
* new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do
execve() and only subthreads share fs_struct. Cleared when finishing exec
(success and failure alike). Makes CLONE_FS fail with -EAGAIN if set.
* check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
is in progress.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Joe Malicki reports that setuid sometimes doesn't: very rarely,
a setuid root program does not get root euid; and, by the way,
they have a health check running lsof every few minutes.
Right, check_unsafe_exec() notes whether the files_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
use of get_files_struct(), which also raises that sharing count.
There's a rather simple fix for this: exec's check on files->count
has been redundant ever since 2.6.1 made it unshare_files() (except
while compat_do_execve() omitted to do so) - just remove that check.
[Note to -stable: this patch will not apply before 2.6.29: earlier
releases should just remove the files->count line from unsafe_exec().]
Reported-by: Joe Malicki <jmalicki@metacarta.com>
Narrowed-down-by: Michael Itz <mitz@metacarta.com>
Tested-by: Joe Malicki <jmalicki@metacarta.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
This patch allows LSM modules to determine whether current process is in an
execve operation or not so that they can behave differently while an execve
operation is in progress.
This patch is needed by TOMOYO. Please see another patch titled "LSM adapter
functions." for backgrounds.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Conflicts:
arch/x86/kernel/setup_percpu.c
arch/x86/mm/fault.c
drivers/acpi/processor_idle.c
kernel/irq/handle.c
|
|
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Conflicts:
fs/namei.c
Manually merged per:
diff --cc fs/namei.c
index 734f2b5,bbc15c2..0000000
--- a/fs/namei.c
+++ b/fs/namei.c
@@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
nd->flags |= LOOKUP_CONTINUE;
err = exec_permission_lite(inode);
if (err == -EAGAIN)
- err = vfs_permission(nd, MAY_EXEC);
+ err = inode_permission(nd->path.dentry->d_inode,
+ MAY_EXEC);
+ if (!err)
+ err = ima_path_check(&nd->path, MAY_EXEC);
if (err)
break;
@@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
flag &= ~O_TRUNC;
}
- error = vfs_permission(nd, acc_mode);
+ error = inode_permission(inode, acc_mode);
if (error)
return error;
+
- error = ima_path_check(&nd->path,
++ error = ima_path_check(path,
+ acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
+ if (error)
+ return error;
/*
* An append-only file must be opened in append mode for writing.
*/
Signed-off-by: James Morris <jmorris@namei.org>
|
|
This patch replaces the generic integrity hooks, for which IMA registered
itself, with IMA integrity hooks in the appropriate places directly
in the fs directory.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Conflicts:
include/linux/syscalls.h
|
|
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Conflicts:
include/linux/kernel_stat.h
|
|
No one cares do_coredump()'s return value, and also it seems that it
is also not necessary. So make it void.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
do_coredump() accesses helper_argv[0] without checking helper_argv !=
NULL. This can happen if page allocation failed.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Untangle the error unwinding in this function, saving a test of local
variable `vma'.
Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|