aboutsummaryrefslogtreecommitdiff
path: root/fs/proc/base.c
AgeCommit message (Collapse)Author
2007-07-31Fix leaks on /proc/{*/sched,sched_debug,timer_list,timer_stats}Alexey Dobriyan
On every open/close one struct seq_operations leaks. Kudos to /proc/slab_allocators. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19coredump masking: add an interface for core dump filterKawai, Hidehiro
This patch adds an interface to set/reset flags which determines each memory segment should be dumped or not when a core file is generated. /proc/<pid>/coredump_filter file is provided to access the flags. You can change the flag status for a particular process by writing to or reading from the file. The flag status is inherited to the child process when it is created. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19coredump masking: reimplementation of dumpable using two flagsKawai, Hidehiro
This patch changes mm_struct.dumpable to a pair of bit flags. set_dumpable() converts three-value dumpable to two flags and stores it into lower two bits of mm_struct.flags instead of mm_struct.dumpable. get_dumpable() behaves in the opposite way. [akpm@linux-foundation.org: export set_dumpable] Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17kallsyms: make KSYM_NAME_LEN include space for trailing '\0'Tejun Heo
KSYM_NAME_LEN is peculiar in that it does not include the space for the trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating buffer. This is nonsense and error-prone. Moreover, when the caller forgets that it's very likely to subtly bite back by corrupting the stack because the last position of the buffer is always cleared to zero. This patch increments KSYM_NAME_LEN by one and updates code accordingly. * off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro is fixed. * Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together, MODULE_NAME_LEN was treated as if it didn't include space for the trailing '\0'. Fix it. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Paulo Marques <pmarques@grupopie.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16move seccomp from /proc to a prctlAndrea Arcangeli
This reduces the memory footprint and it enforces that only the current task can enable seccomp on itself (this is a requirement for a strightforward [modulo preempt ;) ] TIF_NOTSC implementation). Signed-off-by: Andrea Arcangeli <andrea@cpushare.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16/proc/*/environ: wrong placing of ptrace_may_attach() checkAlexey Dobriyan
It's a bit dopey-looking and can permit a task to cause a pagefault in an mm which it doesn't have permission to read from. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-09sched: scheduler debugging, coreIngo Molnar
scheduler debugging core: implement /proc/sched_debug and /proc/<PID>/sched files for scheduler debugging. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: update delay-accounting to use CFS's precise statsBalbir Singh
update delay-accounting to use CFS's precise stats. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-05-08smaps: only define clear_refs for CONFIG_MMUDavid Rientjes
/proc/pid/clear_refs is only defined in the CONFIG_MMU case, so make sure we don't have any references to clear_refs_smap() in generic procfs code. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08procfs: use simple_read_from_buffer()Akinobu Mita
Cleanup using simple_read_from_buffer() in procfs. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08Remove redundant check from proc_setattr()John Johansen
notify_change() already calls security_inode_setattr() before calling iop->setattr. Signed-off-by: Tony Jones <tonyj@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: John Johansen <jjohansen@suse.de> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08Fix race between cat /proc/*/wchan and rmmod et alAlexey Dobriyan
kallsyms_lookup() can go iterating over modules list unprotected which is OK for emergency situations (oops), but not OK for regular stuff like /proc/*/wchan. Introduce lookup_symbol_name()/lookup_module_symbol_name() which copy symbol name into caller-supplied buffer or return -ERANGE. All copying is done with module_mutex held, so... Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08Simplify kallsyms_lookup()Alexey Dobriyan
Several kallsyms_lookup() pass dummy arguments but only need, say, module's name. Make kallsyms_lookup() accept NULLs where possible. Also, makes picture clearer about what interfaces are needed for all symbol resolving business. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08header cleaning: don't include smp_lock.h when not usedRandy Dunlap
Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08/proc/*/oom_score oops re badnessAlexey Dobriyan
Eternal quest to make while true; do cat /proc/fs/xfs/stat >/dev/null 2>/dev/null; done while true; do find /proc -type f 2>/dev/null | xargs cat >/dev/null 2>/dev/null; done while true; do modprobe xfs; rmmod xfs; done work reliably continues and now kernel oopses in the following way: BUG: unable to handle ... at virtual address 6b6b6b6b EIP is at badness process: cat proc_oom_score proc_info_read sys_fstat64 vfs_read proc_info_read sys_read Failing code is prefetch hidden in list_for_each_entry() in badness(). badness() is reachable from two points. One is proc_oom_score, another is out_of_memory() => select_bad_process() => badness(). Second path grabs tasklist_lock, while first doesn't. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08add file position info to procMiklos Szeredi
Add support for finding out the current file position, open flags and possibly other info in the future. These new entries are added: /proc/PID/fdinfo/FD /proc/PID/task/TID/fdinfo/FD For each fd the information is provided in the following format: pos: 1234 flags: 0100002 [bunk@stusta.de: make struct proc_fdinfo_file_operations static] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08procfs: reorder struct pid_dentry to save space on 64bit archs, and constify ↵Eric Dumazet
them Change the order of fields of struct pid_entry (file fs/proc/base.c) in order to avoid a hole on 64bit archs. (8 bytes saved per object) Also change all pid_entry arrays to be const qualified, to make clear they must not be modified. Before (on x86_64) : # size fs/proc/base.o text data bss dec hex filename 15549 2192 0 17741 454d fs/proc/base.o After : # size fs/proc/base.o text data bss dec hex filename 17229 176 0 17405 43fd fs/proc/base.o Thats 336 bytes saved on kernel size on x86_64 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08proc: maps protectionKees Cook
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive information about the memory location and usage of processes. Issues: - maps should not be world-readable, especially if programs expect any kind of ASLR protection from local attackers. - maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc check the maps when %n is in a *printf call, and a setuid(getuid()) process wouldn't be able to read its own maps file. (For reference see http://lkml.org/lkml/2006/1/22/150) - a system-wide toggle is needed to allow prior behavior in the case of non-root applications that depend on access to the maps contents. This change implements a check using "ptrace_may_attach" before allowing access to read the maps contents. To control this protection, the new knob /proc/sys/kernel/maps_protect has been added, with corresponding updates to the procfs documentation. [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: New sysctl numbers are old hat] Signed-off-by: Kees Cook <kees@outflux.net> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08Allow access to /proc/$PID/fd after setuid()Alexey Dobriyan
/proc/$PID/fd has r-x------ permissions, so if process does setuid(), it will not be able to access /proc/*/fd/. This breaks fstatat() emulation in glibc. open("foo", O_RDONLY|O_DIRECTORY) = 4 setuid32(65534) = 0 stat64("/proc/self/fd/4/bar", 0xbfafb298) = -1 EACCES (Permission denied) Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Acked-By: Kirill Korotaev <dev@openvz.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07smaps: add clear_refs file to clear referenceDavid Rientjes
Adds /proc/pid/clear_refs. When any non-zero number is written to this file, pte_mkold() and ClearPageReferenced() is called for each pte and its corresponding page, respectively, in that task's VMAs. This file is only writable by the user who owns the task. It is now possible to measure _approximately_ how much memory a task is using by clearing the reference bits with echo 1 > /proc/pid/clear_refs and checking the reference count for each VMA from the /proc/pid/smaps output at a measured time interval. For example, to observe the approximate change in memory footprint for a task, write a script that clears the references (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and extracts the size in kB. Add the sizes for each VMA together for the total referenced footprint. Moments later, repeat the process and observe the difference. For example, using an efficient Mozilla: accumulated time referenced memory ---------------- ----------------- 0 s 408 kB 1 s 408 kB 2 s 556 kB 3 s 1028 kB 4 s 872 kB 5 s 1956 kB 6 s 416 kB 7 s 1560 kB 8 s 2336 kB 9 s 1044 kB 10 s 416 kB This is a valuable tool to get an approximate measurement of the memory footprint for a task. Cc: Hugh Dickins <hugh@veritas.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> [akpm@linux-foundation.org: build fixes] [mpm@selenic.com: rename for_each_pmd] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-14[PATCH] sanitize security_getprocattr() APIAl Viro
have it return the buffer it had allocated Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20[PATCH] Missing __user in pointer referenced within copy_from_userGlauber de Oliveira Costa
Pointers to user data should be marked with a __user hint. This one is missing. Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12[PATCH] mark struct inode_operations const 3Arjan van de Ven
Many struct inode_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12[PATCH] mark struct file_operations const 6Arjan van de Ven
Many struct file_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_structAlexey Dobriyan
They are fat: 4x8 bytes in task_struct. They are uncoditionally updated in every fork, read, write and sendfile. They are used only if you have some "extended acct fields feature". And please, please, please, read(2) knows about bytes, not characters, why it is called "rchar"? Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-01[PATCH] procfs: Fix listing of /proc/NOT_A_TGID/taskGuillaume Chazarain
Listing /proc/PID/task were PID is not a TGID should not result in duplicated entries. [g ~]$ pidof thunderbird-bin 2751 [g ~]$ ls /proc/2751/task 2751 2770 2771 2824 2826 2834 2835 2851 2853 [g ~]$ ls /proc/2770/task 2751 2770 2771 2824 2826 2834 2835 2851 2853 2770 2771 2824 2826 2834 2835 2851 2853 [g ~]$ Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26[PATCH] Fix NULL ->nsproxy dereference in /proc/*/mountsAlexey Dobriyan
/proc/*/mounstats was fixed, all right, but... To reproduce: while true; do find /proc -type f 2>/dev/null | xargs cat 1>/dev/null 2>/dev/null; done BUG: unable to handle kernel NULL pointer dereference at virtual address 0000000c printing eip: c01754df *pde = 00000000 Oops: 0000 [#28] Modules linked in: af_packet ohci_hcd e1000 ehci_hcd uhci_hcd usbcore xfs CPU: 0 EIP: 0060:[<c01754df>] Not tainted VLI EFLAGS: 00010286 (2.6.20-rc5 #1) EIP is at mounts_open+0x1c/0xac eax: 00000000 ebx: d5898ac0 ecx: d1d27b18 edx: d1d27a50 esi: e6083e10 edi: d3c87f38 ebp: d5898ac0 esp: d3c87ef0 ds: 007b es: 007b ss: 0068 Process cat (pid: 18071, ti=d3c86000 task=f7d5f070 task.ti=d3c86000) Stack: d5898ac0 e6083e10 d3c87f38 c01754c3 c0147c91 c18c52c0 d343f314 d5898ac0 00008000 d3c87f38 ffffff9c c0147e09 d5898ac0 00000000 00000000 c0147e4b 00000000 d3c87f38 d343f314 c18c52c0 c015e53e 00001000 08051000 00000101 Call Trace: [<c01754c3>] mounts_open+0x0/0xac [<c0147c91>] __dentry_open+0xa1/0x18c [<c0147e09>] nameidata_to_filp+0x31/0x3a [<c0147e4b>] do_filp_open+0x39/0x40 [<c015e53e>] seq_read+0x128/0x2aa [<c0147e8c>] do_sys_open+0x3a/0x6d [<c0147efa>] sys_open+0x1c/0x20 [<c0102b76>] sysenter_past_esp+0x5f/0x85 [<c02a0033>] unix_stream_recvmsg+0x3bf/0x4bf ======================= Code: 5d c3 89 d8 e8 06 e0 f9 ff eb bd 0f 0b eb fe 55 57 56 53 89 d5 8b 40 f0 31 d2 e8 02 c1 fa ff 89 c2 85 c0 74 5c 8b 80 48 04 00 00 <8b> 58 0c 85 db 74 02 ff 03 ff 4a 08 0f 94 c0 84 c0 75 74 85 db EIP: [<c01754df>] mounts_open+0x1c/0xac SS:ESP 0068:d3c87ef0 A race with do_exit()'s call to exit_namespaces(). Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-12-10[PATCH] io-accounting: report in procfsAndrew Morton
Add a simple /proc/pid/io to show the IO accounting fields. Maybe this shouldn't be merged in mainline - the preferred reporting channel is taskstats. But given the poor state of our userspace support for taskstats, this is useful for developer-testing, at least. And it improves the changes that the procps developers will wire it up into top(1). Opinions are sought. The patch also wires up the existing IO-accounting fields. It's a bit racy on 32-bit machines: if process A reads process B's /proc/pid/io while process B is updating one of those 64-bit counters, process A could see an intermediate result. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08[PATCH] fault injection: process filtering for fault-injection capabilitiesAkinobu Mita
This patch provides process filtering feature. The process filter allows failing only permitted processes by /proc/<pid>/make-it-fail Please see the example that demostrates how to inject slab allocation failures into module init/cleanup code in Documentation/fault-injection/fault-injection.txt Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08[PATCH] rename struct namespace to struct mnt_namespaceKirill Korotaev
Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with other namespaces being developped for the containers : pid, uts, ipc, etc. 'namespace' variables and attributes are also renamed to 'mnt_ns' Signed-off-by: Kirill Korotaev <dev@sw.ru> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08[PATCH] proc: change uses of f_{dentry, vfsmnt} to use f_pathJosef "Jeff" Sipek
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the proc filesystem code. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] make fs/proc/base.c:proc_pid_instantiate() staticAdrian Bunk
Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] Allow user processes to raise their oom_adj valueGuillem Jover
Currently a user process cannot rise its own oom_adj value (i.e. unprotecting itself from the OOM killer). As this value is stored in the task structure it gets inherited and the unprivileged childs will be unable to rise it. The EPERM will be handled by the generic proc fs layer, as only processes with the proper caps or the owner of the process will be able to write to the file. So we allow only the processes with CAP_SYS_RESOURCE to lower the value, otherwise it will get an EACCES which seems more appropriate than EPERM. Signed-off-by: Guillem Jover <guillem.jover@nokia.com> Acked-by: Andrea Arcangeli <andrea@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-25[PATCH] mounstats NULL pointer dereferenceVasily Tarasov
OpenVZ developers team has encountered the following problem in 2.6.19-rc6 kernel. After some seconds of running script while [[ 1 ]] do find /proc -name mountstats | xargs cat done this Oops appears: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000010 printing eip: c01a6b70 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: xt_length ipt_ttl xt_tcpmss ipt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit ipt_tos ipt_REJECT ip_tables x_tables parport_pc lp parport sunrpc af_packet thermal processor fan button battery asus_acpi ac ohci_hcd ehci_hcd usbcore i2c_nforce2 i2c_core tg3 floppy pata_amd ide_cd cdrom sata_nv libata CPU: 1 EIP: 0060:[<c01a6b70>] Not tainted VLI EFLAGS: 00010246 (2.6.19-rc6 #2) EIP is at mountstats_open+0x70/0xf0 eax: 00000000 ebx: e6247030 ecx: e62470f8 edx: 00000000 esi: 00000000 edi: c01a6b00 ebp: c33b83c0 esp: f4105eb4 ds: 007b es: 007b ss: 0068 Process cat (pid: 6044, ti=f4105000 task=f4104a70 task.ti=f4105000) Stack: c33b83c0 c04ee940 f46a4a80 c33b83c0 e4df31b4 c01a6b00 f4105000 c0169231 e4df31b4 c33b83c0 c33b83c0 f4105f20 00000003 f4105000 c0169445 f2503cf0 f7f8c4c0 00008000 c33b83c0 00000000 00008000 c0169350 f4105f20 00008000 Call Trace: [<c01a6b00>] mountstats_open+0x0/0xf0 [<c0169231>] __dentry_open+0x181/0x250 [<c0169445>] nameidata_to_filp+0x35/0x50 [<c0169350>] do_filp_open+0x50/0x60 [<c01873d6>] seq_read+0xc6/0x300 [<c0169511>] get_unused_fd+0x31/0xc0 [<c01696d3>] do_sys_open+0x63/0x110 [<c01697a7>] sys_open+0x27/0x30 [<c01030bd>] sysenter_past_esp+0x56/0x79 ======================= Code: 45 74 8b 54 24 20 89 44 24 08 8b 42 f0 31 d2 e8 47 cb f8 ff 85 c0 89 c3 74 51 8d 80 a0 04 00 00 e8 46 06 2c 00 8b 83 48 04 00 00 <8b> 78 10 85 ff 74 03 f0 ff 07 b0 01 86 83 a0 04 00 00 f0 ff 4b EIP: [<c01a6b70>] mountstats_open+0x70/0xf0 SS:ESP 0068:f4105eb4 The problem is that task->nsproxy can be equal NULL for some time during task exit. This patch fixes the BUG. Signed-off-by: Vasily Tarasov <vtaras@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20[PATCH] OOM killer meets userspace headersAlexey Dobriyan
Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17[PATCH] PROC_NUMBUF is wrongAndrew Morton
Actually, the decimal representation of a 32-bit signed number can take 12 bytes, including the \0. And then some code adds a \n as well, so let's give it 13 bytes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] introduce get_task_pid() to fix unsafe get_pid()Oleg Nesterov
proc_pid_make_inode: ei->pid = get_pid(task_pid(task)); I think this is not safe. get_pid() can be preempted after checking "pid != NULL". Then the task exits, does detach_pid(), and RCU frees the pid. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: comment what proc_fill_cache doesEric W. Biederman
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: remove the useless SMP-safe comments from /procEric W. Biederman
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: remove trailing blank entry from pid_entry arraysEric W. Biederman
It was pointed out that since I am taking ARRAY_SIZE anyway the trailing empty entry is silly and just wastes space. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: properly compute TGID_OFFSETEric W. Biederman
The value doesn't change but this ensures I will have the proper value when other files are added to proc_base_stuff. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: Use pid_task instead of open coding itEric W. Biederman
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: Merge proc_tid_attr and proc_tgid_attrEric W. Biederman
The implementation is exactly the same and there is currently nothing to distinguish proc_tid_attr, and proc_tgid_attr. So it is pointless to have two separate implementations. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: Remove the hard coded inode numbersEric W. Biederman
The hard coded inode numbers in proc currently limit its maintainability, its flexibility, and what can be done with the rest of system. /proc limits pid-max to 32768 on 32 bit systems it limits fd-max to 32768 on all systems, and placing the pid in the inode number really gets in the way of implementing subdirectories of per process information. Ever since people started adding to the middle of the file type enumeration we haven't been maintaing the historical inode numbers, all we have really succeeded in doing is keeping the pid in the proc inode number. The pid is already available in the directory name so no information is lost removing it from the inode number. So if something in user space cares if we remove the inode number from the /proc inode it is almost certainly broken. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: Factor out an instantiate method from every lookup methodEric W. Biederman
To remove the hard coded proc inode numbers it is necessary to be able to create the proc inodes during readdir. The instantiate methods are the subset of lookup that is needed to accomplish that. This first step just splits the lookup methods into 2 functions. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: Make the generation of the self symlink table drivenEric W. Biederman
This patch generalizes the concept of files in /proc that are related to processes but live in the root directory of /proc Ideally this would reuse infrastructure from the rest of the process specific parts of proc but unfortunately security_task_to_inode must not be called on files that are not strictly per process. security_task_to_inode really needs to be reexamined as the security label can change in important places that we are not currently catching, but I'm not certain that simplifies this problem. By at least matching the structure of the rest of proc we get more idiom reuse and it becomes easier to spot problems in the way things are put together. Later things like /proc/mounts are likely to be moved into proc_base as well. If union mounts are ever supported we may be able to make /proc a union mount, and properly split it into 2 filesystems. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] namespaces: incorporate fs namespace into nsproxySerge E. Hallyn
This moves the mount namespace into the nsproxy. The mount namespace count now refers to the number of nsproxies point to it, rather than the number of tasks. As a result, the unshare_namespace() function in kernel/fork.c no longer checks whether it is being shared. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: modify proc_pident_lookup to be completely table drivenEric W. Biederman
Currently proc_pident_lookup gets the names and types from a table and then has a huge switch statement to get the inode and file operations it needs. That is silly and is becoming increasingly hard to maintain so I just put all of the information in the table. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: reorder the functions in base.cEric W. Biederman
There were enough changes in my last round of cleaning up proc I had to break up the patch series into smaller chunks, and my last chunk never got resent. This patchset gives proc dynamic inode numbers (the static inode numbers were a pain to maintain and prevent all kinds of things), and removes the horrible switch statements that had to be kept in sync with everything else. Being fully table driver takes us 90% of the way of being able to register new process specific attributes in proc. This patch: Group the functions by what they implement instead of by type of operation. As it existed base.c was quickly approaching the point where it could not be followed. No functionality or code changes asside from adding/removing forward declartions are implemented in this patch. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02[PATCH] proc: readdir race fix (take 3)Eric W. Biederman
The problem: An opendir, readdir, closedir sequence can fail to report process ids that are continually in use throughout the sequence of system calls. For this race to trigger the process that proc_pid_readdir stops at must exit before readdir is called again. This can cause ps to fail to report processes, and it is in violation of posix guarantees and normal application expectations with respect to readdir. Currently there is no way to work around this problem in user space short of providing a gargantuan buffer to user space so the directory read all happens in on system call. This patch implements the normal directory semantics for proc, that guarantee that a directory entry that is neither created nor destroyed while reading the directory entry will be returned. For directory that are either created or destroyed during the readdir you may or may not see them. Furthermore you may seek to a directory offset you have previously seen. These are the guarantee that ext[23] provides and that posix requires, and more importantly that user space expects. Plus it is a simple semantic to implement reliable service. It is just a matter of calling readdir a second time if you are wondering if something new has show up. These better semantics are implemented by scanning through the pids in numerical order and by making the file offset a pid plus a fixed offset. The pid scan happens on the pid bitmap, which when you look at it is remarkably efficient for a brute force algorithm. Given that a typical cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There are only 40 cache lines for the entire 32K pid space. A typical system will have 100 pids or more so this is actually fewer cache lines we have to look at to scan a linked list, and the worst case of having to scan the entire pid bitmap is pretty reasonable. If we need something more efficient we can go to a more efficient data structure for indexing the pids, but for now what we have should be sufficient. In addition this takes no additional locks and is actually less code than what we are doing now. Also another very subtle bug in this area has been fixed. It is possible to catch a task in the middle of de_thread where a thread is assuming the thread of it's thread group leader. This patch carefully handles that case so if we hit it we don't fail to return the pid, that is undergoing the de_thread dance. Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for providing the first fix, pointing this out and working on it. [oleg@tv-sign.ru: fix it] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jean Delvare <jdelvare@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>