aboutsummaryrefslogtreecommitdiff
path: root/kernel/sysctl.c
AgeCommit message (Collapse)Author
2007-10-13minimal build fixes for uml (fallout from x86 merge)Al Viro
a) include/asm-um/arch can't just point to include/asm-$(SUBARCH) now b) arch/{i386,x86_64}/crypto are merged now c) subarch-obj needed changes d) cpufeature_64.h should pull "cpufeature_32.h", not <asm/cpufeature_32.h> since it can be included from asm-um/cpufeature.h e) in case of uml-i386 we need CONFIG_X86_32 for make and gcc, but not for Kconfig f) sysctl.c shouldn't do vdso_enabled for uml-i386 (actually, that one should be registered from corresponding arch/*/kernel/*, with ifdef going away; that's a separate patch, though). With that and with Stephen's patch ("[PATCH net-2.6] uml: hard_header fix") we have uml allmodconfig building both on i386 and amd64. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-12[POWERPC] Implement logging of unhandled signalsOlof Johansson
Implement show_unhandled_signals sysctl + support to print when a process is killed due to unhandled signals just as i386 and x86_64 does. Default to having it off, unlike x86 that defaults on. Signed-off-by: Olof Johansson <olof@lixom.net> Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-09-19sched: add /proc/sys/kernel/sched_compat_yieldIngo Molnar
add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-08-25sched: cleanup, sched_granularity -> sched_min_granularityIngo Molnar
due to adaptive granularity scheduling the role of sched_granularity has changed to "minimum granularity", so rename the variable (and the tunable) accordingly. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-08-25sched: adaptive scheduler granularityPeter Zijlstra
Instead of specifying the preemption granularity, specify the wanted latency. By fixing the granlarity to a constany the wakeup latency it a function of the number of running tasks on the rq. Invert this relation. sysctl_sched_granularity becomes a minimum for the dynamic granularity computed from the new sysctl_sched_latency. Then use this latency to do more intelligent granularity decisions: if there are fewer tasks running then we can schedule coarser. This helps performance while still always keeping the latency target. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-25sched: fix CONFIG_SCHED_DEBUG dependency of lockdep sysctlsPeter Zijlstra
Make the lockdep sysctls not depend on CONFIG_SCHED_DEBUG. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-19Remove double inclusion of linux/capability.hChristian Heim
Remove the second inclusion of linux/capability.h, which has been introduced with "[PATCH] move capable() to capability.h" (commit c59ede7b78db329949d9cdcd7064e22d357560ef) Signed-off-by: Christian Heim <phreak@gentoo.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-11Fix missing numa_zonelist_order sysctlLee Schermerhorn
Misplaced #endif is hiding the numa_zonelist_order sysctl when !SECURITY. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-29ACPI: restore CONFIG_ACPI_SLEEPLen Brown
Restore the 2.6.22 CONFIG_ACPI_SLEEP build option, but now shadowing the new CONFIG_PM_SLEEP option. Signed-off-by: Len Brown <len.brown@intel.com> [ Modified to work with the PM config setup changes. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-25ACPI: Kconfig: remove CONFIG_ACPI_SLEEP from sourceLen Brown
As it was a synonym for (CONFIG_ACPI && CONFIG_X86), the ifdefs for it were more clutter than they were worth. For ia64, just add a few stubs in anticipation of future S3 or S4 support. Signed-off-by: Len Brown <len.brown@intel.com>
2007-07-22x86: i386-show-unhandled-signals-v3Masoud Asgharifard Sharbiani
This patch makes the i386 behave the same way that x86_64 does when a segfault happens. A line gets printed to the kernel log so that tools that need to check for failures can behave more uniformly between debug.show_unhandled_signals sysctl variable to 0 (or by doing echo 0 > /proc/sys/debug/exception-trace) Also, all of the lines being printed are now using printk_ratelimit() to deny the ability of DoS from a local user with a program like the following: main() { while (1) if (!fork()) *(int *)0 = 0; } This new revision also includes the fix that Andrew did which got rid of new sysctl that was added to the system in earlier versions of this. Also, 'show-unhandled-signals' sysctl has been renamed back to the old 'exception-trace' to avoid breakage of people's scripts. AK: Enabling by default for i386 will be likely controversal, but let's see what happens AK: Really folks, before complaining just fix your segfaults AK: I bet this will find a lot of silent issues Signed-off-by: Masoud Sharbiani <masouds@google.com> Signed-off-by: Andi Kleen <ak@suse.de> [ Personally, I've found the complaints useful on x86-64, so I'm all for this. That said, I wonder if we could do it more prettily.. -Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19kernel/sysctl.c: finish off the warning commentsAndrew Morton
I've been chasing these comments around this file all week. Hopefully we're straight now. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19lockstat: core infrastructurePeter Zijlstra
Introduce the core lock statistics code. Lock statistics provides lock wait-time and hold-time (as well as the count of corresponding contention and acquisitions events). Also, the first few call-sites that encounter contention are tracked. Lock wait-time is the time spent waiting on the lock. This provides insight into the locking scheme, that is, a heavily contended lock is indicative of a too coarse locking scheme. Lock hold-time is the duration the lock was held, this provides a reference for the wait-time numbers, so they can be put into perspective. 1) lock 2) ... do stuff .. unlock 3) The time between 1 and 2 is the wait-time. The time between 2 and 3 is the hold-time. The lockdep held-lock tracking code is reused, because it already collects locks into meaningful groups (classes), and because it is an existing infrastructure for lock instrumentation. Currently lockdep tracks lock acquisition with two hooks: lock() lock_acquire() _lock() ... code protected by lock ... unlock() lock_release() _unlock() We need to extend this with two more hooks, in order to measure contention. lock_contended() - used to measure contention events lock_acquired() - completion of the contention These are then placed the following way: lock() lock_acquire() if (!_try_lock()) lock_contended() _lock() lock_acquired() ... do locked stuff ... unlock() lock_release() _unlock() (Note: the try_lock() 'trick' is used to avoid instrumenting all platform dependent lock primitive implementations.) It is also possible to toggle the two lockdep features at runtime using: /proc/sys/kernel/prove_locking /proc/sys/kernel/lock_stat (esp. turning off the O(n^2) prove_locking functionaliy can help) [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: nuke unneeded ifdefs] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19coredump masking: bound suid_dumpable sysctlKawai, Hidehiro
This patch series is version 5 of the core dump masking feature, which controls which VMAs should be dumped based on their memory types and per-process flags. I adopted most of Andrew's suggestion at the previous version. He also suggested using system call instead of /proc/<pid>/ interface, I decided to use the latter continuously because adding new system call with pid argument will give a big impact on the kernel. You can access the per-process flags via /proc/<pid>/coredump_filter interface. coredump_filter represents a bitmask of memory types, and if a bit is set, VMAs of corresponding memory type are written into a core file when the process is dumped. The bitmask is inherited from the parent process when a process is created. The original purpose is to avoid longtime system slowdown when a number of processes which share a huge shared memory are dumped at the same time. To achieve this purpose, this patch series adds an ability to suppress dumping anonymous shared memory for specified processes. In this version, three other memory types are also supported. Here are the coredump_filter bits: bit 0: anonymous private memory bit 1: anonymous shared memory bit 2: file-backed private memory bit 3: file-backed shared memory The default value of coredump_filter is 0x3. This means the new core dump routine has the same behavior as conventional behavior by default. In this version, coredump_filter bits and mm.dumpable are merged into mm.flags, and it is accessed by atomic bitops. The supported core file formats are ELF and ELF-FDPIC. ELF has been tested, but ELF-FDPIC has not been built and tested because I don't have the test environment. This patch limits a value of suid_dumpable sysctl to the range of 0 to 2. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19audit: rework execve auditPeter Zijlstra
The purpose of audit_bprm() is to log the argv array to a userspace daemon at the end of the execve system call. Since user-space hasn't had time to run, this array is still in pristine state on the process' stack; so no need to copy it, we can just grab it from there. In order to minimize the damage to audit_log_*() copy each string into a temporary kernel buffer first. Currently the audit code requires that the full argument vector fits in a single packet. So currently it does clip the argv size to a (sysctl) limit, but only when execve auditing is enabled. If the audit protocol gets extended to allow for multiple packets this check can be removed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ollie Wild <aaw@google.com> Cc: <linux-audit@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: Integrate beeping flag with existing acpi_sleep flagsPavel Machek
Move "debug during resume from s2ram" into the variable we already use for real-mode flags to simplify code. It also closes nasty trap for the user in acpi_sleep_setup; order of parameters actually mattered there, acpi_sleep=s3_bios,s3_mode doing something different from acpi_sleep=s3_mode,s3_bios. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18Add common orderly_poweroff()Jeremy Fitzhardinge
Various pieces of code around the kernel want to be able to trigger an orderly poweroff. This pulls them together into a single implementation. By default the poweroff command is /sbin/poweroff, but it can be set via sysctl: kernel/poweroff_cmd. This is split at whitespace, so it can include command-line arguments. This patch replaces four other instances of invoking either "poweroff" or "shutdown -h now": two sbus drivers, and acpi thermal management. sparc64 has its own "powerd"; still need to determine whether it should be replaced by orderly_poweroff(). Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Acked-by: Len Brown <lenb@kernel.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andi Kleen <ak@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David S. Miller <davem@davemloft.net>
2007-07-17proper prototype for proc_nr_files()Adrian Bunk
Add a proper prototype for proc_nr_files() in include/linux/fs.h Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Allow huge page allocations to use GFP_HIGH_MOVABLEMel Gorman
Huge pages are not movable so are not allocated from ZONE_MOVABLE. However, as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it can be used to satisfy hugepage allocations even when the system has been running a long time. This allows an administrator to resize the hugepage pool at runtime depending on the size of ZONE_MOVABLE. This patch adds a new sysctl called hugepages_treat_as_movable. When a non-zero value is written to it, future allocations for the huge page pool will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. [akpm@linux-foundation.org: various fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Remove duplicate comments from sysctl.cLinus Torvalds
Randy Dunlap noticed that the recent comment clarifications from Andrew had somehow gotten duplicated. Quoth Andrew: "hm, that could have been some late-night reject-fixing." Fix it up. Cc: From: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16sysctl.c: add text telling people to use CTL_UNNUMBEREDAndrew Morton
Hopefully this will help people to understand the new regime. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16vdso: print fatal signalsIngo Molnar
Add the print-fatal-signals=1 boot option and the /proc/sys/kernel/print-fatal-signals runtime switch. This feature prints some minimal information about userspace segfaults to the kernel console. This is useful to find early bootup bugs where userspace debugging is very hard. Defaults to off. [akpm@linux-foundation.org: Don't add new sysctl numbers] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16change zonelist order: zonelist order selection logicKAMEZAWA Hiroyuki
Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-11security: Protection for exploiting null dereference using mmapEric Paris
Add a new security check on mmap operations to see if the user is attempting to mmap to low area of the address space. The amount of space protected is indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to 0, preserving existing behavior. This patch uses a new SELinux security class "memprotect." Policy already contains a number of allow rules like a_t self:process * (unconfined_t being one of them) which mean that putting this check in the process class (its best current fit) would make it useless as all user processes, which we also want to protect against, would be allowed. By taking the memprotect name of the new class it will also make it possible for us to move some of the other memory protect permissions out of 'process' and into the new class next time we bump the policy version number (which I also think is a good future idea) Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2007-07-09sched: add CFS debug sysctlsIngo Molnar
add CFS debug sysctls: only tweakable if SCHED_DEBUG is enabled. This allows for faster debugging of scheduler problems. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-05-17make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename ↵Dan Aloni
size Make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename size and change it to 128, so that extensive patterns such as '/local/cores/%e-%h-%s-%t-%p.core' won't result in truncated filename generation. Signed-off-by: Dan Aloni <da-x@monatomic.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09Make vm statistics update interval configurableChristoph Lameter
Make it configurable. Code in mm makes the vm statistics intervals independent from the cache reaper use that opportunity to make it configurable. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08proc: maps protectionKees Cook
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive information about the memory location and usage of processes. Issues: - maps should not be world-readable, especially if programs expect any kind of ASLR protection from local attackers. - maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc check the maps when %n is in a *printf call, and a setuid(getuid()) process wouldn't be able to read its own maps file. (For reference see http://lkml.org/lkml/2006/1/22/150) - a system-wide toggle is needed to allow prior behavior in the case of non-root applications that depend on access to the maps contents. This change implements a check using "ptrace_may_attach" before allowing access to read the maps contents. To control this protection, the new knob /proc/sys/kernel/maps_protect has been added, with corresponding updates to the procfs documentation. [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: New sysctl numbers are old hat] Signed-off-by: Kees Cook <kees@outflux.net> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24Allow reading tainted flag as userBastian Blank
The commit 34f5a39899f3f3e815da64f48ddb72942d86c366 restricted reading of the tainted value. The attached patch changes this back to a write-only check and restores the read behaviour of older versions. Signed-off-by: Bastian Blank <bastian@waldi.eu.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05sysctl: Support vdso_enabled sysctl on SH.Paul Mundt
All of the logic for this was already in place, we just hadn't wired it up in the sysctl table. Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-03-01[PATCH] fix the SYSCTL=n compilationAdrian Bunk
/home/bunk/linux/kernel-2.6/linux-2.6.20-mm2/kernel/sysctl.c:1411: error: conflicting types for 'register_sysctl_table' /home/bunk/linux/kernel-2.6/linux-2.6.20-mm2/include/linux/sysctl.h:1042: error: previous declaration of 'register_sysctl_table' was here make[2]: *** [kernel/sysctl.o] Error 1 Caused by commit 0b4d414714f0d2f922d39424b0c5c82ad900a381. Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: add a parent entry to ctl_table and set the parent entryEric W. Biederman
Add a parent entry into the ctl_table so you can walk the list of parents and find the entire path to a ctl_table entry. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: reimplement the sysctl proc supportEric W. Biederman
With this change the sysctl inodes can be cached and nothing needs to be done when removing a sysctl table. For a cost of 2K code we will save about 4K of static tables (when we remove de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or about half that on a 32bit arch. The speed feels about the same, even though we can now cache the sysctl dentries :( We get the core advantage that we don't need to have a 1 to 1 mapping between ctl table entries and proc files. Making it possible to have /proc/sys vary depending on the namespace you are in. The currently merged namespaces don't have an issue here but the network namespace under /proc/sys/net needs to have different directories depending on which network adapters are visible. By simply being a cache different directories being visible depending on who you are is trivial to implement. [akpm@osdl.org: fix uninitialised var] [akpm@osdl.org: fix ARM build] [bunk@stusta.de: make things static] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: allow sysctl_perm to be called from outside of sysctl.cEric W. Biederman
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: factor out sysctl_head_next from do_sysctlEric W. Biederman
The current logic to walk through the list of sysctl table headers is slightly painful and implement in a way it cannot be used by code outside sysctl.c I am in the process of implementing a version of the sysctl proc support that instead of using the proc generic non-caching monster, just uses the existing sysctl data structure as backing store for building the dcache entries and for doing directory reads. To use the existing data structures however I need a way to get at them. [akpm@osdl.org: warning fix] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: remove insert_at_head from register_sysctlEric W. Biederman
The semantic effect of insert_at_head is that it would allow new registered sysctl entries to override existing sysctl entries of the same name. Which is pain for caching and the proc interface never implemented. I have done an audit and discovered that none of the current users of register_sysctl care as (excpet for directories) they do not register duplicate sysctl entries. So this patch simply removes the support for overriding existing entries in the sys_sysctl interface since no one uses it or cares and it makes future enhancments harder. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andi Kleen <ak@muc.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Corey Minyard <minyard@acm.org> Cc: Neil Brown <neilb@suse.de> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jan Kara <jack@ucw.cz> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: remove support for directory strategy routinesEric W. Biederman
parse_table has support for calling a strategy routine when descending into a directory. To date no one has used this functionality and the /proc/sys interface has no analog to it. So no one is using this functionality kill it and make the binary sysctl code easier to follow. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: remove support for CTL_ANYEric W. Biederman
There are currently no users in the kernel for CTL_ANY and it only has effect on the binary interface which is practically unused. So this complicates sysctl lookups for no good reason so just remove it. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: create sys/fs/binfmt_misc as an ordinary sysctl entryEric W. Biederman
binfmt_misc has a mount point in the middle of the sysctl and that mount point is created as a proc_generic directory. Doing it that way gets in the way of cleaning up the sysctl proc support as it continues the existence of a horrible hack. So instead simply create the directory as an ordinary sysctl directory. At least that removes the magic special case. [akpm@osdl.org: warning fix] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: move SYSV IPC sysctls to their own fileEric W. Biederman
This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded with special cases, and by keeping all of the ipc logic to together it makes the code a little more readable. [gcoady.lk@gmail.com: build fix] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Grant Coady <gcoady.lk@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: move utsname sysctls to their own fileEric W. Biederman
This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded with special cases, and by keeping all of the utsname logic to together it makes the code a little more readable. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14[PATCH] sysctl: move init_irq_proc into init/main where it belongsEric W. Biederman
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] _proc_do_string(): fix short readsOleg Nesterov
If you try to read things like /proc/sys/kernel/osrelease with single-byte reads, you get just one byte and then EOF. This is because _proc_do_string() assumes that the caller is read()ing into a buffer which is large enough to fit the whole string in a single hit. Fix. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] sysctl warning fixAndrew Morton
kernel/sysctl.c:2816: warning: 'sysctl_ipc_data' defined but not used Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Add TAINT_USER and ability to set taint flags from userspaceTheodore Ts'o
Allow taint flags to be set from userspace by writing to /proc/sys/kernel/tainted, and add a new taint flag, TAINT_USER, to be used when userspace has potentially done something dangerous that might compromise the kernel. This will allow support personnel to ask further questions about what may have caused the user taint flag to have been set. For example, they might examine the logs of the realtime JVM to see if the Java program has used the really silly, stupid, dangerous, and completely-non-portable direct access to physical memory feature which MUST be implemented according to the Real-Time Specification for Java (RTSJ). Sigh. What were those silly people at Sun thinking? [akpm@osdl.org: build fix] [bunk@stusta.de: cleanup] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] sysctl_{,ms_}jiffies: fix oldlen semanticsAlexey Dobriyan
currently it's 1) if *oldlenp == 0, don't writeback anything 2) if *oldlenp >= table->maxlen, don't writeback more than table->maxlen bytes and rewrite *oldlenp don't look at underlying type granularity 3) if 0 < *oldlenp < table->maxlen, *cough* string sysctls don't writeback more than *oldlenp bytes. OK, that's because sizeof(char) == 1 int sysctls writeback anything in (0, table->maxlen] range Though accept integers divisible by sizeof(int) for writing. sysctl_jiffies and sysctl_ms_jiffies don't writeback anything but sizeof(int), which violates 1) and 2). So, make sysctl_jiffies and sysctl_ms_jiffies accept a) *oldlenp == 0, not doing writeback b) *oldlenp >= sizeof(int), writing one integer. -EINVAL still returned for *oldlenp == 1, 2, 3. Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] make reading /proc/sys/kernel/cap-bould not require CAP_SYS_MODULEEric Paris
Reading /proc/sys/kernel/cap-bound requires CAP_SYS_MODULE. (see proc_dointvec_bset in kernel/sysctl.c) sysctl appears to drive all over proc reading everything it can get it's hands on and is complaining when it is being denied access to read cap-bound. Clearly writing to cap-bound should be a sensitive operation but requiring CAP_SYS_MODULE to read cap-bound seems a bit to strong. I believe the information could with reasonable certainty be obtained by looking at a bunch of the output of /proc/pid/status which has very low security protection, so at best we are just getting a little obfuscation of information. Currently SELinux policy has to 'dontaudit' capability checks for CAP_SYS_MODULE for things like sysctl which just want to read cap-bound. In doing so we also as a byproduct have to hide warnings of potential exploits such as if at some time that sysctl actually tried to load a module. I wondered if anyone would have a problem opening cap-bound up to read from anyone? Acked-by: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-12-13[PATCH] debug: add sysrq_always_enabled boot optionIngo Molnar
Most distributions enable sysrq support but set it to 0 by default. Add a sysrq_always_enabled boot option to always-enable sysrq keys. Useful for debugging - without having to modify the disribution's config files (which might not be possible if the kernel is on a live CD, etc.). Also, while at it, clean up the sysrq interfaces. [bunk@stusta.de: make sysrq_always_enabled_setup() static] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10[PATCH] sysctl: remove unused "context" paramAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10[PATCH] sysctl: remove some OPsAlexey Dobriyan
kernel.cap-bound uses only OP_SET and OP_AND Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>