aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2007-10-20dm: export name and uuidMike Anderson
This patch adds a function to obtain a copy of a mapped device's name and uuid. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20dm io:ctl use constant struct sizeMilan Broz
Make size of dm_ioctl struct always 312 bytes on all supported architectures. This change retains compatibility with already-compiled code because it uses an embedded offset to locate the payload that follows the structure. On 64-bit architectures there is no change at all; on 32-bit we are increasing the size of dm-ioctl from 308 to 312 bytes. Currently with 32-bit userspace / 64-bit kernel on x86_64 some ioctls (including rename, message) are incorrectly rejected by the comparison against 'param + 1'. This breaks userspace lvrename and multipath 'fail_if_no_path' changes, for example. (BTW Device-mapper uses its own versioning and ignores the ioctl size bits. Only the generic ioctl compat code on mixed arches checks them, and that will continue to accept both sizes for now, but we intend to list 308 as deprecated and eventually remove it.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Guido Guenther <agx@sigxcpu.org> Cc: Kevin Corry <kevcorry@us.ibm.com> Cc: stable@kernel.org
2007-10-19Merge ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86Linus Torvalds
* ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86: (33 commits) x86: convert cpuinfo_x86 array to a per_cpu array x86: introduce frame_pointer() and stack_pointer() x86 & generic: change to __builtin_prefetch() i386: do not BUG_ON() when MSR is unknown x86: acpi use cpu_physical_id x86: convert cpu_llc_id to be a per cpu variable x86: convert cpu_to_apicid to be a per cpu variable i386: introduce "used_vectors" bitmap which can be used to reserve vectors. x86: use raw locks during oopses x86: honor _PAGE_PSE bit on page walks i386: do cpuid_device_create() in CPU_UP_PREPARE instead of CPU_ONLINE. x86: implement missing x86_64 function smp_call_function_mask() x86: use descriptor's functions instead of inline assembly i386: consolidate show_regs and show_registers for i386 i386: make callgraph use dump_trace() on i386/x86_64 x86: enable iommu_merge by default i386: i386 add AMD64 Barcelona PMU MSR definitions to msr.h x86: Unify i386 and x86-64 early quirks x86: enable HPET on ICH3 and ICH4 x86: force enable HPET on VT8235/8237 chipsets ... Manually fix trivial conflict with task pid container helper changes in arch/x86/kernel/process_32.c
2007-10-19Merge git://git.linux-nfs.org/pub/linux/nfs-2.6Linus Torvalds
* git://git.linux-nfs.org/pub/linux/nfs-2.6: NFSv4: Fix an rpc_cred reference leakage in fs/nfs/delegation.c NFSv4: Ensure that we wait for the CLOSE request to complete NFS: Fix a race in sillyrename NFS: Fix a writeback race...
2007-10-19NFS: Fix a race in sillyrenameTrond Myklebust
lookup() and sillyrename() can race one another because the sillyrename() completion cannot take the parent directory's inode->i_mutex since the latter may be held by whoever is calling dput(). We therefore have little option but to add extra locking to ensure that nfs_lookup() and nfs_atomic_open() do not race with the sillyrename completion. If somebody has looked up the sillyrenamed file in the meantime, we just transfer the sillydelete information to the new dentry. Please refer to the bug-report at http://bugzilla.linux-nfs.org/show_bug.cgi?id=150 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-10-19Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (41 commits) ACPICA: hw: Don't carry spinlock over suspend ACPICA: hw: remove use_lock flag from acpi_hw_register_{read, write} ACPI: cpuidle: port idle timer suspend/resume workaround to cpuidle ACPI: clean up acpi_enter_sleep_state_prep Hibernation: Make sure that ACPI is enabled in acpi_hibernation_finish ACPI: suppress uninitialized var warning cpuidle: consolidate 2.6.22 cpuidle branch into one patch ACPI: thinkpad-acpi: skip blanks before the data when parsing sysfs ACPI: AC: Add sysfs interface ACPI: SBS: Add sysfs alarm ACPI: SBS: Add ACPI_PROCFS around procfs handling code. ACPI: SBS: Add support for power_supply class (and sysfs) ACPI: SBS: Make SBS reads table-driven. ACPI: SBS: Simplify data structures in SBS ACPI: SBS: Split host controller (ACPI0001) from SBS driver (ACPI0002) ACPI: EC: Add new query handler to list head. ACPI: Add acpi_bus_generate_event4() function ACPI: Battery: add sysfs alarm ACPI: Battery: Add sysfs support ACPI: Battery: Misc clean-ups, no functional changes ... Fix up conflicts in drivers/misc/thinkpad_acpi.[ch] manually
2007-10-19Linux Kernel Markers - SamplesMathieu Desnoyers
Module example showing how to use the Linux Kernel Markers. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Linux Kernel MarkersMathieu Desnoyers
The marker activation functions sits in kernel/marker.c. A hash table is used to keep track of the registered probes and armed markers, so the markers within a newly loaded module that should be active can be activated at module load time. marker_query has been removed. marker_get_first, marker_get_next and marker_release should be used as iterators on the markers. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: "Frank Ch. Eigler" <fche@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Mike Mason <mmlnx@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Hook up group scheduler with control groupsSrivatsa Vaddagiri
Enable "cgroup" (formerly containers) based fair group scheduling. This will let administrator create arbitrary groups of tasks (using "cgroup" pseudo filesystem) and control their cpu bandwidth usage. [akpm@linux-foundation.org: fix cpp condition] Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Extended crashkernel command lineBernhard Walle
This patch adds a extended crashkernel syntax that makes the value of reserved system RAM dependent on the system RAM itself: crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset] range=start-[end] For example: crashkernel=512M-2G:64M,2G-:128M The motivation comes from distributors that configure their crashkernel command line automatically with some configuration tool (YaST, you know ;)). Of course that tool knows the value of System RAM, but if the user removes RAM, then the system becomes unbootable or at least unusable and error handling is very difficult. This series implements this change for i386, x86_64, ia64, ppc64 and sh. That should be all platforms that support kdump in current mainline. I tested all platforms except sh due to the lack of a sh processor. This patch: This is the generic part of the patch. It adds a parse_crashkernel() function in kernel/kexec.c that is called by the architecture specific code that actually reserves the memory. That function takes the whole command line and looks itself for "crashkernel=" in it. If there are multiple occurrences, then the last one is taken. The advantage is that if you have a bootloader like lilo or elilo which allows you to append a command line parameter but not to remove one (like in GRUB), then you can add another crashkernel value for testing at the boot command line and this one overwrites the command line in the configuration then. Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Vivek Goyal <vgoyal@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19virtualization of sysv msg queues is incompleteKirill Korotaev
Virtualization of sysv msg queues is incomplete: msg_hdrs and msg_bytes variables visible from userspace are global. Let's make them per-namespace. Signed-off-by: Alexey Kuznetsov <alexey@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Serge Hallyn <serue@us.ibm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19ipc: store ipcs into IDRsNadia Derbey
This patch introduces ipcs storage into IDRs. The main changes are: . This ipc_ids structure is changed: the entries array is changed into a root idr structure. . The grow_ary() routine is removed: it is not needed anymore when adding an ipc structure, since we are now using the IDR facility. . The ipc_rmid() routine interface is changed: . there is no need for this routine to return the pointer passed in as argument: it is now declared as a void . since the id is now part of the kern_ipc_perm structure, no need to have it as an argument to the routine Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19hotplug cpu: migrate a task within its cpusetCliff Wickman
When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have been running on that cpu. Currently, such a task is migrated: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any cpu which is both online and among that task's cpus_allowed It is typical of a multithreaded application running on a large NUMA system to have its tasks confined to a cpuset so as to cluster them near the memory that they share. Furthermore, it is typical to explicitly place such a task on a specific cpu in that cpuset. And in that case the task's cpus_allowed includes only a single cpu. This patch would insert a preference to migrate such a task to some cpu within its cpuset (and set its cpus_allowed to its entire cpuset). With this patch, migrate the task to: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any online cpu within the task's cpuset 3) to any cpu which is both online and among that task's cpus_allowed In order to do this, move_task_off_dead_cpu() must make a call to cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will not block. (name change - per Oleg's suggestion) Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set the cpuset mutex during the whole migrate_live_tasks() and migrate_dead_tasks() procedure. [akpm@linux-foundation.org: build fix] [pj@sgi.com: Fix indentation and spacing] Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Use helpers to obtain task pid in printksPavel Emelyanov
The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Isolate the explicit usage of signal->pgrpPavel Emelyanov
The pgrp field is not used widely around the kernel so it is now marked as deprecated with appropriate comment. The initialization of INIT_SIGNALS is trimmed because a) they are set to 0 automatically; b) gcc cannot properly initialize two anonymous (the second one is the one with the session) unions. In this particular case to make it compile we'd have to add some field initialized right before the .pgrp. This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one (from Cedric), but for the pgrp field. Some progress report: We have to deprecate the pid, tgid, session and pgrp fields on struct task_struct and struct signal_struct. The session and pgrp are already deprecated. The tgid value is close to being such - the worst known usage in in fs/locks.c and audit code. The pid field deprecation is mainly blocked by numerous printk-s around the kernel that print the tsk->pid to log. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19remove BITS_TO_TYPE macroJiri Slaby
remove BITS_TO_TYPE macro I realized, that it is actually the same as DIV_ROUND_UP, use it instead. [akpm@linux-foundation.org: build fix] Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19define global BIT macroJiri Slaby
define global BIT macro move all local BIT defines to the new globally define macro. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Kumar Gala <galak@gate.crashing.org> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Jeff Garzik <jeff@garzik.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Russell King <rmk@arm.linux.org.uk> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: "John W. Linville" <linville@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19get rid of input BIT* duplicate definesJiri Slaby
get rid of input BIT* duplicate defines use newly global defined macros for input layer. Also remove includes of input.h from non-input sources only for BIT macro definiton. Define the macro temporarily in local manner, all those local definitons will be removed further in this patchset (to not break bisecting). BIT macro will be globally defined (1<<x) Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: <dtor@mail.ru> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: <lenb@kernel.org> Acked-by: Marcel Holtmann <marcel@holtmann.org> Cc: <perex@suse.cz> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: <vernux@us.ibm.com> Cc: <malattia@linux.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19define first set of BIT* macrosJiri Slaby
define first set of BIT* macros - move BITOP_MASK and BITOP_WORD from asm-generic/bitops/atomic.h to include/linux/bitops.h and rename it to BIT_MASK and BIT_WORD - move BITS_TO_LONGS and BITS_PER_BYTE to bitops.h too and allow easily define another BITS_TO_something (e.g. in event.c) by BITS_TO_TYPE macro Remaining (and common) BIT macro will be defined after all occurences and conflicts will be sorted out in the patches. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19remove asm/bitops.h includesJiri Slaby
remove asm/bitops.h includes including asm/bitops directly may cause compile errors. don't include it and include linux/bitops instead. next patch will deny including asm header directly. Cc: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Misc: phantom, improved data passingJiri Slaby
This new version guarantees amb_bit switch in small enough intervals, so that the device won't stop working in the middle of a movement anymore. However it preserves old (openhaptics) functionality. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Fix cpusets update_cpumaskPaul Menage
Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks: - collect batches of tasks under tasklist_lock and then call set_cpus_allowed() on them outside the lock (since this can sleep). - add a simple generic priority heap type to allow efficient collection of batches of tasks to be processed without duplicating or missing any tasks in subsequent batches. - make "cpus" file update a no-op if the mask hasn't changed - fix race between update_cpumask() and sched_setaffinity() by making sched_setaffinity() post-check that it's not running on any cpus outside cpuset_cpus_allowed(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19cpuset sched_load_balance flagPaul Jackson
Add a new per-cpuset flag called 'sched_load_balance'. When enabled in a cpuset (the default value) it tells the kernel scheduler that the scheduler should provide the normal load balancing on the CPUs in that cpuset, sometimes moving tasks from one CPU to a second CPU if the second CPU is less loaded and if that task is allowed to run there. When disabled (write "0" to the file) then it tells the kernel scheduler that load balancing is not required for the CPUs in that cpuset. Now even if this flag is disabled for some cpuset, the kernel may still have to load balance some or all the CPUs in that cpuset, if some overlapping cpuset has its sched_load_balance flag enabled. If there are some CPUs that are not in any cpuset whose sched_load_balance flag is enabled, the kernel scheduler will not load balance tasks to those CPUs. Moreover the kernel will partition the 'sched domains' (non-overlapping sets of CPUs over which load balancing is attempted) into the finest granularity partition that it can find, while still keeping any two CPUs that are in the same shed_load_balance enabled cpuset in the same element of the partition. This serves two purposes: 1) It provides a mechanism for real time isolation of some CPUs, and 2) it can be used to improve performance on systems with many CPUs by supporting configurations in which load balancing is not done across all CPUs at once, but rather only done in several smaller disjoint sets of CPUs. This mechanism replaces the earlier overloading of the per-cpuset flag 'cpu_exclusive', which overloading was removed in an earlier patch: cpuset-remove-sched-domain-hooks-from-cpusets See further the Documentation and comments in the code itself. [akpm@linux-foundation.org: don't be weird] Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Uninline the task_xid_nr_ns() callsPavel Emelyanov
Since these are expanded into call to pid_nr_ns() anyway, it's OK to move the whole routine out-of-line. This is a cheap way to save ~100 bytes from vmlinux. Together with the previous two patches, it saves half-a-kilo from the vmlinux. Un-inline other (currently inlined) functions must be done with additional performance testing. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Uninline find_pid etc set of functionsPavel Emelyanov
The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its id, depending on whic id - global or virtual - is used. The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the stack to call another function - find_pid_ns(). It turned out, that this dereference together with the push itself cause the kernel text size to grow too much. Move all these out-of-line. Together with the previous patch this saves a bit less that 400 bytes from .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Isolate some explicit usage of task->tgidPavel Emelyanov
With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: remove the struct pid unneeded fieldsPavel Emelyanov
Since we've switched from using pid->nr to pid->upids->nr some fields on struct pid are no longer needed Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Uninline find_task_by_xxx set of functionsPavel Emelyanov
The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: destroy pid namespace on init's deathSukadev Bhattiprolu
Terminate all processes in a namespace when the reaper of the namespace is exiting. We do this by walking the pidmap of the namespace and sending SIGKILL to all processes. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: initialize the namespace's proc_mntPavel Emelyanov
The namespace's proc_mnt must be kern_mount-ed to make this pointer always valid, independently of whether the user space mounted the proc or not. This solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL to not-NULL. The initialization is done after the init's pid is created and hashed to make proc_get_sb() finr it and get for root inode. Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the superblock holds the namespace we must explicitly break this circle to destroy all the stuff. This is done after the init of the namespace dies. Running a few steps forward - when init exits it will kill all its children, so no proc_mnt will be needed after its death. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: allow cloning of new namespacePavel Emelyanov
When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then create a new struct pid for the new process. Allocate pid_t's for the new process in the new pid namespace and all ancestor pid namespaces. Make the newly cloned process the session and process group leader. Since the active pid namespace is special and expected to be the first entry in pid->upid_list, preserve the order of pid namespaces. The size of 'struct pid' is dependent on the the number of pid namespaces the process exists in, so we use multiple pid-caches'. Only one pid cache is created during system startup and this used by processes that exist only in init_pid_ns. When a process clones its pid namespace, we create additional pid caches as necessary and use the pid cache to allocate 'struct pids' for that depth. Note, that with this patch the newly created namespace won't work, since the rest of the kernel still uses global pids, but this is to be fixed soon. Init pid namespace still works. [oleg@tv-sign.ru: merge fix] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: miscellaneous preparations for pid namespacesPavel Emelyanov
* remove pid.h from pid_namespaces.h; * rework is_(cgroup|global)_init; * optimize (get|put)_pid_ns for init_pid_ns; * declare task_child_reaper to return actual reaper. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: make proc have multiple superblocks - one for each namespacePavel Emelyanov
Each pid namespace have to be visible through its own proc mount. Thus we need to have per-namespace proc trees with their own superblocks. We cannot easily show different pid namespace via one global proc tree, since each pid refers to different tasks in different namespaces. E.g. pid 1 refers to the init task in the initial namespace and to some other task when seeing from another namespace. Moreover - pid, exisintg in one namespace may not exist in the other. This approach has one move advantage is that the tasks from the init namespace can see what tasks live in another namespace by reading entries from another proc tree. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: helpers to find the task by its numerical idsPavel Emelyanov
When searching the task by numerical id on may need to find it using global pid (as it is done now in kernel) or by its virtual id, e.g. when sending a signal to a task from one namespace the sender will specify the task's virtual id and we should find the task by this value. [akpm@linux-foundation.org: fix gfs2 linkage] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: helpers to obtain pid numbersPavel Emelyanov
When showing pid to user or getting the pid numerical id for in-kernel use the value of this id may differ depending on the namespace. This set of helpers is used to get the global pid nr, the virtual (i.e. seen by task in its namespace) nr and the nr as it is seen from the specified namespace. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upidPavel Emelyanov
Each struct upid element of struct pid has to be initialized properly, i.e. its nr mst be allocated from appropriate pidmap and ns set to appropriate namespace. When allocating a new pid, we need to know the namespace this pid will live in, so the additional argument is added to alloc_pid(). On the other hand, the rest of the kernel still uses the pid->nr and pid->pid_chain fields, so these ones are still initialized, but this will be removed soon. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: add support for pid namespaces hierarchyPavel Emelyanov
Each namespace has a parent and is characterized by its "level". Level is the number of the namespace generation. E.g. init namespace has level 0, after cloning new one it will have level 1, the next one - 2 and so on and so forth. This level is not explicitly limited. True hierarchy must have some way to find each namespace's children, but it is not used in the patches, so this ability is not added (yet). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: introduce struct upidSukadev Bhattiprolu
Since task will be visible from different pid namespaces each of them have to be addressed by multiple pids. struct upid is to store the information about which id refers to which namespace. The constuciton looks like this. Each struct pid carried the reference counter and the list of tasks attached to this pid. At its end it has a variable length array of struct upid-s. Each struct upid has a numerical id (pid itself), pointer to the namespace, this ID is valid in and is hashed into a pid_hash for searching the pids. The nr and pid_chain fields are kept in struct pid for a while to make kernel still work (no patch initialize the upids yet), but it will be removed at the end of this series when we switch to upids completely. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: prepare proc_flust_task() to flush entries from multiple ↵Pavel Emelyanov
proc trees The first part is trivial - we just make the proc_flush_task() to operate on arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to it. The other change is more tricky: I moved the proc_flush_task() call in release_task() higher to address the following problem. When flushing task from many proc trees we need to know the set of ids (not just one pid) to find the dentries' names to flush. Thus we need to pass the task's pid to proc_flush_task() as struct pid is the only object that can provide all the pid numbers. But after __exit_signal() task has detached all his pids and this information is lost. This creates a tiny gap for proc_pid_lookup() to bring some dentries back to tree and keep them in hash (since pids are still alive before __exit_signal()) till the next shrink, but since proc_flush_task() does not provide a 100% guarantee that the dentries will be flushed, this is OK to do so. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: introduce MS_KERNMOUNT flagPavel Emelyanov
This flag tells the .get_sb callback that this is a kern_mount() call so that it can trust *data pointer to be valid in-kernel one. If this flag is passed from the user process, it is cleared since the *data pointer is not a valid kernel object. Running a few steps forward - this will be needed for proc to create the superblock and store a valid pid namespace on it during the namespace creation. The reason, why the namespace cannot live without proc mount is described in the appropriate patch. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19workqueue: debug flushing deadlocks with lockdepJohannes Berg
In the following scenario: code path 1: my_function() -> lock(L1); ...; flush_workqueue(); ... code path 2: run_workqueue() -> my_work() -> ...; lock(L1); ... you can get a deadlock when my_work() is queued or running but my_function() has acquired L1 already. This patch adds a pseudo-lock to each workqueue to make lockdep warn about this scenario. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Make access to task's nsproxy lighterPavel Emelyanov
When someone wants to deal with some other taks's namespaces it has to lock the task and then to get the desired namespace if the one exists. This is slow on read-only paths and may be impossible in some cases. E.g. Oleg recently noticed a race between unshare() and the (sent for review in cgroups) pid namespaces - when the task notifies the parent it has to know the parent's namespace, but taking the task_lock() is impossible there - the code is under write locked tasklist lock. On the other hand switching the namespace on task (daemonize) and releasing the namespace (after the last task exit) is rather rare operation and we can sacrifice its speed to solve the issues above. The access to other task namespaces is proposed to be performed like this: rcu_read_lock(); nsproxy = task_nsproxy(tsk); if (nsproxy != NULL) { / * * work with the namespaces here * e.g. get the reference on one of them * / } / * * NULL task_nsproxy() means that this task is * almost dead (zombie) * / rcu_read_unlock(); This patch has passed the review by Eric and Oleg :) and, of course, tested. [clg@fr.ibm.com: fix unshare()] [ebiederm@xmission.com: Update get_net_ns_by_pid] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: define is_global_init() and is_container_init()Serge E. Hallyn
is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: rename child_reaper() functionSukadev Bhattiprolu
Rename the child_reaper() function to task_child_reaper() to be similar to other task_* functions and to distinguish the function from 'struct pid_namspace.child_reaper'. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: define and use task_active_pid_ns() wrapperSukadev Bhattiprolu
With multiple pid namespaces, a process is known by some pid_t in every ancestor pid namespace. Every time the process forks, the child process also gets a pid_t in every ancestor pid namespace. While a process is visible in >=1 pid namespaces, it can see pid_t's in only one pid namespace. We call this pid namespace it's "active pid namespace", and it is always the youngest pid namespace in which the process is known. This patch defines and uses a wrapper to find the active pid namespace of a process. The implementation of the wrapper will be changed in when support for multiple pid namespaces are added. Changelog: 2.6.22-rc4-mm2-pidns1: - [Pavel Emelianov, Alexey Dobriyan] Back out the change to use task_active_pid_ns() in child_reaper() since task->nsproxy can be NULL during task exit (so child_reaper() continues to use init_pid_ns). to implement child_reaper() since init_pid_ns.child_reaper to implement child_reaper() since tsk->nsproxy can be NULL during exit. 2.6.21-rc6-mm1: - Rename task_pid_ns() to task_active_pid_ns() to reflect that a process can have multiple pid namespaces. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: dynamic kmem cache allocator for pid namespacesPavel Emelianov
Add kmem_cache to pid_namespace to allocate pids from. Since both implementations expand the struct pid to carry more numerical values each namespace should have separate cache to store pids of different sizes. Each kmem cache is name "pid_<NR>", where <NR> is the number of numerical ids on the pid. Different namespaces with same level of nesting will have same caches. This patch has two FIXMEs that are to be fixed after we reach the consensus about the struct pid itself. The first one is that the namespace to free the pid from in free_pid() must be taken from pid. Now the init_pid_ns is used. The second FIXME is about the cache allocation. When we do know how long the object will be then we'll have to calculate this size in create_pid_cachep. Right now the sizeof(struct pid) value is used. [akpm@linux-foundation.org: coding-style repair] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: make get_pid_ns() return the namespace itselfPavel Emelianov
Make get_pid_ns() return the namespace itself to look like the other getters and make the code using it look nicer. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: round up the APIPavel Emelianov
The set of functions process_session, task_session, process_group and task_pgrp is confusing, as the names can be mixed with each other when looking at the code for a long time. The proposals are to * equip the functions that return the integer with _nr suffix to represent that fact, * and to make all functions work with task (not process) by making the common prefix of the same name. For monotony the routines signal_session() and set_signal_session() are replaced with task_session_nr() and set_task_session(), especially since they are only used with the explicit task->signal dereference. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19cgroups: implement namespace tracking subsystemSerge E. Hallyn
When a task enters a new namespace via a clone() or unshare(), a new cgroup is created and the task moves into it. This version names cgroups which are automatically created using cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or cloned process. (Thanks Pavel for the idea) This is safe because if the process unshares again, it will create /cgroups/(...)/node_<pid>/node_<pid> The only possibilities (AFAICT) for a -EEXIST on unshare are 1. pid wraparound 2. a process fails an unshare, then tries again. Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare() succeed. Changelog: Name cloned cgroups as "node_<pid>". [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Add cgroupstatsBalbir Singh
This patch is inspired by the discussion at http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward ported) This patch implements per cgroup statistics infrastructure and re-uses code from the taskstats interface. A new set of cgroup operations are registered with commands and attributes. It should be very easy to *extend* per cgroup statistics, by adding members to the cgroupstats structure. The current model for cgroupstats is a pull, a push model (to post statistics on interesting events), should be very easy to add. Currently user space requests for statistics by passing the cgroup file descriptor. Statistics about the state of all the tasks in the cgroup is returned to user space. TODO's/NOTE: This patch provides an infrastructure for implementing cgroup statistics. Based on the needs of each controller, we can incrementally add more statistics, event based support for notification of statistics, accumulation of taskstats into cgroup statistics in the future. Sample output # ./cgroupstats -C /cgroup/a sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0 # ./cgroupstats -C /cgroup/ sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0 If the approach looks good, I'll enhance and post the user space utility for the same Feedback, comments, test results are always welcome! [akpm@linux-foundation.org: build fix] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>