aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-03-12futex: unlock before returning -EFAULTDarren Hart
Impact: rt-mutex failure case fix futex_lock_pi can potentially return -EFAULT with the rt_mutex held. This seems like the wrong thing to do as userspace should assume -EFAULT means the lock was not taken. Even if it could figure this out, we'd be leaving the pi_state->owner in an inconsistent state. This patch unlocks the rt_mutex prior to returning -EFAULT to userspace. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075606.9856.88729.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12futex: use current->time_slack_ns for rt tasks tooDarren Hart
RT tasks should set their timer slack to 0 on their own. This patch removes the 'if (rt_task()) slack = 0;' block in futex_wait. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20090312075559.9856.28822.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12futex: add double_unlock_hb()Darren Hart
Impact: cleanup The futex code uses double_lock_hb() which locks the hb->lock's in pointer value order. There is no parallel unlock routine, and the code unlocks them in name order, ignoring pointer value. This patch adds double_unlock_hb() to refactor the duplicated code segments. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075552.9856.48021.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12futex: additional (get|put)_futex_key() fixesDarren Hart
Impact: fix races futex_requeue and futex_lock_pi still had some bad (get|put)_futex_key() usage. This patch adds the missing put_futex_keys() and corrects a goto in futex_lock_pi() to avoid a double get. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075545.9856.75152.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12futex: update futex commentaryDarren Hart
Impact: cleanup The futex_hash_bucket can be a bit confusing when first looking at the code as it is a shared queue (and futex_q isn't a queue at all, but rather an element on the queue). The mmap_sem is no longer held outside of the futex_handle_fault() routine, yet numerous comments refer to it. The fshared argument is no an integer. I left some of these comments along as they are simply removed in future patches. Some of the commentary refering to futexes by virtual page mappings was not very clear, and completely accurate (as for shared futexes both the page and the offset are used to determine the key). For the purposes of the function description, just referring to "the futex" seems sufficient. With hashed futexes we now access the page after the hash-bucket is locked, and not only after it is enqueued. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075537.9856.29954.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10kernel/user.c: fix a memory leak when freeing up non-init usernamespaces usersDhaval Giani
We were returning early in the sysfs directory cleanup function if the user belonged to a non init usernamespace. Due to this a lot of the cleanup was not done and we were left with a leak. Fix the leak. Reported-by: Serge Hallyn <serue@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Tested-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-09copy_process: fix CLONE_PARENT && parent_exec_id interactionOleg Nesterov
CLONE_PARENT can fool the ->self_exec_id/parent_exec_id logic. If we re-use the old parent, we must also re-use ->parent_exec_id to make sure exit_notify() sees the right ->xxx_exec_id's when the CLONE_PARENT'ed task exits. Also, move down the "p->parent_exec_id = p->self_exec_id" thing, to place two different cases together. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Cc: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-09Fix fixpoint divide exception in acct_update_integralsHeiko Carstens
Frans Pop reported the crash below when running an s390 kernel under Hercules: Kernel BUG at 000738b4 verbose debug info unavailable! fixpoint divide exception: 0009 #1! SMP Modules linked in: nfs lockd nfs_acl sunrpc ctcm fsm tape_34xx cu3088 tape ccwgroup tape_class ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod dasd_eckd_mod dasd_mod CPU: 0 Not tainted 2.6.27.19 #13 Process awk (pid: 2069, task: 0f9ed9b8, ksp: 0f4f7d18) Krnl PSW : 070c1000 800738b4 (acct_update_integrals+0x4c/0x118) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 Krnl GPRS: 00000000 000007d0 7fffffff fffff830 00000000 ffffffff 00000002 0f9ed9b8 00000000 00008ca0 00000000 0f9ed9b8 0f9edda4 8007386e 0f4f7ec8 0f4f7e98 Krnl Code: 800738aa: a71807d0 lhi %r1,2000 800738ae: 8c200001 srdl %r2,1 800738b2: 1d21 dr %r2,%r1 >800738b4: 5810d10e l %r1,270(%r13) 800738b8: 1823 lr %r2,%r3 800738ba: 4130f060 la %r3,96(%r15) 800738be: 0de1 basr %r14,%r1 800738c0: 5800f060 l %r0,96(%r15) Call Trace: ( <000000000004fdea>! blocking_notifier_call_chain+0x1e/0x2c) <0000000000038502>! do_exit+0x106/0x7c0 <0000000000038c36>! do_group_exit+0x7a/0xb4 <0000000000038c8e>! SyS_exit_group+0x1e/0x30 <0000000000021c28>! sysc_do_restart+0x12/0x16 <0000000077e7e924>! 0x77e7e924 Reason for this is that cpu time accounting usually only happens from interrupt context, but acct_update_integrals gets also called from process context with interrupts enabled. So in acct_update_integrals we may end up with the following scenario: Between reading tsk->stime/tsk->utime and tsk->acct_timexpd an interrupt happens which updates accouting values. This causes acct_timexpd to be greater than the former stime + utime. The subsequent calculation of dtime = cputime_sub(time, tsk->acct_timexpd); will be negative and the division performed by cputime_to_jiffies(dtime) will generate an exception since the result won't fit into a 32 bit register. In order to fix this just always disable interrupts while accessing any of the accounting values. Reported by: Frans Pop <elendil@planet.nl> Tested by: Frans Pop <elendil@planet.nl> Cc: stable@kernel.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-04rcu: increment quiescent state counter in ksoftirqd()Eric Dumazet
If a machine is flooded by network frames, a cpu can loop 100% of its time inside ksoftirqd() without calling schedule(). This can delay RCU grace period to insane values. Adding rcu_qsctr_inc() call in ksoftirqd() solves this problem. Paul: "This regression was a result of the recent change from "schedule()" to "cond_resched()", which got rid of that quiescent state in the common case where a reschedule is not needed". Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-03Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: don't allow setuid to succeed if the user does not have rt bandwidth sched_rt: don't start timer when rt bandwidth disabled
2009-03-03Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Teach RCU that idle task is not quiscent state at boot
2009-03-02x86-64: seccomp: fix 32/64 syscall holeRoland McGrath
On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with ljmp, and then use the "syscall" instruction to make a 64-bit system call. A 64-bit process make a 32-bit system call with int $0x80. In both these cases under CONFIG_SECCOMP=y, secure_computing() will use the wrong system call number table. The fix is simple: test TS_COMPAT instead of TIF_IA32. Here is an example exploit: /* test case for seccomp circumvention on x86-64 There are two failure modes: compile with -m64 or compile with -m32. The -m64 case is the worst one, because it does "chmod 777 ." (could be any chmod call). The -m32 case demonstrates it was able to do stat(), which can glean information but not harm anything directly. A buggy kernel will let the test do something, print, and exit 1; a fixed kernel will make it exit with SIGKILL before it does anything. */ #define _GNU_SOURCE #include <assert.h> #include <inttypes.h> #include <stdio.h> #include <linux/prctl.h> #include <sys/stat.h> #include <unistd.h> #include <asm/unistd.h> int main (int argc, char **argv) { char buf[100]; static const char dot[] = "."; long ret; unsigned st[24]; if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0) perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?"); #ifdef __x86_64__ assert ((uintptr_t) dot < (1UL << 32)); asm ("int $0x80 # %0 <- %1(%2 %3)" : "=a" (ret) : "0" (15), "b" (dot), "c" (0777)); ret = snprintf (buf, sizeof buf, "result %ld (check mode on .!)\n", ret); #elif defined __i386__ asm (".code32\n" "pushl %%cs\n" "pushl $2f\n" "ljmpl $0x33, $1f\n" ".code64\n" "1: syscall # %0 <- %1(%2 %3)\n" "lretl\n" ".code32\n" "2:" : "=a" (ret) : "0" (4), "D" (dot), "S" (&st)); if (ret == 0) ret = snprintf (buf, sizeof buf, "stat . -> st_uid=%u\n", st[7]); else ret = snprintf (buf, sizeof buf, "result %ld\n", ret); #else # error "not this one" #endif write (1, buf, ret); syscall (__NR_exit, 1); return 2; } Signed-off-by: Roland McGrath <roland@redhat.com> [ I don't know if anybody actually uses seccomp, but it's enabled in at least both Fedora and SuSE kernels, so maybe somebody is. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27Fix recursive lock in free_uid()/free_user_ns()David Howells
free_uid() and free_user_ns() are corecursive when CONFIG_USER_SCHED=n, but free_user_ns() is called from free_uid() by way of uid_hash_remove(), which requires uidhash_lock to be held. free_user_ns() then calls free_uid() to complete the destruction. Fix this by deferring the destruction of the user_namespace. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27sched: don't allow setuid to succeed if the user does not have rt bandwidthDhaval Giani
Impact: fix hung task with certain (non-default) rt-limit settings Corey Hickey reported that on using setuid to change the uid of a rt process, the process would be unkillable and not be running. This is because there was no rt runtime for that user group. Add in a check to see if a user can attach an rt task to its task group. On failure, return EINVAL, which is also returned in CONFIG_CGROUP_SCHED. Reported-by: Corey Hickey <bugfood-ml@fatooh.org> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26sched_rt: don't start timer when rt bandwidth disabledHiroshi Shimamoto
Impact: fix incorrect condition check No need to start rt bandwidth timer when rt bandwidth is disabled. If this timer starts, it may stop at sched_rt_period_timer() on the first time. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26rcu: Teach RCU that idle task is not quiscent state at bootPaul E. McKenney
This patch fixes a bug located by Vegard Nossum with the aid of kmemcheck, updated based on review comments from Nick Piggin, Ingo Molnar, and Andrew Morton. And cleans up the variable-name and function-name language. ;-) The boot CPU runs in the context of its idle thread during boot-up. During this time, idle_cpu(0) will always return nonzero, which will fool Classic and Hierarchical RCU into deciding that a large chunk of the boot-up sequence is a big long quiescent state. This in turn causes RCU to prematurely end grace periods during this time. This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks() function to ignore the idle task as a quiescent state until the system has started up the scheduler in rest_init(), introducing a new non-API function rcu_idle_now_means_idle() to inform RCU of this transition. RCU maintains an internal rcu_idle_cpu_truthful variable to track this state, which is then used by rcu_check_callback() to determine if it should believe idle_cpu(). Because this patch has the effect of disallowing RCU grace periods during long stretches of the boot-up sequence, this patch also introduces Josh Triplett's UP-only optimization that makes synchronize_rcu() be a no-op if num_online_cpus() returns 1. This allows boot-time code that calls synchronize_rcu() to proceed normally. Note, however, that RCU callbacks registered by call_rcu() will likely queue up until later in the boot sequence. Although rcuclassic and rcutree can also use this same optimization after boot completes, rcupreempt must restrict its use of this optimization to the portion of the boot sequence before the scheduler starts up, given that an rcupreempt RCU read-side critical section may be preeempted. In addition, this patch takes Nick Piggin's suggestion to make the system_state global variable be __read_mostly. Changes since v4: o Changes the name of the introduced function and variable to be less emotional. ;-) Changes since v3: o WARN_ON(nr_context_switches() > 0) to verify that RCU switches out of boot-time mode before the first context switch, as suggested by Nick Piggin. Changes since v2: o Created rcu_blocking_is_gp() internal-to-RCU API that determines whether a call to synchronize_rcu() is itself a grace period. o The definition of rcu_blocking_is_gp() for rcuclassic and rcutree checks to see if but a single CPU is online. o The definition of rcu_blocking_is_gp() for rcupreempt checks to see both if but a single CPU is online and if the system is still in early boot. This allows rcupreempt to again work correctly if running on a single CPU after booting is complete. o Added check to rcupreempt's synchronize_sched() for there being but one online CPU. Tested all three variants both SMP and !SMP, booted fine, passed a short rcutorture test on both x86 and Power. Located-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-22PM: Split up sysdev_[suspend|resume] from device_power_[down|up]Rafael J. Wysocki
Move the sysdev_suspend/resume from the callee to the callers, with no real change in semantics, so that we can rework the disabling of interrupts during suspend/hibernation. This is based on an earlier patch from Linus. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21Merge branch 'hibernate'Linus Torvalds
* hibernate: PM: Fix suspend_console and resume_console to use only one semaphore PM: Wait for console in resume PM: Fix pm_notifiers during user mode hibernation swsusp: clean up shrink_all_zones() swsusp: dont fiddle with swappiness PM: fix build for CONFIG_PM unset PM/hibernate: fix "swap breaks after hibernation failures" PM/resume: wait for device probing to finish Consolidate driver_probe_done() loops into one place
2009-02-21PM: Fix suspend_console and resume_console to use only one semaphoreArve Hjønnevåg
This fixes a race where a thread acquires the console while the console is suspended, and the console is resumed before this thread releases it. In this case, the secondary console semaphore would be left locked, and the primary semaphore would be released twice. This in turn would cause the console switch on suspend or resume to hang forever. Note that suspend_console does not actually lock the console for clients that use acquire_console_sem, it only locks it for clients that use try_acquire_console_sem. If we change suspend_console to fully lock the console, then the kernel may deadlock on suspend. One client of try_acquire_console_sem is acquire_console_semaphore_for_printk, which uses it to prevent printk from using the console while it is suspended. Signed-off-by: Arve Hjønnevåg <arve@android.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21PM: Wait for console in resumeArve Hjønnevåg
Avoids later waking up to a blinking cursor if the device woke up and returned to sleep before the console switch happened. Signed-off-by: Brian Swetland <swetland@google.com> Signed-off-by: Arve Hjønnevåg <arve@android.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21PM: Fix pm_notifiers during user mode hibernationAndrey Borzenkov
Snapshot device is opened with O_RDONLY during suspend and O_WRONLY durig resume. Make sure we also call notifiers with correct parameter telling them what we are really doing. Signed-off-by: Andrey Borzenkov <arvidjaar@mail.ru> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <gregkh@suse.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21PM: fix build for CONFIG_PM unsetRafael J. Wysocki
Compilation of kprobes.c with CONFIG_PM unset is broken due to some broken config dependncies. Fix that. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <gregkh@suse.de> Reported-by: Ingo Molnar <mingo@elte.hu> Tested-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21PM/resume: wait for device probing to finishArjan van de Ven
the resume code does not currently wait for device probing to finish. Even without async function calls this is dicey and not correct, but with async function calls during the boot sequence this is going to get hit more... This patch adds the synchronization using the newly introduced helper. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Acked-by: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-19Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: limit the number of loops the ring buffer self test can make tracing: have function trace select kallsyms tracing: disable tracing while testing ring buffer tracing/function-graph-tracer: trace the idle tasks
2009-02-18tracing: limit the number of loops the ring buffer self test can makeSteven Rostedt
Impact: prevent deadlock if ring buffer gets corrupted This patch adds a paranoid check to make sure the ring buffer consumer does not go into an infinite loop. Since the ring buffer has been set to read only, the consumer should not loop for more than the ring buffer size. A check is added to make sure the consumer does not loop more than the ring buffer size. Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-18tracing: have function trace select kallsymsSteven Rostedt
Impact: fix output of function tracer to be useful The function tracer is pretty useless if KALLSYMS is not configured. Unless you are good at reading hex values, the function tracer should select the KALLSYMS configuration. Also, the dynamic function tracer will fail its self test if KALLSYMS is not selected. Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-18tracing: disable tracing while testing ring bufferSteven Rostedt
Impact: fix to prevent hard lockup on self tests If one of the tracers are broken and is constantly filling the ring buffer while the test of the ring buffer is running, it will hang the box. The reason is that the test is a consumer that will not stop till the ring buffer is empty. But if the tracer is broken and is constantly producing input to the buffer, this test will never end. The result is a lockup of the box. This happened when KALLSYMS was not defined and the dynamic ftrace test constantly filled the ring buffer, because the filter failed and all functions were being traced. Something was being called that constantly filled the buffer. Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-18Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list block: fix booting from partitioned md array block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb cciss: PCI power management reset for kexec paride/pg.c: xs(): &&/|| confusion fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free block: fix bad definition of BIO_RW_SYNC bsg: Fix sense buffer bug in SG_IO
2009-02-18pm: fix build for CONFIG_PM unsetRafael J. Wysocki
Compilation of kprobes.c with CONFIG_PM unset is broken due to some broken config dependncies. Fix that. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ingo Molnar <mingo@elte.hu> Tested-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Len Brown <lenb@kernel.org> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18cgroups: fix possible use after freeLi Zefan
In cgroup_kill_sb(), root is freed before sb is detached from the list, so another sget() may find this sb and call cgroup_test_super(), which will access the root that has been freed. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18block: fix bad definition of BIO_RW_SYNCJens Axboe
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before 213d9417fec62ef4c3675621b9364a667954d4dd. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-17Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: cpu hotplug fix
2009-02-17Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: more consistently use clock vs timer
2009-02-17Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: doc: mmiotrace.txt, buffer size control change trace: mmiotrace to the tracer menu in Kconfig mmiotrace: count events lost due to not recording
2009-02-17Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: iommu: fix Intel IOMMU write-buffer flushing futex: fix reference leak Trivial conflicts fixed manually in drivers/pci/intel-iommu.c
2009-02-17tracing/function-graph-tracer: trace the idle tasksFrederic Weisbecker
When the function graph tracer is activated, it iterates over the task_list to allocate a stack to store the return addresses. But the per cpu idle tasks are not iterated by using do_each_thread / while_each_thread. So we have to iterate on them manually. This fixes somes weirdness in the traces and many losses of traces. Examples on two cpus: 0) Xorg-4287 | 2.906 us | } 0) Xorg-4287 | 3.965 us | } 0) Xorg-4287 | 5.302 us | } ------------------------------------------ 0) Xorg-4287 => <idle>-0 ------------------------------------------ 0) <idle>-0 | 2.861 us | } 0) <idle>-0 | 0.526 us | set_normalized_timespec(); 0) <idle>-0 | 7.201 us | } 0) <idle>-0 | 8.214 us | } 0) <idle>-0 | | clockevents_program_event() { 0) <idle>-0 | | lapic_next_event() { 0) <idle>-0 | 0.510 us | native_apic_mem_write(); 0) <idle>-0 | 1.546 us | } 0) <idle>-0 | 2.583 us | } 0) <idle>-0 | + 12.435 us | } 0) <idle>-0 | + 13.470 us | } 0) <idle>-0 | 0.608 us | _spin_unlock_irqrestore(); 0) <idle>-0 | + 23.270 us | } 0) <idle>-0 | + 24.336 us | } 0) <idle>-0 | + 25.417 us | } 0) <idle>-0 | 0.593 us | _spin_unlock(); 0) <idle>-0 | + 41.869 us | } 0) <idle>-0 | + 42.906 us | } 0) <idle>-0 | + 95.035 us | } 0) <idle>-0 | 0.540 us | menu_reflect(); 0) <idle>-0 | ! 100.404 us | } 0) <idle>-0 | 0.564 us | mce_idle_callback(); 0) <idle>-0 | | enter_idle() { 0) <idle>-0 | 0.526 us | mce_idle_callback(); 0) <idle>-0 | 1.757 us | } 0) <idle>-0 | | cpuidle_idle_call() { 0) <idle>-0 | | menu_select() { 0) <idle>-0 | 0.525 us | pm_qos_requirement(); 0) <idle>-0 | 0.518 us | tick_nohz_get_sleep_length(); 0) <idle>-0 | 2.621 us | } [...] 1) <idle>-0 | 0.518 us | touch_softlockup_watchdog(); 1) <idle>-0 | + 14.355 us | } 1) <idle>-0 | + 22.840 us | } 1) <idle>-0 | + 25.949 us | } 1) <idle>-0 | | handle_irq() { 1) <idle>-0 | 0.511 us | irq_to_desc(); 1) <idle>-0 | | handle_edge_irq() { 1) <idle>-0 | 0.638 us | _spin_lock(); 1) <idle>-0 | | ack_apic_edge() { 1) <idle>-0 | 0.510 us | irq_to_desc(); 1) <idle>-0 | | move_native_irq() { 1) <idle>-0 | 0.510 us | irq_to_desc(); 1) <idle>-0 | 1.532 us | } 1) <idle>-0 | 0.511 us | native_apic_mem_write(); ------------------------------------------ 1) <idle>-0 => cat-5073 ------------------------------------------ 1) cat-5073 | 3.731 us | } 1) cat-5073 | | run_local_timers() { 1) cat-5073 | 0.533 us | hrtimer_run_queues(); 1) cat-5073 | | raise_softirq() { 1) cat-5073 | | __raise_softirq_irqoff() { 1) cat-5073 | | /* nr: 1 */ 1) cat-5073 | 2.718 us | } 1) cat-5073 | 3.814 us | } Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15trace: mmiotrace to the tracer menu in KconfigPekka Paalanen
Impact: cosmetic change in Kconfig menu layout This patch was originally suggested by Peter Zijlstra, but seems it was forgotten. CONFIG_MMIOTRACE and CONFIG_MMIOTRACE_TEST were selectable directly under the Kernel hacking / debugging menu in the kernel configuration system. They were present only for x86 and x86_64. Other tracers that use the ftrace tracing framework are in their own sub-menu. This patch moves the mmiotrace configuration options there. Since the Kconfig file, where the tracer menu is, is not architecture specific, HAVE_MMIOTRACE_SUPPORT is introduced and provided only by x86/x86_64. CONFIG_MMIOTRACE now depends on it. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15mmiotrace: count events lost due to not recordingPekka Paalanen
Impact: enhances lost events counting in mmiotrace The tracing framework, or the ring buffer facility it uses, has a switch to stop recording data. When recording is off, the trace events will be lost. The framework does not count these, so mmiotrace has to count them itself. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-13User namespaces: Only put the userns when we unhash the uidSerge E. Hallyn
uids in namespaces other than init don't get a sysfs entry. For those in the init namespace, while we're waiting to remove the sysfs entry for the uid the uid is still hashed, and alloc_uid() may re-grab that uid without getting a new reference to the user_ns, which we've already put in free_user before scheduling remove_user_sysfs_dir(). Reported-and-tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-13timers: more consistently use clock vs timerPeter Zijlstra
While reviewing the manpages, I noticed I'd missed some clock vs timer sites. Make sure that all timer functions call cpu_timer_sample_group() and not cpu_clock_sample_group(). This ensures that we enable the process wide timer in time, and therefore pay the O(n) thread group cost from the syscall. Not doing it here, will result in the first jiffy tick after setting the timer doing this, resulting in a very expensive tick (but only once) and a delay in actually starting the timer. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-12sched: cpu hotplug fixIngo Molnar
rq_attach_root() does a kfree() with the runqueue lock held. That's not a very wise move, fix it. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11cgroups: fix lockdep subclasses overflowLi Zefan
I enabled all cgroup subsystems when compiling kernel, and then: # mount -t cgroup -o net_cls xxx /mnt # mkdir /mnt/0 This showed up immediately: BUG: MAX_LOCKDEP_SUBCLASSES too low! turning off the locking correctness validator. It's caused by the cgroup hierarchy lock: for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { struct cgroup_subsys *ss = subsys[i]; if (ss->root == root) mutex_lock_nested(&ss->hierarchy_mutex, i); } Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but MAX_LOCKDEP_SUBCLASSES is 8. This patch uses different lockdep keys for different subsystems. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11mm: fix dirty_bytes/dirty_background_bytes sysctls on 64bit archesSven Wegener
We need to pass an unsigned long as the minimum, because it gets casted to an unsigned long in the sysctl handler. If we pass an int, we'll access four more bytes on 64bit arches, resulting in a random minimum value. [rientjes@google.com: fix type of `old_bytes'] Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11futex: fix reference leakPeter Zijlstra
Catalin noticed that (38d47c1b7075: futex: rely on get_user_pages() for shared futexes) caused an mm_struct leak. Some tracing with the function graph tracer quickly pointed out that futex_wait() has exit paths with unbalanced reference counts. This regression was discovered by kmemleak. Reported-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: revert recent sync wakeup changes
2009-02-11Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: fix TIMER_ABSTIME for process wide cpu timers timers: split process wide cpu clocks/timers, fix x86: clean up hpet timer reinit timers: split process wide cpu clocks/timers, remove spurious warning timers: split process wide cpu clocks/timers signal: re-add dead task accumulation stats. x86: fix hpet timer reinit for x86_64 sched: fix nohz load balancer on cpu offline
2009-02-11Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ptrace, x86: fix the usage of ptrace_fork() i8327: fix outb() parameter order x86: fix math_emu register frame access x86: math_emu info cleanup x86: include correct %gs in a.out core dump x86, vmi: put a missing paravirt_release_pmd in pgd_dtor x86: find nr_irqs_gsi with mp_ioapic_routing x86: add clflush before monitor for Intel 7400 series x86: disable intel_iommu support by default x86: don't apply __supported_pte_mask to non-present ptes x86: fix grammar in user-visible BIOS warning x86/Kconfig.cpu: make Kconfig help readable in the console x86, 64-bit: print DMI info in the oops trace
2009-02-11sched: revert recent sync wakeup changesPeter Zijlstra
Intel reported a 10% regression (mysql+sysbench) on a 16-way machine with these patches: 1596e29: sched: symmetric sync vs avg_overlap d942fb6: sched: fix sync wakeups Revert them. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Bisected-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11timers: fix TIMER_ABSTIME for process wide cpu timersPeter Zijlstra
The POSIX timer interface allows for absolute time expiry values through the TIMER_ABSTIME flag, therefore we have to synchronize the timer to the clock every time we start it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11timers: split process wide cpu clocks/timers, fixPeter Zijlstra
To decrease the chance of a missed enable, always enable the timer when we sample it, we'll always disable it when we find that there are no active timers in the jiffy tick. This fixes a flood of warnings reported by Mike Galbraith. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>