aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2007-10-10[GFS2] Fix quota do_list operation hangAbhijith Das
This is the filesystem part of the patches to fix this bz. There are additional userland patches (gfs2_quota, libgfs2) for the complete solution. This patch adds a new field qu_ll_next to the gfs2_quota structure. This field allows us to create linked lists of quotas in the ondisk quota inode. Instead of scanning through the entire sparse quota file for valid quotas, we can now simply walk through the user and group quota linked lists to perform the do_list operation. Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2007-10-08Merge branch 'master' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [IPv6]: Fix ICMPv6 redirect handling with target multicast address [PKT_SCHED] cls_u32: error code isn't been propogated properly [ROSE]: Fix rose.ko oops on unload [TCP]: Fix fastpath_cnt_hint when GSO skb is partially ACKed
2007-10-08mm: set_page_dirty_balance() vs ->page_mkwrite()Peter Zijlstra
All the current page_mkwrite() implementations also set the page dirty. Which results in the set_page_dirty_balance() call to _not_ call balance, because the page is already found dirty. This allows us to dirty a _lot_ of pages without ever hitting balance_dirty_pages(). Not good (tm). Force a balance call if ->page_mkwrite() was successful. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-07[ROSE]: Fix rose.ko oops on unloadAlexey Dobriyan
Commit a3d384029aa304f8f3f5355d35f0ae274454f7cd aka "[AX.25]: Fix unchecked rose_add_loopback_neigh uses" transformed rose_loopback_neigh var into statically allocated one. However, on unload it will be kfree's which can't work. Steps to reproduce: modprobe rose rmmod rose BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008 printing eip: c014c664 *pde = 00000000 Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC Modules linked in: rose ax25 fan ufs loop usbhid rtc snd_intel8x0 snd_ac97_codec ehci_hcd ac97_bus uhci_hcd thermal usbcore button processor evdev sr_mod cdrom CPU: 0 EIP: 0060:[<c014c664>] Not tainted VLI EFLAGS: 00210086 (2.6.23-rc9 #3) EIP is at kfree+0x48/0xa1 eax: 00000556 ebx: c1734aa0 ecx: f6a5e000 edx: f7082000 esi: 00000000 edi: f9a55d20 ebp: 00200287 esp: f6a5ef28 ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 Process rmmod (pid: 1823, ti=f6a5e000 task=f7082000 task.ti=f6a5e000) Stack: f9a55d20 f9a5200c 00000000 00000000 00000000 f6a5e000 f9a5200c f9a55a00 00000000 bf818cf0 f9a51f3f f9a55a00 00000000 c0132c60 65736f72 00000000 f69f9630 f69f9528 c014244a f6a4e900 00200246 f7082000 c01025e6 00000000 Call Trace: [<f9a5200c>] rose_rt_free+0x1d/0x49 [rose] [<f9a5200c>] rose_rt_free+0x1d/0x49 [rose] [<f9a51f3f>] rose_exit+0x4c/0xd5 [rose] [<c0132c60>] sys_delete_module+0x15e/0x186 [<c014244a>] remove_vma+0x40/0x45 [<c01025e6>] sysenter_past_esp+0x8f/0x99 [<c012bacf>] trace_hardirqs_on+0x118/0x13b [<c01025b6>] sysenter_past_esp+0x5f/0x99 ======================= Code: 05 03 1d 80 db 5b c0 8b 03 25 00 40 02 00 3d 00 40 02 00 75 03 8b 5b 0c 8b 73 10 8b 44 24 18 89 44 24 04 9c 5d fa e8 77 df fd ff <8b> 56 08 89 f8 e8 84 f4 fd ff e8 bd 32 06 00 3b 5c 86 60 75 0f EIP: [<c014c664>] kfree+0x48/0xa1 SS:ESP 0068:f6a5ef28 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-07Don't do load-average calculations at even 5-second intervalsLinus Torvalds
It turns out that there are a few other five-second timers in the kernel, and if the timers get in sync, the load-average can get artificially inflated by events that just happen to coincide. So just offset the load average calculation it by a timer tick. Noticed by Anders Boström, for whom the coincidence started triggering on one of his machines with the JBD jiffies rounding code (JBD is one of the subsystems that also end up using a 5-second timer by default). Tested-by: Anders Boström <anders@bostrom.dyndns.org> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-05Remove unnecessary cast in prefetch()Serge Belyshev
It is ok to call prefetch() function with NULL argument, as specifically commented in include/linux/prefetch.h. But in standard C, it is invalid to dereference NULL pointer (see C99 standard 6.5.3.2 paragraph 4 and note #84). prefetch() has a memory reference for its argument. Newer gcc versions (4.3 and above) will use that to conclude that "x" argument is non-null and thus wreaking havok everywhere prefetch() was inlined. Fixed by removing cast and changing asm constraint. [ It seems in theory gcc 4.2 could miscompile this too; although no cases known. In 2.6.24 we should probably switch to __builtin_prefetch() instead, but this is a simpler fix for now. -- AK ] Signed-off-by: Serge Belyshev <belyshev@depni.sinp.msu.ru> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-03Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linusLinus Torvalds
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus: [MIPS] Terminally fix local_{dec,sub}_if_positive [MIPS] Type proof reimplementation of cmpxchg. [MIPS] pg-r4k.c: Fix a typo in an R4600 v2 erratum workaround
2007-10-04Blackfin arch: fix PORT_J BUG for BF537/6 EMAC driver reported by Kalle ↵Michael Hennerich
Pokki <kalle.pokki@iki.fi> Cc: Kalle Pokki <kalle.pokki@iki.fi> Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Bryan Wu <bryan.wu@analog.com>
2007-10-04Blackfin arch: gpio pinmux and resource allocation API required by BF537 on ↵Michael Hennerich
chip ethernet mac driver Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Bryan Wu <bryan.wu@analog.com>
2007-10-03[MIPS] Terminally fix local_{dec,sub}_if_positiveRalf Baechle
They contain 64-bit instructions so wouldn't work on 32-bit kernels or 32-bit hardware. Since there are no users, blow them away. They probably were only ever created because there are atomic_sub_if_positive and atomic_dec_if_positive which exist only for sake of semaphores. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-10-03[MIPS] Type proof reimplementation of cmpxchg.Ralf Baechle
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-10-01Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linusLinus Torvalds
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus: [MIPS] vmlinux.lds.S: Handle note sections [MIPS] Fix value of O_TRUNC
2007-10-01[MIPS] Fix value of O_TRUNCRalf Baechle
A "cleanup" almost two years ago deleted the old definition from <asm/fcntl.h>, so asm-generic/fcntl.h defaulted it to the the same value as FASYNC ... which happened to be the wrong thing. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-09-29i386: remove bogus comment about memory barrierNick Piggin
The comment being removed by this patch is incorrect and misleading. In the following situation: 1. load ... 2. store 1 -> X 3. wmb 4. rmb 5. load a <- Y 6. store ... 4 will only ensure ordering of 1 with 5. 3 will only ensure ordering of 2 with 6. Further, a CPU with strictly in-order stores will still only provide that 2 and 6 are ordered (effectively, it is the same as a weakly ordered CPU with wmb after every store). In all cases, 5 may still be executed before 2 is visible to other CPUs! The additional piece of the puzzle that mb() provides is the store/load ordering, which fundamentally cannot be achieved with any combination of rmb()s and wmb()s. This can be an unexpected result if one expected any sort of global ordering guarantee to barriers (eg. that the barriers themselves are sequentially consistent with other types of barriers). However sfence or lfence barriers need only provide an ordering partial ordering of memory operations -- Consider that wmb may be implemented as nothing more than inserting a special barrier entry in the store queue, or, in the case of x86, it can be a noop as the store queue is in order. And an rmb may be implemented as a directive to prevent subsequent loads only so long as their are no previous outstanding loads (while there could be stores still in store queues). I can actually see the occasional load/store being reordered around lfence on my core2. That doesn't prove my above assertions, but it does show the comment is wrong (unless my program is -- can send it out by request). So: mb() and smp_mb() always have and always will require a full mfence or lock prefixed instruction on x86. And we should remove this comment. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Paul McKenney <paulmck@us.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-28Merge branch 'master' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [TCP]: Fix MD5 signature handling on big-endian. [NET]: Zero length write() on socket should not simply return 0.
2007-09-28[TCP]: Fix MD5 signature handling on big-endian.David S. Miller
Based upon a report and initial patch by Peter Lieven. tcp4_md5sig_key and tcp6_md5sig_key need to start with the exact same members as tcp_md5sig_key. Because they are both cast to that type by tcp_v{4,6}_md5_do_lookup(). Unfortunately tcp{4,6}_md5sig_key use a u16 for the key length instead of a u8, which is what tcp_md5sig_key uses. This just so happens to work by accident on little-endian, but on big-endian it doesn't. Instead of casting, just place tcp_md5sig_key as the first member of the address-family specific structures, adjust the access sites, and kill off the ugly casts. Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-27[MIPS] Fix CONFIG_BUILD_ELF64 kernels with symbols in CKSEG0.Ralf Baechle
The __pa() for those did assume that all symbols have XKPHYS values and the math fails for any other address range. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-09-26Revert "[PATCH] x86-64: fix x86_64-mm-sched-clock-share"Linus Torvalds
This reverts commit 184c44d2049c4db7ef6ec65794546954da2c6a0e. As noted by Dave Jones: "Linus, please revert the above cset. It doesn't seem to be necessary (it was added to fix a miscompile in 'make allnoconfig' which doesn't seem to be repeatable with it reverted) and actively breaks the ARM SA1100 framebuffer driver." Requested-by: Dave Jones <davej@redhat.com> Cc: Russell King <rmk+lkml@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-26Revert "x86-64: Disable local APIC timer use on AMD systems with C1E"Linus Torvalds
This reverts commit e66485d747505e9d960b864fc6c37f8b2afafaf0, since Rafael Wysocki noticed that the change only works for his in -mm, not in mainline (and that both "noapictimer" _and_ "apicmaintimer" are broken on his hardware, but that's apparently not a regression, just a symptom of the same issue that causes the automatic apic timer disable to not work). It turns out that it really doesn't work correctly on x86-64, since x86-64 doesn't use the generic clock events for timers yet. Thanks to Rafal for testing, and here's the ugly details on x86-64 as per Thomas: "I just looked into the code and the logic vs. noapictimer on SMP is completely broken. On i386 the noapictimer option not only disables the local APIC timer, it also registers the CPUs for broadcasting via IPI on SMP systems. The x86-64 code uses the broadcast only when the local apic timer is active, i.e. "noapictimer" is not on the command line. This defeats the whole purpose of "noapictimer". It should be there to make boxen work, where the local APIC timer actually has a hardware problem, e.g. the nx6325. The current implementation of x86_64 only fixes the ACPI c-states related problem where the APIC timer stops in C3(2), nothing else. On nx6325 and other AMD X2 equipped systems which have the C1E enabled we run into the following: PIT keeps jiffies (and the system) running, but the local APIC timer interrupts can get out of sync due to this C1E effect. I don't think this is a critical problem, but it is wrong nevertheless. I think it's safe to revert the C1E patch and postpone the fix to the clock events conversion." On further reflection, Thomas noted: "It's even worse than I thought on the first check: "noapictimer" on the command line of an SMP box prevents _ONLY_ the boot CPU apic timer from being used. But the secondary CPU is still unconditionally setting up the APIC timer and uses the non calibrated variable calibration_result, which is of course 0, to setup the APIC timer. Wreckage guaranteed." so we'll just have to wait for the x86 merge to hopefully fix this up for x86-64. Tested-and-requested-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-26x86-64: Disable local APIC timer use on AMD systems with C1EThomas Gleixner
commit 3556ddfa9284a86a59a9b78fe5894430f6ab4eef titled [PATCH] x86-64: Disable local APIC timer use on AMD systems with C1E solves a problem with AMD dual core laptops e.g. HP nx6325 (Turion 64 X2) with C1E enabled: When both cores go into idle at the same time, then the system switches into C1E state, which is basically the same as C3. This stops the local apic timer. This was debugged right after the dyntick merge on i386 and despite the patch title it fixes only the 32 bit path. x86_64 is still missing this fix. It seems that mainline is not really affected by this issue, as the PIT is running and keeps jiffies incrementing, but that's just waiting for trouble. -mm suffers from this problem due to the x86_64 high resolution timer patches. This is a quick and dirty port of the i386 code to x86_64. I spent quite a time with Rafael to debug the -mm / hrt wreckage until someone pointed us to this. I really had forgotten that we debugged this half a year ago already. Sigh, is it just me or is there something yelling arch/x86 into my ear? Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-26fix sctp_del_bind_addr() last argument typeAl Viro
It gets pointer to fastcall function, expects a pointer to normal one and calls the sucker. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-26Merge branch 'master' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [PPP_MPPE]: Don't put InterimKey on the stack SCTP : Add paramters validity check for ASCONF chunk SCTP: Discard OOTB packetes with bundled INIT early. SCTP: Clean up OOTB handling and fix infinite loop processing SCTP: Explicitely discard OOTB chunks SCTP: Send ABORT chunk with correct tag in response to INIT ACK SCTP: Validate buffer room when processing sequential chunks [PATCH] mac80211: fix initialisation when built-in [PATCH] net/mac80211/wme.c: fix sparse warning [PATCH] cfg80211: fix initialisation if built-in [PATCH] net/wireless/sysfs.c: Shut up build warning
2007-09-25SCTP : Add paramters validity check for ASCONF chunkWei Yongjun
If ADDIP is enabled, when an ASCONF chunk is received with ASCONF paramter length set to zero, this will cause infinite loop. By the way, if an malformed ASCONF chunk is received, will cause processing to access memory without verifying. This is because of not check the validity of parameters in ASCONF chunk. This patch fixed this. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-09-25SCTP: Clean up OOTB handling and fix infinite loop processingVlad Yasevich
While processing OOTB chunks as well as chunks with an invalid length of 0, it was possible to SCTP to get wedged inside an infinite loop because we didn't catch the condition correctly, or didn't mark the packet for discard correctly. This work is based on original findings and work by Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-09-25ACPI: CONFIG_ACPI_SLEEP=n power off regression in 2.6.23-rc8 (NOT in rc7)Alexey Starikovskiy
Signed-off-by: Alexey Starikovskiy <astarikovskiy@suse.de> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>
2007-09-24[MIPS] SMTC: Make ack_bad_irq() safe with no IM backstop.Ralf Baechle
Issue reported and original patch by Kevin Kissel, cleaner (imho) implementation by me. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-09-22ACPI: disable lower idle C-states across suspend/resumeThomas Gleixner
device_suspend() calls ACPI suspend functions, which seems to have undesired side effects on lower idle C-states. It took me some time to realize that especially the VAIO BIOSes (both Andrews jinxed UP and my elfstruck SMP one) show this effect. I'm quite sure that other bug reports against suspend/resume about turning the system into a brick have the same root cause. After fishing in the dark for quite some time, I realized that removing the ACPI processor module before suspend (this removes the lower C-state functionality) made the problem disappear. Interestingly enough the propability of having a bricked box is influenced by various factors (interrupts, size of the ram image, ...). Even adding a bunch of printks in the wrong places made the problem go away. The previous periodic tick implementation simply pampered over the problem, which explains why the dyntick / clockevents changes made this more prominent. We avoid complex functionality during the boot process and we have to do the same during suspend/resume. It is a similar scenario and equaly fragile. Add suspend / resume functions to the ACPI processor code and disable the lower idle C-states across suspend/resume. Fall back to the default idle implementation (halt) instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Len Brown <lenb@kernel.org> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-23Blackfin arch: add some missing syscallBryan Wu
When compiling the Blackfin kernel, checksyscalls.pl will report lots of missing syscalls warnings. This patch will add some missing syscalls which make sense on Blackfin arch After appling this patch, toolchain should be rebuilt. Then recompiling the kernel with the new toolchain. Signed-off-by: Bryan Wu <bryan.wu@analog.com>
2007-10-03Binfmt_flat: Add minimum support for the Blackfin relocationsBernd Schmidt
Add minimum support for the Blackfin relocations, since we don't have enough space in each reloc. The idea is to store a value with one relocation so that subsequent ones can access it. Actually, this patch is required for Blackfin. Currently if BINFMT_FLAT is enabled, git-tree kernel will fail to compile. Signed-off-by: Bernd Schmidt <bernd.schmidt@analog.com> Signed-off-by: Bryan Wu <bryan.wu@analog.com> Cc: David McCullough <davidm@snapgear.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Miles Bader <miles.bader@necel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-09-21Revert "x86_64: Quicklist support for x86_64"Linus Torvalds
This reverts commit 34feb2c83beb3bdf13535a36770f7e50b47ef299. Suresh Siddha points out that this one breaks the fundamental requirement that you cannot free page table pages before the TLB caches are flushed. The quicklists do not give the same kinds of guarantees that the mmu_gather structure does, at least not in NUMA configurations. Requested-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Andi Kleen <ak@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-20signalfd simplificationDavide Libenzi
This simplifies signalfd code, by avoiding it to remain attached to the sighand during its lifetime. In this way, the signalfd remain attached to the sighand only during poll(2) (and select and epoll) and read(2). This also allows to remove all the custom "tsk == current" checks in kernel/signal.c, since dequeue_signal() will only be called by "current". I think this is also what Ben was suggesting time ago. The external effect of this, is that a thread can extract only its own private signals and the group ones. I think this is an acceptable behaviour, in that those are the signals the thread would be able to fetch w/out signalfd. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19sched: add /proc/sys/kernel/sched_compat_yieldIngo Molnar
add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-09-19Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linusLinus Torvalds
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus: [MIPS] cpu-bugs64.c: GCC 3.3 constraint workaround [MIPS] DEC: Initialise ioasic_ssr_lock
2007-09-19Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: [POWERPC] Fix timekeeping on PowerPC 601 [POWERPC] Don't expose clock vDSO functions when CPU has no timebase [POWERPC] spusched: Fix null pointer dereference in find_victim
2007-09-19[MIPS] cpu-bugs64.c: GCC 3.3 constraint workaroundMaciej W. Rozycki
Add a workaround to address warnings generated on the "n" constraint by GCC 3.3 and below. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-09-19Fix NUMA Memory Policy Reference CountingLee Schermerhorn
This patch proposes fixes to the reference counting of memory policy in the page allocation paths and in show_numa_map(). Extracted from my "Memory Policy Cleanups and Enhancements" series as stand-alone. Shared policy lookup [shmem] has always added a reference to the policy, but this was never unrefed after page allocation or after formatting the numa map data. Default system policy should not require additional ref counting, nor should the current task's task policy. However, show_numa_map() calls get_vma_policy() to examine what may be [likely is] another task's policy. The latter case needs protection against freeing of the policy. This patch adds a reference count to a mempolicy returned by get_vma_policy() when the policy is a vma policy or another task's mempolicy. Again, shared policy is already reference counted on lookup. A matching "unref" [__mpol_free()] is performed in alloc_page_vma() for shared and vma policies, and in show_numa_map() for shared and another task's mempolicy. We can call __mpol_free() directly, saving an admittedly inexpensive inline NULL test, because we know we have a non-NULL policy. Handling policy ref counts for hugepages is a bit trickier. huge_zonelist() returns a zone list that might come from a shared or vma 'BIND policy. In this case, we should hold the reference until after the huge page allocation in dequeue_hugepage(). The patch modifies huge_zonelist() to return a pointer to the mempolicy if it needs to be unref'd after allocation. Kernel Build [16cpu, 32GB, ia64] - average of 10 runs: w/o patch w/ refcount patch Avg Std Devn Avg Std Devn Real: 100.59 0.38 100.63 0.43 User: 1209.60 0.37 1209.91 0.31 System: 81.52 0.42 81.64 0.34 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19Fix user namespace exiting OOPsPavel Emelyanov
It turned out, that the user namespace is released during the do_exit() in exit_task_namespaces(), but the struct user_struct is released only during the put_task_struct(), i.e. MUCH later. On debug kernels with poisoned slabs this will cause the oops in uid_hash_remove() because the head of the chain, which resides inside the struct user_namespace, will be already freed and poisoned. Since the uid hash itself is required only when someone can search it, i.e. when the namespace is alive, we can safely unhash all the user_struct-s from it during the namespace exiting. The subsequent free_uid() will complete the user_struct destruction. For example simple program #include <sched.h> char stack[2 * 1024 * 1024]; int f(void *foo) { return 0; } int main(void) { clone(f, stack + 1 * 1024 * 1024, 0x10000000, 0); return 0; } run on kernel with CONFIG_USER_NS turned on will oops the kernel immediately. This was spotted during OpenVZ kernel testing. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19Convert uid hash to hlistPavel Emelyanov
Surprisingly, but (spotted by Alexey Dobriyan) the uid hash still uses list_heads, thus occupying twice as much place as it could. Convert it to hlist_heads. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19[POWERPC] Fix timekeeping on PowerPC 601Benjamin Herrenschmidt
Recent changes to the timekeeping code broke support for the PowerPC 601 processor which doesn't have the usual timebase facility but a slightly different thing called (yuck) the RTC. This fixes it, boot tested on an old 601 based PowerMac 7200. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-09-16Merge branch 'master' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6: [SPARC64]: Warn user if cpu is ignored. [SPARC64]: Fix lockdep, particularly on SMP. [SPARC64]: Update defconfig.
2007-09-16Merge branch 'master' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [VLAN]: Fix net_device leak. [PPP] generic: Fix receive path data clobbering & non-linear handling [PPP] generic: Call skb_cow_head before scribbling over skb [NET] skbuff: Add skb_cow_head [BRIDGE]: Kill clone argument to br_flood_* [PPP] pppoe: Fill in header directly in __pppoe_xmit [PPP] pppoe: Fix data clobbering in __pppoe_xmit and return value [PPP] pppoe: Fix skb_unshare_check call position [SCTP]: Convert bind_addr_list locking to RCU [SCTP]: Add RCU synchronization around sctp_localaddr_list [PKT_SCHED]: sch_cbq.c: Shut up uninitialized variable warning [PKTGEN]: srcmac fix [IPV6]: Fix source address selection. [IPV4]: Just increment OutDatagrams once per a datagram. [IPV6]: Just increment OutDatagrams once per a datagram. [IPV6]: Fix unbalanced socket reference with MSG_CONFIRM. [NET_SCHED] protect action config/dump from irqs [NET]: Fix two issues wrt. SO_BINDTODEVICE.
2007-09-16Fix non-ISA link error in drivers/scsi/advansys.cMatthew Wilcox
When CONFIG_ISA is disabled, the isa_driver support will not be compiled in. Define stubs so that we don't get link-time errors. Signed-off-by: Matthew Wilcox <matthew@wil.cx> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-16[NET] skbuff: Add skb_cow_headHerbert Xu
This patch adds an optimised version of skb_cow that avoids the copy if the header can be modified even if the rest of the payload is cloned. This can be used in encapsulating paths where we only need to modify the header. As it is, this can be used in PPPOE and bridging. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-16[SCTP]: Convert bind_addr_list locking to RCUVlad Yasevich
Since the sctp_sockaddr_entry is now RCU enabled as part of the patch to synchronize sctp_localaddr_list, it makes sense to change all handling of these entries to RCU. This includes the sctp_bind_addrs structure and it's list of bound addresses. This list is currently protected by an external rw_lock and that looks like an overkill. There are only 2 writers to the list: bind()/bindx() calls, and BH processing of ASCONF-ACK chunks. These are already seriealized via the socket lock, so they will not step on each other. These are also relatively rare, so we should be good with RCU. The readers are varied and they are easily converted to RCU. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Sridhar Samdurala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-16[SCTP]: Add RCU synchronization around sctp_localaddr_listVlad Yasevich
sctp_localaddr_list is modified dynamically via NETDEV_UP and NETDEV_DOWN events, but there is not synchronization between writer (even handler) and readers. As a result, the readers can access an entry that has been freed and crash the sytem. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Sridhar Samdurala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-16[SPARC64]: Fix lockdep, particularly on SMP.David S. Miller
As noted by Al Viro, when we try to call prom_set_trap_table() in the SMP trampoline code we try to take the PROM call spinlock which doesn't work because the current thread pointer isn't valid yet and lockdep depends upon that being correct. Furthermore, we cannot set the current thread pointer register because it can't be properly dereferenced until we return from prom_set_trap_table(). Kernel TLB misses only work after that call. So do the PROM call to set the trap table directly instead of going through the OBP library C code, and thus avoid the lock altogether. These calls are guarenteed to be serialized fully. Since there are now no calls to the prom_set_trap_table{_sun4v}() library functions, they can be deleted. Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-14Merge git://git.linux-xtensa.org/kernel/xtensa-feedLinus Torvalds
* git://git.linux-xtensa.org/kernel/xtensa-feed: [patch 1/2] Xtensa: enable arbitary tty speed setting ioctls [patch 2/2] xtensa console.c: remove duplicate #include [XTENSA] Add support for cache-aliasing [XTENSA] Add kernel module support [XTENSA] Add support for executable/non-executable feature in the mmu [XTENSA] Use the generic version of get_order [XTENSA] Initialize semaphore_wake_lock [XTENSA] Add typecast macro for constants [XTENSA] Fix timer instabilities. [XTENSA] Fix fadvise64_64 [XTENSA] Remove extraneous include statement [XTENSA] Move string-io functions to io.c from pci.c [XTENSA] Move pre-initialized structures to init_task.c [XTENSA] Add freestanding option to CFLAGS [XTENSA] Add getpgrp system-call to unistd.h [XTENSA] add missing system calls [XTENSA] fix wrong usage of __init and __initdata in traps.c
2007-09-14Merge branch 'for-linus' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/cooloney/blackfin-2.6 * 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/cooloney/blackfin-2.6: Blackfin arch: fix some bugs in lib/string.h functions found by our string testing modules Blackfin arch: fix the aliased write macros Blackfin arch: Update/Fix PM support add new pm_ops valid
2007-09-14Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linusLinus Torvalds
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus: [MIPS] 20Kc: Disable use of WAIT instruction. [MIPS] Workaround for 4Kc machine check exception [MIPS] Malta: Fix off by one bug in interrupt handler. [MIPS] No ide_default_io_base() if PCI IDE was not found [MIPS] Add #include <linux/profile.h> to arch/mips/kernel/time.c [MIPS] N32 needs to use compat_sys_futimesat [MIPS] rtlx: Fix build error. [MIPS] rtlx: fix int vs. long bug.
2007-09-14[MIPS] No ide_default_io_base() if PCI IDE was not foundAtsushi Nemoto
Revert b5438582090406e2ccb4169d9b2df7c9939ae42b and add no_pci_devices() check to avoid crash due to early calling of pci_get_class(). Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>