aboutsummaryrefslogtreecommitdiff
path: root/arch/x86_64
AgeCommit message (Collapse)Author
2006-04-22[PATCH] x86_64: Fix a race in the free_iommu pathMike Waychison
We do this by removing a micro-optimization that tries to avoid grabbing the iommu_bitmap_lock spinlock and using a bus-locked operation. This still races with other simultaneous alloc_iommu or free_iommu(size > 1) which both use bus-unlocked operations. The end result of this race is eventually ending up with an iommu_gart_bitmap that has bits errornously set all over, making large contiguous iommu space allocations fail with 'PCI-DMA: Out of IOMMU space'. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-22[PATCH] x86_64: Pass -32 to the assembler when compiling the 32bit vsyscall ↵Andi Kleen
pages This quietens warnings and actually fixes a bug. The unwind tables would come out wrong without -32, causing pthread cancellation during them to crash in the gcc runtime. The problem seems to only happen with newer binutils (it doesn't happen with 2.16.91.0.2 but happens wit 2.16.91.0.5) Thanks to David Altobelli <david.altobelli@hp.com> and Brian Baker <Brian.B@hp.com> for test case and initial analysis. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-22[PATCH] x86_64: sparsemem does not need node_mem_mapAndy Whitcroft
Seems we are trying to init the node_mem_map when we don't need to, for example when SPARSEMEM is enabled. This causes the error below during compilation. Use CONFIG_FLAT_NODE_MEM_MAP to gate allocation and init. arch/x86_64/mm/numa.c: In function `setup_node_zones': arch/x86_64/mm/numa.c:191: error: structure has no member named `node_mem_map' Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20[PATCH] i386/x86-64: Fix x87 information leak between processesAndi Kleen
AMD K7/K8 CPUs only save/restore the FOP/FIP/FDP x87 registers in FXSAVE when an exception is pending. This means the value leak through context switches and allow processes to observe some x87 instruction state of other processes. This was actually documented by AMD, but nobody recognized it as being different from Intel before. The fix first adds an optimization: instead of unconditionally calling FNCLEX after each FXSAVE test if ES is pending and skip it when not needed. Then do a x87 load from a kernel variable to clear FOP/FIP/FDP. This means other processes always will only see a constant value defined by the kernel in their FP state. I took some pain to make sure to chose a variable that's already in L1 during context switch to make the overhead of this low. Also alternative() is used to patch away the new code on CPUs who don't need it. Patch for both i386/x86-64. The problem was discovered originally by Jan Beulich. Richard Brunner provided the basic code for the workarounds, with contribution from Jan. This is CVE-2006-1056 Cc: richard.brunner@amd.com Cc: jbeulich@novell.com Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19[PATCH] Switch Kprobes inline functions to __kprobes for x86_64Prasanna S Panchamukhi
Andrew Morton pointed out that compiler might not inline the functions marked for inline in kprobes. There-by allowing the insertion of probes on these kprobes routines, which might cause recursion. This patch removes all such inline and adds them to kprobes section there by disallowing probes on all such routines. Some of the routines can even still be inlined, since these routines gets executed after the kprobes had done necessay setup for reentrancy. Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-18[PATCH] x86_64: Add tee and sync_file_rangeAndi Kleen
tee was already there for some reason for native 64bit, but sys_sync_file_range was missing. Also add it to the compat layer. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-18[PATCH] x86_64: x86_64 add crashdump trigger pointsVivek Goyal
o Start booting into the capture kernel after an Oops if system is in a unrecoverable state. System will boot into the capture kernel, if one is pre-loaded by the user, and capture the kernel core dump. o One of the following conditions should be true to trigger the booting of capture kernel. - panic_on_oops is set. - pid of current thread is 0 - pid of current thread is 1 - Oops happened inside interrupt context. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-18[PATCH] x86_64: Update defconfigAndi Kleen
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14[PATCH] DMI: move dmi_scan.c from arch/i386 to drivers/firmware/Bjorn Helgaas
dmi_scan.c is arch-independent and is used by i386, x86_64, and ia64. Currently all three arches compile it from arch/i386, which means that ia64 and x86_64 depend on things in arch/i386 that they wouldn't otherwise care about. This is simply "mv arch/i386/kernel/dmi_scan.c drivers/firmware/" (removing trailing whitespace) and the associated Makefile changes. All three architectures already set CONFIG_DMI in their top-level Kconfig files. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Andi Kleen <ak@muc.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andrey Panin <pazke@orbita1.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-11[PATCH] x86_64: Fix embarassing typo in mmconfig bus checkAndi Kleen
Surprising that it still worked at all with this - yes it was tested. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] i386/x86-64: Remove checks for value == NULL in PCI config space accessAndi Kleen
Nobody should pass NULL here. Could in theory make it a BUG, but the NULL pointer oops will do as well. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] x86_64: Remove check for canonical RIPAndi Kleen
As pointed out by Linus it is useless now because entry.S should handle it correctly in all cases. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] vesafb: Fix incorrect logo colors in x86_64Antonino A. Daplas
Bugzilla Bug 6299: A pixel size of 8 bits produces wrong logo colors in x86_64. The driver has 2 methods for setting the color map, using the protected mode interface provided by the video BIOS and directly writing to the VGA registers. The former is not supported in x86_64 and the latter is enabled only in i386. Fix by enabling the latter method in x86_64 only if supported by the BIOS. If both methods are unsupported, change the visual of vesafb to STATIC_PSEUDOCOLOR. Signed-off-by: Antonino Daplas <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] No arch-specific strpbrk implementationsKyle McMartin
While cleaning up parisc_ksyms.c earlier, I noticed that strpbrk wasn't being exported from lib/string.c. Investigating further, I noticed a changeset that removed its export and added it to _ksyms.c on a few more architectures. The justification was that "other arches do it." I think this is wrong, since no architecture currently defines __HAVE_ARCH_STRPBRK, there's no reason for any of them to be exporting it themselves. Therefore, consolidate the export to lib/string.c. Signed-off-by: Kyle McMartin <kyle@parisc-linux.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] Configurable NODES_SHIFTYasunori Goto
Current implementations define NODES_SHIFT in include/asm-xxx/numnodes.h for each arch. Its definition is sometimes configurable. Indeed, ia64 defines 5 NODES_SHIFT values in the current git tree. But it looks a bit messy. SGI-SN2(ia64) system requires 1024 nodes, and the number of nodes already has been changeable by config. Suitable node's number may be changed in the future even if it is other architecture. So, I wrote configurable node's number. This patch set defines just default value for each arch which needs multi nodes except ia64. But, it is easy to change to configurable if necessary. On ia64 the number of nodes can be already configured in generic ia64 and SN2 config. But, NODES_SHIFT is defined for DIG64 and HP'S machine too. So, I changed it so that all platforms can be configured via CONFIG_NODES_SHIFT. It would be simpler. See also: http://marc.theaimsgroup.com/?l=linux-kernel&m=114358010523896&w=2 Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Richard Henderson <rth@twiddle.net> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jack Steiner <steiner@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Update 32-bit system call tableAndi Kleen
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Eliminate IA32_NR_syscalls defineAndi Kleen
Or rather compute it based on the table length automatically. This also has the intended side effect of not warning for new system calls anymore. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: fix CONFIG_REORDERSam Ravnborg
Fix CONFIG_REORDER. The value of cflags-y was assined to CFLAGS before cflags-y was assigned the value used for CONFIG_REORDER. Use cflags-y for all CFLAGS options in the Makefile to avoid this happening again. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Plug GS leak in arch_prctl()John Blackwood
In linux-2.6.16, we have noticed a problem where the gs base value returned from an arch_prtcl(ARCH_GET_GS, ...) call will be incorrect if: - the current/calling task has NOT set its own gs base yet to a non-zero value, - some other task that ran on the same processor previously set their own gs base to a non-zero value. In this situation, the ARCH_GET_GS code will read and return the MSR_KERNEL_GS_BASE msr register. However, since the __switch_to() code does NOT load/zero the MSR_KERNEL_GS_BASE register when the task that is switched IN has a zero next->gs value, the caller of arch_prctl(ARCH_GET_GS, ...) will get back the value of some previous tasks's gs base value instead of 0. Change the arch_prctl() ARCH_GET_GS code to only read and return the MSR_KERNEL_GS_BASE msr register if the 'gs' register of the calling task is non-zero. Side note: Since in addition to using arch_prctl(ARCH_SET_GS, ...), a task can also setup a gs base value by using modify_ldt() and write an index value into 'gs' from user space, the patch below reads 'gs' instead of using thread.gs, since in the modify_ldt() case, the thread.gs value will be 0, and incorrect value would be returned (the task->thread.gs value). When the user has not set its own gs base value and the 'gs' register is zero, then the MSR_KERNEL_GS_BASE register will not be read and a value of zero will be returned by reading and returning 'task->thread.gs'. The first patch shown below is an attempt at implementing this approach. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Fix drift with HPET timer enabledJordan Hargrave
If the HPET timer is enabled, the clock can drift by ~3 seconds a day. This is due to the HPET timer not being initialized with the correct setting (still using PIT count). If HZ changes, this drift can become even more pronounced. HPET patch initializes tick_nsec with correct tick_nsec settings for HPET timer. Vojtech comments: "It's not entirely correct (it assumes the HPET ticks totally exactly), but it's significantly better than assuming the PIT error there." Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] i386/x86-64: Return defined error value for bad PCI config space ↵Andi Kleen
accesses Mostly to get better handling when a extended config space access has to fallback to Type1. Cc: gregkh@suse.de Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] i386/x86_64: Check if MCFG works for the first 16 bussesAndi Kleen
Previously only the first bus would be checked against Type 1. Why 16? Checking all would need too much memory and we can assume that systems with more than 16 busses have better than average quality BIOS. This is an additional defense against bad MCFG tables. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Fixup read_mostly section on internode cache line size for vSMPRavikiran G Thirumalai
Fixup the read mostly section to start at internode cacheline boundary. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Don't return error for HPET initialization in initcallAndi Kleen
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Don't export strlen twiceAndi Kleen
Fix WARNING: vmlinux: 'strlen' exported twice. Previous export was in vmlinux Reported by Mats Johannesson Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: When user could have changed RIP always force IRETAndi Kleen
Intel EM64T CPUs handle uncanonical return addresses differently from AMD CPUs. The exception is reported in the SYSRET, not the next instruction. This leads to the kernel exception handler running on the user stack with the wrong GS because the kernel didn't expect exceptions on this instruction. This version of the patch has the teething problems that plagued an earlier version fixed. This is CVE-2006-0744 Thanks to Ernie Petrides and Asit B. Mallick for analysis and initial patches. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Don't run NMI watchdog during machine checksAndi Kleen
Machine checks can stall the machine for a long time and it's not good to trigger the nmi watchdog during that. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Clear APIC feature bit when local APIC is disabledAndi Kleen
Needed for other checks later in ACPI. Pointed out by Len Brown Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Fix compilation with CONFIG_PCI=n / allnoconfigAndi Kleen
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] i386/x86-64: Check that MCFG points to an e820 reserved areaArjan van de Ven
This patch introduces a user for the e820_all_mapped function: There have been several machines that don't have a working MMCONFIG, often because of a buggy MCFG table in the ACPI bios. This patch adds a simple sanity check that detects a whole bunch of these cases, and when it detects it, linux now boots rather than crash-and-burns. The accuracy of this detection can in principle be improved if there was a "is this entire range in e820 with THIS attribute", but no such function exist and the complexity needed for this is not really worth it; this simple check already catches most cases anyway. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Introduce e820_all_mappedArjan van de Ven
Introduce a e820_all_mapped() function which checks if the entire range <start,end> is mapped with type. This is done by moving the local start variable to the end of each known-good region; if at the end of the function the start address is still before end, there must be a part that's not of the correct type; otherwise it's a good region. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Rename e820_mapped to e820_any_mappedArjan van de Ven
Rename e820_mapped to e820_any_mapped since it tests if any part of the range is mapped according to the type. Later steps will introduce e820_all_mapped which will check if the entire range is mapped with the type. Both have their merit. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Handle empty PXMs that only contain hotplug memoryAndi Kleen
The node setup code would try to allocate the node metadata in the node itself, but that fails if there is no memory in there. This can happen with memory hotplug when the hotplug area defines an so far empty node. Now use bootmem to try to allocate the mem_map in other nodes. And if it fails don't panic, but just ignore the node. To make this work I added a new __alloc_bootmem_nopanic function that does what its name implies. TBD should try to use nearby nodes here. Currently we just use any. It's hard to do it better because bootmem doesn't have proper fallback lists yet. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Reserve SRAT hotadd memory on x86-64Andi Kleen
From: Keith Mannthey, Andi Kleen Implement memory hotadd without sparsemem. The memory in the SRAT hotadd area is just preserved instead and can be activated later. There are a few restrictions: - Only one continuous hotadd area allowed per node The main problem is dealing with the many buggy SRAT tables that are out there. The strategy here is to reject anything suspicious. Originally from Keith Mannthey, with several hacks and changes by AK and also contributions from Andrew Morton [ TBD: Problems pointed out by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>: 1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n. Rebuilding zonelist is necessary when the system has just memory < 4G at boot, and hot add memory > 4G. because x86_64 has DMA32, ZONE_NORAML is not included into zonelist at boot time if system doesn't have memory >4G at boot. [AK: should just force the higher zones at boot time when SRAT tells us] 2) zone and node's spanned_pages and present_pages are not incremented. They should be. For example, our server (ia64/Fujitsu PrimeQuest) can equip memory from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have possible 1T +memory. (Microsoft requires "write all possible memory in SRAT") When we reserve memmap for possible 1T memory, Linux will not work well in +minimum 4G configuraion ;) [AK: needs limiting to 5-10% of max memory] ] Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Support memory hotadd without sparsememAndi Kleen
Memory hotadd doesn't need SPARSEMEM, but can be handled by just preallocating mem_maps. This only needs some untangling of ifdefs to enable the necessary code even without SPARSEMEM. Originally from Keith Mannthey, hacked by AK. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Clean up execve pathAndi Kleen
Just call IRET always, no need for any special cases. Needed for the next bug fix. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Update defconfigAndi Kleen
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] unexport get_wchanAdrian Bunk
The only user of get_wchan is the proc fs - and proc can't be built modular. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Don't pass boot parameters to argv_init[]OGAWA Hirofumi
The boot cmdline is parsed in parse_early_param() and parse_args(,unknown_bootoption). And __setup() is used in obsolete_checksetup(). start_kernel() -> parse_args() -> unknown_bootoption() -> obsolete_checksetup() If __setup()'s callback (->setup_func()) returns 1 in obsolete_checksetup(), obsolete_checksetup() thinks a parameter was handled. If ->setup_func() returns 0, obsolete_checksetup() tries other ->setup_func(). If all ->setup_func() that matched a parameter returns 0, a parameter is seted to argv_init[]. Then, when runing /sbin/init or init=app, argv_init[] is passed to the app. If the app doesn't ignore those arguments, it will warning and exit. This patch fixes a wrong usage of it, however fixes obvious one only. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Mark unwind info for signal trampolines in vDSOsJakub Jelinek
Mark unwind info for signal trampolines using the new S augmentation flag introduced in: http://gcc.gnu.org/PR26208. GCC 4.2 (or patched earlier GCC) will be able to special case unwinding through frames right above signal trampolines. As the augmentations start with z flag and S is at the very end of the augmentation string, older GCCs will just skip the S flag as unknown (that's why an augmentation flag was chosen over say a new CFA opcode). Signed-off-by: Jakub Jelinek <jakub@redhat.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] RTC: Remove RTC UIP synchronization on x86_64Matt Mackall
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Andi Kleen <ak@muc.de> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] Notifier chain update: API changesAlan Stern
The kernel's implementation of notifier chains is unsafe. There is no protection against entries being added to or removed from a chain while the chain is in use. The issues were discussed in this thread: http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2 We noticed that notifier chains in the kernel fall into two basic usage classes: "Blocking" chains are always called from a process context and the callout routines are allowed to sleep; "Atomic" chains can be called from an atomic context and the callout routines are not allowed to sleep. We decided to codify this distinction and make it part of the API. Therefore this set of patches introduces three new, parallel APIs: one for blocking notifiers, one for atomic notifiers, and one for "raw" notifiers (which is really just the old API under a new name). New kinds of data structures are used for the heads of the chains, and new routines are defined for registration, unregistration, and calling a chain. The three APIs are explained in include/linux/notifier.h and their implementation is in kernel/sys.c. With atomic and blocking chains, the implementation guarantees that the chain links will not be corrupted and that chain callers will not get messed up by entries being added or removed. For raw chains the implementation provides no guarantees at all; users of this API must provide their own protections. (The idea was that situations may come up where the assumptions of the atomic and blocking APIs are not appropriate, so it should be possible for users to handle these things in their own way.) There are some limitations, which should not be too hard to live with. For atomic/blocking chains, registration and unregistration must always be done in a process context since the chain is protected by a mutex/rwsem. Also, a callout routine for a non-raw chain must not try to register or unregister entries on its own chain. (This did happen in a couple of places and the code had to be changed to avoid it.) Since atomic chains may be called from within an NMI handler, they cannot use spinlocks for synchronization. Instead we use RCU. The overhead falls almost entirely in the unregister routine, which is okay since unregistration is much less frequent that calling a chain. Here is the list of chains that we adjusted and their classifications. None of them use the raw API, so for the moment it is only a placeholder. ATOMIC CHAINS ------------- arch/i386/kernel/traps.c: i386die_chain arch/ia64/kernel/traps.c: ia64die_chain arch/powerpc/kernel/traps.c: powerpc_die_chain arch/sparc64/kernel/traps.c: sparc64die_chain arch/x86_64/kernel/traps.c: die_chain drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list kernel/panic.c: panic_notifier_list kernel/profile.c: task_free_notifier net/bluetooth/hci_core.c: hci_notifier net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain net/ipv6/addrconf.c: inet6addr_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain net/netlink/af_netlink.c: netlink_chain BLOCKING CHAINS --------------- arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain arch/s390/kernel/process.c: idle_chain arch/x86_64/kernel/process.c idle_notifier drivers/base/memory.c: memory_chain drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list drivers/macintosh/adb.c: adb_client_list drivers/macintosh/via-pmu.c sleep_notifier_list drivers/macintosh/via-pmu68k.c sleep_notifier_list drivers/macintosh/windfarm_core.c wf_client_list drivers/usb/core/notify.c usb_notifier_list drivers/video/fbmem.c fb_notifier_list kernel/cpu.c cpu_chain kernel/module.c module_notify_list kernel/profile.c munmap_notifier kernel/profile.c task_exit_notifier kernel/sys.c reboot_notifier_list net/core/dev.c netdev_chain net/decnet/dn_dev.c: dnaddr_chain net/ipv4/devinet.c: inetaddr_chain It's possible that some of these classifications are wrong. If they are, please let us know or submit a patch to fix them. Note that any chain that gets called very frequently should be atomic, because the rwsem read-locking used for blocking chains is very likely to incur cache misses on SMP systems. (However, if the chain's callout routines may sleep then the chain cannot be atomic.) The patch set was written by Alan Stern and Chandra Seetharaman, incorporating material written by Keith Owens and suggestions from Paul McKenney and Andrew Morton. [jes@sgi.com: restructure the notifier chain initialization macros] Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] lightweight robust futexes: x86_64Ingo Molnar
x86_64: add the futex_atomic_cmpxchg_inuser() assembly implementation, and wire up the new syscalls. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Acked-by: Ulrich Drepper <drepper@redhat.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] for_each_online_pgdat: renaming for_each_pgdatKAMEZAWA Hiroyuki
Replace for_each_pgdat() with for_each_online_pgdat(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] unify pfn_to_page: x86_64 pfn_to_pageKAMEZAWA Hiroyuki
x86_64 can use generic funcs. For DISCONTIGMEM, CONFIG_OUT_OF_LINE_PFN_TO_PAGE is selected. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] sched: new sched domain for representing multi-coreSiddha, Suresh B
Add a new sched domain for representing multi-core with shared caches between cores. Consider a dual package system, each package containing two cores and with last level cache shared between cores with in a package. If there are two runnable processes, with this appended patch those two processes will be scheduled on different packages. On such systems, with this patch we have observed 8% perf improvement with specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2 users). This new domain will come into play only on multi-core systems with shared caches. On other systems, this sched domain will be removed by domain degeneration code. This new domain can be also used for implementing power savings policy (see OLS 2005 CMP kernel scheduler paper for more details.. I will post another patch for power savings policy soon) Most of the arch/* file changes are for cpu_coregroup_map() implementation. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] bitops: x86_64: use generic bitopsAkinobu Mita
- remove sched_find_first_bit() - remove generic_hweight{64,32,16,8}() - remove ext2_{set,clear,test,find_first_zero,find_next_zero}_bit() - remove minix_{test,set,test_and_clear,test,find_first_zero}_bit() Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] kprobes: fix broken fault handling for x86_64Prasanna S Panchamukhi
Provide proper kprobes fault handling, if a user-specified pre/post handlers tries to access user address space, through copy_from_user(), get_user() etc. The user-specified fault handler gets called only if the fault occurs while executing user-specified handlers. In such a case user-specified handler is allowed to fix it first, later if the user-specifed fault handler does not fix it, we try to fix it by calling fix_exception(). The user-specified handler will not be called if the fault happens when single stepping the original instruction, instead we reset the current probe and allow the system page fault handler to fix it up. Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] kprobe handler: discard user space trapbibo,mao
Currently kprobe handler traps only happen in kernel space, so function kprobe_exceptions_notify should skip traps which happen in user space. This patch modifies this, and it is based on 2.6.16-rc4. Signed-off-by: bibo mao <bibo.mao@intel.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> Cc: <hiramatu@sdl.hitachi.co.jp> Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] kretprobe instance recycled by parent processbibo mao
When kretprobe probes the schedule() function, if the probed process exits then schedule() will never return, so some kretprobe instances will never be recycled. In this patch the parent process will recycle retprobe instances of the probed function and there will be no memory leak of kretprobe instances. Signed-off-by: bibo mao <bibo.mao@intel.com> Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>