aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2009-02-24Merge branch 'x86/defconfig' into x86/mce2H. Peter Anvin
Conflicts (resolved): arch/x86/configs/x86_64_defconfig Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce: enable machine checks in 32-bit defconfigH. Peter Anvin
Impact: defconfig change Enable MCE in the 32-bit defconfig. Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce: enable machine checks in 64-bit defconfigAndi Kleen
Impact: defconfig change Enable MCE in the 64-bit defconfig. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-24x86, mce, cmci: recheck CMCI banks after APIC has been enabled on CPU #0Andi Kleen
Impact: Fix marginal race condition One the first CPU the machine checks are enabled early before the local APIC is enabled. This could in theory lead to some lost CMCI events very early during boot because CMCIs cannot be delivered with disabled LAPIC. The poller also doesn't recover from this because it doesn't check CMCI banks. Add an explicit CMCI banks check after the LAPIC is enabled. This is only done for CPU #0, the other CPUs only initialize machine checks after the LAPIC is on. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: disable CMCI on rebootingAndi Kleen
Impact: Avoids confusing other OSes. Disable the CMCI vector on reboot to avoid confusing other OS. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: remove incorrect __cpuinit/__cpuexit annotationsH. Peter Anvin
Impact: Bug fix on UP The MCE code is reinitialized from resume, so we can't use __cpuinit/__cpuexit for most of the code. Remove those annotations for anything downstream of mce_init(). Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: add CMCI supportAndi Kleen
Impact: Major new feature Intel CMCI (Corrected Machine Check Interrupt) is a new feature on Nehalem CPUs. It allows the CPU to trigger interrupts on corrected events, which allows faster reaction to them instead of with the traditional polling timer. Also use CMCI to discover shared banks. Machine check banks can be shared by CPU threads or even cores. Using the CMCI enable bit it is possible to detect the fact that another CPU already saw a specific bank. Use this to assign shared banks only to one CPU to avoid reporting duplicated events. On CPU hot unplug bank sharing is re discovered. This is done using a thread that cycles through all the CPUs. To avoid races between the poller and CMCI we only poll for banks that are not CMCI capable and only check CMCI owned banks on a interrupt. The shared banks ownership information is currently only used for CMCI interrupts, not polled banks. The sharing discovery code follows the algorithm recommended in the IA32 SDM Vol3a 14.5.2.1 The CMCI interrupt handler just calls the machine check poller to pick up the machine check event that caused the interrupt. I decided not to implement a separate threshold event like the AMD version has, because the threshold is always one currently and adding another event didn't seem to add any value. Some code inspired by Yunhong Jiang's Xen implementation, which was in term inspired by a earlier CMCI implementation by me. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: define MSR names and fields for new CMCI registersAndi Kleen
Impact: New register definitions only CMCI means support for raising an interrupt on a corrected machine check event instead of having to poll for it. It's a new feature in Intel Nehalem CPUs available on some machine check banks. For details see the IA32 SDM Vol3a 14.5 Define the registers for it as a preparation for further patches. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: use polled banks bitmap in machine check pollerAndi Kleen
Define a per cpu bitmap that contains the banks polled by the machine check poller. This is needed for the CMCI code in the next patches to be able to disable polling on specific banks. The bank by default contains all banks, so there is no behaviour change. Only future code will remove some banks from the polling set. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce: replace machine check events logged interval with ratelimitAndi Kleen
Impact: behavior change, use common code Use a standard leaky bucket ratelimit for the machine check warning print interval instead of waiting every check_interval. Also decrease the limit to twice per minute. This interacts better with threshold interrupts because they can happen more often than check_interval. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: avoid potential reentry of threshold interruptAndi Kleen
Impact: minor bugfix The threshold handler on AMD (and soon on Intel) could be theoretically reentered by the hardware. This could lead to corrupted events because the machine check poll code assumes it is not reentered. Move the APIC ACK to the end of the interrupt handler to let the hardware avoid that. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: factor out threshold interrupt handlerAndi Kleen
Impact: cleanup; preparation for feature The mce_amd_64 code has an own private MC threshold vector with an own interrupt handler. Since Intel needs a similar handler it makes sense to share the vector because both can not be active at the same time. I factored the common APIC handler code into a separate file which can be used by both the Intel or AMD MC code. This is needed for the next patch which adds an Intel specific CMCI handler. This patch should be a nop for AMD, it just moves some code around. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24x86, mce, cmci: export MAX_NR_BANKSAndi Kleen
Impact: Cleanup (code movement) Move MAX_NR_BANKS into mce.h because it's needed there for followup patches. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-23Merge branch 'x86/urgent' into x86/mce2H. Peter Anvin
2009-02-23x86, mce: remove invalid __cpuinit/__cpuexit annotationsH. Peter Anvin
Impact: Bug fix when CPU hotplug is disabled Correct the following broken __cpuinit/__cpuexit annotations: - mce_cpu_features() is called from mce_resume(), and so cannot be __cpuinit. - mce_disable_cpu() and mce_reenable_cpu() are called from mce_cpu_callback(), and so cannot be __cpuexit(). Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-22PM: Split up sysdev_[suspend|resume] from device_power_[down|up]Rafael J. Wysocki
Move the sysdev_suspend/resume from the callee to the callers, with no real change in semantics, so that we can rework the disabling of interrupts during suspend/hibernation. This is based on an earlier patch from Linus. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-22x86: Add IRQF_TIMER to legacy x86 timer interrupt descriptorsLinus Torvalds
Right now nobody cares, but the suspend/resume code will eventually want to suspend device interrupts without suspending the timer, and will depend on this flag to know. The modern x86 timer infrastructure uses the local APIC timers and never shows up as a device interrupt at all, so it isn't affected and doesn't need any of this. Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21x86_64: Fix S3 fail pathJiri Slaby
As acpi_enter_sleep_state can fail, take this into account in do_suspend_lowlevel and don't return to the do_suspend_lowlevel's caller. This would break (currently) fpu status and preempt count. Technically, this means use `call' instead of `jmp' and `jmp' to the `resume_point' after the `call' (i.e. if acpi_enter_sleep_state returns=fails). `resume_point' will handle the restore of fpu and preempt count gracefully. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>
2009-02-21x86_64: acpi/wakeup_64 cleanupJiri Slaby
- remove %ds re-set, it's already set in wakeup_long64 - remove double labels and alignment (ENTRY already adds both) - use meaningful resume point labelname - skip alignment while jumping from wakeup_long64 to the resume point - remove .size, .type and unused labels [v2] - added ENDPROCs Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>
2009-02-20x86, mce: remove incorrect __cpuinit for mce_cpu_features()H. Peter Anvin
Impact: Bug fix on UP Checkin 6ec68bff3c81e776a455f6aca95c8c5f1d630198: x86, mce: reinitialize per cpu features on resume introduced a call to mce_cpu_features() in the resume path, in order for the MCE machinery to get properly reinitialized after a resume. However, this function (and its successors) was flagged __cpuinit, which becomes __init on UP configurations (on SMP suspend/resume requires CPU hotplug and so this would not be seen.) Remove the offending __cpuinit annotations for mce_cpu_features() and its successor functions. Cc: Andi Kleen <ak@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-20x86: use the right protections for split-up pagetablesIngo Molnar
Steven Rostedt found a bug in where in his modified kernel ftrace was unable to modify the kernel text, due to the PMD itself having been marked read-only as well in split_large_page(). The fix, suggested by Linus, is to not try to 'clone' the reference protection of a huge-page, but to use the standard (and permissive) page protection bits of KERNPG_TABLE. The 'cloning' makes sense for the ptes but it's a confused and incorrect concept at the page table level - because the pagetable entry is a set of all ptes and hence cannot 'clone' any single protection attribute - the ptes can be any mixture of protections. With the permissive KERNPG_TABLE, even if the pte protections get changed after this point (due to ftrace doing code-patching or other similar activities like kprobes), the resulting combined protections will still be correct and the pte's restrictive (or permissive) protections will control it. Also update the comment. This bug was there for a long time but has not caused visible problems before as it needs a rather large read-only area to trigger. Steve possibly hacked his kernel with some really large arrays or so. Anyway, the bug is definitely worth fixing. [ Huang Ying also experienced problems in this area when writing the EFI code, but the real bug in split_large_page() was not realized back then. ] Reported-by: Steven Rostedt <rostedt@goodmis.org> Reported-by: Huang Ying <ying.huang@intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-20x86, vmi: TSC going backwards check in vmi clocksourceAlok N Kataria
Impact: fix time warps under vmware Similar to the check for TSC going backwards in the TSC clocksource, we also need this check for VMI clocksource. Signed-off-by: Alok N Kataria <akataria@vmware.com> Cc: Zachary Amsden <zach@vmware.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: stable@kernel.org
2009-02-19x86, mce: use %ll instead of %L for 64-bit numbersH. Peter Anvin
Impact: Cleanup The standard spelling of a printf pattern for long long is "ll", not "L", which is for long double. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19x86, mce: separate correct machine check poller and fatal exception handlerAndi Kleen
Impact: cleanup, performance enhancement The machine check poller is diverging more and more from the fatal exception handler. Instead of adding more special cases separate the code paths completely. The corrected poll path is actually quite simple, and this doesn't result in much code duplication. This makes both handlers much easier to read and results in cleaner code flow. The exception handler now only needs to care about uncorrected errors, which also simplifies the handling of multiple errors. The corrected poller also now always runs in standard interrupt context and does not need to do anything special to handle NMI context. Minor behaviour changes: - MCG status is now not cleared on polling. - Only the banks which had corrected errors get cleared on polling - The exception handler only clears banks with errors now v2: Forward port to new patch order. Add "uc" argument. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19x86, mce: factor out duplicated struct mce setup into one functionAndi Kleen
Impact: cleanup This merely factors out duplicated code to set up the initial struct mce state into a single function. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19x86, mce: implement dynamic machine check banks supportAndi Kleen
Impact: cleanup; making code future proof; memory saving on small systems This patch replaces the hardcoded max number of machine check banks with dynamic allocation depending on what the CPU reports. The sysfs data structures and the banks array are dynamically allocated. There is still a hard bank limit (128) because the mcelog protocol uses banks >= 128 as pseudo banks to escape other events. But we expect that 128 banks is beyond any reasonable CPU for now. This supersedes an earlier patch by Venki, but it solves the problem more completely by making the limit fully dynamic (up to the 128 boundary). This saves some memory on machines with less than 6 banks because they won't need sysdevs for unused ones and also allows to use sysfs to control these banks on possible future CPUs with more than 6 banks. This is an updated patch addressing Venki's comments. I also added in another patch from Thomas which fixed the error allocation path (that patch was previously separated) Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19x86, mce: enable machine checks in 64-bit defconfigAndi Kleen
Impact: Low priority fix The 32-bit defconfig already had it enabled. And it's a pretty fundamental feature, so better enable it on 64 bits too. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown x86, mce: use force_sig_info to kill process in machine check x86, mce: reinitialize per cpu features on resume x86, rcu: fix strange load average and ksoftirqd behavior
2009-02-18mm: clean up for early_pfn_to_nid()KAMEZAWA Hiroyuki
What's happening is that the assertion in mm/page_alloc.c:move_freepages() is triggering: BUG_ON(page_zone(start_page) != page_zone(end_page)); Once I knew this is what was happening, I added some annotations: if (unlikely(page_zone(start_page) != page_zone(end_page))) { printk(KERN_ERR "move_freepages: Bogus zones: " "start_page[%p] end_page[%p] zone[%p]\n", start_page, end_page, zone); printk(KERN_ERR "move_freepages: " "start_zone[%p] end_zone[%p]\n", page_zone(start_page), page_zone(end_page)); printk(KERN_ERR "move_freepages: " "start_pfn[0x%lx] end_pfn[0x%lx]\n", page_to_pfn(start_page), page_to_pfn(end_page)); printk(KERN_ERR "move_freepages: " "start_nid[%d] end_nid[%d]\n", page_to_nid(start_page), page_to_nid(end_page)); ... And here's what I got: move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00] move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00] move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff] move_freepages: start_nid[1] end_nid[0] My memory layout on this box is: [ 0.000000] Zone PFN ranges: [ 0.000000] Normal 0x00000000 -> 0x0081ff5d [ 0.000000] Movable zone start PFN for each node [ 0.000000] early_node_map[8] active PFN ranges [ 0.000000] 0: 0x00000000 -> 0x00020000 [ 0.000000] 1: 0x00800000 -> 0x0081f7ff [ 0.000000] 1: 0x0081f800 -> 0x0081fe50 [ 0.000000] 1: 0x0081fed1 -> 0x0081fed8 [ 0.000000] 1: 0x0081feda -> 0x0081fedb [ 0.000000] 1: 0x0081fedd -> 0x0081fee5 [ 0.000000] 1: 0x0081fee7 -> 0x0081ff51 [ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d So it's a block move in that 0x81f600-->0x81f7ff region which triggers the problem. This patch: Declaration of early_pfn_to_nid() is scattered over per-arch include files, and it seems it's complicated to know when the declaration is used. I think it makes fix-for-memmap-init not easy. This patch moves all declaration to include/linux/mm.h After this, if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID -> Use static definition in include/linux/mm.h else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID -> Use generic definition in mm/page_alloc.c else -> per-arch back end function will be called. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Miller <davem@davemlloft.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-17x86, mce: fix a race condition in mce_read()Huang Ying
Impact: bugfix Considering the situation as follow: before: mcelog.next == 1, mcelog.entry[0].finished = 1 +-------------------------------------------------------------------------- R W1 W2 W3 read mcelog.next (1) mcelog.next++ (2) (working on entry 1, finished == 0) mcelog.next = 0 mcelog.next++ (1) (working on entry 0) mcelog.next++ (2) (working on entry 1) <----------------- race ----------------> (done on entry 1, finished = 1) (done on entry 1, finished = 1) To fix the race condition, a cmpxchg loop is added to mce_read() to ensure no new MCE record can be added between mcelog.next reading and mcelog.next = 0. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: disable machine checks on offlined CPUsAndi Kleen
Impact: Lower priority bug fix Offlined CPUs could still get machine checks, but the machine check handler cannot handle them properly, leading to an unconditional crash. Disable machine checks on CPUs that are going down. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: don't set up mce sysdev devices with mce=offAndi Kleen
Impact: bug fix, in this case the resume handler shouldn't run which avoids incorrectly reenabling machine checks on resume When MCEs are completely disabled on the command line don't set up the sysdev devices for them either. Includes a comment fix from Thomas Gleixner. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: switch machine check polling to per CPU timerAndi Kleen
Impact: Higher priority bug fix The machine check poller runs a single timer and then broadcasted an IPI to all CPUs to check them. This leads to unnecessary synchronization between CPUs. The original CPU running the timer has to wait potentially a long time for all other CPUs answering. This is also real time unfriendly and in general inefficient. This was especially a problem on systems with a lot of events where the poller run with a higher frequency after processing some events. There could be more and more CPU time wasted with this, to the point of significantly slowing down machines. The machine check polling is actually fully independent per CPU, so there's no reason to not just do this all with per CPU timers. This patch implements that. Also switch the poller also to use standard timers instead of work queues. It was using work queues to be able to execute a user program on a event, but mce_notify_user() handles this case now with a separate callback. So instead always run the poll code in in a standard per CPU timer, which means that in the common case of not having to execute a trigger there will be less overhead. This allows to clean up the initialization significantly, because standard timers are already up when machine checks get init'ed. No multiple initialization functions. Thanks to Thomas Gleixner for some help. Cc: thockin@google.com v2: Use del_timer_sync() on cpu shutdown and don't try to handle migrated timers. v3: Add WARN_ON for timer running on unexpected CPU Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: always use separate work queue to run triggerAndi Kleen
Impact: Needed for bug fix in next patch This relaxes the requirement that mce_notify_user has to run in process context. Useful for future changes, but also leads to cleaner behaviour now. Now instead mce_notify_user can be called directly from interrupt (but not NMI) context. The work queue only uses a single global work struct, which can be done safely because it is always free to reuse before the trigger function is executed. This way no events can be lost. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: don't disable machine checks during code patchingAndi Kleen
Impact: low priority bug fix This removes part of a a patch I added myself some time ago. After some consideration the patch was a bad idea. In particular it stopped machine check exceptions during code patching. To quote the comment: * MCEs only happen when something got corrupted and in this * case we must do something about the corruption. * Ignoring it is worse than a unlikely patching race. * Also machine checks tend to be broadcast and if one CPU * goes into machine check the others follow quickly, so we don't * expect a machine check to cause undue problems during to code * patching. So undo the machine check related parts of 8f4e956b313dcccbc7be6f10808952345e3b638c NMIs are still disabled. This only removes code, the only additions are a new comment. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: disable machine checks on suspendAndi Kleen
Impact: Bug fix During suspend it is not reliable to process machine check exceptions, because CPUs disappear but can still get machine check broadcasts. Also the system is slightly more likely to machine check them, but the handler is typically not a position to handle them in a meaningfull way. So disable them during suspend and enable them during resume. Also make sure they are always disabled on hot-unplugged CPUs. This new code assumes that suspend always hotunplugs all non BP CPUs. v2: Remove the WARN_ONs Thomas objected to. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdownAndi Kleen
Impact: Bugfix The ifdef for the apic clear on shutdown for the 64bit intel thermal vector was incorrect and never triggered. Fix that. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: use force_sig_info to kill process in machine checkAndi Kleen
Impact: bug fix (with tolerant == 3) do_exit cannot be called directly from the exception handler because it can sleep and the exception handler runs on the exception stack. Use force_sig() instead. Based on a earlier patch by Ying Huang who debugged the problem. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17x86, mce: reinitialize per cpu features on resumeAndi Kleen
Impact: Bug fix This fixes a long standing bug in the machine check code. On resume the boot CPU wouldn't get its vendor specific state like thermal handling reinitialized. This means the boot cpu wouldn't ever get any thermal events reported again. Call the respective initialization functions on resume v2: Remove ancient init because they don't have a resume device anyways. Pointed out by Thomas Gleixner. v3: Now fix the Subject too to reflect v2 change Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: doc: mmiotrace.txt, buffer size control change trace: mmiotrace to the tracer menu in Kconfig mmiotrace: count events lost due to not recording
2009-02-17Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, vm86: fix preemption bug x86, olpc: fix model detection without OFW x86, hpet: fix for LS21 + HPET = boot hang x86: CPA avoid repeated lazy mmu flush x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem x86/cpa: make sure cpa is safe to call in lazy mmu mode x86, ptrace, mm: fix double-free on race
2009-02-17Merge branch 'kvm-updates/2.6.29' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
* 'kvm-updates/2.6.29' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Flush volatile msrs before emulating rdmsr KVM: Fix assigned devices circular locking dependency KVM: x86: fix LAPIC pending count calculation KVM: Fix INTx for device assignment KVM: MMU: Map device MMIO as UC in EPT KVM: x86: disable kvmclock on non constant TSC hosts KVM: PIT: fix i8254 pending count read KVM: Fix racy in kvm_free_assigned_irq KVM: Add kvm_arch_sync_events to sync with asynchronize events KVM: mmu_notifiers release method KVM: Avoid using CONFIG_ in userspace visible headers KVM: ia64: fix fp fault/trap handler
2009-02-17x86, rcu: fix strange load average and ksoftirqd behaviorPaul E. McKenney
Damien Wyart reported high ksoftirqd CPU usage (20%) on an otherwise idle system. The function-graph trace Damien provided: > 799.521187 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.521371 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.521555 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.521738 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.521934 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.522068 | 1) ksoftir-2324 | | rcu_check_callbacks() { > 799.522208 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.522392 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.522575 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.522759 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.522956 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.523074 | 1) ksoftir-2324 | | rcu_check_callbacks() { > 799.523214 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.523397 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.523579 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.523762 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.523960 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.524079 | 1) ksoftir-2324 | | rcu_check_callbacks() { > 799.524220 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.524403 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.524587 | 1) <idle>-0 | | rcu_check_callbacks() { > 799.524770 | 1) <idle>-0 | | rcu_check_callbacks() { > [ . . . ] Shows rcu_check_callbacks() being invoked way too often. It should be called once per jiffy, and here it is called no less than 22 times in about 3.5 milliseconds, meaning one call every 160 microseconds or so. Why do we need to call rcu_pending() and rcu_check_callbacks() from the idle loop of 32-bit x86, especially given that no other architecture does this? The following patch removes the call to rcu_pending() and rcu_check_callbacks() from the x86 32-bit idle loop in order to reduce the softirq load on idle systems. Reported-by: Damien Wyart <damien.wyart@free.fr> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-16cpumask: fix powernow-k8: partial revert of ↵Rusty Russell
2fdf66b491ac706657946442789ec644cc317e1a Impact: fix powernow-k8 when acpi=off (or other error). There was a spurious change introduced into powernow-k8 in this patch: so that we try to "restore" the cpus_allowed we never saved. We revert that file. See lkml "[PATCH] x86/powernow: fix cpus_allowed brokage when acpi=off" from Yinghai for the bug report. Cc: Mike Travis <travis@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu>
2009-02-15trace: mmiotrace to the tracer menu in KconfigPekka Paalanen
Impact: cosmetic change in Kconfig menu layout This patch was originally suggested by Peter Zijlstra, but seems it was forgotten. CONFIG_MMIOTRACE and CONFIG_MMIOTRACE_TEST were selectable directly under the Kernel hacking / debugging menu in the kernel configuration system. They were present only for x86 and x86_64. Other tracers that use the ftrace tracing framework are in their own sub-menu. This patch moves the mmiotrace configuration options there. Since the Kconfig file, where the tracer menu is, is not architecture specific, HAVE_MMIOTRACE_SUPPORT is introduced and provided only by x86/x86_64. CONFIG_MMIOTRACE now depends on it. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15x86, vm86: fix preemption bugThomas Gleixner
Commit 3d2a71a596bd9c761c8487a2178e95f8a61da083 ("x86, traps: converge do_debug handlers") changed the preemption disable logic of do_debug() so vm86_handle_trap() is called with preemption disabled resulting in: BUG: sleeping function called from invalid context at include/linux/kernel.h:155 in_atomic(): 1, irqs_disabled(): 0, pid: 3005, name: dosemu.bin Pid: 3005, comm: dosemu.bin Tainted: G W 2.6.29-rc1 #51 Call Trace: [<c050d669>] copy_to_user+0x33/0x108 [<c04181f4>] save_v86_state+0x65/0x149 [<c0418531>] handle_vm86_trap+0x20/0x8f [<c064e345>] do_debug+0x15b/0x1a4 [<c064df1f>] debug_stack_correct+0x27/0x2c [<c040365b>] sysenter_do_call+0x12/0x2f BUG: scheduling while atomic: dosemu.bin/3005/0x10000001 Restore the original calling convention and reenable preemption before calling handle_vm86_trap(). Reported-by: Michal Suchanek <hramrach@centrum.cz> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15KVM: VMX: Flush volatile msrs before emulating rdmsrAvi Kivity
Some msrs (notable MSR_KERNEL_GS_BASE) are held in the processor registers and need to be flushed to the vcpu struture before they can be read. This fixes cygwin longjmp() failure on Windows x64. Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15KVM: x86: fix LAPIC pending count calculationMarcelo Tosatti
Simplify LAPIC TMCCT calculation by using hrtimer provided function to query remaining time until expiration. Fixes host hang with nested ESX. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15KVM: MMU: Map device MMIO as UC in EPTSheng Yang
Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15KVM: x86: disable kvmclock on non constant TSC hostsMarcelo Tosatti
This is better. Currently, this code path is posing us big troubles, and we won't have a decent patch in time. So, temporarily disable it. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>