aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2008-05-08semaphore: fixIngo Molnar
Yanmin Zhang reported: | Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th | regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, | and Itanium Montecito. Bisect located the patch below: | | 64ac24e738823161693bf791f87adc802cf529ff is first bad commit | commit 64ac24e738823161693bf791f87adc802cf529ff | Author: Matthew Wilcox <matthew@wil.cx> | Date: Fri Mar 7 21:55:58 2008 -0500 | | Generic semaphore implementation | | After I manually reverted the patch against 2.6.26-rc1 while fixing | lots of conflicts/errors, aim7 regression became less than 2%. i reproduced the AIM7 workload and can confirm Yanmin's findings that -.26-rc1 regresses over .25 - by over 67% here. Looking at the workload i found and fixed what i believe to be the real bug causing the AIM7 regression: it was inefficient wakeup / scheduling / locking behavior of the new generic semaphore code, causing suboptimal performance. The problem comes from the following code. The new semaphore code does this on down(): spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--; else __down(sem); spin_unlock_irqrestore(&sem->lock, flags); and this on up(): spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++; else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); where __up() does: list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); and where __down() does this in essence: list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; waiter.up = 0; for (;;) { [...] spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (waiter.up) return 0; } the fastpath looks good and obvious, but note the following property of the contended path: if there's a task on the ->wait_list, the up() of the current owner will "pass over" ownership to that waiting task, in a wake-one manner, via the waiter->up flag and by removing the waiter from the wait list. That is all and fine in principle, but as implemented in kernel/semaphore.c it also creates a nasty, hidden source of contention! The contention comes from the following property of the new semaphore code: the new owner owns the semaphore exclusively, even if it is not running yet. So if the old owner, even if just a few instructions later, does a down() [lock_kernel()] again, it will be blocked and will have to wait on the new owner to eventually be scheduled (possibly on another CPU)! Or if another task gets to lock_kernel() sooner than the "new owner" scheduled, it will be blocked unnecessarily and for a very long time when there are 2000 tasks running. I.e. the implementation of the new semaphores code does wake-one and lock ownership in a very restrictive way - it does not allow opportunistic re-locking of the lock at all and keeps the scheduler from picking task order intelligently. This kind of scheduling, with 2000 AIM7 processes running, creates awful cross-scheduling between those 2000 tasks, causes reduced parallelism, a throttled runqueue length and a lot of idle time. With increasing number of CPUs it causes an exponentially worse behavior in AIM7, as the chance for a newly woken new-owner task to actually run anytime soon is less and less likely. Note that it takes just a tiny bit of contention for the 'new-semaphore catastrophy' to happen: the wakeup latencies get added to whatever small contention there is, and quickly snowball out of control! I believe Yanmin's findings and numbers support this analysis too. The best fix for this problem is to use the same scheduling logic that the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and wanted because we do not want to over-schedule), but also allow opportunistic locking of the lock even if a wakee is already "in flight". The patch below implements this new logic. With this patch applied the AIM7 regression is largely fixed on my quad testbox: # v2.6.25 vanilla: .................. Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 56096.4 91 207.5 789.7 0.4675 2000 55894.4 94 208.2 792.7 0.4658 # v2.6.26-rc1-166-gc0a1811 vanilla: ................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 33230.6 83 350.3 784.5 0.2769 2000 31778.1 86 366.3 783.6 0.2648 # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup: ............................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55707.1 92 209.0 795.6 0.4642 2000 55704.4 96 209.0 796.0 0.4642 i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 performance levels and have zero idle time during the test, as expected. Btw., interactivity also improved dramatically with the fix - for example console-switching became almost instantaneous during this workload (which after all is running 2000 tasks at once!), without the patch it was stuck for a minute at times. There's another nice side-effect of this speedup patch, the new generic semaphore code got even smaller: text data bss dec hex filename 1241 0 0 1241 4d9 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after (because the waiter.up complication got removed.) Longer-term we should look into using the mutex code for the generic semaphore code as well - but i's not easy due to legacies and it's outside of the scope of v2.6.26 and outside the scope of this patch as well. Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-08x86: cleanup PAT cpu validationThomas Gleixner
Move the scattered checks for PAT support to a single function. Its moved to addon_cpuid_features.c as this file is shared between 32 and 64 bit. Remove the manipulation of the PAT feature bit and just disable PAT in the PAT layer, based on the PAT bit provided by the CPU and the current CPU version/model white list. Change the boot CPU check so it works on Voyager somewhere in the future as well :) Also panic, when a secondary has PAT disabled but the primary one has alrady switched to PAT. We have no way to undo that. The white list is kept for now to ensure that we can rely on known to work CPU types and concentrate on the software induced problems instead of fighthing CPU erratas and subtle wreckage caused by not yet verified CPUs. Once the PAT code has stabilized enough, we can remove the white list and open the can of worms. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-08x86: geode: define geode_has_vsa2() even if CONFIG_MGEODE_LX is not setAndres Salomon
We want drivers to be able to use geode_has_vsa2 without having to worry about what model geode is being compiled for. This patch ensures that geode_has_vsa2 is always defined. Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Jordan Crouse <jordan.crouse@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-08x86: GEODE: cache results from geode_has_vsa2() and uninlineAndres Salomon
This moves geode_has_vsa2 into a .c file, caches the result we get from the VSA virtual registers, and causes the function to no longer be inline. [akpm@linux-foundation.org: cleanup] Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Jordan Crouse <jordan.crouse@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-08x86: revert geode config dependencyThomas Gleixner
commit e26a28d190304d910ee49b81cbfe6d9241f56e86 x86: olpc build fix was a fix to a patch that was withdrawn/delayed and then erroneously commited to x86.git. Revert it. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-08Revert "relay: fix splice problem"Jens Axboe
This reverts commit c3270e577c18b3d0e984c3371493205a4807db9d.
2008-05-08[ALSA] soc at91 minor bug fixesPatrik Sevallius
Found these two bugs while browsing through the code. The first one is a cut-n-paste bug, instead of disabling the clock when request_irq() fails, it enabled it once more. The second one fixes a debug printout, AT91_SSC_IER is write only, AT91_SSC_IMR is readable (the printed string actually says imr). Frank Mandarino was busy so he asked me to send these to this list. /Patrik Signed-off-by: Patrik Sevallius <patrik.sevallius@enea.com> Acked-by: Frank Mandarino <fmandarino@endrelia.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2008-05-08[ALSA] soc - at91-pcm - Fix line wrappingMark Brown
There's more checkpatch stuff to fix in the driver, this just fixes the minimum required for the following patch to be clean. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2008-05-08[ARM] lubbock: fix compilationAlexey Dobriyan
arch/arm/mach-pxa/lubbock.c:399: error: expected '}' before ';' token Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2008-05-08net: Added ASSERT_RTNL() to dev_open() and dev_close().Ben Hutchings
dev_open() and dev_close() must be called holding the RTNL, since they call device functions and netdevice notifiers that are promised the RTNL. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08can: Fix can_send() handling on dev_queue_xmit() failuresOliver Hartkopp
The tx packet counting and the local loopback of CAN frames should only happen in the case that the CAN frame has been enqueued to the netdevice tx queue successfully. Thanks to Andre Naujoks <nautsch@gmail.com> for reporting this issue. Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net> Signed-off-by: Urs Thuermann <urs@isnogud.escape.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08Merge branch 'upstream-davem' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6
2008-05-08netns: Fix arbitrary net_device-s corruptions on net_ns stop.Pavel Emelyanov
When a net namespace is destroyed, some devices (those, not killed on ns stop explicitly) are moved back to init_net. The problem, is that this net_ns change has one point of failure - the __dev_alloc_name() may be called if a name collision occurs (and this is easy to trigger). This allocator performs a likely-to-fail GFP_ATOMIC allocation to find a suitable number. Other possible conditions that may cause error (for device being ns local or not registered) are always false in this case. So, when this call fails, the device is unregistered. But this is *not* the right thing to do, since after this the device may be released (and kfree-ed) improperly. E. g. bridges require more actions (sysfs update, timer disarming, etc.), some other devices want to remove their private areas from lists, etc. I. e. arbitrary use-after-free cases may occur. The proposed fix is the following: since the only reason for the dev_change_net_namespace to fail is the name generation, we may give it a unique fall-back name w/o %d-s in it - the dev<ifindex> one, since ifindexes are still unique. So make this change, raise the failure-case printk loglevel to EMERG and replace the unregister_netdevice call with BUG(). [ Use snprintf() -DaveM ] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08netfilter: Kconfig: default DCCP/SCTP conntrack support to the protocol ↵Patrick McHardy
config values When conntrack and DCCP/SCTP protocols are enabled, chances are good that people also want DCCP/SCTP conntrack and NAT support. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08netfilter: nf_conntrack_sip: restrict RTP expect flushing on error to last ↵Patrick McHardy
request Some Inovaphone PBXs exhibit very stange behaviour: when dialing for example "123", the device sends INVITE requests for "1", "12" and "123" back to back. The first requests will elicit error responses from the receiver, causing the SIP helper to flush the RTP expectations even though we might still see a positive response. Note the sequence number of the last INVITE request that contained a media description and only flush the expectations when receiving a negative response for that sequence number. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08macvlan: Fix memleak on device removal/crash on module removalPatrick McHardy
As noticed by Ben Greear, macvlan crashes the kernel when unloading the module. The reason is that it tries to clean up the macvlan_port pointer on the macvlan device itself instead of the underlying device. A non-NULL pointer is taken as indication that the macvlan_handle_frame_hook is valid, when receiving the next packet on the underlying device it tries to call the NULL hook and crashes. Clean up the macvlan_port on the correct device to fix this. Signed-off-by; Patrick McHardy <kaber@trash.net> Tested-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08net/ipv4: correct RFC 1122 section reference in commentJ.H.M. Dassen (Ray)
RFC 1122 does not have a section 3.1.2.2. The requirement to silently discard datagrams with a bad checksum is in section 3.2.1.2 instead. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10611 Signed-off-by: J.H.M. Dassen (Ray) <jdassen@debian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08tcp FRTO: SACK variant is errorneously used with NewRenoIlpo Järvinen
Note: there's actually another bug in FRTO's SACK variant, which is the causing failure in NewReno case because of the error that's fixed here. I'll fix the SACK case separately (it's a separate bug really, though related, but in order to fix that I need to audit tp->snd_nxt usage a bit). There were two places where SACK variant of FRTO is getting incorrectly used even if SACK wasn't negotiated by the TCP flow. This leads to incorrect setting of frto_highmark with NewReno if a previous recovery was interrupted by another RTO. An eventual fallback to conventional recovery then incorrectly considers one or couple of segments as forward transmissions though they weren't, which then are not LOST marked during fallback making them "non-retransmittable" until the next RTO. In a bad case, those segments are really lost and are the only one left in the window. Thus TCP needs another RTO to continue. The next FRTO, however, could again repeat the same events making the progress of the TCP flow extremely slow. In order for these events to occur at all, FRTO must occur again in FRTOs step 3 while the key segments must be lost as well, which is not too likely in practice. It seems to most frequently with some small devices such as network printers that *seem* to accept TCP segments only in-order. In cases were key segments weren't lost, things get automatically resolved because those wrongly marked segments don't need to be retransmitted in order to continue. I found a reproducer after digging up relevant reports (few reports in total, none at netdev or lkml I know of), some cases seemed to indicate middlebox issues which seems now to be a false assumption some people had made. Bugzilla #10063 _might_ be related. Damon L. Chesser <damon@damtek.com> had a reproducable case and was kind enough to tcpdump it for me. With the tcpdump log it was quite trivial to figure out. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08[POWERPC] spufs: lockdep annotations for spufs_dir_closeChristoph Hellwig
We need to acquire the parent i_mutex with I_MUTEX_PARENT to keep lockdep happy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
2008-05-08[POWERPC] spufs: don't requeue victim contex in find_victim if it's not in ↵Christoph Hellwig
spu_run We should not requeue the victim context in find_victim if the owner is not in spu_run. It's first not needed because leaving the context on the spu is an optimization and second is harmful because it means the owner could re-enter spu_run when the context is on the runqueue and trip the BUG_ON in __spu_update_sched_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
2008-05-08sh: Stub in cpu_to_node() and friends for NUMA build.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: intc register modify fixMagnus Damm
Make sure register modifications stay atomic. Fixes processors with shared priority register masking. Dual bitmap masking is unaffected. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: no high level trigger on some sh3 cpusMagnus Damm
The processor models sh7706, sh7707 and sh7709 don't support high level trigger sense configuration. And the intc code looks like crap these days so what's the difference. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: clean up sh7710 and sh7720 intc tablesMagnus Damm
Clean up the intc tables by removing unneeded #ifdefs. The vector list is what selects which interrupt sources that should be added, having unsupported bitfields listed is ok as long as the vector is excluded from the list. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: add interrupt ack code to sh3Magnus Damm
This patch adds interrupt acknowledge code for external interrupt sources on sh3 processors. Only really required for edge triggered interrupts, but we ack regardless of sense configuration. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: unify external irq pin code for sh3Magnus Damm
This patch unifies the sh3 external irq pin code. It buys us some savings with reduced code redundancy, but the main feature with this change is irq sense selection support for all sh3 processors. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh-sci: avoid writing to nonexistent registersMagnus Damm
Only write to hardware in SCI_OUT() if the register size is valid. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh-sci: sh7722 lacks scsptr registersMagnus Damm
The sh7722 serial ports all lack SCSPTR registers, so mark them as nonexistent in the register table. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh-sci: improve sh7722 supportMagnus Damm
Improve sh7722 support for SCIF1 and SCIF2 and separate code from sh7366 implementation. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: reset hardware from early printkMagnus Damm
Reset the transmitter and receiver when setting up early printk. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: drain and wait for early printkMagnus Damm
Drain by waiting for all characters to be sent, and make sure to wait a little bit after setting up the baud rate. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: use sci_out() for early printkMagnus Damm
Use sci_out() instead of ctrl_outw() for early printk setup code. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: add memory resources to /proc/iomemMagnus Damm
Add physical memory resources such as System RAM, Kernel code/data/bss and reserved crash dump area to /proc/iomem. Same strategy as on x86. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: add kernel bss resourceMagnus Damm
Do like everyone else and have a struct resource for kernel bss. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: fix sh7705 interrupt vector typoMagnus Damm
Fix sh7705 interrupt sources for vectors 0xc80 and 0xca0. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: update smc91x platform data for se7722Magnus Damm
Select smc91x bus width using platform data for se7722 now when the smc91x header file is in place. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: update smc91x platform data for MigoRMagnus Damm
Select smc91x bus width and irg flags using platform data for MigoR now when the smc91x header file is in place. Signed-off-by: Magnus Damm <damm@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: remove -traditional.Mathieu Desnoyers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Sam Ravnborg <sam@ravnborg.org> CC: linux-sh@vger.kernel.org Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08rtc: rtc-sh: Fixup for 64-bit resources.Paul Mundt
ioremap() and friends get the size information right, so force everything to go through there. Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: r7780rp: Kill off unneded ifdefs for irq setup.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: rts7751r2d: Kill off unneeded ifdefs.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: intc_sh5 depends on cayman board for IRQ priority table.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08input: i8042: sh64 IRQ definitions depend on cayman board.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: Enable use of the clk fwk on SH-5.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh64: export onchip_remap/unmap() too.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh64: Some symbol exports to make the allmodconfig happier.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08mtd: solutionengine flash map depends on solution engine mach group.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: remove the broken SH_MPC1211 supportAdrian Bunk
SH_MPC1211 has been marked as BROKEN for some time. Unless someone is working on reviving it now, I'd therefore suggest this patch to remove it. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh64: Setup I/D-TLB defaults in SH-5 probe path.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-05-08sh: kexec and kdump depend on SUPERH32.Paul Mundt
Signed-off-by: Paul Mundt <lethal@linux-sh.org>