aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2007-07-18xen: event channelsJeremy Fitzhardinge
Xen implements interrupts in terms of event channels. Each guest domain gets 1024 event channels which can be used for a variety of purposes, such as Xen timer events, inter-domain events, inter-processor events (IPI) or for real hardware IRQs. Within the kernel, we map the event channels to IRQs, and implement the whole interrupt handling using a Xen irq_chip. Rather than setting NR_IRQ to 1024 under PARAVIRT in order to accomodate Xen, we create a dynamic mapping between event channels and IRQs. Ideally, Linux will eventually move towards dynamically allocating per-irq structures, and we can use a 1:1 mapping between event channels and irqs. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Eric W. Biederman <ebiederm@xmission.com>
2007-07-18xen: Core Xen implementationJeremy Fitzhardinge
This patch is a rollup of all the core pieces of the Xen implementation, including: - booting and setup - pagetable setup - privileged instructions - segmentation - interrupt flags - upcalls - multicall batching BOOTING AND SETUP The vmlinux image is decorated with ELF notes which tell the Xen domain builder what the kernel's requirements are; the domain builder then constructs the address space accordingly and starts the kernel. Xen has its own entrypoint for the kernel (contained in an ELF note). The ELF notes are set up by xen-head.S, which is included into head.S. In principle it could be linked separately, but it seems to provoke lots of binutils bugs. Because the domain builder starts the kernel in a fairly sane state (32-bit protected mode, paging enabled, flat segments set up), there's not a lot of setup needed before starting the kernel proper. The main steps are: 1. Install the Xen paravirt_ops, which is simply a matter of a structure assignment. 2. Set init_mm to use the Xen-supplied pagetables (analogous to the head.S generated pagetables in a native boot). 3. Reserve address space for Xen, since it takes a chunk at the top of the address space for its own use. 4. Call start_kernel() PAGETABLE SETUP Once we hit the main kernel boot sequence, it will end up calling back via paravirt_ops to set up various pieces of Xen specific state. One of the critical things which requires a bit of extra care is the construction of the initial init_mm pagetable. Because Xen places tight constraints on pagetables (an active pagetable must always be valid, and must always be mapped read-only to the guest domain), we need to be careful when constructing the new pagetable to keep these constraints in mind. It turns out that the easiest way to do this is use the initial Xen-provided pagetable as a template, and then just insert new mappings for memory where a mapping doesn't already exist. This means that during pagetable setup, it uses a special version of xen_set_pte which ignores any attempt to remap a read-only page as read-write (since Xen will map its own initial pagetable as RO), but lets other changes to the ptes happen, so that things like NX are set properly. PRIVILEGED INSTRUCTIONS AND SEGMENTATION When the kernel runs under Xen, it runs in ring 1 rather than ring 0. This means that it is more privileged than user-mode in ring 3, but it still can't run privileged instructions directly. Non-performance critical instructions are dealt with by taking a privilege exception and trapping into the hypervisor and emulating the instruction, but more performance-critical instructions have their own specific paravirt_ops. In many cases we can avoid having to do any hypercalls for these instructions, or the Xen implementation is quite different from the normal native version. The privileged instructions fall into the broad classes of: Segmentation: setting up the GDT and the GDT entries, LDT, TLS and so on. Xen doesn't allow the GDT to be directly modified; all GDT updates are done via hypercalls where the new entries can be validated. This is important because Xen uses segment limits to prevent the guest kernel from damaging the hypervisor itself. Traps and exceptions: Xen uses a special format for trap entrypoints, so when the kernel wants to set an IDT entry, it needs to be converted to the form Xen expects. Xen sets int 0x80 up specially so that the trap goes straight from userspace into the guest kernel without going via the hypervisor. sysenter isn't supported. Kernel stack: The esp0 entry is extracted from the tss and provided to Xen. TLB operations: the various TLB calls are mapped into corresponding Xen hypercalls. Control registers: all the control registers are privileged. The most important is cr3, which points to the base of the current pagetable, and we handle it specially. Another instruction we treat specially is CPUID, even though its not privileged. We want to control what CPU features are visible to the rest of the kernel, and so CPUID ends up going into a paravirt_op. Xen implements this mainly to disable the ACPI and APIC subsystems. INTERRUPT FLAGS Xen maintains its own separate flag for masking events, which is contained within the per-cpu vcpu_info structure. Because the guest kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely ignored (and must be, because even if a guest domain disables interrupts for itself, it can't disable them overall). (A note on terminology: "events" and interrupts are effectively synonymous. However, rather than using an "enable flag", Xen uses a "mask flag", which blocks event delivery when it is non-zero.) There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which are implemented to manage the Xen event mask state. The only thing worth noting is that when events are unmasked, we need to explicitly see if there's a pending event and call into the hypervisor to make sure it gets delivered. UPCALLS Xen needs a couple of upcall (or callback) functions to be implemented by each guest. One is the event upcalls, which is how events (interrupts, effectively) are delivered to the guests. The other is the failsafe callback, which is used to report errors in either reloading a segment register, or caused by iret. These are implemented in i386/kernel/entry.S so they can jump into the normal iret_exc path when necessary. MULTICALL BATCHING Xen provides a multicall mechanism, which allows multiple hypercalls to be issued at once in order to mitigate the cost of trapping into the hypervisor. This is particularly useful for context switches, since the 4-5 hypercalls they would normally need (reload cr3, update TLS, maybe update LDT) can be reduced to one. This patch implements a generic batching mechanism for hypercalls, which gets used in many places in the Xen code. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Ian Pratt <ian.pratt@xensource.com> Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18xen: Add Xen interface header filesJeremy Fitzhardinge
Add Xen interface header files. These are taken fairly directly from the Xen tree, but somewhat rearranged to suit the kernel's conventions. Define macros and inline functions for doing hypercalls into the hypervisor. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18Add a sched_clock paravirt_opJeremy Fitzhardinge
The tsc-based get_scheduled_cycles interface is not a good match for Xen's runstate accounting, which reports everything in nanoseconds. This patch replaces this interface with a sched_clock interface, which matches both Xen and VMI's requirements. In order to do this, we: 1. replace get_scheduled_cycles with sched_clock 2. hoist cycles_2_ns into a common header 3. update vmi accordingly One thing to note: because sched_clock is implemented as a weak function in kernel/sched.c, we must define a real function in order to override this weak binding. This means the usual paravirt_ops technique of using an inline function won't work in this case. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Zachary Amsden <zach@vmware.com> Cc: Dan Hecht <dhecht@vmware.com> Cc: john stultz <johnstul@us.ibm.com>
2007-07-18paravirt: helper to disable all IO spaceJeremy Fitzhardinge
In a virtual environment, device drivers such as legacy IDE will waste quite a lot of time probing for their devices which will never appear. This helper function allows a paravirt implementation to lay claim to the whole iomem and ioport space, thereby disabling all device drivers trying to claim IO resources. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Rusty Russell <rusty@rustcorp.com.au>
2007-07-18Allocate and free vmalloc areasJeremy Fitzhardinge
Allocate/release a chunk of vmalloc address space: alloc_vm_area reserves a chunk of address space, and makes sure all the pagetables are constructed for that address range - but no pages. free_vm_area releases the address space range. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: "Jan Beulich" <JBeulich@novell.com> Cc: "Andi Kleen" <ak@muc.de>
2007-07-18paravirt: make siblingmap functions visibleJeremy Fitzhardinge
Paravirt implementations need to set the sibling map on new cpus. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
2007-07-18paravirt: unstatic smp_store_cpu_infoJeremy Fitzhardinge
Paravirt implementations need to store cpu info when bringing up cpus. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
2007-07-18paravirt: unstatic leave_mmJeremy Fitzhardinge
Make globally leave_mm visible, specifically so that Xen can use it to shoot-down lazy uses of cr3. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18paravirt: increase IRQ limitJeremy Fitzhardinge
When running with CONFIG_PARAVIRT, we may want lots of IRQs even if there's no IO APIC. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
2007-07-18paravirt: add a hook for once the allocator is readyJeremy Fitzhardinge
Add a hook so that the paravirt backend knows when the allocator is ready. This is useful for the obvious reason that the allocator is available, but the other side-effect of having the bootmem allocator available is that each page now has an associated "struct page". Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
2007-07-18paravirt: add an "mm" argument to alloc_ptJeremy Fitzhardinge
It's useful to know which mm is allocating a pagetable. Xen uses this to determine whether the pagetable being added to is pinned or not. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
2007-07-18use elfnote.h to generate vsyscall notes.Jeremy Fitzhardinge
Use existing elfnote.h to generate vsyscall notes, rather than doing it locally. Changes elfnote.h a bit to suit, since this is the first asm user, and it wasn't quite right. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.com>
2007-07-18usermodehelper: Tidy up waitingJeremy Fitzhardinge
Rather than using a tri-state integer for the wait flag in call_usermodehelper_exec, define a proper enum, and use that. I've preserved the integer values so that any callers I've missed should still work OK. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andi Kleen <ak@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Howells <dhowells@redhat.com>
2007-07-18Add common orderly_poweroff()Jeremy Fitzhardinge
Various pieces of code around the kernel want to be able to trigger an orderly poweroff. This pulls them together into a single implementation. By default the poweroff command is /sbin/poweroff, but it can be set via sysctl: kernel/poweroff_cmd. This is split at whitespace, so it can include command-line arguments. This patch replaces four other instances of invoking either "poweroff" or "shutdown -h now": two sbus drivers, and acpi thermal management. sparc64 has its own "powerd"; still need to determine whether it should be replaced by orderly_poweroff(). Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Acked-by: Len Brown <lenb@kernel.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andi Kleen <ak@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David S. Miller <davem@davemloft.net>
2007-07-18usermodehelper: split setup from executionJeremy Fitzhardinge
Rather than having hundreds of variations of call_usermodehelper for various pieces of usermode state which could be set up, split the info allocation and initialization from the actual process execution. This means the general pattern becomes: info = call_usermodehelper_setup(path, argv, envp); /* basic state */ call_usermodehelper_<SET EXTRA STATE>(info, stuff...); /* extra state */ call_usermodehelper_exec(info, wait); /* run process and free info */ This patch introduces wrappers for all the existing calling styles for call_usermodehelper_*, but folds their implementations into one. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: David Howells <dhowells@redhat.com> Cc: Bj?rn Steinbrink <B.Steinbrink@gmx.de> Cc: Randy Dunlap <randy.dunlap@oracle.com>
2007-07-18add argv_split()Jeremy Fitzhardinge
argv_split() is a helper function which takes a string, splits it at whitespace, and returns a NULL-terminated argv vector. This is deliberately simple - it does no quote processing of any kind. [ Seems to me that this is something which is already being done in the kernel, but I couldn't find any other implementations, either to steal or replace. Keep an eye out. ] Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com>
2007-07-18add kstrndupJeremy Fitzhardinge
Add a kstrndup function, modelled on strndup. Like strndup this returns a string copied into its own allocated memory, but it copies no more than the specified number of bytes from the source. Remove private strndup() from irda code. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@mandriva.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Panagiotis Issaris <takis@issaris.org> Cc: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
2007-07-18zs: move to the serial subsystemMaciej W. Rozycki
This is a reimplementation of the zs driver for the serial subsystem. Any resemblance to the old driver is purely coincidential. ;-) I do hope I got the handling of modem lines right -- better do not tackle me about the issue unless you feel too good... Any users of the old driver: please note the numbers of the serial lines have now been swapped, i.e. ttyS0 <-> ttyS1 and ttyS2 <-> ttyS3. It has to do with the modem lines mentioned above; basically the port A in a given chip has to be initialised before the port B if you want to use the latter as the serial console (which is usually the case), as operations on modem lines of the serial line associated with the port B access both ports (see the comment at the top of the driver for the details of wiring used). Please update your scripts. This is also the reason each SCC now requests an IRQ once only (as seen in "/proc/interrupts") -- the handler takes care of both ports at once as the line associated with the port B has to take status update interrupts from both ports (and yet the line of the port A takes its own for itself too). The old driver never got it right... Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18serial: add early_serial_setup() back to header fileYinghai Lu
early_serial_setup was removed from serial.h, but forgot to put in serial_8250.h Signed-off-by: Yinghai Lu <yinghai.lu@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18UBI: kill homegrown endian macrosChristoph Hellwig
Kill UBI's homegrown endianess handling and replace it with the standard kernel endianess handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2007-07-18ext4: Remove 65000 subdirectory limitAndreas Dilger
This patch adds support to ext4 for allowing more than 65000 subdirectories. Currently the maximum number of subdirectories is capped at 32000. If we exceed 65000 subdirectories in an htree directory it sets the inode link count to 1 and no longer counts subdirectories. The directory link count is not actually used when determining if a directory is empty, as that only counts subdirectories and not regular files that might be in there. A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. A later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory with >65000 subdirs. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields Kalpak Shah
We need to make sure that existing ext3 filesystems can also avail the new fields that have been added to the ext4 inode. We use s_want_extra_isize and s_min_extra_isize to decide by how much we should expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand the inode by max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question about whether users should be able to set s_*_extra_isize smaller than the known fields or not. This patch also adds the functionality to expand inodes to include the newly added fields. We start by trying to expand by s_want_extra_isize bytes and if its fails we try to expand by s_min_extra_isize bytes. This is done by changing the i_extra_isize if enough space is available in the inode and no EAs are present. If EAs are present and there is enough space in the inode then the EAs in the inode are shifted to make space. If enough space is not available in the inode due to the EAs then 1 or more EAs are shifted to the external EA block. In the worst case when even the external EA block does not have enough space we inform the user that some EA would need to be deleted or s_min_extra_isize would have to be reduced. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18ext4: Add nanosecond timestampsKalpak Shah
This patch adds nanosecond timestamps for ext4. This involves adding *time_extra fields to the ext4_inode to extend the timestamps to 64-bits. Creation time is also added by this patch. These extended fields will fit into an inode if the filesystem was formatted with large inodes (-I 256 or larger) and there are currently no EAs consuming all of the available space. For new inodes we always reserve enough space for the kernel's known extended fields, but for inodes created with an old kernel this might not have been the case. So this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature flag(ro-compat so that older kernels can't create inodes with a smaller extra_isize). which indicates if the fields fitting inside s_min_extra_isize are available or not. If the expansion of inodes if unsuccessful then this feature will be disabled. This feature is only enabled if requested by the sysadmin. None of the extended inode fields is critical for correct filesystem operation. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18jbd2: Move jbd2-debug file to debugfsJose R. Santos
The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but it incorrectly used create_proc_entry() instead of the sysctl routines, and no proc entry was ever created. Instead of fixing this we might as well move the jbd2-debug file to debugfs which would be the preferred location for this kind of tunable. The new location is now /sys/kernel/debug/jbd2/jbd2-debug. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUGJose R. Santos
When the JBD code was forked to create the new JBD2 code base, the references to CONFIG_JBD_DEBUG where never changed to CONFIG_JBD2_DEBUG. This patch fixes that. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18ext4: copy i_flags to inode flags on writeJan Kara
Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext4-specific i_flags. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem. (This is a forward port patch from ext3) Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18Change on-disk format to support 2^15 uninitialized extentsAmit Arora
This change was suggested by Andreas Dilger. This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. Signed-off-by: Amit Arora <aarora@in.ibm.com>
2007-07-18[NETFILTER]: ipt_iprange.h must #include <linux/types.h>Adrian Bunk
ipt_iprange.h must #include <linux/types.h> since it uses __be32. This patch fixes kernel Bugzilla #7604. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18[NET]: move dev_mc_discard from dev_mcast.c to dev.cDenis Cheng
Because this function is only called by unregister_netdevice, this moving could make this non-global function static, and also remove its declaration in netdevice.h; Any further, function __dev_addr_discard is also just called by dev_mc_discard and dev_unicast_discard, keeping this two functions both in one c file could make __dev_addr_discard also static and remove its declaration in netdevice.h; Futhermore, the sequential call to dev_unicast_discard and then dev_mc_discard in unregister_netdevice have a similar mechanism that: (netif_tx_lock_bh / __dev_addr_discard / netif_tx_unlock_bh), they should merged into one to eliminate duplicates in acquiring and releasing the dev->_xmit_lock, this would be done in my following patch. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18[XFRM]: Fix crash introduced by struct dst_entry reorderingPatrick McHardy
XFRM expects xfrm_dst->u.next to be same pointer as dst->next, which was broken by the dst_entry reordering in commit 1e19e02c~, causing an oops in xfrm_bundle_ok when walking the bundle upwards. Kill xfrm_dst->u.next and change the only user to use dst->next instead. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18[TCP]: remove unused argument to cong_avoid opStephen Hemminger
None of the existing TCP congestion controls use the rtt value pased in the ca_ops->cong_avoid interface. Which is lucky because seq_rtt could have been -1 when handling a duplicate ack. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18[SPARC]: Mark sparc and sparc64 as not having virt_to_busStephen Rothwell
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18[SPARC64]: Massively simplify VIO device layer and support hot add/remove.David S. Miller
Create and destroy VIO devices in response to MD update events. These run synchronously inside of the MD update mutex so the VIO layer doesn't need to do internal locking of any sort. Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18[SPARC64]: Add basic infrastructure for MD add/remove notification.David S. Miller
And add dummy handlers for the VIO device layer. These will be filled in with real code after the vdc, vnet, and ds drivers are reworked to have simpler dependencies on the VIO device tree. Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-17write support for preallocated blocksAmit Arora
This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Signed-off-by: Amit Arora <aarora@in.ibm.com>
2007-07-17fallocate support in ext4Amit Arora
This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. Signed-off-by: Amit Arora <aarora@in.ibm.com>
2007-07-17sys_fallocate() implementation on i386, x86_64 and powerpcAmit Arora
fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora <aarora@in.ibm.com>
2007-07-17slob: Kill off duplicate kzalloc() definition.Paul Mundt
With the slab zeroing allocations cleanups Christoph stubbed in a generic kzalloc(), which was missed on SLOB. Follow the SLAB/SLUB changes and kill off the __kzalloc() wrapper that SLOB was using. Reported-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Merge branch 'bsg' of git://git.kernel.dk/data/git/linux-2.6-blockLinus Torvalds
* 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block: bsg: fix missing space in version print Don't define empty struct bsg_class_device if !CONFIG_BLK_DEV_BSG bsg: Kconfig updates bsg: minor cleanup bsg: device hash table cleanup bsg: fix initialization error handling bugs bsg: mark FUJITA Tomonori as bsg maintainer bsg: convert to dynamic major bsg: address various review comments
2007-07-17smp_call_function_single() should be a macro on UPAl Viro
... or we end up with header include order problems from hell. E.g. on m68k this is 100% fatal - local_irq_enable() there wants preempt_count(), which wants task_struct fields, which we won't have when we are in smp.h pulled from sched.h. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid checkSatyam Sharma
Introduce is_owner_or_cap() macro in fs.h, and convert over relevant users to it. This is done because we want to avoid bugs in the future where we check for only effective fsuid of the current task against a file's owning uid, without simultaneously checking for CAP_FOWNER as well, thus violating its semantics. [ XFS uses special macros and structures, and in general looked ... untouchable, so we leave it alone -- but it has been looked over. ] The (current->fsuid != inode->i_uid) check in generic_permission() and exec_permission_lite() is left alone, because those operations are covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH. Similarly operations falling under the purview of CAP_CHOWN and CAP_LEASE are also left alone. Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Cc: Al Viro <viro@ftp.linux.org.uk> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (80 commits) KVM: Use CPU_DYING for disabling virtualization KVM: Tune hotplug/suspend IPIs KVM: Keep track of which cpus have virtualization enabled SMP: Allow smp_call_function_single() to current cpu i386: Allow smp_call_function_single() to current cpu x86_64: Allow smp_call_function_single() to current cpu HOTPLUG: Adapt thermal throttle to CPU_DYING HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING HOTPLUG: Add CPU_DYING notifier KVM: Clean up #includes KVM: Remove kvmfs in favor of the anonymous inodes source KVM: SVM: Reliably detect if SVM was disabled by BIOS KVM: VMX: Remove unnecessary code in vmx_tlb_flush() KVM: MMU: Fix Wrong tlb flush order KVM: VMX: Reinitialize the real-mode tss when entering real mode KVM: Avoid useless memory write when possible KVM: Fix x86 emulator writeback KVM: Add support for in-kernel pio handlers KVM: VMX: Fix interrupt checking on lightweight exit KVM: Adds support for in-kernel mmio handlers ...
2007-07-17Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] Clean away some code inside some non-existent CONFIG ifdefs [IA64] ar.itc access must really be after xtime_lock.sequence has been read [IA64] correctly count CPU objects in the ia64/sn hwperf interface [IA64] arbitary speed tty ioctl support [IA64] use machvec=dig on hpzx1 platforms
2007-07-17saner typechecking in generic unaligned.hAl Viro
Verify that types would match for assignment (under sizeof, so we are safe from side effects or any code actually getting generated), then explicitly cast everywhere to the fixed-sized types. Kills a bunch of bogus warnings about constants being truncated (gcc, sparse), finds a pile of endianness problems hidden by old noise (sparse). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17alpha termios.h hadn't been updatedAl Viro
... fortunately, termios and ktermios there are identical, so no run-time breakage happened. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17md: change bitmap_unplug and others to void functionsNeilBrown
bitmap_unplug only ever returns 0, so it may as well be void. Two callers try to print a message if it returns non-zero, but that message is already printed by bitmap_file_kick. write_page returns an error which is not consistently checked. It always causes BITMAP_WRITE_ERROR to be set on an error, and that can more conveniently be checked. When the return of write_page is checked, an error causes bitmap_file_kick to be called - so move that call into write_page - and protect against recursive calls into bitmap_file_kick. bitmap_update_sb returns an error that is never checked. So make these 'void' and be consistent about checking the bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17md: improve the is_mddev_idle test fixNeilBrown
Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing we want to do with them is a signed comparison, and fix up the comment which had become quite wrong. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17OMAP: add TI OMAP1610 accelerator entry.Imre Deak
Signed-off-by: Trilok Soni <soni.trilok@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17fbdev: Add fb_append_extra_logo()Geert Uytterhoeven
Add fb_append_extra_logo(), to append extra lines of logos below the standard Linux logo. Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-By: James Simmons <jsimmons@infradead.org> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>