aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2005-07-05Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
2005-07-05[IPV4]: Add LC-Trie implementation notesRobert Olsson
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Never TSO defer under periods of congestion.David S. Miller
Congestion window recover after loss depends upon the fact that if we have a full MSS sized frame at the head of the send queue, we will send it. TSO deferral can defeat the ACK clocking necessary to exit cleanly from recovery. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[PKT_SCHED]: Blackhole queueing disciplineThomas Graf
Useful in combination with classful qdiscs to drop or temporary disable certain flows, e.g. one could block specific ds flows with dsmark. Unlike the noop qdisc it can be controlled by the user and statistic accounting is done. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Move to new TSO segmenting scheme.David S. Miller
Make TSO segment transmit size decisions at send time not earlier. The basic scheme is that we try to build as large a TSO frame as possible when pulling in the user data, but the size of the TSO frame output to the card is determined at transmit time. This is guided by tp->xmit_size_goal. It is always set to a multiple of MSS and tells sendmsg/sendpage how large an SKB to try and build. Later, tcp_write_xmit() and tcp_push_one() chop up the packet if necessary and conditions warrant. These routines can also decide to "defer" in order to wait for more ACKs to arrive and thus allow larger TSO frames to be emitted. A general observation is that TSO elongates the pipe, thus requiring a larger congestion window and larger buffering especially at the sender side. Therefore, it is important that applications 1) get a large enough socket send buffer (this is accomplished by our dynamic send buffer expansion code) 2) do large enough writes. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Break out send buffer expansion test.David S. Miller
This makes it easier to understand, and allows easier tweaking of the heuristic later on. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Do not call tcp_tso_acked() if no work to do.David S. Miller
In tcp_clean_rtx_queue(), if the TSO packet is not even partially acked, do not waste time calling tcp_tso_acked(). Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Kill bogus comment above tcp_tso_acked().David S. Miller
Everything stated there is out of data. tcp_trim_skb() does adjust the available socket send buffer space and skb->truesize now. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Fix send-side cpu utiliziation regression.David S. Miller
Only put user data purely to pages when doing TSO. The extra page allocations cause two problems: 1) Add the overhead of the page allocations themselves. 2) Make us do small user copies when we get to the end of the TCP socket cache page. It is still beneficial to purely use pages for TSO, so we will do it for that case. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Eliminate redundant computations in tcp_write_xmit().David S. Miller
tcp_snd_test() is run for every packet output by a single call to tcp_write_xmit(), but this is not necessary. For one, the congestion window space needs to only be calculated one time, then used throughout the duration of the loop. This cleanup also makes experimenting with different TSO packetization schemes much easier. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Break out tcp_snd_test() into it's constituent parts.David S. Miller
tcp_snd_test() does several different things, use inline functions to express this more clearly. 1) It initializes the TSO count of SKB, if necessary. 2) It performs the Nagle test. 3) It makes sure the congestion window is adhered to. 4) It makes sure SKB fits into the send window. This cleanup also sets things up so that things like the available packets in the congestion window does not need to be calculated multiple times by packet sending loops such as tcp_write_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Fix __tcp_push_pending_frames() 'nonagle' handling.David S. Miller
'nonagle' should be passed to the tcp_snd_test() function as 'TCP_NAGLE_PUSH' if we are checking an SKB not at the tail of the write_queue. This is because Nagle does not apply to such frames since we cannot possibly tack more data onto them. However, while doing this __tcp_push_pending_frames() makes all of the packets in the write_queue use this modified 'nonagle' value. Fix the bug and simplify this function by just calling tcp_write_xmit() directly if sk_send_head is non-NULL. As a result, we can now make tcp_data_snd_check() just call tcp_push_pending_frames() instead of the specialized __tcp_data_snd_check(). Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Fix redundant calculations of tcp_current_mss()David S. Miller
tcp_write_xmit() uses tcp_current_mss(), but some of it's callers, namely __tcp_push_pending_frames(), already has this value available already. While we're here, fix the "cur_mss" argument to be "unsigned int" instead of plain "unsigned". Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: tcp_write_xmit() tabbing cleanupDavid S. Miller
Put the main basic block of work at the top-level of tabbing, and mark the TCP_CLOSE test with unlikely(). Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Kill extra cwnd validate in __tcp_push_pending_frames().David S. Miller
The tcp_cwnd_validate() function should only be invoked if we actually send some frames, yet __tcp_push_pending_frames() will always invoke it. tcp_write_xmit() does the call for us, so the call here can simply be removed. Also, tcp_write_xmit() can be marked static. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Add missing skb_header_release() call to tcp_fragment().David S. Miller
When we add any new packet to the TCP socket write queue, we must call skb_header_release() on it in order for the TSO sharing checks in the drivers to work. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Move __tcp_data_snd_check into tcp_output.cDavid S. Miller
It reimplements portions of tcp_snd_check(), so it we move it to tcp_output.c we can consolidate it's logic much easier in a later change. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Move send test logic out of net/tcp.hDavid S. Miller
This just moves the code into tcp_output.c, no code logic changes are made by this patch. Using this as a baseline, we can begin to untangle the mess of comparisons for the Nagle test et al. We will also be able to reduce all of the redundant computation that occurs when outputting data packets. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Fix quick-ack decrementing with TSO.David S. Miller
On each packet output, we call tcp_dec_quickack_mode() if the ACK flag is set. It drops tp->ack.quick until it hits zero, at which time we deflate the ATO value. When doing TSO, we are emitting multiple packets with ACK set, so we should decrement tp->ack.quick that many segments. Note that, unlike this case, tcp_enter_cwr() should not take the tcp_skb_pcount(skb) into consideration. That function, one time, readjusts tp->snd_cwnd and moves into TCP_CA_CWR state. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.David S. Miller
The ideal and most optimal layout for an SKB when doing scatter-gather is to put all the headers at skb->data, and all the user data in the page array. This makes SKB splitting and combining extremely simple, especially before a packet goes onto the wire the first time. So, when sk_stream_alloc_pskb() is given a zero size, make sure there is no skb_tailroom(). This is achieved by applying SKB_DATA_ALIGN() to the header length used here. Next, make select_size() in TCP output segmentation use a length of zero when NETIF_F_SG is true on the outgoing interface. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: Remove __ARGS from include/net/slhc_vj.hAlexey Dobriyan
I suspect "#define __ARGS(x) ()" was deprecated before I was born. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: improve readability of dev_set_promiscuity() in net/core/dev.cDavid Chau
A trivial patch to improve the readability of dev_set_promiscuity() in net/core/dev.c. New code does exactly the same thing as original code. Signed-off-by: David Chau <ddcc@mit.edu> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[SHAPER]: Switch to spinlocks.Christoph Hellwig
Dave, you were right and the sleeping locks in shaper were broken. Markus Kanet noticed this and also tested the patch below that switches locking to spinlocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[IPV4]: More broken memory allocation fixes for fib_trieRobert Olsson
Below a patch to preallocate memory when doing resize of trie (inflate halve) If preallocations fails it just skips the resize of this tnode for this time. The oops we got when killing bgpd (with full routing) is now gone. Patrick memory patch is also used. Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[DECNET]: Fix memset overflow on 64bit archs while dumping decnet routing rulesThomas Graf
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[IPV4]: Bug fix in rt_check_expire()Eric Dumazet
- rt_check_expire() fixes (an overflow occured if size of the hash was >= 65536) reminder of the bugfix: The rt_check_expire() has a serious problem on machines with large route caches, and a standard HZ value of 1000. With default values, ie ip_rt_gc_interval = 60*HZ = 60000 ; the loop count : for (t = ip_rt_gc_interval << rt_hash_log; t >= 0; overflows (t is a 31 bit value) as soon rt_hash_log is >= 16 (65536 slots in route cache hash table). In this case, rt_check_expire() does nothing at all Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[IPV4]: Use the fancy alloc_large_system_hash() function for route hash tableEric Dumazet
- rt hash table allocated using alloc_large_system_hash() function. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: Hashed spinlocks in net/ipv4/route.cEric Dumazet
- Locking abstraction - Spinlocks moved out of rt hash table : Less memory (50%) used by rt hash table. it's a win even on UP. - Sizing of spinlocks table depends on NR_CPUS Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[IPV4]: Handle large allocations in fib_triePatrick McHardy
Inflating a node a couple of times makes it exceed the 128k kmalloc limit. Use __get_free_pages for allocations > PAGE_SIZE, as in fib_hash. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Robert Olsson <Robert.Olsson@data.slu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TG3]: Update driver version and reldate.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[TG3]: support for ethtool -CMichael Chan
Add support for ethtool -C with verification of user parameters. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[IPV6]: Makes IPv6 rcv registration happen last during initialisation.Herbert Xu
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[IPV4]: Fix crash in ip_rcv while booting related to netconsoleHerbert Xu
Makes IPv4 ip_rcv registration happen last in af_inet. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[SKGE]: Fix build on big-endianDavid S. Miller
Missing PCI_REV_DESC define. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05Merge rsync://rsync.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6Linus Torvalds
2005-07-05[PKT_SCHED]: Report rate estimator configuration errors during qdisc allocationThomas Graf
Current behaviour is to not report an error if a rate estimator is created together with a qdisc and the configuration of the rate estimator is bogus. This leads to unexpected behaviour because the user is not notified. New behaviour is to report the error and let the whole qdisc creation operation fail so the user is able to fix his mistake. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[PKT_SCHED]: Cleanup qdisc creation and alignment macrosThomas Graf
Adds qdisc_alloc() to share code between qdisc_create() and qdisc_create_dflt(). Hides the qdisc alignment behind macros and makes use of them. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[PKT_SCHED]: Move sch_generic.c prototypes to correct header fileThomas Graf
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: Reduce size of sk_buff by 4 bytesThomas Graf
Reduce local_df to a bit field and ip_summed to a 2 bits field thus saving 13 bits. Move bit fields, packet type, and protocol into the spare area between the priority and the destructor. Saves 4 bytes on both, 32bit and 64bit architectures. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: Remove unused security member in sk_buffThomas Graf
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: net/core/filter.c: make len cover the entire packetPatrick McHardy
As suggested by Herbert Xu: Since we don't require anything to be in the linear packet range anymore make len cover the entire packet. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: Consolidate common code in net/core/filter.cPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: Remove redundant code in net/core/filter.cPatrick McHardy
skb_header_pointer handles linear and non-linear data, no need to handle linear data again. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05[NET]: Fix signedness issues in net/core/filter.cPatrick McHardy
This is the code to load packet data into a register: k = fentry->k; if (k < 0) { ... } else { u32 _tmp, *p; p = skb_header_pointer(skb, k, 4, &_tmp); if (p != NULL) { A = ntohl(*p); continue; } } skb_header_pointer checks if the requested data is within the linear area: int hlen = skb_headlen(skb); if (offset + len <= hlen) return skb->data + offset; When offset is within [INT_MAX-len+1..INT_MAX] the addition will result in a negative number which is <= hlen. I couldn't trigger a crash on my AMD64 with 2GB of memory, but a coworker tried on his x86 machine and it crashed immediately. This patch fixes the check in skb_header_pointer to handle large positive offsets similar to skb_copy_bits. Invalid data can still be accessed using negative offsets (also similar to skb_copy_bits), anyone using negative offsets needs to verify them himself. Thanks to Thomas Vögtle <thomas.voegtle@coreworks.de> for verifying the problem by crashing his machine and providing me with an Oops. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6Linus Torvalds
2005-07-04[SPARC64]: Fix IRQ retry interval timer value on sparc64 PCI controllers.David S. Miller
Use '5' instead of 'infinity'. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-04[SPARC64]: Small Schizo PCI controller programming tweaks.David S. Miller
Use macro instead of magic value for Tomatillo discard- timeout interrupt enable register bit. Leave OBP programming PTO value unless Tomatillo and version >= 0x2. If no-bus-parking property is present, explicitly clear PCICTRL_PARK bit. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-04[SPARC64]: Do proper DMA IRQ syncing on TomatilloDavid S. Miller
This was the main impetus behind adding the PCI IRQ shim. In order to properly order DMA writes wrt. interrupts, you have to write to a PCI controller register, then poll for that bit clearing. There is one bit for each interrupt source, and setting this register bit tells Tomatillo to drain all pending DMA from that device. Furthermore, Tomatillo's with revision less than 4 require us to do a block store due to some memory transaction ordering issues it has on JBUS. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-04[SPARC64]: Add support for IRQ pre-handlers.David S. Miller
This allows a PCI controller to shim into IRQ delivery so that DMA queues can be drained, if necessary. If some bus specific code needs to run before an IRQ handler is invoked, the bus driver simply needs to setup the function pointer in bucket->irq_info->pre_handler and the two args bucket->irq_info->pre_handler_arg[12]. The Schizo PCI driver is converted over to use a pre-handler for the DMA write-sync processing it needs when a device is behind a PCI->PCI bus deeper than the top-level APB bridges. While we're here, clean up all of the action allocation and handling. Now, we allocate the irqaction as part of the bucket->irq_info area. There is an array of 4 irqaction (for PCI irq sharing) and a bitmask saying which entries are active. The bucket->irq_info is allocated at build_irq() time, not at request_irq() time. This simplifies request_irq() and free_irq() tremendously. The SMP dynamic IRQ retargetting code got removed in this change too. It was disabled for a few months now, and we can resurrect it in the future if we want. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-04[SPARC]: bpp: remove sleep_on usageChristoph Hellwig
Use schedule_timeout() instead of sleep_on + timer. Totally untested due to lack of hardware, but the changes are rather trivial. Signed-off-by: David S. Miller <davem@davemloft.net>