aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2009-01-29ipv4: fix infinite retry loop in IP-ConfigBenjamin Zores
Signed-off-by: Benjamin Zores <benjamin.zores@alcatel-lucent.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-26tcp: Fix length tcp_splice_data_recv passes to skb_splice_bits.Dimitris Michailidis
tcp_splice_data_recv has two lengths to consider: the len parameter it gets from tcp_read_sock, which specifies the amount of data in the skb, and rd_desc->count, which is the amount of data the splice caller still wants. Currently it passes just the latter to skb_splice_bits, which then splices min(rd_desc->count, skb->len - offset) bytes. Most of the time this is fine, except when the skb contains urgent data. In that case len goes only up to the urgent byte and is less than skb->len - offset. By ignoring len tcp_splice_data_recv may a) splice data tcp_read_sock told it not to, b) return to tcp_read_sock a value > len. Now, tcp_read_sock doesn't handle used > len and leaves the socket in a bad state (both sk_receive_queue and copied_seq are bad at that point) resulting in duplicated data and corruption. Fix by passing min(rd_desc->count, len) to skb_splice_bits. Signed-off-by: Dimitris Michailidis <dm@chelsio.com> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-26udp: optimize bind(0) if many ports are in useEric Dumazet
commit 9088c5609584684149f3fb5b065aa7f18dcb03ff (udp: Improve port randomization) introduced a regression for UDP bind() syscall to null port (getting a random port) in case lot of ports are already in use. This is because we do about 28000 scans of very long chains (220 sockets per chain), with many spin_lock_bh()/spin_unlock_bh() calls. Fix this using a bitmap (64 bytes for current value of UDP_HTABLE_SIZE) so that we scan chains at most once. Instead of 250 ms per bind() call, we get after patch a time of 2.9 ms Based on a report from Vitaly Mayatskikh Reported-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Tested-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-14gso: Ensure that the packet is long enoughHerbert Xu
When we get a GSO packet from an untrusted source, we need to ensure that it is sufficiently long so that we don't end up crashing. Based on discovery and patch by Ian Campbell. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-13tcp: splice as many packets as possible at onceWilly Tarreau
As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not optimal. It processes at most one segment per call. This results in low performance and very high overhead due to syscall rate when splicing from interfaces which do not support LRO. Willy provided a patch inside tcp_splice_read(), but a better fix is to let tcp_read_sock() process as many segments as possible, so that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less often. With this change, splice() behaves like tcp_recvmsg(), being able to consume many skbs in one system call. With typical 1460 bytes of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return 16*1460 = 23360 bytes. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-12netfilter 06/09: nf_conntrack: fix ICMP/ICMPv6 timeout sysctls on big-endianPatrick McHardy
An old bug crept back into the ICMP/ICMPv6 conntrack protocols: the timeout values are defined as unsigned longs, the sysctl's maxsize is set to sizeof(unsigned int). Use unsigned int for the timeout values as in the other conntrack protocols. Reported-by: Jean-Mickael Guerin <jean-mickael.guerin@6wind.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-12netfilter 01/09: remove "happy cracking" messagePatrick McHardy
Don't spam logs for locally generated short packets. these can only be generated by root. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-09Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (22 commits) ioat: fix self test for multi-channel case dmaengine: bump initcall level to arch_initcall dmaengine: advertise all channels on a device to dma_filter_fn dmaengine: use idr for registering dma device numbers dmaengine: add a release for dma class devices and dependent infrastructure ioat: do not perform removal actions at shutdown iop-adma: enable module removal iop-adma: kill debug BUG_ON iop-adma: let devm do its job, don't duplicate free dmaengine: kill enum dma_state_client dmaengine: remove 'bigref' infrastructure dmaengine: kill struct dma_client and supporting infrastructure dmaengine: replace dma_async_client_register with dmaengine_get atmel-mci: convert to dma_request_channel and down-level dma_slave dmatest: convert to dma_request_channel dmaengine: introduce dma_request_channel and private channels net_dma: convert to dma_find_channel dmaengine: provide a common 'issue_pending_all' implementation dmaengine: centralize channel allocation, introduce dma_find_channel dmaengine: up-level reference counting to the module level ...
2009-01-08Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
2009-01-08tcp6: Add GRO supportHerbert Xu
This patch adds GRO support for TCP over IPv6. The code is exactly the same as the IPv4 version except for the pseudo-header checksum computation. Note that I've removed the unused tcphdr argument from tcp_v6_check rather than invent a bogus value for GRO. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-07Merge branch 'next' into for-linusJames Morris
2009-01-06net_dma: convert to dma_find_channelDan Williams
Use the general-purpose channel allocation provided by dmaengine. Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-01-06dmaengine: up-level reference counting to the module levelDan Williams
Simply, if a client wants any dmaengine channel then prevent all dmaengine modules from being removed. Once the clients are done re-enable module removal. Why?, beyond reducing complication: 1/ Tracking reference counts per-transaction in an efficient manner, as is currently done, requires a complicated scheme to avoid cache-line bouncing effects. 2/ Per-transaction ref-counting gives the false impression that a dma-driver can be gracefully removed ahead of its user (net, md, or dma-slave) 3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but if such an engine were built one day we still would not need to notify clients of remove events. The driver can simply return NULL to a ->prep() request, something that is much easier for a client to handle. Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-01-05tcp: Kill extraneous SPLICE_F_NONBLOCK checks.David S. Miller
In splice TCP receive, the SPLICE_F_NONBLOCK flag is used to compute the "timeo" value. So checking it again inside of the main receive loop to trigger -EAGAIN processing is entirely unnecessary. Noticed by Jarek P. and Lennert Buytenhek. Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-05tcp: don't mask EOF and socket errors on nonblocking splice receiveLennert Buytenhek
Currently, setting SPLICE_F_NONBLOCK on splice from a TCP socket results in masking of EOF (RDHUP) and error conditions on the socket by an -EAGAIN return. Move the NONBLOCK check in tcp_splice_read() to be after the EOF and error checks to fix this. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-04gro: Use gso_size to store MSSHerbert Xu
In order to allow GRO packets without frag_list at all, we need to store the MSS in the packet itself. The obvious place is gso_size. The only thing to watch out for is if the packet ends up not being GRO then we need to clear gso_size before pushing the packet into the stack. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-31netlabel: Update kernel configuration APIPaul Moore
Update the NetLabel kernel API to expose the new features added in kernel releases 2.6.25 and 2.6.28: the static/fallback label functionality and network address based selectors. Signed-off-by: Paul Moore <paul.moore@hp.com>
2008-12-29net: Fix percpu counters deadlockHerbert Xu
When we converted the protocol atomic counters such as the orphan count and the total socket count deadlocks were introduced due to the mismatch in BH status of the spots that used the percpu counter operations. Based on the diagnosis and patch by Peter Zijlstra, this patch fixes these issues by disabling BH where we may be in process context. Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-29cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits: netRusty Russell
In future all cpumask ops will only be valid (in general) for bit numbers < nr_cpu_ids. So use that instead of NR_CPUS in iterators and other comparisons. This is always safe: no cpu number can be >= nr_cpu_ids, and nr_cpu_ids is initialized to NR_CPUS at boot. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits) net: Allow dependancies of FDDI & Tokenring to be modular. igb: Fix build warning when DCA is disabled. net: Fix warning fallout from recent NAPI interface changes. gro: Fix potential use after free sfc: If AN is enabled, always read speed/duplex from the AN advertising bits sfc: When disabling the NIC, close the device rather than unregistering it sfc: SFT9001: Add cable diagnostics sfc: Add support for multiple PHY self-tests sfc: Merge top-level functions for self-tests sfc: Clean up PHY mode management in loopback self-test sfc: Fix unreliable link detection in some loopback modes sfc: Generate unique names for per-NIC workqueues 802.3ad: use standard ethhdr instead of ad_header 802.3ad: generalize out mac address initializer 802.3ad: initialize ports LACPDU from const initializer 802.3ad: remove typedef around ad_system 802.3ad: turn ports is_individual into a bool 802.3ad: turn ports is_enabled into a bool 802.3ad: make ntt bool ixgbe: Fix set_ringparam in ixgbe to use the same memory pools. ... Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due to the conversion to %pI (in this networking merge) and the addition of doing IPv6 addresses (from the earlier merge of CIFS).
2008-12-26ipsec: Remove useless ret variableHerbert Xu
This patch removes a useless ret variable from the IPv4 ESP/UDP decapsulation code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-25tcp: Always set urgent pointer if it's beyond snd_nxtHerbert Xu
Our TCP stack does not set the urgent flag if the urgent pointer does not fit in 16 bits, i.e., if it is more than 64K from the sequence number of a packet. This behaviour is different from the BSDs, and clearly contradicts the purpose of urgent mode, which is to send the notification (though not necessarily the associated data) as soon as possible. Our current behaviour may in fact delay the urgent notification indefinitely if the receiver window does not open up. Simply matching BSD however may break legacy applications which incorrectly rely on the out-of-band delivery of urgent data, and conversely the in-band delivery of non-urgent data. Alexey Kuznetsov suggested a safe solution of following BSD only if the urgent pointer itself has not yet been transmitted. This way we guarantee that when the remote end sees the packet with non-urgent data marked as urgent due to wrap-around we would have advanced the urgent pointer beyond, either to the actual urgent data or to an as-yet untransmitted packet. The only potential downside is that applications on the remote end may see multiple SIGURG notifications. However, this would occur anyway with other TCP stacks. More importantly, the outcome of such a duplicate notification is likely to be harmless since the signal itself does not carry any information other than the fact that we're in urgent mode. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-25netns: igmp: make /proc/net/{igmp,mcfilter} per netnsAlexey Dobriyan
This patch makes the followinf proc entries per-netns: /proc/net/igmp /proc/net/mcfilter Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-25netns: igmp: allow IPPROTO_IGMP sockets in netnsAlexey Dobriyan
Looks like everything is already ready. Required for ebtables(8) for one thing. Also, required for ipmr per-netns (coming soon). (Benjamin) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-25Merge branch 'next' into for-linusJames Morris
2008-12-18tcp: Stop scaring users with "treason uncloaked!"Matt Mackall
The original message was unhelpful and extremely alarming to our poor users, despite its charm. Make it less frightening. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16ipmr: merge common codeIlpo Järvinen
Also removes redundant skb->len < x check which can't be true once pskb_may_pull(skb, x) succeeded. $ diff-funcs pim_rcv ipmr.c ipmr.c pim_rcv_v1 --- ipmr.c:pim_rcv() +++ ipmr.c:pim_rcv_v1() @@ -1,22 +1,27 @@ -static int pim_rcv(struct sk_buff * skb) +int pim_rcv_v1(struct sk_buff * skb) { - struct pimreghdr *pim; + struct igmphdr *pim; struct iphdr *encap; struct net_device *reg_dev = NULL; if (!pskb_may_pull(skb, sizeof(*pim) + sizeof(*encap))) goto drop; - pim = (struct pimreghdr *)skb_transport_header(skb); - if (pim->type != ((PIM_VERSION<<4)|(PIM_REGISTER)) || - (pim->flags&PIM_NULL_REGISTER) || - (ip_compute_csum((void *)pim, sizeof(*pim)) != 0 && - csum_fold(skb_checksum(skb, 0, skb->len, 0)))) + pim = igmp_hdr(skb); + + if (!mroute_do_pim || + skb->len < sizeof(*pim) + sizeof(*encap) || + pim->group != PIM_V1_VERSION || pim->code != PIM_V1_REGISTER) goto drop; - /* check if the inner packet is destined to mcast group */ encap = (struct iphdr *)(skb_transport_header(skb) + - sizeof(struct pimreghdr)); + sizeof(struct igmphdr)); + /* + Check that: + a. packet is really destinted to a multicast group + b. packet is not a NULL-REGISTER + c. packet is not truncated + */ if (!ipv4_is_multicast(encap->daddr) || encap->tot_len == 0 || ntohs(encap->tot_len) + sizeof(*pim) > skb->len) @@ -40,9 +45,9 @@ skb->ip_summed = 0; skb->pkt_type = PACKET_HOST; dst_release(skb->dst); + skb->dst = NULL; reg_dev->stats.rx_bytes += skb->len; reg_dev->stats.rx_packets++; - skb->dst = NULL; nf_reset(skb); netif_rx(skb); dev_put(reg_dev); $ codiff net/ipv4/ipmr.o.old net/ipv4/ipmr.o.new net/ipv4/ipmr.c: pim_rcv_v1 | -283 pim_rcv | -284 2 functions changed, 567 bytes removed net/ipv4/ipmr.c: __pim_rcv | +307 1 function changed, 307 bytes added net/ipv4/ipmr.o.new: 3 functions changed, 307 bytes added, 567 bytes removed, diff: -260 (Tested on x86_64). It seems that pimlen arg could be left out as well and eq-sizedness of structs trapped with BUILD_BUG_ON but I don't think that's more than a cosmetic flaw since there aren't that many args anyway. Compile tested. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15tcp: Add GRO supportHerbert Xu
This patch adds the TCP-specific portion of GRO. The criterion for merging is extremely strict (the TCP header must match exactly apart from the checksum) so as to allow refragmentation. Otherwise this is pretty much identical to LRO, except that we support the merging of ECN packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15ipv4: Add GRO infrastructureHerbert Xu
This patch adds GRO support for IPv4. The criteria for merging is more stringent than LRO, in particular, we require all fields in the IP header to be identical except for the length, ID and checksum. In addition, the ID must form an arithmetic sequence with a difference of one. The ID requirement might seem overly strict, however, most hardware TSO solutions already obey this rule. Linux itself also obeys this whether GSO is in use or not. In future we could relax this rule by storing the IDs (or rather making sure that we don't drop them when pulling the aggregate skb's tail). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/e1000e/ich8lan.c
2008-12-15netfilter: update rwlock initialization for nat_tableSteven Rostedt
The commit e099a173573ce1ba171092aee7bb3c72ea686e59 (netfilter: netns nat: per-netns NAT table) renamed the nat_table from __nat_table to nat_table without updating the __RW_LOCK_UNLOCKED(__nat_table.lock). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-14icsk: join error paths using gotoIlpo Järvinen
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-09tcp: tcp_vegas cong avoid fix Doug Leith
This patch addresses a book-keeping issue in tcp_vegas.c. At present tcp_vegas does separate book-keeping of cwnd based on packet sequence numbers. A mismatch can develop between this book-keeping and tp->snd_cwnd due, for example, to delayed acks acking multiple packets. When vegas transitions to reno operation (e.g. following loss), then this mismatch leads to incorrect behaviour (akin to a cwnd backoff). This seems mostly to affect operation at low cwnds where delayed acking can lead to a significant fraction of cwnd being covered by a single ack, leading to the book-keeping mismatch. This patch modifies the congestion avoidance update to avoid the need for separate book-keeping while leaving vegas congestion avoidance functionally unchanged. A secondary advantage of this modification is that the use of fixed-point (via V_PARAM_SHIFT) and 64 bit arithmetic is no longer necessary, simplifying the code. Some example test measurements with the patched code (confirming no functional change in the congestion avoidance algorithm) can be seen at: http://www.hamilton.ie/doug/vegaspatch/ Signed-off-by: Doug Leith <doug.leith@nuim.ie> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: fix tso_should_defer in 64bitIlpo Järvinen
Since jiffies is unsigned long, the types get expanded into that and after long enough time the difference will therefore always be > 1 (and that probably happens near boot as well as iirc the first jiffies wrap is scheduler close after boot to find out problems related to that early). This was originally noted by Bill Fink in Dec'07 but nobody never ended fixing it. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: use tcp_write_xmit also in tcp_push_oneIlpo Järvinen
tcp_minshall_update is not significant difference since it only checks for not full-sized skb which is BUG'ed on the push_one path anyway. tcp_snd_test is tcp_nagle_test+tcp_cwnd_test+tcp_snd_wnd_test, just the order changed slightly. net/ipv4/tcp_output.c: tcp_snd_test | -89 tcp_mss_split_point | -91 tcp_may_send_now | +53 tcp_cwnd_validate | -98 tso_fragment | -239 __tcp_push_pending_frames | -1340 tcp_push_one | -146 7 functions changed, 53 bytes added, 2003 bytes removed, diff: -1950 net/ipv4/tcp_output.c: tcp_write_xmit | +1772 1 function changed, 1772 bytes added, diff: +1772 tcp_output.o.new: 8 functions changed, 1825 bytes added, 2003 bytes removed, diff: -178 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-core.c drivers/net/wireless/iwlwifi/iwl-sta.c
2008-12-05tcp: move some parts from tcp_write_xmitIlpo Järvinen
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: share code through function, not through copy-paste. :-)Ilpo Järvinen
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: drop tcp_bound_rto, merge content of it tcp_set_rtoIlpo Järvinen
Both are called by the same sites. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: no need to pass prev skb around, reduces arg pressureIlpo Järvinen
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: introduce struct tcp_sacktag_state to reduce arg pressureIlpo Järvinen
There are just too many args to some sacktag functions. This idea was first proposed by David S. Miller around a year ago, and the current situation is much worse that what it was back then. tcp_sacktag_one can be made a bit simpler by returning the new sacked (it can be achieved with a single variable though the previous code "caching" sacked into a local variable and therefore it is not exactly equal but the results will be the same). codiff on x86_64 tcp_sacktag_one | -15 tcp_shifted_skb | -50 tcp_match_skb_to_sack | -1 tcp_sacktag_walk | -64 tcp_sacktag_write_queue | -59 tcp_urg | +1 tcp_event_data_recv | -1 7 functions changed, 1 bytes added, 190 bytes removed, diff: -189 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: make mtu probe failure to not break gso'ed skbs unnecessarilyIlpo Järvinen
I noticed that since skb->len has nothing to do with actual segment length with gso, we need to figure it out separately, reuse a function from the recent shifting stuff (generalize it). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: Fix thinko making the not-shiftable to cover S|R as wellIlpo Järvinen
S|R won't result in S if just SACK is received. DSACK is another story (but it is covered correctly already). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05tcp: force mss equality with the next skb too.Ilpo Järvinen
Also make if-goto forest nicer looking. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-04tcp: tcp_vegas ssthresh bug fixDoug Leith
This patch fixes a bug in tcp_vegas.c. At the moment this code leaves ssthresh untouched. However, this means that the vegas congestion control algorithm is effectively unable to reduce cwnd below the ssthresh value (if the vegas update lowers the cwnd below ssthresh, then slow start is activated to raise it back up). One example where this matters is when during slow start cwnd overshoots the link capacity and a flow then exits slow start with ssthresh set to a value above where congestion avoidance would like to adjust it. Signed-off-by: Doug Leith <doug.leith@nuim.ie> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-03net: /proc/net/ip_mr_cache, display Iif as a signed shortBenjamin Thery
Today, iproute2 fails to show multicast forwarding unresolved cache entries while scanning /proc/net/ip_mr_cache. Indeed, it expects to see -1 in 'Iif' column to identify unresolved entries but the kernel outputs 65535. It's a signed/unsigned issue: 'Iif', the source interface, is retrieved from member mfc_parent in struct mfc_cache. mfc_parent is a vifi_t: unsigned short, but is displayed in ipmr_mfc_seq_show() as "%-3d", signed integer. In unresolevd entries, the 65535 value (0xFFFF) comes from this define: #define ALL_VIFS ((vifi_t)(-1)) That may explains why the guy who added support for this in iproute2 thought a -1 should be expected. I don't know if this must be fixed in kernel or in iproute2. Who is right? What is the correct API? How was it designed originally? I let you decide if it should goes in the kernel or be fixed in iproute2. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-03net: fix /proc/net/ip_mr_cache display - V2Benjamin Thery
/proc/net/ip_mr_cache and /proc/net/ip6_mr_cache displays garbage when showing unresolved mfc_cache entries. [root@qemu tests]# cat /proc/net/ip_mr_cache Group Origin Iif Pkts Bytes Wrong Oifs 014C00EF 010014AC 1 10 10050 0 2:1 3:1 024C00EF 010014AC 65535 514 2 -559067475 The first line is correct. It is a resolved cache entry, 10 packets used it... The second line represents an unresolved entry, and the columns Pkts(4th), Bytes(5th) and Wrong(6th) just show garbage. In struct mfc_cache, there's an union to store data for resolved and unresolved cases. And what ipmr_mfc_seq_show() is printing in these columns for the unresolved entries is some bytes from mfc_cache.mfc_un.res. Bad. (eg. In our case -559067475 is in fact 0xdead4ead which is the spinlock magic from mfc_cache.mfc_un.unres.unresolved.lock.magic). This patch replaces the garbage data written in these columns for the unresolved entries by '0' (zeros) which is more correct. This change doesn't break the ABI. Also, mfc->mfc_un.res.pkt, mfc->mfc_un.res.bytes, mfc->mfc_un.res.wrong_if are unsigned long. It applies on top of net-next-2.6. The patch for net-2.6 is slightly different because of the NIP6_FMT to %pI6 conversion that was made in the seq_printf. Changelog: ========== V2: * Instead of breaking the ABI by suppressing the columns that have no meaning for unresolved entries, fill them with 0 values. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-04Merge branch 'master' into nextJames Morris
Conflicts: fs/nfsd/nfs4recover.c Manually fixed above to use new creds API functions, e.g. nfs4_save_creds(). Signed-off-by: James Morris <jmorris@namei.org>
2008-12-03tcp: make urg+gso work for real this timeIlpo Järvinen
I should have noticed this earlier... :-) The previous solution to URG+GSO/TSO will cause SACK block tcp_fragment to do zig-zig patterns, or even worse, a steep downward slope into packet counting because each skb pcount would be truncated to pcount of 2 and then the following fragments of the later portion would restore the window again. Basically this reverts "tcp: Do not use TSO/GSO when there is urgent data" (33cf71cee1). It also removes some unnecessary code from tcp_current_mss that didn't work as intented either (could be that something was changed down the road, or it might have been broken since the dawn of time) because it only works once urg is already written while this bug shows up starting from ~64k before the urg point. The retransmissions already are split to mss sized chunks, so only new data sending paths need splitting in case they have a segment otherwise suitable for gso/tso. The actually check can be improved to be more narrow but since this is late -rc already, I'll postpone thinking the more fine-grained things. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-01net: percpu_counter_inc() should not be called in BH-disabled sectionEric Dumazet
Based upon a lockdep report by Alexey Dobriyan. I checked all per_cpu_counter_xxx() usages in network tree, and I think all call sites are BH enabled except one in inet_csk_listen_stop(). commit dd24c00191d5e4a1ae896aafe33c6b8095ab4bd1 (net: Use a percpu_counter for orphan_count) replaced atomic_t orphan_count to a percpu_counter. atomic_inc()/atomic_dec() can be called from any context, while percpu_counter_xxx() should be called from a consistent state. For orphan_count, this context can be the BH-enabled one. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>