aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2008-03-20[TCP]: Fix shrinking windows with window scalingPatrick McHardy
When selecting a new window, tcp_select_window() tries not to shrink the offered window by using the maximum of the remaining offered window size and the newly calculated window size. The newly calculated window size is always a multiple of the window scaling factor, the remaining window size however might not be since it depends on rcv_wup/rcv_nxt. This means we're effectively shrinking the window when scaling it down. The dump below shows the problem (scaling factor 2^7): - Window size of 557 (71296) is advertised, up to 3111907257: IP 172.2.2.3.33000 > 172.2.2.2.33000: . ack 3111835961 win 557 <...> - New window size of 514 (65792) is advertised, up to 3111907217, 40 bytes below the last end: IP 172.2.2.3.33000 > 172.2.2.2.33000: . 3113575668:3113577116(1448) ack 3111841425 win 514 <...> The number 40 results from downscaling the remaining window: 3111907257 - 3111841425 = 65832 65832 / 2^7 = 514 65832 % 2^7 = 40 If the sender uses up the entire window before it is shrunk, this can have chaotic effects on the connection. When sending ACKs, tcp_acceptable_seq() will notice that the window has been shrunk since tcp_wnd_end() is before tp->snd_nxt, which makes it choose tcp_wnd_end() as sequence number. This will fail the receivers checks in tcp_sequence() however since it is before it's tp->rcv_wup, making it respond with a dupack. If both sides are in this condition, this leads to a constant flood of ACKs until the connection times out. Make sure the window is never shrunk by aligning the remaining window to the window scaling factor. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-20[NETFILTER]: ipt_recent: sanity check hit countDaniel Hokka Zakrisson
If a rule using ipt_recent is created with a hit count greater than ip_pkt_list_tot, the rule will never match as it cannot keep track of enough timestamps. This patch makes ipt_recent refuse to create such rules. With ip_pkt_list_tot's default value of 20, the following can be used to reproduce the problem. nc -u -l 0.0.0.0 1234 & for i in `seq 1 100`; do echo $i | nc -w 1 -u 127.0.0.1 1234; done This limits it to 20 packets: iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \ --rsource iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \ 60 --hitcount 20 --name test --rsource -j DROP While this is unlimited: iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \ --rsource iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \ 60 --hitcount 21 --name test --rsource -j DROP With the patch the second rule-set will throw an EINVAL. Reported-by: Sean Kennedy <skennedy@vcn.com> Signed-off-by: Daniel Hokka Zakrisson <daniel@hozac.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-17[IPV4]: esp_output() misannotationsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-17[NET] endianness noise: INADDR_ANYAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-11[TCP]: Prevent sending past receiver window with TSO (at last skb)Ilpo Järvinen
With TSO it was possible to send past the receiver window when the skb to be sent was the last in the write queue while the receiver window is the limiting factor. One can notice that there's a loophole in the tcp_mss_split_point that lacked a receiver window check for the tcp_write_queue_tail() if also cwnd was smaller than the full skb. Noticed by Thomas Gleixner <tglx@linutronix.de> in form of "Treason uncloaked! Peer ... shrinks window .... Repaired." messages (the peer didn't actually shrink its window as the message suggests, we had just sent something past it without a permission to do so). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Tested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-04[IPCONFIG]: The kernel gets no IP from some DHCP serversStephen Hemminger
From: Stephen Hemminger <shemminger@linux-foundation.org> Based upon a patch by Marcel Wappler: This patch fixes a DHCP issue of the kernel: some DHCP servers (i.e. in the Linksys WRT54Gv5) are very strict about the contents of the DHCPDISCOVER packet they receive from clients. Table 5 in RFC2131 page 36 requests the fields 'ciaddr' and 'siaddr' MUST be set to '0'. These DHCP servers ignore Linux kernel's DHCP discovery packets with these two fields set to '255.255.255.255' (in contrast to popular DHCP clients, such as 'dhclient' or 'udhcpc'). This leads to a not booting system. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-04[ESP]: Add select on AUTHENCHerbert Xu
Now the ESP uses the AEAD interface even for algorithms which are not combined mode, we need to select CONFIG_CRYPTO_AUTHENC as otherwise only combined mode algorithms will work. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-03[TCP]: Must count fack_count also when skippingIlpo Järvinen
It makes fackets_out to grow too slowly compared with the real write queue. This shouldn't cause those BUG_TRAP(packets <= tp->packets_out) to trigger but how knows how such inconsistent fackets_out affects here and there around TCP when everything is nowadays assuming accurate fackets_out. So lets see if this silences them all. Reported by Guillaume Chazarain <guichaz@gmail.com>. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-28[TCP]: BIC web page link is corrected.Sangtae Ha
Signed-off-by: Sangtae Ha <sha2@ncsu.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-28[IPV4]: Use proc_create() to setup ->proc_fops firstWang Chen
Use proc_create() to make sure that ->proc_fops be setup before gluing PDE to main tree. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-28[IPCOMP]: Disable BH on output when using shared tfmHerbert Xu
Because we use shared tfm objects in order to conserve memory, (each tfm requires 128K of vmalloc memory), BH needs to be turned off on output as that can occur in process context. Previously this was done implicitly by the xfrm output code. That was lost when it became lockless. So we need to add the BH disabling to IPComp directly. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-26[INET]: Don't create tunnels with '%' in name.Pavel Emelyanov
Four tunnel drivers (ip_gre, ipip, ip6_tunnel and sit) can receive a pre-defined name for a device from the userspace. Since these drivers call the register_netdevice() (rtnl_lock, is held), which does _not_ generate the device's name, this name may contain a '%' character. Not sure how bad is this to have a device with a '%' in its name, but all the other places either use the register_netdev(), which call the dev_alloc_name(), or explicitly call the dev_alloc_name() before registering, i.e. do not allow for such names. This had to be prior to the commit 34cc7b, but I forgot to number the patches and this one got lost, sorry. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-26[IPV4]: Reset scope when changing addressBjorn Mork
This bug did bite at least one user, who did have to resort to rebooting the system after an "ifconfig eth0 127.0.0.1" typo. Deleting the address and adding a new is a less intrusive workaround. But I still beleive this is a bug that should be fixed. Some way or another. Another possibility would be to remove the scope mangling based on address. This will always be incomplete (are 127/8 the only address space with host scope requirements?) We set the scope to RT_SCOPE_HOST if an IPv4 interface is configured with a loopback address (127/8). The scope is never reset, and will remain set to RT_SCOPE_HOST after changing the address. This patch resets the scope if the address is changed again, to restore normal functionality. Signed-off-by: Bjorn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-23[IP_TUNNEL]: Don't limit the number of tunnels with generic name explicitly.Pavel Emelyanov
Use the added dev_alloc_name() call to create tunnel device name, rather than iterate in a hand-made loop with an artificial limit. Thanks Patrick for noticing this. [ The way this works is, when the device is actually registered, the generic code noticed the '%' in the name and invokes dev_alloc_name() to fully resolve the name. -DaveM ] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-19[NETFILTER]: Fix incorrect use of skb_make_writableJoonwoo Park
http://bugzilla.kernel.org/show_bug.cgi?id=9920 The function skb_make_writable returns true or false. Signed-off-by: Joonwoo Park <joonwpark81@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-19[NETFILTER]: {ip,ip6,nfnetlink}_queue: fix SKB_LINEAR_ASSERT when mangling ↵Patrick McHardy
packet data As reported by Tomas Simonaitis <tomas.simonaitis@gmail.com>, inserting new data in skbs queued over {ip,ip6,nfnetlink}_queue triggers a SKB_LINEAR_ASSERT in skb_put(). Going back through the git history, it seems this bug is present since at least 2.6.12-rc2, probably even since the removal of skb_linearize() for netfilter. Linearize non-linear skbs through skb_copy_expand() when enlarging them. Tested by Thomas, fixes bugzilla #9933. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-19ipv4/fib_hash.c: fix NULL dereferenceAdrian Bunk
Unless I miss a guaranteed relation between between "f" and "new_fa->fa_info" this patch is required for fixing a NULL dereference introduced by commit a6501e080c318f8d4467679d17807f42b3a33cd5 ("[IPV4] FIB_HASH: Reduce memory needs and speedup lookups") and spotted by the Coverity checker. Eric Dumazet says: Hum, you are right, kmem_cache_free() doesnt allow a NULL object, like kfree() does. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-17[TCP]: Fix tcp_v4_send_synack() commentKris Katterjohn
Signed-off-by: Kris Katterjohn <katterjohn@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-17[IPV4]: fix alignment of IP-Config outputUwe Kleine-Koenig
Make the indented lines aligned in the output (not in the code). Signed-off-by: Uwe Kleine-Koenig <Uwe.Kleine-Koenig@digi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-17Revert "[NDISC]: Fix race in generic address resolution"David S. Miller
This reverts commit 69cc64d8d92bf852f933e90c888dfff083bd4fc9. It causes recursive locking in IPV6 because unlike other neighbour layer clients, it even needs neighbour cache entries to send neighbour soliciation messages :-( We'll have to find another way to fix this race. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-13[INET]: Unexport inet_listen_wlockAdrian Bunk
This patch removes the no longer used EXPORT_SYMBOL(inet_listen_wlock). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-13[INET]: Unexport __inet_hash_connectAdrian Bunk
This patch removes the unused EXPORT_SYMBOL_GPL(__inet_hash_connect). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-12[IPSEC]: Fix bogus usage of u64 on input sequence numberHerbert Xu
Al Viro spotted a bogus use of u64 on the input sequence number which is big-endian. This patch fixes it by giving the input sequence number its own member in the xfrm_skb_cb structure. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-12[NDISC]: Fix race in generic address resolutionDavid S. Miller
Frank Blaschka provided the bug report and the initial suggested fix for this bug. He also validated this version of this fix. The problem is that the access to neigh->arp_queue is inconsistent, we grab references when dropping the lock lock to call neigh->ops->solicit() but this does not prevent other threads of control from trying to send out that packet at the same time causing corruptions because both code paths believe they have exclusive access to the skb. The best option seems to be to hold the write lock on neigh->lock during the ->solicit() call. I looked at all of the ndisc_ops implementations and this seems workable. The only case that needs special care is the IPV4 ARP implementation of arp_solicit(). It wants to take neigh->lock as a reader to protect the header entry in neigh->ha during the emission of the soliciation. We can simply remove the read lock calls to take care of that since holding the lock as a writer at the caller providers a superset of the protection afforded by the existing read locking. The rest of the ->solicit() implementations don't care whether the neigh is locked or not. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-12fib_trie: /proc/net/route performance improvementStephen Hemminger
Use key/offset caching to change /proc/net/route (use by iputils route) from O(n^2) to O(n). This improves performance from 30sec with 160,000 routes to 1sec. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-12fib_trie: handle empty treeStephen Hemminger
This fixes possible problems when trie_firstleaf() returns NULL to trie_leafindex(). Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-12[IPV4]: Remove IP_TOS setting privilege checks.David S. Miller
Various RFCs have all sorts of things to say about the CS field of the DSCP value. In particular they try to make the distinction between values that should be used by "user applications" and things like routing daemons. This seems to have influenced the CAP_NET_ADMIN check which exists for IP_TOS socket option settings, but in fact it has an off-by-one error so it wasn't allowing CS5 which is meant for "user applications" as well. Further adding to the inconsistency and brokenness here, IPV6 does not validate the DSCP values specified for the IPV6_TCLASS socket option. The real actual uses of these TOS values are system specific in the final analysis, and these RFC recommendations are just that, "a recommendation". In fact the standards very purposefully use "SHOULD" and "SHOULD NOT" when describing how these values can be used. In the final analysis the only clean way to provide consistency here is to remove the CAP_NET_ADMIN check. The alternatives just don't work out: 1) If we add the CAP_NET_ADMIN check to ipv6, this can break existing setups. 2) If we just fix the off-by-one error in the class comparison in IPV4, certain DSCP values can be used in IPV6 but not IPV4 by default. So people will just ask for a sysctl asking to override that. I checked several other freely available kernel trees and they do not make any privilege checks in this area like we do. For the BSD stacks, this goes back all the way to Stevens Volume 2 and beyond. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-09[IGMP]: Optimize kfree_skb in igmp_rcv.Denis V. Lunev
Merge error paths inside igmp_rcv. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-07[IPV4]: route: fix crash ip_route_inputPatrick McHardy
ip_route_me_harder() may call ip_route_input() with skbs that don't have skb->dev set for skbs rerouted in LOCAL_OUT and TCP resets generated by the REJECT target, resulting in a crash when dereferencing skb->dev->nd_net. Since ip_route_input() has an input device argument, it seems correct to use that one anyway. Bug introduced in b5921910a1 (Routing cache virtualization). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-07[NETFILTER]: nf_conntrack: fix ct_extend ->move operationPatrick McHardy
The ->move operation has two bugs: - It is called with the same extension as source and destination, so it doesn't update the new extension. - The address of the old extension is calculated incorrectly, instead of (void *)ct->ext + ct->ext->offset[i] it uses ct->ext + ct->ext->offset[i]. Fixes a crash on x86_64 reported by Chuck Ebbert <cebbert@redhat.com> and Thomas Woerner <twoerner@redhat.com>. Tested-by: Thomas Woerner <twoerner@redhat.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-05ipvs: Make wrr "no available servers" error message rate-limitedSven Wegener
No available servers is more an error message than something informational. It should also be rate-limited, else we're going to flood our logs on a busy director, if all real servers are out of order with a weight of zero. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits) [PKT_SCHED]: vlan tag match [NET]: Add if_addrlabel.h to sanitized headers. [NET] rtnetlink.c: remove no longer used functions [ICMP]: Restore pskb_pull calls in receive function [INET]: Fix accidentally broken inet(6)_hash_connect's port offset calculations. [NET]: Remove further references to net-modules.txt bluetooth rfcomm tty: destroy before tty_close() bluetooth: blacklist another Broadcom BCM2035 device drivers/bluetooth/btsdio.c: fix double-free drivers/bluetooth/bpa10x.c: fix memleak bluetooth: uninlining bluetooth: hidp_process_hid_control remove unnecessary parameter dealing tun: impossible to deassert IFF_ONE_QUEUE or IFF_NO_PI hamradio: fix dmascc section mismatch [SCTP]: Fix kernel panic while received AUTH chunk with BAD shared key identifier [SCTP]: Fix kernel panic while received AUTH chunk while enabled auth [IPV4]: Formatting fix for /proc/net/fib_trie. [IPV6]: Fix sysctl compilation error. [NET_SCHED]: Add #ifdef CONFIG_NET_EMATCH in net/sched/cls_flow.c (latest git broken build) [IPV4]: Fix compile error building without CONFIG_FS_PROC ...
2008-02-05NetLabel: introduce a new kernel configuration API for NetLabelPaul Moore
Add a new set of configuration functions to the NetLabel/LSM API so that LSMs can perform their own configuration of the NetLabel subsystem without relying on assistance from userspace. Signed-off-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05[ICMP]: Restore pskb_pull calls in receive functionHerbert Xu
Somewhere along the development of my ICMP relookup patch the header length check went AWOL on the non-IPsec path. This patch restores the check. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-05[INET]: Fix accidentally broken inet(6)_hash_connect's port offset calculations.Pavel Emelyanov
The port offset calculations depend on the protocol family, but, as Adrian noticed, I broke this logic with the commit 5ee31fc1ecdcbc234c8c56dcacef87c8e09909d8 [INET]: Consolidate inet(6)_hash_connect. Return this logic back, by passing the port offset directly into the consolidated function. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Noticed-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-05[IPV4]: Formatting fix for /proc/net/fib_trie.Denis V. Lunev
The line in the /proc/net/fib_trie for route with TOS specified - has extra \n at the end - does not have a space after route scope like below. |-- 1.1.1.1 /32 universe UNICASTtos =1 Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-05[IPSEC] xfrm4_beet_input(): fix an if()Adrian Bunk
A bug every C programmer makes at some point in time... Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03[SOCK] proto: Add hashinfo member to struct protoArnaldo Carvalho de Melo
This way we can remove TCP and DCCP specific versions of sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port sk->sk_prot->hash: inet_hash is directly used, only v6 need a specific version to deal with mapped sockets sk->sk_prot->unhash: both v4 and v6 use inet_hash directly struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so that inet_csk_get_port can find the per family routine. Now only the lookup routines receive as a parameter a struct inet_hashtable. With this we further reuse code, reducing the difference among INET transport protocols. Eventually work has to be done on UDP and SCTP to make them share this infrastructure and get as a bonus inet_diag interfaces so that iproute can be used with these protocols. net-2.6/net/ipv4/inet_hashtables.c: struct proto | +8 struct inet_connection_sock_af_ops | +8 2 structs changed __inet_hash_nolisten | +18 __inet_hash | -210 inet_put_port | +8 inet_bind_bucket_create | +1 __inet_hash_connect | -8 5 functions changed, 27 bytes added, 218 bytes removed, diff: -191 net-2.6/net/core/sock.c: proto_seq_show | +3 1 function changed, 3 bytes added, diff: +3 net-2.6/net/ipv4/inet_connection_sock.c: inet_csk_get_port | +15 1 function changed, 15 bytes added, diff: +15 net-2.6/net/ipv4/tcp.c: tcp_set_state | -7 1 function changed, 7 bytes removed, diff: -7 net-2.6/net/ipv4/tcp_ipv4.c: tcp_v4_get_port | -31 tcp_v4_hash | -48 tcp_v4_destroy_sock | -7 tcp_v4_syn_recv_sock | -2 tcp_unhash | -179 5 functions changed, 267 bytes removed, diff: -267 net-2.6/net/ipv6/inet6_hashtables.c: __inet6_hash | +8 1 function changed, 8 bytes added, diff: +8 net-2.6/net/ipv4/inet_hashtables.c: inet_unhash | +190 inet_hash | +242 2 functions changed, 432 bytes added, diff: +432 vmlinux: 16 functions changed, 485 bytes added, 492 bytes removed, diff: -7 /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c: tcp_v6_get_port | -31 tcp_v6_hash | -7 tcp_v6_syn_recv_sock | -9 3 functions changed, 47 bytes removed, diff: -47 /home/acme/git/net-2.6/net/dccp/proto.c: dccp_destroy_sock | -7 dccp_unhash | -179 dccp_hash | -49 dccp_set_state | -7 dccp_done | +1 5 functions changed, 1 bytes added, 242 bytes removed, diff: -241 /home/acme/git/net-2.6/net/dccp/ipv4.c: dccp_v4_get_port | -31 dccp_v4_request_recv_sock | -2 2 functions changed, 33 bytes removed, diff: -33 /home/acme/git/net-2.6/net/dccp/ipv6.c: dccp_v6_get_port | -31 dccp_v6_hash | -7 dccp_v6_request_recv_sock | +5 3 functions changed, 5 bytes added, 38 bytes removed, diff: -33 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NETNS]: Lookup in FIB semantic hashes taking into account the namespace.Denis V. Lunev
The namespace is not available in the fib_sync_down_addr, add it as a parameter. Looking up a device by the pointer to it is OK. Looking up using a result from fib_trie/fib_hash table lookup is also safe. No need to fix that at all. So, just fix lookup by address and insertion to the hash table path. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NETNS]: Add a namespace mark to fib_info.Denis V. Lunev
This is required to make fib_info lookups namespace aware. In the other case initial namespace devices are marked as dead in the local routing table during other namespace stop. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[IPV4]: fib_sync_down rework.Denis V. Lunev
fib_sync_down can be called with an address and with a device. In reality it is called either with address OR with a device. The codepath inside is completely different, so lets separate it into two calls for these two cases. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NETNS]: Process interface address manipulation routines in the namespace.Denis V. Lunev
The namespace is available when required except rtm_to_ifaddr. Add namespace argument to it. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[IPV4]: Small style cleanup of the error path in rtm_to_ifaddr.Denis V. Lunev
Remove error code assignment inside brackets on failure. The code looks better if the error is assigned before condition check. Also, the compiler treats this better. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[IPV4]: Fix memory leak on error path during FIB initialization.Denis V. Lunev
net->ipv4.fib_table_hash is not freed when fib4_rules_init failed. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[TCP]: Unexport sysctl_tcp_tso_win_divisorAdrian Bunk
This patch removes the no longer used EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[IPV4]: Make struct ipv4_devconf static.Adrian Bunk
struct ipv4_devconf can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[IPV4] route cache: Introduce rt_genid for smooth cache invalidationEric Dumazet
Current ip route cache implementation is not suited to large caches. We can consume a lot of CPU when cache must be invalidated, since we currently need to evict all cache entries, and this eviction is sometimes asynchronous. min_delay & max_delay can somewhat control this asynchronism behavior, but whole thing is a kludge, regularly triggering infamous soft lockup messages. When entries are still in use, this also consumes a lot of ram, filling dst_garbage.list. A better scheme is to use a generation identifier on each entry, so that cache invalidation can be performed by changing the table identifier, without having to scan all entries. No more delayed flushing, no more stalling when secret_interval expires. Invalidated entries will then be freed at GC time (controled by ip_rt_gc_timeout or stress), or when an invalidated entry is found in a chain when an insert is done. Thus we keep a normal equilibrium. This patch : - renames rt_hash_rnd to rt_genid (and makes it an atomic_t) - Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit) - Checks entry->rt_genid at appropriate places :
2008-01-31[TCP]: Fix a bug in strategy_allowed_congestion_controlShan Wei
In strategy_allowed_congestion_control of the 2.6.24 kernel, when sysctl_string return 1 on success,it should call tcp_set_allowed_congestion_control to set the allowed congestion control.But, it don't. the sysctl_string return 1 on success, otherwise return negative, never return 0.The patch fix the problem. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[IPV4] fib_trie: rescan if key is lost during dumpStephen Hemminger
Normally during a dump the key of the last dumped entry is used for continuation, but since lock is dropped it might be lost. In that case fallback to the old counter based N^2 behaviour. This means the dump will end up skipping some routes which matches what FIB_HASH does. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NETNS]: Udp sockets per-net lookup.Pavel Emelyanov
Add the net parameter to udp_get_port family of calls and udp_lookup one and use it to filter sockets. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>