aboutsummaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2005-11-29[IPV6]: make two functions staticAdrian Bunk
This patch makes two needlessly global functions static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-29[NET]: Add const markers to various variables.Arjan van de Ven
the patch below marks various variables const in net/; the goal is to move them to the .rodata section so that they can't false-share cachelines with things that get written to, as well as potentially helping gcc a bit with optimisations. (these were found using a gcc patch to warn about such variables) Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-28[IPV6]: Implement appropriate dummy rule 4 in ipv6_dev_get_saddr().YOSHIFUJI Hideaki
Ensure to update hiscore.rule in dummy rule 4 in ipv6_dev_get_saddr(). Pointed out by Yan Zheng <yanzheng@21cn.com>. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-20Merge git://git.skbuff.net/gitroot/yoshfuji/linux-2.6.14+advapi-fix/David S. Miller
2005-11-20[IPV6]: Acquire addrconf_hash_lock for read in addrconf_verify(...)Yan Zheng
addrconf_verify(...) only traverse address hash table when addrconf_hash_lock is held for writing, and it may hold addrconf_hash_lock for a long time. So I think it's better to acquire addrconf_hash_lock for reading instead of writing Signed-off-by: Yan Zheng <yanzheng@21cn.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-20[IPV6]: Fix sending extension headers before and including routing header.YOSHIFUJI Hideaki
Based on suggestion from Masahide Nakamura <nakam@linux-ipv6.org>. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2005-11-20[IPV6]: Fix calculation of AH length during filling ancillary data.Ville Nuorvala
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2005-11-20[IPV6]: Fix memory management error during setting up new advapi sockopts.YOSHIFUJI Hideaki
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2005-11-17[IPV6]: Fib dump really needs GFP_ATOMIC.David S. Miller
Revert: 8225ccbaf01b459cf1e462047a51b2851e756bc1 Based upon a report by Yan Zheng. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-16[IPV4,IPV6]: replace handmade list with hlist in IPv{4,6} reassemblyYasuyuki Kozakai
Both of ipq and frag_queue have *next and **prev, and they can be replaced with hlist. Thanks Arnaldo Carvalho de Melo for the suggestion. Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14[IPV6]: Fixes sparse warning in ipv6/ipv6_sockglue.cLuiz Capitulino
The patch below fixes the following sparse warning: net/ipv6/ipv6_sockglue.c:291:13: warning: Using plain integer as NULL pointer Signed-off-by: Luiz Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14[IPV6]: small fix for ipv6_dev_get_saddr(...)Yan Zheng
The "score.rule++" doesn't make any sense for me. According to codes above, I think it should be "hiscore.rule++;" . Signed-off-by: Yan Zheng<yanzheng@21cn.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14[NETFILTER] fix leak of fragment queue at unloading nf_conntrack_ipv6Yasuyuki Kozakai
This patch makes nf_conntrack_ipv6 free all IPv6 fragment queues at module unloading time. Also introduce a BUG_ON if we ever again have leaks in the memory accounting. Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14[NETFILTER] nf_conntrack: fix possibility of infinite loop while evicting ↵Yasuyuki Kozakai
nf_ct_frag6_queue This synchronizes nf_ct_reasm with ipv6 reassembly, and fixes a possibility of an infinite loop if CPUs evict and create nf_ct_frag6_queue in parallel. Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14[NETFILTER]: fix type of sysctl variables in nf_conntrack_ipv6Yasuyuki Kozakai
These variables should be unsigned. This fixes sysctl handler for nf_ct_frag6_{low,high}_thresh. Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14[NETFILTER]: cleanup IPv6 Netfilter KconfigYasuyuki Kozakai
This removes linux 2.4 configs in comments as TODO lists. And this also move the entry of nf_conntrack to top like IPv4 Netfilter Kconfig. Based on original patch by Krzysztof Piotr Oledzki <ole@ans.pl>. Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-12[IPV6]: Fix unnecessary GFP_ATOMIC allocation in fib6 dumpThomas Graf
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-12[IPV6]: Fix rtnetlink dump infinite loopHerbert Xu
The recent change to netlink dump "done" callback handling broke IPv6 which played dirty tricks with the "done" callback. This causes an infinite loop during a dump. The following patch fixes it. This bug was reported by Jeff Garzik. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-11[IPV6]: Fix inet6_init missing unregister.David S. Miller
Based mostly upon a patch from Olaf Kirch <okir@suse.de> When initialization fails in inet6_init(), we should unregister the PF_INET6 socket ops. Also, check sock_register()'s return value for errors. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10[NET]: Detect hardware rx checksum faults correctlyHerbert Xu
Here is the patch that introduces the generic skb_checksum_complete which also checks for hardware RX checksum faults. If that happens, it'll call netdev_rx_csum_fault which currently prints out a stack trace with the device name. In future it can turn off RX checksum. I've converted every spot under net/ that does RX checksum checks to use skb_checksum_complete or __skb_checksum_complete with the exceptions of: * Those places where checksums are done bit by bit. These will call netdev_rx_csum_fault directly. * The following have not been completely checked/converted: ipmr ip_vs netfilter dccp This patch is based on patches and suggestions from Stephen Hemminger and David S. Miller. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10[NETLINK]: Make netlink_callback->done() optionalThomas Graf
Most netlink families make no use of the done() callback, making it optional gets rid of all unnecessary dummy implementations. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09[NETFILTER]: Add nf_conntrack subsystem.Yasuyuki Kozakai
The existing connection tracking subsystem in netfilter can only handle ipv4. There were basically two choices present to add connection tracking support for ipv6. We could either duplicate all of the ipv4 connection tracking code into an ipv6 counterpart, or (the choice taken by these patches) we could design a generic layer that could handle both ipv4 and ipv6 and thus requiring only one sub-protocol (TCP, UDP, etc.) connection tracking helper module to be written. In fact nf_conntrack is capable of working with any layer 3 protocol. The existing ipv4 specific conntrack code could also not deal with the pecularities of doing connection tracking on ipv6, which is also cured here. For example, these issues include: 1) ICMPv6 handling, which is used for neighbour discovery in ipv6 thus some messages such as these should not participate in connection tracking since effectively they are like ARP messages 2) fragmentation must be handled differently in ipv6, because the simplistic "defrag, connection track and NAT, refrag" (which the existing ipv4 connection tracking does) approach simply isn't feasible in ipv6 3) ipv6 extension header parsing must occur at the correct spots before and after connection tracking decisions, and there were no provisions for this in the existing connection tracking design 4) ipv6 has no need for stateful NAT The ipv4 specific conntrack layer is kept around, until all of the ipv4 specific conntrack helpers are ported over to nf_conntrack and it is feature complete. Once that occurs, the old conntrack stuff will get placed into the feature-removal-schedule and we will fully kill it off 6 months later. Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-09[IPV6]: ip6ip6_lock is not unlocked in error path.Ken-ichirou MATSUZAWA
From: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09[IPV6]: Fix fallout from CONFIG_IPV6_PRIVACYPeter Chubb
Trying to build today's 2.6.14+git snapshot gives undefined references to use_tempaddr Looks like an ifdef got left out. Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-08[NET]: kfree cleanupJesper Juhl
From: Jesper Juhl <jesper.juhl@gmail.com> This is the net/ part of the big kfree cleanup patch. Remove pointless checks for NULL prior to calling kfree() in net/. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br> Acked-by: Marcel Holtmann <marcel@holtmann.org> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Andrew Morton <akpm@osdl.org>
2005-11-08[IPV6]: RFC3484 compliant source address selectionYOSHIFUJI Hideaki
Choose more appropriate source address; e.g. - outgoing interface - non-deprecated - scope - matching label Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-08[IPV6]: Make ipv6_addr_type() more generic so that we can use it for source ↵YOSHIFUJI Hideaki
address selection. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-08[IPV6]: Put addr_diff() into common header for future use.YOSHIFUJI Hideaki
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-05[TCP/DCCP]: Randomize port selectionStephen Hemminger
This patch randomizes the port selected on bind() for connections to help with possible security attacks. It should also be faster in most cases because there is no need for a global lock. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-03[IPV6]: inet6_ifinfo_notify should use RTM_DELLINK in addrconf_ifdownYan Zheng
Signed-off-by: Yan Zheng <yanzheng@21cn.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-02[MCAST]: ip[6]_mc_add_src should be called when number of sources is zeroYan Zheng
And filter mode is exclude. Further explanation by David Stevens: Multicast source filters aren't widely used yet, and that's really the only feature that's affected if an application actually exercises this bug, as far as I can tell. An ordinary filter-less multicast join should still work, and only forwarded multicast traffic making use of filters and doing empty-source filters with the MSFILTER ioctl would be at risk of not getting multicast traffic forwarded to them because the reports generated would not be based on the correct counts. Signed-off-by: Yan Zheng <yanzheng@21cn.com Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-31[MCAST] IPv6: Check packet size when process MulticastYan Zheng
Signed-off-by: Yan Zheng <yanzheng@21cn.com Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-31[IPV6]: Fix behavior of ip6_route_input() for link local addressYan Zheng
I find that linux will reply echo request destined to an address which belongs to an interface other than the one from which the request received. This behavior doesn't make sense for link local address. YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> said: Please note that sender does need to setup neighbor entry by hand to reproduce this bug. (Link-local address on eth1 is not visible on eth0, from the point of view of neighbor discovery in IPv6.) +--------+ +--------+ | sender | | router | +---+----+ +-+----+-+ |eth0 eth0| |eth1 -----+----------------------+- -+-------------- Signed-off-by: Yan Zheng <yanzheng@21cn.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Andrew Morton <akpm@osdl.org> (forwarded) Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-31[NETFILTER]: Add "revision" support to arp_tables and ip6_tablesHarald Welte
Like ip_tables already has it for some time, this adds support for having multiple revisions for each match/target. We steal one byte from the name in order to accomodate a 8 bit version number. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-30[PATCH] Use sg_set_buf/sg_init_one where applicableDavid Hardeman
This patch uses sg_set_buf/sg_init_one in some places where it was duplicated. Signed-off-by: David Hardeman <david@2gen.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Greg KH <greg@kroah.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2005-10-28[MCAST] IPv6: Fix algorithm to compute Querier's Query IntervalYan Zheng
5.1.3. Maximum Response Code The Maximum Response Code field specifies the maximum time allowed before sending a responding Report. The actual time allowed, called the Maximum Response Delay, is represented in units of milliseconds, and is derived from the Maximum Response Code as follows: If Maximum Response Code < 32768, Maximum Response Delay = Maximum Response Code If Maximum Response Code >=32768, Maximum Response Code represents a floating-point value as follows: 0 1 2 3 4 5 6 7 8 9 A B C D E F +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| exp | mant | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Maximum Response Delay = (mant | 0x1000) << (exp+3) 5.1.9. QQIC (Querier's Query Interval Code) The Querier's Query Interval Code field specifies the [Query Interval] used by the Querier. The actual interval, called the Querier's Query Interval (QQI), is represented in units of seconds, and is derived from the Querier's Query Interval Code as follows: If QQIC < 128, QQI = QQIC If QQIC >= 128, QQIC represents a floating-point value as follows: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |1| exp | mant | +-+-+-+-+-+-+-+-+ QQI = (mant | 0x10) << (exp + 3) -- rfc3810 #define MLDV2_QQIC(value) MLDV2_EXP(0x80, 4, 3, value) #define MLDV2_MRC(value) MLDV2_EXP(0x8000, 12, 3, value) Above macro are defined in mcast.c. but 1 << 4 == 0x10 and 1 << 12 == 0x1000. So the result computed by original Macro is larger. Signed-off-by: Yan Zheng <yanzheng@21cn.com> Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-28[IPv4/IPv6]: UFO Scatter-gather approachAnanda Raju
Attached is kernel patch for UDP Fragmentation Offload (UFO) feature. 1. This patch incorporate the review comments by Jeff Garzik. 2. Renamed USO as UFO (UDP Fragmentation Offload) 3. udp sendfile support with UFO This patches uses scatter-gather feature of skb to generate large UDP datagram. Below is a "how-to" on changes required in network device driver to use the UFO interface. UDP Fragmentation Offload (UFO) Interface: ------------------------------------------- UFO is a feature wherein the Linux kernel network stack will offload the IP fragmentation functionality of large UDP datagram to hardware. This will reduce the overhead of stack in fragmenting the large UDP datagram to MTU sized packets 1) Drivers indicate their capability of UFO using dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG NETIF_F_HW_CSUM is required for UFO over ipv6. 2) UFO packet will be submitted for transmission using driver xmit routine. UFO packet will have a non-zero value for "skb_shinfo(skb)->ufo_size" skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP fragment going out of the adapter after IP fragmentation by hardware. skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[] contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW indicating that hardware has to do checksum calculation. Hardware should compute the UDP checksum of complete datagram and also ip header checksum of each fragmented IP packet. For IPV6 the UFO provides the fragment identification-id in skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating IPv6 fragments. Signed-off-by: Ananda Raju <ananda.raju@neterion.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded) Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-25[NET]: Wider use of for_each_*cpu()John Hawkes
In 'net' change the explicit use of for-loops and NR_CPUS into the general for_each_cpu() or for_each_online_cpu() constructs, as appropriate. This widens the scope of potential future optimizations of the general constructs, as well as takes advantage of the existing optimizations of first_cpu() and next_cpu(), which is advantageous when the true CPU count is much smaller than NR_CPUS. Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-25[IPV6]: Fix refcnt of struct ip6_flowlabelYan Zheng
Signed-off-by: Yan Zheng <yanzheng@21cn.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-15[NETFILTER]: Fix ip6_table.c build with NETFILTER_DEBUG enabled.Andrew Morton
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-13[NETFILTER]: Fix OOPSes on machines with discontiguous cpu numbering.David S. Miller
Original patch by Harald Welte, with feedback from Herbert Xu and testing by Sébastien Bernard. EBTABLES, ARP tables, and IP/IP6 tables all assume that cpus are numbered linearly. That is not necessarily true. This patch fixes that up by calculating the largest possible cpu number, and allocating enough per-cpu structure space given that. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10[IPSEC] Fix block size/MTU bugs in ESPHerbert Xu
This patch fixes the following bugs in ESP: * Fix transport mode MTU overestimate. This means that the inner MTU is smaller than it needs be. Worse yet, given an input MTU which is a multiple of 4 it will always produce an estimate which is not a multiple of 4. For example, given a standard ESP/3DES/MD5 transform and an MTU of 1500, the resulting MTU for transport mode is 1462 when it should be 1464. The reason for this is because IP header lengths are always a multiple of 4 for IPv4 and 8 for IPv6. * Ensure that the block size is at least 4. This is required by RFC2406 and corresponds to what the esp_output function does. At the moment this only affects crypto_null as its block size is 1. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10[IPSEC]: Use ALIGN macro in ESPHerbert Xu
This patch uses the macro ALIGN in all the applicable spots for ESP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-05[IPV6]: Fix NS handing for proxy/anycast addressYOSHIFUJI Hideaki
Timer set up by pneigh_enqueue() ended up calling ndisc_rcv() via pndisc_redo(), which clears LOCALLY_ENQUEUED flag in NEIGH_CB(skb) and NS was queued again. Let's call ndisc_recv_ns() directly to avoid the loop. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-05[MCAST] ipv6: Fix address size in grec_sizeYan Zheng
Signed-Off-By: Yan Zheng <yanzheng@21cn.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-04[IPV6]: Fix infinite loop in udp_v6_get_port().YOSHIFUJI Hideaki
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03[IPV4]: Replace __in_dev_get with __in_dev_get_rcu/rtnlHerbert Xu
The following patch renames __in_dev_get() to __in_dev_get_rtnl() and introduces __in_dev_get_rcu() to cover the second case. 1) RCU with refcnt should use in_dev_get(). 2) RCU without refcnt should use __in_dev_get_rcu(). 3) All others must hold RTNL and use __in_dev_get_rtnl(). There is one exception in net/ipv4/route.c which is in fact a pre-existing race condition. I've marked it as such so that we remember to fix it. This patch is based on suggestions and prior work by Suzanne Wood and Paul McKenney. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03[IPV6]: Fix leak added by udp connect dst caching fix.David S. Miller
Based upon a patch from Mitsuru KANDA <mk@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03[IPV6]: Fix ipv6 fragment ID selection at slow pathYan Zheng
Signed-Off-By: Yan Zheng <yanzheng@21cn.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03[INET]: speedup inet (tcp/dccp) lookupsEric Dumazet
Arnaldo and I agreed it could be applied now, because I have other pending patches depending on this one (Thank you Arnaldo) (The other important patch moves skc_refcnt in a separate cache line, so that the SMP/NUMA performance doesnt suffer from cache line ping pongs) 1) First some performance data : -------------------------------- tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established() The most time critical code is : sk_for_each(sk, node, &head->chain) { if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif)) goto hit; /* You sunk my battleship! */ } The sk_for_each() does use prefetch() hints but only the begining of "struct sock" is prefetched. As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far away from the begining of "struct sock", it has to bring into CPU cache cold cache line. Each iteration has to use at least 2 cache lines. This can be problematic if some chains are very long. 2) The goal ----------- The idea I had is to change things so that INET_MATCH() may return FALSE in 99% of cases only using the data already in the CPU cache, using one cache line per iteration. 3) Description of the patch --------------------------- Adds a new 'unsigned int skc_hash' field in 'struct sock_common', filling a 32 bits hole on 64 bits platform. struct sock_common { unsigned short skc_family; volatile unsigned char skc_state; unsigned char skc_reuse; int skc_bound_dev_if; struct hlist_node skc_node; struct hlist_node skc_bind_node; atomic_t skc_refcnt; + unsigned int skc_hash; struct proto *skc_prot; }; Store in this 32 bits field the full hash, not masked by (ehash_size - 1) Using this full hash as the first comparison done in INET_MATCH permits us immediatly skip the element without touching a second cache line in case of a miss. Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to sk_hash and tw_hash) already contains the slot number if we mask with (ehash_size - 1) File include/net/inet_hashtables.h 64 bits platforms : #define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \ ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \ (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) 32bits platforms: #define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) && \ (inet_sk(__sk)->daddr == (__saddr)) && \ (inet_sk(__sk)->rcv_saddr == (__daddr)) && \ (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) - Adds a prefetch(head->chain.first) in __inet_lookup_established()/__tcp_v4_check_established() and __inet6_lookup_established()/__tcp_v6_check_established() and __dccp_v4_check_established() to bring into cache the first element of the list, before the {read|write}_lock(&head->lock); Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>