aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2005-06-21[IPV4]: Add LC-Trie FIB lookup algorithm.Robert Olsson
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-20[NETLINK]: fib_lookup() via netlinkRobert Olsson
Below is a more generic patch to do fib_lookup via netlink. For others we should say that we discussed this as a way to verify route selection. It's also possible there are others uses for this. In short the fist half of struct fib_result_nl is filled in by caller and netlink call fills in the other half and returns it. In case anyone is interested there is a corresponding user app to compare the full routing table this was used to test implementation of the LC-trie. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-20[IPSEC]: Add XFRM_STATE_NOPMTUDISC flagHerbert Xu
This patch adds the flag XFRM_STATE_NOPMTUDISC for xfrm states. It is similar to the nopmtudisc on IPIP/GRE tunnels. It only has an effect on IPv4 tunnel mode states. For these states, it will ensure that the DF flag is always cleared. This is primarily useful to work around ICMP blackholes. In future this flag could also allow a larger MTU to be set within the tunnel just like IPIP/GRE tunnels. This could be useful for short haul tunnels where temporary fragmentation outside the tunnel is desired over smaller fragments inside the tunnel. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: James Morris <jmorris@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-20[IPSEC]: Add xfrm_init_stateHerbert Xu
This patch adds xfrm_init_state which is simply a wrapper that calls xfrm_get_type and subsequently x->type->init_state. It also gets rid of the unused args argument. Abstracting it out allows us to add common initialisation code, e.g., to set family-specific flags. The add_time setting in xfrm_user.c was deleted because it's already set by xfrm_state_alloc. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: James Morris <jmorris@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[TCP]: Fix sysctl_tcp_low_latencyDavid S. Miller
When enabled, this should disable UCOPY prequeue'ing altogether, but it does not due to a missing test. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[IPV4]: [4/4] signed vs unsigned cleanup in net/ipv4/raw.cJesper Juhl
This patch changes the type of the third parameter 'length' of the raw_send_hdrinc() function from 'int' to 'size_t'. This makes sense since this function is only ever called from one location, and the value passed as the third parameter in that location is itself of type size_t, so this makes the recieving functions parameter type match. Also, inside raw_send_hdrinc() the 'length' variable is used in comparisons with unsigned values and passed as parameter to functions expecting unsigned values (it's used in a single comparison with a signed value, but that one can never actually be negative so the patch also casts that one to size_t to stop gcc worrying, and it is passed in a single instance to memcpy_fromiovecend() which expects a signed int, but as far as I can see that's not a problem since the value of 'length' shouldn't ever exceed the value of a signed int). Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[IPV4]: [3/4] signed vs unsigned cleanup in net/ipv4/raw.cJesper Juhl
This patch changes the type of the local variable 'i' in raw_probe_proto_opt() from 'int' to 'unsigned int'. The only use of 'i' in this function is as a counter in a for() loop and subsequent index into the msg->msg_iov[] array. Since 'i' is compared in a loop to the unsigned variable msg->msg_iovlen gcc -W generates this warning : net/ipv4/raw.c:340: warning: comparison between signed and unsigned Changing 'i' to unsigned silences this warning and is safe since the array index can never be negative anyway, so unsigned int is the logical type to use for 'i' and also enables a larger msg_iov[] array (but I don't know if that will ever matter). Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[IPV4]: [2/4] signed vs unsigned cleanup in net/ipv4/raw.cJesper Juhl
This patch gets rid of the following gcc -W warning in net/ipv4/raw.c : net/ipv4/raw.c:387: warning: comparison of unsigned expression < 0 is always false Since 'len' is of type size_t it is unsigned and can thus never be <0, and since this is obvious from the function declaration just a few lines above I think it's ok to remove the pointless check for len<0. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[IPV4]: [1/4] signed vs unsigned cleanup in net/ipv4/raw.cJesper Juhl
This patch silences these two gcc -W warnings in net/ipv4/raw.c : net/ipv4/raw.c:517: warning: signed and unsigned type in conditional expression net/ipv4/raw.c:613: warning: signed and unsigned type in conditional expression It doesn't change the behaviour of the code, simply writes the conditional expression with plain 'if()' syntax instead of '? :' , but since this breaks it into sepperate statements gcc no longer complains about having both a signed and unsigned value in the same conditional expression. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[IPV4/IPV6]: Replace spin_lock_irq with spin_lock_bhHerbert Xu
In light of my recent patch to net/ipv4/udp.c that replaced the spin_lock_irq calls on the receive queue lock with spin_lock_bh, here is a similar patch for all other occurences of spin_lock_irq on receive/error queue locks in IPv4 and IPv6. In these stacks, we know that they can only be entered from user or softirq context. Therefore it's safe to disable BH only. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[NETLINK]: Set correct pid for ioctl originating netlink eventsJamal Hadi Salim
This patch ensures that netlink events created as a result of programns using ioctls (such as ifconfig, route etc) contains the correct PID of those events. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[NETLINK]: Correctly set NLM_F_MULTI without checking the pidJamal Hadi Salim
This patch rectifies some rtnetlink message builders that derive the flags from the pid. It is now explicit like the other cases which get it right. Also fixes half a dozen dumpers which did not set NLM_F_MULTI at all. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[NET]: Move sysctl_max_syn_backlog into request_sock.cDavid S. Miller
This fixes the CONFIG_INET=n build failure noticed by Andrew Morton. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[NET] rename struct tcp_listen_opt to struct listen_sockArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[NET] Generalise tcp_listen_optArnaldo Carvalho de Melo
This chunks out the accept_queue and tcp_listen_opt code and moves them to net/core/request_sock.c and include/net/request_sock.h, to make it useful for other transport protocols, DCCP being the first one to use it. Next patches will rename tcp_listen_opt to accept_sock and remove the inline tcp functions that just call a reqsk_queue_ function. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[NET] Rename open_request to request_sockArnaldo Carvalho de Melo
Ok, this one just renames some stuff to have a better namespace and to dissassociate it from TCP: struct open_request -> struct request_sock tcp_openreq_alloc -> reqsk_alloc tcp_openreq_free -> reqsk_free tcp_openreq_fastfree -> __reqsk_free With this most of the infrastructure closely resembles a struct sock methods subset. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18[NET] Generalise TCP's struct open_request minisock infrastructureArnaldo Carvalho de Melo
Kept this first changeset minimal, without changing existing names to ease peer review. Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn has two new members: ->slab, that replaces tcp_openreq_cachep ->obj_size, to inform the size of the openreq descendant for a specific protocol The protocol specific fields in struct open_request were moved to a class hierarchy, with the things that are common to all connection oriented PF_INET protocols in struct inet_request_sock, the TCP ones in tcp_request_sock, that is an inet_request_sock, that is an open_request. I.e. this uses the same approach used for the struct sock class hierarchy, with sk_prot indicating if the protocol wants to use the open_request infrastructure by filling in sk_prot->rsk_prot with an or_calltable. Results? Performance is improved and TCP v4 now uses only 64 bytes per open request minisock, down from 96 without this patch :-) Next changeset will rename some of the structs, fields and functions mentioned above, struct or_calltable is way unclear, better name it struct request_sock_ops, s/struct open_request/struct request_sock/g, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-15[NETFILTER]: ipt_recent: last_pkts is an array of "unsigned long" not ↵David S. Miller
"u_int32_t" This fixes various crashes on 64-bit when using this module. Based upon a patch by Juergen Kreileder <jk@blackdown.de>. Signed-off-by: David S. Miller <davem@davemloft.net> ACKed-by: Patrick McHardy <kaber@trash.net>
2005-06-13[NETFILTER]: Advance seq-file position in exp_next_seq()Patrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13[IPV4]: Sysctl configurable icmp error source address.J. Simonetti
This patch alows you to change the source address of icmp error messages. It applies cleanly to 2.6.11.11 and retains the default behaviour. In the old (default) behaviour icmp error messages are sent with the ip of the exiting interface. The new behaviour (when the sysctl variable is toggled on), it will send the message with the ip of the interface that received the packet that caused the icmp error. This is the behaviour network administrators will expect from a router. It makes debugging complicated network layouts much easier. Also, all 'vendor routers' I know of have the later behaviour. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13[SCTP] Add support for ip_nonlocal_bind sysctl & IP_FREEBIND socket optionNeil Horman
Signed-off-by: Neil Horman <nhorman@redhat.com> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13[IPV4]: Multipath modules need a license to prevent kernel tainting.Randy Dunlap
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13[TCP]: Adjust TCP mem order check to new alloc_large_system_hashAndi Kleen
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-02[IPVS]: remove net/ipv4/ipvs/ip_vs_proto_icmp.cAdrian Bunk
ip_vs_proto_icmp.c was never finished. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-31[IPSEC]: Fix esp_decap_data size verification in esp4.Edgar E Iglesias
Signed-off-by: Edgar E Iglesias <edgar@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-30[IPV4]: Fix BUG() in 2.6.x, udp_poll(), fragments + CONFIG_HIGHMEMHerbert Xu
Steven Hand <Steven.Hand@cl.cam.ac.uk> wrote: > > Reconstructed forward trace: > > net/ipv4/udp.c:1334 spin_lock_irq() > net/ipv4/udp.c:1336 udp_checksum_complete() > net/core/skbuff.c:1069 skb_shinfo(skb)->nr_frags > 1 > net/core/skbuff.c:1086 kunmap_skb_frag() > net/core/skbuff.h:1087 local_bh_enable() > kernel/softirq.c:0140 WARN_ON(irqs_disabled()); The receive queue lock is never taken in IRQs (and should never be) so we can simply substitute bh for irq. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-30[NETFILTER]: Fix deadlock with ip_queue and tcp local input path.Harald Welte
When we have ip_queue being used from LOCAL_IN, then we end up with a situation where the verdicts coming back from userspace traverse the TCP input path from syscall context. While this seems to work most of the time, there's an ugly deadlock: syscall context is interrupted by the timer interrupt. When the timer interrupt leaves, the timer softirq get's scheduled and calls tcp_delack_timer() and alike. They themselves do bh_lock_sock(sk), which is already held from somewhere else -> boom. I've now tested the suggested solution by Patrick McHardy and Herbert Xu to simply use local_bh_{en,dis}able(). Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-29[IPV4]: Kill MULTIPATHHOLDROUTE flag.Pravin B. Shelar
It cannot work properly, so just ignore it in drr and rr multipath algorithms just like the random multipath algorithm does. Suggested by Herbert Xu. Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-29[IPV4]: Primary and secondary addressesHarald Welte
Add an option to make secondary IP addresses get promoted when primary IP addresses are removed from the device. It defaults to off to preserve existing behavior. Signed-off-by: Harald Welte <laforge@gnumonks.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-23[TCP]: Fix stretch ACK performance killer when doing ucopy.David S. Miller
When we are doing ucopy, we try to defer the ACK generation to cleanup_rbuf(). This works most of the time very well, but if the ucopy prequeue is large, this ACKing behavior kills performance. With TSO, it is possible to fill the prequeue so large that by the time the ACK is sent and gets back to the sender, most of the window has emptied of data and performance suffers significantly. This behavior does help in some cases, so we should think about re-enabling this trick in the future, using some kind of limit in order to avoid the bug case. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-19[NETFILTER]: Do not be clever about SKB ownership in ip_ct_gather_frags().David S. Miller
Just do an skb_orphan() and be done with it. Based upon discussions with Herbert Xu on netdev. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-19[IP_VS]: Remove extra __ip_vs_conn_put() for incoming ICMP.Julian Anastasov
Remove extra __ip_vs_conn_put for incoming ICMP in direct routing mode. Mark de Vries reports that IPVS connections are not leaked anymore. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-18[IPV4/IPV6] Ensure all frag_list members have NULL skHerbert Xu
Having frag_list members which holds wmem of an sk leads to nightmares with partially cloned frag skb's. The reason is that once you unleash a skb with a frag_list that has individual sk ownerships into the stack you can never undo those ownerships safely as they may have been cloned by things like netfilter. Since we have to undo them in order to make skb_linearize happy this approach leads to a dead-end. So let's go the other way and make this an invariant: For any skb on a frag_list, skb->sk must be NULL. That is, the socket ownership always belongs to the head skb. It turns out that the implementation is actually pretty simple. The above invariant is actually violated in the following patch for a short duration inside ip_fragment. This is OK because the offending frag_list member is either destroyed at the end of the slow path without being sent anywhere, or it is detached from the frag_list before being sent. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-05[PATCH] update Ross Biro bouncing email addressJesper Juhl
Ross moved. Remove the bad email address so people will find the correct one in ./CREDITS. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05[IPV4]: multipath_wrandom.c GPF fixesPatrick McHardy
multipath_wrandom needs to use GFP_ATOMIC. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03[IPSEC]: Store idev entriesHerbert Xu
I found a bug that stopped IPsec/IPv6 from working. About a month ago IPv6 started using rt6i_idev->dev on the cached socket dst entries. If the cached socket dst entry is IPsec, then rt6i_idev will be NULL. Since we want to look at the rt6i_idev of the original route in this case, the easiest fix is to store rt6i_idev in the IPsec dst entry just as we do for a number of other IPv6 route attributes. Unfortunately this means that we need some new code to handle the references to rt6i_idev. That's why this patch is bigger than it would otherwise be. I've also done the same thing for IPv4 since it is conceivable that once these idev attributes start getting used for accounting, we probably need to dereference them for IPv4 IPsec entries too. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03[NETFILTER]: Drop conntrack reference in ip_dev_loopback_xmit()Patrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03[NETLINK]: Synchronous message processing.Herbert Xu
Let's recap the problem. The current asynchronous netlink kernel message processing is vulnerable to these attacks: 1) Hit and run: Attacker sends one or more messages and then exits before they're processed. This may confuse/disable the next netlink user that gets the netlink address of the attacker since it may receive the responses to the attacker's messages. Proposed solutions: a) Synchronous processing. b) Stream mode socket. c) Restrict/prohibit binding. 2) Starvation: Because various netlink rcv functions were written to not return until all messages have been processed on a socket, it is possible for these functions to execute for an arbitrarily long period of time. If this is successfully exploited it could also be used to hold rtnl forever. Proposed solutions: a) Synchronous processing. b) Stream mode socket. Firstly let's cross off solution c). It only solves the first problem and it has user-visible impacts. In particular, it'll break user space applications that expect to bind or communicate with specific netlink addresses (pid's). So we're left with a choice of synchronous processing versus SOCK_STREAM for netlink. For the moment I'm sticking with the synchronous approach as suggested by Alexey since it's simpler and I'd rather spend my time working on other things. However, it does have a number of deficiencies compared to the stream mode solution: 1) User-space to user-space netlink communication is still vulnerable. 2) Inefficient use of resources. This is especially true for rtnetlink since the lock is shared with other users such as networking drivers. The latter could hold the rtnl while communicating with hardware which causes the rtnetlink user to wait when it could be doing other things. 3) It is still possible to DoS all netlink users by flooding the kernel netlink receive queue. The attacker simply fills the receive socket with a single netlink message that fills up the entire queue. The attacker then continues to call sendmsg with the same message in a loop. Point 3) can be countered by retransmissions in user-space code, however it is pretty messy. In light of these problems (in particular, point 3), we should implement stream mode netlink at some point. In the mean time, here is a patch that implements synchronous processing. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03[TCP]: Optimize check in port-allocation code.Folkert van Heusden
Signed-off-by: Folkert van Heusden <folkert@vanheusden.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03[RTNETLINK] Cleanup rtnetlink_link tablesThomas Graf
Converts remaining rtnetlink_link tables to use c99 designated initializers to make greping a little bit easier. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03[NETFILTER]: Don't checksum CHECKSUM_UNNECESSARY skbs in TCP connection trackingPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03[NETFILTER]: Missing owner-field initialization in iptable_rawPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-28[NET]: /proc/net/stat/* header cleanupOlaf Rempel
Signed-off-by: Olaf Rempel <razzor@kopf-tisch.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-28[IPV4]: Incorrect permissions on route flush sysctlDave Jones
This has been brought up before.. http://lkml.org/lkml/2000/1/21/116 but didnt seem to get resolved. This morning I got someone file a bugzilla about it breaking sysctl(8). Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25[NET]: kill gratitious includes of major.hAl Viro
A lot of places in there are including major.h for no reason whatsoever. Removed. And yes, it still builds. The history of that stuff is often amusing. E.g. for net/core/sock.c the story looks so, as far as I've been able to reconstruct it: we used to need major.h in net/socket.c circa 1.1.early. In 1.1.13 that need had disappeared, along with register_chrdev(SOCKET_MAJOR, "socket", &net_fops) in sock_init(). Include had not. When 1.2 -> 1.3 reorg of net/* had moved a lot of stuff from net/socket.c to net/core/sock.c, this crap had followed... Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25[TCP]: Trivial tcp_data_queue() cleanupJames Morris
This patch removes a superfluous intialization from tcp_data_queue(). Signed-off-by: James Morris <jmorris@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25[NETFILTER]: Drop conntrack reference when packet leaves IPPatrick McHardy
In the event a raw socket is created for sending purposes only, the creator never bothers to check the socket's receive queue. But we continue to add skbs to its queue until it fills up. Unfortunately, if ip_conntrack is loaded on the box, each skb we add to the queue potentially holds a reference to a conntrack. If the user attempts to unload ip_conntrack, we will spin around forever since the queued skbs are pinned. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25[NETFILTER]: Fix truncated sequence numbers in FTP helperYasuyuki KOZAKAI
Signed-off-by: Yasuyuki KOZAKAI <yasuyuki.kozkaai@toshiba.co.jp> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-24[TCP]: skb pcount with MTU discoveryDavid S. Miller
The problem is that when doing MTU discovery, the too-large segments in the write queue will be calculated as having a pcount of >1. When tcp_write_xmit() is trying to send, tcp_snd_test() fails the cwnd test when pcount > cwnd. The segments are eventually transmitted one at a time by keepalive, but this can take a long time. This patch checks if TSO is enabled when setting pcount. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-24[NETFILTER]: Ignore PSH on SYN/ACK in TCP connection trackingPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>