aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2007-10-17[IPSEC]: Rename mode to outer_mode and add inner_modeHerbert Xu
This patch adds a new field to xfrm states called inner_mode. The existing mode object is renamed to outer_mode. This is the first part of an attempt to fix inter-family transforms. As it is we always use the outer family when determining which mode to use. As a result we may end up shoving IPv4 packets into netfilter6 and vice versa. What we really want is to use the inner family for the first part of outbound processing and the outer family for the second part. For inbound processing we'd use the opposite pairing. I've also added a check to prevent silly combinations such as transport mode with inter-family transforms. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[NET]: Fix the race between sk_filter_(de|at)tach and sk_clone()Pavel Emelyanov
The proposed fix is to delay the reference counter decrement until the quiescent state pass. This will give sk_clone() a chance to get the reference on the cloned filter. Regular sk_filter_uncharge can happen from the sk_free() only and there's no need in delaying the put - the socket is dead anyway and is to be release itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[NET]: Cleanup the error path in sk_attach_filterPavel Emelyanov
The sk_filter_uncharge is called for error handling and for releasing the former filter, but this will have to be done in a bit different manner, so cleanup the error path a bit. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[NET]: Move the filter releasing into a separate callPavel Emelyanov
This is done merely as a preparation for the fix. The sk_filter_uncharge() unaccounts the filter memory and calls the sk_filter_release(), which in turn decrements the refcount anf frees the filter. The latter function will be required separately. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[NET]: Introduce the sk_detach_filter() callPavel Emelyanov
Filter is attached in a separate function, so do the same for filter detaching. This also removes one variable sock_setsockopt(). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15[NEIGH]: Ensure that pneigh_lookup is protected with RTNLPavel Emelyanov
The pnigh_lookup is used to lookup proxy entries and to create them in case lookup failed. However, the "creation" code does not perform the re-lookup after GFP_KERNEL allocation. This is done because the code is expected to be protected with the RTNL lock, so add the assertion (mainly to address future questions from new network developers like me :) ). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15[NET]: Avoid copying TCP packets unnecessarilyHerbert Xu
TCP packets all have writable heads, that is, even though it's cloned, it is writable up to the end of the TCP header. This patch makes skb_checksum_help aware of this fact by using skb_clone_writable and avoiding a copy for TCP. I've also modified the BUG_ON tests to be unsigned. The only case where this makes a difference is if csum_start points to a location before skb->data. Since skb->data should always include the header where the checksum field is (and all currently callers adhere to that), this change is safe and may uncover bugs later. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15[NET]: Fix csum_start update in pskb_expand_headHerbert Xu
I got confused by the dual nature of the off variable in the function pskb_expand_head. The csum_start offset should use nhead instead of off which can change depending on whether we are using offsets or pointers. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15[NET]: Avoid unnecessary cloning for ingress filteringHerbert Xu
As it is we always invoke pt_prev before ing_filter, even if there are no ingress filters attached. This can cause unnecessary cloning in pt_prev. This patch changes it so that we only invoke pt_prev if there are ingress filters attached. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15[SKBUFF]: Add skb_morphHerbert Xu
This patch creates a new function skb_morph that's just like skb_clone except that it lets user provide the spare skb that will be overwritten by the one that's to be cloned. This will be used by IP fragment reassembly so that we get back the same skb that went in last (rather than the head skb that we get now which requires us to carry around double pointers all over the place). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15[SKBUFF]: Merge common code between copy_skb_header and skb_cloneHerbert Xu
This patch creates a new function __copy_skb_header to merge the common code between copy_skb_header and skb_clone. Having two functions which are largely the same is a source of wasted labour as well as confusion. In fact the tc_verd stuff is almost certainly a bug since it's treated differently in skb_clone compared to the callers of copy_skb_header (skb_copy/pskb_copy/skb_copy_expand). I've kept that difference in tact with a comment added asking for clarification. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-13net core: fix kernel-doc for new function parametersRandy Dunlap
Fix networking code kernel-doc for newly added parameters. Warning(linux-2.6.23-git2//net/core/sock.c:879): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:570): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:594): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:617): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:641): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:667): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:722): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:959): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:1195): No description found for parameter 'dev' Warning(linux-2.6.23-git2//net/core/dev.c:2105): No description found for parameter 'n' Warning(linux-2.6.23-git2//net/core/dev.c:3272): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:3445): No description found for parameter 'net' Warning(linux-2.6.23-git2//include/linux/netdevice.h:1301): No description found for parameter 'cpu' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-12Driver core: change add_uevent_var to use a structKay Sievers
This changes the uevent buffer functions to use a struct instead of a long list of parameters. It does no longer require the caller to do the proper buffer termination and size accounting, which is currently wrong in some places. It fixes a known bug where parts of the uevent environment are overwritten because of wrong index calculations. Many thanks to Mathieu Desnoyers for finding bugs and improving the error handling. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-10[NET]: make netlink user -> kernel interface synchroniousDenis V. Lunev
This patch make processing netlink user -> kernel messages synchronious. This change was inspired by the talk with Alexey Kuznetsov about current netlink messages processing. He says that he was badly wrong when introduced asynchronious user -> kernel communication. The call netlink_unicast is the only path to send message to the kernel netlink socket. But, unfortunately, it is also used to send data to the user. Before this change the user message has been attached to the socket queue and sk->sk_data_ready was called. The process has been blocked until all pending messages were processed. The bad thing is that this processing may occur in the arbitrary process context. This patch changes nlk->data_ready callback to get 1 skb and force packet processing right in the netlink_unicast. Kernel -> user path in netlink_unicast remains untouched. EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock drop, but the process remains in the cycle until the message will be fully processed. So, there is no need to use this kludges now. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: rtnl_unlock cleanupsDenis V. Lunev
There is no need to process outstanding netlink user->kernel packets during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv anymore. Normal code path is the following: netlink_sendmsg netlink_unicast netlink_sendskb skb_queue_tail netlink_data_ready rtnetlink_rcv mutex_lock(&rtnl_mutex); netlink_run_queue(sk, qlen, &rtnetlink_rcv_msg); mutex_unlock(&rtnl_mutex); So, it is possible, that packets can be present in the rtnl->sk_receive_queue during rtnl_unlock, but there is no need to process them at that moment as rtnetlink_rcv for that packet is pending. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Remove double dev->flags checking when calling dev_close()Pavel Emelyanov
The unregister_netdevice() and dev_change_net_namespace() both check for dev->flags to be IFF_UP before calling the dev_close(), but the dev_close() checks for IFF_UP itself, so remove those unneeded checks. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NETNS]: Don't memset() netns to zero manuallyPavel Emelyanov
The newly created net namespace is set to 0 with memset() in setup_net(). The setup_net() is also called for the init_net_ns(), which is zeroed naturally as a global var. So remove this memset and allocate new nets with the kmem_cache_zalloc(). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NETNS]: Move some code into __init section when CONFIG_NET_NS=nPavel Emelyanov
With the net namespaces many code leaved the __init section, thus making the kernel occupy more memory than it did before. Since we have a config option that prohibits the namespace creation, the functions that initialize/finalize some netns stuff are simply not needed and can be freed after the boot. Currently, this is almost not noticeable, since few calls are no longer in __init, but when the namespaces will be merged it will be possible to free more code. I propose to use the __net_init, __net_exit and __net_initdata "attributes" for functions/variables that are not used if the CONFIG_NET_NS is not set to save more space in memory. The exiting functions cannot just reside in the __exit section, as noticed by David, since the init section will have references on it and the compilation will fail due to modpost checks. These references can exist, since the init namespace never dies and the exit callbacks are never called. So I introduce the __exit_refok attribute just like it is already done with the __init_refok. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: split dev_ifsioc() according to lockingJeff Garzik
This always bugged me: dev_ioctl() called dev_ifsioc() either inside read_lock(dev_base_lock) or rtnl_lock(), depending on the ioctl being executed. This change moves the ioctls executed inside dev_base_lock to a new function, dev_ifsioc_locked(). Now the locking context is completely clear to the reader. Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: sparse warning fixesStephen Hemminger
Fix a bunch of sparse warnings. Mostly about 0 used as NULL pointer, and shadowed variable declarations. One notable case was that hash size should have been unsigned. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NETNS]: Simplify the network namespace list locking rules.Eric W. Biederman
Denis V. Lunev <den@sw.ru> noticed that the locking rules for the network namespace list are over complicated and broken. In particular the current register_netdev_notifier currently does not take any lock making the for_each_net iteration racy with network namespace creation and destruction. Oops. The fact that we need to use for_each_net in rtnl_unlock() when the rtnetlink support becomes per network namespace makes designing the proper locking tricky. In addition we need to be able to call rtnl_lock() and rtnl_unlock() when we have the net_mutex held. After thinking about it and looking at the alternatives carefully it looks like the simplest and most maintainable solution is to remove net_list_mutex altogether, and to use the rtnl_mutex instead. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Move hardware header operations out of netdevice.Stephen Hemminger
Since hardware header operations are part of the protocol class not the device instance, make them into a separate object and save memory. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Wrap netdevice hardware header creation.Stephen Hemminger
Add inline for common usage of hardware header creation, and fix bug in IPV6 mcast where the assumption about negative return is an errno. Negative return from hard_header means not enough space was available,(ie -N bytes). Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make the loopback device per network namespace.Eric W. Biederman
This patch makes loopback_dev per network namespace. Adding code to create a different loopback device for each network namespace and adding the code to free a loopback device when a network namespace exits. This patch modifies all users the loopback_dev so they access it as init_net.loopback_dev, keeping all of the code compiling and working. A later pass will be needed to update the users to use something other than the initial network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Add network namespace clone & unshare support.Eric W. Biederman
This patch allows you to create a new network namespace using sys_clone, or sys_unshare. As the network namespace is still experimental and under development clone and unshare support is only made available when CONFIG_NET_NS is selected at compile time. As this patch introduces network namespace support into code paths that exist when the CONFIG_NET is not selected there are a few additions made to net_namespace.h to allow a few more functions to be used when the networking stack is not compiled in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Fix running without sysfsEric W. Biederman
When sysfs support is compiled out the kernel still keeps and maintains the kobject tree. So it is not safe to skip our kobject reference counting or to avoid becoming members of the kobject tree. It is safe to not add the networking specific sysfs attributes. This patch removes the sysfs special cases from net/core/dev.c renames functions from netdev_sysfs_xxxx to netdev_kobject_xxxx and always compiles in net-sysfs.c net-sysfs.c is modified with a CONFIG_SYSFS guard around the parts that are actually sysfs specific. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Dynamically allocate the loopback device, part 1.Daniel Lezcano
This patch replaces all occurences to the static variable loopback_dev to a pointer loopback_dev. That provides the mindless, trivial, uninteressting change part for the dynamic allocation for the loopback. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-By: Kirill Korotaev <dev@sw.ru> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Introduce and use print_mac() and DECLARE_MAC_BUF()Joe Perches
This is nicer than the MAC_FMT stuff. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NETNS]: Cleanup list walking in setup_net and cleanup_netPavel Emelyanov
I proposed introducing a list_for_each_entry_continue_reverse macro to be used in setup_net() when unrolling the failed ->init callback. Here is the macro and some more cleanup in the setup_net() itself to remove one variable from the stack :) The same thing is for the cleanup_net() - the existing list_for_each_entry_reverse() is used. Minor, but the code looks nicer. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[SKBUFF]: Fix up csum_start when head room changesHerbert Xu
Thanks for noticing the bug where csum_start is not updated when the head room changes. This patch fixes that. It also moves the csum/ip_summed copying into copy_skb_header so that skb_copy_expand gets it too. I've checked its callers and no one should be upset by this. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NETLINK]: Avoid pointer in netlink_run_queueHerbert Xu
I was looking at Patrick's fix to inet_diag and it occured to me that we're using a pointer argument to return values unnecessarily in netlink_run_queue. Changing it to return the value will allow the compiler to generate better code since the value won't have to be memory-backed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[IPV4/IPV6/DECNET]: Small cleanup for fib rules.Denis V. Lunev
This patch slightly cleanups FIB rules framework. rules_list as a pointer on struct fib_rules_ops is useless. It is always assigned with a static per/subsystem list in IPv4, IPv6 and DecNet. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Cleanup calling netdev notifiers.Pavel Emelyanov
The call_netdev_notifiers routine can successfully be used in the net/core_dev.c itself. This will save 6 lines of code and 62 ;) bytes of .text section. 62 is rather small, but I have one more patch saving ~30 bytes from netns code (sent to Eric), so altogether they can save some more noticeable amount. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NETNS]: Consolidate hashes creation in netdev_init()Pavel Emelyanov
The dev_name_hash and the dev_index_hash are now booth kmalloc-ed (and each element is properly initialized as usually) so I think it's worth consolidating this code making it look nicer (and saving 28 bytes of .text section ;) ) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Fix the prototype of call_netdevice_notifiers.Eric W. Biederman
This replaces the void * parameter with a struct net_device * which is what is actually required. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: migrate HARD_TX_LOCK to header fileJamal Hadi Salim
HARD_TX_LOCK micro is a nice aggregation that could be used in other spots. move it to netdevice.h Also makes sure the previously superflous cpu arguement is used. Thanks to DaveM for the suggestions. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[ETHTOOL] Provide default behaviors for a few ethtool sub-ioctlsJeff Garzik
For the operations get-tx-csum get-sg get-tso get-ufo the default ethtool_op_xxx behavior is fine for all drivers, so we permit op==NULL to imply the default behavior. This provides a more uniform behavior across all drivers, eliminating ethtool(8) "ioctl not supported" errors on older drivers that had not been updated for the latest sub-ioctls. The ethtool_op_xxx() functions are left exported, in case anyone wishes to call them directly from a driver-private implementation -- a not-uncommon case. Should an ethtool_op_xxx() helper remain unused for a while, except by net/core/ethtool.c, we can un-export it at a later date. [ Resolved conflicts with set/get value ethtool patch... -DaveM ] Signed-off-by: Jeff Garzik <jeff@garzik.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Fix race when opening a proc file while a network namespace is exiting.Eric W. Biederman
The problem: proc_net files remember which network namespace the are against but do not remember hold a reference count (as that would pin the network namespace). So we currently have a small window where the reference count on a network namespace may be incremented when opening a /proc file when it has already gone to zero. To fix this introduce maybe_get_net and get_proc_net. maybe_get_net increments the network namespace reference count only if it is greater then zero, ensuring we don't increment a reference count after it has gone to zero. get_proc_net handles all of the magic to go from a proc inode to the network namespace instance and call maybe_get_net on it. PROC_NET the old accessor is removed so that we don't get confused and use the wrong helper function. Then I fix up the callers to use get_proc_net and handle the case case where get_proc_net returns NULL. In that case I return -ENXIO because effectively the network namespace has already gone away so the files we are trying to access don't exist anymore. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Add a might_sleep() to dev_close().David S. Miller
Requested by Johannes Berg. Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[PATCH] NET : convert IP route cache garbage collection from softirq ↵Eric Dumazet
processing to a workqueue When the periodic IP route cache flush is done (every 600 seconds on default configuration), some hosts suffer a lot and eventually trigger the "soft lockup" message. dst_run_gc() is doing a scan of a possibly huge list of dst_entries, eventually freeing some (less than 1%) of them, while holding the dst_lock spinlock for the whole scan. Then it rearms a timer to redo the full thing 1/10 s later... The slowdown can last one minute or so, depending on how active are the tcp sessions. This second version of the patch converts the processing from a softirq based one to a workqueue. Even if the list of entries in garbage_list is huge, host is still responsive to softirqs and can make progress. Instead of resetting gc timer to 0.1 second if one entry was freed in a gc run, we do this if more than 10% of entries were freed. Before patch : Aug 16 06:21:37 SRV1 kernel: BUG: soft lockup detected on CPU#0! Aug 16 06:21:37 SRV1 kernel: Aug 16 06:21:37 SRV1 kernel: Call Trace: Aug 16 06:21:37 SRV1 kernel: <IRQ> [<ffffffff802286f0>] wake_up_process+0x10/0x20 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80251e09>] softlockup_tick+0xe9/0x110 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803cd380>] dst_run_gc+0x0/0x140 Aug 16 06:21:37 SRV1 kernel: [<ffffffff802376f3>] run_local_timers+0x13/0x20 Aug 16 06:21:37 SRV1 kernel: [<ffffffff802379c7>] update_process_times+0x57/0x90 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80216034>] smp_local_timer_interrupt+0x34/0x60 Aug 16 06:21:37 SRV1 kernel: [<ffffffff802165cc>] smp_apic_timer_interrupt+0x5c/0x80 Aug 16 06:21:37 SRV1 kernel: [<ffffffff8020a816>] apic_timer_interrupt+0x66/0x70 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803cd3d3>] dst_run_gc+0x53/0x140 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803cd3c6>] dst_run_gc+0x46/0x140 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80237148>] run_timer_softirq+0x148/0x1c0 Aug 16 06:21:37 SRV1 kernel: [<ffffffff8023340c>] __do_softirq+0x6c/0xe0 Aug 16 06:21:37 SRV1 kernel: [<ffffffff8020ad6c>] call_softirq+0x1c/0x30 Aug 16 06:21:37 SRV1 kernel: <EOI> [<ffffffff8020cb34>] do_softirq+0x34/0x90 Aug 16 06:21:37 SRV1 kernel: [<ffffffff802331cf>] local_bh_enable_ip+0x3f/0x60 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80422913>] _spin_unlock_bh+0x13/0x20 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803dfde8>] rt_garbage_collect+0x1d8/0x320 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803cd4dd>] dst_alloc+0x1d/0xa0 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803e1433>] __ip_route_output_key+0x573/0x800 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803c02e2>] sock_common_recvmsg+0x32/0x50 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803e16dc>] ip_route_output_flow+0x1c/0x60 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80400160>] tcp_v4_connect+0x150/0x610 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803ebf07>] inet_bind_bucket_create+0x17/0x60 Aug 16 06:21:37 SRV1 kernel: [<ffffffff8040cd16>] inet_stream_connect+0xa6/0x2c0 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80422981>] _spin_lock_bh+0x11/0x30 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803c0bdf>] lock_sock_nested+0xcf/0xe0 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80422981>] _spin_lock_bh+0x11/0x30 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803be551>] sys_connect+0x71/0xa0 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803eee3f>] tcp_setsockopt+0x1f/0x30 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803c030f>] sock_common_setsockopt+0xf/0x20 Aug 16 06:21:37 SRV1 kernel: [<ffffffff803be4bd>] sys_setsockopt+0x9d/0xc0 Aug 16 06:21:37 SRV1 kernel: [<ffffffff8028881e>] sys_ioctl+0x5e/0x80 Aug 16 06:21:37 SRV1 kernel: [<ffffffff80209c4e>] system_call+0x7e/0x83 After patch : (RT_CACHE_DEBUG set to 2 to get following traces) dst_total: 75469 delayed: 74109 work_perf: 141 expires: 150 elapsed: 8092 us dst_total: 78725 delayed: 73366 work_perf: 743 expires: 400 elapsed: 8542 us dst_total: 86126 delayed: 71844 work_perf: 1522 expires: 775 elapsed: 8849 us dst_total: 100173 delayed: 68791 work_perf: 3053 expires: 1256 elapsed: 9748 us dst_total: 121798 delayed: 64711 work_perf: 4080 expires: 1997 elapsed: 10146 us dst_total: 154522 delayed: 58316 work_perf: 6395 expires: 25 elapsed: 11402 us dst_total: 154957 delayed: 58252 work_perf: 64 expires: 150 elapsed: 6148 us dst_total: 157377 delayed: 57843 work_perf: 409 expires: 400 elapsed: 6350 us dst_total: 163745 delayed: 56679 work_perf: 1164 expires: 775 elapsed: 7051 us dst_total: 176577 delayed: 53965 work_perf: 2714 expires: 1389 elapsed: 8120 us dst_total: 198993 delayed: 49627 work_perf: 4338 expires: 1997 elapsed: 8909 us dst_total: 226638 delayed: 46865 work_perf: 2762 expires: 2748 elapsed: 7351 us I successfully reduced the IP route cache of many hosts by a four factor thanks to this patch. Previously, I had to disable "ip route flush cache" to avoid crashes. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: #if 0 out net_alloc() for now.David S. Miller
We will undo this once it is actually used. Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: netlink support for moving devices between network namespaces.Eric W. Biederman
The simplest thing to implement is moving network devices between namespaces. However with the same attribute IFLA_NET_NS_PID we can easily implement creating devices in the destination network namespace as well. However that is a little bit trickier so this patch sticks to what is simple and easy. A pid is used to identify a process that happens to be a member of the network namespace we want to move the network device to. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Implement network device movement between namespacesEric W. Biederman
This patch introduces NETIF_F_NETNS_LOCAL a flag to indicate a network device is local to a single network namespace and should never be moved. Useful for pseudo devices that we need an instance in each network namespace (like the loopback device) and for any device we find that cannot handle multiple network namespaces so we may trap them in the initial network namespace. This patch introduces the function dev_change_net_namespace a function used to move a network device from one network namespace to another. To the network device nothing special appears to happen, to the components of the network stack it appears as if the network device was unregistered in the network namespace it is in, and a new device was registered in the network namespace the device was moved to. This patch sets up a namespace device destructor that upon the exit of a network namespace moves all of the movable network devices to the initial network namespace so they are not lost. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Factor out __dev_alloc_name from dev_alloc_nameEric W. Biederman
When forcibly changing the network namespace of a device I need something that can generate a name for the device in the new namespace without overwriting the old name. __dev_alloc_name provides me that functionality. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make the device list and device lookups per namespace.Eric W. Biederman
This patch makes most of the generic device layer network namespace safe. This patch makes dev_base_head a network namespace variable, and then it picks up a few associated variables. The functions: dev_getbyhwaddr dev_getfirsthwbytype dev_get_by_flags dev_get_by_name __dev_get_by_name dev_get_by_index __dev_get_by_index dev_ioctl dev_ethtool dev_load wireless_process_ioctl were modified to take a network namespace argument, and deal with it. vlan_ioctl_set and brioctl_set were modified so their hooks will receive a network namespace argument. So basically anthing in the core of the network stack that was affected to by the change of dev_base was modified to handle multiple network namespaces. The rest of the network stack was simply modified to explicitly use &init_net the initial network namespace. This can be fixed when those components of the network stack are modified to handle multiple network namespaces. For now the ifindex generator is left global. Fundametally ifindex numbers are per namespace, or else we will have corner case problems with migration when we get that far. At the same time there are assumptions in the network stack that the ifindex of a network device won't change. Making the ifindex number global seems a good compromise until the network stack can cope with ifindex changes when you change namespaces, and the like. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Support multiple network namespaces with netlinkEric W. Biederman
Each netlink socket will live in exactly one network namespace, this includes the controlling kernel sockets. This patch updates all of the existing netlink protocols to only support the initial network namespace. Request by clients in other namespaces will get -ECONREFUSED. As they would if the kernel did not have the support for that netlink protocol compiled in. As each netlink protocol is updated to be multiple network namespace safe it can register multiple kernel sockets to acquire a presence in the rest of the network namespaces. The implementation in af_netlink is a simple filter implementation at hash table insertion and hash table look up time. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make device event notification network namespace safeEric W. Biederman
Every user of the network device notifiers is either a protocol stack or a pseudo device. If a protocol stack that does not have support for multiple network namespaces receives an event for a device that is not in the initial network namespace it quite possibly can get confused and do the wrong thing. To avoid problems until all of the protocol stacks are converted this patch modifies all netdev event handlers to ignore events on devices that are not in the initial network namespace. As the rest of the code is made network namespace aware these checks can be removed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Initialize the network namespace of network devices.Eric W. Biederman
Except for carefully selected pseudo devices all network interfaces should start out in the initial network namespace. Ultimately it will be register_netdev that examines what dev->nd_net is set to and places a device in a network namespace. This patch modifies alloc_netdev to initialize the network namespace a device is in with the initial network namespace. This gets it right for the vast majority of devices so their drivers need not be modified and for those few pseudo devices that need something different they can change this parameter before calling register_netdevice. The network namespace parameter on a network device is not reference counted as the devices are inside of a network namespace and cannot remain in that namespace past the lifetime of the network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make socket creation namespace safe.Eric W. Biederman
This patch passes in the namespace a new socket should be created in and has the socket code do the appropriate reference counting. By virtue of this all socket create methods are touched. In addition the socket create methods are modified so that they will fail if you attempt to create a socket in a non-default network namespace. Failing if we attempt to create a socket outside of the default network namespace ensures that as we incrementally make the network stack network namespace aware we will not export functionality that someone has not audited and made certain is network namespace safe. Allowing us to partially enable network namespaces before all of the exotic protocols are supported. Any protocol layers I have missed will fail to compile because I now pass an extra parameter into the socket creation code. [ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make /proc/net per network namespaceEric W. Biederman
This patch makes /proc/net per network namespace. It modifies the global variables proc_net and proc_net_stat to be per network namespace. The proc_net file helpers are modified to take a network namespace argument, and all of their callers are fixed to pass &init_net for that argument. This ensures that all of the /proc/net files are only visible and usable in the initial network namespace until the code behind them has been updated to be handle multiple network namespaces. Making /proc/net per namespace is necessary as at least some files in /proc/net depend upon the set of network devices which is per network namespace, and even more files in /proc/net have contents that are relevant to a single network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>