aboutsummaryrefslogtreecommitdiff
path: root/include/net/sock.h
AgeCommit message (Collapse)Author
2008-10-07net: wrap sk->sk_backlog_rcv()Peter Zijlstra
Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the generic sk_backlog_rcv behaviour. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07inet: Don't lookup the socket if there's a socket attached to the skbKOVACS Krisztian
Use the socket cached in the skb if it's present. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-28net: more #ifdef CONFIG_COMPATAlexey Dobriyan
All users of struct proto::compat_[gs]etsockopt and struct inet_connection_sock_af_ops::compat_[gs]etsockopt are under #ifdef already, so use it in structure definition too. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16sock: add net to prot->enter_memory_pressure callbackPavel Emelyanov
The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't have where to get the net from. I decided to add a sk argument, not the net itself, only to factor all the required sock_net(sk) calls inside the enter_memory_pressure callback itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-17net: Add sk_set_socket() helper.David S. Miller
In order to more easily grep for all things that set sk->sk_socket, add sk_set_socket() helper inline function. Suggested (although only half-seriously) by Evgeniy Polyakov. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-17udp: sk_drops handlingEric Dumazet
In commits 33c732c36169d7022ad7d6eb474b0c9be43a2dc1 ([IPV4]: Add raw drops counter) and a92aa318b4b369091fd80433c80e62838db8bc1c ([IPV6]: Add raw drops counter), Wang Chen added raw drops counter for /proc/net/raw & /proc/net/raw6 This patch adds this capability to UDP sockets too (/proc/net/udp & /proc/net/udp6). This means that 'RcvbufErrors' errors found in /proc/net/snmp can be also be examined for each udp socket. # grep Udp: /proc/net/snmp Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors Udp: 23971006 75 899420 16390693 146348 0 # cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt --- uid timeout inode ref pointer drops 75: 00000000:02CB 00000000:0000 07 00000000:00000000 00:00000000 00000000 --- 0 0 2358 2 ffff81082a538c80 0 111: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 --- 0 0 2286 2 ffff81042dd35c80 146348 In this example, only port 111 (0x006F) was flooded by messages that user program could not read fast enough. 146348 messages were lost. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-17net: Kill SOCK_SLEEP_PRE and SOCK_SLEEP_POST, no users.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-14net: change proto destroy method to return voidBrian Haley
Change struct proto destroy function pointer to return void. Noticed by Al Viro. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16[NETNS]: Add netns refcnt debug for kernel sockets.Denis V. Lunev
Protocol control sockets and netlink kernel sockets should not prevent the namespace stop request. They are initialized and disposed in a special way by sk_change_net/sk_release_kernel. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10socket: sk_filter deinlineStephen Hemminger
The sk_filter function is too big to be inlined. This saves 2296 bytes of text on allyesconfig. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-31[SOCK][NETNS]: Add a struct net argument to sock_prot_inuse_add and _get.Pavel Emelyanov
This counter is about to become per-proto-and-per-net, so we'll need two arguments to determine which cell in this "table" to work with. All the places, but proc already pass proper net to it - proc will be tuned a bit later. Some indentation with spaces in proc files is done to keep the file coding style consistent. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28[SOCK]: Drop inuse pcounter from struct proto (v2).Pavel Emelyanov
An uppercut - do not use the pcounter on struct proto. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28[SOCK]: Drop per-proto inuse init and fre functions (v2).Pavel Emelyanov
Constructive part of the set is finished here. We have to remove the pcounter, so start with its init and free functions. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28[SOCK]: Introduce a percpu inuse counters array (v2).Pavel Emelyanov
And redirect sock_prot_inuse_add and _get to use one. As far as the dereferences are concerned. Before the patch we made 1 dereference to proto->inuse.add call, the call itself and then called the __get_cpu_var() on a static variable. After the patch we make a direct call, then one dereference to proto->inuse_idx and then the same __get_cpu_var() on a still static variable. So this patch doesn't seem to produce performance penalty on SMP. This is not per-net yet, but I will deliberately make NET_NS=y case separated from NET_NS=n one, since it'll cost us one-or-two more dereferences to get the struct net and the inuse counter. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28[SOCK]: Enumerate struct proto-s to facilitate percpu inuse accounting (v2).Pavel Emelyanov
The inuse counters are going to become a per-cpu array. Introduce an index for this array on the struct proto. To handle the case of proto register-unregister-register loop the bitmap is used. All its bits manipulations are protected with proto_list_lock and a sanity check for the bitmap being exhausted is also added. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26[NETNS]: Compilation warnings under CONFIG_NET_NS.Denis V. Lunev
Recent commits from YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> have been introduced a several compilation warnings 'assignment discards qualifiers from pointer target type' due to extra const modifier in the inline call parameters of {dev|sock|twsk}_net_set. Drop it. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.YOSHIFUJI Hideaki
Introduce per-sock inlines: sock_net(), sock_net_set() and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-22[RAW]: Add raw_hashinfo member on struct proto.Pavel Emelyanov
Sorry for the patch sequence confusion :| but I found that the similar thing can be done for raw sockets easily too late. Expand the proto.h union with the raw_hashinfo member and use it in raw_prot and rawv6_prot. This allows to drop the protocol specific versions of hash and unhash callbacks. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22[SOCK]: Add udp_hash member to struct proto.Pavel Emelyanov
Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4 and -v6 protocols. The result is not that exciting, but it removes some levels of indirection in udpxxx_get_port and saves some space in code and text. The first step is to union existing hashinfo and new udp_hash on the struct proto and give a name to this union, since future initialization of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc warning about inability to initialize anonymous member this way. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21socket: SOCK_DEBUG type checkingStephen Hemminger
Use the inline trick (same as pr_debug) to get checking of debug statements even if no code is generated. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21[NET]: Add per-connection option to set max TSO frame sizePeter P Waskiewicz Jr
Update: My mailer ate one of Jarek's feedback mails... Fixed the parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the whitespace issue due to a patch import botch. Changed the types from u32 to unsigned int to be more consistent with other variables in the area. Also brought the patch up to the latest net-2.6.26 tree. Update: Made gso_max_size container 32 bits, not 16. Moved the location of gso_max_size within netdev to be less hotpath. Made more consistent names between the sock and netdev layers, and added a define for the max GSO size. Update: Respun for net-2.6.26 tree. Update: changed max_gso_frame_size and sk_gso_max_size from signed to unsigned - thanks Stephen! This patch adds the ability for device drivers to control the size of the TSO frames being sent to them, per TCP connection. By setting the netdevice's gso_max_size value, the socket layer will set the GSO frame size based on that value. This will propogate into the TCP layer, and send TSO's of that size to the hardware. This can be desirable to help tune the bursty nature of TSO on a per-adapter basis, where one may have 1 GbE and 10 GbE devices coexisting in a system, one running multiqueue and the other not, etc. This can also be desirable for devices that cannot support full 64 KB TSO's, but still want to benefit from some level of segmentation offloading. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29[NET]: Make netlink_kernel_release publically available as sk_release_kernel.Denis V. Lunev
This staff will be needed for non-netlink kernel sockets, which should also not pin a namespace like tcp_socket and icmp_socket. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-18net: fix kernel-doc warnings in header filesRandy Dunlap
Add missing structure kernel-doc descriptions to sock.h & skbuff.h to fix kernel-doc warnings. (I think that Stephen H. sent a similar patch, but I can't find it. I just want to kill the warnings, with either patch.) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03[SOCK] proto: Add hashinfo member to struct protoArnaldo Carvalho de Melo
This way we can remove TCP and DCCP specific versions of sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port sk->sk_prot->hash: inet_hash is directly used, only v6 need a specific version to deal with mapped sockets sk->sk_prot->unhash: both v4 and v6 use inet_hash directly struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so that inet_csk_get_port can find the per family routine. Now only the lookup routines receive as a parameter a struct inet_hashtable. With this we further reuse code, reducing the difference among INET transport protocols. Eventually work has to be done on UDP and SCTP to make them share this infrastructure and get as a bonus inet_diag interfaces so that iproute can be used with these protocols. net-2.6/net/ipv4/inet_hashtables.c: struct proto | +8 struct inet_connection_sock_af_ops | +8 2 structs changed __inet_hash_nolisten | +18 __inet_hash | -210 inet_put_port | +8 inet_bind_bucket_create | +1 __inet_hash_connect | -8 5 functions changed, 27 bytes added, 218 bytes removed, diff: -191 net-2.6/net/core/sock.c: proto_seq_show | +3 1 function changed, 3 bytes added, diff: +3 net-2.6/net/ipv4/inet_connection_sock.c: inet_csk_get_port | +15 1 function changed, 15 bytes added, diff: +15 net-2.6/net/ipv4/tcp.c: tcp_set_state | -7 1 function changed, 7 bytes removed, diff: -7 net-2.6/net/ipv4/tcp_ipv4.c: tcp_v4_get_port | -31 tcp_v4_hash | -48 tcp_v4_destroy_sock | -7 tcp_v4_syn_recv_sock | -2 tcp_unhash | -179 5 functions changed, 267 bytes removed, diff: -267 net-2.6/net/ipv6/inet6_hashtables.c: __inet6_hash | +8 1 function changed, 8 bytes added, diff: +8 net-2.6/net/ipv4/inet_hashtables.c: inet_unhash | +190 inet_hash | +242 2 functions changed, 432 bytes added, diff: +432 vmlinux: 16 functions changed, 485 bytes added, 492 bytes removed, diff: -7 /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c: tcp_v6_get_port | -31 tcp_v6_hash | -7 tcp_v6_syn_recv_sock | -9 3 functions changed, 47 bytes removed, diff: -47 /home/acme/git/net-2.6/net/dccp/proto.c: dccp_destroy_sock | -7 dccp_unhash | -179 dccp_hash | -49 dccp_set_state | -7 dccp_done | +1 5 functions changed, 1 bytes added, 242 bytes removed, diff: -241 /home/acme/git/net-2.6/net/dccp/ipv4.c: dccp_v4_get_port | -31 dccp_v4_request_recv_sock | -2 2 functions changed, 33 bytes removed, diff: -33 /home/acme/git/net-2.6/net/dccp/ipv6.c: dccp_v6_get_port | -31 dccp_v6_hash | -7 dccp_v6_request_recv_sock | +5 3 functions changed, 5 bytes added, 38 bytes removed, diff: -33 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NET]: Introducing socket mark socket option.Laszlo Attila Toth
A userspace program may wish to set the mark for each packets its send without using the netfilter MARK target. Changing the mark can be used for mark based routing without netfilter or for packet filtering. It requires CAP_NET_ADMIN capability. Signed-off-by: Laszlo Attila Toth <panther@balabit.hu> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer().David S. Miller
Otherwise we beat heavily on the global tcp_memory atomics when all of the sockets in the system are slowly sending perioding packet clumps. Noticed and suggested by Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: prot_inuse cleanups and optimizationsEric Dumazet
1) Cleanups (all functions are prefixed by sock_prot_inuse) sock_prot_inc_use(prot) -> sock_prot_inuse_add(prot,-1) sock_prot_dec_use(prot) -> sock_prot_inuse_add(prot,-1) sock_prot_inuse() -> sock_prot_inuse_get() New functions : sock_prot_inuse_init() and sock_prot_inuse_free() to abstract pcounter use. 2) if CONFIG_PROC_FS=n, we can zap 'inuse' member from "struct proto", since nobody wants to read the inuse value. This saves 1372 bytes on i386/SMP and some cpu cycles. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET] CORE: Introducing new memory accounting interface.Hideo Aoki
This patch introduces new memory accounting functions for each network protocol. Most of them are renamed from memory accounting functions for stream protocols. At the same time, some stream memory accounting functions are removed since other functions do same thing. Renaming: sk_stream_free_skb() -> sk_wmem_free_skb() __sk_stream_mem_reclaim() -> __sk_mem_reclaim() sk_stream_mem_reclaim() -> sk_mem_reclaim() sk_stream_mem_schedule -> __sk_mem_schedule() sk_stream_pages() -> sk_mem_pages() sk_stream_rmem_schedule() -> sk_rmem_schedule() sk_stream_wmem_schedule() -> sk_wmem_schedule() sk_charge_skb() -> sk_mem_charge() Removeing sk_stream_rfree(): consolidates into sock_rfree() sk_stream_set_owner_r(): consolidates into skb_set_owner_r() sk_stream_mem_schedule() The following functions are added. sk_has_account(): check if the protocol supports accounting sk_mem_uncharge(): do the opposite of sk_mem_charge() In addition, to achieve consolidation, updating sk_wmem_queued is removed from sk_mem_charge(). Next, to consolidate memory accounting functions, this patch adds memory accounting calls to network core functions. Moreover, present memory accounting call is renamed to new accounting call. Finally we replace present memory accounting calls with new interface in TCP and SCTP. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[SOCK] Avoid divides in sk_stream_pages() and __sk_stream_mem_reclaim()Eric Dumazet
sk_forward_alloc being signed, we should take care of divides by SK_STREAM_MEM_QUANTUM we do in sk_stream_pages() and __sk_stream_mem_reclaim() This patchs introduces SK_STREAM_MEM_QUANTUM_SHIFT, defined as ilog2(SK_STREAM_MEM_QUANTUM), to be able to use right shifts instead of plain divides. This should help compiler to choose right shifts instead of expensive divides (as seen with CONFIG_CC_OPTIMIZE_FOR_SIZE=y on x86) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[SOCK] Avoid integer divides where not necessary in include/net/sock.hEric Dumazet
Because sk_wmem_queued, sk_sndbuf are signed, a divide per two may force compiler to use an integer divide. We can instead use a right shift. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Remove FASTCALL macroHarvey Harrison
X86_32 was the last user of the FASTCALL macro, now that it uses regparm(3) by default, this macro expands to nothing. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Isolate the net/core/ sysctl tablePavel Emelyanov
Using ctl paths we can put all the stuff, related to net/core/ sysctl table, into one file and remove all the references on it. As a good side effect this hides the "core_table" name from the global scope :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: netns compilation speedupDenis V. Lunev
This patch speedups compilation when net_namespace.h is changed. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Eliminate unused argument from sk_stream_alloc_pskbPavel Emelyanov
The 3rd argument is always zero (according to grep :) Eliminate it and merge the function with sk_stream_alloc_skb. This saves 44 more bytes, and together with the previous patch we have: add/remove: 1/0 grow/shrink: 0/8 up/down: 183/-751 (-568) function old new delta sk_stream_alloc_skb - 183 +183 ip_rt_init 529 525 -4 arp_ignore 112 107 -5 __inet_lookup_listener 284 274 -10 tcp_sendmsg 2583 2481 -102 tcp_sendpage 1449 1300 -149 tso_fragment 417 258 -159 tcp_fragment 1149 988 -161 __tcp_push_pending_frames 1998 1837 -161 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Uninline the sk_stream_alloc_pskbPavel Emelyanov
This function seems too big for inlining. Indeed, it saves half-a-kilo when uninlined: add/remove: 1/0 grow/shrink: 0/7 up/down: 195/-719 (-524) function old new delta sk_stream_alloc_pskb - 195 +195 ip_rt_init 529 525 -4 __inet_lookup_listener 284 274 -10 tcp_sendmsg 2583 2486 -97 tcp_sendpage 1449 1305 -144 tso_fragment 417 267 -150 tcp_fragment 1149 992 -157 __tcp_push_pending_frames 1998 1841 -157 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET] proto: Use pcounters for the inuse fieldArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Move sock_valbool_flag to socket.cPavel Emelyanov
The sock_valbool_flag() helper is used in setsockopt to set or reset some flag on the sock. This helper is required in the net/socket.c only, so move it there. Besides, patch two places in sys_setsockopt() that repeat this helper functionality manually. Since this is not a bugfix, but a trivial cleanup, I prepared this patch against net-2.6.25, but it also applies (with a single offset) to the latest net-2.6. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPV4]: Add raw drops counter.Wang Chen
Add raw drops counter for IPv4 in /proc/net/raw . Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-08[SOCK]: Adds a rcu_dereference() in sk_filterEric Dumazet
It seems commit fda9ef5d679b07c9d9097aaf6ef7f069d794a8f9 introduced a RCU protection for sk_filter(), without a rcu_dereference() Either we need a rcu_dereference(), either a comment should explain why we dont need it. I vote for the former. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-18[TCP]: Fix TCP header misalignmentHerbert Xu
Indeed my previous change to alloc_pskb has made it possible for the TCP header to be misaligned iff the MTU is not a multiple of 4 (and less than a page). So I suspect the optimised IPsec MTU calculation is giving you just such an MTU :) This patch fixes it by changing alloc_pskb to make sure that the size is at least 32-bit aligned. This does not cause the problem fixed by the previous patch because max_header is always 32-bit aligned which means that in the SG/NOTSO case this will be a no-op. I thought about putting this in the callers but all the current callers are from TCP. If and when we get a non-TCP caller we can always create a TCP wrapper for this function and move the alignment over there. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14[TCP]: Fix size calculation in sk_stream_alloc_pskbHerbert Xu
We round up the header size in sk_stream_alloc_pskb so that TSO packets get zero tail room. Unfortunately this rounding up is not coordinated with the select_size() function used by TCP to calculate the second parameter of sk_stream_alloc_pskb. As a result, we may allocate more than a page of data in the non-TSO case when exactly one page is desired. In fact, rounding up the head room is detrimental in the non-TSO case because it makes memory that would otherwise be available to the payload head room. TSO doesn't need this either, all it wants is the guarantee that there is no tail room. So this patch fixes this by adjusting the skb_reserve call so that exactly the requested amount (which all callers have calculated in a precise way) is made available as tail room. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07[NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA ↵Eric Dumazet
way. "struct proto" currently uses an array stats[NR_CPUS] to track change on 'inuse' sockets per protocol. If NR_CPUS is big, this means we use a big memory area for this. Moreover, all this memory area is located on a single node on NUMA machines, increasing memory pressure on the boot node. In this patch, I tried to : - Keep a fast !CONFIG_SMP implementation - Keep a fast CONFIG_SMP implementation for often used protocols (tcp,udp,raw,...) - Introduce a NUMA efficient implementation Some helper macros are defined in include/net/sock.h These macros take into account CONFIG_SMP If a "struct proto" is declared without using DEFINE_PROTO_INUSE / REF_PROTO_INUSE macros, it will automatically use a default implementation, using a dynamically allocated percpu zone. This default implementation will be NUMA efficient, but might use 32/64 bytes per possible cpu because of current alloc_percpu() implementation. However it still should be better than previous implementation based on stats[NR_CPUS] field. When a "struct proto" is changed to use the new macros, we use a single static "int" percpu variable, lowering the memory and cpu costs, still preserving NUMA efficiency. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01[NET]: Forget the zero_it argument of sk_alloc()Pavel Emelyanov
Finally, the zero_it argument can be completely removed from the callers and from the function prototype. Besides, fix the checkpatch.pl warnings about using the assignments inside if-s. This patch is rather big, and it is a part of the previous one. I splitted it wishing to make the patches more readable. Hope this particular split helped. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01[NET]: Move the sock_copy() from the headerPavel Emelyanov
The sock_copy() call is not used outside the sock.c file, so just move it into a sock.c Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[NET]: Fix the race between sk_filter_(de|at)tach and sk_clone()Pavel Emelyanov
The proposed fix is to delay the reference counter decrement until the quiescent state pass. This will give sk_clone() a chance to get the reference on the cloned filter. Regular sk_filter_uncharge can happen from the sk_free() only and there's no need in delaying the put - the socket is dead anyway and is to be release itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[NET]: Move the filter releasing into a separate callPavel Emelyanov
This is done merely as a preparation for the fix. The sk_filter_uncharge() unaccounts the filter memory and calls the sk_filter_release(), which in turn decrements the refcount anf frees the filter. The latter function will be required separately. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: sparse warning fixesStephen Hemminger
Fix a bunch of sparse warnings. Mostly about 0 used as NULL pointer, and shadowed variable declarations. One notable case was that hash size should have been unsigned. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make socket creation namespace safe.Eric W. Biederman
This patch passes in the namespace a new socket should be created in and has the socket code do the appropriate reference counting. By virtue of this all socket create methods are touched. In addition the socket create methods are modified so that they will fail if you attempt to create a socket in a non-default network namespace. Failing if we attempt to create a socket outside of the default network namespace ensures that as we incrementally make the network stack network namespace aware we will not export functionality that someone has not audited and made certain is network namespace safe. Allowing us to partially enable network namespaces before all of the exotic protocols are supported. Any protocol layers I have missed will fail to compile because I now pass an extra parameter into the socket creation code. [ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Add a network namespace parameter to struct sockEric W. Biederman
Sockets need to get a reference to their network namespace, or possibly a simple hold if someone registers on the network namespace notifier and will free the sockets when the namespace is going to be destroyed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Change type of owner in sock_lock_t to int, renameJohn Heffner
The type of owner in sock_lock_t is currently (struct sock_iocb *), presumably for historical reasons. It is never used as this type, only tested as NULL or set to (void *)1. For clarity, this changes it to type int, and renames to owned, to avoid any possible type casting errors. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>