diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/operstates.txt | 161 | ||||
-rw-r--r-- | Documentation/networking/xfrm_sync.txt | 166 |
2 files changed, 327 insertions, 0 deletions
diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt new file mode 100644 index 00000000000..4a21d9bb836 --- /dev/null +++ b/Documentation/networking/operstates.txt @@ -0,0 +1,161 @@ + +1. Introduction + +Linux distinguishes between administrative and operational state of an +interface. Admininstrative state is the result of "ip link set dev +<dev> up or down" and reflects whether the administrator wants to use +the device for traffic. + +However, an interface is not usable just because the admin enabled it +- ethernet requires to be plugged into the switch and, depending on +a site's networking policy and configuration, an 802.1X authentication +to be performed before user data can be transferred. Operational state +shows the ability of an interface to transmit this user data. + +Thanks to 802.1X, userspace must be granted the possibility to +influence operational state. To accommodate this, operational state is +split into two parts: Two flags that can be set by the driver only, and +a RFC2863 compatible state that is derived from these flags, a policy, +and changeable from userspace under certain rules. + + +2. Querying from userspace + +Both admin and operational state can be queried via the netlink +operation RTM_GETLINK. It is also possible to subscribe to RTMGRP_LINK +to be notified of updates. This is important for setting from userspace. + +These values contain interface state: + +ifinfomsg::if_flags & IFF_UP: + Interface is admin up +ifinfomsg::if_flags & IFF_RUNNING: + Interface is in RFC2863 operational state UP or UNKNOWN. This is for + backward compatibility, routing daemons, dhcp clients can use this + flag to determine whether they should use the interface. +ifinfomsg::if_flags & IFF_LOWER_UP: + Driver has signaled netif_carrier_on() +ifinfomsg::if_flags & IFF_DORMANT: + Driver has signaled netif_dormant_on() + +These interface flags can also be queried without netlink using the +SIOCGIFFLAGS ioctl. + +TLV IFLA_OPERSTATE + +contains RFC2863 state of the interface in numeric representation: + +IF_OPER_UNKNOWN (0): + Interface is in unknown state, neither driver nor userspace has set + operational state. Interface must be considered for user data as + setting operational state has not been implemented in every driver. +IF_OPER_NOTPRESENT (1): + Unused in current kernel (notpresent interfaces normally disappear), + just a numerical placeholder. +IF_OPER_DOWN (2): + Interface is unable to transfer data on L1, f.e. ethernet is not + plugged or interface is ADMIN down. +IF_OPER_LOWERLAYERDOWN (3): + Interfaces stacked on an interface that is IF_OPER_DOWN show this + state (f.e. VLAN). +IF_OPER_TESTING (4): + Unused in current kernel. +IF_OPER_DORMANT (5): + Interface is L1 up, but waiting for an external event, f.e. for a + protocol to establish. (802.1X) +IF_OPER_UP (6): + Interface is operational up and can be used. + +This TLV can also be queried via sysfs. + +TLV IFLA_LINKMODE + +contains link policy. This is needed for userspace interaction +described below. + +This TLV can also be queried via sysfs. + + +3. Kernel driver API + +Kernel drivers have access to two flags that map to IFF_LOWER_UP and +IFF_DORMANT. These flags can be set from everywhere, even from +interrupts. It is guaranteed that only the driver has write access, +however, if different layers of the driver manipulate the same flag, +the driver has to provide the synchronisation needed. + +__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP: + +The driver uses netif_carrier_on() to clear and netif_carrier_off() to +set this flag. On netif_carrier_off(), the scheduler stops sending +packets. The name 'carrier' and the inversion are historical, think of +it as lower layer. + +netif_carrier_ok() can be used to query that bit. + +__LINK_STATE_DORMANT, maps to IFF_DORMANT: + +Set by the driver to express that the device cannot yet be used +because some driver controlled protocol establishment has to +complete. Corresponding functions are netif_dormant_on() to set the +flag, netif_dormant_off() to clear it and netif_dormant() to query. + +On device allocation, networking core sets the flags equivalent to +netif_carrier_ok() and !netif_dormant(). + + +Whenever the driver CHANGES one of these flags, a workqueue event is +scheduled to translate the flag combination to IFLA_OPERSTATE as +follows: + +!netif_carrier_ok(): + IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN + otherwise. Kernel can recognise stacked interfaces because their + ifindex != iflink. + +netif_carrier_ok() && netif_dormant(): + IF_OPER_DORMANT + +netif_carrier_ok() && !netif_dormant(): + IF_OPER_UP if userspace interaction is disabled. Otherwise + IF_OPER_DORMANT with the possibility for userspace to initiate the + IF_OPER_UP transition afterwards. + + +4. Setting from userspace + +Applications have to use the netlink interface to influence the +RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1 +via RTM_SETLINK instructs the kernel that an interface should go to +IF_OPER_DORMANT instead of IF_OPER_UP when the combination +netif_carrier_ok() && !netif_dormant() is set by the +driver. Afterwards, the userspace application can set IFLA_OPERSTATE +to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set +netif_carrier_off() or netif_dormant_on(). Changes made by userspace +are multicasted on the netlink group RTMGRP_LINK. + +So basically a 802.1X supplicant interacts with the kernel like this: + +-subscribe to RTMGRP_LINK +-set IFLA_LINKMODE to 1 via RTM_SETLINK +-query RTM_GETLINK once to get initial state +-if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until + netlink multicast signals this state +-do 802.1X, eventually abort if flags go down again +-send RTM_SETLINK to set operstate to IF_OPER_UP if authentication + succeeds, IF_OPER_DORMANT otherwise +-see how operstate and IFF_RUNNING is echoed via netlink multicast +-set interface back to IF_OPER_DORMANT if 802.1X reauthentication + fails +-restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag + +if supplicant goes down, bring back IFLA_LINKMODE to 0 and +IFLA_OPERSTATE to a sane value. + +A routing daemon or dhcp client just needs to care for IFF_RUNNING or +waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before +considering the interface / querying a DHCP address. + + +For technical questions and/or comments please e-mail to Stefan Rompf +(stefan at loplof.de). diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt new file mode 100644 index 00000000000..8be626f7c0b --- /dev/null +++ b/Documentation/networking/xfrm_sync.txt @@ -0,0 +1,166 @@ + +The sync patches work is based on initial patches from +Krisztian <hidden@balabit.hu> and others and additional patches +from Jamal <hadi@cyberus.ca>. + +The end goal for syncing is to be able to insert attributes + generate +events so that the an SA can be safely moved from one machine to another +for HA purposes. +The idea is to synchronize the SA so that the takeover machine can do +the processing of the SA as accurate as possible if it has access to it. + +We already have the ability to generate SA add/del/upd events. +These patches add ability to sync and have accurate lifetime byte (to +ensure proper decay of SAs) and replay counters to avoid replay attacks +with as minimal loss at failover time. +This way a backup stays as closely uptodate as an active member. + +Because the above items change for every packet the SA receives, +it is possible for a lot of the events to be generated. +For this reason, we also add a nagle-like algorithm to restrict +the events. i.e we are going to set thresholds to say "let me +know if the replay sequence threshold is reached or 10 secs have passed" +These thresholds are set system-wide via sysctls or can be updated +per SA. + +The identified items that need to be synchronized are: +- the lifetime byte counter +note that: lifetime time limit is not important if you assume the failover +machine is known ahead of time since the decay of the time countdown +is not driven by packet arrival. +- the replay sequence for both inbound and outbound + +1) Message Structure +---------------------- + +nlmsghdr:aevent_id:optional-TLVs. + +The netlink message types are: + +XFRM_MSG_NEWAE and XFRM_MSG_GETAE. + +A XFRM_MSG_GETAE does not have TLVs. +A XFRM_MSG_NEWAE will have at least two TLVs (as is +discussed further below). + +aevent_id structure looks like: + + struct xfrm_aevent_id { + struct xfrm_usersa_id sa_id; + __u32 flags; + }; + +xfrm_usersa_id in this message layout identifies the SA. + +flags are used to indicate different things. The possible +flags are: + XFRM_AE_RTHR=1, /* replay threshold*/ + XFRM_AE_RVAL=2, /* replay value */ + XFRM_AE_LVAL=4, /* lifetime value */ + XFRM_AE_ETHR=8, /* expiry timer threshold */ + XFRM_AE_CR=16, /* Event cause is replay update */ + XFRM_AE_CE=32, /* Event cause is timer expiry */ + XFRM_AE_CU=64, /* Event cause is policy update */ + +How these flags are used is dependent on the direction of the +message (kernel<->user) as well the cause (config, query or event). +This is described below in the different messages. + +The pid will be set appropriately in netlink to recognize direction +(0 to the kernel and pid = processid that created the event +when going from kernel to user space) + +A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS +to get notified of these events. + +2) TLVS reflect the different parameters: +----------------------------------------- + +a) byte value (XFRMA_LTIME_VAL) +This TLV carries the running/current counter for byte lifetime since +last event. + +b)replay value (XFRMA_REPLAY_VAL) +This TLV carries the running/current counter for replay sequence since +last event. + +c)replay threshold (XFRMA_REPLAY_THRESH) +This TLV carries the threshold being used by the kernel to trigger events +when the replay sequence is exceeded. + +d) expiry timer (XFRMA_ETIMER_THRESH) +This is a timer value in milliseconds which is used as the nagle +value to rate limit the events. + +3) Default configurations for the parameters: +---------------------------------------------- + +By default these events should be turned off unless there is +at least one listener registered to listen to the multicast +group XFRMNLGRP_AEVENTS. + +Programs installing SAs will need to specify the two thresholds, however, +in order to not change existing applications such as racoon +we also provide default threshold values for these different parameters +in case they are not specified. + +the two sysctls/proc entries are: +a) /proc/sys/net/core/sysctl_xfrm_aevent_etime +used to provide default values for the XFRMA_ETIMER_THRESH in incremental +units of time of 100ms. The default is 10 (1 second) + +b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth +used to provide default values for XFRMA_REPLAY_THRESH parameter +in incremental packet count. The default is two packets. + +4) Message types +---------------- + +a) XFRM_MSG_GETAE issued by user-->kernel. +XFRM_MSG_GETAE does not carry any TLVs. +The response is a XFRM_MSG_NEWAE which is formatted based on what +XFRM_MSG_GETAE queried for. +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. +*if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved +*if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved + +b) XFRM_MSG_NEWAE is issued by either user space to configure +or kernel to announce events or respond to a XFRM_MSG_GETAE. + +i) user --> kernel to configure a specific SA. +any of the values or threshold parameters can be updated by passing the +appropriate TLV. +A response is issued back to the sender in user space to indicate success +or failure. +In the case of success, additionally an event with +XFRM_MSG_NEWAE is also issued to any listeners as described in iii). + +ii) kernel->user direction as a response to XFRM_MSG_GETAE +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. +The threshold TLVs will be included if explicitly requested in +the XFRM_MSG_GETAE message. + +iii) kernel->user to report as event if someone sets any values or +thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above). +In such a case XFRM_AE_CU flag is set to inform the user that +the change happened as a result of an update. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +iv) kernel->user to report event when replay threshold or a timeout +is exceeded. +In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout +happened) is set to inform the user what happened. +Note the two flags are mutually exclusive. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +Exceptions to threshold settings +-------------------------------- + +If you have an SA that is getting hit by traffic in bursts such that +there is a period where the timer threshold expires with no packets +seen, then an odd behavior is seen as follows: +The first packet arrival after a timer expiry will trigger a timeout +aevent; i.e we dont wait for a timeout period or a packet threshold +to be reached. This is done for simplicity and efficiency reasons. + +-JHS |