| TCP protocol |
| ============ |
| |
| Last updated: 3 June 2017 |
| |
| Contents |
| ======== |
| |
| - Congestion control |
| - How the new TCP output machine [nyi] works |
| |
| Congestion control |
| ================== |
| |
| The following variables are used in the tcp_sock for congestion control: |
| snd_cwnd The size of the congestion window |
| snd_ssthresh Slow start threshold. We are in slow start if |
| snd_cwnd is less than this. |
| snd_cwnd_cnt A counter used to slow down the rate of increase |
| once we exceed slow start threshold. |
| snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to. |
| snd_cwnd_stamp Timestamp for when congestion window last validated. |
| snd_cwnd_used Used as a highwater mark for how much of the |
| congestion window is in use. It is used to adjust |
| snd_cwnd down when the link is limited by the |
| application rather than the network. |
| |
| As of 2.6.13, Linux supports pluggable congestion control algorithms. |
| A congestion control mechanism can be registered through functions in |
| tcp_cong.c. The functions used by the congestion control mechanism are |
| registered via passing a tcp_congestion_ops struct to |
| tcp_register_congestion_control. As a minimum, the congestion control |
| mechanism must provide a valid name and must implement either ssthresh, |
| cong_avoid and undo_cwnd hooks or the "omnipotent" cong_control hook. |
| |
| Private data for a congestion control mechanism is stored in tp->ca_priv. |
| tcp_ca(tp) returns a pointer to this space. This is preallocated space - it |
| is important to check the size of your private data will fit this space, or |
| alternatively, space could be allocated elsewhere and a pointer to it could |
| be stored here. |
| |
| There are three kinds of congestion control algorithms currently: The |
| simplest ones are derived from TCP reno (highspeed, scalable) and just |
| provide an alternative congestion window calculation. More complex |
| ones like BIC try to look at other events to provide better |
| heuristics. There are also round trip time based algorithms like |
| Vegas and Westwood+. |
| |
| Good TCP congestion control is a complex problem because the algorithm |
| needs to maintain fairness and performance. Please review current |
| research and RFC's before developing new modules. |
| |
| The default congestion control mechanism is chosen based on the |
| DEFAULT_TCP_CONG Kconfig parameter. If you really want a particular default |
| value then you can set it using sysctl net.ipv4.tcp_congestion_control. The |
| module will be autoloaded if needed and you will get the expected protocol. If |
| you ask for an unknown congestion method, then the sysctl attempt will fail. |
| |
| If you remove a TCP congestion control module, then you will get the next |
| available one. Since reno cannot be built as a module, and cannot be |
| removed, it will always be available. |
| |
| How the new TCP output machine [nyi] works. |
| =========================================== |
| |
| Data is kept on a single queue. The skb->users flag tells us if the frame is |
| one that has been queued already. To add a frame we throw it on the end. Ack |
| walks down the list from the start. |
| |
| We keep a set of control flags |
| |
| |
| sk->tcp_pend_event |
| |
| TCP_PEND_ACK Ack needed |
| TCP_ACK_NOW Needed now |
| TCP_WINDOW Window update check |
| TCP_WINZERO Zero probing |
| |
| |
| sk->transmit_queue The transmission frame begin |
| sk->transmit_new First new frame pointer |
| sk->transmit_end Where to add frames |
| |
| sk->tcp_last_tx_ack Last ack seen |
| sk->tcp_dup_ack Dup ack count for fast retransmit |
| |
| |
| Frames are queued for output by tcp_write. We do our best to send the frames |
| off immediately if possible, but otherwise queue and compute the body |
| checksum in the copy. |
| |
| When a write is done we try to clear any pending events and piggy back them. |
| If the window is full we queue full sized frames. On the first timeout in |
| zero window we split this. |
| |
| On a timer we walk the retransmit list to send any retransmits, update the |
| backoff timers etc. A change of route table stamp causes a change of header |
| and recompute. We add any new tcp level headers and refinish the checksum |
| before sending. |
| |