TCP Implementation in Linux: A Brief Tutorial
一个简单教程关于 TCP 协议在 linux 内核的实现
翻译:内核小王子 (欢迎订阅微信公众号)
原文:Helali Bhuiyan, Mark McGinley, Tao Li, Malathi Veeraraghavan University of Virginia
原文链接 TCP Implementation in Linux: A Brief Tutorial
A. Introduction
This document provides a brief overview of how TCP is implemented in Linux. 1 It is not meant to be comprehensive, nor do we assert that it is without inaccuracies.
本文档简要概述了如何在Linux中实现TCP。他可能并不全面,并且也不能保证完全准确。
B. TCP implementation in Linux
Figures 1 and 2 show the internals of the TCP implemen- tation in Linux kernel. Fig. 1 shows the path taken by a new packet from the the wire to a user application. The Linux kernel uses an sk buff data structure to describe each packet. When a packet arrives at the NIC, it invokes the DMA engine to place the packet into the kernel memory via empty sk buffs stored in a ring buffer called rx ring. An incoming packet is dropped if the ring buffer is full. When a packet is processed at higher layers, packet data remains in the same kernel memory, avoiding any extra memory copies.
图1 和 图2 展示了 TCP/IP 协议栈在 Linux 内核中的实现,图1 展示了一个网络包通过物理网线到达应用程序的过程,Linux 内核使用一个名为 sk_buff 的数据结构来表示一个网络包。当一个网络包到达网卡时,会通过 DMA 引擎将这个 sk_buff 加入到一个叫 rx ring 的 ring buffer 中,当这个 ring buffer 已经满了的时候,的报文将被舍弃。当更高层的协议处理数据包的时候,报文保存在内核的内存中从而避免了额外的拷贝。
Once a packet is successfully received, the NIC raises an interrupt to the CPU, which processes each incoming packet and passes it to the IP layer. The IP layer performs its processing on each packet, and passes it up to the TCP layer if it is a TCP packet. The TCP process is then scheduled to handle received packets. Each packet in TCP goes through a series of complex processing steps. The TCP state machine is updated, and finally the packet is stored inside the TCP recv buffer.
一旦成功接收到一个数据包,网卡会向 CPU 发送一个中断,中断处理函数将数据包传给 IP 层。 IP层处理完后,判断如果是 TCP 报文,就会将数据包发给 TCP 层处理,数据包经过 TCP 层一系列复杂的处理过程,会更新 TCP 的状态机,最后将数据包存储在 TCP 的 接收缓冲区中。
A critical parameter for tuning TCP is the size of the recv buffer at the receiver. The number of packets a TCP sender is able to have outstanding (unacknowledged) is the minimum of the congestion window (cwnd) and the receiver’s advertised window (rwnd). The maximum size of the receiver’s advertised window is the TCP recv buffer size. Hence, if the size of the recv buffer is smaller than the the bandwidth- delay product (BDP) of the end-to-end path, the achievable throughput will be low. On the other hand, a large recv buffer allows a correspondingly large number of packets to remain outstanding, possibly exceeding the number of packets an end- to-end path can sustain. The size of the recv buffer can be set by modifying the /proc/sys/net/ipv4/tcp rmem variable. It takes three different values, i.e, min, default, and max. The min value defines the minimum receive buffer size even when the operating system is under hard memory pressure. The default is the default size of the receive buffer, which is used together with the TCP window scaling factor to calculate the actual advertised window. The max defines the maximum size of the receive buffer,
TCP 调优的一个关键参数为接收端的 recv 缓冲区大小。TCP 发送方能够发送的数据包的数量为发送方的拥塞控制窗口 (cwnd) 和接收方的告知的接收窗口 (rwnd) 中的最小值。而接收方告知的接收窗口的最大值就是 recv 缓冲区大小。因此,如果 recv 缓冲区设置的比 BGP (带宽延迟积) 小,则网络的吞吐量将会很低。另外,一个大的 recv 缓冲区允许大量的数据包处于未完成状态,可能超过了双方可以维持的数据包数量。recv 缓冲区大小可以通过修改 /proc/sys/net/ipv4/tcp rmem变量来设置。它需要三个值,最大值,最小值,默认值。最小值定义了最小可以接收的缓冲区大小,即使操作系统处于硬件内存很小。默认值是接收缓冲区的默认大小,它与TCP滑动窗口比例一起用来计算实际公示的窗口大小。max 定义接收缓冲区的最大值。
Also at the receiver, the parameter netdev max backlog dictates the maximum number of packets queued at a device, which are waiting to be processed by the TCP receiving process. If a newly received packet when added to the queue would cause the queue to exceed netdev max backlog then it is discarded.
此外在接收端,参数netdev max backlog 指示网卡设备上排队的最大数据包数,这些数据包等待TCP接收进程处理。如果一个新收到的数据包在添加到队列时会导致队列超过netdev max backlog,那么它将被丢弃。
On the sender, as shown in Fig 2, a user application writes the data into the TCP send buffer by calling the write() system call. Like the TCP recv buffer, the send buffer is a crucial parameter to get maximum throughput. The maximum size of the congestion window is related to the amount of send buffer space allocated to the TCP socket. The send buffer holds all outstanding packets (for potential retransmission) as well as all data queued to be transmitted. Therefore, the congestion window can never grow larger than send buffer can accommodate. If the send buffer is too small, the congestion window will not fully open, limiting the throughput. On the other hand, a large send buffer allows the congestion window to grow to a large value. If not constrained by the TCP recv buffer, the number of outstanding packets will also grow as the congestion window grows, causing packet loss if the end-to- end path can not hold the large number of outstanding packets. The size of the send buffer can be set by modifying the /proc/sys/net/ipv4/tcp wmem variable, which also takes three different values, i.e., min, default, and max.
在发送端,如图 2 ,所示,用户程序通过系统调用 write() 将数据写入 TCP 的 send buffer,和接收端的缓冲区一样,send buffer 也是提供吞吐量很重要的参数。拥塞窗口的最大值和分配给 TCP socket 的 send buffer 空间大小相关,send buffer 保存了所有还没有确认的数据包,因为该数据包可能还需要重发,如果s end buffer 设置的太小,则拥塞窗口也会变小,将影响吞吐量。另外,一个大的 send buffer 可能导致拥塞窗口变大,如果没有通过 接收端的 recv buffer 来限制,未确认的报文数目会随着拥塞窗口的增加而变大,如果超过双方可以维持的最大包数目从而导致丢包。send buffer 的大小可以通过修改 /proc/sys/net/ipv4/tcp 的 wmem 变量值,同样需要配置最大最小值和默认值。
The analogue to the receiver’s netdev max backlog is the sender’s txqueuelen. The TCP layer builds packets when data is available in the send buffer or ACK packets in response to data packets received. Each packet is pushed down to the IP layer for transmission. The IP layer enqueues each packet in an output queue (qdisc) associated with the NIC. The size of the qdisc can be modified by assigning a value to the txqueuelen variable associated with each NIC device. If the output queue is full, the attempt to enqueue a packet generates a local- congestion event, which is propagated upward to the TCP layer. The TCP congestion-control algorithm then enters into the Congestion Window Reduced (CWR) state, and reduces the congestion window by one every other ACK (known as rate halving). After a packet is successfully queued inside the output queue, the packet descriptor (sk buff) is then placed in the output ring buffer tx ring. When packets are available inside the ring buffer, the device driver invokes the NIC DMA engine to transmit packets onto the wire.
类似于接收端的 netdev max backlog 是发送者的网卡设备上排队的最大数据包数。TCP 层在数据到达 send buffer的时候会构建报文,当收到确认报文回复的时候也会更高数据包状态。构建好 TCP 报文后会将数据包推送到协议下层的 IP 层进行传输,IP 层将加数据包放入一个和网卡关联的输出队列。该队列的大小可以通过修改和网卡设备关联的 txqueuelen 变量值来设置。如果队列已满,会尝试将数据包排队生成一个阻塞事件传播到 TCP层。TCP 拥塞控制算法将减少拥塞窗口的状态变量,每有一个阻塞事件会将当前拥塞窗口的状态变量减半。当数据包成功加入到队列,则这个数据包的描述符 (sk buff) 将会放入到发送方的 ring buffer 中,之后设备驱动通过 DMA engine 将数据包传输到线路中。
While the above parameters dictate the flow-control profile of a connection, the congestion-control behavior can also have a large impact on the throughput. TCP uses one of several congestion control algorithms to match its sending rate with the bottleneck-link rate. Over a connectionless network, a large number of TCP flows and other types of traffic share the same bottleneck link. As the number of flows sharing the bottleneck link changes, the available bandwidth for a certain TCP flow varies. Packets get lost when the sending rate of a TCP flow is higher than the available bandwidth. On the other hand, packets are not lost due to competition with other flows in a circuit as bandwidth is reserved. However, when a fast sender is connected to a circuit with lower rate, packets can get lost due to buffer overflow at the switch.
上述参数展示了网络连接的流量控制,但拥塞控制行为也会对对吞吐量产生很大影响。TCP使用多种拥塞控制算法来匹配发送速率以适应有瓶颈的线路。在一个无连接的网络环境里,大量的TCP流和其他类型的流量共享同一个瓶颈链路,当链路上的数据包数量发生变化的时候,TCP 流的可用带宽也会变化。当TCP流的发送速率高于可用带宽时,数据包会丢失。另一方面,由于带宽被保留,数据包不会因为与电路中其他流的竞争而丢失。但,当一个发送速率很快的发送端连接到一个速率较低的链路时,由于交换机的缓冲区溢出,数据包也可能会丢失。
When a TCP connection is set up, a TCP sender uses ACK packets as a ’clock, known as ACK-clocking, to inject new packets into the network [1]. Since TCP receivers cannot send ACK packets faster than the bottleneck-link rate, a TCP senders transmission rate while under ACK-clocking is matched with the bottleneck link rate. In order to start the ACK-clock, a TCP sender uses the slow-start mechanism. During the slow-start phase, for each ACK packet received, a TCP sender transmits two data packets back-to-back. Since ACK packets are coming at the bottleneck-link rate, the sender is essentially transmitting data twice as fast as the bottleneck link can sustain. The slow-start phase ends when the size of the congestion window grows beyond ssthresh. In many congestion control algorithms, such as BIC [2], the initial slow start threshold (ssthresh) can be adjusted, as can other factors such as the maximum increment, to make BIC more or less aggressive. However, like changing the buffers via the sysctl function, these are system-wide changes which could adversely affect other ongoing and future connections. A TCP sender is allowed to send the minimum of the con- gestion window and the receivers advertised window number of packets. Therefore, the number of outstanding packets is doubled in each roundtrip time, unless bounded by the receivers advertised window. As packets are being forwarded by the bottleneck-link rate, doubling the number of outstanding packets in each roundtrip time will also double the buffer occupancy inside the bottleneck switch. Eventually, there will be packet losses inside the bottleneck switch once the buffer overflows.
当一个 TCP 完成连接建立后,发送方使用确认报文作为一个时钟从而将新的数据包加入网络,称为 ACK-clocking。由于 TCP 接收端发送 ACK 数据包的速度不能超过瓶颈链路速率,因此ACK 时钟下的 TCP 发送端传输速率与瓶颈链路速率匹配。为了启动 ACK 时钟,TCP 发送端使用慢速启动机制。在慢启动阶段,对于接收到的每个 ACK 数据包,TCP发送端连续传输两个数据包。由于 ACK 数据包以瓶颈链路速率传输,发送方传输数据的速度基本上是瓶颈链路能够维持的速度的两倍。当拥塞窗口的大小超过 ssthresh 时,慢启动阶段结束。在许多拥塞控制算法中,如 bic,可以调整初始慢启动阈值(ssthresh),以及其他因素(如最大增量),使bic或多或少提高效率。但是,与通过sysctl函数更改缓冲区一样,这些是系统范围内的更改,可能会对其他正在进行的连接和将来的连接产生不利影响。TCP 发送端最多只能发送拥塞窗口和接收端公布的窗口中的最小值。因此,除非受接收端公示的窗口的限制,否则每个往返时间内未完成数据包的数量将增加一倍。由于数据包是由瓶颈链路速率转发的,因此在每个往返时间内,将未完成数据包的数量加倍也将使瓶颈交换机内的缓冲区占用率加倍。最后,一旦缓冲区溢出,瓶颈交换机内部就会有数据包丢失。
After packet loss occurs, a TCP sender enters into the congestion avoidance phase. During congestion avoidance, the congestion window is increased by one packet in each roundtrip time. As ACK packets are coming at the bottleneck link rate, the congestion window keeps growing, as does the the number of outstanding packets. Therefore, packets will get lost again once the number of outstanding packets grows larger than the buffer size in the bottleneck switch plus the number of packets on the wire.
当发生数据包丢失后,TCP发送端进入拥塞控制阶段。在这期间,每收到一个回复报文拥塞窗口加一。当 ACK 数据包以瓶颈链路速率返回时,拥塞窗口和未完成数据包的数量都在不断增加。因此,一旦未完成数据包的数量超过瓶颈链路交换机中的缓冲区大小加上线路上的数据包数量,数据包将再次丢失。
There are many other parameters that are relevant to the operation of TCP in Linux, and each is at least briefly explained in the documentation included in the distribution (Documentation/networking/ip-sysctl.txt). An example of a configurable parameter in the TCP implementation is the RFC2861 congestion window restart function. RFC2861 pro- poses restarting the congestion window if the sender is idle for a period of time (one RTO). The purpose is to ensure that the congestion window reflects the current state of the network. If the connection has been idle, the congestion window may reflect an obsolete view of the network and so is reset. This be- havior can be disabled using the sysctl tcp slow start after idle but, again, this change affects all connections system-wide.
还有许多与 Linux 中的 TCP 操作相关的其他参数,并且每个参数都在发布的文档(documentation/networking/ip sysctl.txt)中进行了简要说明。TCP 实现可配置参数的一个例子是 rfc2861 拥塞窗口重启功能。如果发送方 空闲一段时间(一个 RTO),则RFC2861 Pro 将重新启动拥塞窗口,目的是确保拥塞窗口反映网络的当前状态。如果连接处于空闲状态,拥塞窗口可能反映网络的已经过时状态,需要进行重置。可以使用 ysctl tcp slow start 在空闲后禁用此行为,但此更改会影响系统范围内的所有连接。
如果对 TCP 对流量控制和拥塞控制不是很理解,欢迎关注公众号 内核小王子 ,下周将分享 网络内核之如何实现c10m 深入分析linux的网络模型