• TCP拥塞状态的变迁


    The Linux TCP sender is governed by a state machine that determines the sender actions when
    acknowledgements arrive.

    The states are as follows:

    enum tcp_ca_state {
            TCP_CA_Open = 0,
    #define TCPF_CA_Open (1<<TCP_CA_Open)
    
            TCP_CA_Disorder = 1,
    #define TCPF_CA_Disorder (1<<TCP_CA_Disorder)
    
            TCP_CA_CWR = 2,
    #define TCPF_CA_CWR (1<<TCP_CA_CWR)
    
            TCP_CA_Recovery = 3,
    #define TCPF_CA_Recovery (1<<TCP_CA_Recovery)
    
            TCP_CA_Loss = 4
    #define TCPF_CA_Loss (1<<TCP_CA_Loss)
    };

    Open

    This is the normal state in which the TCP sender follows the fast path of execution optimized for
    the common case in processing incoming acknowledgements.
    When an acknowledgement arrives, the sender increases the congestion window according to
    either slow start or congestion avoidance, depending on whether the congestion window is
    smaller or larger than the slow start threshold, respectively.

    初始状态,也是正常的状态。

    Disorder

    When the sender detects duplicate ACKs or selective acknowledgements, it moves to the Disorder
    state. In this state the congestion window is not adjusted, but each incoming packet triggers
    transmission of a new segment. Therefore, the TCP sender follows the packet conservation
    principle, which states that a new packet is not send out until an old packet has left the network.

    拥塞窗口恒定,网络中数据包守恒。

    CWR

    The TCP sender may receive congestion notification either by Explicit Congestion Notification,
    ICMP source quench, or from a local device. When receiving a congestion notification, the Linux
    sender does not reduce the congestion window at once, but by one segment for every second
    incoming ACK until the window size is halved. When the sender is in process of reducing the
    congestion window size and it does not have outstanding retransmissions, it is in CWR(Congestion
    Window Reduced) state. CWR state can be interrupted by Recovery or Loss state.

    拥塞窗口减小,且没有明显的重传。

    struct tcp_sock {
            ...
    
            u32 bytes_acked; /* Appropriate Byte Counting */
            u32 prior_ssthresh; /* ssthresh saved at recovery start */
            u32 undo_marker; /* tracking retrans started here */
            u32 high_seq; /* snd_nxt at onset of congestion */
            u32 snd_cwnd_stamp; 
            u8 ecn_flags; /* ECN status bits */
    
            ...
    }
    
    struct inet_connection_sock {
            ...
    
            __u8 icsk_ca_state;
            __u8 icsk_retransmits;
            const struct tcp_congestion_ops *icsk_ca_ops;
    
            ...
    }
    
     /* Set slow start threshold and cwnd not falling to slow start */
    void tcp_enter_cwr(struct sock *sk, const int set_ssthresh)
    {
            struct tcp_sock *tp = tcp_sk(sk);
            const struct inet_connection_sock *icsk = inet_csk(sk);
            tp->prior_ssthresh = 0;
            tp->bytes_acked = 0;
    
            if (icsk->icsk_ca_state < TCP_CA_CWR) { /*只有Open和Disorder态才能进入*/
                    tp->undo_marker = 0;
                    if (set_ssthresh)
                         tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); /* 重设慢启动阈值*/
                    tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1U);
                    tp->snd_cwnd_cnt = 0;
                    tp->high_seq = tp->snd_nxt;
                    tp->snd_cwnd_stamp = tcp_time_stamp;
    
                    TCP_ECN_queue_cwr(tp);
                    tcp_set_ca_state(sk, TCP_CA_CWR);/*设置状态*/
    }
    
    #define TCP_ECN_OK 1
    #define TCP_ECN_QUEUE_CWR 2
    #define TCP_ECN_DEMAND_CWR 4
    
    static inline void TCP_ECN_queue_cwr(struct tcp_sock *tp)
    {
            if (tp->ecn_flags & TCP_ECN_OK)
                tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
    } 
    
    static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state)
    {
            struct inet_connection_sock *icsk = inet_csk(sk);
            if (icsk->icsk_ca_ops->set_state)
                   icsk->icsk_ca_ops->set_state(sk, ca_state);
            icsk->icsk_ca_state = ca_state;
    }
    
    

     
    Recovery

    After a sufficient amount of successive duplicate ACKs arrive at the sender, it retransmits the first
    unacknowledged segment and enters the Recovery state. By default, the threshold for entering
    Recovery is three successive duplicate ACKs, a value recommended by the TCP congestion
    control specification. During the Recovery state, the congestion window size is reduced by one
    segment for every second incoming acknowledgement, similar to the CWR state. The window
    reduction ends when the congestion window size is equal to ssthresh, i.e. half of the window
    size when entering the Recovery state. The congestion window is not increased during the
    recovery state, and the sender either retransmits the segments marked lost, or makes forward
    transmissions on new data according to the packet conservation principle. The sender stays in
    the Recovery state until all of the segments outstanding when the Recovery state was entered
    are successfully acknowledged. After this the sender goes back to the Open state. A retrans-
    mission timeout can also interrupt the Recovery state.

    Loss

    When an RTO expires, the sender enters the Loss state. All outstanding segments are marked
    lost, and the congestion window is set to one segment, hence the sender starts increasing the
    congestion window using the slow start algorithm. A major difference between the Loss and
    Recovery states is that in the Loss state the congestion window is increased after the sender
    has reset it to one segment, but in the Recovery state the congestion window size can only be
    reduced. The Loss state cannot be interrupted by any other state, thus the sender exits to the
    Open state only after all data outstanding when the Loss state began have successfully been
    acknowledged. For example, fast retransmit cannot be triggered during the Loss state, which
    is in conformance with the NewReno specification.

    /* Enter Loss state. If "how" is not zero, forget all SACK information and 
     * reset tags competely, otherwise preserve SACKs. If receiver dropped its 
     * ofo queue, we will know this due to reneging detection.
     * 进入Loss状态,是否清除SACK标志取决于how,how不为0则清除
     */
    void tcp_enter_loss(struct sock *sk, int how)
    {
            const struct inet_connection_sock *icsk = inet_csk(sk);
            struct tcp_sock *tp = tcp_sk(sk);
            struct sk_buff *skb;
    
            /* Reduce ssthresh if it has not yet been made inside this window. 
             * 在刚进入Loss状态时,减小慢启动阈值
              */
            if (icsk->icsk_ca_state <= TCP_CA_Disorder || tp->snd_una == tp->high_seq
                (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
                    /* 保留当前阈值,以便在拥塞窗口调整撤销时使用*/
                    tp->prior_ssthresh = tcp_current_ssthresh(sk); 
                    /* 减小慢启动阈值*/
                    tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
                    /* 通知CA_EVENT_LOSS事件给具体的拥塞控制算法*/
                    tcp_ca_event(sk, CA_EVENT_LOSS);
            }
    
            tp->snd_cwnd = 1; /*调整拥塞窗口为1*/
            tp->snd_cwnd_cnt = 0;
            tp->snd_cwnd_stamp = tcp_time_stamp;
            tp->bytes_acked = 0;
            tcp_clear_retrans_partial(tp);/*清零和重传有关的变量*/
    
            if (tcp_is_reno(tp))
                    tcp_reset_reno_sack(tp); /* 清零sacked_out */
    
            if (!how) { /* 保留SACK标志*/
                    tp->undo_marker = tp->snd_una; /* 以便在合适时进行拥塞窗口调整撤销*/
            } else { /* 清除SACK标志*/
                    tp->sacked_out = 0;
                    tp->fackets_out = 0;
            }
            tcp_clear_all_retrans_hints(tp);
    
            tcp_for_write_queue(skb, sk) { /*遍历sk->sk_write_queue发送队列*/
                    if (skb == tcp_send_head(sk)) /*从snd.una到snd.nxt*/
                        break;
    
                    if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
                         tp->undo_marker = 0;
    
                    /* 清除重传和丢失标志*/
                    TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS) | TCPCB_SACKED_ACKED;
    
                    if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) || how) {
                        TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;/*清除SACK标志*/
                        TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; /*标志为丢失*/
                        tp->lost_out += tcp_skb_pcount(skb); /*统计丢失段的数量*/
                        tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
                    }
            }
    
            tcp_verify_left_out(tp); /*left_out > packets_out则发出警告*/
            tp->reordering = min_t(unsigned int, tp->reordering, sysctl_tcp_reordering);
            tcp_set_ca_state(sk, TCP_CA_Loss);
            tp->high_seq = tp->snd_nxt;
            TCP_ECN_queue_cwr(tp); /*表示发送发进入拥塞状态*/
    
            /* Abort F-RTO algorithm if one is in progress */
            tp->frto_counter = 0;
    }
    #define tcp_for_write_queue(skb, sk)        \
            skb_queue_walk(&(sk)->sk_write_queue, skb)
    
    #define skb_queue_walk(queue, skb)        \
            for (skb = (queue)->next;                         \
                   prefetch(skb->next), (skb != (struct sk_buff *) (queue) ) ;      \
                   skb = skb->next)
    
    /* Due to TSO, an SKB can be composed of multiple actual packets.
     * To keep these tracked properly, we use this.
     */
    static inline int tcp_skb_pcount(const struct sk_buff *skb)
    {
            return skb_shinfo(skb)->gso_segs;
    }
    
    struct sock {
            ...
    
            struct sk_buff_head sk_write_queue; /*发送队列头*/
            struct sk_buff *sk_send_head; /* snd_nxt */
    
            ...
    } 
    
    /* If cwnd > ssthresh, we may raise ssthresh to be half-way to cwnd. 
     * The exception is rate halving phase, when cwnd is decreasing towards ssthresh 
     */
    static inline __u32 tcp_current_ssthresh(const struct sock *sk)
    {
            const struct tcp_sock *tp = tcp_sk(sk);
            if ((1<<inet_csk(sk)->icsk_ca_state) & (TCPF_CA_CWR | TCPF_CA_Recovery))
                    return tp->snd_ssthresh;  /*CWR和Recovery时cwnd在减小*/
            esle /*调大ssthresh*/
                    return max(tp->snd_ssthresh, ((tp->snd_cwnd>>1)+(tp->snd_cwnd>>2)));
    }
    
    static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
    {
            const struct inet_connection_sock *icsk = inet_csk(sk);
            if (icsk->icsk_ca_ops->cwnd_event)
                    icsk->icsk_ca_ops->cwnd_event(sk, event);
    }
    
  • 相关阅读:
    数仓备机DN重建:快速修复你的数仓DN单点故障
    深度学习分类任务常用评估指标
    云小课 | MRS基础入门之HDFS组件介绍
    华为云数据库GaussDB(for Cassandra)揭秘第二期:内存异常增长的排查经历
    为什么vacuum后表还是继续膨胀?
    Go 自定义日志库
    Go time包
    Go 文件操作
    Go 包相关
    【程序人生】跟小伙伴们聊聊我有趣的大学生活和我那两个好基友!
  • 原文地址:https://www.cnblogs.com/aiwz/p/6333394.html
Copyright © 2020-2023  润新知