Wireshark & Packetdrill | TCP 慢启动（2） - 新鲜讯息

持续思索，探究根本原因

实验目的

基于 pack etdrill TCP 三次握手脚本，通过构造模拟服务器端场景，继续研究测试 TCP 慢启动现象。

基础脚本

# cat tcp_tcp_slow_start_000.pkt 0   socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0  bind(3, ..., ...) = 0+0  listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4#

慢启动

TCP 慢启动是 TCP 拥塞控制的一种初始机制，其核心目的是在连接建立或重传超时后，快速而谨慎地探测网络的可用带宽。发送方通过维护一个拥塞窗口（cwnd）来控制未确认数据量，在慢启动阶段，每收到一个确认数据包（ACK），cwnd 就增加一个最大报文段长度（MSS）。这使得在一个往返时延（RTT）内，能发送的数据量大约会翻倍，从而实现窗口的指数级增长，直至达到慢启动阈值（ssthresh）或检测到数据包丢失，此后连接将转入线性增长的拥塞避免阶段。

在 Linux 实现中，慢启动过程会在三种典型场景下触发：连接建立时（默认 cwnd=10、ssthresh 为极大值），确保必然进入慢启动（除非路由表配置覆盖）；RTO 超时重传时将 cwnd 重置为1，同时根据拥塞控制算法重新计算 ssthresh 后重新启动慢启动；连接空闲超过 RTO 时间时也会重新初始化拥塞窗口，通过慢启动重新探测网络状态。这三种机制共同保障 TCP 在不同场景下都能合理适配网络容量。

基础测试

仍然是连接初始建立的场景，在文章中的实验中，发送数据所使用的均是 MSS 大小，那么有人问，cwnd 的增长和数据大小是否有关，是否需要满足 MSS 大小。

对于这个问题，我们可以减少数据段大小，验证即可，在 tcp_slow_start_1_003.pkt 脚本基础上修改 1000 为 100 字节。

# cat tcp_slow_start_1_004.pkt 0   socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0  setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0+0  bind(3, ..., ...) = 0+0  listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1000,nop,nop,sackOK>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4+0.01 %{print (tcpi_snd_cwnd, tcpi_snd_ssthresh)}%+0.01 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 %{print (tcpi_snd_cwnd)}%+0.01 <. 1:1(0) ack 101 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 201 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 301 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 401 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 501 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 601 win 10000+0 %{print (tcpi_snd_cwnd)}%#

运行脚本后，运行脚本后，在收到 ACK Num 之后的 cwnd 值发生了一定变化，ACK Num 1001 增加 1 为 11 ，ACK Num 2001 继续增加 1 为 12，但之后的 ACK Num 3001-6001 都不再变化，维持在 12 。

该实验结果与 tcp_slow_start_1_003.pkt 一致，说明和数据包所发送的大小无关，与 MSS 也无关。

# packetdrill tcp_slow_start_1_004.pkt 10 214748364710111212121212## tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes21:16:39.127760 tun0  In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [S], seq 0, win 10000, options [mss 1000,nop,nop,sackOK], length 021:16:39.127825 tun0  Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [S.], seq 923791363, ack 1, win 64240, options [mss 1460,nop,nop,sackOK], length 021:16:39.138164 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [.], ack 1, win 10000, length 021:16:39.158721 ?     Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP21:16:39.158786 ?     Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [P.], seq 101:201, ack 1, win 64240, length 100: HTTP21:16:39.158835 ?     Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [P.], seq 201:301, ack 1, win 64240, length 100: HTTP21:16:39.158888 ?     Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [P.], seq 301:401, ack 1, win 64240, length 100: HTTP21:16:39.158943 ?     Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [P.], seq 401:501, ack 1, win 64240, length 100: HTTP21:16:39.158990 ?     Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [P.], seq 501:601, ack 1, win 64240, length 100: HTTP21:16:39.169249 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [.], ack 101, win 10000, length 021:16:39.169318 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [.], ack 201, win 10000, length 021:16:39.169363 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [.], ack 301, win 10000, length 021:16:39.169398 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [.], ack 401, win 10000, length 021:16:39.169434 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [.], ack 501, win 10000, length 021:16:39.169479 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [.], ack 601, win 10000, length 021:16:39.183199 ?     Out IP 192.168.74.165.8080 > 192.0.2.1.52837: Flags [F.], seq 601, ack 1, win 64240, length 021:16:39.183226 ?     In  IP 192.0.2.1.52837 > 192.168.74.165.8080: Flags [R.], seq 1, ack 601, win 10000, length 0#

而在实验过程中，曾经有一个小插曲，那就是忘了关闭 Nagle 算法，差点带偏了结论。如下脚本，默认开启了 Nagle 算法。

# cat tcp_slow_start_1_005.pkt 0   socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0  bind(3, ..., ...) = 0+0  listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1000,nop,nop,sackOK>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4+0.01 %{print (tcpi_snd_cwnd, tcpi_snd_ssthresh)}%+0.01 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 write(4,...,100) = 100+0 %{print (tcpi_snd_cwnd)}%+0.01 <. 1:1(0) ack 101 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 201 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 301 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 401 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 501 win 10000+0 %{print (tcpi_snd_cwnd)}%+0 <. 1:1(0) ack 601 win 10000+0 %{print (tcpi_snd_cwnd)}%#

运行脚本后，在收到 ACK Num 之后的 cwnd 值实际并没有变化，一直保持是 10 ，并没有任何增加。

# packetdrill tcp_slow_start_1_005.pkt 10 214748364710101010101010## tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes21:26:56.792587 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [S], seq 0, win 10000, options [mss 1000,nop,nop,sackOK], length 021:26:56.792616 tun0  Out IP 192.168.173.236.8080 > 192.0.2.1.50429: Flags [S.], seq 753209366, ack 1, win 64240, options [mss 1460,nop,nop,sackOK], length 021:26:56.802707 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [.], ack 1, win 10000, length 021:26:56.822865 tun0  Out IP 192.168.173.236.8080 > 192.0.2.1.50429: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP21:26:56.833053 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [.], ack 101, win 10000, length 021:26:56.833138 tun0  Out IP 192.168.173.236.8080 > 192.0.2.1.50429: Flags [P.], seq 101:601, ack 1, win 64240, length 500: HTTP21:26:56.833200 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [.], ack 201, win 10000, length 021:26:56.833216 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [.], ack 301, win 10000, length 021:26:56.833232 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [.], ack 401, win 10000, length 021:26:56.833246 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [.], ack 501, win 10000, length 021:26:56.833258 tun0  In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [.], ack 601, win 10000, length 021:26:56.920009 ?     Out IP 192.168.173.236.8080 > 192.0.2.1.50429: Flags [F.], seq 601, ack 1, win 64240, length 021:26:56.920144 ?     In  IP 192.0.2.1.50429 > 192.168.173.236.8080: Flags [R.], seq 1, ack 601, win 10000, length 0#

为什么会是这样的情况，实际上主要看 tcpdump 数据包就能看出端倪，在 Nagle 算法的作用下，实际 write 的第 2-6 个数据包 Seq 101:601 是无法发送的，只有当第一个发送出去的小数据包 Seq 1:101 得到 ACK 确认后才可以发出，且是合并发出。

那么如之前实验场景分析可知，在收到第一个 ACK 时，在 tcp_is_cwnd_limited() 函数中，由于处于慢启动阶段，return tcp_snd_cwnd(tp) < 2 * tp->max_packets_out 的结果，由于 10 并不小于 2 * 1，因此返回的结果实际为 false ，因此在 cubictcp_cong_avoid() 函数中 if (!tcp_is_cwnd_limited(sk)) 判断成立，直接 return , cwnd 仍然为 10。而在之后的第二至第六个 ACK，实际也和第一个 ACK 处理过程一样，并不会增长 cwnd 。

结合之前实验现象，所以说类似每收到一个确认数据包（ACK），cwnd 就增加一个最大报文段长度（MSS）的说法像是宏观描述，而在现有的 Linux 实现上，拥塞控制算法 cubic 中最主要的就是关于 tcp_snd_cwnd(tp) < 2 * tp->max_packets_out 的比较值。

/* We follow the spirit of RFC2861 to validate cwnd but implement a more * flexible approach. The RFC suggests cwnd should not be raised unless * it was fully used previously. And that's exactly what we do in * congestion avoidance mode. But in slow start we allow cwnd to grow * as long as the application has used half the cwnd. * Example : *    cwnd is 10 (IW10), but application sends 9 frames. *    We allow cwnd to reach 18 when all frames are ACKed. * This check is safe because it's as aggressive as slow start which already * risks 100% overshoot. The advantage is that we discourage application to * either send more filler packets or data to artificially blow up the cwnd * usage, and allow application-limited process to probe bw more aggressively. */staticinlinebooltcp_is_cwnd_limited(conststruct sock *sk){  const struct tcp_sock *tp = tcp_sk(sk);  if (tp->is_cwnd_limited)    return true;  /* If in slow start, ensure cwnd grows to twice what was ACKed. */  if (tcp_in_slow_start(tp))    return tcp_snd_cwnd(tp) < 2 * tp->max_packets_out;  return false;}