该装糊涂的时候装糊涂
基于 packetdrill TCP 三次握手脚本,通过构造模拟服务器端场景,研究测试接收端避免 Silly Window Syndrome 现象。
基础脚本
# cat tcp_sws_000.pkt0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4#
Silly Window Syndrome
Silly Window Syndrome 是指通信双方(发送端和接收端)以一种“低效”的方式工作,导致网络传输了大量有效载荷很小(比如只有1字节)的 TCP 报文段。一个生动的比喻:想象一个巨大的工厂(发送端)和一个巨大的仓库(接收端)之间,用一辆每次只能运一个零件(1 字节数据)的卡车( TCP 报文段)来运输。这辆卡车本身很大(有 40 字节的 IP 和 TCP 头部),但运载的货物却极小。这导致了极低的运输效率,大部分燃料和资源都浪费在了卡车本身的运行上,而不是货物上。综合来说,SWS 的根本原因是应用程序频繁地进行小量数据的读写操作,而 TCP 又过于“殷勤”地立即响应这些操作。
接收端通过 Clark 算法避免问题,其核心是不通告小窗口:当回复 ACK 的时候,如果可用的接收缓存过小则直接通告零窗口;当应用层读取数据,只有空闲接收缓存大于一定值的时候,会根据空闲缓存计算出一个新的 Window Size,如果这个新的 Window Size 大于等于两倍的当前接收窗口,才会立即触发窗口更新。发送端则通过 Nagle 算法来避免,其核心是合并小数据:如果之前发送的数据还未被确认,发送端会缓存后续传来的一些小数据,等待它们合并成一个更大的报文或在收到ACK后再发送,以此减少网络中小包的数量。两者协同工作,分别从“抑制触发”和“主动延迟”两个角度,共同确保了 TCP 传输的高效性,避免了网络被大量小数据包充斥的局面。
基础测试
对于发送端的 Nagle 算法,之前在和两篇文章中介绍,此次不再赘述,仅通过实验研究接收端避免 Silly Window Syndrome 的现象。
首先是基础场景,如下脚本,通过 SO_RCVBUF 设置接收缓存。
# cat tcp_sws_001.pkt0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,间隔 40ms+。
# packetdrill tcp_sws_001.pkt## tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes11:57:58.192829 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 011:57:58.192936 tun0 Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [S.], seq 1285251059, ack 1, win 2920, options [mss 1460], length 011:57:58.203116 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [.], ack 1, win 10000, length 011:57:58.213220 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP11:57:58.213249 tun0 Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [.], ack 1461, win 1460, length 011:57:58.213262 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP11:57:58.256181 tun0 Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [.], ack 2921, win 0, length 011:57:59.216422 ? Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [R.], seq 1, ack 2921, win 2920, length 011:57:59.216449 ? In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [R.], seq 2921, ack 1, win 10000, length 0#
以下通过应用层读取数据大小不同区分现象,修改脚本如下,首先 read 1460 字节大小。
# cat tcp_sws_002.pkt0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000+0.01 read(4,...,1460) = 1460+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,延迟发送,之后在间隔 10ms ,应用层读取了数据,大小 1460,此时判断需要并发送了 ACK 数据包,其中 Win 1460,表示接收窗口更新为 1460 。
# packetdrill tcp_sws_002.pkt## tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes11:53:14.992640 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 011:53:14.992715 tun0 Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [S.], seq 3858090574, ack 1, win 2920, options [mss 1460], length 011:53:15.002951 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [.], ack 1, win 10000, length 011:53:15.013220 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP11:53:15.013303 tun0 Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [.], ack 1461, win 1460, length 011:53:15.013334 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP11:53:15.023573 tun0 Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [.], ack 2921, win 1460, length 011:53:16.052373 ? Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [R.], seq 1, ack 2921, win 2920, length 011:53:16.052406 ? In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [R.], seq 2921, ack 1, win 10000, length 0#
修改脚本如下,将 read 由之前的 1460 改为 1459 字节。
# cat tcp_sws_003.pkt0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000+0.01 read(4,...,1459) = 1459+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,在间隔 10ms 之后,应用层读取了数据,大小 1459,此时判断发送 ACK 的条件不成立,并不触发窗口更新数据包,仍是在间隔 40ms+ 后延迟 ACK 发送,且 Win 0,表示接收窗口为 0,虽然此时应用层读取了 1459 ,释放了 1459 字节大小空间。
# packetdrill tcp_sws_003.pkt## tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes12:01:05.644565 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 012:01:05.644642 tun0 Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [S.], seq 1090302568, ack 1, win 2920, options [mss 1460], length 012:01:05.654805 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [.], ack 1, win 10000, length 012:01:05.664970 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP12:01:05.664998 tun0 Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [.], ack 1461, win 1460, length 012:01:05.665013 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP12:01:05.708229 tun0 Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [.], ack 2921, win 0, length 012:01:06.688384 ? Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [R.], seq 1, ack 2921, win 2920, length 012:01:06.688413 ? In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [R.], seq 2921, ack 1, win 10000, length 0#
对于上述应用层读取数据的两个实验场景结果的不同,主要是 tcp_cleanup_rbuf() 和 __tcp_select_window() 函数相关,重点计算 new_window = __tcp_select_window(sk) 的值。
- read 1460 场景
在 __tcp_select_window() 函数中,由于 tp->rx_opt.rcv_wscale 为 0 的缘故,window = tp->rcv_wnd 的值为 1460;
在 tcp_cleanup_rbuf() 函数中,new_window 值为 1460,满足 new_window && new_window >= 2 * rcv_window_now ,此时 time_to_ack = true,调用 tcp_send_ack() 发送 ACK,之后在 ACK 数据包中的 Window 值,也由 tcp_select_window() 确定,即为 1460。
- read 1459 场景
表面上虽然仅差距 1 字节的大小,但在 __tcp_select_window() 函数中,free_space 显著降低,虽然未小于总接收缓存的 1/16 ,但是小于 mss ,因此直接返回 0 值;
在 tcp_cleanup_rbuf() 函数中,new_window 值为 0,不满足 new_window && new_window >= 2 * rcv_window_now ,此时 time_to_ack = false,未调用 tcp_send_ack() ,仍然保持 Delayed ACK,最后在延迟 ACK 时间超时后,发送 ACK,此时在 ACK 数据包中的 Window 值,也由 tcp_select_window() 确定,即为 0。
上述场景为应用层读取数据大小不同造成的不同实验现象,一个是立马触发 ACK 窗口更新,一个是延迟 ACK。以下继续在 Delayed ACK 之后的 read 少量字节的场景,模拟由 TCP ZeroWindow Probe 触发出 ACK 回复,实际上也就是 TCP ZeroWindow Probe ACK。
# cat tcp_sws_004.pkt0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000+0.1 read(4,...,1000) = 1000+0.01 < . 2920:2920(0) ack 1 win 10000+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,间隔 40ms+,之后服务器进行了 read 1000 字节,如之前实验,未满足条件并未触发出 ACK 窗口更新。
再之后客户端发送了 TCP ZeroWindow Probe ,触发服务器端响应了 ACK,且 Window 值为 0。此处的过程与 tcp_cleanup_rbuf() 函数无关,是由于收到 TCP ZeroWindow Probe 后触发回复 ACK,而在 ACK 发送过程中,仍然调用了 tcp_select_window() 和 __tcp_select_window() ,这其中仍然是根据 free_space 大小,虽然未小于总接收缓存的 1/16 ,但是小于 mss ,因此返回 0 值。
# packetdrill tcp_sws_004.pkt## tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes16:06:10.232572 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 016:06:10.232600 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [S.], seq 2494482824, ack 1, win 2920, options [mss 1460], length 016:06:10.242747 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [.], ack 1, win 10000, length 016:06:10.252826 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP16:06:10.252850 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [.], ack 1461, win 1460, length 016:06:10.252870 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP16:06:10.296221 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [.], ack 2921, win 0, length 016:06:10.362940 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [.], ack 1, win 10000, length 016:06:10.362958 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [.], ack 2921, win 0, length 016:06:11.367787 ? Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [R.], seq 1, ack 2921, win 2920, length 016:06:11.367817 ? In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [R.], seq 2920, ack 1, win 10000, length 0#
代码参考
/* Clean up the receive buffer for full frames taken by the user,* then send an ACK if necessary. COPIED is the number of bytes* tcp_recvmsg has given to the user so far, it speeds up the* calculation of whether or not we must ACK for the sake of* a window update.*/voidtcp_cleanup_rbuf(struct sock *sk, int copied){struct tcp_sock *tp = tcp_sk(sk);bool time_to_ack = false;struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),"cleanup rbuf bug: copied %X seq %X rcvnxt %Xn",tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);if (inet_csk_ack_scheduled(sk)) {const struct inet_connection_sock *icsk = inet_csk(sk);if (/* Once-per-two-segments ACK was not sent by tcp_input.c */tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||/** If this read emptied read buffer, we send ACK, if* connection is not bidirectional, user drained* receive buffer and there was a small segment* in queue.*/(copied > 0 &&((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&!inet_csk_in_pingpong_mode(sk))) &&!atomic_read(&sk->sk_rmem_alloc)))time_to_ack = true;}/* We send an ACK if we can now advertise a non-zero window* which has been raised "significantly".** Even if window raised up to infinity, do not send window open ACK* in states, where we will not receive more. It is useless.*/if (copied > 0 && !time_to_ack && !(sk->sk_shutdown & RCV_SHUTDOWN)) {__u32 rcv_window_now = tcp_receive_window(tp);/* Optimize, __tcp_select_window() is not cheap. */if (2*rcv_window_now <= tp->window_clamp) {__u32 new_window = __tcp_select_window(sk);/* Send ACK now, if this read freed lots of space* in our buffer. Certainly, new_window is new window.* We can advertise it now, if it is not less than current one.* "Lots" means "at least twice" here.*/if (new_window && new_window >= 2 * rcv_window_now)time_to_ack = true;}}if (time_to_ack)tcp_send_ack(sk);}
# cat tcp_sws_001.pkt0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000+0 `sleep 1`#0
往期推荐
推荐站内搜索:最好用的开发软件、免费开源系统、渗透测试工具云盘下载、最新渗透测试资料、最新黑客工具下载……




还没有评论,来说两句吧...