From Fedora Project Wiki
Line 150: Line 150:
  #
  #
  if [ $# -ne 3 ]; then
  if [ $# -ne 3 ]; then
  echo "Usage: $0 [ BW Mbps] [RTT ms] [% loss]";
  echo "Usage: $0 [BW Mbps] [RTT ms] [% loss]";
  exit 1
  exit 1
  fi
  fi
Line 162: Line 162:
  }
  }
  #
  #
  test1=$( echo "scale=3; 0.98 * $BANDWIDTH" |bc )
SCALE=4  # Adjust if required
  test1=$( echo "scale=$SCALE; 0.98 * $BANDWIDTH" |bc )
  MAX1=$test1; MAX2=$test1
  MAX1=$test1; MAX2=$test1
  test2a=$( echo "scale=3; 17520 * 8 / $RTT /1000/1000" |bc )
  test2a=$( echo "scale=$SCALE; 17520 * 8 / $RTT /1000/1000" |bc )
  simplemath "$MAX1 > $test2a" && MAX1=$test2a
  simplemath "$MAX1 > $test2a" && MAX1=$test2a
  test2b=$( echo "scale=3;  262144 * 8 / $RTT /1000/1000" |bc )
  test2b=$( echo "scale=$SCALE;  262144 * 8 / $RTT /1000/1000" |bc )
  test2c=$( echo "scale=3;  4194304 * 8 / $RTT /1000/1000" |bc )
  test2c=$( echo "scale=$SCALE;  4194304 * 8 / $RTT /1000/1000" |bc )
  simplemath "$MAX2 > $test2c" && MAX2=$test2c
  simplemath "$MAX2 > $test2c" && MAX2=$test2c
  test3a=$( echo "scale=5; 0.7 * ( 1500-52 ) * 8 / ( $RTT * sqrt($LOSS) ) /1000/1000" |bc -l)
  test3a=$( echo "scale=$SCALE; 0.7 * ( 1500-52 ) * 8 / ( $RTT * sqrt($LOSS) ) /1000/1000" |bc -l)
  simplemath "$MAX1 > $test3a" && MAX1=$test3a
  simplemath "$MAX1 > $test3a" && MAX1=$test3a
  test3b=$( echo "scale=5; 0.7 * ( 9000-52 ) * 8 / ( $RTT * sqrt($LOSS) ) /1000/1000" |bc -l)
  test3b=$( echo "scale=$SCALE; 0.7 * ( 9000-52 ) * 8 / ( $RTT * sqrt($LOSS) ) /1000/1000" |bc -l)
  simplemath "$MAX2 > $test3b" && MAX2=$test3b
  simplemath "$MAX2 > $test3b" && MAX2=$test3b
  #
  #

Revision as of 02:54, 20 June 2011

WAN Latency and Its Effect on TCP

Network Latency and Performance

Network latency is the time it takes for a packet to get from one end of the circuit to the other. We measure latency as RTT (round trip time) in milliseconds.

Latency is directly proportional to the distance between the two endpoints of a communication.

By default, many systems are configured for high-speed, low latency LAN environments, and suffer from poor performance connecting across high latency links. This is usually noticeable with large data transfers.

Fortunately, TCP can be tuned on the host systems to improve large data transfer performance over high latency links.

BDP and TCP Buffers

BDP (Bandwidth Delay Product) measures the amount of data that would "fill the pipe"; it is the buffer space required at the sender and receiver to obtain maximum throughput on the TCP connection over the path.

BDP = [bandwidth in bits per second] * RTT / 8

The TCP window is the number of packets sent before the receiver is required to send an acknowledgment. The window size is linked to the send and receive buffers.

With a window size smaller than the BDP, time slots are inefficiently used and the pipe is never filled, reducing throughput.

The BDP of an example circuits are: DOCSIS1 (22 * .009)/8 = 0.02 Mbytes DSL1 (16 * .051)/8 = 0.10 Mbytes

Tuning TCP –RFC1323

RFC 1323 - TCP Extensions for High Performance

TCP window scaling beyond 64KB – the TCP header has a 16-bit field to report the receive window to the sender. Therefore, the largest window that can be used is 65,535 bytes. The “Window Scale” option defines an implicit scale factor, which is used to multiply the window size in the header. With this option, the largest window size is 1,073,741,823 bytes, or 1GB.

Timestamps – monitors the RTT of nearly every segment, including retransmissions, so that TCP can more efficiently handle duplicate packets, holes in the TCP window, wrapped sequence numbers, and other packet flow issues.

Tuning TCP – Buffer sizes

While RFC1323 will allow large windows to be negotiated between hosts, a small initial (default) window size will result in a saw-tooth effect on overall throughput. This is because network congestion will cause TCP to “slow-start”, where it resets the window to the default size and slowly increases.

By increasing the default buffer size closer to the BDP of the path, you “slow start” at more efficient rate.

The maximum TCP window size can be 1GB with RFC1323 enabled. However, for most environments it is recommended to use a fraction of the BDP; if the window size gets too large, the sender can overrun buffers on the receiver. This can cause out-of-order and lost packets, resulting in decreased performance.

It is also important to remember that buffers consume system memory. A higher than optimal size will result in wasted memory and can potentially cause resource contention.

Throughput can never exceed the window size divided by the RTT.

Tuning TCP –SACK

RFC2018 SACK (Selective Acknowledgments)

When SACK is enabled, a packet or series of packets can be dropped, and the receiver can inform the sender of exactly which data has been received, and where the holes in the data are. The sender can then selectively retransmit the missing data without needing to retransmit blocks of data that have already been received successfully.

Tuning TCP – PMTU

RFC1191 PMTU (Path MTU Discovery)

Historically, UNIX systems used the lesser of 576 and the first-hop MTU as the PMTU.

PMTU discovery works by setting the Don't Fragment (DF) bit in the IP header. Initially the packets are sent using the MTU of the next hop router. When a router is encountered with a smaller MTU, the packet is dropped and an ICMP “fragmentation needed” message is sent back. The host then lowers the PMTU and resends the packet.

Periodically the host will increase the PMTU and set the DF bit to perform a new discovery.

How much throughput can I get?

With all of these options turned on, and buffer sizes adjusted for optimum performance, a considerable increase from default values can be realized. Note that protocol overhead and network congestion may play a part in slower than theoretical maximum throughput.

The formula for determining maximum throughput is ((buffer_size * 8) / RTT). So for our 22M circuit with 9ms latency, and 2MB committed to the send/recv TCP buffers, we would theoretically see 1.8 Mbps throughput. Again, this is a theoretical number and does not account for protocol overhead, link congestion or overhead from stateful network devices.

Since 2MB buffers are potentially excessive for busy servers and may limit the number of active connections, a more realistic value such as 256-512KB should be used. While this may result in marginally slower throughput, the memory trade-off could be critical.

Effect of MTU over the WAN

The performance of TCP over wide area networks has been extensively studied and modeled. One paper by Matt Mathis et al. explains how TCP throughput has an upper bound based on the following parameters:

MSS = MTU – (Header Size: Typically 40, 52, or 64)
Throughput <= ~0.7 * MSS / (rtt * sqrt(packet_loss))

The model predicts the bandwidth of a sustained TCP connection subjected to light to moderate packet losses, such as loss caused by network congestion. It assumes that TCP avoids retransmission timeouts and always has sufficient receiver window and sender data.

Example: Round Trip Time (rtt) to my local mirror is about 12 msec, and let's say packet loss is 0.1% (0.001). With an MTU of 1500 bytes (MSS of 1460), TCP throughput will have an upper bound of about 3.0 Mbps! And no, that is not a window size limitation, but rather one based on TCP's ability to detect and recover from congestion (loss). With 9000 byte frames, TCP throughput could reach about 16 Mbps.

Formula Recap

BDP = [bandwidth in bits per second] * RTT / 8

MSS = MTU – [Header_Size] (40, 52 or 64)

Max Throughput is the smaller of these formulas:

1) ~98% * [Bandwidth]
2) [Buffer_Size] * 8 / RTT
3) ~0.7 * MSS / (RTT * sqrt(packet_loss))

Examples below follow these parameters:

2a) 17kB buffer  = [ 17520 * 8 / RTT ]  (Windows default size)
2b) 262kB buffer = [ 262144 * 8 / RTT ] (Commonly suggested value)
2c) 4M buffer    = [ 4194304 * 8 / RTT ] (Linux)
3a) MTU = 1500B = [ 0.7*(1500-52)*8/(RTT*(0.001**0.5)) ]
3b) MTU = 9000B = [ 0.7*(9000-52)*8/(RTT*(0.001**0.5)) ]

Examples – Putting it all together

Example 1:  DOCSIS1 = 22 Mbps / RTT = 9ms (0.009s) / Loss = 0.1%

1)  0.98 * 22 = 22 Mbps
2a) 17kB buffer: 15.6 Mbps
2b) 262kB buffer: 233 Mbps
2c) 4M buffer: 3.73 Gbps
3a) default MTU: 28.5 Mbps
3b) 9k MTU: 176 Mbps

Max throughput =  15.6 Mbps (default) or 22 Mbps (fully tuned)

Example 2:  DSL1 = 30 Mbps / RTT = 51ms (0.051s) / Loss = 0.1%

1)  0.98 * 30 = 29 Mbps
2a) 17kB buffer: 2.75 Mbps
2b) 262kB buffer: 41.1 Mbps
2c) 4M buffer: 658 Mbps
3a) default MTU: 5.03 Mbps
3b) 9k MTU: 31.1 Mbps

Max throughput =  2.75 Mbps (default) or 29 Mbps (fully tuned)

Example 3: UMTS/HSPA (3G) = 3.6 Mbps / RTT = 133ms (0.133s) / Loss = 0.2%

1)  0.98 * 3.6 = 3.5 Mbps
2a) 17kB buffer: 1.05 Mbps
2b) 262kB buffer: 15.8 Mbps
2c) 4M buffer: 252 Mbps
3a) default MTU: 1.36 Mbps
3b) 9k MTU: 8.42 Mbps

Max throughput = 1.05 Mbps (default) or 3.5 Mbps (fully tuned)

Sample Script

#!/bin/bash
#
if [ $# -ne 3 ]; then
echo "Usage: $0 [BW Mbps] [RTT ms] [% loss]";
exit 1
fi
#
BANDWIDTH=$1                              #MBps
RTT=$( echo "scale=3;  $2 / 1000." |bc )  #ms
LOSS=$( echo "scale=3;  $3 / 100." |bc )  #pecent
#
simplemath () { 
    echo "" | awk 'END { exit ( !( '"$1"')); }'
}
#
SCALE=4  # Adjust if required
test1=$( echo "scale=$SCALE; 0.98 * $BANDWIDTH" |bc )
MAX1=$test1; MAX2=$test1
test2a=$( echo "scale=$SCALE; 17520 * 8 / $RTT /1000/1000" |bc )
simplemath "$MAX1 > $test2a" && MAX1=$test2a
test2b=$( echo "scale=$SCALE;  262144 * 8 / $RTT /1000/1000" |bc )
test2c=$( echo "scale=$SCALE;  4194304 * 8 / $RTT /1000/1000" |bc )
simplemath "$MAX2 > $test2c" && MAX2=$test2c
test3a=$( echo "scale=$SCALE; 0.7 * ( 1500-52 ) * 8 / ( $RTT * sqrt($LOSS) ) /1000/1000" |bc -l)
simplemath "$MAX1 > $test3a" && MAX1=$test3a
test3b=$( echo "scale=$SCALE; 0.7 * ( 9000-52 ) * 8 / ( $RTT * sqrt($LOSS) ) /1000/1000" |bc -l)
simplemath "$MAX2 > $test3b" && MAX2=$test3b
#
echo "1) $test1 Mbps"  
echo "2a) 17kB buffer: $test2a Mbps"
echo "2b) 262kB buffer: $test2b Mbps"
echo "2c) 4M buffer: $test2c Mbps"
echo "3a) default MTU: $test3a Mbps"
echo "3b) 9k MTU: $test3b Mbps"
#
echo ""
echo "Max throughput = $MAX1 Mbps (default) or $MAX2 Mbps (fully tuned)"
#EOF

References

Widely used and accepted TCP tuning guide http://www.psc.edu/networking/projects/tcptune/

Large TCP windows and timestamps RFC http://www.faqs.org/rfcs/rfc1323.html

PMTU (Path MTU Discovery) RFC http://www.faqs.org/rfcs/rfc1191.html

SACK (Selective Acknowledgments) RFC http://www.faqs.org/rfcs/rfc2018.html

The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm http://www.psc.edu/networking/papers/model_abstract.html