Cover V10, I05
Article
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Sidebar
Table 1

may2001.tar


TCP -- Either Fast or Efficient

Noah Davids

TCP/IP can be a very inefficient protocol. The TCP and IP headers require a minimum of 40 bytes. Packets with only a few bytes of data, for example a telnet or rlogin session or a bank transaction, have extremely high overhead. A TCP acknowledgement packet may contain no data at all, making the entire packet overhead. Ethernet additionally imposes a 14-byte header and a 4-byte trailer for an increased overhead of 58 bytes. Because the minimum-sized Ethernet frame is 64 bytes, there can also be up to an additional 6 bytes of overhead. To add insult to injury, Ethernet also requires an 8-byte preamble and a gap between each frame that is equivalent to 12 bytes. Thus, sending 1 to 6 bytes of data effectively requires an 84-byte Ethernet frame.

TCP uses two strategies to reduce this overhead, the Nagle algorithm and delayed acknowledgment. Both these strategies reduce overhead by reducing the number of packets with only a few bytes of data. However, nothing is free, and the cost of reduced overhead is, in this instance, reduced performance. In this article, I will describe these two strategies, explain how they impact performance, and describe what you can do about them.

The Nagle Algorithm

The Nagle algorithm is documented in RFC 896 ("Congestion Control in IP/TCP Internetworks") and clarified in section 4.2.3.4 "When to Send Data" of RFC 1122 ("Requirements for Internet Hosts -- Communication Layers"). Basically, the RFC says that TCP should buffer data (i.e., not send it, as long as there is any unacknowledged data) until it can send a packet containing the maximum segment size (MSS) number of bytes. For example, if the MSS is 1460 bytes and my application writes 10-byte blocks, then when the application writes the first 10 bytes, the TCP stack transmits it immediately. However, the TCP stack holds subsequent 10-byte writes until it receives a total of 1460 bytes or until the acknowledgment from the first 10 bytes arrives.

The MSS can be sent as a TCP option in the initial packet that sets up the TCP connection. It tells the other host what the sender will accept as the maximum number of bytes in one packet. If the sender does not send an MSS, then the receiver should assume a value of 536. However, not all stacks do this. Some stacks assume the local MTU minus 40 for local connections and 536 for non-local connections. The MTU is the maximum transmission unit or the maximum amount of data that the interface can send in 1 frame. For Ethernet, the MTU is 1500 (1518 if you count the Ethernet header and trailer), so the MSS is 1460 (1500 minus IP and TCP headers sizes). A local connection means that the receiver has an interface connected to the same network as the sender. Other stacks attempt to dynamically determine the MTU for the connection. They send packets of decreasing size until they get a response, then calculate the MSS as MTU minus 40.

As long as the sending application generates enough data to create packets of MSS bytes, the Nagle algorithm does not significantly reduce performance. But, what if the application does not generate enough data? Traces 1 and 2 were generated by a client application that sends 500 bytes of data to a server. The server does not respond until all the data is received (the server's response is not shown in the trace). In trace 1, the client generates 10 bytes of data every 10 ms, taking 490 ms to generate all of the data. In the second trace the client generates 10 bytes of data as fast as it can, taking about 9 ms. The traces were taken on the server system listening to port 1234 (see Figures 1 and 2).

In trace 1, there are 5 data packets from the client, each with 78 bytes of overhead (8-byte Ethernet preamble, 14 Ethernet header, 20 IP header, 20 TCP header, 4 Ethernet trailer, and 12 interframe gap) for 500 bytes of data a ratio of (5 * 78)/500 or 78%. If we add in the 5 acknowledgement packets (with an additional 6 bytes to pad out the Ethernet frame to the minimum length) from the server, we get ((5 * 78) + (5 * 84)) / 500 or 162%. The delay is also significant. Remember, it took 490 ms to for the application to generate and write the data; but it took (748 - 94) = 654 ms to send the data, a delay of (654 - 490)/490 = 33%. As you can see from trace 2, generating the data faster leads to fewer packets because all the data is available by the time that the Nagle algorithm allows the second data packet to be sent. The client side overhead is only (2 * 78)/500, or 31%. Adding in the server's ACKs gives us ((2 * 78)+(2 * 84))/500, or about 65%. However, the delay in actually sending the data is significantly larger (728 - 527) = 201 ms, a delay of (201 - 9)/9 = 2133%.

To address the problem of unacceptable delay, the application programmer can use the socket option NODELAY to turn Nagle on or off. By default, it is usually on; to turn it off, set the NODELAY socket option to 1. The following code fragment will turn the Nagle algorithm off:

nodelay = 1;
err = setsockopt (sd, IPPROTO_TCP, \
      TCP_NODELAY, &nodelay, \
      sizeof (nodelay));
if (err < 0)
   {
   printf ("Error setting nodelay - \
           errno = %d\n", errno);
   exit (errno);
   }
Trace 3 (Figure 3) was generated with the same application as trace 1 but with Nagle turned off. Similarly, trace 4 (Figure 4) was generated with the same application as trace 2.

As seen in the figures, the overhead in terms of number of bytes is significantly larger in these two traces. In both cases, there are 50 data packets. Because trace 3 takes longer to send, there are also more acknowledgment frames. Note, however, that there is no time delay.

Obviously, the decision to turn off the Nagle algorithm must be made after taking into account the available bandwidth of the network between the client and server. On a lightly used LAN, turning off Nagle might be very reasonable. On a heavily used LAN, maybe not. Or, perhaps the solution should be to turn off Nagle but also to reduce the bandwidth usage by subnetting the network or moving to a switched environment. On a WAN or a VPN over the Internet, you may need some research to determine whether turning off Nagle is reasonable. Remember that if a packet is dropped due to router congestion (which using Nagle can significantly decrease), the application must wait for the TCP retransmit timer. This will typically degrade performance much worse then the delay imposed by using Nagle. Also remember that an application may be used in many environments. It could be used both on a lightly used LAN and a highly congested WAN, so if the application does include code to turn off the Nagle algorithm, it's best to make that a configurable option.

There are some less drastic alternatives. Instead of the client making 50 writes of 10 bytes each, it could make 1 write of 500 bytes. With only 1 write, the entire packet will go immediately. This example of 50 10-byte writes is extreme, but I have seen applications make 2 or 3 writes of 50 or 100 bytes each, instead of 1 write containing all the data.

What about adding padding so that all the messages are MSS bytes in length? Obviously, adding more than 78 bytes would be worse than not using Nagle; but, if the messages were already close to MSS in size, adding a few bytes seems reasonable. Unfortunately, this strategy is complicated because the MSS that a server advertises may change depending on the OS (and OS version) of the server. It may also change according to whether the client and server are on the same subnet. Also, the application cannot determine the MSS, at least not by querying the TCP stack. So, although this sounds reasonable, I don't think its practical.

Nagle for the Sys Admin

Only the application developers can implement the above strategies. If the developers do not provide a way to turn off the Nagle algorithm or buffer their own writes, what can a systems administrator do?

On Solaris, there is a kernel parameter named tcp_naglim_def. This is the default number of bytes to buffer. Each connection has its own copy of this value, which is set to the minimum of the MSS for the connection and the default value. When the application sets the TCP_NODELAY socket option, it changes the connection's copy of this value to 1.

Changing the value of tcp_naglim_def to 1 will have the same effect (on connections established after the change) as if each application set the TCP_NODELAY option. In trace 5 (Figure 5), I have changed the default to 30, a compromise that reduces the number of packets from 53 to 21 but does not lengthen the sending time.

The current value of tcp_naglim_def can be displayed with the command:

ndd -get /dev/tcp tcp_naglim_def
and can be changed with the command:

ndd -set /dev/tcp tcp_naglim_def 30
HP-UX 11.x has the same tuning parameter and uses the same commands to display and change it. However, according to the HP-UX man page, changing the value is not supported.

As stated above, changing the tcp_naglim_def value will effect all TCP applications. Note that while it might be appropriate to turn Nagle off for one application, it might not be appropriate for another. In fact, it might create so much congestion that the performance of the application you are trying to improve may drop or performance of other applications may deteriorate to unacceptable levels. This is not something to do without understanding your entire network (both local and remote) as well as the applications running on your system. A fair amount of experimentation will probably be needed as well.

Delayed Acknowledgement

So, is there something the systems administrator can do on the receiving side? There is, which leads me to the second strategy that TCP uses to reduce overhead -- the delayed acknowledgement or delayed ACK.

As shown in the sample traces, ACKs are not sent after every packet is received. In some cases, the ACKs can be delayed a significant amount. RFC 813 ("Window and Acknowledgment Strategy in TCP") first discusses delayed ACKs, and section 4.2.3.2 "When to Send an ACK Segment" in RFC 1122 clarifies it. Briefly, the RFC states that there should be an ACK for every second full-sized segment (i.e., a segment with MSS bytes of data in it). Other than that, ACKs can be delayed for up to 500 ms. In the above example, the ACKs are delayed for 200 ms. Delayed ACKs give the application time to read the data and send a reply. The acknowledgement can then be piggy-backed in the same packet as the reply, and the overhead of sending a packet with just the acknowledgement is eliminated. This only works if the receiver has a reply. If not, the delayed ACK timer controls when the ACKs are sent.

Reducing the delayed ACK timer speeds things up by reducing the time that the Nagle algorithm must wait for an acknowledgment. Traces 6 and 7 were generated with the same clients as trace 1 and 2, however, the delayed ACK timer was changed from 200 ms to 50 ms (see Figures 6 and 7). There are more ACK packets from the server, but the overhead is certainly less than running without the Nagle algorithm. See Table 1 for a summary of overhead and transmission delay for each trace.

Unfortunately, not all TCP stacks allow the systems administrator to change the delayed ACK timer and, like tcp_naglim_def, changing it effects all TCP connections. So again, understand your network and test the effects on all your applications.

On Solaris systems, the timer value is held in the kernel parameter named tcp_deferred_ack_interval. The current value can be retrieved with the command:

ndd -get /dev/tcp tcp_deferred_ack_interval
It can be changed with the command:

ndd -set /dev/tcp tcp_deferred_ack_interval 60
where "60" is the new value in milliseconds. On Solaris 8, I think that the default value is 100; in earlier releases, it was 50.

HP-UX 11.0 has the same kernel parameter, and as with tcp_naglim_def, changing this value is not supported. The default value is 50 so there may not be as great a need to change it. (See sidebar for information about Windows 2000.)

Conclusion

In summary, the Nagle algorithm and delayed ACKs are used by TCP to reduce overhead and congestion. This has the unfortunate effect of delaying transmissions across the network. A well-thought out application may be able to reduce overhead and congestion without incurring any delays by buffering its own writes to the network. However, if the data flow is such that this is not possible, the application can turn the Nagle algorithm off for its own connection. Application programmers should make this an option so that systems administrators can choose between the delay imposed by the Nagle algorithm and the network congestion caused by turning off the algorithm. If an application does not have an option for turning Nagle off, then some TCP stacks, including those of Solaris and HP-UX, have an option for turning it off by default for all applications. Some TCP stacks, including Solaris and HP-UX, also have an option for changing the delayed ACK timer. By shortening the delayed ACK timer, the delaying effects of the Nagle algorithm can be reduced.

Noah Davids works in the Customer Assistance Center of Stratus Computer Inc. He specializes in the diagnosis and correction of LAN problems. He can be reached at: Noah_Davids@stratus.com.