Either Fast or Efficient
TCP/IP can be a very inefficient protocol. The TCP and IP headers
require a minimum of 40 bytes. Packets with only a few bytes of
data, for example a telnet or rlogin session or a
bank transaction, have extremely high overhead. A TCP acknowledgement
packet may contain no data at all, making the entire packet overhead.
Ethernet additionally imposes a 14-byte header and a 4-byte trailer
for an increased overhead of 58 bytes. Because the minimum-sized
Ethernet frame is 64 bytes, there can also be up to an additional
6 bytes of overhead. To add insult to injury, Ethernet also requires
an 8-byte preamble and a gap between each frame that is equivalent
to 12 bytes. Thus, sending 1 to 6 bytes of data effectively requires
an 84-byte Ethernet frame.
TCP uses two strategies to reduce this overhead, the Nagle algorithm
and delayed acknowledgment. Both these strategies reduce overhead
by reducing the number of packets with only a few bytes of data.
However, nothing is free, and the cost of reduced overhead is, in
this instance, reduced performance. In this article, I will describe
these two strategies, explain how they impact performance, and describe
what you can do about them.
The Nagle Algorithm
The Nagle algorithm is documented in RFC 896 ("Congestion
Control in IP/TCP Internetworks") and clarified in section
220.127.116.11 "When to Send Data" of RFC 1122 ("Requirements
for Internet Hosts -- Communication Layers"). Basically,
the RFC says that TCP should buffer data (i.e., not send it, as
long as there is any unacknowledged data) until it can send a packet
containing the maximum segment size (MSS) number of bytes. For example,
if the MSS is 1460 bytes and my application writes 10-byte blocks,
then when the application writes the first 10 bytes, the TCP stack
transmits it immediately. However, the TCP stack holds subsequent
10-byte writes until it receives a total of 1460 bytes or until
the acknowledgment from the first 10 bytes arrives.
The MSS can be sent as a TCP option in the initial packet that
sets up the TCP connection. It tells the other host what the sender
will accept as the maximum number of bytes in one packet. If the
sender does not send an MSS, then the receiver should assume a value
of 536. However, not all stacks do this. Some stacks assume the
local MTU minus 40 for local connections and 536 for non-local connections.
The MTU is the maximum transmission unit or the maximum amount of
data that the interface can send in 1 frame. For Ethernet, the MTU
is 1500 (1518 if you count the Ethernet header and trailer), so
the MSS is 1460 (1500 minus IP and TCP headers sizes). A local connection
means that the receiver has an interface connected to the same network
as the sender. Other stacks attempt to dynamically determine the
MTU for the connection. They send packets of decreasing size until
they get a response, then calculate the MSS as MTU minus 40.
As long as the sending application generates enough data to create
packets of MSS bytes, the Nagle algorithm does not significantly
reduce performance. But, what if the application does not generate
enough data? Traces 1 and 2 were generated by a client application
that sends 500 bytes of data to a server. The server does not respond
until all the data is received (the server's response is not
shown in the trace). In trace 1, the client generates 10 bytes of
data every 10 ms, taking 490 ms to generate all of the data. In
the second trace the client generates 10 bytes of data as fast as
it can, taking about 9 ms. The traces were taken on the server system
listening to port 1234 (see Figures 1 and 2).
In trace 1, there are 5 data packets from the client, each with
78 bytes of overhead (8-byte Ethernet preamble, 14 Ethernet header,
20 IP header, 20 TCP header, 4 Ethernet trailer, and 12 interframe
gap) for 500 bytes of data a ratio of (5 * 78)/500 or 78%. If we
add in the 5 acknowledgement packets (with an additional 6 bytes
to pad out the Ethernet frame to the minimum length) from the server,
we get ((5 * 78) + (5 * 84)) / 500 or 162%. The delay is also significant.
Remember, it took 490 ms to for the application to generate and
write the data; but it took (748 - 94) = 654 ms to send the data,
a delay of (654 - 490)/490 = 33%. As you can see from trace 2, generating
the data faster leads to fewer packets because all the data is available
by the time that the Nagle algorithm allows the second data packet
to be sent. The client side overhead is only (2 * 78)/500, or 31%.
Adding in the server's ACKs gives us ((2 * 78)+(2 * 84))/500,
or about 65%. However, the delay in actually sending the data is
significantly larger (728 - 527) = 201 ms, a delay of (201 - 9)/9
To address the problem of unacceptable delay, the application
programmer can use the socket option NODELAY to turn Nagle on or
off. By default, it is usually on; to turn it off, set the NODELAY
socket option to 1. The following code fragment will turn the Nagle
nodelay = 1;
err = setsockopt (sd, IPPROTO_TCP, \
TCP_NODELAY, &nodelay, \
if (err < 0)
printf ("Error setting nodelay - \
errno = %d\n", errno);
Trace 3 (Figure 3) was generated with the same application as trace
1 but with Nagle turned off. Similarly, trace 4 (Figure 4) was generated
with the same application as trace 2.
As seen in the figures, the overhead in terms of number of bytes
is significantly larger in these two traces. In both cases, there
are 50 data packets. Because trace 3 takes longer to send, there
are also more acknowledgment frames. Note, however, that there is
no time delay.
Obviously, the decision to turn off the Nagle algorithm must be
made after taking into account the available bandwidth of the network
between the client and server. On a lightly used LAN, turning off
Nagle might be very reasonable. On a heavily used LAN, maybe not.
Or, perhaps the solution should be to turn off Nagle but also to
reduce the bandwidth usage by subnetting the network or moving to
a switched environment. On a WAN or a VPN over the Internet, you
may need some research to determine whether turning off Nagle is
reasonable. Remember that if a packet is dropped due to router congestion
(which using Nagle can significantly decrease), the application
must wait for the TCP retransmit timer. This will typically degrade
performance much worse then the delay imposed by using Nagle. Also
remember that an application may be used in many environments. It
could be used both on a lightly used LAN and a highly congested
WAN, so if the application does include code to turn off the Nagle
algorithm, it's best to make that a configurable option.
There are some less drastic alternatives. Instead of the client
making 50 writes of 10 bytes each, it could make 1 write of 500
bytes. With only 1 write, the entire packet will go immediately.
This example of 50 10-byte writes is extreme, but I have seen applications
make 2 or 3 writes of 50 or 100 bytes each, instead of 1 write containing
all the data.
What about adding padding so that all the messages are MSS bytes
in length? Obviously, adding more than 78 bytes would be worse than
not using Nagle; but, if the messages were already close to MSS
in size, adding a few bytes seems reasonable. Unfortunately, this
strategy is complicated because the MSS that a server advertises
may change depending on the OS (and OS version) of the server. It
may also change according to whether the client and server are on
the same subnet. Also, the application cannot determine the MSS,
at least not by querying the TCP stack. So, although this sounds
reasonable, I don't think its practical.
Nagle for the Sys Admin
Only the application developers can implement the above strategies.
If the developers do not provide a way to turn off the Nagle algorithm
or buffer their own writes, what can a systems administrator do?
On Solaris, there is a kernel parameter named tcp_naglim_def.
This is the default number of bytes to buffer. Each connection has
its own copy of this value, which is set to the minimum of the MSS
for the connection and the default value. When the application sets
the TCP_NODELAY socket option, it changes the connection's
copy of this value to 1.
Changing the value of tcp_naglim_def to 1 will have the
same effect (on connections established after the change) as if
each application set the TCP_NODELAY option. In trace 5 (Figure
5), I have changed the default to 30, a compromise that reduces
the number of packets from 53 to 21 but does not lengthen the sending
The current value of tcp_naglim_def can be displayed with
ndd -get /dev/tcp tcp_naglim_def
and can be changed with the command:
ndd -set /dev/tcp tcp_naglim_def 30
HP-UX 11.x has the same tuning parameter and uses the same commands
to display and change it. However, according to the HP-UX man page,
changing the value is not supported.
As stated above, changing the tcp_naglim_def value will
effect all TCP applications. Note that while it might be appropriate
to turn Nagle off for one application, it might not be appropriate
for another. In fact, it might create so much congestion that the
performance of the application you are trying to improve may drop
or performance of other applications may deteriorate to unacceptable
levels. This is not something to do without understanding your entire
network (both local and remote) as well as the applications running
on your system. A fair amount of experimentation will probably be
needed as well.
So, is there something the systems administrator can do on the
receiving side? There is, which leads me to the second strategy
that TCP uses to reduce overhead -- the delayed acknowledgement
or delayed ACK.
As shown in the sample traces, ACKs are not sent after every packet
is received. In some cases, the ACKs can be delayed a significant
amount. RFC 813 ("Window and Acknowledgment Strategy in TCP")
first discusses delayed ACKs, and section 18.104.22.168 "When to
Send an ACK Segment" in RFC 1122 clarifies it. Briefly, the
RFC states that there should be an ACK for every second full-sized
segment (i.e., a segment with MSS bytes of data in it). Other than
that, ACKs can be delayed for up to 500 ms. In the above example,
the ACKs are delayed for 200 ms. Delayed ACKs give the application
time to read the data and send a reply. The acknowledgement can
then be piggy-backed in the same packet as the reply, and the overhead
of sending a packet with just the acknowledgement is eliminated.
This only works if the receiver has a reply. If not, the delayed
ACK timer controls when the ACKs are sent.
Reducing the delayed ACK timer speeds things up by reducing the
time that the Nagle algorithm must wait for an acknowledgment. Traces
6 and 7 were generated with the same clients as trace 1 and 2, however,
the delayed ACK timer was changed from 200 ms to 50 ms (see Figures
6 and 7). There are more ACK packets from the server, but the overhead
is certainly less than running without the Nagle algorithm. See
Table 1 for a summary of overhead and transmission delay for each
Unfortunately, not all TCP stacks allow the systems administrator
to change the delayed ACK timer and, like tcp_naglim_def,
changing it effects all TCP connections. So again, understand your
network and test the effects on all your applications.
On Solaris systems, the timer value is held in the kernel parameter
named tcp_deferred_ack_interval. The current value can be
retrieved with the command:
ndd -get /dev/tcp tcp_deferred_ack_interval
It can be changed with the command:
ndd -set /dev/tcp tcp_deferred_ack_interval 60
where "60" is the new value in milliseconds. On Solaris
8, I think that the default value is 100; in earlier releases, it
HP-UX 11.0 has the same kernel parameter, and as with tcp_naglim_def,
changing this value is not supported. The default value is 50 so
there may not be as great a need to change it. (See sidebar for
information about Windows 2000.)
In summary, the Nagle algorithm and delayed ACKs are used by TCP
to reduce overhead and congestion. This has the unfortunate effect
of delaying transmissions across the network. A well-thought out
application may be able to reduce overhead and congestion without
incurring any delays by buffering its own writes to the network.
However, if the data flow is such that this is not possible, the
application can turn the Nagle algorithm off for its own connection.
Application programmers should make this an option so that systems
administrators can choose between the delay imposed by the Nagle
algorithm and the network congestion caused by turning off the algorithm.
If an application does not have an option for turning Nagle off,
then some TCP stacks, including those of Solaris and HP-UX, have
an option for turning it off by default for all applications. Some
TCP stacks, including Solaris and HP-UX, also have an option for
changing the delayed ACK timer. By shortening the delayed ACK timer,
the delaying effects of the Nagle algorithm can be reduced.
Noah Davids works in the Customer Assistance Center of Stratus
Computer Inc. He specializes in the diagnosis and correction of
LAN problems. He can be reached at: Noah_Davids@stratus.com.