Which
OS is Fastest for High-Performance Network Applications?
Jeffrey B. Rothman and John Buckman
In this article, we compare Linux, Solaris (for Intel), FreeBSD,
and Windows 2000 to determine which operating system (OS) runs high-performance
network applications the fastest. We will describe which software
designs to look for from your network software vendor, explaining
how each design yields different performance characteristics, and
determine which OS platform is best suited for each common network
programming design. We present our OS benchmarks with both simulated
and real-world tests, then evaluate the results.
We found that the software application's architecture determines
speed results much more than the operating system on which it runs.
Our benchmarks demonstrate a 12x performance difference between
process-based and asynchronous task architectures. Significantly,
we found up to a 75% overall performance difference between OSes
when using the most efficient asynchronous architecture. We found
Linux to be the best performing operating system based on our metrics,
performing 35% better than Solaris, which came in second, followed
by Windows, and finally, FreeBSD.
Background
At Lyris Technologies, we write high-performance, cross-platform,
email-based server applications. Better application performance
is a competitive advantage, so we spend a great deal of time tuning
all aspects of an application's performance profile (software,
hardware, and operating system). Our customers frequently ask us
which operating system is best for running our software. Or, if
they have already chosen an OS, they ask how to make their system
run our applications faster. Additionally, we run a hosting (outsourcing)
division and want to reduce our hardware cost while providing the
best performance for our hosting customers.
Most Internet applications follow these steps:
1. Accept an incoming TCP/IP connection or create a connection
to another machine.
2. Once connected, exchange various text-based commands via TCP/IP.
3. These commands cause various activities to happen, such as
disk reading (e.g., viewing a Web page), disk writing (e.g., queuing
a received email message), or calling external functionality (e.g.,
mail filtering, reverse DNS lookup).
In general, the performance issues for network applications are
to:
1. Accomplish many concurrent tasks as quickly as possible.
2. Efficiently cope with a great deal of waiting (caused by TCP/IP
slowness, or for the other end to send the next command).
3. Perform TCP/IP operations efficiently.
The most effective way to maximize network application performance
is for the application's software designer to choose an architecture
that addresses the three performance criteria above. The two significant
variables are the task architecture and the TCP/IP call architecture.
Task Architecture
In the area of task architecture, there are three main techniques:
- One-process-per-task (process-oriented) -- Many copies
of the program are run with each copy handling one task at a time.
Sometimes, a new process is created each time a new task is created
(e.g., inetd, Sendmail) or processes are re-used (e.g.,
Apache). This architecture yields good performance at low loads.
Medium loads can also be handled, if the process image is small
(e.g., qmail), if application-specific efficiency improvements
are implemented, or if the application genre does not create too
many simultaneous tasks. Multiple CPUs are efficiently used if
process caching is used, and if the total number of processes
is kept low (i.e., low-to-medium load). This technique works on
all operating systems; however, UNIX is significantly more efficient
than Windows at implementing it. (Windows lacks the fork()
system call, and this method is so slow that few Windows applications
use this technique.)
- One-thread-per-task (multi-threaded) -- One copy of the
program is run with a separate thread of execution inside the
process handling each task. Multi-threaded applications perform
very well at low to medium loads. Higher loads cause decreasing
(but usually still acceptable) performance; however, extremely
high loads can cause your multi-threaded application to death-spiral.
Multi-threaded applications typically scale to between 500 and
1000 concurrent tasks, which is acceptable in many situations.
Each new task uses a new thread, which consumes less memory and
less CPU power than a new process would. Few open source projects
use multi-threading because only the most popular UNIX variants
are stable under heavy multi-threading loads. Performance with
multiple CPUs can be worse than on one CPU, because semaphore
locks are much more costly on multiprocessor machines. (Examples
of multithreaded software are Netscape Web server and Apache on
Windows.)
- One-thread-many-tasks (asynchronous) -- One copy of the
program is run with a set number of threads (typically, one thread
per type of task), and each thread handles a large number of tasks
using a technique called asynchronous (or non-blocking) TCP/IP.
Because most programs are not required to handle high loads and
because asynchronous programming is difficult, few programs support
this architecture. Asynchronous programs scale well to multiple
CPU machines, because they typically use long running threads
operating independently of each other. They require few cross-CPU
locks, so each thread can be permanently and effectively assigned
to a CPU. (Example: the DNS BIND daemon.)
TCP/IP Call Architecture
The second major performance variable is the TCP/IP call architecture.
On an operating system level, there are multiple ways to accomplish
the same network operation. A tradeoff exists between TCP/IP speed
versus programming effort (faster techniques are more work for the
programmer). Additionally, some faster techniques are not available
on all platforms; higher performance requirements may limit the
platform choice.
Blocking TCP/IP Call
A blocking TCP/IP call waits for the requested operation to complete,
then acts immediately on the result. With small numbers of tasks,
this results in immediate reaction to events as they occur. With
large numbers of tasks, the operating system incurs significant
context-switching overhead, and overall efficiency is poor. Blocking
(synchronous) TCP/IP calls yield very short latencies under low
loads, and are ideal for an application such as a low-load Web server,
where page-response time should be very fast, and the load is never
very high. However, if a process-oriented architecture is used and
a new process is created for each new connection (e.g., inetd),
then the latency improvements from blocking TCP/IP are negated by
the significant overhead of running a new process.
Non-Blocking TCP/IP Call
A non-blocking (asynchronous) TCP/IP call initiates an operation,
then continues with other activities. When the operation completes
or an event occurs, the application is notified and then reacts.
There is more programming work involved with this two-step process
and sometimes a small amount of time is needed to react to the new
event (increasing latency). This non-blocking technique yields much
better performance under medium-to-high loads, and can survive abusively
high loads, but latency may be slightly longer than with blocking
TCP/IP calls.
Each of the three task-handling architectures matches up with
a particular TCP/IP system call model. Process-oriented and multi-threaded
programs tend to use blocking TCP/IP calls, as this is the simplest
way to program, and handles the low loads that are most common case.
However, an application that uses the asynchronous task architecture
must use non-blocking TCP/IP operations to handle multiple tasks:
blocking TCP/IP is not an option. Therefore, if you find a network
application that uses the highly scalable asynchronous task architecture,
you also benefit from that application using the most scalable TCP/IP
call architecture (non-blocking).
Real-World Test
To evaluate the performance of various operating systems and network
applications, we created three different tests: real-world, disk
I/O, and task architecture comparison. The operating systems we
examined were Linux (Red Hat 7.0, kernel 2.2.16-22), Solaris 2.8
for Intel, FreeBSD 4.2, and Windows 2000 Server. The operating systems
were the latest version available from a commercial distribution
and were not recompiled (i.e., everything was tested right out of
the box). We installed all operating systems on identical 4-GB SCSI-3
drives (IBM model DCAS-34330), and ran the tests on the same machine
(ASUS P3B motherboard, Intel Pentium III 550-MHz processor, 384-MB
SDRAM, Adaptec 2940UW SCSI controller, ATI Rage Pro 3D video card,
Intel EtherExpress Pro 10/100 Ethernet card).
As a real-world test, we measured how quickly email could be sent
using our MailEngine software. MailEngine is an email delivery server,
ships on all the tested platforms (plus on Solaris for Sparc), and
uses an asynchronous architecture (with non-blocking TCP/IP using
the poll () system call). So that email was not actually
delivered to our 200,000-member test list, we ran MailEngine in
test mode. In this mode, MailEngine performs all the steps of sending
mail, but sends the RSET command instead of the DATA
command at the last moment. The SMTP connection is then QUIT,
and no email is delivered to the recipient. Our workload consisted
of a single message being delivered to 200,000 distinct email addresses
spread across 9113 domains. Because the same message was queued
in memory for every recipient, disk I/O was not a significant factor.
We slowly raised the number of simultaneous connections to see how
the increased load altered performance.
Figure 1 ("Operating system comparison") shows the test-mode
email delivery speed for MailEngine over a range of simultaneous
connections for each OS. Linux is the clear speed winner, roughly
35% faster than Solaris, the runner-up. Overall performance increased
as connections were added, showing marginal additional speed with
more than 1500. FreeBSD performance decreased somewhat when more
than 1500 connections were added.
On the UNIX-style operating systems, it was necessary to tweak
the kernel slightly to allow the use of so many connections in one
process. Despite kernel tweaking, FreeBSD gave us resource-shortage
warnings and failed to run when loaded with more than 2500 connections.
File System Test
Many network applications also require the ability to queue information
on disk for later processing (i.e., Sendmail's mail queue)
or to handle overflow situations. To measure file system efficiency
mimicking typical situations, we wrote a C++ program that creates,
writes, and reads back 10,000 files in a single directory, one file
at a time. To measure file system efficiency of various kinds of
files, the file size was increased from 4 KB to 128 KB.
Figure 2 ("Time to Create, Write and Read 10,000 Files")
displays the file system test results. Linux and Windows speeds
were almost identical, significantly faster than the other two:
6x faster than FreeBSD, 10x faster than Solaris. The file system
for each OS was: Linux - EXT2, Solaris - UFS, Windows 2000 - NTFS,
FreeBSD - UFS. Other file systems would undoubtedly yield different
performance results. If your software application depends heavily
of disk I/O, we recommend using Linux or Windows, or else investigating
alternative file systems on FreeBSD or Solaris.
Application Architecture Test
Finally, we evaluated how different network application architectures
performed on each operating system. We wrote a simple C++ server
program that responded to incoming connections with the message
"450 too busy", using one of three architectures to handle
sending the response message. The three architectures our program
tested were: (1) a process-based architecture, with a new process
executed to handle each connection; (2) a multi-threaded architecture,
with a thread assigned to each connection; and (3) an asynchronous
architecture, with all connections answered using non-blocking TCP/IP.
A separate C++ program, running on a different machine (on Linux),
attempts to connect to our simple server program as quickly as possible,
slowly increasing the simultaneous connection load, and counting
successfully received response messages. The multiple charts (12
test runs) were too much to present in this article, so we instead
charted the average for each task architecture to show general performance
differences.
Figure 3 ("Average Throughput per Network Architecture")
shows the performance of each type of task architecture, averaged
across all OS platforms. Although there was a significant amount
of variation in performance between platforms, the variation was
not nearly as significant as the architecture choice. The slowest
network application architecture is the process-based architecture,
which can handle only about 5% of the connections of the asynchronous
method. The asynchronous method can handle about 35% more load than
the thread-based method at 1000 simultaneous connections. The trend
lines show that the multi-threaded versus asynchronous performance
gap widens as load increases.
Kernel Tweaks for High Performance
In their default configurations, the UNIX-style operating systems
we tested do not support the large numbers of simultaneous TCP/IP
connections that multi-threaded and asynchronous applications require.
This limitation drastically restricts applications performance,
and can incorrectly dissuade a systems administrator from using
these kinds of high-performance architectures. Fortunately, these
limitations are easily overcome with a few kernel tweaks. On UNIX,
each TCP/IP connection uses a file descriptor, so you must increase
the total number of descriptors available to the operating system,
and also increase the maximum number of descriptors each process
is allowed to use. All UNIX-style operating systems have a ulimit
shell command (sh and bash), which can allow more
open file descriptors to commands started in that shell once the
appropriate kernel tweak has been made. We suggest ulimit -n
8192. Here are our recommended kernel tweaks:
On Linux: echo 65536 > /proc/sys/fs/file-max changes
the number of system-wide file descriptors.
On FreeBSD: Append to /etc/sysctl (or you can use sysctl
-w to add these):
kern.maxfiles=65536
kern.maxfilesperproc=32768
On Solaris: Add the following to /etc/system and reboot:
set rlim_fd_max=0x8000
set rlim_fd_cur=0x8000
Summary
Our real-world test observed a 75% performance gap between the
best and worst performing operating systems, with Linux enjoying
a 35% lead over runner-up Solaris. Of more significance, asynchronous
applications were on average 12x faster than process-based applications,
and 35% faster than multi-threaded applications. If disk I/O occupies
a significant run-time portion of your application, your disk I/O
tasks will run up to 10x faster on Linux and Windows 2000, when
compared to Solaris, or 6x faster than FreeBSD.
If you are evaluating a network software application and final
performance is important to you, software architecture should be
a vital evaluation criterion (i.e., you should show a preference
for multi-threaded or asynchronous architectures).
Jeffrey Rothman is the Manager of Technical Support and head
System Administrator at Lyris, and holds a Ph.D. in Computer Science
from U.C. Berkeley on the topic of high-performance memory architectures
for multiprocessor systems. John Buckman is the CEO/Founder of Lyris,
and the original software programmer behind their three products:
ListManager, MailShield, and MailEngine.
|