Which OS is Fastest for High-Performance Network Applications?

Jeffrey B. Rothman and John Buckman

In this article, we compare Linux, Solaris (for Intel), FreeBSD, and Windows 2000 to determine which operating system (OS) runs high-performance network applications the fastest. We will describe which software designs to look for from your network software vendor, explaining how each design yields different performance characteristics, and determine which OS platform is best suited for each common network programming design. We present our OS benchmarks with both simulated and real-world tests, then evaluate the results.

We found that the software application's architecture determines speed results much more than the operating system on which it runs. Our benchmarks demonstrate a 12x performance difference between process-based and asynchronous task architectures. Significantly, we found up to a 75% overall performance difference between OSes when using the most efficient asynchronous architecture. We found Linux to be the best performing operating system based on our metrics, performing 35% better than Solaris, which came in second, followed by Windows, and finally, FreeBSD.

Background

At Lyris Technologies, we write high-performance, cross-platform, email-based server applications. Better application performance is a competitive advantage, so we spend a great deal of time tuning all aspects of an application's performance profile (software, hardware, and operating system). Our customers frequently ask us which operating system is best for running our software. Or, if they have already chosen an OS, they ask how to make their system run our applications faster. Additionally, we run a hosting (outsourcing) division and want to reduce our hardware cost while providing the best performance for our hosting customers.

Most Internet applications follow these steps:

1. Accept an incoming TCP/IP connection or create a connection to another machine.

2. Once connected, exchange various text-based commands via TCP/IP.

3. These commands cause various activities to happen, such as disk reading (e.g., viewing a Web page), disk writing (e.g., queuing a received email message), or calling external functionality (e.g., mail filtering, reverse DNS lookup).

In general, the performance issues for network applications are to:

1. Accomplish many concurrent tasks as quickly as possible.

2. Efficiently cope with a great deal of waiting (caused by TCP/IP slowness, or for the other end to send the next command).

3. Perform TCP/IP operations efficiently.

The most effective way to maximize network application performance is for the application's software designer to choose an architecture that addresses the three performance criteria above. The two significant variables are the task architecture and the TCP/IP call architecture.

Task Architecture

In the area of task architecture, there are three main techniques:

One-process-per-task (process-oriented) -- Many copies of the program are run with each copy handling one task at a time. Sometimes, a new process is created each time a new task is created (e.g., inetd, Sendmail) or processes are re-used (e.g., Apache). This architecture yields good performance at low loads. Medium loads can also be handled, if the process image is small (e.g., qmail), if application-specific efficiency improvements are implemented, or if the application genre does not create too many simultaneous tasks. Multiple CPUs are efficiently used if process caching is used, and if the total number of processes is kept low (i.e., low-to-medium load). This technique works on all operating systems; however, UNIX is significantly more efficient than Windows at implementing it. (Windows lacks the fork() system call, and this method is so slow that few Windows applications use this technique.)
One-thread-per-task (multi-threaded) -- One copy of the program is run with a separate thread of execution inside the process handling each task. Multi-threaded applications perform very well at low to medium loads. Higher loads cause decreasing (but usually still acceptable) performance; however, extremely high loads can cause your multi-threaded application to death-spiral. Multi-threaded applications typically scale to between 500 and 1000 concurrent tasks, which is acceptable in many situations. Each new task uses a new thread, which consumes less memory and less CPU power than a new process would. Few open source projects use multi-threading because only the most popular UNIX variants are stable under heavy multi-threading loads. Performance with multiple CPUs can be worse than on one CPU, because semaphore locks are much more costly on multiprocessor machines. (Examples of multithreaded software are Netscape Web server and Apache on Windows.)
One-thread-many-tasks (asynchronous) -- One copy of the program is run with a set number of threads (typically, one thread per type of task), and each thread handles a large number of tasks using a technique called asynchronous (or non-blocking) TCP/IP. Because most programs are not required to handle high loads and because asynchronous programming is difficult, few programs support this architecture. Asynchronous programs scale well to multiple CPU machines, because they typically use long running threads operating independently of each other. They require few cross-CPU locks, so each thread can be permanently and effectively assigned to a CPU. (Example: the DNS BIND daemon.)

TCP/IP Call Architecture

The second major performance variable is the TCP/IP call architecture. On an operating system level, there are multiple ways to accomplish the same network operation. A tradeoff exists between TCP/IP speed versus programming effort (faster techniques are more work for the programmer). Additionally, some faster techniques are not available on all platforms; higher performance requirements may limit the platform choice.

Blocking TCP/IP Call

A blocking TCP/IP call waits for the requested operation to complete, then acts immediately on the result. With small numbers of tasks, this results in immediate reaction to events as they occur. With large numbers of tasks, the operating system incurs significant context-switching overhead, and overall efficiency is poor. Blocking (synchronous) TCP/IP calls yield very short latencies under low loads, and are ideal for an application such as a low-load Web server, where page-response time should be very fast, and the load is never very high. However, if a process-oriented architecture is used and a new process is created for each new connection (e.g., inetd), then the latency improvements from blocking TCP/IP are negated by the significant overhead of running a new process.

Non-Blocking TCP/IP Call

A non-blocking (asynchronous) TCP/IP call initiates an operation, then continues with other activities. When the operation completes or an event occurs, the application is notified and then reacts. There is more programming work involved with this two-step process and sometimes a small amount of time is needed to react to the new event (increasing latency). This non-blocking technique yields much better performance under medium-to-high loads, and can survive abusively high loads, but latency may be slightly longer than with blocking TCP/IP calls.

Each of the three task-handling architectures matches up with a particular TCP/IP system call model. Process-oriented and multi-threaded programs tend to use blocking TCP/IP calls, as this is the simplest way to program, and handles the low loads that are most common case. However, an application that uses the asynchronous task architecture must use non-blocking TCP/IP operations to handle multiple tasks: blocking TCP/IP is not an option. Therefore, if you find a network application that uses the highly scalable asynchronous task architecture, you also benefit from that application using the most scalable TCP/IP call architecture (non-blocking).

Real-World Test

To evaluate the performance of various operating systems and network applications, we created three different tests: real-world, disk I/O, and task architecture comparison. The operating systems we examined were Linux (Red Hat 7.0, kernel 2.2.16-22), Solaris 2.8 for Intel, FreeBSD 4.2, and Windows 2000 Server. The operating systems were the latest version available from a commercial distribution and were not recompiled (i.e., everything was tested right out of the box). We installed all operating systems on identical 4-GB SCSI-3 drives (IBM model DCAS-34330), and ran the tests on the same machine (ASUS P3B motherboard, Intel Pentium III 550-MHz processor, 384-MB SDRAM, Adaptec 2940UW SCSI controller, ATI Rage Pro 3D video card, Intel EtherExpress Pro 10/100 Ethernet card).

As a real-world test, we measured how quickly email could be sent using our MailEngine software. MailEngine is an email delivery server, ships on all the tested platforms (plus on Solaris for Sparc), and uses an asynchronous architecture (with non-blocking TCP/IP using the poll () system call). So that email was not actually delivered to our 200,000-member test list, we ran MailEngine in test mode. In this mode, MailEngine performs all the steps of sending mail, but sends the RSET command instead of the DATA command at the last moment. The SMTP connection is then QUIT, and no email is delivered to the recipient. Our workload consisted of a single message being delivered to 200,000 distinct email addresses spread across 9113 domains. Because the same message was queued in memory for every recipient, disk I/O was not a significant factor. We slowly raised the number of simultaneous connections to see how the increased load altered performance.

Figure 1 ("Operating system comparison") shows the test-mode email delivery speed for MailEngine over a range of simultaneous connections for each OS. Linux is the clear speed winner, roughly 35% faster than Solaris, the runner-up. Overall performance increased as connections were added, showing marginal additional speed with more than 1500. FreeBSD performance decreased somewhat when more than 1500 connections were added.

On the UNIX-style operating systems, it was necessary to tweak the kernel slightly to allow the use of so many connections in one process. Despite kernel tweaking, FreeBSD gave us resource-shortage warnings and failed to run when loaded with more than 2500 connections.

File System Test

Many network applications also require the ability to queue information on disk for later processing (i.e., Sendmail's mail queue) or to handle overflow situations. To measure file system efficiency mimicking typical situations, we wrote a C++ program that creates, writes, and reads back 10,000 files in a single directory, one file at a time. To measure file system efficiency of various kinds of files, the file size was increased from 4 KB to 128 KB.

Figure 2 ("Time to Create, Write and Read 10,000 Files") displays the file system test results. Linux and Windows speeds were almost identical, significantly faster than the other two: 6x faster than FreeBSD, 10x faster than Solaris. The file system for each OS was: Linux - EXT2, Solaris - UFS, Windows 2000 - NTFS, FreeBSD - UFS. Other file systems would undoubtedly yield different performance results. If your software application depends heavily of disk I/O, we recommend using Linux or Windows, or else investigating alternative file systems on FreeBSD or Solaris.

Application Architecture Test

Finally, we evaluated how different network application architectures performed on each operating system. We wrote a simple C++ server program that responded to incoming connections with the message "450 too busy", using one of three architectures to handle sending the response message. The three architectures our program tested were: (1) a process-based architecture, with a new process executed to handle each connection; (2) a multi-threaded architecture, with a thread assigned to each connection; and (3) an asynchronous architecture, with all connections answered using non-blocking TCP/IP. A separate C++ program, running on a different machine (on Linux), attempts to connect to our simple server program as quickly as possible, slowly increasing the simultaneous connection load, and counting successfully received response messages. The multiple charts (12 test runs) were too much to present in this article, so we instead charted the average for each task architecture to show general performance differences.

Figure 3 ("Average Throughput per Network Architecture") shows the performance of each type of task architecture, averaged across all OS platforms. Although there was a significant amount of variation in performance between platforms, the variation was not nearly as significant as the architecture choice. The slowest network application architecture is the process-based architecture, which can handle only about 5% of the connections of the asynchronous method. The asynchronous method can handle about 35% more load than the thread-based method at 1000 simultaneous connections. The trend lines show that the multi-threaded versus asynchronous performance gap widens as load increases.

Kernel Tweaks for High Performance

In their default configurations, the UNIX-style operating systems we tested do not support the large numbers of simultaneous TCP/IP connections that multi-threaded and asynchronous applications require. This limitation drastically restricts applications performance, and can incorrectly dissuade a systems administrator from using these kinds of high-performance architectures. Fortunately, these limitations are easily overcome with a few kernel tweaks. On UNIX, each TCP/IP connection uses a file descriptor, so you must increase the total number of descriptors available to the operating system, and also increase the maximum number of descriptors each process is allowed to use. All UNIX-style operating systems have a ulimit shell command (sh and bash), which can allow more open file descriptors to commands started in that shell once the appropriate kernel tweak has been made. We suggest ulimit -n 8192. Here are our recommended kernel tweaks:

On Linux: echo 65536 > /proc/sys/fs/file-max changes the number of system-wide file descriptors.

On FreeBSD: Append to /etc/sysctl (or you can use sysctl -w to add these):

kern.maxfiles=65536
kern.maxfilesperproc=32768

On Solaris: Add the following to /etc/system and reboot:

set rlim_fd_max=0x8000
set rlim_fd_cur=0x8000

Summary

Our real-world test observed a 75% performance gap between the best and worst performing operating systems, with Linux enjoying a 35% lead over runner-up Solaris. Of more significance, asynchronous applications were on average 12x faster than process-based applications, and 35% faster than multi-threaded applications. If disk I/O occupies a significant run-time portion of your application, your disk I/O tasks will run up to 10x faster on Linux and Windows 2000, when compared to Solaris, or 6x faster than FreeBSD.

If you are evaluating a network software application and final performance is important to you, software architecture should be a vital evaluation criterion (i.e., you should show a preference for multi-threaded or asynchronous architectures).

Jeffrey Rothman is the Manager of Technical Support and head System Administrator at Lyris, and holds a Ph.D. in Computer Science from U.C. Berkeley on the topic of high-performance memory architectures for multiprocessor systems. John Buckman is the CEO/Founder of Lyris, and the original software programmer behind their three products: ListManager, MailShield, and MailEngine.