Performance Tuning: A Case Study

Mark E. Dawson, Jr.

I've encountered many types of interview questions from employers in the IT field. Questions have ranged from troubleshooting, architecture, and performance tuning to operating systems, database technologies, and kernel internals. Although most of the questions were par for the course, one question really stuck with me and led me to write this article.

I was asked about my preference between capacity planning and design, or performance tuning of already existing systems. I responded that I prefer 55% of performance tuning in existing systems, 45% of architecting new systems. Almost any engineer can follow a recipe or HOW-TO doc to build a recommended configuration from scratch, but it takes in-depth knowledge of a variety of technologies and how each interoperates to tune and troubleshoot what someone else has constructed. Of course, this answer helped "seal the deal" after the interview, but I will attempt to demonstrate the truthfulness of my answer via a case study of a performance-tuning scenario I recently encountered. The situation involved an Oracle database crash where the database had been configured to use asynchronous I/O. Although diagnosing a database crash may seem more like a troubleshooting issue, a deeper understanding of running Oracle with asynchronous I/O (AIO) enabled will show that it is really a performance issue.

The Problem

Early one morning, one of the DBAs in the group sent an email about an Oracle instance I'll refer to as "prodDB":

FYI: The prodDB database went down at 1:40am with a database writer (DBWR) error. I brought the database up and everything looks fine. This error has happened before and is typically due to the DBWR process not being able to write to a file on the NetApp. The exact error is ORA-27062: skgfospo: Could not find pending async I/Os.

Of course, we couldn't tolerate such instability in our production database for long.

The Environment

Our environment consisted of Oracle 8.0.5 through 8.1.5 databases running on several Solaris 2.6 E3000-E4500 servers. Network Appliance (NetApp) F760s running OnTap 5.3.5R2 served as the filestore for the databases, which mount the NetApps via NFS over UDP (rsize/wsize=32kB). These servers were networked with the NetApp backend storage via Cisco Cat6505 switches. All the NetApps used GigE connections to the switch, while half of the servers used 100-Mb connections and the other half GigE. The server on which prodDB runs was attached at 100 Mb. Each Oracle database wrote using filesystem asynchronous I/O (AIO). ProdDB was an Oracle 8.1.5 database.

The Initial Prognosis

My initial guess was that the database instability issues were due to using AIO in our Oracle 8.1.5 instance. This idea stemmed from reading white papers stating that Oracle wasn't even supposed to be able to run AIO against anything but raw datafiles with Oracle 8.1.5 through 8.1.6 on Solaris. This appeared to be a little-known fact, and caused some confusion since our environment was actually running with this configuration. So, did Sun and Oracle mean that we shouldn't use filesystem AIO on filesystems, or that the Oracle software wouldn't allow its use in 8.1.5 and 8.1.6? The docs I've read indicated the latter.

Further discussion with Oracle indicated that the information provided in these white papers from Sun and Oracle were only partially correct. The missing piece of information (that neither Sun nor Oracle included in their documents) is that only Solaris 2.6 with kernel revision 105181-17 and below is incapable of utilizing filesystem AIO with Oracle 8.1.5 and 8.1.6. We were well above that kernel rev, so there was nothing to prevent use of AIO in our environment.

Asynchronous I/O Explained

Why was AIO even considered as an issue? In case you're unaware of what AIO is and how it works, I'll briefly define it and explain why we decided to use it for our databases.

Normally, the write() system call in UNIX will return without actually writing the data to disk due to filesystem buffering, which greatly aids write performance. However, if a developer wanted to guarantee that data reaches the disk before a write() call returns, he could execute the system call with the O_SYNC (or O_DSYNC) flag to bypass the default buffering scheme. When this flag is used, a write will take longer, but the data is guaranteed to be on disk after a successful return from the write() system call. This also causes the calling application to wait idly while the call executes, preventing the application from doing any other work until the system call returns. Oracle is one of those applications that requires a guarantee that all writes are completed before continuing.

Modern Unices provide an asynchronous method of reading and writing. By means of this, Oracle can issue a write syscall with the O_DSYNC flag, but continue processing while the call executes in the background. This alleviates the need to wait idly on the data to reach the disk before the application can proceed with other work. Since AIO greatly enhances application performance in I/O-intensive workloads, Oracle provides an option for DBAs to enable it on operating systems supporting this feature. Because of the performance benefits that AIO afforded our environment, we implemented it across all our databases.

There are a couple of ways for an application to know when an AIO call has completed: by catching the SIGIO signal, or by calling aiowait(). Oracle makes use of the latter. In addition, a timer is placed on the execution of "aiowait", such that if this time is exceeded with no response from outstanding AIO writes, the process will die to avoid any further data corruption. The database then crashes, and the alert log will show the ORA-27062 message mentioned above.

The Usual Suspects

After identifying AIO as the issue, I formulated a checklist of possible culprits:

1. RDBMS: Could there be an Oracle bug? Is Oracle misconfigured?

2. Operating system: Does a kernel or NFS patch need to be applied? Is there I/O contention on the server, perhaps caused by other applications running on the database machine?

3. NFS issue: Could NFS be misconfigured, or could the NetApp be overloaded with NFS requests?

4. Network: Could there be faulty cables, NICs, switch ports, etc.?

5. NetApp issues: What about an OnTap bug, overloaded machine, poorly configured machine, etc.?

The problem could be one, or a combination, of these factors.

The Initial Analysis

Going down my checklist, I ruled out item #1, as Oracle's MetaLink Web site only listed one AIO bug that occurred when databases resided on an unpatched version of Veritas Volume Manager.

I also ruled out item #2 because the server in question was only running Oracle databases. Thus there was little likelihood of resource contention from other applications. The kernel and NFS patches were also up-to-date at the time.

The /var/adm/messages file on Solaris revealed numerous NFS server still not responding messages around the time of the database crash. iostat with the -xPnc options (see Solaris iostat man pages or Adrian Cockroft's Sun Performance Tuning book for more on these options) showed that average service time and average wait time for the NFS-mounted Oracle datafiles were often more than 60-70 ms, sometimes far more than this! This led me to the handy nfsstat command, which, as you can probably gather from the command name, reveals statistics on NFS/RPC client and server operation.

On the client, the most important stats would come from running nfsstat with the -rc args, which delivers info on the RPC calls (-r) on a client (-c). The areas of interest were the calls, retrans, and badxids columns under the "Connectionless" heading (since we're using NFS over UDP, the connectionless transport mechanism).

Interestingly, the retrans/calls percentage was high (more than 2%), while the badxids column was at 0. Each unique NFS request has a transaction ID (xid) assigned to it that the NFS server returns in its response. If an NFS client receives a response with an xid for which there is no outstanding request, the badxids column is incremented. Such events occur commonly when an NFS server is overloaded, thus causing a client to retransmit its request after a response timeout is exceeded until it receives a response. Furthermore, the outmatched NFS server may send late responses to each request (the original packet and the retransmitted packets). The client accepts the first response, rejects the duplicate replies, and increments the badxids columns for each such duplicate response (a request NFS packet retains the same xid across all retransmissions of the same request).

Remember that our client machine had no badxids, but a high percentage of retransmissions. This told me that the NetApp could keep up, but packets were getting dropped somewhere along the way. Thus, I ruled out item #3 on the checklist.

As previously described, the network consisted of the Sun Enterprise server attached to a Cat6505 switch via 100 Mb, with the backend NetApp attached via GigE. The 100-Mb card at the server end was a Sun hme card, for which a patch existed that remedied a packet-dropping behavior of the card (as noticed from netstat -in). However, application of the patch only very negligibly helped the NFS timeout/retransmit issues. Another thing that came to mind was the infamous autonegotiation woes that can occur between NICs and Cisco switch ports, where port speed and duplex settings between the NIC and port suddenly become out of sync during operation. However, every switch port had its settings forced for every server machine. Nevertheless, checklist item #4 still needed closer investigation.

The Network Examined

On the Cat6505 switch, executing show port mac PORTNUM against the 100-Mb port attached to the server showed a number of "Out-Discards". This result has to do with the buffers assigned to each port on a switch. GigE ports have 448 KB of transmit buffer space, with 64 KB of receive space, while the 100-Mb ports have 112 KB of transmit space and 16 KB of receive space. Frames dropped from the transmit buffer increment the "Out-Discards", while "In-Discards" are frames dropped from the receive end. My hypothesis was that the amount of traffic being pushed from the GigE end of the connection from the NetApp was overrunning the port buffers at the 100-Mb end of connection from the Sun server. Using NFS over UDP exacerbated the problem, as a dropped packet requires retransmission of the entire datagram rather than just the lost packet in the case of TCP. With 32-KB-sized NFS packets, you can see how important retransmits can become in a lousy network with a 1500-byte MTU size (that's approximately 22 packets in one 32-KB NFS request).

IEEE 802.3x defines flow control for full-duplex Ethernet LANs, implemented by means of pause packets sent at the MAC layer (Layer 2) to signal to the other end to stop sending packets for a period. Unfortunately, neither Sun 100-Mb hme NICs nor the Cat6505 100-Mb ports support flow control. Only Sun GigE adapters and Cat6505 GigE ports support them, though Fast Ethernet in general is capable of doing so.

The NetApp Examined

Checking the CPU utilization and its NFS-specific load handling via sysstat and then nfsstat -d confirmed that the NetApp was not overloaded. However, the GigE card was dropping packets from its receive queue, as shown from executing ifstat -a. So now we had dropped packets at the switch and the NetApp end. Flow control was enabled, but not all cards communicating with it made use of flow-control.

Item #5 proved true, in part, due to some initial configuration problems with the NetApps. The F760 had a disk controller located in a slot with a higher interrupt priority than the GigE card, which would interfere with the GigE's ability to push packets onto the bus during disk operation, causing it to drop packets from its hardware receive queue as it filled with incoming packets. Addressing this issue would require downtime for a number of databases, as each F760 supported multiple databases.

For more info on NetApp commands, go to http://now.netapp.com. If you have a NetApp, go to http://hostname_of_netapp/na_admin/ to read the man pages off the filer itself.

The Solution

My recommendation was to make sure all database machines are networked to the NetApps via GigE <--> GigE with full flow control. This would enable all endpoints on the connection -- the NetApp GigE, the switch ports, and the Sun GigE -- to use the 802.3x flow control capability to alleviate receive queue and port transmit buffer overruns. Lastly, all NetApp machines with an unsupported slot configuration (detailed via sysconfig -c) would have the slot assignments rearranged during a maintenance window to place GigE cards in the higher numbered slots over disk controllers.

Interim solutions, such as turning off AIO in favor of using multiple I/O slaves (which has the unfortunate side-effect of serializing writes to redo log members) and using NFS over TCP (with the additional per-packet and packet type overhead) were used until the real solutions could be put in place. In the meantime, I configured a fault-tolerant Sun E4500 with a fully patched OS and GigE connection on which to migrate prodDB. If you are not familiar with running databases on NetApp storage, the ease of such migrations is unmatched when compared to traditional direct-attached storage. All that was necessary was to bring the database down, mount the NetApp volumes onto the destination machine using the same mount points as the source machine, and bring the database back up.

The End Result

Several databases were migrated to the new E4500, and have not exhibited any of the crashes or timeout symptoms of the past. The cleaner network with GigE flow control from end-to-end provides not only an ORA-27062 free environment, but also much improved I/O performance. For example, the time to export prodDB has dropped to 30 minutes from a previous duration of as long as 4 hours. Retransmissions are down to .01% from an initial 2% reading. iostat shows that wait times and service times are often in the 5-14 ms range, which is well below the 30-ms rule-of-thumb that I generally use. Life is good.

Later, after seeing how much better the databases were performing, I removed unnecessary dependency on the network from the database operation. This included such things as using local file locking mechanisms instead of using rpc.lockd and avoiding the unnecessary close-to-open mechanism used by NFS to ensure data integrity among NFS clients. Both of these are implemented by using the -llock and -nocto NFS mount options, respectively. Since no two NFS clients in our environment shared Oracle datafiles (nor can they until Oracle Parallel Server is certified on NetApp), these NFS facilities are unnecessary. As a result, database performance has improved even more.

While I thought of testing using 8-KB NFS rsize/wsize for the database mount points since the Oracle blocksize was 8 KB, I decided to just leave well enough alone. I'd had enough adventure for one week at a new position.

Conclusion

My answer to the interview question mentioned at the outset had the foremost goal of impressing the employer and clinching the job. However, the gist of my response reveals what I, and many other performance enthusiasts, have long ago come to appreciate -- the joy of performance tuning.

This case study illustrates the challenges posed by performance tuning in comparison to architecture in most cases, especially when the systems being tuned were configured by unavailable third parties. As an engineer, I thoroughly enjoy designing systems to meet and beat the workload demands of a business. But the challenge and the technical skill involved in performance tuning bring out the brainteaser enthusiast in me, enriching me far more than most architectural tasks. I hope that this article will arouse more interest in this specialized area of systems administration.

Mark El Toro Dawson, Jr. is Senior UNIX/Oracle Engineer at Encyclopaedia Britannica, Inc. When not spoiling his wonderful, supportive, and deserving wife, he can be found on hilltops extolling the virtues of AIX, in dark corners hacking kernels, or at technical forums explaining Oracle internals. He can be contacted at: medawsonjr@yahoo.com.