Tuning: A Case Study
Mark E. Dawson, Jr.
I've encountered many types of interview questions from employers
in the IT field. Questions have ranged from troubleshooting, architecture,
and performance tuning to operating systems, database technologies,
and kernel internals. Although most of the questions were par for
the course, one question really stuck with me and led me to write
I was asked about my preference between capacity planning and
design, or performance tuning of already existing systems. I responded
that I prefer 55% of performance tuning in existing systems, 45%
of architecting new systems. Almost any engineer can follow a recipe
or HOW-TO doc to build a recommended configuration from scratch,
but it takes in-depth knowledge of a variety of technologies and
how each interoperates to tune and troubleshoot what someone else
has constructed. Of course, this answer helped "seal the deal"
after the interview, but I will attempt to demonstrate the truthfulness
of my answer via a case study of a performance-tuning scenario I
recently encountered. The situation involved an Oracle database
crash where the database had been configured to use asynchronous
I/O. Although diagnosing a database crash may seem more like a troubleshooting
issue, a deeper understanding of running Oracle with asynchronous
I/O (AIO) enabled will show that it is really a performance issue.
Early one morning, one of the DBAs in the group sent an email
about an Oracle instance I'll refer to as "prodDB":
FYI: The prodDB database went down at 1:40am with a database writer
(DBWR) error. I brought the database up and everything looks fine.
This error has happened before and is typically due to the DBWR
process not being able to write to a file on the NetApp. The exact
error is ORA-27062: skgfospo: Could not find pending async I/Os.
Of course, we couldn't tolerate such instability in our production
database for long.
Our environment consisted of Oracle 8.0.5 through 8.1.5 databases
running on several Solaris 2.6 E3000-E4500 servers. Network Appliance
(NetApp) F760s running OnTap 5.3.5R2 served as the filestore for
the databases, which mount the NetApps via NFS over UDP (rsize/wsize=32kB).
These servers were networked with the NetApp backend storage via
Cisco Cat6505 switches. All the NetApps used GigE connections to
the switch, while half of the servers used 100-Mb connections and
the other half GigE. The server on which prodDB runs was attached
at 100 Mb. Each Oracle database wrote using filesystem asynchronous
I/O (AIO). ProdDB was an Oracle 8.1.5 database.
The Initial Prognosis
My initial guess was that the database instability issues were
due to using AIO in our Oracle 8.1.5 instance. This idea stemmed
from reading white papers stating that Oracle wasn't even supposed
to be able to run AIO against anything but raw datafiles with Oracle
8.1.5 through 8.1.6 on Solaris. This appeared to be a little-known
fact, and caused some confusion since our environment was actually
running with this configuration. So, did Sun and Oracle mean that
we shouldn't use filesystem AIO on filesystems, or that
the Oracle software wouldn't allow its use in 8.1.5
and 8.1.6? The docs I've read indicated the latter.
Further discussion with Oracle indicated that the information
provided in these white papers from Sun and Oracle were only partially
correct. The missing piece of information (that neither Sun nor
Oracle included in their documents) is that only Solaris 2.6 with
kernel revision 105181-17 and below is incapable of utilizing filesystem
AIO with Oracle 8.1.5 and 8.1.6. We were well above that kernel
rev, so there was nothing to prevent use of AIO in our environment.
Asynchronous I/O Explained
Why was AIO even considered as an issue? In case you're unaware
of what AIO is and how it works, I'll briefly define it and
explain why we decided to use it for our databases.
Normally, the write() system call in UNIX will return without
actually writing the data to disk due to filesystem buffering, which
greatly aids write performance. However, if a developer wanted to
guarantee that data reaches the disk before a write() call
returns, he could execute the system call with the O_SYNC
(or O_DSYNC) flag to bypass the default buffering scheme.
When this flag is used, a write will take longer, but the data is
guaranteed to be on disk after a successful return from the write()
system call. This also causes the calling application to wait idly
while the call executes, preventing the application from doing any
other work until the system call returns. Oracle is one of those
applications that requires a guarantee that all writes are completed
Modern Unices provide an asynchronous method of reading and writing.
By means of this, Oracle can issue a write syscall with the
O_DSYNC flag, but continue processing while the call executes
in the background. This alleviates the need to wait idly on the
data to reach the disk before the application can proceed with other
work. Since AIO greatly enhances application performance in I/O-intensive
workloads, Oracle provides an option for DBAs to enable it on operating
systems supporting this feature. Because of the performance benefits
that AIO afforded our environment, we implemented it across all
There are a couple of ways for an application to know when an
AIO call has completed: by catching the SIGIO signal, or
by calling aiowait(). Oracle makes use of the latter. In
addition, a timer is placed on the execution of "aiowait",
such that if this time is exceeded with no response from outstanding
AIO writes, the process will die to avoid any further data corruption.
The database then crashes, and the alert log will show the ORA-27062
message mentioned above.
The Usual Suspects
After identifying AIO as the issue, I formulated a checklist of
1. RDBMS: Could there be an Oracle bug? Is Oracle misconfigured?
2. Operating system: Does a kernel or NFS patch need to be applied?
Is there I/O contention on the server, perhaps caused by other applications
running on the database machine?
3. NFS issue: Could NFS be misconfigured, or could the NetApp
be overloaded with NFS requests?
4. Network: Could there be faulty cables, NICs, switch ports,
5. NetApp issues: What about an OnTap bug, overloaded machine,
poorly configured machine, etc.?
The problem could be one, or a combination, of these factors.
The Initial Analysis
Going down my checklist, I ruled out item #1, as Oracle's
MetaLink Web site only listed one AIO bug that occurred when databases
resided on an unpatched version of Veritas Volume Manager.
I also ruled out item #2 because the server in question was only
running Oracle databases. Thus there was little likelihood of resource
contention from other applications. The kernel and NFS patches were
also up-to-date at the time.
The /var/adm/messages file on Solaris revealed numerous
NFS server still not responding messages around the time
of the database crash. iostat with the -xPnc options
(see Solaris iostat man pages or Adrian Cockroft's Sun
Performance Tuning book for more on these options) showed that
average service time and average wait time for the NFS-mounted Oracle
datafiles were often more than 60-70 ms, sometimes far more than
this! This led me to the handy nfsstat command, which, as
you can probably gather from the command name, reveals statistics
on NFS/RPC client and server operation.
On the client, the most important stats would come from running
nfsstat with the -rc args, which delivers info on
the RPC calls (-r) on a client (-c). The areas of
interest were the calls, retrans, and badxids
columns under the "Connectionless" heading (since we're
using NFS over UDP, the connectionless transport mechanism).
Interestingly, the retrans/calls percentage was
high (more than 2%), while the badxids column was at 0.
Each unique NFS request has a transaction ID (xid) assigned
to it that the NFS server returns in its response. If an NFS client
receives a response with an xid for which there is no outstanding
request, the badxids column is incremented. Such events occur
commonly when an NFS server is overloaded, thus causing a client
to retransmit its request after a response timeout is exceeded until
it receives a response. Furthermore, the outmatched NFS server may
send late responses to each request (the original packet and the
retransmitted packets). The client accepts the first response, rejects
the duplicate replies, and increments the badxids columns
for each such duplicate response (a request NFS packet retains the
same xid across all retransmissions of the same request).
Remember that our client machine had no badxids, but a
high percentage of retransmissions. This told me that the NetApp
could keep up, but packets were getting dropped somewhere along
the way. Thus, I ruled out item #3 on the checklist.
As previously described, the network consisted of the Sun Enterprise
server attached to a Cat6505 switch via 100 Mb, with the backend
NetApp attached via GigE. The 100-Mb card at the server end was
a Sun hme card, for which a patch existed that remedied a packet-dropping
behavior of the card (as noticed from netstat -in). However,
application of the patch only very negligibly helped the NFS timeout/retransmit
issues. Another thing that came to mind was the infamous autonegotiation
woes that can occur between NICs and Cisco switch ports, where port
speed and duplex settings between the NIC and port suddenly become
out of sync during operation. However, every switch port had its
settings forced for every server machine. Nevertheless, checklist
item #4 still needed closer investigation.
The Network Examined
On the Cat6505 switch, executing show port mac PORTNUM
against the 100-Mb port attached to the server showed a number of
"Out-Discards". This result has to do with the buffers
assigned to each port on a switch. GigE ports have 448 KB of transmit
buffer space, with 64 KB of receive space, while the 100-Mb ports
have 112 KB of transmit space and 16 KB of receive space. Frames
dropped from the transmit buffer increment the "Out-Discards",
while "In-Discards" are frames dropped from the receive
end. My hypothesis was that the amount of traffic being pushed from
the GigE end of the connection from the NetApp was overrunning the
port buffers at the 100-Mb end of connection from the Sun server.
Using NFS over UDP exacerbated the problem, as a dropped packet
requires retransmission of the entire datagram rather than just
the lost packet in the case of TCP. With 32-KB-sized NFS packets,
you can see how important retransmits can become in a lousy network
with a 1500-byte MTU size (that's approximately 22 packets
in one 32-KB NFS request).
IEEE 802.3x defines flow control for full-duplex Ethernet LANs,
implemented by means of pause packets sent at the MAC layer (Layer
2) to signal to the other end to stop sending packets for a period.
Unfortunately, neither Sun 100-Mb hme NICs nor the Cat6505 100-Mb
ports support flow control. Only Sun GigE adapters and Cat6505 GigE
ports support them, though Fast Ethernet in general is capable of
The NetApp Examined
Checking the CPU utilization and its NFS-specific load handling
via sysstat and then nfsstat -d confirmed that the
NetApp was not overloaded. However, the GigE card was dropping packets
from its receive queue, as shown from executing ifstat -a.
So now we had dropped packets at the switch and the NetApp
end. Flow control was enabled, but not all cards communicating with
it made use of flow-control.
Item #5 proved true, in part, due to some initial configuration
problems with the NetApps. The F760 had a disk controller located
in a slot with a higher interrupt priority than the GigE card, which
would interfere with the GigE's ability to push packets onto
the bus during disk operation, causing it to drop packets from its
hardware receive queue as it filled with incoming packets. Addressing
this issue would require downtime for a number of databases, as
each F760 supported multiple databases.
For more info on NetApp commands, go to http://now.netapp.com.
If you have a NetApp, go to http://hostname_of_netapp/na_admin/
to read the man pages off the filer itself.
My recommendation was to make sure all database machines are networked
to the NetApps via GigE <--> GigE with full flow control.
This would enable all endpoints on the connection -- the NetApp
GigE, the switch ports, and the Sun GigE -- to use the 802.3x
flow control capability to alleviate receive queue and port transmit
buffer overruns. Lastly, all NetApp machines with an unsupported
slot configuration (detailed via sysconfig -c) would have
the slot assignments rearranged during a maintenance window to place
GigE cards in the higher numbered slots over disk controllers.
Interim solutions, such as turning off AIO in favor of using multiple
I/O slaves (which has the unfortunate side-effect of serializing
writes to redo log members) and using NFS over TCP (with the additional
per-packet and packet type overhead) were used until the real solutions
could be put in place. In the meantime, I configured a fault-tolerant
Sun E4500 with a fully patched OS and GigE connection on which to
migrate prodDB. If you are not familiar with running databases on
NetApp storage, the ease of such migrations is unmatched when compared
to traditional direct-attached storage. All that was necessary was
to bring the database down, mount the NetApp volumes onto the destination
machine using the same mount points as the source machine, and bring
the database back up.
The End Result
Several databases were migrated to the new E4500, and have not
exhibited any of the crashes or timeout symptoms of the past. The
cleaner network with GigE flow control from end-to-end provides
not only an ORA-27062 free environment, but also much improved I/O
performance. For example, the time to export prodDB has dropped
to 30 minutes from a previous duration of as long as 4 hours. Retransmissions
are down to .01% from an initial 2% reading. iostat shows
that wait times and service times are often in the 5-14 ms range,
which is well below the 30-ms rule-of-thumb that I generally use.
Life is good.
Later, after seeing how much better the databases were performing,
I removed unnecessary dependency on the network from the database
operation. This included such things as using local file locking
mechanisms instead of using rpc.lockd and avoiding the unnecessary
close-to-open mechanism used by NFS to ensure data integrity among
NFS clients. Both of these are implemented by using the -llock
and -nocto NFS mount options, respectively. Since no two
NFS clients in our environment shared Oracle datafiles (nor can
they until Oracle Parallel Server is certified on NetApp), these
NFS facilities are unnecessary. As a result, database performance
has improved even more.
While I thought of testing using 8-KB NFS rsize/wsize
for the database mount points since the Oracle blocksize was 8 KB,
I decided to just leave well enough alone. I'd had enough adventure
for one week at a new position.
My answer to the interview question mentioned at the outset had
the foremost goal of impressing the employer and clinching the job.
However, the gist of my response reveals what I, and many other
performance enthusiasts, have long ago come to appreciate --
the joy of performance tuning.
This case study illustrates the challenges posed by performance
tuning in comparison to architecture in most cases, especially when
the systems being tuned were configured by unavailable third parties.
As an engineer, I thoroughly enjoy designing systems to meet and
beat the workload demands of a business. But the challenge and the
technical skill involved in performance tuning bring out the brainteaser
enthusiast in me, enriching me far more than most architectural
tasks. I hope that this article will arouse more interest in this
specialized area of systems administration.
Mark El Toro Dawson, Jr. is Senior UNIX/Oracle Engineer at
Encyclopaedia Britannica, Inc. When not spoiling his wonderful,
supportive, and deserving wife, he can be found on hilltops extolling
the virtues of AIX, in dark corners hacking kernels, or at technical
forums explaining Oracle internals. He can be contacted at: firstname.lastname@example.org.