I/O Considerations for Database Performance
The biggest issue when designing a system to run a database (Oracle, Sybase, Informix, DB2, etc.) is I/O. Once the system is running, tuning I/O to perform optimally while running the database is a daunting task. Fixing I/O issues usually requires "touching" the database, which is time consuming, costly (due to database downtime), and risky (due to the potential for losing data). It's easy to throw multiple fast CPUs on a server and load it up with RAM if you detect CPU or memory bottlenecks. Correcting an I/O bottleneck is usually much harder. Even a "small" error in configuration can require a large amount of work to recover from. It's important to get it right before implementation. Two central issues to consider in designing a database are avoiding the creation of "hot spots" resulting from I/O contention, and protecting the data from disk failures. This article is a discussion of different aspects of I/O that should be considered. It is mainly slanted toward UNIX, RAID, and databases, but most of the concepts can be applied toward other systems as well.
Filesystems vs. Raw Partitions
A few years ago, there was an argument against using file systems to house your database data. I/O to raw partitions used to be much faster than I/O to a file system. This difference in speed outweighed the benefits of using file systems. In recent years, UNIX file systems have become more and more efficient to the point where the difference in performance between raw and file system I/O is only an issue for the most heavily taxed databases. On properly configured systems, it is now recommended that database data be stored in file systems to take advantage of the file systems' features (most importantly, backups).
UNIX provides a slew of commands useful for managing a standard UNIX file system (UFS). Commands like ls, cp, mv, tar, etc. are priceless when dealing with files. Also, file systems can grow according to the needs of the application. Standard and third-party backup solutions are also critical to reliably backup data. Using UFS for Oracle is not a clear-cut decision. There are a number of drawbacks to using UFS for Oracle files. System crashes frequently damage files stored in a UFS and lead to long boot times while the file system checks its integrity. Also, redundant-caching (data is stored in Oracle buffers and the UFS data cache) and redundant-copying (data is moved from Oracle to UFS to disk) are not very efficient.
Journaled file systems (JFS) were developed to improve the reliability of the UFS. While retaining all of the UFS user commands, JFS also resist damage caused by system crashes. As a result, systems running JFS boot very quickly after a system failure with fewer damaged files. Unfortunately, journaled file systems are still plagued by the inefficiencies of redundant-copy and redundant-caching operations.
Using UNIX raw devices (or character special files) can provide some performance improvements over UFS and JFS implementations. For example, raw I/O allows Oracle to write directly from the shared global area to disk without incurring the overhead of I/O buffering associated with file systems. Also, only raw devices support asynchronous I/O, which can add a nice boost in performance. The downside to raw I/O is that it is difficult to administer a database stored on raw partitions. Since you can't manipulate files stored on a raw device the way you can manipulate files in file systems, it is harder to add files to the database, load-balance, and perform backup/recovery functions. Also, raw I/O has been shown to provide performance gains in only a small percentage of production sites. For most implementations, with the proper configuration of disks (using RAID), cache, or solid state disks, Oracle will perform as well using file systems as raw devices.
A Word about Database Layout
The "old" way of building a database was to create one file system or raw partition per disk. The DBA would then monitor the database using UNIX commands, like sar, and internal database functions to detect any "hot spots." Once the overutilized disks were identified, the DBA would juggle the data across the drives to try to balance the load. With a predictable, well-behaved database, and an excellent DBA/SA team, this could actually happen. However, systems and performance usually fell somewhat short of the optimum.
This method led to the discovery that the best way to improve I/O performance, using standard disks, is to group the disks together and spread the I/O across them. An easy, efficient way to group disks together is to use some sort of architecture that improves performance and protects the data from disk failures.
RAID (Redundant Array of Independent Disks) is important to implement when designing a database. Besides performance increases, most RAID levels can also provide a higher level of availability than JBOD (Just a Bunch Of Disks) configurations. The data-striping aspect of RAID implies multiple disks. There is an inherent performance advantage in using a large number of smaller disks rather than a few large disks: the increased number of actuators provides multiple simultaneous accesses to data. This setup also makes it easier to distribute data across all members for dynamic load balancing. I will briefly discuss the most common RAID levels used for OLTP implementations.
RAID Level 0 - RAID level 0 is disk striping (see Figure 1), in which data is spread, or striped, across multiple disk drives. RAID level 0 is technically not RAID, because it does not provide data protection in the event of a disk or media failure (i.e., it's not redundant). It does, however, deliver higher performance compared to an equal number of independent disks.
RAID Level 1 - RAID level 1 is disk mirroring (see Figure 2). Each disk has a mirrored partner, and all data is replicated, or redundant, on each disk. Write performance is about the same as for a single disk; however, read performance is increased as the data can be read from the primary or mirror disks, whichever is faster.
RAID Level 5 - RAID level 5 uses parity to protect the data (see Figure 3). The parity is striped across all of the drives in the volume. Read performance is substantially better than for a single disk, or parallel access array, because there is independent access to each disk. Write performance is poor due to the complexity of parity processing. RAID level 5 performance is scalable, as more disks provide more independent access. In the case of a disk failure, data from the lost drive can be computed from parity (using an arithmetic function (XOR)) stored on the other drives in the RAID level 5 volume.
RAID Level 10 - RAID level 10 (also known as Raid 0+1 or Raid 1+0) is a striped mirror (see Figure 4). It provides the high read performance of RAID level 1 (mirroring) with the speed of RAID level 0 (striping).
Figure 5 shows how RAID Level 0+1 holds a clear performance lead over the other RAID levels. This chart was provided by ECCS Inc.
Plaid - With the maturity of software-based RAID (e.g., Veritas Volume Manager for Solaris and Logical Volume Manager for HP) a new, unofficial RAID level has cropped up: Plaid. Plaid is striped stripes (see Figure 6). To create the Plaid RAID level, you need to use software striping across multiple hardware-striped volumes. For example, let's say that you have three RAID level 0+1 volumes (eight 4 GB disks per volume for a total of 16 GB of usable storage per volume) defined on the hardware RAID device. The operating system will see each volume as a 16 GB disk. Using your systems' Volume Manager software, group the three 16 GB "disks" into a 48 GB volume. When you create your file systems, create them as striped across the three "disks". You now have "horizontal stripes" created with the hardware RAID and "vertical stripes" created with the Volume Manager software. Hence the name "plaid". There is no better way to squeeze higher performance from your file system.
Note that software striping can also be used across volumes created with other RAID levels (e.g., create striped file systems across multiple RAID level 5 volumes). It is not recommended to create striped file systems across unlike volumes. If you were to create a file system across a RAID level 5 volume and a RAID level 0+1 volume then your performance on the fastest volume (RAID level 0+1) would be impacted by the slowest volume (RAID level 5 in this case). Similarly, all disks in a volume (or group of volumes) should be of the same type and speed. Adding one slower disk to the volume can negatively impact the performance of very fast disks in a volume.
Once you've decided on a RAID level, take advantage of another aspect of a RAID box's redundancy. A RAID device is usually connected to a UNIX system with two SCSI cables through two SCSI cards. When creating the Volume Manager volumes, use alternate hardware paths to balance the load while accessing the disks. For example, let's say that you have four volumes created on the RAID box and that the UNIX system is connected to the RAID box with two SCSI cables (c2t4 and c3t6). Volume 1 would show up as a "disk" addressed as c2t4d0 and c3t6d0; volume 2 would show up as a "disk" addressed as c2t4d1 and c3t6d1, etc. When creating the Volume Manager volume, add the first "disk" by using c2t4d0, the second "disk" by using c3t6d1, the third "disk" by using c2t4d2, and the fourth "disk" by using c3t6d3. When creating your file systems, be sure to create them as striped. You have now guaranteed that all I/O to this file system will round-robin across the two SCSI cards, balancing the load.
Another feature a RAID device should have is redundant controllers. With redundant controllers, if one controller fails, the second takes over the load (and the failed controller can be replaced while the system is "hot").
There are two ways to implement redundant controllers: active-passive and active-active. The active-passive configuration is when one controller is active and handles all of the workload. The passive controller monitors the active controller for failures and takes over the workload in a failure situation. The passive controller also keeps a mirrored copy of the active controller's cache. This assures that no data in cache is lost when the passive controller takes over.
The active-active configuration is when both controllers are active at the same time. They split the workload and also maintain mirrored copies of each other's cache. As you can imagine, active-active configurations are more complex to develop and therefore (usually) more expensive. Vendors claim that throughput is greater with an active-active configuration. Take a careful look at their numbers. Performance increases are usually only noticed on more highly taxed systems.
The next step toward optimal performance is implementing cache. Cache is when data that is read and written is stored first in the RAID system's memory and then stored to disk. This allows subsequent queries of this data to be processed very quickly. The better RAID boxes allow you to turn caching on or off for each volume as well as using different cache types per volume. There are three basic caching types: write-back cache, write-through cache, and no cache.
With write-back cache, a write is acknowledged as completed as soon as the data is stored in the RAID cache. The RAID controller, sometime later, commits the write from cache to disk. Reads also use the cache when write-back cache is configured. Although no cache scheme can guarantee a performance increase on reads (due to the potentially random nature of reads), write-back cache guarantees an increase on write throughput (provided you are not overwhelming the cache with huge writes that are larger than the cache. If this is the case, consider solid-state disks).
Write-through cache means that all writes are not stored in cache. Instead, all writes are acknowledged after they are committed to disk. Reads are still stored in cache. As one might expect, there is no performance gain on writes with write-through cache. Reads have all of the normal performance benefits of using cache.
"No cache" means exactly what it says. Neither writes nor reads utilize cache. Volumes that are made up of solid state disks (SSDs) should be configured with no cache. There is no performance gain in turning on cache, since solid state disks are made up of RAM. Configuring cache on volumes made of SSDs will take the cache away from the volumes made of conventional disks and negatively impact their performance.
By configuring cache for each volume, you can give cache to high-impact volumes and turn it off for the volumes that don't necessarily need the performance. Another feature to look for is cache that can be configured on the fly. This allows you to experiment with different cache configurations to determine optimal settings for performance. You may also want to configure the cache differently based on what the systems will be doing. During the day, you may want the cache mode set one way for the high-OLTP processing, while at night you may want it configured to handle the batch processing.
Just as disks in a RAID system are redundant, you should also look for RAID boxes that have redundant (mirrored) cache. Without mirrored cache, if the active controller were to crash, all of the non-committed writes in the cache would be lost. With mirrored cache, those writes are completed when the redundant controller comes on-line. This is true with both active-active and active-passive configurations.
When configuring a volume on a RAID box, you may also want to customize the stripe depth. The stripe depth defines how much data is written to one disk before writing to the next disk. For example, assume that three disks are grouped into a RAID level 0 volume with a stripe depth of 3. If you are writing 18 blocks of data, then three blocks are written to disk 1, three blocks to disk 2, three blocks to disk 3, three blocks to disk 1, three blocks to disk 2, and the last three blocks to disk 3. In this example, the first nine blocks are written as three blocks simultaneously to each of the three disks, and the second nine blocks are written in the same manner. This equals 18 blocks written to three disks in the same amount of time it takes to perform a write of six blocks to one disk. If you are reading and writing small amounts of data (OLTP), then you would choose a small stripe size. Each read and write should access as many drives as possible to increase performance. The more drives you have working for you, the faster the read or write will occur, because the drives will be sharing the load. Figure 7 shows how 18 blocks of data would be spread across three disks striped with a depth of 3.
Solid State Disks
For the ultimate in I/O performance, use solid state disks. SSDs are the fastest storage technology available, both in terms of access time and transfer rate. They have sub 50-microsecond access times allowing I/O throughput rates of up to 8,000 I/Os per second. Compare this to today's fastest 10,000 rpm SCSI drives with 8.19 millisecond average access time and about 125 I/Os per second (see Figure 8). This data was provided by ECCS Inc.
An SSD drive is a disk drive that uses semiconductor storage (DRAMs) instead of magnetic platters as the data storage media. Since they are memory and have no moving parts, solid state disks do not have any of the delays associated with the mechanical components of magnetic disks (seek time, rotational delay, latency, etc.) leading to their very high performance. SSDs use the same SCSI hardware and are accessed as if they were mechanical disks using normal SCSI commands.
SSDs can easily be installed in many RAID storage devices. A solid state disk can replace any conventional disk in the system. SSDs can be used as individual disks or in any of the supported RAID configurations. To the end-user and more importantly to the operating system, a solid state disk looks and acts like any other SCSI disk. The fact that the disk is solid state is totally transparent to the operating system. The only difference to the OS is that solid state disks are about 100 times faster than magnetic disks.
The greatest improvement in performance will be seen in small, random I/O requests. There is no waiting for the disk heads to move to the next piece of data. Performance gains start to diminish as I/O requests grow larger and more sequential. Larger, sequential I/Os spend more time transferring the data then they do on locating the data on the disk (which is where solid state technology thrives).
As you can tell, I'm a avid proponent of RAID. I think any administrator who designs a system that is holding critical data and who chooses not to implement a RAID solution is putting their job on the line.
RAID (except for RAID level 0) provides the necessary redundancy to protect critical data and allows an administrator to squeeze all the performance possible from disks (by striping data (except RAID level 1), providing cache, and load balancing across I/O channels). Combine RAID with the use of file systems (allowing a DBA to move "hot" database files with ease) and the task of I/O tuning a database becomes much easier than in the past.
About the Author
Jim McKinstry is a technical analyst, specializing in UNIX, for Sprint Paranet, a vendor-independent supplier of network services for distributed computing environments. This article contains excerpts from a whitepaper Jim wrote for ECCS Inc that is currently being presented at Oracle conferences around the country.