RAID Performance Issues
Ed Lehr and Chris Schultz
Most people consider data security the primary motivation for employing RAID storage, forgetting that RAID can also have performance benefits. Choosing the correct RAID level and configuration for your array, however, is essential in obtaining the best performance. This article explores various issues that impact RAID choice and performance, as well as techniques for tuning your RAID arrays to the performance requirements of different applications.
RAID Levels and Performance
A number of off-the-shelf disks properly configured as RAID 0+1 (data striping plus mirroring) will give the best performance and still protect from a disk failure, however, the cost may be prohibitive. The two most popular levels of RAID for Very Large Databases (VLDB) are RAID 3 and RAID 5. RAID 3 stores the data across several drives and stores all parity information on a single drive. For random write operations, this single drive can be a bottleneck. However, for very large sequential write applications, such as streaming imagery, RAID 3 excels. RAID 5 splits the parity data across drives. For example, the parity information for different data blocks stored on drive one in a five-drive array is stored across drives two through five. The parity for drive two data is striped across drives one, three, four, and five, and so forth. This configuration tends to work best when the host server executes a great deal of random I/O operations.
Whether you choose RAID 3 or RAID 5 depends on your application. When you need to make a random update to a file in a database, RAID 5 will perform better because there is less chance that multiple parity drives will be bottlenecked. Striped parity is also important when you realize that every time you change a block in a parity-based RAID scheme, the program must read the old block, read the old parity, then calculate the new parity and write both the new parity and data back out to the drive. The parity data must be accessed twice for each I/O operation; striping this data across multiple drives makes this function more efficient.
RAID 3 is most applicable when you need to stream a lot of data to and from the array in a sequential manner. As data is read from a RAID 3 array, it looks like a simple stripe of drives to the controller. This provides a slight advantage over RAID 5. The controller algorithms are also much simpler to implement. As data is streamed back into the array, the stripe is filled up, and then parity is calculated. Since the array knows that it just filled the data drives with enough information, it doesn't need to read any old data blocks or parity values. It just calculates and writes the information on the fly. Random writes nullify this advantage.
The parity calculation that allows a RAID array to survive a drive failure compromises performance by requiring additional I/O operations. This is most noticeable in software implementations of RAID. The host computer is required to do the additional I/O operations and calculate parity values. Not only does this make for slow I/O, but it also degrades overall system performance.
A fast RAID array should require very little of the host computer. However, the actual device can be fairly complicated. Additionally, other hardware is added to make the RAID device even faster. With intelligent use of cache, multiple internal SCSI buses, and striping schemes, a good RAID implementation can move data faster (and more safely) than a similar set of striped drives.
The critical component of a fast RAID implementation is a good write-back cache. This memory is located on the I/O controller in the array and allows an immediate commit on a write once data is safely written into the cache, instead of waiting for the drive. This gives the I/O controller some flexibility in timing the actual writes. In a simple scheme (and with enough cache) the data could be written during periods of idle time, when the host does not need to access the array. However, in a VLDB environment with many users, such times may be few and far between. Another alternative is using a write-gathering style of cache. The array controller optimally arranges the writes in its cache based on the number of drives in the RAID stripe. When it flushes the cache, it maximizes the use of each I/O operation internally in the array.
We have found the "write penalty" that has long been associated with RAID 5 has been ameliorated by improved caching and sorting algorithms. The write performance of a vendor's RAID 5 should certainly be benchmarked using your application, however.
RAID Tuning Strategies
Obvious though it may seem, the tuning process begins before the PO is written. RAID technologies, vendor implementations, application requirements, availability requirements, and budgets are part of the RAID tuning matrix. If you can influence these variables now, that's great. But even if the purchase has been made, you should understand how past decisions will impact your tuning effort.
Four Steps to Optimal Performance
The basic steps are: categorize, segregate, assign, and optimize. You should categorize your I/O patterns, segregate them where possible, assign the segments to appropriate RAID levels, then optimize the array. To give concrete examples in the process, we assume you are tuning for a Very Large Database (VLDB) managed by a relational database engine. It's a common commercial application, and it will require us to examine all four I/O patterns. Whether or not you are tuning for a database, you will see how I/O patterns, RAID implementation, concurrency, I/O size, stripe depth, and array size all impact performance tuning for any application.
First, determine the level of concurrent accesses to the files. High levels of concurrent accesses will create a random I/O pattern even if the individual accesses are sequential. With non-database applications, knowledge of the application and usage patterns, combined with sar data can quickly answer this question as well as whether the dominant access is to read or write. You may need to interview your users about how they use the application and other files to gain some insight into the nature of the accesses. Where you cannot determine the I/O patterns by interviews or by the nature of the application, you may have to separate the files on different devices and analyze sar stats to determine whether concurrency is high or low (avque and r+w/s), or whether I/Os are predominantly reads or writes (w/s and r+w/s). The goal is to correlate the usage pattern to individual files in the application.
A relational database may exhibit sequential I/O patterns in the data (non-redo or control) segments, but this is unusual. Sometimes DBAs, and even SAs, do not consider the effect of concurrency when categorizing I/O patterns and inappropriately assume parts of the database are accessed sequentially when the pattern really is random. Your DBA may insist that a table stored on its own drive(s) is primarily accessed by "full table scans" and thereby assume a sequential read pattern dominates. If more than a few users are accessing the table, or if some form of parallel query decomposition is used, the access pattern will be random.
Databases often have a staging area outside of the database for the loading, unloading, or hygiene of data. Such staging areas usually have low concurrent access, with a single process performing sequential reads or writes.
Segregation of the files on different arrays, whether for a relational database or some other application, should be as follows:
1. Group all randomly accessed data files together on the same array. Your DBA may balk at grouping index data files and table data files on the same array. Encourage the DBA to run benchmark queries against the database with tables and indexes separated on different arrays and again when stored on a single large array. Be sure the benchmark queries simulate realistic loads. The results will depend on the efficiency of the algorithms implemented by your RAID vendor. Some algorithms will benefit from the randomness and increased frequency of accesses, others will not. The DBA should also run benchmarks that use rollback segments with the rollback segments segregated and again when assimilated on one large array with tables and indexes. Much depends on the nature of the database application and your RAID implementation. The point is that conventional wisdom may no longer be correct. Always test previous assumptions when implementing a new RAID product.
2. Place sequentially accessed files with low concurrent operations on their own array with no other files. If the files are small, salvage the wasted space by placing the files in their own small partition. Use a second partition for static storage that will be accessed very infrequently or only during non-prime hours. Redo logs should be treated this way. Staging areas should be segregated from the database files on separate arrays. Determine whether input and output occur in the staging area concurrently. If so, create separate arrays for input and output.
3. Place sort segments on their own array on raw devices. Raw devices can be easily be used here because management of sort segments (backup, recovery, etc.) is usually not an issue.
4. Some I/O operations in the database take place synchronously and may halt database operations until the I/O is complete (archiving) and will require special planning and configuration. The automatic archiving of redo logs is the prime example. If this function is turned on, the archive destination should be configured for sequential access on its own array.
You should work with your DBA to determine the I/O pattern, reliability, and availability requirements for each database segment. From a performance perspective for any application, the best RAID level is RAID 0. From a performance and reliability perspective, RAID 0+1 is optimal. When cost is thrown in the mix, RAID 3 or RAID 5 is the optimal compromise for data segments. Because most databases support concurrent access by a number of users or processes where the I/O pattern is random access, RAID 5 is the choice most often seen in commercial VLDBs for data segments.
Usually database sort segments are placed on raw RAID 0, because they do not hold data or require backup but do require top performance. A failure of a drive here would interrupt service for a short period, but would not require the recovery of data. If even short interruptions are unacceptable, place sort segments on 0+1. Redo logs (online and archived) should be placed on separate arrays at RAID 0+1. If write performance is critical and your RAID 5 does not write efficiently, you will have to use RAID 0+1 for data segments.
You may have heard of the "write penalty," or overhead, long associated with RAID 5. Here's news. We found two vendors whose RAID 5 outperforms RAID 0 in writes using small amounts of controller cache (16 MB) in common benchmarks such as "bonnie" and "diskbench." These results have since been confirmed with other shops performing similar tests. These vendors have invested substantial effort in write-gathering and write-through caching algorithms.
Most vendors though, have not produced these kinds of results and will insist that you purchase inordinate amounts of cache to cover for inferior algorithms. Cache may improve performance in OLTP environments to some extent, but it is an expensive solution. Cache is of really very little benefit in DSS environments.
We emphasize vendors' algorithms because the brute force approach using cache on the controller is very inefficient, requiring huge amounts to be effective. You will do better to put more memory for the application on the server than to spend it on the drive controller cache. This is because of the "skimming" effect that is seen in cached I/O chains. One cache "skims" the hottest blocks from the cache beneath it. What happens is that caching algorithms at each cache level keep the most frequently requested blocks in their buffers, avoiding the request to caches lower in the chain. In a typical database application, there will be three or four independent caches above the drive controllers (cpu cache, app cache, file system buffers). This means that the controller cache, at the end of the chain, will get the smallest number of requests for the most frequently accessed blocks. This effect is most noticeable in OLTP environments.
In decision support systems and others requiring sequential scans of large tables, the cache chain becomes nothing more than a long pipe sucking data no faster than the drives can supply it. Our tests and others have shown that 64 MB of cache on the RAID controller performs no better than 16 MB in VLDB environments.
In both OLTP and DSS environments, the vendor's algorithm is most significant. We have found efficient read-ahead caching algorithms to produce hit ratios greater than 80% in 16 MB of controller cache when doing full table scans on 20 GB tables. Again, it is the vendor's algorithm that makes the difference, not the amount of cache.
Primary Tuning Parameters
There are four primary tuning variables: stripe depth, I/O size, array size, and concurrency (I/O rate). You can worry less about these on RAID 3 or 5 where your vendor software intelligently manages queuing. But, if you are using a less sophisticated product or are tuning RAID 0, you will need to attend to these parameters. The goal is to adjust these parameters to minimize queuing and wait times. The average queue, wait times, and service times on an array are determined by the performance of the drives, array size, concurrency, and the depth of striping in relation to the I/O size (deep or shallow). For example, if a 64K I/O is spread across a four-drive array in 16K chunks, then only one 64K I/O can be serviced at a time. However, two 64K I/Os can be serviced simultaneously if the stripe depth is 32K. The "deeper" stripe depth of 32K enables higher concurrent operations and decreases wait time, but at the expense of service time because now only two drives service the I/O instead of four. There are ways to adjust the I/O size and even the concurrency on an array, but usually not as easily as stripe depth.
The simplest and most expensive approach is to set stripe depth equal to I/O size and increase the array size until the capacity of the drives equals the required I/O rate for the application. In practice, we are often limited in the number of drives we can place in an array and must optimize the other parameters for that size. The methods we use to do this are too involved for this article, but we can get you started with a general rule of thumb. After establishing concurrency and I/O size for a given array, determine gross stripe depth (i.e., shallow or deep). If concurrency is high, that is, at least 25% higher than the number of drives in the array, then the stripe depth should be deep - at least a multiple of the I/O size. If the concurrency is lower than the number of drives in the array, then the stripe depth should be shallow - an even fraction of the I/O size.
As you have seen, RAID is quite tunable. In arrays managed by intelligent controllers, vendors' algorithms will usually do a better job of managing average waits and average service times. In this case you will be wise to make your arrays as large as you can. Most vendors will still let you configure stripe depth. If so, use the ideas above to pick an appropriate depth. You may find that you have less control over stripe depth with some RAID vendor implementations. This is not necessarily bad. It may indicate that their algorithms are optimized for a limited range of stripe depths. You should benchmark RAID 0 arrays under both your logical volume manager (LVM) and your vendor's software to better determine where to place your RAID 0 and 0+1 arrays. The LVM will always cost you some CPU cycles as will any software implemented RAID. However, where we have an efficient LVM, we prefer to use it for RAID 0, increasing space under intelligent controllers for RAID 5, where our RAID vendor's software excels.
About the Author
Ed Lehr is a Principal Consultant with Oracle Corporation in Richmond, Virginia. He has experience in the airlines, oil and gas, records management, and banking industries. When he is not architecting and tuning databases, he spends his time flying sailplanes.
Chris Schultz is a Systems Engineer for Silicon Graphics, Inc. Chris holds a degree in Chemistry, which he somehow parlayed into a job at a bank before migrating to SGI. From time to time, you can still catch him tweaking 30-year-old FORTRAN code for friends.
At the time this article was written, Ed and Chris managed one of the world's largest UNIX/Oracle DSS systems, in terms of concurrent usage, for Signet Bank in Richmond, Virginia.