Article

Real-World RAID Implementation

Jim McKinstry

Some of my more challenging and interesting assignments have been analyzing UNIX systems to identify and correct performance bottlenecks. UNIX performance is, generally, related to a few factors: memory, CPU, network, kernel, and disk I/O. Providing that a system is not physically maxed-out, solving a CPU or memory bottleneck can be as easy as adding more CPUs or RAM. Network problems can be solved by doing something as simple as upgrading a few Mac cards or as difficult as a complete network redesign. Tweaking a UNIX kernel is becoming a lost art with the advent of "dynamic" kernels. Disk I/O bottlenecks can be harder to find and more difficult to solve than other bottlenecks. This article describes a RAID solution that was successfully implemented to remove existing I/O bottlenecks and allow for future growth. The experience of this case may be a useful blueprint for analyzing other RAID-related problems.

Old Setup

I was in a situation where two UNIX systems were sharing an older RAID box running at RAID level 5 and RAID level 1. (For a review of RAID levels, see the sidebar "RAID Overview.") The RAID had a split backplane configuration with half containing ten 2-GB drives and the other half containing ten 1-GB drives. Each UNIX box had two SCSI adapters, one for each backplane in the RAID box. The RAID box had two controllers per backplane. One controller was always active, while the other was passive for failover purposes. This configuration was very poor for many reasons. One reason was that the technology was just old. (The unit had no cache and few RAID levels to choose. The drives and controllers were 4- to 5-year-old technology and much slower than those available today.) Another reason was that we had no automatic disaster recovery action if we were to lose a SCSI controller on the UNIX box. A UNIX SCSI controller failure meant downtime until we could have it replaced. Finally, this configuration was maxed-out for space. We could not add any more capacity without updating the controllers (which were no longer being manufactured). Even if we upgraded the controllers, they were so old that they didn't support the 4-GB drives necessary to obtain the capacity we wanted.

When I inherited this system, I was told that there were some performance issues. After analyzing sar reports and watching the output of top, I discovered that we were low on memory, needed more CPU power, and had some major I/O problems (for more information on detecting and correcting UNIX performance issues, see System Performance Tuning by Mike Loukides and O'Reilly & Associates, Inc). Solving memory and CPU problems can be very easy. If you determine that you are low on memory, add more. (Memory problems can also be addressed by tweaking the buffers in the kernel, however, this tends to just move the problem from memory to I/O.) If you find that you don't have enough CPU horsepower, add more (or bigger) CPUs. Of course, after I corrected my CPU and memory bottlenecks, the I/O bottleneck became much worse, since a more powerful system is able to request more data faster. Also, solving I/O problems is a much harder task. Try tweaking the I/O buffer sizes, reorganizing the data on the disks (heavier used filesystems/raw partitions should be on faster drives or striped volumes), upgrading the hard drives and controllers, adding cache, etc. The poor performance of my RAID combined with the inability to upgrade it for speed and capacity led me to the decision that the RAID would have to be replaced. Replacing a RAID system is a far more complicated task than replacing memory or CPUs. The steps I took were:

Size and purchase the new RAID. (I had to get growth estimates from my DBA and management. The vendor was also able to help me size the cache I needed.)
Add faster SCSI controllers on the UNIX boxes to support the new RAID system.
Determine RAID Levels, volume size/layout, etc. for the new RAID (The vendor can help with this.)
Connect/configure and test the new RAID.
Move filesystems from the old RAID to the new RAID and monitor for a week.
Move filesystems from the non-RAID drives to the new RAID and monitor for a week.
Move database data from the old RAID unit to the new one. (This involved reconfiguring the DB, since it addresses raw partitions by their hardware addresses. The new RAID hardware addresses are different from the old RAID addresses.)

On the old RAID unit, I/O performance was so bad that sar -d reports would show 100% busy for hours for some of the volumes on the RAID. The new RAID barely even registers with sar. sar (System Activity Reporter) is part of every UNIX system and was the only tool for this purpose available to me on this system. You should be able to find a third-party application (GlancePlus is a decent one) to help determine your system bottlenecks. In short, I was able to detect my I/O problem by analyzing sar data that was collected every 5 minutes 7x24. The fields to look at in sar -d are %busy, avque, r+w/s, blks/s, avwait, and avserv. It's hard to say what these numbers should be for any given system. Numbers that are "bad" on one systems may be "OK" on another. You need to look at your numbers and decide what is normal and what is unusual for your system. In my case, 100% busy was a pretty obvious problem.

Sizing Your New System

When choosing a new RAID system, you should consider the following factors.

How much storage do you need? Take into account filesystem data, database growth, etc. Interview your DBAs and management to determine how much space they think they will need. Find out how many new user accounts will be added (mail servers can be real pigs). Don't forget to include any space needed for any new sys admin tools you may buy (capacity planning tools need to store lots of performance history). Also consider moving any data that is not already protected by RAID onto the new system.

What RAID level do you want to run? One DBA told me that they "needed" to run RAID 5. Databases don't care what RAID level you use. The database vendor told him about 5 years ago that RAID level 5 was the best, so that's what the DBA wanted. With the lower cost of hard drives, RAID level 10, although still more expensive than RAID level 5, may be affordable and is much more fault tolerant. Your vendor should be able to help you decide what you need.

How much cache do you need? Work with your vendor on this one. I have no generic way to calculate the amount of cache needed. Each system is different. Your vendor should be able to tell you what you need based on your applications and your user/transaction-per-second load. Large amounts of cache allow for more I/O to be buffered in the high-speed cache. A system with small amounts of cache installed may spend more time waiting for the cache to flush during peak I/O periods.

New Setup

After evaluating many RAID solutions, I decided on a Synchronix system from ECCS. With this device, I was able to replace our ten 1-GB and ten 2-GB drives with twenty 4.2-GB drives, leaving room for another ten drives in the future. Currently, a Synchronix can hold thirty 18-GB hard drives. The Synchronix unit has everything I think is important in a RAID system (cache, redundant parts, hot-swappable parts, multiple concurrent RAID levels) and is expandable and easy to configure. It also has two very important features: application transparent failover (ATF) and array-based failover.

Array-Based Failover

Array-based failover is when the RAID unit detects errors within a controller and automatically switches to the other controller. The other alternatives are no failover and host-based failover. There is no failover if you have only one controller, because there is no alternate controller to switch to. While many vendors will sell you a single controller configuration, I would not recommend it (why spend money on redundant drives only to lose access to them if a controller dies).

Host-based failover is when there are device drivers running on the host that detects a failure with the RAID controller. The host then coordinates the changeover of the controllers. The advantages of array-based failover vs. host-based are that array-based failovers are faster than host-based failovers (milliseconds vs. seconds (up to a minute)) and there are no device drivers (platform independence). For a more thorough discussion of RAID controller failover strategies, see http://www \
.eccs.com/wpapers-failover.htm.

ATF

ATF provides protection against host failures. ATF runs on the UNIX side and sits between the OS and the SCSI adapters. ATF is able to detect when a SCSI controller fails and automatically re-routes future I/O to another SCSI controller. This functionality is priceless when dealing with 7x24 mission critical systems. ATF also provides load balancing. ATF will automatically round-robin I/O between two UNIX SCSI controllers, thus eliminating all but the worst case scenario I/O bottlenecks.

The Synchronix system has two controllers, one active and one passive. This was all that was available at the time and was more than adequate for our needs. In fact, when using fast controllers, active/active solutions are only marginally faster than active/passive. This is due to SCSI bus contention and the overhead involved in keeping the caches in sync. Also, active/passive configurations are more reliable in failover situations, because the passive controller just monitors the active controller. When both controllers are active, each controllers must monitor the other in addition to transferring data. Array-based failover is also more complicated to implement and therefore more costly with active/active controllers. Most vendors charge a premium for active/active solutions that result in a minimal performance increase.

Each Synchronix controller can hold up to 244 MB of cache. The cache can be configured as write-back (cache reads and writes), write-through (cache reads only), or no cache. Each logical volume defined on the RAID unit can have its cache configured differently. In a two-controller configuration, each controller must have the same amount of cache so that it can be mirrored. The cache is mirrored to facilitate a failover. If the cache were not mirrored, a failure of the active controller could result in the loss of up to 244 MB of data. Mirrored cache is a lot like mirrored hard drives. The passive controller contains an exact copy of everything in the cache on the active controller. The Synchronix does not use battery-backed cache. Instead, you can plug the unit into an optional intelligent UPS and communicate to it through a serial port on the UPS and RAID. When the UPS detects an outage, it sends a signal out on the serial port. When the Synchronix detects the signal, it flushes the writes that are in the cache and converts all write-back cache volumes to write-through cache. This protects the system from losing any data and is much cheaper than battery-backed cache as you get into larger caches. I configured my system with 64 MB, based on the vendor's recommendation, which was plenty for my immediate needs and short-term growth expectations.

The new RAID box was configured with three volumes: a 40-GB RAID level 10 volume (20 GB usable), a 32-GB RAID level 10 volume (16 GB usable), and an 8-GB RAID level 1 volume (4 GB usable). From the UNIX side, each volume looks like a single drive. Using standard UNIX utilities, I was able to slice these "drives" and build filesystems or use the raw slices directly. I used the 40-GB volume for all of our filesystems that were on the old RAID and moved most of the filesystems on internal drives onto this volume, too. The 32-GB and 8-GB volumes were used as raw partitions for our database and transaction monitor. With better UNIX tools (the UNIX platform I used was a raw SVR4 system with few third-party tools available), I could have come up with a better configuration. I could have perhaps used a series of RAID level 10 volumes on the RAID side and used them as drives that could be striped with a UNIX utility, taking advantage of hardware and software striping if needed. The configuration I used was more than adequate for my needs. In fact, the results of heavy stress tests showed that the Synchronix wasn't being taxed at all (again, I used sar as my analysis tool). I couldn't overload the Synchronix with my UNIX machines. We never even performed stress tests with the old RAID configuration, because normal production was enough to swamp the unit.

Performance

I previously discussed cache on the RAID controllers. Cache is used to mask the inherent slowness of hard drives by buffering I/Os in RAM. I configured my systems to run with write-back cache on the 32-GB and 8-GB volumes (the database raw partitions) and no cache on the 40-GB volume. Write-back cache acknowledges a write to the OS once it's in the cache. Write-through cache and no cache acknowledge a write after it is written to disk. Write-back cache improves performance for reads and writes, while write-through cache improves performance for reads. Usually, write-back cache is configured when write performance outweighs data integrity concerns. In my case, my UNIX systems have an internal battery. When the UNIX systems detect a power outage, they start an orderly shutdown. Combined with the Synchronix intelligent UPS connection and the mirrored cache, there is virtually no chance of losing data.

The 40-GB user volume was so fast that I didn't need cache turned on for it (cache is configurable as on or off for each volume). This volume contained all of the filesystems from the old RAID as well as some filesystems that were on internal drives. It still outperformed the old RAID box running at RAID level 5. Part of this performance increase is due to the faster hardware, and part of the increase is due to the RAID level 10 technology. RAID level 10 performance increases as mirrored pairs are added to the volume. The performance increase is not linear, and there is a cutoff. Basically, the more spindles your data is spread across the better. Eventually, you will overwhelm the SCSI bus and see a flattening or slight drop in performance. Your vendor should be able to help you determine an optimal configuration for your system.

Since I didn't use the cache on the 40-GB volume, the entire 64-MB of cache was available for the database volumes. I saw the biggest gains in performance on these volumes. The DBAs run a database reorganization on weekends that took 7+ hours with the old RAID. With the new RAID, this reorganization took as little as 3 hours. Our nightly batch processing is also able to finish much faster as a result of this upgrade.

Features to Look For

Look for reliability features. The RAID solution I implemented had numerous features that increase the reliability of the system. The RAID level 10 solution allows you to lose up to 50% of the drives in the volume (remember, RAID level 10 volumes are made up of striped, mirrored pairs of drives) before any data is lost. RAID level 5 only allows for the loss of one drive. One drive could also be configured as a "hot spare." This means that if a drive in any of the volumes fails, then the hot spare would automatically become part of the volume and be rebuilt to replace the failed drive.

Mirrored cache across redundant controllers protects against data loss due to a controller or cache failure. The use of an intelligent UPS protects against data loss in case of a power failure. ECCS's ATF allows for automatic failover on the UNIX side if one of the SCSI controllers dies. Also included is the standard ability to hot-swap redundant fans, power supplies, hard drives, and RAID controllers.

Also, look for an easy to use GUI to configure the array. Dial-up use of the GUI was another feature we liked with the chosen system, because this allowed us to connect to the management interface remotely using either Windows-based PCs or UNIX systems running X Windows. In addition to the GUI, a complete command line interface and an API for writing scripts is important. The API enables you to script various management functions and drive them from cron.

When choosing a RAID system to hold mission critical data, look for at least the basics: redundant parts, hot-swappable parts, cache, ease of configuration, and expandability. For a more reliable solution, look for ATF, RAID level 10, and array-based failover. All these features combined will allow you to build a very fast, reliable system.

Conclusions

When attacking a UNIX performance problem, be patient. If you don't have a history of how the system should act, start collecting it now and check whether your numbers are getting worse. I ran sar, through cron, for weeks to collect data before analyzing it. Looking at a single point in time is rarely useful. Third-party tools allow you to collect data over time and then graph the results, so you can easily see any problems that already exist or problems that may be coming. Some tools will even predict how your system will perform based on expected growth. Take advantage of these tools, if available on your platform. I was stuck with sar, which can be very confusing. Make use of a good reference book, and don't forget to look to your vendor. You pay a lot of money for support, and it is reasonable to expect them help analyze the numbers.

About the Author

Jim is a Technical Analyst for Sprint Paranet specializing in UNIX. Jim has also worked at IBM, EDS, and Rite Aid. He can be reached at jrmckins@yahoo.com.