Cover V07, I06
Article

jun98.tar


Linux Software RAID - High Performance through Low-Cost Alternatives

Derek Vadala

Users and system administrators alike frequently find themselves in need of increased storage capacity as well as drive I/O performance. Unfortunately, high-performance drive arrays require an enormous investment overhead and are costly to maintain. Replacing drives in these arrays often costs as much as purchasing several high-end drives from resellers. In the same respect, administrators realize that single-drive solutions do not provide adequate performance upgrades, nor do they provide the high-storage capacities that are necessary to effectively handle large data sets. When most of us encounter these situations, we are usually left with only one option: purchase a high-performance server with RAID capability, while somehow managing to justify the necessity of the expense to management. There is, however, another alternative.

Since release 2.0, the Linux kernel has supported a stock implementation of two RAID levels. Linear concatenation provides a method to quickly combine multiple drives into one drive. This provides increased storage capacity. RAID0, also known as striping, is also a stock part of the Linux kernel and allows users to interleave multiple drives while spreading them across one or more controllers. This provides added storage capacity as well as a tremendous I/O increase. While neither of these RAID implementations support data redundancy, an effort to provide this feature to Linux users began with the release of kernel version 2.0.30. This effort has successfully implemented RAID1 (mirroring), as well as RAID4/5. In addition to providing kernel patches, a set of tools for managing metadevices has been developed, making the implementation of RAID on any Linux system both cost-effective and simple.

The machine used as an example in this article has three Adaptec 2940 Ultra Wide SCSI Controllers. Connected to each controller are two 4.5 gigabyte Seagate Barracuda Wide SCSI drives. Two of these drives operate independently but share a common controller. They are used for the root file system and for applications that access the metadevice. The remaining four drives are spread across the other two controllers and are striped together using RAID0. The server is mainly responsible for Usenet news but doubles as a Web server and development machine and has performed extraordinarily well. Similar configurations using non-RAID technology on Silicon Graphics and Sun Microsystems machines seriously underperformed and cost more than three times as much. Similar configurations using RAID solutions provided by these vendors were not attempted as the costs were more than ten times that of the eventual Linux solution, which cost under $7000. Before you begin hacking the kernel and building metadevices, I would like to impart one piece of advice: configure and test your drives for normal operation before attempting anything involving metadevices. Although the RAID implementations for Linux are rather easy to use, they do require a competent understanding of the kernel and of Linux file systems and drive implementations. If you can properly set up your system without RAID, then you will have a much easier time troubleshooting problems that arise during its implementation. Elements like SCSI BIOS settings, popular on most newer SCSI cards, as well as bus scanning order and its effect on Linux drive detection could quickly foil your attempts. It is generally a good idea to disable features, at the lowest possible hardware level, if you do not think you will need them for proper operation.

The Kernel

The first step to take in creating a Linux raid involves retrieving the raid-1-4-5 kernel patch as well as the RAID Tools package, which is used to administer your RAID drives. Although the patch only provides support for RAID types 1, 4 and 5, you will need the included headers files to compile the RAID Tools even if you plan to use RAID0 or linear mode operation. The patch can be obtained from:

ftp://ftp.kernel.org/pub/linux/daemons/raid/beta/ \
raid145-0.36.3-2.0.30.gz

and the tools package is located at:

ftp://ftp.kernel.org/pub/linux/daemons/raid/ \
raidtools-0.41.tar.gz

Once you obtain the packages and un-tar them, move the raid145-0.36.3-2.0.30 patch file into the root directory of your kernel source. This is usually /usr/src/linux, and it will be a good idea to comply with that naming convention because the tools package will look there by default for the include files it needs.

You can apply the raid145 patch to your kernel by entering the /usr/src/linux directory and typing:

jaded:/usr/src/linux# patch -p1 < raid145-0.36.3-2.0.30

When the patch returns successfully, your kernel will contain support for the additional RAID types. Execute a make config to begin the kernel configuration. I use RAID0 (striping) in my examples, because I feel it has a larger mass appeal. Although RAID0 does not provide the redundancy that some of you will require, it does provide both an increase in storage capacity and an increase in I/O throughput. Thus, it will have the widest practical use.

jaded:/usr/src/linux# make config
Multiple devices driver support (CONFIG_BLK_DEV_MD) [Y/n/?] y
Linear (append) mode (CONFIG_MD_LINEAR) [y/m/N/?] n
RAID-0 (striping) mode (CONFIG_MD_STRIPED) [Y/m/n/?] y
RAID-1 (mirroring) mode (CONFIG_MD_MIRRORING) [N/y/m/?] (NEW) n
RAID-4/RAID-5 mode (CONFIG_MD_RAID5) [N/y/m/?] (NEW) n

If you use a RAID type other than the one I have selected, merely choose it instead of RAID0. Continue the kernel configuration making sure that you have included support for any drives and controller cards that have been added to your system for use in the RAID. After the configuration is complete, simply execute a make zImage, rebuild LILO, and reboot. When the system reboots, the line md driver 0.35 MAX_MD_DEV=4, MAX_REAL=8 will indicate that you have successfully installed the kernel with metadevice support. Once the system finishes booting, look in /proc/mdstat to confirm that metadevices and desired RAID types are available.

jaded:~# cat /proc/mdstat
Personalities : [2 raid0]
read_ahead not set
md0 : inactive
md1 : inactive
md2 : inactive
md3 : inactive

Since we still need to create metadevices and populate them with logical drives, all of the available devices should have an inactive, but empty entry. There are four available metadevices with which you can work. If you have compiled the kernel with multiple RAID personalities, then you will see entries for them in the first line of /proc/mdstat.

Creating a Metadevice

You should not encounter any problems while building the RAID Tools package unless you have patched a kernel that is not located in /usr/src/linux. If your kernel is elsewhere, then you will need to let the configuration script for the RAID Tools know. The task is trivial and well documented. Once the tools are compiled, you can use the mdcreate command to create your metadevice and add an entry for it into /etc/mdtab.

jaded:/etc# mdcreate raid0 /dev/md0 /dev/sda1 /dev/sdc1 \
/dev/sdb1 /dev/sdd1 jaded:/etc# cat /etc/mdtab # mdtab entry for /dev/md0 /dev/md0 raid0,8k,0,c12926bf /dev/sda1 /dev/sdc1 \
/dev/sdb1 /dev/sdd1

This creates a RAID0 device, with 8 K chunks (the default), on /dev/md0 containing the SCSI drives sda1, sdc1, sdb1, and sdd1. You can tune the chunk size using the -cN switch in either mdcreate of mdadd if you are using RAID0 or RAID1. Although 8 K is recommended, you might want to experiment with this option if you are not satisfied with performance. Make sure to spread as many drives across as many controllers as possible. This will provide the maximum performance increase. It is also important to alternate controller use, because switching between two controllers will provide more I/O throughput than sequentially cycling through them. If you get an error message complaining /dev/md0: No such device, you can easily rectify the problem using the MAKEDEV utility included with Linux. The problem is that Linux does not know about metadevices yet, and you will have to manually add them into the /dev directory.

jaded:/dev# MAKEDEV -v md
create md0      b 9 0 root:disk 660
create md1      b 9 1 root:disk 660
create md2      b 9 2 root:disk 660
create md3      b 9 3 root:disk 660

Order is extremely important. If you change the order of the drives in your metadevice, then you will be unable to access the data on the metadevice. This order could be potentially affected by an erroneous /etc/mdtab file or by misconfigured hardware. It is very important that you take note of how your hardware is set up, because the BIOS in most Intel-based computers can sometimes cause drive naming problems for Linux. Adding a SCSI controller card or changing card orders on the system board could ultimately mean that your metadevice will add drives in the wrong order, resulting in inaccessible data.

If all goes well using the mdcreate program to generate /etc/mdtab, then you only need to execute mdadd -a to add drives to your metadevice. The -a option will read the configuration in /etc/mdtab and use it as arguments for mdadd. Looking in /proc/mdstat will now show that the metadevice is available but inactive.

jaded:/etc# mdadd -a
jaded:/etc# cat /proc/mdstat
Personalities : [2 raid0]
read_ahead not set
md0 : inactive sda1 sdc1 sdb1 sdd1 17776576 blocks

The mdrun program can be used to start your metadevice once you have initialized it using the mdadd program.

jaded:/# mdrun -a
jaded:/# cat /proc/mdstat
Personalities : [2 raid0]
read_ahead 120 sectors
md0 : active raid0 sda1 sdc1 sdb1 sdd1 17776576 blocks 8k chunks

The -a flag operates in the same capacity as when used with the mdadd program. In fact, if you are short on time, you can use the mdadd -ar, which will add drives to your metadevice based on /etc/mdtab and consequently execute mdrun -a. Once the metadevice is active and contains the desired drives in the correct order, you can use mke2fs to build a file system on it. The metadevice will now behave like any other drive. Once the file system is built, you can mount the metadevice at any mount point as you would with any normal drive.

jaded:/# mount -t ext2 /dev/md0 /raid

Additionally, you can add an entry for your metadevice to /etc/fstab as you would with any normal drive.

jaded:/# cat /etc/fstab
/dev/md0        /raid        ext2    defaults        1       2

Finally, add entries to start your metadevice into the appropriate scripts in /etc/rc.d. I prefer to put mdadd -ar in /etc/rc.d/rc.S just after swapon is executed. This will start your metadevice in time for it to be dealt with appropriately by mount and fsck. It is also a good idea to add the line mdstop -a into /etc/rc.d/rc.0 right after umount is called. This will ensure that your metadevice is halted cleanly during reboot and shutdown.

Notes About RAID1 and RAID4/5

Now that you have successfully configured a RAID0 device, you might find yourself in need of a redundant metadevice. Redundancy is supported only by the additional RAID levels provided by the patch mentioned above. The steps mentioned above are all necessary for implementing a redundant RAID, but an additional step is also required. Before applying the steps above, you will need to use the mkraid command to initialize a RAID device of the type you desire. To do this, build a configuration file for the RAID type you prefer. The configuration for a RAID1 device might look like:

jaded:/root# cat /etc/raid1.conf
raiddev                 /dev/md0
raid-level              1
nr-raid-disks           2
nr-spare-disks          0
device                  /dev/hda1
raid-disk               0
device                  /dev/hdb1
raid-disk               1

The configuration is fairly straightforward with the exception of nr-raid-disks and nr-spare-disks. The former is the number of RAID disks that will be used in the array. The latter contains a value equal to the number of disks reserved for redundancy. Remember that RAID will not provide additional space, but will utilize extra drives to satisfy its redundancy requirement. Thus, it is fine to leave nr-spare-disks set to 0 in most simple configurations. The paranoid administrator may wish to increase either of these values if hardware is available. RAID4/5 is implemented in a similar fashion and also requires a configuration file for use with mkraid. Examples of these files are provided with the RAID Tools distribution. Once you have built a configuration file and used the mkraid command, you can follow the steps used in creating a RAID0 device. The ckraid program is also provided as a mechanism for checking the integrity of these additional RAID implementations.

Performance

While the benefits of concatenating hard drives for additional space are apparent, the increase in I/O performance gained by implementing a RAID0 system is not always as clear. Remember that you are spreading your I/O bottleneck across more than one drive, and in our test case, across more than one controller. The best way to measure performance of these RAID systems is to merely load up your favorite benchmark tool and get cracking. I prefer using both bonnie and iozone for different but effective comparisons. Try experimenting with different file sizes. Remember that sheer overall speed is not as important as speed where it counts. If your system needs to write hundreds of small files per second but doesn't really need to write exceedingly large files, then writing 100-MB files with bonnie won't really provide you with an accurate picture of how your system will perform. You should also remember that linear mode and RAID1 may actually hinder I/O throughput, because data needs to be written either more than once or across many physical segments.

Conclusion

Although a hardware RAID will inevitably provide superior performance and complete data redundancy, the increasing price of these proprietary systems makes that level of reliance almost unattainable for many users and smaller companies. Linux software RAID provides a comparable level of performance at unbeatable prices. Even for larger shops, the Linux software RAID is a viable alternative for critical applications and systems, because the lack of complete end-to-end support is outweighed by the fact that one could purchase multiple Linux RAID configurations for less than half the cost of a name brand solution. Linux RAID has worked increasing well for us. We're able to distribute and receive Usenet news at unbelievable speeds, and the added storage capacities make it significantly easier to provide long-term caching of news for countless numbers of users.

If you want to implement a Linux RAID, I highly recommend investigating the enormous number of resources available on the Internet, without which my first attempt at this would have been excruciatingly painful. http://linas.org/linux/raid.html provides an excellent list of RAID resources for Linux and maintains the Linux RAID HowTo. A mailing list has also been active for some time. Send mail to majordomo@vger.rutgers.edu with "subscribe linux-raid" in the message body.n

About the Author

Derek works as a UNIX Systems Administrator at the University of San Francisco, where he manages a gaggle of Sun Workstations, a couple of Linux boxes (with software RAIDs), and one stubborn DEC Ultrix machine. He occasionally does tech support for the extraordinarily nice people in the Math and Computer Science departments who are engaged in an ongoing struggle with their SGI servers. You can email him at derek@usfca.edu.