Cover V10, I01

Article
Listing 1
Listing 2
Listing 3

jan2001.tar


Homebrew High Availability: Booting Linux from a RAID-1 Device

Drew Smith

Recently, a colleague told me about a trick his company uses to make Windows NT remote administration easier. His firm provides professional services for many small server rooms around town, and the trick involved mirrored IDE hard disks in removable drive bays -- by mirroring the primary disk, you provide an easier backout path when doing upgrades. When doing service work, he always removes the second hard disk, providing an up-to-the-minute backup in case, for example, the latest Microsoft "Service Pack" does more harm than good.

My first reaction was to say "Of course, you can do that in Linux!". However, it brought to mind a few questions -- most importantly, just how would you go about doing this? I mulled it over and eventually decided to explore it on my own. I found that it was possible to boot Linux from a software RAID-1 device, along with a few LILO and mkinitrd tricks, and this little hack could potentially give your Linux Web server a performance boost. In addition to doubling the reliability of your hard disk, RAID-1 configuration also gives your IDE or SCSI bus a break, providing two different paths from which to read the disk. Of course, write operations will be slower because the data must be written to the drive twice, but in many situations (most commonly Web servers, where read operations are a top priority), slower writes are not a drawback.

This is a project for anyone with a Linux machine. If you're new to Linux, you'll need a solid understanding of hard-disk partitioning and the Linux command line, but it's surprisingly easy and fun to do.

Getting Started

The machine you choose to work with probably shouldn't be a production server. As with any project that involves partitioning, you will run the risk of losing data. The machine should have two identical hard disks. Although this process will work happily with different-sized disks, I've decided to stay on the safe side. Any size disks will do, and they can be either SCSI or IDE. For this example, I've used two Quantum Fireball 20-GB drives, and I've installed them as the primary drives on each of the machine's IDE buses. The machine in the example is a VA Linux 2130 rackmountable server with a single 650-MHz Pentium III, and I used a stock installation of RedHat 6.2.

The safest and easiest way is to start from a completely fresh machine, installing the operating system as a part of the process, but it doesn't necessarily have to be this way. (As of this writing, RedHat 7.0 has been released, but I don't have a machine around to re-test with at the moment.) If you're starting from scratch, you'll end up with a cleaner system if you install the operating system to the hard disk on the second IDE bus, or /dev/hdc. You'll probably want to use the original fdisk rather than DiskDruid or a similar tool. Making a program easier to use often removes parts of the functionality, and fdisk is a prime example. DiskDruid didn't allow me to create /dev/hdc1. Your configuration should look similar to this:

/dev/hdc1 - 20M, mounted on /boot
/dev/hdc2 - <most of the drive>, mounted on /
/dev/hdc3 - 120M, for swap-space
Give /dev/hdc1 a boot flag, as you'll be using it as the booting drive when you next power down. However, first install the rest of the OS. Normally, security considerations would require creating more than just these three partitions. A separate partition for /tmp and /var is always a good idea to prevent an attacker from filling your root partition with a denial-of-service attack on your logfiles. Without quotas, a malicious user can do the same if you don't have a separate partition for /home. This article will only cover a basic example, but you can take it to whatever lengths you feel appropriate.

Install the OS, and be certain to create a bootdisk. This step should also be done if you're not installing from scratch; see man mkbootdisk for more information on how to do this step. Because we are installing from scratch, write the boot information to the superblock of /dev/hdc, rather than the master boot record (MBR). This may or may not allow you to boot the system after the install is complete, but we've got a bootdisk and are far from done anyway.

Finish installing and reboot. If it doesn't come up, use the bootdisk, and log in as root.

Creating the RAID Devices

Now that we've got a running system, it's time to tackle that second drive we've put in (the primary drive in the machine). We'll configure it as a RAID device with two drives involved, declaring /dev/hdc to be a mirror of /dev/hda. Here's the really clever part -- we'll declare /dev/hdc as "failed" until we move the operating system off of it and onto the new RAID device. Then simply add /dev/hdc to the RAID as a replacement for the failed disk, and allow it to rebuild.

Note that Linux handles software RAID with the "md" driver, which stands for "multiple devices". This driver has the ability to control storage devices in several different fashions -- RAID-0 through RAID5, or even combinations of two or more types. Drives are allocated into a storage array, and when the array is of RAID type 1 or higher, the drives work together towards redundancy. Should one drive fail, the RAID subsystem will mark that drive as "failed", stopping any subsequent requests.

Partition /dev/hda exactly the same as /dev/hdc. As root, fdisk -l /dev/hdc provides a listing of the partitions on that drive, which we'll then match. However, we won't use the same partition types as before; instead, set the main data partitions as type fd, or "Linux raid autodetect". In my case, I set up /dev/hda like this:

DEVICE     BOOT   START   END   BLOCKS     ID  SYSTEM
/dev/hda1  *      1       3     24066      fd  Linux raid autodetect
/dev/hda2         4       2484  19928632+  fd  Linux raid autodetect
/dev/hda3         2485    2498  112455     82  Linux swap
Build these partitions, but don't create filesystems on them yet. We must first declare the RAID device to the system, using a configuration file in /etc/, called a "raidtab". Listing 1 shows my copy of the raidtab file. The format is fairly straightforward, but for larger formats (e.g., partitions for /var/, /tmp/, etc.) can become confusing pretty quickly. There's also a manpage dedicated to this file.

After you've written a /etc/raidtab file, it's time to create the actual RAID devices, which is accomplished with mkraid, a program from the "raidtools". In an uncommon show of forgiveness, this program will not actually let you create the new RAID devices without adding an -f switch to "force" it. The partitions involved are of type "Linux raid autodetect", and mkraid assumes they are already a part of another RAID device. You do want to force it, however, and the extra warnings are humorous. Go ahead and create the first RAID device:

[root@tester /etc]# mkraid /dev/md0
Linux uses the special /proc filesystem to provide interesting statistics about running processes and the kernel, and the raidtools are no exception. A special file called /proc/mdstat will show you the current status of any md devices in the system. cat /proc/mdstat will show you some information on your newly created RAID device. View that file, then create the second device.

[root@tester /etc]# mkraid /dev/md1
Check the /proc/mdstat file again. You should now see both devices -- both with a disk marked as failed such as:

[root@tester /etc]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md1 : active raid1 hda1[0] 24000 blocks [2/1] [U_]
md0 : active raid1 hda2[0] 19928512 blocks [2/1] [U_]
unused devices: <none>
/dev/md1 will hold the boot information, and /dev/md0 will be the root partition. You'll need to create filesystems on the new devices before you can use them:

[root@tester /etc]# mke2fs /dev/md0
    <stuff>
[root@tester /etc]# mke2fs /dev/md1
    <stuff>
With that, two new filesystems are ready to be mounted.

Now What?

You now have a booting system and two working RAID devices, but how do you switch the operating system over into the new devices and make it boot? Actually, only the "making it boot" part is difficult; some tricks with LILO and mkinitrd will help. For now, move the filesystem across with a cp. First mount the RAID device to an arbitrary directory:

[root@tester /etc]# mkdir -p /feh
[root@tester /etc]# mount /dev/md0 /feh
You should maintain all the permissions and datestamps, etc., so use the -a switch to make cp treat the operation as an archive:

[root@tester /etc]# cp -a /bin /feh
[root@tester /etc]# cp -a /dev /feh
[root@tester /etc]# cp -a /etc /feh
[root@tester /etc]# cp -a /home /feh
[root@tester /etc]# cp -a /lib /feh
[root@tester /etc]# cp -a /root /feh
[root@tester /etc]# cp -a /sbin /feh
[root@tester /etc]# cp -a /tmp /feh
[root@tester /etc]# cp -a /usr /feh
[root@tester /etc]# cp -a /var /feh
Notice, however, that from a stock RedHat 6.2 system, I've omitted the /opt, /proc, and /boot directories. We'll create those by hand:

[root@tester /etc]# mkdir -p /feh/boot
[root@tester /etc]# mkdir -p /feh/opt
[root@tester /etc]# mkdir -p /feh/proc
The "lost+found" directory should have already been created by mke2fs. Mount the other RAID device under /feh/boot, and copy all of the boot files into it:

[root@tester /etc]# mount /dev/md1 /feh/boot
[root@tester /etc]# cp -a /boot /feh
Now make the necessary changes in the new filesystem to have it correctly mount the new filesystems after reboot. You must edit the /feh/etc/fstab file to change the / and /boot partitions. The current /etc/fstab file looks like this:

/dev/hdc2       /               ext2    defaults        1       1
/dev/hdc1       /boot           ext2    defaults        1       2
/dev/fd0        /mnt/floppy     auto    noauto,owner    0       0
none            /proc           proc    defaults        0       0
none            /dev/pts        devpts  gid=5,mode=620  0       0
/dev/hdc3       swap            swap    defaults        0       0
We'll only be changing it slightly, pointing it to the new devices:

/dev/md0        /               ext2    defaults        1       1              
/dev/md1        /boot           ext2    defaults        1       2              
/dev/fd0        /mnt/floppy     auto    noauto,owner    0       0
none            /proc           proc    defaults        0       0              
none            /dev/pts        devpts  gid=5,mode=620  0       0
/dev/hdc3       swap            swap    defaults        0       0               
Nothing else in this file needs to change. We're almost ready to reboot and try to boot into the new RAID-1 Linux machine for the first time, but we'll definitely need a bootdisk to start. Grab another blank floppy and build one:

[root@tester /etc]# mkbootdisk --mkinitrdargs "--preload raid1" 2.2.14-5.0

Notice the --mkinitrdargs switch, and the value afterwards. The mkinitrd command is an extremely powerful tool for booting machines with special requirements. At its simplest, an initrd is an "initial ramdisk", which contains modules to be loaded before anything else. For example, imagine you're trying to boot a machine with a non-standard SCSI controller. You will have serious problems booting if the kernel itself sits on a drive on that controller! The initial ramdisk could hold the module needed to talk to that controller, and as such, using it could prevent the need of a boot floppy. In this case, we'll add the module required to speak to RAID-1 devices to this initrd, and build a boot floppy accordingly. If you're using SCSI devices, you may want to add a preload statement, which loads the necessary modules for your SCSI controller.

Reboot the machine from the new floppy.

First Boot with RAID-1

As the machine begins to boot from the floppy disk, there will be a BOOT: prompt displayed for about ten seconds. We want to type in an argument here. Don't worry about time, because after you start typing, it'll wait for you to finish.

BOOT: linux root=/dev/md0
This should bring up your machine with the RAID devices as your boot and root disks! Log in and type df to see something like this:

[root@tester /etc]# df
Filesystem    1k-blocks    Used      Available  Use%   Mounted on
/dev/md0      10615648     415012    18204212   2%     /
/dev/md1      23239        2442      19597      11%    /boot
[root@tester /etc]#
Next, the idea is to make LILO want to boot from these new devices, and to make certain that it will boot from either one.

If you don't see output like this, something's wrong. Is it your bootdisk? What are the error messages? Did your system almost come up? How far did the boot get before stopping? Often, the best step here is to return to your former setup (i.e., remove the floppy and reboot), repartition the RAID disk, and try again.

Making It All Bootable

This section assumes you've got a recent backup of your system. With that out of the way, repartition /dev/hdc and add it to the new RAID devices as a replacement for the "failed disks". Using fdisk, open /dev/hdc and use the "t" key to toggle the partition type for the first two partitions from 83 (Linux) to fd (Linux raid autodetect). If you're absolutely certain that you're not going to get in trouble for losing any data, you can now use the "w" key to write changes to the disk.

Adding the new partitions to the RAID device shows just how easy it is to work with RAID under Linux:

[root@tester /etc]# raidhotadd /dev/md0 /dev/hdc2
       <stuff>
[root@tester /etc]# raidhotadd /dev/md1 /dev/hdc1
       <stuff>
Don't worry about the <stuff>, unless there's something extremely alarming in the messages. If something goes horribly pear-shaped, delete all the partitions on /dev/hdc and try again. If that fails, you're hooped -- bring out the backups and start over.

At this point, we need a proper system-wide initrd image, containing all the modules needed to boot the system into RAID-1. This is like the bootdisk we made earlier, but it will be written to the hard disk.

[root@tester /]# mkinitrd /boot/initrd-2.2.14-5.0.img \
> --preload raid1 2.2.14-5.0
It's important that you specify your kernel version here. In my case, the stock RedHat 6.2 kernel is 2.2.14-5.0.

As a final step, use LILO to make the system bootable. We're going to be a bit sneaky here and use two slightly different LILO configuration files -- one for the booting of each drive. This way, either drive can fail and the machine will still boot.

Listing 2 shows a working lilo.conf.hda configuration file. Note that I specified the disk as /dev/md0. Also note that the sectors, heads, and cylinders are included (different from standard LILO config). These numbers can be obtained with fdisk -l /dev/hd<x> and are extremely important here.

In the second file (Listing 3), I changed only one parameter -- the boot= flag. After you've written these two files, run LILO with the -C flag to specify which configuration file to use:

[root@tester /etc]# lilo -C /etc/lilo.conf.hda
Added LinuxRAID *
    [root@tester /etc]# lilo -C /etc/lilo.conf.hdc
Warning: /dev/hdc is not on the first disk
Added LinuxRAID *
Reboot. If all went well, you're booting into a mirrored, high-availability (well, higher availability) Linux machine!

Conclusion

Although the benefits of this configuration are obvious, the fact remains that it is a reasonably simple procedure, and the resulting machine is much more stable than before. The hard disk, usually being the only moving part in a Linux system (with the exception of cooling devices), is usually the first to fail. Adding monitoring and paging capabilities is fairly simple (although beyond the scope of this article), but for deployment to remote locations, an hour or so of work could save you the trouble of getting out of bed to fix a downed system.

Acknowledgements

Thanks to Peter Lincoln for putting the idea into my head and to Linus Vepstas for writing the HOWTO that pointed me in the right direction. Also, my girlfriend Erin, for putting up with my near-constant geeking.

Drew Smith lives in East Vancouver, has blue hair, lives in a house full of geeks, and works as the UNIX Network Administrator for a stock trust company. When not geeking, he makes live electronic music for raves. He can be reached via the geek-house Web site, at: http://eastvan.bc.ca.