Fast and Painless System-Disk Recovery on SGI Systems
Dan Garrett
Keeping computer systems running smoothly requires a lot of effort, as most systems administrators know. A good backup plan (that is also implemented) allows us to recover data and most applications when a disk fails. However, the failure of the system disk requires more effort to restore since the operating system (and all the minor changes in /etc) are lost. Re-installing and configuring the OS from CD can be a time-consuming process that we do not always have the time for when the system disk belongs to a critical production computer. Planning both the backup and recovery in advance can ensure a smoother recovery. This article details a backup and recovery strategy that makes the recovery virtually painless, without incurring too much cost in the planning and backup stages.
In our case, we have six magnetic resonance spectrometers used 24x7 for scientific research, each of which is controlled by a separate SGI O2. The system disk contains the Irix 6.5.x OS and specialized software, Drivers, and parameter files for controlling the magnetic resonance spectrometer (which varies slightly from one SGI O2 to another). Over time, the parameter files on each SGI O2 change, as new experiments are implemented requiring regular (at least weekly) backups be performed. Additionally, the backup can not occur while an experiment is in progress, since the backup may interfere with data collection.
Creating the Backup
There are a wide variety of backup products (both free and commercial) that are available. Sys Admin has also published several good articles on backup and recovery, including Backups with Standard UNIX Commands (September/October 1993), Fast Backup and Restore Scripts for DAT Drives (September/October 1993), Simplifying Site-Wide Backups (July 1996), Backups for Large UNIX Installations (November 1997), Backups and Disaster Recovery (November 1998), and Network Backup Configuration: A More Complex Scenario (April 1999).
When deciding which backup tool/product to use, keep in mind the following:
1. It should be simple to use and as automated as possible.
2. Backed-up data must be easy to locate.
3. Backup and recovery of critical systems should be well tested.
Since we were already using Veritas Netbackup software in conjunction with an ATL 7100 DLT tape library (2 DLT 4000 drives and 96 cartridges) to perform automated scheduled backups for our 40 UNIX workstations, it was the obvious choice for manually backing up the 6 SGI O2 workstations. All we needed was a short shell script (see Listing 1) that users could easily invoke to initiate the backup when the magnetic resonance spectrometer was being filled with liquid nitrogen. Liquid nitrogen fills are required for the instrument and occur about once per week. As long as users remember to issue the backup command during nitrogen fills, we are guaranteed weekly backups. The script keeps a log in the file dateLog.txt of who did backups and when they were performed. This file then gets emailed to me and the facility manager of the instruments.
Note that even though we invoke #! /bin/sh at the top, this is not a Bourne shell. SGI decided in Irix 6.4 to switch the default shell from the Bourne shell to the Korn shell. Either the script must be run by root, or it must have the SUID bit set for the owner (with root being the owner). Since the goal is to have any user run this script during liquid nitrogen fills, and it is less of a security risk to have an SUID script than to give the root password to every user, we chose the SUID script option.
There are two steps that are required to have an SUID script on an SGI. The first step is to edit the file /var/sysgen/stune and add:
nosuidshells = 0
It takes two reboots before this becomes effective since the first reboot builds a new kernel, and the second reboot uses the new kernel, which will allow SUID scripts. The second step is to set the SUID bit on the backup script:
cd /usr/bin
chmod u+s do_backup
The script is very straightforward and sets six variables at the top, which are used throughout. The variable list defines the local file that contains a list of the files and directories that are to be included in the backup. The variable dateLog defines the local file, which records relevant information about the backup (see sample email of dateLog file below). Depending on the value of the variable doProgLog, a progress log can be written to $progLog, but this will significantly slow the backup. The Veritas Netbackup command is bpbackup and takes four arguments (without progLog), which are:
-c {name of class}
-s {name of schedule}
-w and -f {file list}
The {name of class} and {name of schedule} arguments are self-evident, and must match a defined class and schedule at the Netbackup server. The -w causes the bpbackup command to wait until the backup completes before returning control to the shell. This allows the log file to show how long the backup took to complete. The backup takes 23 minutes to record 1.8 GB over a 100- Mb/s Ethernet connection to the ATL 7100 DLT tape library. This rate, 4.9 GB/hr, is limited primarily by the speed of writing to the DLT 4000 drives that have a native (i.e., uncompressed) transfer of 5.4 GB/hr. For comparison, a DLT 7000 has a native transfer rate of 18 GB/hr. A sample email (sanitized of computer and user names) of the dateLog file is shown below:
From user-15 Mon Jul 17 12:33 EDT 2000
Date: Mon, 17 Jul 2000 12:33:41 -0400 (EDT)
From: Any User <user-15>
To: dgarrett@nih.gov
Subject: Manual Full on spect-3
Full on spect-3 by user-1 [Tue Jun 27, 2000] 14:38:37 -- 14:39:08
Full on spect-3 by user-21 [Mon Jul 3, 2000] 11:28:32 -- 11:50:33
Full on spect-3 by user-1 [Mon Jul 10, 2000] 12:17:14 -- 12:38:52
Full on spect-3 by user-15 [Mon Jul 17, 2000] 12:10:30 -- 12:33:41
This backup plan provides for a very user-friendly mechanism for users to manually backup the instrument on a regular basis by entering the single command do_backup. Although we use Veritas Netbackup, it should be simple to replace the /usr/openv/netbackup/bin/bpbackup in the script with dump, tar, or any command-line driven backup command, and have a similar backup. When using standalone tape drives, you should remind the user from the script (add this just after testing for $doBackup = "y") to check that the proper tape has been inserted in the drive as follows:
doCheckTape="n"
echo -n "Has the Backup tape been inserted for $hn (y/n)? "
read doCheckTape
if [[ $doCheckTape != "y" ]]; then
exit 1
fi
For standalone tape drives, you also need to manually track when the tapes were created and where they are stored. This is not difficult, but failing to do so will cost you time when you need the tapes. One of the benefits of using commercial hardware/software is that all the tape handling and tracking is done automatically, so that neither the user nor the administrator has to worry about the tapes.
Creating the Boot Server
Booting from the network is no more difficult than booting from a local disk drive. The difficult part is setting up a boot server that has all the required files for the architecture being booting. SGI has a licensed product called RoboInst that makes the process of creating a boot server and using it for systems administration a little easier. The RoboInst client package is installed by default on SGI computers, and a license is only required to activate the server part (i.e., boot server). The RoboInst package has the ability to perform initial software installs, software upgrades, restores from tape, and patch installs on several clients simultaneously, hence the name RoboInst. This article will focus primarily on system recovery using restore from tapes. Once configured, RoboInst is an ideal way to manage several SGIs and keep them up to date with respect to OS and patches.
The six steps below will configure an SGI computer to be the RoboInst server and, as such, it will be the Boot Server, Configure Server, Log Server, and Software Distribution Server. RoboInst has the flexibility to allow separate computers to perform each function, but it is much easier for all the functions to be performed by a single computer whenever possible. There are two obvious situations when one computer can not perform all the functions. First, the Boot Server and the client SGI being booted must not be separated by a router unless BOOTP forwarding is enabled at the router, which normally blocks BOOTP. Unless you have complete control over the router and its configuration, you must have the Boot Server and client on the same side of the router. The second situation is when the software being distributed (not applicable to system recovery described in this article) will not fit on the available disk space of the Boot Server. In this case, another SGI machine with enough disk space for the software can be used as the Software Distribution Server. Creating the Boot Server described below is the most difficult part in the entire process, but is well worth the effort since the same server that is used to restore a system disk can later be configured as a Software Distribution Server.
The six steps that should be done as root on the RoboInst server to configure an SGI to be the Boot Server are:
1. Verify/install RoboInst server package.
2. Create a dummy user to run tftp daemon.
3. Edit /etc/inetd.conf, /etc/ethers, and ~bootme/.rhosts files.
4. Create RoboInst boot directory and install miniroot boot files.
5. Create RoboInst restore directory and install restore files.
6. Run the configuration server program RoboInst_config.
These six steps are described below, assuming that the client machine that is being backed up and restored has the following attributes:
Model: SGI O2 (R5K)
Name: spect-3
IP: 10.7.7.7
eaddr: 08:00:69:77:77:77 (obtained by nvram | grep eaddr)
Netmask: 0xfffffe00
First check to see if RoboInst software is installed via the command:
versions -a RoboInst
The expected output should look like the following:
I = Installed, R = Removed
Name Date Description
I RoboInst 12/15/1999 RoboInst Tools for \
Automatic Installations 1.2
I RoboInst.man 12/15/1999 RoboInst Man Pages 1.2
I RoboInst.man.man 12/15/1999 RoboInst Man Pages 1.2
I RoboInst.man.relnotes 12/15/1999 RoboInst Release Notes 1.2
I RoboInst.sw 12/15/1999 RoboInst Tools for \
Automatic Installations 1.2
I RoboInst.sw.client 12/15/1999 RoboInst Client Tools for \
Automatic Installations 1.2
I RoboInst.sw.examples 12/15/1999 RoboInst Examples 1.2
I RoboInst.sw.server 12/15/1999 RoboInst Server Tools for
Automatic Installations 1.2
Note if there is not an I at the beginning of the line with RoboInst.sw.server, then the server package has not been installed. The RoboInst package can be found on the most recent Overlay CDROM from SGI and should be installed on the the RoboInst server.
Next, create a dummy user on the RoboInst server for tftp access over the network. Files from the RoboInst server are requested by the client and transferred to the client using tftp (trivial file transfer protocol). In the default configuration, the guest owns the tftp daemon, but by creating an arbitrary user with a locked account (i.e., no one can log into it), it will be more difficult for a hacker to break into your system via tftp. Be creative for the user name, but do not use the example name shown here. Assume a user called bootme has been created for the purpose of remotely calling tftp from the clients.
Edit the files /etc/inetd.conf, /etc/ethers, and ~bootme/.rhosts. In the file /etc/inetd.conf, modify the line beginning with tftp so that it looks like the following:
tftp dgram udp wait bootme /usr/etc/tftpd \
tftpd -s /usr/local/boot
Confirm that the bootp service is available (i.e., not commented out) and looks like:
bootp dgram udp wait root \
/usr/etc/dhcp_bootp dhcp_bootp -o \
/etc/config/dhcp_bootp.options
Instruct inetd to reread its configuration file:
/etc/killall -HUP inetd
When tftp gets started as a result of a client request, it is the user bootme that will be running the tftp daemon, not the user guest. It would be very unwise from a security standpoint to have root run tftp.
In the file, /etc/ethers, add in one line for each client as follows:
08:00:69:77:77:77 spect-3.nih.gov \
# Spectrometer #3 at NIH
The machine name can be either the name of the client machine or its IP address. If the name is used, make sure that the RoboInst server can translate the name to an IP address using either local files (i.e., /etc/hosts), NIS or DNS. If the RoboInst server can not translate the name to an IP address, it will be unable to assign the client machine with an IP address and the client will not be able to boot over the network.
Create and edit the file ~bootme/.rhosts and add a line for each client as follows:
spect-3.nih.gov root
Note that this allows root on spect-3.nih.gov to remotely start shells on the RoboInst server as the user bootme, without any authentication. This is why I suggest creating a dummy account that no one except you will know. The owner of this file should be bootme, and the permissions as shown below:
-rw------- 1 bootme guest 1118 Jul 12 13:11 \
/usr/people/bootme/.rhosts
Create RoboInst boot directory and install miniroot boot files. Insert a current Overlay CDROM that contains installation tools into the CDROM drive. If the mediad is running, then the CDROM should automatically mount at /CDROM. Otherwise, you should manually mount it at /CDROM:
mkdir -p /usr/local/boot/miniroot
cd /usr/local/boot
cp /CDROM/dist/sa sa
cd /usr/local/boot/miniroot
cp -p /CDROM/dist/miniroot/*
The file sa is the standalone shell referred to as SASH in the SGI documentation. The sa file has support for booting both the 32-bit ARCS and 64-bit SGI hardware. The files in /usr/local/boot/miniroot are the miniroot kernel files for each SGI architecture.
Create RoboInst restore directories and install restore files:
mkdir -p /usr/local/boot/RoboInst/ \
restore_spectrometer/StandardFiles
There are 11 files that need to be in the StandardFiles directory that can easily be obtained from the client SGI. These 11 files are not part of the normal RoboInst environment when a client is booted over the network, but are required in order to perform the restore from tape, and can be easily obtained from your client SGI, which has already been configured with networking and has backup software installed:
cd / usr/local/boot/RoboInst/ \
restore_spectrometer/StandardFiles
Get the standard SGI Irix files that are missing during RoboInst boot:
rcp -p spect-3:/usr/etc/inetd inetd
rcp -p spect-3:/usr/lib/libmdbm.so libmdbm.so
rcp -p spect-3:/usr/bin/tar tar
rcp -p spect-3:/usr/bsd/zcat zcat
rcp -p spect-3:/usr/bsd/more more
rcp -p spect-3:/etc/inetd.conf inetd.conf
rcp -p spect-3:/etc/services services
Create 32-bit (sashARCS) and 64-bit (sash64) standalone shells from /usr/local/boot/sa file, which are needed to apply the boot block to the system disk as the final three lines in the mrconfig file:
/etc/mkboottape -f /usr/local/boot/sa -x sashARCS
/etc/mkboottape -f /usr/local/boot/sa -x sash64
Get Veritas Netbackup client software in the form of a compressed tar image. If you are using dump or some other package, you need to get those files instead of netbackup. If you use /sbin/dump for your backups, you will also need to get /usr/libdisk.so, which is a dynamic shared library that /sbin/dump requires. Use the command elfdump -Dl /sbin/dump to determine what DSO (Dynamic Shared Object) libraries a given executable, such as /sbin/dump, requires:
rsh spect-3 'cd /usr/openv ; \
tar cf - netbackup | compress \
-c > /tmp/netbackup.tar.Z'
rcp spect-3:/tmp/netbackup.tar.Z netbackup.tar.Z
rsh spect-3 rm /tmp/netbackup.tar.Z
rcp -p spect-3:/usr/openv/netbackup/ \
bp.conf bp.conf.sed
Edit the file bp.conf.sed so that the line starting with CLIENT_NAME is this:
CLIENT_NAME = HOSTNAME
Do not replace HOSTNAME with the hostname of your client. The purpose of the file bp.conf.sed is to be used as input to a sed script that will replace the text string HOSTNAME with the actual host name of the client. This allows the same set of files to perform restores for any hostname:
cd /usr/local/boot/RoboInst/restore_spectrometer
The files hosts, mrconfig, and mypreinst, are three text files that need to be manually added into the restore_spectrometer directory. The hosts file needs to contain the hostname and IP for any machine that will be accessed during miniroot. The two essential entries are the RoboInst server and the backup server, as shown in the example hosts file:
10.7.7.77 tape-server.nih.gov tape-server
10.7.7.177 RoboInst-server.nih.gov RoboInst-server
The mrconfig file (see Listing 2) is the master miniroot configuration file and contains the instructions for the miniroot session. The mrconfig file is not a shell script, but can call a shell script such as mypreinst.
The mrconfig file can be divided into five sections. In the first section, the loghost is defined by its IP address. In the second section, the system disk is initialized, thereby destroying any data that may be on it. The hosts file containing the RoboInst server and tape server are appended to the default miniroot hosts file, which contains only the local computer. The fourth section calls the shell script mypreinst described below, which sets the system up for networking and performs the restore. The command #preinst sh is commented out, but when uncommented, it allows you to get a UNIX shell at that point in the recovery for debugging. The fifth and final section uses the dvhtool to install the boot block on the system disk.
The mypreinst file (see Listing 3), as mentioned above, does all the work in setting up the standalone system so that it can communicate over the network and restore the system disk from tape using Netbackup. The mypreinst script can be divided into three sections. The first section installs the backup software from the compressed tar image created above. The second section installs standard services and network tools normally not available during miniroot. The third and final section restores the system disk from tape. If you plan on using dump or tar to a local tape drive, then section 1 can be omitted and section 2 will install only those commands that you need, such as dump and tar. The inetd, inetd.conf, and services will not be needed if you are restoring from a local tape.
Run the configuration server program roboinst_config:
cd /usr/local/boot/RoboInst/restore_spectrometer
roboinst_config -y -c
For additional information on setting up RoboInst, see the man pages for RoboInst and RoboInst_config. The SGI online documentation tool insight also has useful advice for setting up RoboInst. After starting /usr/sbin/insight, search for RoboInst and look for the Section Automating Installations with RoboInst. And finally, if the RobInst server package is installed, then there are examples located at /usr/share/src/RoboInst.
Restoring from Backup
After replacing the defective system disk with a new disk, power up the SGI to the PROM Monitor (i.e., click on Stop for Maintenance, and then click on Enter Command Monitor). There are four environment variables that need to be set in the PROM before initiating the restore:
setenv -p netmask 0xfffffe00
setenv -p hostname spect-3
setenv -p netaddr 10.7.7.7
setenv -p mrconfig bootme@boot-server: \
/usr/local/boot/RoboInst/restore_spectrometer
The -p option only works for newer SGIs, such as O2s and Octanes, and requests that the environment variable be persistent so that it survives a reboot.
The following command at the PROM Monitor will instruct the SGI O2 to boot over the network and automatically restore from tape using the instructions in the mrconfig file from the directory defined by the mrconfig environment variable:
bootp()boot-server:/usr/local/boot/sa(sashARCS) \
mrmode=custom disksetup=true
For an Octane, use the following command:
bootp()boot-server:/usr/local/boot/sa(sash64) \
mrmode=custom disksetup=true
It takes approximately 25 minutes from the time the bootp() command is issued until the system automatically reboots and is ready to use.
This backup/recovery plan has been successfully tested several times on our site prior to implementing on the production computers. However, the first test on a production computer failed to boot after what appeared to be a successful restore to a new drive. The original mrconfig file did not include the dvhtool command, and thus did not add a boot block to the disk.
All of the successful tests involved disks that already had a boot block and therefore did not show a problem. Although we have not had a system disk failure on the production computers since this backup/recovery plan has been in use, I feel very confident that I will be able to painlessly restore the computer to the state of its last backup.
About the Author
Dan Garrett works as a research scientist, software developer, and systems administrator for the National Institute of Diabetes, Digestive and Kidney Diseases at the National Institutes of Health. He enjoys biking to work, playing with his active 2-year-old son, Joel, and reading books with his wife Gloria on quiet evenings (after Joel goes to sleep). He can be contacted at: dgarrett@nih.gov.
|