apr2002.tar

Multi-Platform Backups

Robin Wakefield

When I started my previous administration position, I realized that the company's backup strategy needed a major overhaul. The main application servers were being successfully backed up with Omniback, but many users had their own powerful design workstations that also required nightly backups. A number of these workstations had DAT drives attached, and the machine that hosted each drive ran a nightly script to back up itself and a number of other servers. As new workstations were added to the network, it was necessary to determine which machine would perform the backup of the new server, and the local script changed accordingly. There was very little logging, so a few days might pass before discovering that a drive or tape was failing; thus, there was the possibility of losing data until the hardware or tape problem was corrected.

The Solution

To ensure the backups could be managed more efficiently, I wrote a suite of scripts together with a single configuration file, all maintained centrally on a master server. Figure 1 illustrates a schematic of the system, showing a subset of the servers involved. Each server that has a tape drive attached is configured to run the Run_Dump script (Listing 1) from cron. This script is source controlled on the master server and distributed to all tape-host servers when changes are made. (All Sys Admin magazine listings are available for download at: http://www.sysadminmag.com/code/.)

The Run_Dump script is driven from a configuration file called Local.Tape. This file contains the list of servers and filesystems to back up, and what day of the week to attempt a full backup of each filesystem. The script contains the following features:

Tape label generation -- Each tape is labeled, based on a four-week cycle. This is stored as a header file on the tape.
Usage count -- Overused tapes can be flagged/discarded.
Wrong tape inserted -- If the wrong tape is inserted (i.e., the tape belongs to another tape-host), this can be flagged.
Retry of full backup -- If the full backup fails on the day that it is attempted, it is retried the next day.
Once-a-month archive -- The script marks the backup as an "archive" backup every four weeks.
Logging -- Logs are maintained allowing for quick retrieval of any data from a specific day.

The Local.Table file is generated from a Master.Table that is centrally maintained on the master server. Whenever the backup requirements change (e.g., a new workstation is added, or a new filesystem is created), this Master.Table is updated. A script, push_table, is then used to distribute the Local.Table file to the appropriate server that will perform the backup (see Listing 2).

The master server contains the Master.Table file:

cabb19   /dev/rmt/3hcn   cabb19   /              Mon
.        .               .        /local_user2   .
.        .               caic40   /              .
.        .               .        /data          .
.        .               casw19   /              Tue
.        .               casw76   /              .
.        .               cabb25   /              Wed
.        .               caeda13  /              .
.        .               .        /var           .
.        .               .        /usr           .
.        .               .        /tmp           .
.        .               .        /opt           .
.        .               .        /home          .
.        .               .        /local_user    .
.        .               capcb2   /              Thu
.        .               .        /local_user2   .
cabb20   /dev/rmt/3hcn   cabb20   /              Mon
.        .               .        /local_user2   .
.        .               caic23   /              Tue
cabb24   /dev/rmt/3hcn   cabb24   /              Mon
.        .               .        /local_user    .
.        .               cabb14   /              Tue
cadoc1   /dev/rmt/3hcn   cabb2    /              Wed
.        .               camek10  /              .
.        .               .        /local_user    .
.        .               casw42   /              Thu
.        .               .        /local_user2   .
.        .               carf42   /              .
cadoc6   /dev/rmt/3hcn   cadoc6   /              Mon
.        .               .        /local_user    .
.        .               cabb35   /              .
.        .               cadoc3   /              Tue

(A "." is used for a repeated field to aid readability). Each tab-separated field is described below:

1. Tape host server

2. Device name of drive on this host server

3. Server to back up

4. Filesystem on this server

5. Day of the week to attempt a full backup

push_table shows the distribution script. This script is used not only to distribute the list of server/filesystems, but also the main backup directory structure and executables should they not exist on the target system. The switch settings for this script are as follows:

-h host -- Only distribute to this host.

-co -- Only check that the backup directories exist in the target host(s).

-nc -- Don't check that the backup directories exist in the target host(s).

-nf -- Don't copy the executables across.

-f file -- Only copy this file across.

The default is to build a new backup directory structure if it doesn't exist, copy the executables across, and build the Local.Table file for each tape host.

As previously noted, the main backup script, Run_Dump, is put into cron on all the tape-host servers. Also note that you will need to enable access across the network via .rhosts for interaction between the various servers. You must decide whether the somewhat limited .rhosts security structure is appropriate for your own environment. See the "About Run_Dump" file included with the listings for explanation of the script.

Note that the get_mtstatus program (Listing 3) is a small piece of C code to determine the status of a tape drive.

If any failures have occurred, the remove_old_history script (Listing 4) will retain the detail log files, else they are removed. The summary file for a typical backup may look like this:

                             Tape     Duration
  DATE    TIME  MB dumped L  Label    (mins) STATUS  HOST   File System
========= ===== ========= = ========== ============ ======== ===========
17Dec2000 21:01    9.347 1 caeda1-Tue2   1.3 OK     caeda1   /
17Dec2000 21:05   36.493 1 caeda1-Tue2   3.7 OK     caeda1   /local_user5
17Dec2000 21:07    0.528 1 caeda1-Tue2   1.8 OK     caeda1   /local_user2
17Dec2000 21:08    2.361 1 caeda1-Tue2   1.6 OK     caeda1   /local_user3
17Dec2000 21:10    9.833 1 caeda1-Tue2   1.8 OK     caeda1   /local_user
17Dec2000 21:11    0.264 1 caeda1-Tue2   1.2 OK     caeda1   /local_user4
17Dec2000 21:13    0.000 0 caeda1-Tue2   0.0 FAILED cabb29   /
17Dec2000 22:59  456.586 0 caeda1-Tue2 105.8 OK     cabb7    /local_user
18Dec2000 00:45  393.630 0 caeda1-Tue2 105.8 OK     camek18  /
18Dec2000 00:46    0.074 1 caeda1-Tue2   1.0 OK     casw80   /local_user2
18Dec2000 01:04   55.136 1 caeda1-Tue2  17.5 OK     casw69   /
18Dec2000 01:06    3.486 1 caeda1-Tue2   2.3 OK     casw80   /
18Dec2000 01:09    6.576 1 caeda1-Tue2   2.8 OK     camek1   /
18Dec2000 01:11    0.468 1 caeda1-Tue2   1.4 OK     cadoc9   /local_user2
18Dec2000 01:13    6.598 1 caeda1-Tue2   2.7 OK     carf34   /
18Dec2000 01:33   53.327 1 caeda1-Tue2  19.3 OK     cadsp1   /
18Dec2000 01:34    0.943 1 caeda1-Tue2   1.2 OK     cadoc13  /
18Dec2000 01:37    2.874 1 caeda1-Tue2   2.4 OK     casw21   /local_user
18Dec2000 01:39    2.695 1 caeda1-Tue2   2.5 OK     camek10  /
18Dec2000 01:40    1.304 1 caeda1-Tue2   1.2 OK     caeda12  /local_user2
18Dec2000 01:45   14.970 1 caeda1-Tue2   4.8 OK     carf18   /local_user
18Dec2000 01:50   12.709 1 caeda1-Tue2   4.5 OK     casw83   /
18Dec2000 01:51    0.046 1 caeda1-Tue2   0.7 OK     catec01  /tmp
18Dec2000 02:09    1.316 1 caeda1-Tue   16.9 OK     casw58   /
18Dec2000 02:21   61.359 1 caeda1-Tue2  10.8 OK     camek25  /local_user

Total Backed up = 1132.9 Mbytes

An archive summary file would display a "+" as the separation character between the tape-host and the daily-cycle (e.g., caeda1+Tue2).

Restores

To determine which tape a particular filesystem is on for restore purposes, you can simply specify what you are looking for and let UNIX do the rest. The get_tape script (Listing 5) can be used. If, for example, you want to look for all dumps of camek25:/local_user for the past year, type:

./get_tape "2000.*camek25.*/local_user"

It will return:

  17: 17Oct2000 22:16  122.199 1 caeda1-Thu1  27.5 OK  camek25  /local_user
  21: 07Nov2000 01:15  130.942 1 caeda1-Wed4  36.8 OK  camek25  /local_user
   6: 18Nov2000 23:53 1510.740 0 caeda1+Mon2 160.3 OK  camek25  /local_user
  24: 18Dec2000 02:21   61.359 1 caeda1-Tue2  10.8 OK  camek25  /local_user

The sort command in the script orders the output by date. The initial number in the output indicates the dump file number on tape to specify with the s argument of the restore command. This may be incorrect in certain circumstances due to failed backups, but the THIS_IS_... flag file previously indicated will help identify your relative position on the tape.

Conclusion

You may notice that this solution has evolved over the years because of the variety of methods employed within the various scripts. However, I have successfully ported this method within a number of the companies for which I've worked. At one site, they had such a problem with defective tape drives (with a "Call To Fix" of three days) that I put a wrapper around the main script to randomly decide which tape drive to use to back up all the systems. Although this might sound very messy, it actually worked well and meant that if one system were not backed up one night, it would probably be picked up the next night using a different drive.

I hope that this article has shown how, with a little work up front, multi-platform backups can become fully automated. We all hope that backup tapes rarely need to be called upon, but there is nothing more satisfying than a user phoning up, asking for a file to be restored, and having that file quickly back in place.

Robin Wakefield studied Mechanical Engineering at City University, London, his first job was a Stress Officer for Thorn EMI. He then moved into CAD/CAE Systems Administration before becoming Development Systems Manager at Cray Communications and UNIX Team Leader at Nokia Mobile Phones. Robin currently works for Perot Systems, where he is in the Messaging Engineering team at UBS Warburg, and he is a regular contributor to HP's ITRC Forum. He can be reached via email at: robin.wakefield@ubsw.com or eranuwak@aol.com.