Multi-Platform
Backups
Robin Wakefield
When I started my previous administration position, I realized
that the company's backup strategy needed a major overhaul.
The main application servers were being successfully backed up with
Omniback, but many users had their own powerful design workstations
that also required nightly backups. A number of these workstations
had DAT drives attached, and the machine that hosted each drive
ran a nightly script to back up itself and a number of other servers.
As new workstations were added to the network, it was necessary
to determine which machine would perform the backup of the new server,
and the local script changed accordingly. There was very little
logging, so a few days might pass before discovering that a drive
or tape was failing; thus, there was the possibility of losing data
until the hardware or tape problem was corrected.
The Solution
To ensure the backups could be managed more efficiently, I wrote
a suite of scripts together with a single configuration file, all
maintained centrally on a master server. Figure 1 illustrates a
schematic of the system, showing a subset of the servers involved.
Each server that has a tape drive attached is configured to run
the Run_Dump script (Listing 1) from cron. This script is
source controlled on the master server and distributed to all tape-host
servers when changes are made. (All Sys Admin magazine listings
are available for download at: http://www.sysadminmag.com/code/.)
The Run_Dump script is driven from a configuration file
called Local.Tape. This file contains the list of servers
and filesystems to back up, and what day of the week to attempt
a full backup of each filesystem. The script contains the following
features:
- Tape label generation -- Each tape is labeled, based on
a four-week cycle. This is stored as a header file on the tape.
- Usage count -- Overused tapes can be flagged/discarded.
- Wrong tape inserted -- If the wrong tape is inserted (i.e.,
the tape belongs to another tape-host), this can be flagged.
- Retry of full backup -- If the full backup fails on the
day that it is attempted, it is retried the next day.
- Once-a-month archive -- The script marks the backup as
an "archive" backup every four weeks.
- Logging -- Logs are maintained allowing for quick retrieval
of any data from a specific day.
The Local.Table file is generated from a Master.Table
that is centrally maintained on the master server. Whenever the
backup requirements change (e.g., a new workstation is added, or
a new filesystem is created), this Master.Table is updated.
A script, push_table, is then used to distribute the Local.Table
file to the appropriate server that will perform the backup (see
Listing 2).
The master server contains the Master.Table file:
cabb19 /dev/rmt/3hcn cabb19 / Mon
. . . /local_user2 .
. . caic40 / .
. . . /data .
. . casw19 / Tue
. . casw76 / .
. . cabb25 / Wed
. . caeda13 / .
. . . /var .
. . . /usr .
. . . /tmp .
. . . /opt .
. . . /home .
. . . /local_user .
. . capcb2 / Thu
. . . /local_user2 .
cabb20 /dev/rmt/3hcn cabb20 / Mon
. . . /local_user2 .
. . caic23 / Tue
cabb24 /dev/rmt/3hcn cabb24 / Mon
. . . /local_user .
. . cabb14 / Tue
cadoc1 /dev/rmt/3hcn cabb2 / Wed
. . camek10 / .
. . . /local_user .
. . casw42 / Thu
. . . /local_user2 .
. . carf42 / .
cadoc6 /dev/rmt/3hcn cadoc6 / Mon
. . . /local_user .
. . cabb35 / .
. . cadoc3 / Tue
(A "." is used for a repeated field to aid readability).
Each tab-separated field is described below:
1. Tape host server
2. Device name of drive on this host server
3. Server to back up
4. Filesystem on this server
5. Day of the week to attempt a full backup
push_table shows the distribution script. This script is
used not only to distribute the list of server/filesystems, but
also the main backup directory structure and executables should
they not exist on the target system. The switch settings for this
script are as follows:
-h host -- Only distribute to this host.
-co -- Only check that the backup directories exist
in the target host(s).
-nc -- Don't check that the backup directories
exist in the target host(s).
-nf -- Don't copy the executables across.
-f file -- Only copy this file across.
The default is to build a new backup directory structure if it
doesn't exist, copy the executables across, and build the Local.Table
file for each tape host.
As previously noted, the main backup script, Run_Dump,
is put into cron on all the tape-host servers. Also note that you
will need to enable access across the network via .rhosts
for interaction between the various servers. You must decide whether
the somewhat limited .rhosts security structure is appropriate
for your own environment. See the "About Run_Dump" file
included with the listings for explanation of the script.
Note that the get_mtstatus program (Listing 3) is a small
piece of C code to determine the status of a tape drive.
If any failures have occurred, the remove_old_history script
(Listing 4) will retain the detail log files, else they are removed.
The summary file for a typical backup may look like this:
Tape Duration
DATE TIME MB dumped L Label (mins) STATUS HOST File System
========= ===== ========= = ========== ============ ======== ===========
17Dec2000 21:01 9.347 1 caeda1-Tue2 1.3 OK caeda1 /
17Dec2000 21:05 36.493 1 caeda1-Tue2 3.7 OK caeda1 /local_user5
17Dec2000 21:07 0.528 1 caeda1-Tue2 1.8 OK caeda1 /local_user2
17Dec2000 21:08 2.361 1 caeda1-Tue2 1.6 OK caeda1 /local_user3
17Dec2000 21:10 9.833 1 caeda1-Tue2 1.8 OK caeda1 /local_user
17Dec2000 21:11 0.264 1 caeda1-Tue2 1.2 OK caeda1 /local_user4
17Dec2000 21:13 0.000 0 caeda1-Tue2 0.0 FAILED cabb29 /
17Dec2000 22:59 456.586 0 caeda1-Tue2 105.8 OK cabb7 /local_user
18Dec2000 00:45 393.630 0 caeda1-Tue2 105.8 OK camek18 /
18Dec2000 00:46 0.074 1 caeda1-Tue2 1.0 OK casw80 /local_user2
18Dec2000 01:04 55.136 1 caeda1-Tue2 17.5 OK casw69 /
18Dec2000 01:06 3.486 1 caeda1-Tue2 2.3 OK casw80 /
18Dec2000 01:09 6.576 1 caeda1-Tue2 2.8 OK camek1 /
18Dec2000 01:11 0.468 1 caeda1-Tue2 1.4 OK cadoc9 /local_user2
18Dec2000 01:13 6.598 1 caeda1-Tue2 2.7 OK carf34 /
18Dec2000 01:33 53.327 1 caeda1-Tue2 19.3 OK cadsp1 /
18Dec2000 01:34 0.943 1 caeda1-Tue2 1.2 OK cadoc13 /
18Dec2000 01:37 2.874 1 caeda1-Tue2 2.4 OK casw21 /local_user
18Dec2000 01:39 2.695 1 caeda1-Tue2 2.5 OK camek10 /
18Dec2000 01:40 1.304 1 caeda1-Tue2 1.2 OK caeda12 /local_user2
18Dec2000 01:45 14.970 1 caeda1-Tue2 4.8 OK carf18 /local_user
18Dec2000 01:50 12.709 1 caeda1-Tue2 4.5 OK casw83 /
18Dec2000 01:51 0.046 1 caeda1-Tue2 0.7 OK catec01 /tmp
18Dec2000 02:09 1.316 1 caeda1-Tue 16.9 OK casw58 /
18Dec2000 02:21 61.359 1 caeda1-Tue2 10.8 OK camek25 /local_user
Total Backed up = 1132.9 Mbytes
An archive summary file would display a "+" as the
separation character between the tape-host and the daily-cycle (e.g.,
caeda1+Tue2).
Restores
To determine which tape a particular filesystem is on for restore
purposes, you can simply specify what you are looking for and let
UNIX do the rest. The get_tape script (Listing 5) can be
used. If, for example, you want to look for all dumps of camek25:/local_user
for the past year, type:
./get_tape "2000.*camek25.*/local_user"
It will return:
17: 17Oct2000 22:16 122.199 1 caeda1-Thu1 27.5 OK camek25 /local_user
21: 07Nov2000 01:15 130.942 1 caeda1-Wed4 36.8 OK camek25 /local_user
6: 18Nov2000 23:53 1510.740 0 caeda1+Mon2 160.3 OK camek25 /local_user
24: 18Dec2000 02:21 61.359 1 caeda1-Tue2 10.8 OK camek25 /local_user
The sort command in the script orders the output by date. The
initial number in the output indicates the dump file number on tape
to specify with the s argument of the restore command.
This may be incorrect in certain circumstances due to failed backups,
but the THIS_IS_... flag file previously indicated will help
identify your relative position on the tape.
Conclusion
You may notice that this solution has evolved over the years because
of the variety of methods employed within the various scripts. However,
I have successfully ported this method within a number of the companies
for which I've worked. At one site, they had such a problem
with defective tape drives (with a "Call To Fix" of three
days) that I put a wrapper around the main script to randomly decide
which tape drive to use to back up all the systems. Although this
might sound very messy, it actually worked well and meant that if
one system were not backed up one night, it would probably be picked
up the next night using a different drive.
I hope that this article has shown how, with a little work up
front, multi-platform backups can become fully automated. We all
hope that backup tapes rarely need to be called upon, but there
is nothing more satisfying than a user phoning up, asking for a
file to be restored, and having that file quickly back in place.
Robin Wakefield studied Mechanical Engineering at City University,
London, his first job was a Stress Officer for Thorn EMI. He then
moved into CAD/CAE Systems Administration before becoming Development
Systems Manager at Cray Communications and UNIX Team Leader at Nokia
Mobile Phones. Robin currently works for Perot Systems, where he
is in the Messaging Engineering team at UBS Warburg, and he is a
regular contributor to HP's ITRC Forum. He can be reached via
email at: robin.wakefield@ubsw.com or eranuwak@aol.com.
|