Disaster
Recovery with Tivoli Storage Manager
Primitivo Cervantes
Tivoli Storage Manager (TSM) from Tivoli is an enterprise-wide
backup and recovery software package that is frequently used by
small and large companies to back up their critical data. The TSM
software runs on a server connected to the client's network,
and the other systems' data are backed up to this server through
the client's network. The data can also be restored to any
system through the network. Because the TSM server is a critical
resource, a process is needed to ensure the TSM server and any other
system's data are restored in the event of a disaster. In this
article, I will discuss how to use the Disaster Recovery Module
(DRM) of TSM to recover the TSM server and its database to restore
other systems' data. I will also discuss how off-site tapes
are created and managed in preparation for such a scenario.
A typical TSM server scenario will include the server hardware,
a tape library, two or more tape drives within the tape library,
the TSM software, a network connection, and other systems that need
to be backed up (TSM clients). The TSM server hardware and the TSM
clients can vary greatly (e.g., HP, Sun, IBM, etc.). The storage
devices can also vary (e.g., IBM tape libraries, STK silos, DLT
tape libraries, etc.).
For purposes of this article, I will describe the TSM and DRM
functions using the example of an IBM RISC System/6000 model H80
as the server system and an IBM 3494 tape library with two IBM 3590
tape drives as the storage devices. The IBM 3590 tape drives are
connected to the IBM H80 using a SCSI connection. The client network
is a 100-Mb Ethernet network. This TSM server will have a hostname
of "tsmserver1". I will also include an AIX system (IBM
RISC System/6000 model F80) as a TSM client. This TSM client will
have a hostname of "aixclient1".
Function of DRM
DRM serves two particular functions within TSM. It provides a
way of managing the TSM off-site tapes. It also creates a DRM report
that can be used to recreate the TSM server in the event of a disaster.
This DRM report is really a set of scripts and logs in a single
text file. This text file can be separated out to scripts, and these
scripts can be executed to recreate the TSM server.
Some Basic Terminology
A TSM "client" refers to a system that can back up to
the TSM "server". This TSM client system may be a file
server or SQL database server or other type of server. This TSM
client has a "nodename" associated with it in TSM. When
a client backs up data to the TSM server, the TSM server will store
it in tape in our example. These tapes belong to a "primary
storage pool". These primary storage pool tapes are duplicated
to another set of tapes to create the off-site tapes. These off-site
tapes belong to a "copy storage pool".
Configuring Base DRM Parameters
For our purposes, I will configure the base DRM parameters with
the following values:
Recovery Plan Prefix=/var/adm/tsm
Primary Storage Pools=BKUP_TAPEPOOL
Copy Storage Pools=OFFSITE_TAPE
DB Backup Series Expiration Days=14 Day(s)
Recovery Plan File Expiration Days=14 Day(s)
There are other fields, but those can be ignored for our purposes.
The fields have the following purposes:
Recovery Plan Prefix -- The recovery plan is the DRM
report. By default, the DRM report has the format <DATE>.<TIME>
where the <DATE> is the date the DRM report was generated
and <TIME> is the time the DRM report was generated.
For a DRM report that was generated on May 23, 2001 10:40:30 P.M.,
the DRM report name is 20010523.204030. With the value that
we will insert in this field, /var/adm/tsm, this same report
will have a filename of /var/adm/tsm.20010523.204030.
Primary Storage Pools -- The primary storage pool contains
the original tapes to be copied. In this case, we will call that
BKUP_TAPEPOOL.
Copy Storage Pools -- The copy storage pool contains
the copies of the BKUP_TAPEPOOL and, therefore, off-site
tapes. We will call this OFFSITE_TAPE.
DB Backup Series Expiration Days -- This parameter
determines how many days TSM will keep the TSM database backup information.
We will set this to 14 days.
Recovery Plan File Expiration Days -- This parameter
determines how many days TSM will keep the DRM reports. We will
set this to 14 days.
To set these fields, run these commands from the TSM administrative
client shell. To start the TSM administrative client, run dsmadmc
and enter the TSM administrative ID and password. You'll notice
that most of the commands are case insensitive, even though the
AIX operating system itself is case sensitive. There are exceptions,
such as when you are specifying file locations (as when specifying
the drmplanprefix). To set the recovery plan prefix:
set drmplanprefix /var/adm/tsm
To set the primary storage pool:
set drmprimstgpool bkup_tapepool
To set the copy storage pool:
set drmcopystgpool offsite_tape
To set the TSM db backup series expiration days:
set drmdbbackupexpiredays 14
To set the recovery plan file expiration days:
set drmrpfexpiredays 14
Defining Machines to DRM
When you define a "machine" to DRM, DRM will include that
system in the DRM report. DRM does not automatically include all of
its clients in the DRM report. You must manually input them into DRM.
This is not usually a problem but can be annoying if you have many
clients to include in DRM. In this example, we will include the TSM
server "tsmserver1" and the TSM client "aixclient1".
There are several machine fields that can be customized, and we
will focus on the "name", "description", and
"adsmserver". The "name" is the machine name,
often the TSM nodename. The "description" comprises a
sentence about the system and perhaps its function. The "adsmserver"
field tells DRM whether the system is the TSM server. TSM used to
be called ADSM, and there are still references to this in many places.
To define the machines to DRM:
Define machine tsmserver1 description="The actual TSM server" adsmserver=yes
Define machine aixclient1 description="The AIX TSM client" adsmserver=no
After we create the machine definition, we need to associate this
machine with its TSM nodename. Since a single system can have several
TSM nodenames, this is how DRM keeps track of the actual system. To
define the machines associates to DRM:
Define machnodeassociation tsmserver1 tsmserver1
Define machnodeassociation aixclient1 aixclient1
Defining Recovery Media to DRM
In general, TSM is used to back up critical data but is not used to
install the systems themselves. In the event of a disaster, the system
is first rebuilt and then TSM is used to restore the data. There are
ways to use TSM to install the actual system, and this is called a
"bare-metal restore". I will not focus on that type of restore
because this procedure differs with different types of systems. I
usually recommend that a single system be set up as an install server
for AIX, another for Sun, etc., so that the number of systems needing
actual bootable backups can be minimized.
When you define "recovery media" to DRM, you are telling
DRM the location of the system-bootable media so that DRM can include
that information in the DRM report. To define recovery media to
DRM:
Define recoverymedia aix_433_boot_cds volumenames=cds1-4 \
description="AIX 4.3.3 install CD's" location="top shelf of \
the 3rd floor cabinet" type=boot
After creating the recovery media definition, we must associate this
with the TSM machines. To define this association:
Define recmedmachassociation aix_433_boot_cds tsmserver1
Define recmedmachassociation aix_433_boot_cds aixclient1
What to Do Before a Disaster
To restore the TSM server and hence any data that it contains,
you will need at least three things: the TSM database backup tape,
the DRM report, and the off-site tapes that contain the TSM client
data. To prepare for a disaster, daily TSM activities must include
the following steps in order:
- Backup the TSM client data.
- Create the off-site tapes.
- Backup the TSM database.
- Create the DRM report.
- Perform an AIX-bootable backup (mksysb) to include the
DRM report.
- Send the off-site tapes, the TSM database backup tape, and
the TSM server mksysb tape to an off-site vault.
- Send additional recovery media (such as AIX installation CDs)
to the vault.
Backing up the TSM client data can differ depending on the type
of data. For example, a TSM agent can be used to back up an Oracle
database while it is online. For regular AIX files:
/usr/tivoli/tsm/client/ba/bin/dsmc inc
To create the off-site tapes, we back up the primary storage pool
tapes to the copy storage pool tapes. We do this with the TSM administrative
client, dsmadmc. After starting the dsmadmc shell:
backup stgpool bkup_tapepool offsite_tape
The DRM report in our case will be in the /var/adm directory
and will have a format of tsm.<DATE>.<TIME>, as
mentioned previously. To create the DRM report:
prepare
To recover the TSM server, we will need the AIX install CDs, the TSM
install CDs, and the DRM report. The DRM report contains scripts to
recreate the TSM database, logs, options files, etc. We can combine
these into one by creating a bootable install tape (AIX mksysb).
All we need in this case is the AIX mksysb tape, which will
contain everything else. As we will see later, this lets us ignore
many of the scripts in the DRM report, saving a lot of time and effort.
To create the AIX mksysb:
/usr/bin/mksysb -i /dev/rmt0
where /dev/rmt0 is the tape device to back up to.
After performing all of the above actions, the AIX mksysb tape,
the TSM database backup tape, and all of the off-site data tapes
must be sent to the off-site vault for safe keeping. Additionally,
keep a copy of the installation media for restoring specific clients,
such as AIX installation CDs, HP installation CDs, etc. This is
done in case other ways of rebuilding the TSM clients fail and we
have to rebuild the system.
Using DRM to Manage Off-Site Tapes
Without DRM, there are basically only a couple of states for the
tapes, where access is equal to readwrite or offsite
depending on whether it is in the tape library of out of the tape
library. DRM adds more states that are useful for managing off-site
tapes. Here are the more important states in the order that they
would be used:
Mountable -- The tape is in the tape library and can
be mounted into a tape drive for reading and writing.
Courier -- The tape has been ejected from the tape
library and is in transit to the off-site vault.
Vault -- The tape is in the off-site vault.
VaultRetrieve -- The data on the tape has expired or
been moved to another tape so it no longer contains any data. It
can be retrieved from the vault and used again. It is important
to note that the TSM server automatically sets the tape to this
state. It cannot manually be set to this state. The only way to
manually set a tape to this state is to move the data from the tape
or to delete the data contained in the tape.
CourierRetrieve -- The tape has been recalled from
the vault and is in transit back to the office.
OnsiteRetrieve -- The courier has returned the tape
to the office. Note that as soon as the tape is move to onsiteretrieve,
TSM removes it from its database. As far as TSM is concerned, the
tape no longer exists. This is because as the tape is returned to
the office, it can actually be used on other systems and does not
necessarily have to be returned to this particular TSM server.
All of the following commands are performed in the TSM administrative
client shell, dsmadmc. In order for a tape to be mountable,
it is inserted into the tape library and checked into TSM. In the
following command, IBM3494 is the name of the tape library
as defined to TSM. To check in the tapes:
Checkin libv IBM3494 status=scratch search=yes checklabel=yes devtype=3590
After creating the off-site tapes, those tapes need to be ejected
from the library and marked as courier. The following command
will do it in one step:
Move drmedia * stgpool=offsite_tape remove=yes tostate=courier
Once the tapes have arrived at the vault, they need to be changed
to vault status:
Move drmedia * wherestate=courier tostate=vault
Again, when the data in the off-site tapes expires and the tape contains
no more data, TSM automatically changes the tape status to vaultretrieve.
In order to find out which tapes are available to be retrieved from
the vault and reused:
Query drmedia * wherestate=vaultretrieve
After contacting the vault with a list of tapes that need to be retrieved
from the vault, we can change the status of the tapes to courierretrieve:
Move drmedia * wherestate=vaultretrieve tostate=courierretrieve
Once the courier has delivered the tapes to the office from the vault,
we can change the status of the tapes to onsiteretrieve:
Move drmedia * wherestate=courierretrieve tostate=onsiteretrieve
After the tapes arrive back at the office and are changed to onsiteretrieve
status, TSM will remove the information about the tape from its database.
It can then be checked into the server again, and the cycle repeats.
Automating the Management of Off-Site Tapes
It's important to note that all of the tape management commands
can be combined into scripts so that most of the process can be
automated. It all depends on your requirements. Some clients like
to have total control and accountability, so they require their
operators to perform everything manually. Other clients automate
the process as much as possible. It all depends on your requirements
and on your confidence that the courier will deliver all of the
off-site tapes to the vault.
Format of the DRM Report
To understand how to use the DRM report to recover the TSM server,
we need to understand its format. The DRM report is really a single
text file that contains entire scripts, logs, and configuration
files, etc. all in the same text file. The format is best described
using an example:
begin PLANFILE.DESCRIPTION
Recovery Plan for Server SERVER1
Created by DRM PREPARE on 09/15/00 13:34:47
DRM PLANPREFIX /usr/local/scripts/log/prepare_doc
Storage Management Server for AIX-RS/6000 - Version 3, Release 7, Level 3.0
end PLANFILE.DESCRIPTION
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin PLANFILE.TABLE.OF.CONTENTS
PLANFILE.DESCRIPTION
PLANFILE.TABLE.OF.CONTENTS
...other file contents...
end PLANFILE.TABLE.OF.CONTENTS
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
...additional scripts...
As you can see, the DRM report is broken down into sections. Each
section is a file, log, or script. The section begins with the key
word "begin" followed by the file name, in the first
file PLANFILE.DESCRIPTION. A blank line follows this, and the
contents of the file are then included. After the file contents, the
section ends with three lines. The first ending line starts with the
key word "end" followed by the file name, a blank
line, and a line with "*-*-" repeated across the
line.
The names of the files are important because they are used within
the scripts that DRM creates in the DRM report. The scripts within
the DRM report expects those names to be there along with the "DRM
plan prefix" we had configured before. For example, the
file above PLANFILE.DESCRIPTION would be appended to the
DRM plan prefix we configured of /var/adm/tsm to get the
filename /var/adm/tsm.PLANFILE.DESCRIPTION.
I've written a small Perl script to extract the files and
break them into the appropriate filenames. This is a fairly simple
script so I won't go through it here. The code for this article
is available for download at the Sys Admin Web site. This
script will take the DRM report and separate it into the appropriate
filenames using the tsm as a DRM plan prefix. Simply cd
to the /var/adm directory, and run the following command:
cat <DRM REPORT> | perl prepare_extract_files.pl
where "<DRM REPORT>" is the name of the report.
Those files are broken down in your current directory. In addition
to the above Perl script, I have included an actual DRM report so
you can see how this script extracts files. The Perl script has a
filename of tsm_prepare_extract_files.pl, and the DRM report
has the name tsm.20000915.130005.
What to Do in the Event of a Disaster
Now that we have all of this information, what happens if we lose
the computer room and have to rebuild everything at another location?
The assumption here is that we have another site where we can recover
systems. Also, we will assume that the recovery site has identical
hardware. Because we have taken time to prepare for this event,
the recovery process will be remarkably simple. We will proceed
with the recovery with the following steps:
- Retrieve the off-site tapes, including the TSM database backup
tapes and the AIX mksysb tape.
- Rebuild the TSM server using the AIX mksysb tape.
- Insert all of the off-site tapes into the tape library.
- Break out the latest DRM report into scripts and files using
the Perl script mentioned above.
- Use the DRM scripts to rebuild the TSM server database.
- Install the TSM clients.
- Rebuild the TSM clients using TSM.
It is generally best when retrieving off-site tapes to retrieve
all of them. Depending on where the vault is, it will probably be
unfeasible to keep calling the vault to retrieve tapes during a
recovery process. That is because assuming that a major disaster
has occurred, it will be quite urgent to get these systems up and
running.
Once the tapes are at the recovery location, the first step is
to rebuild the TSM server. Using the AIX mksysb tape and
using the regular AIX mksysb installation procedures, reinstall
the AIX operating system on the TSM server. During this time, the
tapes could be inserted into the tape library.
After the server has been rebuilt, it will not be in a state that
we can use to start TSM. We will have to rebuild the TSM database
first. This is where the DRM report and its scripts come into play.
Break out the DRM report into scripts using the Perl script mentioned
above. The script that we are interested in will be called /var/adm/tsm.RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE.
When you run this script, it will use several of the other scripts
and files to:
- Rebuild the TSM server configuration files.
- Rebuild the TSM volume history and device configuration files.
- Create and format and initialize the TSM database and TSM log
files.
- Recreate the TSM database from the TSM database backup tape.
- Start the TSM server.
- Up until this point, you could use this procedure to rebuild
a TSM server in which all of the tapes are available including
all of the primary storage pool tapes.
In case of an actual disaster, where the primary storage pool
tapes are not available, you will also be prompted to do the
following:
- Register the TSM server licenses.
- Tell TSM the copy storage pool tapes are available.
- Mark the primary storage pool tapes as "destroyed".
That's it! One script does it all. All of the preparation
makes this possible. DRM does a great job of creating this script
and doing the hard work for you. This would normally be the
hard part since the TSM server rebuild is a prerequisite to
rebuilding all of the TSM clients' data.
After rebuilding the TSM server, just install the clients
using the installation media or bootable install tapes. Again,
my recommendation is to create an install server for each type
of system, rebuild the install server first, and then use that
to rebuild the individual client operating systems. Once the
client operating systems are operational, we can use TSM to
restore all of the data using the normal TSM restore procedures.
Summary
I have introduced the DRM component of TSM and shown how to
configure it. Also, I have introduced how DRM manages tapes
and discussed what to do to prepare for a disaster. Additionally,
I have talked about the DRM report and how to use that to rebuild
the TSM server. Once this TSM server is rebuilt, we can use
it to restore all of the TSM client's data.
Primitivo Cervantes is an IT Specialist who has worked
as a consultant for the past nine years. He has been in the
computer/systems industry for fifteen years and has specialized
in high-availability and disaster-recovery systems for the past
seven years
|