Cover V10, I09

Article

sep2001.tar


Disaster Recovery with Tivoli Storage Manager

Primitivo Cervantes

Tivoli Storage Manager (TSM) from Tivoli is an enterprise-wide backup and recovery software package that is frequently used by small and large companies to back up their critical data. The TSM software runs on a server connected to the client's network, and the other systems' data are backed up to this server through the client's network. The data can also be restored to any system through the network. Because the TSM server is a critical resource, a process is needed to ensure the TSM server and any other system's data are restored in the event of a disaster. In this article, I will discuss how to use the Disaster Recovery Module (DRM) of TSM to recover the TSM server and its database to restore other systems' data. I will also discuss how off-site tapes are created and managed in preparation for such a scenario.

A typical TSM server scenario will include the server hardware, a tape library, two or more tape drives within the tape library, the TSM software, a network connection, and other systems that need to be backed up (TSM clients). The TSM server hardware and the TSM clients can vary greatly (e.g., HP, Sun, IBM, etc.). The storage devices can also vary (e.g., IBM tape libraries, STK silos, DLT tape libraries, etc.).

For purposes of this article, I will describe the TSM and DRM functions using the example of an IBM RISC System/6000 model H80 as the server system and an IBM 3494 tape library with two IBM 3590 tape drives as the storage devices. The IBM 3590 tape drives are connected to the IBM H80 using a SCSI connection. The client network is a 100-Mb Ethernet network. This TSM server will have a hostname of "tsmserver1". I will also include an AIX system (IBM RISC System/6000 model F80) as a TSM client. This TSM client will have a hostname of "aixclient1".

Function of DRM

DRM serves two particular functions within TSM. It provides a way of managing the TSM off-site tapes. It also creates a DRM report that can be used to recreate the TSM server in the event of a disaster. This DRM report is really a set of scripts and logs in a single text file. This text file can be separated out to scripts, and these scripts can be executed to recreate the TSM server.

Some Basic Terminology

A TSM "client" refers to a system that can back up to the TSM "server". This TSM client system may be a file server or SQL database server or other type of server. This TSM client has a "nodename" associated with it in TSM. When a client backs up data to the TSM server, the TSM server will store it in tape in our example. These tapes belong to a "primary storage pool". These primary storage pool tapes are duplicated to another set of tapes to create the off-site tapes. These off-site tapes belong to a "copy storage pool".

Configuring Base DRM Parameters

For our purposes, I will configure the base DRM parameters with the following values:

Recovery Plan Prefix=/var/adm/tsm
Primary Storage Pools=BKUP_TAPEPOOL
Copy Storage Pools=OFFSITE_TAPE
DB Backup Series Expiration Days=14 Day(s)
Recovery Plan File Expiration Days=14 Day(s)
There are other fields, but those can be ignored for our purposes. The fields have the following purposes:

Recovery Plan Prefix -- The recovery plan is the DRM report. By default, the DRM report has the format <DATE>.<TIME> where the <DATE> is the date the DRM report was generated and <TIME> is the time the DRM report was generated. For a DRM report that was generated on May 23, 2001 10:40:30 P.M., the DRM report name is 20010523.204030. With the value that we will insert in this field, /var/adm/tsm, this same report will have a filename of /var/adm/tsm.20010523.204030.

Primary Storage Pools -- The primary storage pool contains the original tapes to be copied. In this case, we will call that BKUP_TAPEPOOL.

Copy Storage Pools -- The copy storage pool contains the copies of the BKUP_TAPEPOOL and, therefore, off-site tapes. We will call this OFFSITE_TAPE.

DB Backup Series Expiration Days -- This parameter determines how many days TSM will keep the TSM database backup information. We will set this to 14 days.

Recovery Plan File Expiration Days -- This parameter determines how many days TSM will keep the DRM reports. We will set this to 14 days.

To set these fields, run these commands from the TSM administrative client shell. To start the TSM administrative client, run dsmadmc and enter the TSM administrative ID and password. You'll notice that most of the commands are case insensitive, even though the AIX operating system itself is case sensitive. There are exceptions, such as when you are specifying file locations (as when specifying the drmplanprefix). To set the recovery plan prefix:

set drmplanprefix /var/adm/tsm
To set the primary storage pool:

set drmprimstgpool bkup_tapepool
To set the copy storage pool:

set drmcopystgpool offsite_tape
To set the TSM db backup series expiration days:

set drmdbbackupexpiredays 14
To set the recovery plan file expiration days:

set drmrpfexpiredays 14
Defining Machines to DRM
When you define a "machine" to DRM, DRM will include that system in the DRM report. DRM does not automatically include all of its clients in the DRM report. You must manually input them into DRM. This is not usually a problem but can be annoying if you have many clients to include in DRM. In this example, we will include the TSM server "tsmserver1" and the TSM client "aixclient1".

There are several machine fields that can be customized, and we will focus on the "name", "description", and "adsmserver". The "name" is the machine name, often the TSM nodename. The "description" comprises a sentence about the system and perhaps its function. The "adsmserver" field tells DRM whether the system is the TSM server. TSM used to be called ADSM, and there are still references to this in many places. To define the machines to DRM:

Define machine tsmserver1 description="The actual TSM server" adsmserver=yes
Define machine aixclient1 description="The AIX TSM client" adsmserver=no
After we create the machine definition, we need to associate this machine with its TSM nodename. Since a single system can have several TSM nodenames, this is how DRM keeps track of the actual system. To define the machines associates to DRM:

Define machnodeassociation tsmserver1 tsmserver1
Define machnodeassociation aixclient1 aixclient1
Defining Recovery Media to DRM
In general, TSM is used to back up critical data but is not used to install the systems themselves. In the event of a disaster, the system is first rebuilt and then TSM is used to restore the data. There are ways to use TSM to install the actual system, and this is called a "bare-metal restore". I will not focus on that type of restore because this procedure differs with different types of systems. I usually recommend that a single system be set up as an install server for AIX, another for Sun, etc., so that the number of systems needing actual bootable backups can be minimized.

When you define "recovery media" to DRM, you are telling DRM the location of the system-bootable media so that DRM can include that information in the DRM report. To define recovery media to DRM:

Define recoverymedia aix_433_boot_cds volumenames=cds1-4 \
  description="AIX 4.3.3 install CD's" location="top shelf of \
  the 3rd floor cabinet" type=boot
After creating the recovery media definition, we must associate this with the TSM machines. To define this association:

Define recmedmachassociation aix_433_boot_cds tsmserver1
Define recmedmachassociation aix_433_boot_cds aixclient1
What to Do Before a Disaster

To restore the TSM server and hence any data that it contains, you will need at least three things: the TSM database backup tape, the DRM report, and the off-site tapes that contain the TSM client data. To prepare for a disaster, daily TSM activities must include the following steps in order:

  • Backup the TSM client data.
  • Create the off-site tapes.
  • Backup the TSM database.
  • Create the DRM report.
  • Perform an AIX-bootable backup (mksysb) to include the DRM report.
  • Send the off-site tapes, the TSM database backup tape, and the TSM server mksysb tape to an off-site vault.
  • Send additional recovery media (such as AIX installation CDs) to the vault.

Backing up the TSM client data can differ depending on the type of data. For example, a TSM agent can be used to back up an Oracle database while it is online. For regular AIX files:

/usr/tivoli/tsm/client/ba/bin/dsmc inc
To create the off-site tapes, we back up the primary storage pool tapes to the copy storage pool tapes. We do this with the TSM administrative client, dsmadmc. After starting the dsmadmc shell:

backup stgpool bkup_tapepool offsite_tape
The DRM report in our case will be in the /var/adm directory and will have a format of tsm.<DATE>.<TIME>, as mentioned previously. To create the DRM report:

prepare
To recover the TSM server, we will need the AIX install CDs, the TSM install CDs, and the DRM report. The DRM report contains scripts to recreate the TSM database, logs, options files, etc. We can combine these into one by creating a bootable install tape (AIX mksysb). All we need in this case is the AIX mksysb tape, which will contain everything else. As we will see later, this lets us ignore many of the scripts in the DRM report, saving a lot of time and effort. To create the AIX mksysb:

/usr/bin/mksysb -i /dev/rmt0
where /dev/rmt0 is the tape device to back up to.

After performing all of the above actions, the AIX mksysb tape, the TSM database backup tape, and all of the off-site data tapes must be sent to the off-site vault for safe keeping. Additionally, keep a copy of the installation media for restoring specific clients, such as AIX installation CDs, HP installation CDs, etc. This is done in case other ways of rebuilding the TSM clients fail and we have to rebuild the system.

Using DRM to Manage Off-Site Tapes

Without DRM, there are basically only a couple of states for the tapes, where access is equal to readwrite or offsite depending on whether it is in the tape library of out of the tape library. DRM adds more states that are useful for managing off-site tapes. Here are the more important states in the order that they would be used:

Mountable -- The tape is in the tape library and can be mounted into a tape drive for reading and writing.

Courier -- The tape has been ejected from the tape library and is in transit to the off-site vault.

Vault -- The tape is in the off-site vault.

VaultRetrieve -- The data on the tape has expired or been moved to another tape so it no longer contains any data. It can be retrieved from the vault and used again. It is important to note that the TSM server automatically sets the tape to this state. It cannot manually be set to this state. The only way to manually set a tape to this state is to move the data from the tape or to delete the data contained in the tape.

CourierRetrieve -- The tape has been recalled from the vault and is in transit back to the office.

OnsiteRetrieve -- The courier has returned the tape to the office. Note that as soon as the tape is move to onsiteretrieve, TSM removes it from its database. As far as TSM is concerned, the tape no longer exists. This is because as the tape is returned to the office, it can actually be used on other systems and does not necessarily have to be returned to this particular TSM server.

All of the following commands are performed in the TSM administrative client shell, dsmadmc. In order for a tape to be mountable, it is inserted into the tape library and checked into TSM. In the following command, IBM3494 is the name of the tape library as defined to TSM. To check in the tapes:

Checkin libv IBM3494 status=scratch search=yes checklabel=yes devtype=3590
After creating the off-site tapes, those tapes need to be ejected from the library and marked as courier. The following command will do it in one step:

Move drmedia * stgpool=offsite_tape remove=yes tostate=courier
Once the tapes have arrived at the vault, they need to be changed to vault status:

Move drmedia * wherestate=courier tostate=vault
Again, when the data in the off-site tapes expires and the tape contains no more data, TSM automatically changes the tape status to vaultretrieve. In order to find out which tapes are available to be retrieved from the vault and reused:

Query drmedia * wherestate=vaultretrieve
After contacting the vault with a list of tapes that need to be retrieved from the vault, we can change the status of the tapes to courierretrieve:

Move drmedia * wherestate=vaultretrieve tostate=courierretrieve
Once the courier has delivered the tapes to the office from the vault, we can change the status of the tapes to onsiteretrieve:

Move drmedia * wherestate=courierretrieve tostate=onsiteretrieve
After the tapes arrive back at the office and are changed to onsiteretrieve status, TSM will remove the information about the tape from its database. It can then be checked into the server again, and the cycle repeats.

Automating the Management of Off-Site Tapes

It's important to note that all of the tape management commands can be combined into scripts so that most of the process can be automated. It all depends on your requirements. Some clients like to have total control and accountability, so they require their operators to perform everything manually. Other clients automate the process as much as possible. It all depends on your requirements and on your confidence that the courier will deliver all of the off-site tapes to the vault.

Format of the DRM Report

To understand how to use the DRM report to recover the TSM server, we need to understand its format. The DRM report is really a single text file that contains entire scripts, logs, and configuration files, etc. all in the same text file. The format is best described using an example:

begin PLANFILE.DESCRIPTION

Recovery Plan for Server SERVER1
Created by DRM PREPARE on 09/15/00 13:34:47
DRM PLANPREFIX /usr/local/scripts/log/prepare_doc
Storage Management Server for AIX-RS/6000 - Version 3, Release 7, Level 3.0

end PLANFILE.DESCRIPTION

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

begin PLANFILE.TABLE.OF.CONTENTS

PLANFILE.DESCRIPTION
PLANFILE.TABLE.OF.CONTENTS

...other file contents...

end PLANFILE.TABLE.OF.CONTENTS

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
...additional scripts...
As you can see, the DRM report is broken down into sections. Each section is a file, log, or script. The section begins with the key word "begin" followed by the file name, in the first file PLANFILE.DESCRIPTION. A blank line follows this, and the contents of the file are then included. After the file contents, the section ends with three lines. The first ending line starts with the key word "end" followed by the file name, a blank line, and a line with "*-*-" repeated across the line.

The names of the files are important because they are used within the scripts that DRM creates in the DRM report. The scripts within the DRM report expects those names to be there along with the "DRM plan prefix" we had configured before. For example, the file above PLANFILE.DESCRIPTION would be appended to the DRM plan prefix we configured of /var/adm/tsm to get the filename /var/adm/tsm.PLANFILE.DESCRIPTION.

I've written a small Perl script to extract the files and break them into the appropriate filenames. This is a fairly simple script so I won't go through it here. The code for this article is available for download at the Sys Admin Web site. This script will take the DRM report and separate it into the appropriate filenames using the tsm as a DRM plan prefix. Simply cd to the /var/adm directory, and run the following command:

cat <DRM REPORT> |  perl prepare_extract_files.pl
where "<DRM REPORT>" is the name of the report. Those files are broken down in your current directory. In addition to the above Perl script, I have included an actual DRM report so you can see how this script extracts files. The Perl script has a filename of tsm_prepare_extract_files.pl, and the DRM report has the name tsm.20000915.130005.

What to Do in the Event of a Disaster

Now that we have all of this information, what happens if we lose the computer room and have to rebuild everything at another location? The assumption here is that we have another site where we can recover systems. Also, we will assume that the recovery site has identical hardware. Because we have taken time to prepare for this event, the recovery process will be remarkably simple. We will proceed with the recovery with the following steps:

  • Retrieve the off-site tapes, including the TSM database backup tapes and the AIX mksysb tape.
  • Rebuild the TSM server using the AIX mksysb tape.
  • Insert all of the off-site tapes into the tape library.
  • Break out the latest DRM report into scripts and files using the Perl script mentioned above.
  • Use the DRM scripts to rebuild the TSM server database.
  • Install the TSM clients.
  • Rebuild the TSM clients using TSM.

It is generally best when retrieving off-site tapes to retrieve all of them. Depending on where the vault is, it will probably be unfeasible to keep calling the vault to retrieve tapes during a recovery process. That is because assuming that a major disaster has occurred, it will be quite urgent to get these systems up and running.

Once the tapes are at the recovery location, the first step is to rebuild the TSM server. Using the AIX mksysb tape and using the regular AIX mksysb installation procedures, reinstall the AIX operating system on the TSM server. During this time, the tapes could be inserted into the tape library.

After the server has been rebuilt, it will not be in a state that we can use to start TSM. We will have to rebuild the TSM database first. This is where the DRM report and its scripts come into play. Break out the DRM report into scripts using the Perl script mentioned above. The script that we are interested in will be called /var/adm/tsm.RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE. When you run this script, it will use several of the other scripts and files to:

  • Rebuild the TSM server configuration files.

  • Rebuild the TSM volume history and device configuration files.

  • Create and format and initialize the TSM database and TSM log files.

  • Recreate the TSM database from the TSM database backup tape.

  • Start the TSM server.

  • Up until this point, you could use this procedure to rebuild a TSM server in which all of the tapes are available including all of the primary storage pool tapes.

    In case of an actual disaster, where the primary storage pool tapes are not available, you will also be prompted to do the following:

  • Register the TSM server licenses.

  • Tell TSM the copy storage pool tapes are available.

  • Mark the primary storage pool tapes as "destroyed".

    That's it! One script does it all. All of the preparation makes this possible. DRM does a great job of creating this script and doing the hard work for you. This would normally be the hard part since the TSM server rebuild is a prerequisite to rebuilding all of the TSM clients' data.

    After rebuilding the TSM server, just install the clients using the installation media or bootable install tapes. Again, my recommendation is to create an install server for each type of system, rebuild the install server first, and then use that to rebuild the individual client operating systems. Once the client operating systems are operational, we can use TSM to restore all of the data using the normal TSM restore procedures.

    Summary

    I have introduced the DRM component of TSM and shown how to configure it. Also, I have introduced how DRM manages tapes and discussed what to do to prepare for a disaster. Additionally, I have talked about the DRM report and how to use that to rebuild the TSM server. Once this TSM server is rebuilt, we can use it to restore all of the TSM client's data.

    Primitivo Cervantes is an IT Specialist who has worked as a consultant for the past nine years. He has been in the computer/systems industry for fifteen years and has specialized in high-availability and disaster-recovery systems for the past seven years