Backups and Disaster Recovery
Computers have become an integral part of today's technological society, and, now more than ever, the protection and continued precise operation of those computers is essential to the functioning of even basic everyday services. When a site's computers fail to perform as expected, restoring the computing services to their former efficiency and integrity is a top priority. To facilitate the timely recovery of a system, you must analyze your site-specific risks and needs, implement a tailored solution, and test all of the components to ensure that the solution is as infallible as possible.
There are enough aspects to risk management and disaster recovery to warrant a book, so I will focus on one particular topic: data recovery from backups. Whether the disaster results from a breach in security or hardware or software failure, there is almost always a need to revert to some previously known state, making restoration from removable media an integral part of disaster recovery. To ensure that your data recovery goes as smoothly as possible, you can implement a number of preventive measures. To address as many flavors of UNIX and backup hardware and software as possible, this article provides a fairly generic overview of how to plan for quick restoration of data after a disaster.
The first step toward ensuring a speedy recovery is a well-planned backup scheme. The following are some ideas to consider when formulating plans for your site. Every site has differing needs and available resources, making no two disaster recovery plans alike. At the very least, a good plan includes finding the appropriate backup hardware and software for your needs, planning incremental and archive backups, cataloging, storing, and rotating tapes, making sure that clear instructions are available for those who will need to restore the machine, and testing it all before a true emergency occurs.
Choosing Storage Hardware
Picking the type of hardware and media for your site's backups can be one of the easiest steps if you have a general idea of your site's projected growth. Pick tape hardware that is compatible with your server hardware, supported by your backup software, and (funds allowing) commensurate with your storage capacity needs for the next couple of years. Common choices include 4mm DAT, 8mm (Exabyte or AIT), and DLT (2000, 4000, or 7000) drives. Additionally, the venerable QIC format now has high-capacity versions in the form of MLR2 and MLR3. You can also choose your drives with or without stacker capability. Make sure that you have quick access to at least one compatible replacement drive in case of hardware failure. Keeping an extra drive handy also allows for restoring more than one machine in parallel if necessary.
Choosing Backup Software
Choosing backup software can often be the most difficult planning decision because there are so many software packages from which to choose, including growing your own. For instance, a site with only a few machines may write their own backup utilities and run them by hand once a week, while a large 24x7 database-driven site may purchase an expensive "backup solution" that includes a number of automated features, "helper" modules for their databases, and a customer support contract. Often the larger a site becomes, the more important it is to automate critical procedures such as backups. There are many in-between solutions utilizing both freeware and commercial products. The trick is finding the one that best suits your site.
Finding the appropriate backup software for your site usually comes down to an issue of features versus cost. A few issues to consider when choosing or writing backup utilities are: compatibility/multiple platform support for heterogeneous networks, scalability, security, commercial support, cost, availability of source code, support for popular database (Oracle, Sybase, etc.) backup interfaces, tape and file cataloging, tape rotation scheduling, backup level scheduling, and resilience/recovery after a failed backup.
Commercial products often have more automation, dedicated support, and hooks for things such as live database backups, but they can also be costly. Such products are usually provided in binary-only format, increasing the possibility of coding insecurities and incompatibility with non-standard or uncommon systems. Conversely, with freeware and shareware, public scrutiny of the source code often makes for less buggy software in the final revision because many people run alpha and beta software and contribute ports, suggestions, and fixes. Although people usually attempt to help freeware and shareware users through mailing lists and newsgroups, technical support can't be guaranteed if you encounter a problem. Your proficiency in coding and debugging may well be a big factor in whether you decide to go with a commercial product or a freeware/shareware product.
If you choose to write your own backup software, you can obviously include as much functionality as you want, but when something breaks, someone in your organization must fix the problem. For this reason, code documentation is one of the most important things to remember when writing your own backup software. Home-grown solutions are often optimal for small, stable sites with few system administrators. Writing and supporting a home-grown backup system can turn into a full-time job for the unsuspecting system administrator, however, if the site suddenly outgrows the capabilities of the software. This is particularly true if the software is not well written or if the author leaves the company with undocumented code.
Creating a Backup Schedule
Once you decide on the hardware and the software for your backup scheme, you need to plan the site's backup schedule. Creating a backup schedule includes planning incremental and archive runs, deciding how far back data should be kept, testing tapes, and rotating old tapes out of the cycle. Each machine on your network may have varying amounts of important and volatile data, therefore each machine or each partition may have a unique schedule. What levels of backups you perform and how often you perform them depends largely on the total amount of data to be backed up, the importance of the data, how often the data changes, and what constitutes the acceptable level of resource usage (CPU cycles, network bandwidth, storage media, etc) at your site.
In general, make sure that you can always restore data from the time of the last archive run until as close as possible to the present. The fewer incremental restorations that need to be done, the faster the machine will be back up and running. For this reason, most sites with sensitive data perform archive backups frequently, sometimes skipping incremental backups altogether by doing archive backups every night. This is usually not a viable option for most sites, because it uses too many tapes and may take too long or bog down the network. More often, sites will run an archive backup of each partition once a week or month and then run some level of incremental backups in the intervening nights. Some sites may not even back up their data every night if there is little change or the data is not important enough.
Many sites also keep archive tapes for long periods of time, even indefinitely, but this is not a requirement for everyday emergency restoration. Storing archive backups for an extended period of time can be useful when you need to retrieve data from a period not spanned by your weekly/monthly backup cycle. Many sites will keep such items as end-of-month or end-of-year reports on archive tapes that are permanently stored in a secure location. For security purposes, many sites also have a permanently stored "virgin" backup of each machine, a copy of the system as it was when first brought onto the network. Virgin backups, along with successive stored archive backups since the time of installation can help to track when a security breach, such as a trojaned system binary, may have occurred.
Another important part of the backup schedule is periodically checking tapes for data integrity and deciding when to rotate tapes out of the normal cycle. Many sites keep writing to the same tape over and over and never realize that the tape can't be read until there's a need to do a restore. Be proactive; don't get caught in such scenarios. Since all storage media wear out after prolonged use, you should remove tapes from the backup cycle after a site-specified number of reads and writes. This helps prevent you from being stranded with corrupt and unusable data during a critical situation. You should also frequently pick a random tape and try to restore data to ensure that the tape is still good and that your backup software is still functioning properly.
Document, Document, Document
One of the most important things you can do besides having a disaster recovery strategy and performing backups is to document everything about your scheme. To bring the system back to a known state requires knowledge of the hardware, software, and any site-specific procedures. In an emergency, you can't rely on the fact that someone who knows the backup system inside and out will be there or even that any documentation you write will be available online. The key to rapid data recovery is ensuring that all required information is printed on paper in a secure but accessible location. This sounds like an amazing amount of work, and it is, but it is the one thing that can really save you in the event of a major crisis.
Everything necessary to bring a system back from the dead should be documented. This includes the contact numbers of any personnel who need to know about or might help recover from the outage, specific hardware specs and part numbers of each component, each disk's geometry and layout (including any special RAID configuration or other such modifications), directions on how to initialize the hardware (like disks) or software (like setting up software RAID devices), numbers and addresses of emergency replacement part vendors, detailed instructions on how to install and operate the base operating system and the restore software, location of tapes and what data is stored on each tape, along with any other site-specific instructions. All information should be cataloged, so that even a novice could follow the instructions and bring the system back online.
Once you've put your backup scheme together and documented everything, declare an "emergency" and have someone attempt to rebuild the system on a spare machine. Assume for this exercise that you are unreachable and someone who is not as familiar with the backup system must handle the crisis. Don't answer any questions if that person needs help; your documentation should speak for you. Time the recovery and carefully note any places where things did not go exactly as planned. Also make sure that all aspects of the disaster recovery plan are tested, including calling all the appropriate people, attempting to replace faulty hardware (but have the spare sitting by), and making sure that restorations can be done when the backup machine fails. This sort of test should point out the weaknesses in your analysis, implementation, and documentation if any exist. Make all necessary revisions and test the recovery again. Rinse, lather, and repeat until flawless.
An Evolving Procedure
Most sites are continually changing, sometimes by adding or removing hardware, upgrading software, or just tweaking a configuration file. Once you have a disaster recovery plan, don't undermine it by tossing it in a corner, never to see the light of day again. Each time something at your site changes, the same process should be followed: analyze, implement, document, test. Failing to keep your disaster recovery plans current can, at best, result in a headache while the person responsible for the recovery searches your documentation for information that doesn't exist. At worst, the lack of established procedure can result in an extended and costly downtime, not to mention departmental and corporate embarrassment and a potential loss of business.
If keeping your machines running is an important part of your job, you shouldn't be caught without a disaster recovery plan. It takes quite a bit of work to create the initial plan and stay on top of changes, but a well-planned and tested disaster recovery scheme can save you time, money, and face when disaster strikes. To help with your risk analysis and disaster recovery planning, GUIDE publishes a thorough pamphlet available for a small fee. See http://www.guide.org/ \
pubslist.htm#GPP236 for information on ordering this and other GUIDE pamphlets. The Federal Emergency Management Agency also provides a independent study program on natural and technological hazards to help develop emergency preparedness plans; see http://www.fema.gov/home/emi/is2.htm for more information.
About the Author
Amy Rich, president of the Boston-based Oceanwave Consulting, Inc. (http://www.oceanwave.com), has been a UNIX systems administrator for more than five years. She received a BSCS at Worcester Polytechnic Institute, and can be reached at: firstname.lastname@example.org.