Article

Disaster Prevention

Russ Hill

Disaster recovery has received a lot of press since the bombing in Oklahoma City and at the World Trade Center in New York, but little has been publicized about disaster prevention. It is not likely that your business will be bombed tomorrow, but what actions are you taking to prevent the disaster lurking in your facility? Without the experience of disaster, it is nearly impossible to be aware of all the potential sources of disaster and to take measures to prevent them. The intent of this article is to raise your awareness before disaster strikes and to help you be prepared.

If you feel that your maintenance agreement specifying a four-hour or next-day response time is sufficient, ask yourself how long it would take to recover your main fileserver if all the disk drives had to be replaced. How long would it take you just to get the new drives from your vendor? Do you even have a maintenance agreement or do you handle failures on a case-by-case basis paying for parts, labor, and materials? How much money would your company lose through lost revenues and lost work? Even with a four-hour guaranteed response maintenance agreement, it can still take days to get a failed system back up and operational. At the company I work for, we had a drive fail on a server, which prompted a support call. A maintenance representative showed up within four hours of the reported failure, but it still took three days to get the system operational again.

A Complete Disaster

When the system disk drive crashed on our main server, a call went immediately to the support contractor's service hotline. Three hours later a field support engineer arrived to replace the disk drive. After he had removed the failed drive, he realized that he had the wrong replacement drive. He immediately called the local office and discovered that they did not have the correct replacement drive, but they said one could be sent overnight. The next day, the support person came out with the new drive, and, once again, he had the wrong one. The system internal and external disk drives on the failed system had two different part numbers and two different styles. The support person had read the part number off of the external drives when he had placed the overnight order.

Again, he called the local office, and again they didn't have the correct drive in stock - another overnight order. The disk drive was finally replaced the following day at noon - two and a half days after the failure occurred. Restoring the data on the drive took another five hours, and three complete business days were lost to 500 employees!

This is not a story about an incompetent support contractor, These comedies of error happen all the time. The local support office normally keeps two drives on the shelf at all times, but two days before our drive failed, a power spike in the area had caused several companies to lose computer equipment at the same time. The replacement stocks were on ground transport when our server failed. (We were saved from the power spike because all of the systems in our building were supported by an uninterruptable power system with power conditioning.)

After the experience described above, our company allowed us to buy spare drives to keep on the shelf. We also bought a spare server as a backup. Our maximum time to repair dropped from three days to 15 minutes (plus data restoration time). We had opportunity to prove that repair time just three weeks later when a second server went down. The cost of the spare system and disk drives was significantly less than the cost of paying 500 people who were unable to work for three days.

Disaster prevention is a form of insurance. It is very important to make sure that you insure your greatest risks, however, you do not want to overinsure by making every computer system redundant. Underinsuring is worse than overinsuring. The important question is: how much will it cost your company, per day, if all of the computer systems are not operational? This is not a subjective question. Everything in business has an associated cost and you need to determine that cost. Assess them in terms of lost revenue and labor expenses per day. Compare those with the cost of spare equipment, factor in the benefit of each, and it becomes a simple decision to fund disaster prevention.

System architecture will vary depending on the required uptime. Some systems require continuous uptime while others can be down a few hours a week, and even others can be down for days without anyone asking any questions. A business decision must be made as to how much uptime is necessary. A fully redundant fault-tolerant system is quite expensive; however, depending on the business need, it may be much cheaper than any amount of downtime. If some systems can tolerate some downtime, then either on-site or overnight spare parts may be the answer. As systems in your business become more critical, the uptime costs will need to be re-evaluated. For example, if your Web server is a critical part of your company's international marketing efforts, you will need continuous uptime. A zero tolerance for any downtime could require several redundant Web servers located all over the world for your company.

Disaster prevention is not a one-time task, it is a continuing process. Do not assume you can examine your company's current environment, allocate the necessary resources to prevent disaster, and then rest contentedly knowing that you have prevented all disasters for the next ten years. Like everything else in life, priorities change. For companies, priorities can change from quarter to quarter, week to week, and day to day. With these changing priorities, additional computer resources can become critical, and adequate resources must be available to prevent impending disaster.

The Basics of Disaster Prevention

Once the decision to invest in disaster prevention has been made, It is important to determine your Maximum Time To Repair (MTTR). This is NOT your Mean Time To Repair, it is the Maximum time required to repair a single point failure (in this case, a single point is a single system). The MTTR should be determined for every foreseeable problem whether it is a simple hardware problem or a major hardware problem. Once these MTTRs have been determined, efforts need to be made to reduce the times for each potential problem identified.

If your company handles failures on a case-by-case basis, it may take a significant period of time to get an order placed for replacement parts. This time can be reduced by having the paperwork completed (and signed if possible) prior to any failure. Even with this preparation, or if your company has hardware maintenance agreements, you may still be faced with an overnight shipment of the replacement parts, which in most cases amounts to a minimum of one lost work day. Having spare parts, on hand, such as disk drives of the same models used by your systems (or bigger), monitors, controllers, motherboards, keyboards, and other parts, will reduce your wait time for replacement parts. In short, you should have the parts for making at least one complete and available spare replacement server. That way, if a failure occurs and you use a spare, you will have the luxury of replacing the spare at your convenience.

If your company has a maintenance agreement, work with them to split the work load. It is possible to create a situation producing more up-time for your company and less work for your maintenance provider. Ask them to train your hardware technician to maintain your systems, and then keep spare parts at your site. The maintenance provider can send a field support engineer to pick up the bad parts and bring new ones on a regular schedule. The maintenance provider then will not have to drive to your site for each problem. Additionally, because the maintenance provider's costs are reduced, the maintenance agreements will be less costly. Save yourself both time and money by becoming partners with your maintenance provider.

Plan for disasters. Plan for different types of disasters. A Florida site should probably plan for hurricanes. A Texas site might want to plan for hail storms and tornados. A site in a flood zone should plan how to prevent and recover from potential water damage. This type of planning for disaster is geographic as well as site specific. Each disaster plan should have a contingency to prevent damage to the systems - whether that involves raised floors for flood areas, or extra air conditioning units for sites that get hot. Different sites will have different risks.

Once you initiate a plan, you must be ready to recover from anything. This does not mean you can do just a simple analysis. It is important to know exactly how to restore your company's environment and to have that process documented. If you do not have a documented procedure for bringing your servers back up, then arrange a time for those servers (or their spare) to be unavailable and reload them from scratch. Reload them five or six times until you have documented exactly how to get them back up in any emergency. Until you can walk into your lab knowing exactly how long and what steps it would take you to get any of your servers operational, you are not preventing disaster, you are just hoping a disaster does not happen to you. If you are scared to reload any one of your systems from scratch, then you will lose valuable time when your system fails. Even if your company's environment is operating with fantastic panache, it's a good idea to reload your main server from scratch just to be sure you can. Once you've documented and tested the process, continue testing it on a regular basis to prevent any new potential snags from cropping up.

Most Critical: Backups

The most critical step in disaster prevention is the creation of valid backups. Valid backups are backups that you know are readable and timely. You should verify tapes to make sure that what you think is written on them is actually there. A simple way to do this is to create a table of contents file from the tape immediately after it has been written and compare that to a table of contents generated from disk. This is not a complete verification, though. A thorough verification that compares the files on tape to those on disk should be performed weekly.

Tape use generates wear on the tape and requires that the tape be replaced after about 30 rewrites. Old tapes should be thrown away after that time. Readable data is worth more than a new tape, so commit your tapes to a to a life cycle. You should also clean the tape drives periodically. Tape oxides can build up on the read/write heads causing soft and eventually hard errors. Keep your backup tapes in a neat and clean storage area, preferably a fireproof data vault. All too often, a box of five 8-mm tapes left on top of the server represents the entire week's backups. If something were to happen to the server, the same fate could befall the backup tapes. Tapes should be removed from the lab and stored under lock and key in clean locations both within the company and off-site. It is not sufficient for administrators to take tapes home and store them in the garage. Make the effort to be neat and consistent - these tapes are very important.

Another disaster prevention activity that can help reduce recovery time is backing up the data on the system's boot disk drive to a spare disk drive and keeping it re-synched weekly. The boot drive should contain the operating system, so when you replace a failed boot drive with the spare, you can skip loading the operating system from scratch. If a data drive fails, you can still use the spare drive by just loading the data on the drive to replace the backed up system data.

The Computer Center Lab

Another important step in preventing a disaster is keeping your computer center lab clean. You should proactively clean all systems and equipment inside and out, twice a year with compressed air, removing dust from all surfaces. All boxes and non-necessary equipment should be removed. All cabling should be tied up and neatly kept. Keep the computer center lab temperature a little colder than the normal working conditions and locked at all times. Only authorized personnel should be allowed into the lab - everyone else should be escorted. No unauthorized person should be left alone in the lab. Performing these simple tasks can save you from many disasters. If your lab is utilized as a storage area, it is easy for someone to shift a box, disconnect or destroy a cable, or accidentally hit a switch or bump a keyboard causing several hours of unnecessary work for the sys admin.

Do not work directly on the system consoles or on the console ports of systems unless it is absolutely necessary. Most consoles have special privileges and were meant to be locked inside a computer center lab. Also, most systems report error messages and system status through the system console. Connect a printer to the system console to provide a hardcopy log of the actions reported on the console. If you need to perform some work in the lab, get another small, independent, system that you do not mind crashing and use that system in the lab. System console ports generally should not be used except for logging and emergencies.

Be static safe. Do not open a system's cabinet and begin fiddling with the system's boards and cables. Use a static wrist strap. Technicians sometimes use a wrist strap to remove a board and then lay the board directly on the floor. That is not the correct way to prevent static damage. Do not leave computer boards lying around the lab or on shelves without static protection. Static electricity damage can be very expensive, but can easily be prevented with the proper procedures.

Facilities issues also enter into disaster prevention. Extra air handlers are typically installed in computer center labs as most large server systems are not built to operate in temperatures over 80 degrees Fahrenheit (27 degrees Centigrade). Without proper air conditioning, most labs get hot very quickly. Your computer center lab should have an automatic power cut-off for use when the A/C fails. It should also have a manual override for those additional situations that warrant a power cut-off before the automatic system takes over. If your systems monitor the ambient temperature and perform a clean shutdown when temperatures exceed operating parameters, the automatic power cut-off should be set above that temperature. Heat can build up and damage system components.

Even if your systems will operate at high temperatures, they almost certainly are not built to operate under water. Studies have indicated that water-based fire suppression systems are more effective and less damaging in computer center fires than Halon or other systems, but they do require an automatic power cut-off before the water is applied. Of course, the best way to recover from water damage is to prevent it. Make sure your lab's fire suppression sprinklers are not connected to the system outside the lab, so that a fire erupting in another section of the building does not unnecessarily trigger the sprinklers in the lab. Water and computers are not a good combination in the best of circumstances; be sure to have some sheet plastic and tape on hand just in case the sprinkler system releases (or the roof leaks).

Network Traffic

Today, the computer center is just a small portion of a company's computing resources. Most companies have Local Area Networks tying many computing resources together. Although typically not a source of disaster, a poorly operating network can cause unnecessary delays, intermittent errors, and great hair loss for administrators. Disaster can happen when a problem occurs and the network is not documented and labeled. Place maps of the network in each wiring closet and near each router and bridge. Label every cable on both ends and match it on the maps. When a problem occurs, these maps will save valuable time. It is important to keep the maps updated, too.

Provide Proper Help

Provide a centralized help desk with a telephone "hotline." Route all problems through this help desk no matter what the problem is. Often, within a company, there are too many support telephone numbers, and users become frustrated by not knowing who to call. The help desk should have a problem tracking and reporting system to track each and every call that comes in. The help desk should also provide appropriate follow-up feedback for all calls. Post the help desk hotline number on each system. Also, post the procedures for booting and shutting down each system. Label the hostname of every machine directly on the machine in an easily visible location. Because monitors frequently fail and get replaced, the monitor is probably not the best place to label the machine. Label printers so that your users will know the name of each printer just by looking. Post a map with the locations of all other printers at each printer location, so users will know where to go when they cannot find their printout.

The entire computer systems administration group should know each other's home telephone and beeper number. Outside of this group, it is important to post an authorized personnel contact list for use when no administrator is available. These lists should also be kept online and updated often. Each system vendor's customer support telephone number should also be posted in the computer center lab.

Make certain that any security guards also know who to call in case of emergency. Implement a rotating "hot" beeper policy within the system administration group so that every week someone "owns" the beeper and is on call 24 hours a day. Train every person in the group on the proper emergency procedures.

Become Proactive

The single most effective way to prevent a disaster is to become proactive. This can be accomplished in several ways, some of which have been discussed above. Additional efforts can be made to prevent disasters through the monitoring of your system's operational statistics. Check for increasing soft error counts on disk and tape activity. Check systems for continuing occurrences of zombie processes (zombie processes are processes that do not die correctly). Check memory and disk utilization. Become proactive by looking at these statistics and fix little problems before they become big problems. Physically examine each system when you clean them and replace any worn or broken parts.

Scheduled Down Time

Few sites have the requirement of 100% uptime. Sites that do should already have the proper fault-tolerant equipment to handle those tough requirements. If your site does not have those requirements, schedule periodic down-time for maintenance. Most operating systems today are very mature and generally do not require periodic rebooting, but system performance may still be increased or system crashes prevented by periodic rebooting. As an operating system runs, buffers, process space, queues, and other structures are constantly being allocated and deallocated. If a process crashes or hangs, the resources used by that process can remain allocated and be unavailable for use by other processes, thereby reducing your system's performance. Proactive rebooting reduces the chances of an unexpected crash due to locked up resources.

In addition to a regular reboot, you might want to schedule a regular period of extended (four-hour) down-time during off hours to perform the periodic cleaning, reconfiguration, and maintenance required on the systems. A regular schedule will allow users and your maintenance provider to make proper arrangements.

Scheduled rebooting can be clean and at your command, which is better than letting your system crash periodically. If you have a system that crashes regularly, watch it carefully and check the system logs. Reboot the system often to head off crashes and try to determine the cause. When the system does crash, either you or your system vendor need to examine the crash dump to determine the root cause.

Systems do not need to crash. Often, a system crash indicates that the administrator is not doing his job. The administrator needs to proactively head off disaster, although he or she must have management support to do that. A system that crashes frequently may need to be replaced by a system from another vendor, or you may need a different maintenance provider. Replacing equipment is cheaper than repeated down-time.

System Administration Responsibility

When a server system does go down, your number one priority is to make the system operational again. Do not wait for anything - kludge it, hack it, whatever - get the system up. Do not wait for the maintenance provider to bring a new drive or motherboard. Assume you will get no help from anyone. Too often, help will not arrive on time, and you will be left hanging. If you assume the worst, you will not get burned. Remember, down-time is expensive. Make plans just in case the worst does happen.

Systems administration responsibilities are different at different sites. However, the responsibilities for each person in the disaster/recovery plan should be fully documented so everyone knows what to do in the event of a disaster.

About the Author

Russ is a production UNIX consultant with Lucent Technologies. He can be reached at rhill@dallas.cp.lucent.com.