Article

Surviving Large-Scale Distributed Computing Powerdowns

Andrew B. Sherman and Yuval Lirov

Introduction

A full or partial building powerdown destroys an orderly universe in the data center and stimulates hardware and software failures. Unfortunately, building power outages are a necessary evil. On one hand, computer systems require conditioned and uninterruptible power sources. On the other hand, those very power sources require periodic testing and maintenance. Thus, a series of scheduled outages is the price paid for avoiding random power interruptions. On the average we experience one scheduled powerdown per month.

In a distributed computing environment, the difficulties of coping with powerdowns are compounded. The distributed environment for a large organization may comprise hundreds of systems in the data center and thousands of systems on desktops. These systems may be administered by a handful of people, each of whom may be responsible for anywhere from 50 to 150 machines. In such an environment powerdown recovery requires scalable procedures. Other factors that may affect recovery time include time constraints, imperfect configuration management data, chaotic powerup sequence, stimulated disk failures, and disguised network outages.

The focus here is on data center powerdowns, since they pose the greatest threat. Each individual failure of a server can have an impact on a large number of users and pose a substantial business risk. Against that backdrop, failures of client machines are less important individually, although very trying in large numbers. Given these priorities, recovery from a powerdown affecting both the data center and other business areas should proceed in two stages, server recovery and client recovery. The objective is to minimize the time required for the first stage, to try to guarantee 100 percent server availability.

ICMP echo requests (pings) are typically used to find faults after the powerup. However, ping can only confirm that a machine is sufficiently operational to run the lowest layers of the kernel networking code -- it may be unable to determine if the machine is fully operational. And a damaged system that escapes detection (because it still answers pings) can exacerbate confusion and even create cascading faults.

This article presents a diagnostic methodology that reduces the time needed for the server recovery stage. The methodology exploits the synergy of a few very simple tests. By emphasizing early fault detection, it helps system administrators and operators direct their effort where it is needed. At our site this methodology has eliminated much of the uncertainty that system administrators used to face the Monday morning after a weekend powerdown.

A Time-Critical Business Environment

Our organization, Lehman Brothers, is a global financial services firm. Thousands of Sun and Intel-based machines (hundreds of them residing in the data centers) are interconnected by a global wide area network, reaching beyond the New York and New Jersey locations to all domestic and international branch offices.

For the computer support staff, this means that there are very few times during the week when there isn't a business need for many of the systems. In North America alone, analysts, sales people, and traders use the systems from 7:00 AM to 6:00 PM Monday through Friday, and production batch cycles from 4:00 PM to 7:00 AM Sunday through Friday. This leaves Saturday as the prime day for any maintenance activity, especially a powerdown that affects the entire data center.

The primary goal of our recovery procedures is to complete the first recovery phase (100 percent server availability) by Sunday afternoon, in time for the start of business in the Far East and the start of Sunday night production in New York. The secondary goal is to complete the second phase (100 percent of the desktop machines in operation) by the start of business on Monday.

Problems and Risks

Our organization experiences an average of one scheduled powerdown per month. Powerdowns affecting distributed systems are coordinated by the Distributed Operations group. This group produces and runs scripts that halt all affected machines and then turn off power on all machines and peripherals. When power is restored, Operations turns on power to all peripherals and processors. Each individual support group is responsible for taking care of details such as cleanly halting database servers prior to the powerdown.

There are several risks in this procedure, no matter how effectively it is carried out:

The order in which machines are brought back up is dictated more by geography than by the software and hardware interdependencies. For example, many of our machines are dependent upon a set of AFS servers for software. In an ideal world, those servers would be powered up first.

Disk failures occur in great numbers at power-cycle time. A basic and unchangeable fact of life of computers is that disks fail. You are inevitably reminded of that fact when you spin up a large number of disks at once.

In a large data center, it is inevitable that some peripherals will not be turned on during the power-up process.

How often, if ever, machines are routinely rebooted is a matter that varies widely with local tastes. If your taste is to leave your machines up for a very long time, a powerdown exposes you to hitherto undiscovered bugs in your start-up scripts.

If you cross-monitor your systems, the entropy in your monitoring system will be rather high during and immediately after the powerup until things stabilize.

The databases that were shut down so cleanly need to be brought back up.

Bringing It All Back Home

From experience and from our analysis of the risk factors we determined that the single largest problem was coping with disks that were off-line (due to either component or human failure) when the system was rebooted. This is not surprising given that even in normal circumstances the most common hardware fault is a disk drive failure.

Facing this fact for powerdowns caused us to change our normal reboot procedures, as contrasted with the powerup procedures to be described in the next paragraph. Since trying to boot with dead disks is a problem at any time, our normal bootup procedure (see code in Listing 1) now issues a quick dd command for every disk referenced in either /etc/fstab or in the database configuration files to ensure that each is spinning. If all the disks are okay, pages and email are sent to the SAs for the machine. If there is a disk error, the pages and email list the failed partitions. Later, the reboot process starts the log monitors and cross-monitors (if enabled). On dataservers an additional script starts up the server process and database monitors, and notifies the database administrator that the machine is up.

This level of testing and notification is adequate for the occasional reboot. However, it is not an effective diagnostic procedure during a powerup. Since there is no particular reboot order, there is no way to predict when certain network-wide services, such as paging and mail, will be available. Further, given the level of chaos in the network, it is best to turn the cross-monitors off before the powerdown and leave them off until the network begins to approach a steady state.

The ultimate resolution entails performing extraordinary surveillance in an extraordinary situation. On the Saturday night or Sunday morning after a powerdown, we run a script that takes attendance of all our group's servers (see code in Listing 2). While the sample code given is driven by a file containing the hostnames, the script as implemented is driven by a netgroup (which originates from a central configuration management database) that identifies the universe of servers. In either case, for each machine in the list, the script runs the following diagnostic sequence:

Does the machine answer ICMP echo requests? (ping)

Is the RPC subsystem responding? (a local utility, rpcping)

Are the disks all spinning? (also tests the functioning of inetd and in.rshd.)

The code for the disk check is shown in Listing 3. As in the routine bootup version of the disk check, the script identifies all disk partitions of interest from /etc/fstab and database configuration files, and checks their status with a dd command. In addition to running the disk check, this simple script serves as an amazingly robust test of the health of the system. If some major piece of the reboot procedure has failed, the chances are good that the attempt to run the disk check using rsh will fail. The early damage assessment provided by this diagnostic script gives system administrators and field service engineers a headstart on solving software and hardware problems. Once all the servers are operational or under repair, the recovery procedure can focus on the clients and can begin cataloging desktop machines that have not come back up.

Experience

In the many powerdowns our site has experienced since we implemented these procedures, we have had from 98 to 100 percent availability of the servers by the start of Sunday night production, and 100 percent availability by the start of business Monday morning. In general our problem resolution priorities are: primary production servers, backup production servers, development servers. This has worked for us, as we have not missed the start of production since we introduced these procedures.

There are a few pitfalls, however. For powerdowns extending beyond the data center, we have seldom achieved 100 percent client availability on Monday morning. One reason for this is that despite our best efforts to dissuade them, users will sometimes have machines moved without notifying the System Administration groups, which means that there are invalid locations in our configuration management system. It can sometimes take days to find a machine on the fixed income trading floor if it is down, the primary user is on vacation, and it was moved without a database update. Furthermore, some machines are behind locked office doors, ensuring that they will be down or damaged when their users arrive for work, depending upon when the door was locked.

Finally, we advise any support manager coping with powerdowns to expect the unexpected. For example, it is implicitly assumed in this discussion that once power is restored on Saturday night there will be no further power outage. This is not necessarily a valid assumption: Among other possiblities, human errors by the people doing the power maintenance could lead to unscheduled outages later in the weekend. In that case, there would be no orderly process; rather, the entire data center (or building) would go down in an instant, and come back all at once. In such cases, the same risks appear but the failure probabilities are much higher.

Acknowledgment

The authors thank Jeff Borror for inspiration and his vision of a fault-tolerant computing environment.

About the Authors

Yuval Lirov is Vice President of Global UNIX Support at Lehman Brothers. He manages 24x7 administration of systems, databases, and production for 3,000 workstations. A winner of Innovator '94 award from Software Development Trends magazine and of Outstanding Contribution '87 award from the American Institute of Aeronautics and Astonautics, he authored over 70 patents and technical publications in distributed systems management, troubleshooting, and resource allocation.

Andrew Sherman is a manager of Systems Administration in Lehman Brothers' Global UNIX Support department. He is a cum laude graduate of Vassar College in Physics and received a Ph.D. in Physcis from Rensselaer Polytechnic Institute. Prior to joining Lehman Brothers, he led the Unix Systems Support team at Salomon Brothers and was a major driver of the effort to standardize the system architecture there.

Drs. Lirov and Sherman are currently editing Mission Critical Systems Management -- an anthology to be published in 1996.