Surviving Large-Scale Distributed Computing Powerdowns
Andrew B. Sherman and Yuval Lirov
Introduction
A full or partial building powerdown destroys an orderly
universe
in the data center and stimulates hardware and software
failures.
Unfortunately, building power outages are a necessary
evil. On one
hand, computer systems require conditioned and uninterruptible
power
sources. On the other hand, those very power sources
require periodic
testing and maintenance. Thus, a series of scheduled
outages is the
price paid for avoiding random power interruptions.
On the average
we experience one scheduled powerdown per month.
In a distributed computing environment, the difficulties
of coping
with powerdowns are compounded. The distributed environment
for a
large organization may comprise hundreds of systems
in the data center
and thousands of systems on desktops. These systems
may be administered
by a handful of people, each of whom may be responsible
for anywhere
from 50 to 150 machines. In such an environment powerdown
recovery
requires scalable procedures. Other factors that may
affect recovery
time include time constraints, imperfect configuration
management
data, chaotic powerup sequence, stimulated disk failures,
and disguised
network outages.
The focus here is on data center powerdowns, since they
pose the greatest
threat. Each individual failure of a server can have
an impact on
a large number of users and pose a substantial business
risk. Against
that backdrop, failures of client machines are less
important individually,
although very trying in large numbers. Given these priorities,
recovery
from a powerdown affecting both the data center and
other business
areas should proceed in two stages, server recovery
and client recovery.
The objective is to minimize the time required for the
first stage,
to try to guarantee 100 percent server availability.
ICMP echo requests (pings) are typically used
to find faults after the powerup. However, ping can
only confirm
that a machine is sufficiently operational to run the
lowest layers
of the kernel networking code -- it may be unable to
determine
if the machine is fully operational. And a damaged system
that escapes
detection (because it still answers pings) can exacerbate
confusion and even create cascading faults.
This article presents a diagnostic methodology that
reduces the time
needed for the server recovery stage. The methodology
exploits the
synergy of a few very simple tests. By emphasizing early
fault detection,
it helps system administrators and operators direct
their effort where
it is needed. At our site this methodology has eliminated
much of
the uncertainty that system administrators used to face
the Monday
morning after a weekend powerdown.
A Time-Critical Business Environment
Our organization, Lehman Brothers, is a global financial
services
firm. Thousands of Sun and Intel-based machines (hundreds
of them
residing in the data centers) are interconnected by
a global wide
area network, reaching beyond the New York and New Jersey
locations
to all domestic and international branch offices.
For the computer support staff, this means that there
are very few
times during the week when there isn't a business need
for many of
the systems. In North America alone, analysts, sales
people, and traders
use the systems from 7:00 AM to 6:00 PM Monday through
Friday, and production batch cycles from 4:00 PM to
7:00 AM
Sunday through Friday. This leaves Saturday as the prime
day for any
maintenance activity, especially a powerdown that affects
the entire
data center.
The primary goal of our recovery procedures is to complete
the first
recovery phase (100 percent server availability) by
Sunday afternoon,
in time for the start of business in the Far East and
the start of
Sunday night production in New York. The secondary goal
is to complete
the second phase (100 percent of the desktop machines
in operation)
by the start of business on Monday.
Problems and Risks
Our organization experiences an average of one scheduled
powerdown
per month. Powerdowns affecting distributed systems
are coordinated
by the Distributed Operations group. This group produces
and runs
scripts that halt all affected machines and then turn
off power on
all machines and peripherals. When power is restored,
Operations turns
on power to all peripherals and processors. Each individual
support
group is responsible for taking care of details such
as cleanly halting
database servers prior to the powerdown.
There are several risks in this procedure, no matter
how effectively
it is carried out:
The order in which machines are brought back up is
dictated more by geography than by the software and
hardware interdependencies.
For example, many of our machines are dependent upon
a set of AFS
servers for software. In an ideal world, those servers
would be powered
up first.
Disk failures occur in great numbers at power-cycle
time. A basic and unchangeable fact of life of computers
is that disks
fail. You are inevitably reminded of that fact when
you spin up a
large number of disks at once.
In a large data center, it is inevitable that some
peripherals will not be turned on during the power-up
process.
How often, if ever, machines are routinely rebooted
is a matter that varies widely with local tastes. If
your taste is
to leave your machines up for a very long time, a powerdown
exposes
you to hitherto undiscovered bugs in your start-up scripts.
If you cross-monitor your systems, the entropy in
your monitoring system will be rather high during and
immediately
after the powerup until things stabilize.
The databases that were shut down so cleanly need
to be brought back up.
Bringing It All Back Home
From experience and from our analysis of the risk factors
we determined
that the single largest problem was coping with disks
that were off-line
(due to either component or human failure) when the
system was rebooted.
This is not surprising given that even in normal circumstances
the
most common hardware fault is a disk drive failure.
Facing this fact for powerdowns caused us to change
our normal reboot
procedures, as contrasted with the powerup procedures
to be described
in the next paragraph. Since trying to boot with dead
disks is a problem
at any time, our normal bootup procedure (see code in
Listing 1) now
issues a quick dd command for every disk referenced
in either
/etc/fstab or in the database configuration files to
ensure
that each is spinning. If all the disks are okay, pages
and email
are sent to the SAs for the machine. If there is a disk
error, the
pages and email list the failed partitions. Later, the
reboot process
starts the log monitors and cross-monitors (if enabled).
On dataservers
an additional script starts up the server process and
database monitors,
and notifies the database administrator that the machine
is up.
This level of testing and notification is adequate for
the occasional
reboot. However, it is not an effective diagnostic procedure
during
a powerup. Since there is no particular reboot order,
there is no
way to predict when certain network-wide services, such
as paging
and mail, will be available. Further, given the level
of chaos in
the network, it is best to turn the cross-monitors off
before the
powerdown and leave them off until the network begins
to approach
a steady state.
The ultimate resolution entails performing extraordinary
surveillance
in an extraordinary situation. On the Saturday night
or Sunday morning
after a powerdown, we run a script that takes attendance
of all our
group's servers (see code in Listing 2). While the sample
code given
is driven by a file containing the hostnames, the script
as implemented
is driven by a netgroup (which originates from a central
configuration
management database) that identifies the universe of
servers. In either
case, for each machine in the list, the script runs
the following
diagnostic sequence:
Does the machine answer ICMP echo requests? (ping)
Is the RPC subsystem responding? (a local utility, rpcping)
Are the disks all spinning? (also tests the functioning
of inetd and in.rshd.)
The code for the disk check is shown in Listing 3. As
in the routine
bootup version of the disk check, the script identifies
all disk partitions
of interest from /etc/fstab and database configuration
files,
and checks their status with a dd command. In addition
to
running the disk check, this simple script serves as
an amazingly
robust test of the health of the system. If some major
piece of the
reboot procedure has failed, the chances are good that
the attempt
to run the disk check using rsh will fail. The early
damage
assessment provided by this diagnostic script gives
system administrators
and field service engineers a headstart on solving software
and hardware
problems. Once all the servers are operational or under
repair, the
recovery procedure can focus on the clients and can
begin cataloging
desktop machines that have not come back up.
Experience
In the many powerdowns our site has experienced since
we implemented
these procedures, we have had from 98 to 100 percent
availability
of the servers by the start of Sunday night production,
and 100 percent
availability by the start of business Monday morning.
In general our
problem resolution priorities are: primary production
servers, backup
production servers, development servers. This has worked
for us, as
we have not missed the start of production since we
introduced these
procedures.
There are a few pitfalls, however. For powerdowns extending
beyond
the data center, we have seldom achieved 100 percent
client availability
on Monday morning. One reason for this is that despite
our best efforts
to dissuade them, users will sometimes have machines
moved without
notifying the System Administration groups, which means
that there
are invalid locations in our configuration management
system. It can
sometimes take days to find a machine on the fixed income
trading
floor if it is down, the primary user is on vacation,
and it was moved
without a database update. Furthermore, some machines
are behind locked
office doors, ensuring that they will be down or damaged
when their
users arrive for work, depending upon when the door
was locked.
Finally, we advise any support manager coping with powerdowns
to expect
the unexpected. For example, it is implicitly assumed
in this discussion
that once power is restored on Saturday night there
will be no further
power outage. This is not necessarily a valid assumption:
Among other
possiblities, human errors by the people doing the power
maintenance
could lead to unscheduled outages later in the weekend.
In that case,
there would be no orderly process; rather, the entire
data center
(or building) would go down in an instant, and come
back all at once.
In such cases, the same risks appear but the failure
probabilities
are much higher.
Acknowledgment
The authors thank Jeff Borror for inspiration and his
vision of a
fault-tolerant computing environment.
About the Authors
Yuval Lirov is Vice President of Global UNIX Support
at Lehman
Brothers. He manages 24x7 administration of systems,
databases, and
production for 3,000 workstations. A winner of Innovator
'94 award
from Software Development Trends magazine and of Outstanding
Contribution '87 award from the American Institute of
Aeronautics
and Astonautics, he authored over 70 patents and technical
publications
in distributed systems management, troubleshooting,
and resource allocation.
Andrew Sherman is a manager of Systems Administration
in Lehman
Brothers' Global UNIX Support department. He is a cum
laude graduate
of Vassar College in Physics and received a Ph.D. in
Physcis from
Rensselaer Polytechnic Institute. Prior to joining Lehman
Brothers,
he led the Unix Systems Support team at Salomon Brothers
and was a
major driver of the effort to standardize the system
architecture
there.
Drs. Lirov and Sherman are currently editing Mission
Critical
Systems Management -- an anthology to be published in
1996.
|