Publisher's Forum
Well, after several years of nearly trouble-free operation,
we have been blessed with a system administration nightmare.
One of
our large hard disks died in late January. Our initial
response was
"no big deal, we'll just replace it." Guess
again. This type
of disk is no longer available.
Our second response was, "ok, we'll replace the
entire
disk subsystem." (We needed some extra space and
a better upgrade
path anyway.) So, we ordered in a disk and controller
from the nearest
and quickest vendor. Unfortunately, the controller was
so slow that
we were left with an unusable system.
So now we're down to Plan C: reshuffle all our file
systems
so they fit on the remaining drive. Finally (after far
more than a
few hours of downtime), we are operating, but with a
noticeable limp!
For one thing, the root file system (residing in a partition
on the
"good" drive) gets corrupted every time we
write a large file
to the /tmp directory. Apparently we have a marginal
track near the
end of this partition. When the partition becomes nearly
full, the
controller fails and the free-list becomes corrupted.
We've tested
the drive for bad sectors, reformatted the drive, and
rebuilt the
file systems -- and we still have an intermittent problem.
AARGHH!
So we're benchmarking replacement machines. (We needed
to increase
our system capacity anyway.) When we're done we'll have
more disk
space, greater robustness, and nearly double the throughput
(presuming
our early measurements hold up under careful examination).
There are some lessons here (lessons that I thought
I already
knew). First, when faced with a failure in a mission
critical machine,
the goal is not to get the machine fixed. The correct
goal
is to get the organization operating. If we had kept
the correct goal
firmly in mind, we would have reshuffled file space
as a first
response. Nothing else could get us operational so quickly.
Second, even if you have a good relationship with a
local vendor,
you can't blindly rely upon the vendor's inventory for
backup. The
component you need might be out of production by the
time you need
it. From here on out, we'll have a complete set of backup
components
on site, or a binding agreement with the vendor.
Third, it's not enough to "know" how to recover
from
a serious failure. You must practice the procedure.
It is amazing
how "simple" revisions and configuration changes
can increase
the time necessary to fully restore a system. At one
point we lost
over an hour because of a minor mistake in the command
that restores
the root from tape. We had done it before, but so long
ago that some
of the details got "lost." From now on we
practice major recoveries
on a regular schedule.
Besides frustrating us, our downtime has affected some
of our
customers. We were off the net for several days, causing
a significant
delay in correcting an omission in our most recent code
posting, and
had problems with some mail communication. We apologize
and ask your
understanding. Those problems have been fixed (or at
least patched).
We think we see the "light at the end of the tunnel."
On a brighter note . . . With this issue Sys Admin passes
a
significant milestone: our first issue with over 100
pages! Starting
very soon we'll be using this extra space to focus on
specific topics.
The next issue will have special coverage of performance
and kernel
tuning. That will be followed by issues with special
coverage of network
administration and file system maintenance. If you have
useful tools
or advice that relate to any of these special topics,
please contact
us. We're always looking for meaty, useful technical
information.
Sincerely yours,
Robert Ward
saletter@rdpub.com (". . . !uunet!rdpub!saletter")
|