Well, after several years of nearly trouble-free operation,
we have been blessed with a system administration nightmare.
our large hard disks died in late January. Our initial
"no big deal, we'll just replace it." Guess
again. This type
of disk is no longer available.
Our second response was, "ok, we'll replace the
disk subsystem." (We needed some extra space and
a better upgrade
path anyway.) So, we ordered in a disk and controller
from the nearest
and quickest vendor. Unfortunately, the controller was
so slow that
we were left with an unusable system.
So now we're down to Plan C: reshuffle all our file
so they fit on the remaining drive. Finally (after far
more than a
few hours of downtime), we are operating, but with a
For one thing, the root file system (residing in a partition
"good" drive) gets corrupted every time we
write a large file
to the /tmp directory. Apparently we have a marginal
track near the
end of this partition. When the partition becomes nearly
controller fails and the free-list becomes corrupted.
the drive for bad sectors, reformatted the drive, and
file systems -- and we still have an intermittent problem.
So we're benchmarking replacement machines. (We needed
our system capacity anyway.) When we're done we'll have
space, greater robustness, and nearly double the throughput
our early measurements hold up under careful examination).
There are some lessons here (lessons that I thought
knew). First, when faced with a failure in a mission
the goal is not to get the machine fixed. The correct
is to get the organization operating. If we had kept
the correct goal
firmly in mind, we would have reshuffled file space
as a first
response. Nothing else could get us operational so quickly.
Second, even if you have a good relationship with a
you can't blindly rely upon the vendor's inventory for
component you need might be out of production by the
time you need
it. From here on out, we'll have a complete set of backup
on site, or a binding agreement with the vendor.
Third, it's not enough to "know" how to recover
a serious failure. You must practice the procedure.
It is amazing
how "simple" revisions and configuration changes
the time necessary to fully restore a system. At one
point we lost
over an hour because of a minor mistake in the command
the root from tape. We had done it before, but so long
ago that some
of the details got "lost." From now on we
practice major recoveries
on a regular schedule.
Besides frustrating us, our downtime has affected some
customers. We were off the net for several days, causing
delay in correcting an omission in our most recent code
had problems with some mail communication. We apologize
and ask your
understanding. Those problems have been fixed (or at
We think we see the "light at the end of the tunnel."
On a brighter note . . . With this issue Sys Admin passes
significant milestone: our first issue with over 100
very soon we'll be using this extra space to focus on
The next issue will have special coverage of performance
tuning. That will be followed by issues with special
coverage of network
administration and file system maintenance. If you have
or advice that relate to any of these special topics,
us. We're always looking for meaty, useful technical
email@example.com (". . . !uunet!rdpub!saletter")