Article

Publisher's Forum

Well, after several years of nearly trouble-free operation, we have been blessed with a system administration nightmare. One of our large hard disks died in late January. Our initial response was "no big deal, we'll just replace it." Guess again. This type of disk is no longer available.

Our second response was, "ok, we'll replace the entire disk subsystem." (We needed some extra space and a better upgrade path anyway.) So, we ordered in a disk and controller from the nearest and quickest vendor. Unfortunately, the controller was so slow that we were left with an unusable system.

So now we're down to Plan C: reshuffle all our file systems so they fit on the remaining drive. Finally (after far more than a few hours of downtime), we are operating, but with a noticeable limp! For one thing, the root file system (residing in a partition on the "good" drive) gets corrupted every time we write a large file to the /tmp directory. Apparently we have a marginal track near the end of this partition. When the partition becomes nearly full, the controller fails and the free-list becomes corrupted. We've tested the drive for bad sectors, reformatted the drive, and rebuilt the file systems -- and we still have an intermittent problem. AARGHH!

So we're benchmarking replacement machines. (We needed to increase our system capacity anyway.) When we're done we'll have more disk space, greater robustness, and nearly double the throughput (presuming our early measurements hold up under careful examination).

There are some lessons here (lessons that I thought I already knew). First, when faced with a failure in a mission critical machine, the goal is not to get the machine fixed. The correct goal is to get the organization operating. If we had kept the correct goal firmly in mind, we would have reshuffled file space as a first response. Nothing else could get us operational so quickly.

Second, even if you have a good relationship with a local vendor, you can't blindly rely upon the vendor's inventory for backup. The component you need might be out of production by the time you need it. From here on out, we'll have a complete set of backup components on site, or a binding agreement with the vendor.

Third, it's not enough to "know" how to recover from a serious failure. You must practice the procedure. It is amazing how "simple" revisions and configuration changes can increase the time necessary to fully restore a system. At one point we lost over an hour because of a minor mistake in the command that restores the root from tape. We had done it before, but so long ago that some of the details got "lost." From now on we practice major recoveries on a regular schedule.

Besides frustrating us, our downtime has affected some of our customers. We were off the net for several days, causing a significant delay in correcting an omission in our most recent code posting, and had problems with some mail communication. We apologize and ask your understanding. Those problems have been fixed (or at least patched). We think we see the "light at the end of the tunnel."

On a brighter note . . . With this issue Sys Admin passes a significant milestone: our first issue with over 100 pages! Starting very soon we'll be using this extra space to focus on specific topics. The next issue will have special coverage of performance and kernel tuning. That will be followed by issues with special coverage of network administration and file system maintenance. If you have useful tools or advice that relate to any of these special topics, please contact us. We're always looking for meaty, useful technical information.

Sincerely yours,
Robert Ward
saletter@rdpub.com (". . . !uunet!rdpub!saletter")