Cover V11, I02

Article

feb2002.tar

The Golden Rules of Sun Systems Administration

Peter Baer Galvin

This month, Peter provides a hard-earned set of rules that can help guide systems administrators toward good systems practices. True-life examples show why these rules are so important.

When in Doubt, Reboot

As silly and hackneyed as it sounds, this is probably the most important rule. There are many circumstances when reboots are required, such as after installing kernel patches, and after making changes to /etc/system. There are other times when a reboot is debatable. In these instances, the correct decision usually is to perform the reboot. For example, I once made a simple change to a startup script and because it was so simple, I did not feel testing was necessary. A month or so later, the system crashed and failed to complete its startup. After much debugging, I found the change (that I had forgotten I had made) had an error that had caused the problem. It was then time to cancel those service calls I'd made for new hardware and systems patches.

One client was having a problem with software that had been ported from Windows to Solaris. The vendor recommended that we reinstall the operating system and all the application software, which would have been a multi-day project. Rather than reinstalling the software and operating system, we decided to take the Sun approach of debugging the problem. It turned out to be a bad Sun jumbo patch cluster, and backing down to the previous revision solved the problem and saved several days on the project.

Communicate with Users

Veteran sys admins realize that they are supposed to simplify the lives of users. Those who are going to enjoy long careers as admins take this to heart, and make decisions and recommendations based on usability, stability, and security. The antithesis of this rule is the joke I used to tell as a university sys admin: "This university job would be great if it wasn't for the dang students messing up my systems." Good admins realize that this is a joke, not gospel!

If You Have a Problem, Check the Cables

This rule is similar to the first question that appliance customer-service representatives ask when someone calls in for support -- is it plugged in? I've heard of (and experienced) several instances where the problem was with a cable, even if it did not appear to be. In one instance, a server had an I/O board added and the system failed to boot properly. After much debugging, the problem resolved to a bad cable. The cable was not even touched during the upgrade.

All Projects Take Twice the Estimated Time and Money

Project planning, even when performed by experienced planners, misses small but important steps that can delay the project or destroy the plan (or the resulting systems). The most experienced planners pad the times they think each step will take, doubling it for any given plan. Then, when the plan proves to be accurate, he or she is considered to be a genius. In reality, the planner thought "this step should take four days, so I'll put down eight".

It's Not Done Until It's Tested

Many sys admins like their work because of the technical challenges, minute details, and creative processes involved in running systems. The type of personality drawn to these challenges is not typically the type that will thoroughly test a solution, and then retest after changes to variables. Unfortunately for them, testing is required for system assurance, and for good systems administration. I shudder to think how much system downtime, and sys admin time, has been wasted by less-than-thorough testing.

Never Change Anything on Fridays

Whatever the cause (in a hurry to get home, goblins, gamma rays), changes made before weekends or trips frequently turn into disaster. Do not tempt the fates -- just wait until there are a couple of days in a row to make the change and monitor the change. Everyone will be happier.

Use Defaults Whenever Possible

I recall a conversion with a client in which the client was trying to go outside of the box, save some time and money, and produce a complex but theoretically workable solution. The solution had problems, was difficult to debug, was difficult for vendors to support, and generally helped make the company miserable for a while.

In another example, some sys admins make changes for convenience that end up making the system different from others. For example, some still put files into /usr/local/bin, even though that has been deprecated for many years. It may make the user's lives easier because that's where they expect to find files, but other admins may be unpleasantly surprised when they use standard methods and find they conflict with the current system configuration.

Create a Backup

Back up the entire system if making systemic changes, such as system upgrades or major application upgrades. Back up individual files for minor changes.

Avoid Poor Decisions from Above

Management can have bad data and make bad decisions, or can even have good data and make bad decisions. Admins, however, have to live with (and suffer with) these decisions. Take a stand, with reason and data, to "encourage" the correct decision. Even if you lose the battle, you can always say "I told you so" and feel good about yourself (while looking for a new job).

If You Haven't Seen It Work, It Probably Doesn't

Products are announced, purchased, and installed with the expectation that they will perform as advertised, and sometimes they actually do. Most of the time, however, they are over-promised and under-delivered.

If You Are Constantly Fighting Fires, Find the Source

Fighting computer system "fires", instead of finding the causes of those fires, wastes many man-hours. To avoid this problem, automate whenever possible (especially monitoring and log checking).

If You Don't Understand It, Don't Play with It on the Production Systems

In my university days, there was a student programmer who noticed the key switch on our main production Sun server, wondered what it does, and turned it to find out. He probably should have asked someone about it first, or at least tried it on a test system, which is the case with just about everything administration-related.

If It Can be Accidentally Used, and Can Produce Bad Consequences, Protect It

For example, if there is a big red power button, right around shoulder height, at a spot on a wall that is just begging to be leaned against, be sure to put something uncomfortable around it to discourage that action. Otherwise the power could be lost in the entire workstation lab and lots of students could lose their work. Not that such a thing happened on the systems I was managing...

This rule should be extrapolated to just about everything users have access to, hardware or software.

Ockham's Razor is Very Sharp Indeed

William of Ockham (or Occam) advanced humanity immensely when he devised his "razor", or the "law of parsimony", which says that the simplest of two or more competing theories is the preferable one.

In one example where this rule was ignored, a production system kept booting and crashing during the boot. The problem appeared in the boot sequence when the disks where being fsck checked and mounted. It appeared that the disks or controller had a problem, but replacing the controller, moving the disks, and other time-wasting experiments proved it could not be the disk. A couple of more disk-oriented experiments and lots of head-scratching later, a similar system was examined and it was noticed that right before the disks were enabled, networking was enabled. After the interface was brought up, packets started flowing, and during the disk-check step a bad packet was being received and causing the system to crash. Once the SGI server was prevented from annoying the Sun by sending it large packets (this was a long time ago), abject depression set it about the amount of human time we wasted and down time we caused by not looking for the root cause.

What Rules Do You Live By?

I hope you enjoyed these rules, and the practical examples of their use and abuse. If you have any to add to the "official" golden rules, please send them to me at pbg@petergalvin.org.

Peter Baer Galvin (http://www.petergalvin.org) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.