The Great Solaris Administration Best Practices Debate

Peter Baer Galvin

In the February 2002 Sys Admin magazine, my column covered "The Golden Rules of Sun Systems Administration" (http://www.sysadminmag.com/documents/s=2293/sam0202j/sam0202j.htm). This led to a flurry of feedback, point, and counterpoint about the true Solaris systems administration best practices. This month, my column includes the points from readers, and my view on them.

So much time, effort, and discussion is spent on specific sys admin tasks ("how to get this thing to do that thing"), and relatively little time is spent on the larger issue of the best ways to install systems, the best ways run systems, and in general, the best practices of systems administrators. Systems administration is a journeyman's trade. Most admins don't earn a degree in systems administration. Most don't even take any courses in it, except for the occasional "how to" tutorial or vendor course. Rather, good sys admin practices are learned on the job, through interactions with other administrators and direct experience (i.e., "the hard way").

Thus, I believe that a Solaris Sys Admin Best Practices FAQ is a good and useful thing to create and maintain. This month, I include new ideas and points of discussion from readers. In the May issue of Sys Admin, the wisdom from "Golden Rules" and here, plus your new feedback, will be collected into such a FAQ. I hope it will continue to evolve and improve over time, again with reader input.

With no further ado, here is the more salient feedback to the Golden Rules column. Stewart Dean wrote:

Comments/Additions:

1. PAY ATTENTION: Your world is hostile; "smell" the wind, listen to the slightest noise, *know* your world. You can generally deal with problems two ways:

Early and cheap, if you're paying attention and are proactive
Late and expensive if you are smashed under the avalanche. So, pay attention to how things work when they work well; pay attention when things change. Devise trip wires and reports. Related to this: pay attention to error messages and clean up the errors behind them. You ignore them at your peril: the problem will snowball and devour you at the worst possible time and/or messages will mount up until they hide a really important one that you miss because you begin ignoring all messages.
2. STRATEGY & TACTICS: It's essential to learn the difference between strategy and tactics and learn the place for both. Strategy is arranging the battlefield so that your chances are maximized: you may thus be able to win easily or without even a fight. Tactics refer to hand-to-hand combat. You win by being good at both.
Subrules:
a. An ounce of strategy can be worth a couple of tons of tactics. See Rule 1.
b. Don't be overly clever with strategy where tactics will do. Life is short.
3. HELP USERS FIX IT THEMSELVES: If you have reports that tell you about quota abuse, overly large mail Trash and Sent folders and the like, extend the exec to send messages to users informing them of the approaching fubar while they can still deal with it. Besides, you will have the immense satisfaction of being able to say to them, "You got a warning message, you turkey, why didn't you pay attention", when they call up to whine at you.
[Plus, following this rule tends to get users who appreciate and respect you, and who are your allies later, when you need it. - PBG]
4. HTML SYSADMIN DOC: Document your system administration by using basic HTML (Netscape Composer is plenty sufficient). It is "indexed": grep on the search argument; it hyperlinks to your own stuff and that of your vendors. You can find the answers to fubars that happened a year ago (when they recur).
You can access the data from anywhere (although you'll want to restrict that naturally); you can ftp it all from the server to your laptop for when you're at a remote site or on vacation and problems occur. You can share the info and add contributions easily. You can burn a CD of it to archive with the backups.
[I've frequently kicked myself for not writing down some solution to some problem that recurred. If I'd taken 10 minutes to document it, I would have saved 10 hours at the next occurrence. - PBG]
Another Occam's Razor corollary. I thought I thought it up, but Dick Paddock seems to have originated it: Never attribute to malice what can be explained as the result of sheer idiot stupidity. Re: users, bosses, the guy who left you the mess you're cleaning up, etc.
[Very true. I learned long ago not to pass judgment on whoever created or left the mess that I was dealing with. I once criticized a system configuration in front of a client, and it turned out to have been done by one of my co-workers! - PBG]
Some minor things:
a. Most cell phones now come with text messaging so it's now dead easy to have your trip wire execs call your cell phone.
b. Don't just save a backup, also save your system configuration (with Sun, it's essential, with AIX less so) and your system admin doc.

That reminds me of one of the more important but hard to understand tenets I follow: do not be "too" clever. Of course the challenge is in knowing when "too" is occurring. For example, a client was devising multiple, complicated ways to use a set of systems, rather than simplifying the solution by using more equipment. There are always tradeoffs like that in any facility architecture. I was concerned by the complexity and advised them of this tenet. The systems manager in charge responded, "oh, I'm not". He was, and it ended up costing his job when the facility failed and failed again. Frequently, implementation and management complexity is unnecessary and results from "too clever" systems administration.

An anonymous reader is miffed about one of my golden rule statements:

Since you say that putting files in /usr/local/bin is "deprecated", how about informing the totally clueless where the now standard location would be?

Well, as with all of the best practice advice, there are really no hard-and-fast rules. In this case, the direction of Sun, as discerned by their installation changes over time, is to make /usr as read-only as possible. Rather, /opt/local is the best place to make local changes. If users (or administrators) cannot stand the change, a link from /usr/local to /opt/local is reasonable.

Ken Stone states:

I read your article on the Golden rules of Sun administration and can appreciate them all. We have recently begun collecting what we call principles. Many are similar to what you have documented, but here are a couple you might find merit in adding to the list:

Completion backward principle -- Never move forward with something unless you are fully prepared to return the server to the original starting point. Make images, back stuff up and test, make sure you have the original CDs, or whatever; just make sure you do it before you begin.

Gita principle -- When users diagnose their own problems and provide a solution based upon their limited knowledge of possible solutions. This is a little tougher to explain. Basically it's important as an admin to recognize when this is happening while communicating with users. The example we use is when someone brings his car to a shop for a tune-up; usually it's because the car is running badly and he has diagnosed it as a tune-up because that's all he knows about. In reality, it's something like the wheel being flat, etc.

Let me add these are not policy of the company I work for or official in any way, shape, or form. These are just things we use to identify situations when we commiserate over the finer points of being an administrator.

Those are wonderful and true Golden Rules.

An anonymous reader contributes:

I enjoyed reading your article in Sys Admin mag. Mostly, I got quite a few chuckles -- although the suggestions you put forth are sound in basis, many are impractical for large shops. When in doubt, reboot? Sometimes a sys admin doesn't have the luxury to reboot systems every time there is a problem. One of the beauties of UNIX (above other operating systems that I shall refrain from naming) is that it separates user- and system-level functions cleanly and frequent reboots to solve problems are not absolutely necessary. For every time that a corrupted jumbo patch was the culprit in my professional experiences, there have been ten that rebooting ad nauseum proved nothing and some real detective work was needed to resolve a problem. It is true that editing /etc/system typically requires a reboot (unless you're really brave and are willing to test your changes by running adb on a live kernel), but I think you're doing your readers a bit of a disservice by encouraging them to use the reboot as a crutch rather than as a last resort; knowing when to reboot -- and when not to -- is one characteristic of a good, experienced admin.

[That is a great point for all of these rules -- knowing when to apply them is as important as knowing the rule. And I agree that rebooting is sometimes pointless. That said, I've been surprised at the number of issues that are still resolved on Solaris by a reboot. Sometimes, fighting with the machine may waste time, when a reboot could save time. -PBG]

Similarly, I would take exception with your suggestion that it is a best practice to "Use Defaults Whenever Possible." Again, this seems to be a suggestion geared at inexperienced administrators that don't have the time to immerse themselves in the operating system enough to become expert at it. For instance, were I to follow this tenet in my daily life, I would install the entire OEM+ Solaris distribution on all my machines... only to be nagged by our security administrator every other week when the latest sadmind, dtspcd, or rpc.ttdbserverd exploit is released in the wild. By learning enough about my system to know the minimum setup that it needs to operate properly, not only do I eliminate security risk, but I reduce some unnecessary dead weight in my environment. It is especially disappointing to see this subject treated so lightly since your bio states that you teach security topics worldwide (actually, I think I attended one of your talks at SANS '97).

[Rats, I hate it when people use my words against me. But this is a very good point. If security is a primary concern, then many of these rules should be used in antithesis. For example, being too clever is bad, unless you're trying to secure a machine by adding layers of defenses. Then being too clever is good. The point I was trying to make is not to change the defaults without good reason, and security is one of the best reasons. -PBG]

I like many of your suggestions and overall it was a good column (never change anything on Fridays is an especially insightful point), but I really think you should have prefaced it with an explanation of who your target audience is. The administrator that is looking to make the leap from novice to expert would not be well-served by taking some of these suggestions as gospel.

Art Kufeldt lets us know:

Another corollary to the "Check the Cables" rule is: Check the Simple and Obvious Stuff First. A good example is the user who calls and says, "I can't login anymore." Rather than reset their password or worse yet, tear into the NIS server, I just ask them if they have the "Caps Lock" key on. Probably 70-80% of the time it is something simple like that. Also certain apps lock up the mouse and keyboard when the "Num Lock" key is invoked, so rather than kill their processes or make them reboot, I ask them that question also. Systems administration can be really time consuming as you are well aware of, but things like these rules can make your life a lot easier. Excellent article!

Perhaps the most concise (and true) rule comes from Juan Pablo Sánchez Beltrán:

If everything's working ok don't change anything.

When I think back at all the time I wasted (mostly on Windows systems) making just one last change, or one small improvement, only to cause irreparable harm resulting in prolonged debugging (or an operating system reinstallation), it makes me wish I had this rule tattooed somewhere obvious. The beauty of this rule is that it works in real life, not just for systems administration.

I enjoyed your article "The Golden Rules of Sun Systems Administration". It should be required reading for all admins.

I'd like to mention a few rules of my own. This one would have saved a lot of people a lot of anxiety:

Power off before you remove or rebuild -- if you power it off, you may find out that there are people depending on services running. If you have already removed the server, replaced hard drives, or started putting a new OS on, it's much harder to resolve the unforeseen issues. I've often seen complaints from people who got the email but thought the change did not affect them. Sometimes even technical staff will forget to mention a dependency.

Here's a sneaky one to help you get your project management work done:

Announce big changes far earlier than you really need them done -- this is an extension to the "All projects take twice the Estimated Time and Money". Let's say you need to power off a data center or replace a critical server by October 31. Announce it for September. People will be much more forthcoming about possible gotchas as the "deadline" approaches. You can adjust to "push back" very diplomatically and generously because your real deadline is not imperiled.

And something philosophical:

The really good stuff is never funny at the time.

Regards, Pete Tamas

Very nice rules to live by. I especially like the idea of announcing an earlier date. We could also call this the "Scotty rule" (from Star Trek), but he not only acted like the problem would take longer than it would, but also that it was harder than it is. This could make you a hero in the eyes of your users...until they catch on!

Finally, this from Christopher Jones:

It was the article on "The Golden Rules of Sun System Administration" that I just got in the mail yesterday. I fully agreed with every point you made on there, I've seen all of them. I've been a UNIX admin here at NASA for over five years now. But I wanted to point out to you (and maybe you're just partial to Sun) that everything you said applies to all UNIX platforms out there -- Sun, SGI, HP-UX, AIX, and Linux (and I'm sure I've missed some). Even though Sun seems to be the big dog out there these days, there's still plenty of the other operating systems in play still.

Here where I am, we've got pretty much a 50/50 mix of Sun and SGI (with most of our high-end high-memory high-number of processor) systems being the SGI). Even though I've got my favorites, I try to treat all the flavors equally.

I did enjoy the article though... Keep writing them!

Very true. Given that most of my experience is with Solaris, I did not want to speak for other operating systems. My experiences with Linux, Win2K, and even Tops-20 (way back when), indicate that these are life-long systems administrator rules, regardless of the operating system being administered.

For another perspective about the sys admin life, you may want to check out:

http://www.linux.com/enhance/newsitem.phtml?sid=1&aid=11529">linux.com

Conclusions

These comments seem to hit home with long-term systems administration experiences. Are there any others that we've missed? If so, send them along to pbg@petergalvin.org, and I'll include them in The Solaris Sys Admin Best Practices FAQ.

Peter Baer Galvin (http://www.petergalvin.org) is the Chief Technologist for Corporate Technologies, a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider. Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.