The
Great Solaris Administration Best Practices Debate
Peter Baer Galvin
In the February 2002 Sys Admin magazine, my column covered
"The Golden Rules of Sun Systems Administration" (http://www.sysadminmag.com/documents/s=2293/sam0202j/sam0202j.htm).
This led to a flurry of feedback, point, and counterpoint about
the true Solaris systems administration best practices. This month,
my column includes the points from readers, and my view on them.
So much time, effort, and discussion is spent on specific sys
admin tasks ("how to get this thing to do that thing"),
and relatively little time is spent on the larger issue of the best
ways to install systems, the best ways run systems, and in general,
the best practices of systems administrators. Systems administration
is a journeyman's trade. Most admins don't earn a degree
in systems administration. Most don't even take any courses
in it, except for the occasional "how to" tutorial or
vendor course. Rather, good sys admin practices are learned on the
job, through interactions with other administrators and direct experience
(i.e., "the hard way").
Thus, I believe that a Solaris Sys Admin Best Practices FAQ is
a good and useful thing to create and maintain. This month, I include
new ideas and points of discussion from readers. In the May issue
of Sys Admin, the wisdom from "Golden Rules" and
here, plus your new feedback, will be collected into such a FAQ.
I hope it will continue to evolve and improve over time, again with
reader input.
With no further ado, here is the more salient feedback to the
Golden Rules column. Stewart Dean wrote:
Comments/Additions:
1. PAY ATTENTION: Your world is hostile; "smell" the
wind, listen to the slightest noise, *know* your world. You can
generally deal with problems two ways:
- Early and cheap, if you're paying attention and are proactive
- Late and expensive if you are smashed under the avalanche.
So, pay attention to how things work when they work well; pay
attention when things change. Devise trip wires and reports.
Related to this: pay attention to error messages and clean up the
errors behind them. You ignore them at your peril: the problem will
snowball and devour you at the worst possible time and/or messages
will mount up until they hide a really important one that you miss
because you begin ignoring all messages.
2. STRATEGY & TACTICS: It's essential to learn the difference
between strategy and tactics and learn the place for both. Strategy
is arranging the battlefield so that your chances are maximized:
you may thus be able to win easily or without even a fight. Tactics
refer to hand-to-hand combat. You win by being good at both.
Subrules:
a. An ounce of strategy can be worth a couple of tons of tactics.
See Rule 1.
b. Don't be overly clever with strategy where tactics will
do. Life is short.
3. HELP USERS FIX IT THEMSELVES: If you have reports that tell
you about quota abuse, overly large mail Trash and Sent folders
and the like, extend the exec to send messages to users informing
them of the approaching fubar while they can still deal with it.
Besides, you will have the immense satisfaction of being able to
say to them, "You got a warning message, you turkey, why didn't
you pay attention", when they call up to whine at you.
[Plus, following this rule tends to get users who appreciate and
respect you, and who are your allies later, when you need it. -
PBG]
4. HTML SYSADMIN DOC: Document your system administration by using
basic HTML (Netscape Composer is plenty sufficient). It is "indexed":
grep on the search argument; it hyperlinks to your own stuff
and that of your vendors. You can find the answers to fubars that
happened a year ago (when they recur).
You can access the data from anywhere (although you'll want
to restrict that naturally); you can ftp it all from the server
to your laptop for when you're at a remote site or on vacation
and problems occur. You can share the info and add contributions
easily. You can burn a CD of it to archive with the backups.
[I've frequently kicked myself for not writing down some
solution to some problem that recurred. If I'd taken 10 minutes
to document it, I would have saved 10 hours at the next occurrence.
- PBG]
Another Occam's Razor corollary. I thought I thought it up,
but Dick Paddock seems to have originated it: Never attribute to
malice what can be explained as the result of sheer idiot stupidity.
Re: users, bosses, the guy who left you the mess you're cleaning
up, etc.
[Very true. I learned long ago not to pass judgment on whoever
created or left the mess that I was dealing with. I once criticized
a system configuration in front of a client, and it turned out to
have been done by one of my co-workers! - PBG]
Some minor things:
a. Most cell phones now come with text messaging so it's
now dead easy to have your trip wire execs call your cell phone.
b. Don't just save a backup, also save your system configuration
(with Sun, it's essential, with AIX less so) and your system
admin doc.
That reminds me of one of the more important but hard to understand
tenets I follow: do not be "too" clever. Of course the
challenge is in knowing when "too" is occurring. For example,
a client was devising multiple, complicated ways to use a set of
systems, rather than simplifying the solution by using more equipment.
There are always tradeoffs like that in any facility architecture.
I was concerned by the complexity and advised them of this tenet.
The systems manager in charge responded, "oh, I'm not".
He was, and it ended up costing his job when the facility failed
and failed again. Frequently, implementation and management complexity
is unnecessary and results from "too clever" systems administration.
An anonymous reader is miffed about one of my golden rule statements:
Since you say that putting files in /usr/local/bin is "deprecated",
how about informing the totally clueless where the now standard
location would be?
Well, as with all of the best practice advice, there are really
no hard-and-fast rules. In this case, the direction of Sun, as discerned
by their installation changes over time, is to make /usr
as read-only as possible. Rather, /opt/local is the best
place to make local changes. If users (or administrators) cannot
stand the change, a link from /usr/local to /opt/local
is reasonable.
Ken Stone states:
I read your article on the Golden rules of Sun administration
and can appreciate them all. We have recently begun collecting what
we call principles. Many are similar to what you have documented,
but here are a couple you might find merit in adding to the list:
Completion backward principle -- Never move forward with something
unless you are fully prepared to return the server to the original
starting point. Make images, back stuff up and test, make sure you
have the original CDs, or whatever; just make sure you do it before
you begin.
Gita principle -- When users diagnose their own problems and
provide a solution based upon their limited knowledge of possible
solutions. This is a little tougher to explain. Basically it's
important as an admin to recognize when this is happening while
communicating with users. The example we use is when someone brings
his car to a shop for a tune-up; usually it's because the car
is running badly and he has diagnosed it as a tune-up because that's
all he knows about. In reality, it's something like the wheel
being flat, etc.
Let me add these are not policy of the company I work for or official
in any way, shape, or form. These are just things we use to identify
situations when we commiserate over the finer points of being an
administrator.
Those are wonderful and true Golden Rules.
An anonymous reader contributes:
I enjoyed reading your article in Sys Admin mag. Mostly, I got
quite a few chuckles -- although the suggestions you put forth
are sound in basis, many are impractical for large shops. When in
doubt, reboot? Sometimes a sys admin doesn't have the luxury
to reboot systems every time there is a problem. One of the beauties
of UNIX (above other operating systems that I shall refrain from
naming) is that it separates user- and system-level functions cleanly
and frequent reboots to solve problems are not absolutely necessary.
For every time that a corrupted jumbo patch was the culprit in my
professional experiences, there have been ten that rebooting ad
nauseum proved nothing and some real detective work was needed to
resolve a problem. It is true that editing /etc/system typically
requires a reboot (unless you're really brave and are willing
to test your changes by running adb on a live kernel), but
I think you're doing your readers a bit of a disservice by
encouraging them to use the reboot as a crutch rather than as a
last resort; knowing when to reboot -- and when not to --
is one characteristic of a good, experienced admin.
[That is a great point for all of these rules -- knowing when
to apply them is as important as knowing the rule. And I agree that
rebooting is sometimes pointless. That said, I've been surprised
at the number of issues that are still resolved on Solaris by a
reboot. Sometimes, fighting with the machine may waste time, when
a reboot could save time. -PBG]
Similarly, I would take exception with your suggestion that it
is a best practice to "Use Defaults Whenever Possible."
Again, this seems to be a suggestion geared at inexperienced administrators
that don't have the time to immerse themselves in the operating
system enough to become expert at it. For instance, were I to follow
this tenet in my daily life, I would install the entire OEM+ Solaris
distribution on all my machines... only to be nagged by our security
administrator every other week when the latest sadmind, dtspcd,
or rpc.ttdbserverd exploit is released in the wild. By learning
enough about my system to know the minimum setup that it needs to
operate properly, not only do I eliminate security risk, but I reduce
some unnecessary dead weight in my environment. It is especially
disappointing to see this subject treated so lightly since your
bio states that you teach security topics worldwide (actually, I
think I attended one of your talks at SANS '97).
[Rats, I hate it when people use my words against me. But this
is a very good point. If security is a primary concern, then many
of these rules should be used in antithesis. For example, being
too clever is bad, unless you're trying to secure a machine
by adding layers of defenses. Then being too clever is good. The
point I was trying to make is not to change the defaults without
good reason, and security is one of the best reasons. -PBG]
I like many of your suggestions and overall it was a good column
(never change anything on Fridays is an especially insightful point),
but I really think you should have prefaced it with an explanation
of who your target audience is. The administrator that is looking
to make the leap from novice to expert would not be well-served
by taking some of these suggestions as gospel.
Art Kufeldt lets us know:
Another corollary to the "Check the Cables" rule is:
Check the Simple and Obvious Stuff First. A good example is the
user who calls and says, "I can't login anymore."
Rather than reset their password or worse yet, tear into the NIS
server, I just ask them if they have the "Caps Lock" key
on. Probably 70-80% of the time it is something simple like that.
Also certain apps lock up the mouse and keyboard when the "Num
Lock" key is invoked, so rather than kill their processes or
make them reboot, I ask them that question also. Systems administration
can be really time consuming as you are well aware of, but things
like these rules can make your life a lot easier. Excellent article!
Perhaps the most concise (and true) rule comes from Juan Pablo
Sánchez Beltrán:
When I think back at all the time I wasted (mostly on Windows
systems) making just one last change, or one small improvement,
only to cause irreparable harm resulting in prolonged debugging
(or an operating system reinstallation), it makes me wish I had
this rule tattooed somewhere obvious. The beauty of this rule is
that it works in real life, not just for systems administration.
I enjoyed your article "The Golden Rules of Sun Systems Administration".
It should be required reading for all admins.
I'd like to mention a few rules of my own. This one would
have saved a lot of people a lot of anxiety:
Power off before you remove or rebuild -- if you power it
off, you may find out that there are people depending on services
running. If you have already removed the server, replaced hard drives,
or started putting a new OS on, it's much harder to resolve
the unforeseen issues. I've often seen complaints from people
who got the email but thought the change did not affect them. Sometimes
even technical staff will forget to mention a dependency.
Here's a sneaky one to help you get your project management
work done:
Announce big changes far earlier than you really need them done
-- this is an extension to the "All projects take twice
the Estimated Time and Money". Let's say you need to power
off a data center or replace a critical server by October 31. Announce
it for September. People will be much more forthcoming about possible
gotchas as the "deadline" approaches. You can adjust to
"push back" very diplomatically and generously because
your real deadline is not imperiled.
And something philosophical:
The really good stuff is never funny at the time.
Regards, Pete Tamas
Very nice rules to live by. I especially like the idea of announcing
an earlier date. We could also call this the "Scotty rule"
(from Star Trek), but he not only acted like the problem would take
longer than it would, but also that it was harder than it is. This
could make you a hero in the eyes of your users...until they catch
on!
Finally, this from Christopher Jones:
It was the article on "The Golden Rules of Sun System Administration"
that I just got in the mail yesterday. I fully agreed with every
point you made on there, I've seen all of them. I've been
a UNIX admin here at NASA for over five years now. But I wanted
to point out to you (and maybe you're just partial to Sun)
that everything you said applies to all UNIX platforms out there
-- Sun, SGI, HP-UX, AIX, and Linux (and I'm sure I've
missed some). Even though Sun seems to be the big dog out there
these days, there's still plenty of the other operating systems
in play still.
Here where I am, we've got pretty much a 50/50 mix of Sun
and SGI (with most of our high-end high-memory high-number of processor)
systems being the SGI). Even though I've got my favorites,
I try to treat all the flavors equally.
I did enjoy the article though... Keep writing them!
Very true. Given that most of my experience is with Solaris, I
did not want to speak for other operating systems. My experiences
with Linux, Win2K, and even Tops-20 (way back when), indicate that
these are life-long systems administrator rules, regardless of the
operating system being administered.
For another perspective about the sys admin life, you may want
to check out:
http://www.linux.com/enhance/newsitem.phtml?sid=1&aid=11529">linux.com
Conclusions
These comments seem to hit home with long-term systems administration
experiences. Are there any others that we've missed? If so,
send them along to pbg@petergalvin.org, and I'll include
them in The Solaris Sys Admin Best Practices FAQ.
Peter Baer Galvin (http://www.petergalvin.org) is the
Chief Technologist for Corporate Technologies, a premier systems
integrator and VAR. Before that, Peter was the systems manager for
Brown University's Computer Science Department. He has written
articles for Byte and other magazines, and previously wrote
Pete's Wicked World, the security column, and Pete's Super
Systems, the systems management column for Unix Insider.
Peter is coauthor of the Operating Systems Concepts and Applied
Operating Systems Concepts textbooks. As a consultant and trainer,
Peter has taught tutorials and given talks on security and systems
administration worldwide.
|