SolarisTM
Administration Best Practices
Peter Baer Galvin
Over the past few months in Sys Admin, both online and
in print, a discourse has been taking place about the best practices
of Solaris administrators. This month in the Solaris Corner, I take
the best of the old, add the best of the new, and create a consensus
"best practices" document. In this document, you'll
see some repeats from past columns, but it seemed logical to put
forth a complete version of the document this once. I hope this
will be useful to both experienced and novice administrators, and
that it will continue to evolve and grow as more administrators
contribute their wisdom.
"Consensus" and "systems administration" are
words that are rarely used near each other. Systems administration
is a journeyman's trade, with knowledge earned through experience
and hard work. Every sys admin performs his or her work with slight
and not-so-slight variation from colleagues. Some of that variation
is caused by personal choice, some by superstition, and some by
a differing knowledge set. The more this hard-won knowledge is spread,
the more systems will be run alike, and the more stable, usable,
and manageable they will be, thus the need for a Best Practices
document. Note that most of the information here applies to other
operating systems, and sometimes to the real world as well.
Solaris Administration Best Practices, Version 1.0
This document is the result of input from many of the top administrators
in the Solaris community. Please help your fellow sys admins (as
they have helped you here) by contributing your best practices to
this document. Email them to me at: bestpractice@petergalvin.org.
Keep an Eye Peeled and a Wall at Your Back
The best way to prepare to debug problems is to know how your
systems run when there are no problems. Then, when a problem occurs,
it can easily be discerned. For example, if you aren't familiar
with the normal values of system statistics (CPU load, number of
interrupts, memory scan rates, and so on), determining that one
value is unusual will be impossible. Devise trip wires and reports
to help detect when something unusual is happening. swatch
and netsaint are good programs for this purpose.
Also, pay attention to error messages and clean up the errors
behind them. You ignore them at your peril -- the problem will
snowball and devour you at the worst possible time, or they'll
mount until they hide a really important problem that you miss in
all the noise.
Communicate with Users
Veteran sys admins realize that they are there to make users'
lives easier. Those who are going to enjoy long sys admin careers
take this to heart, and make decisions and recommendations based
on usability, stability, and security. The antithesis of this rule
is the joke I used to tell as a university sys admin: "This
university job would be great if it wasn't for the dang students
messing up my systems". Good sys admins realize that this is
a joke, not gospel!
Also remember the example set by the World Champion New England
Patriots -- teamwork can overcome strong foes. Talk with your
fellow admins, bounce ideas around, share experiences, and learn
from each other. It will make you, them, and your systems better.
Help Users Fix It Themselves
Helping users help themselves means that you can spend your time
on the hard and interesting problems (not to mention fewer calls
on your off-hours, and happier users). For example, if you have
reports that tell you about quota abuse (overly large mail folders
and the like), enable the users to solve the problem, rather than
repeatedly complaining to them about their abuse. The counter to
this rule is that a little knowledge can be dangerous. Users may
think they understand the problem when they don't, or might
think they are solving the problem when they are making it worse.
Know When to Use Strategy, and When to Use Tactics
Sys admins must learn the difference between strategy and tactics
and learn the place for both. Being good at this requires experience.
Strategy means arranging the battlefield so that your chances are
maximized, possibly allowing you to win easily or even without a
fight. Tactics mean hand-to-hand combat. You win by being good at
both. An ounce of strategy can be worth a couple of tons of tactics.
But don't be overly clever with strategy where tactics will
do.
Another way to think of this rule is: "Do it the hard way
and get it over with." Too often admins try to do things the
easy way, or take a shortcut, only to have to redo everything anyway
and do it the "hard" way. (Note that the "hard way"
may not be the vendor-documented way of doing things.)
All Projects Take Twice as Long as They Should
Project planning, even when performed by experienced planners,
misses small but important steps that, at a minimum, delay the project,
and at a maximum, can destroy the plan (or the resulting systems).
The most experienced planners pad the times they think each step
will take, doubling the time of a given plan. Then, when the plan
proves to be accurate, he or she is considered to be a genius. In
reality, the planner thought "this step should take four days,
so I'll put down eight".
Another way that knowledge of this rule can be used to your advantage
-- announce big changes far earlier than you really need them
done. If you need to power off a data center or replace a critical
server by October 31. Announce it for September. People will be
much more forthcoming about possible problems as the "deadline"
approaches. You can adjust to "push back" very diplomatically
and generously because your real deadline is not imperiled. (Of
course, it's also very important to be honest with your users
to establish trust, so be careful with this "Scotty" rule.)
It's Not Done Until It's Tested
Many sys admins like their work because of the technical challenges,
minute details, and creative processes involved in running systems.
The type of personality drawn to those types of challenges typically
is not the type that is good at thoroughly testing a solution, and
then retesting after changes to variables. Unfortunately for them,
testing is required for system assurance, and for good systems administration.
I shudder to think how much system (and administrator) time has
been wasted by less-than-thorough testing.
Note that this rule can translate into the need for a test environment
of some sort. If the production environment is important, there
should be a test environment for learning and experimentation.
It's Not Done Until It's Documented
Documentation is never a favorite task. It's hard to do and
even harder to keep up to date, but it pays great dividends when
done properly. Note that documentation does not need to be in the
form of a novel. It can be the result of a script run to capture
system configuration files and status command output, in part. Another
strategy -- document your systems administration using basic
HTML. (Netscape Composer is sufficient for this task.) Documents
can be stored remotely, and links can be included to point to useful
stuff. You can even burn a CD of the contents to archive with the
backups.
If you keep a system history this way, searching the documents
can help solve recurring problems. Frequently, admins waste time
working on problems that have previously been solved, but not properly
documented.
Never Change Anything on Fridays
Whatever the cause (in a hurry to get home, goblins, gamma rays),
changes made before weekends (or vacations) frequently turn into
disaster. Do not tempt the fates -- just wait until there are
a couple of days in a row to make the change and monitor it. Some
admins even avoid making a change late in the day, preferring to
do so at the start of a day to allow more monitoring and debugging
time.
Use Defaults Whenever Possible
I recall a conversion with a client in which the client was trying
to go outside of the box, save some time and money, and produce
a complex but theoretically workable solution. My response was "there's
such a thing as being too clever". He continued being clever
(too much so, in my opinion; just right, in his opinion). The solution
had problems, was difficult to debug, was difficult for vendors
to support, and was quite a bother for a while.
In another example, some admins make changes for convenience that
end up making the system different from others. For example, some
still put files into /usr/local/bin, even though Sun has
discouraged that (and encouraged /opt/local) for many years.
It may make the users' lives easier, because that's where
they expect to find files, but other admins may be unpleasantly
surprised when they use standard methods and find they conflict
with the current system configuration.
This rule (as with the others) can be violated with good cause.
For example, where security is concerned, security can be increased
by not using defaults.
Furthermore, standardize as much as possible. Try to run the same
release of Solaris on all machines, with the same patch set, with
the same revisions of applications, with the same hardware configurations
-- this is easier said than done, as with most of these rules.
It is important to set goals, and drive toward them, when possible.
This is a good goal, even if it is not 100% attainable.
With this in mind, isolate site- and system-specific changes.
Try to keep all nonstandard things in one place so it is easy to
manage them, move them, and to know that something is non-standard.
Always Be Able to Undo What You Are About to Do
Never move forward with something unless you are fully prepared
to return the server to the original starting point (e.g., make
images, back stuff up and test, make sure you have the original
CDs). Back up the entire system if making systemic changes, such
as system upgrades or major application upgrades. Backup individual
files if making minor changes. Rather than deleting something (a
directory or application), try renaming or moving it first. Everyone
who has administered systems for any amount of time has seen the
result of ignoring this rule.
Avoid Poor Decisions from Above
This certainly falls under the category of easier said than done,
unfortunately. Management can have bad data and make bad decisions,
or can even have good data and make bad decisions. You as the admin
usually have to live with (and suffer with) these decisions, so,
with reason and data, encourage the correct decision. Even if you
lose the battle, you can always say "I told you so" and
feel good about yourself (while looking for a new job).
If You Haven't Seen It Work, It Probably Doesn't
Also known as "the discount of marketing". Products
are announced, purchased, and installed with the expectation that
they will perform as advertised, and in some small percentage of
time they actually do. Most of the time, they are over promised
and under delivered.
If You're Fighting Fires, Find the Sources
I would posit that thousands of admin-lives have been wasted by
fighting computer system "fires", instead of the causes
of those fires. To avoid this problem, you must automate what can
be automated, especially monitoring and log checking. This can free
up enough time to allow you to make progress on projects, rather
than spending time on tedious work. Those projects in turn can stabilize
systems, improve manageability, increase performance, and in general
make the systems happier.
If You Don't Understand It, Don't Play with It on
Production Systems
In my university days, we had a student programmer look at the
key switch on our main production Sun server, wonder what it does,
and turn it to find out. It would have been much better if he asked
someone about it first, or at least tried it on a test system. That
turns out to be the case with just about everything admin-related.
(Of course, he's now a .com millionaire, so who got the last
laugh?)
If It Can Be Accidentally Used, and Can Produce Bad Consequences,
Protect It
For example, if there is a big red power button at shoulder height
on a wall that is just begging to be leaned against, protect it.
Otherwise, the power could be lost in the entire workstation lab
and lots of students could lose their work (not that such a thing
happened on the systems I was managing...). This rule should be
extrapolated to just about everything users have access to --
hardware or software. For instance, if it's a dangerous command
or procedure, wrap it in a protective script.
Ockham's Razor Is Very Sharp Indeed
William of Ockham (or Occam) advanced humanity immensely when
he devised his "razor", or the "law of parsimony",
which says that the simplest of two or more competing theories is
the preferable one.
Checking the simple and obvious stuff first can save a lot of
time and aggravation. A good example is the user who calls and says,
"I can't log in anymore." Rather than reset the password
or worse yet, tear into the NIS server, just ask if he has the "Caps
Lock" key on. Note that to properly execute Ockham's Razor,
you must start with no preconceived notions. Assume nothing, gather
data, test hypothesis, and then decide on the problem and the solution.
Another Ockham's Razor corollary: Never attribute to malice
what can be explained as the result of sheer idiot stupidity.
Frequently, implementation and management complexity is unnecessary
and results from "too clever" systems administration.
This subtle rule has wide-ranging ramifications. Many times, a sys
admin is in an untenable position (debugging a problem, transitioning,
upgrading, and so on) because of too much cleverness in the past
(their own or someone else's).
When in Doubt, Reboot
As silly and hackneyed as it sounds, this is probably the most
important rule. It's also the most controversial, with much
argument on both sides. Some argue that it's amateur to reboot,
others talk of all the time saved and problems solved by rebooting.
Of course, sometimes a sys admin doesn't have the luxury
to reboot systems every time there is a problem. For every time
that a corrupted jumbo patch was the culprit and a reboot solved
the problem, there is a time that rebooting ad nauseum proved nothing
and some real detective work was needed to resolve a problem. As
with many of these rules, experience is a guide. Knowing when to
reboot -- and when not to -- is one characteristic of a
good, experienced admin.
If It Ain't Broke, Don't Fix It
It's amazing that cliché's are often true. When
I recall all the time I've wasted making just one last change,
or one small improvement, only to cause irreparable harm resulting
in prolonged debugging (or an operating system reinstallation),
I wish this rule were tattooed somewhere obvious. The beauty of
this rule is that it works in real life, not just for systems administration.
Save Early and Often
If you added up all the hours wasted because of data lost due
to program and system crashes, it would be a very large number.
Saving early and often reduces the amount of loss when a problem
occurs.
I've heard a story that Bill Joy was adding a lot of features
to his program, the vi editor. It had multiple windows and
a bunch of good stuff. The system crashed and he lost the changes.
Frustrated, he didn't repeat the changes, leaving us with a
much less useful vi. Don't let this happen to you!
Dedicate a System Disk
So much depends on the system disk that it is worth keeping the
disk only for system use. Swapping it then becomes possible without
effecting users. Performing upgrades, cloning, and mirroring are
all easier as well. Performance improves if the system disk is not
used for other purposes as well.
Have a Plan
Develop a written task list for important tasks. Develop it during
initial testing and use/refine it as the changes are applied to
increasingly more critical servers. By the time you get to production,
you typically have a short maintenance window and avoiding mistakes
is critical. A finely tuned plan can make all the difference, not
to mention that the next time (and the time after that) that you
are going to be doing something similar, you already have a template
for a new task list.
Don't Panic and Have Fun
Rash decisions usually turn out to be bad ones. Logic, reason,
testing, deduction, and repeatability -- these techniques usually
result in solved problems and smooth-running systems. And in spite
of the complexity of the job, and the pressure that can result from
the inherent responsibilities, try to enjoy your work and your co-workers.
Acknowledgements
Thanks to the following folks for contributing to this document:
Stewart Dean, Ken Stone, Art Kufeldt, Juan Pablo Sanchez Beltrßn,
Pete Tamas, Christopher Jones, Leslie Walker, Dave Powell, Mike
Zinni, Peggy Fenner, John Kelly, Lola Brown, Chris Gait, Timothy
R. Geier, Michael McConnell, David J. DeWolfe, Christopher Corayer,
and Tarjei Jensen
Peter Baer Galvin (http://www.petergalvin.org) is the
Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column,
and Pete's Super Systems, the systems management column for
Unix Insider (http://www.unixinsider.com). Peter is
coauthor of the Operating Systems Concepts and Applied
Operating Systems Concepts textbooks. As a consultant and trainer,
Peter has taught tutorials and given talks on security and systems
administration worldwide.
|