The
Golden Rules of Sun Systems Administration
Peter Baer Galvin
This month, Peter provides a hard-earned set of rules that can
help guide systems administrators toward good systems practices.
True-life examples show why these rules are so important.
When in Doubt, Reboot
As silly and hackneyed as it sounds, this is probably the most
important rule. There are many circumstances when reboots are required,
such as after installing kernel patches, and after making changes
to /etc/system. There are other times when a reboot is debatable.
In these instances, the correct decision usually is to perform the
reboot. For example, I once made a simple change to a startup script
and because it was so simple, I did not feel testing was necessary.
A month or so later, the system crashed and failed to complete its
startup. After much debugging, I found the change (that I had forgotten
I had made) had an error that had caused the problem. It was then
time to cancel those service calls I'd made for new hardware
and systems patches.
One client was having a problem with software that had been ported
from Windows to Solaris. The vendor recommended that we reinstall
the operating system and all the application software, which would
have been a multi-day project. Rather than reinstalling the software
and operating system, we decided to take the Sun approach of debugging
the problem. It turned out to be a bad Sun jumbo patch cluster,
and backing down to the previous revision solved the problem and
saved several days on the project.
Communicate with Users
Veteran sys admins realize that they are supposed to simplify
the lives of users. Those who are going to enjoy long careers as
admins take this to heart, and make decisions and recommendations
based on usability, stability, and security. The antithesis of this
rule is the joke I used to tell as a university sys admin: "This
university job would be great if it wasn't for the dang students
messing up my systems." Good admins realize that this is a
joke, not gospel!
If You Have a Problem, Check the Cables
This rule is similar to the first question that appliance customer-service
representatives ask when someone calls in for support -- is
it plugged in? I've heard of (and experienced) several instances
where the problem was with a cable, even if it did not appear to
be. In one instance, a server had an I/O board added and the system
failed to boot properly. After much debugging, the problem resolved
to a bad cable. The cable was not even touched during the upgrade.
All Projects Take Twice the Estimated Time and Money
Project planning, even when performed by experienced planners,
misses small but important steps that can delay the project or destroy
the plan (or the resulting systems). The most experienced planners
pad the times they think each step will take, doubling it for any
given plan. Then, when the plan proves to be accurate, he or she
is considered to be a genius. In reality, the planner thought "this
step should take four days, so I'll put down eight".
It's Not Done Until It's Tested
Many sys admins like their work because of the technical challenges,
minute details, and creative processes involved in running systems.
The type of personality drawn to these challenges is not typically
the type that will thoroughly test a solution, and then retest after
changes to variables. Unfortunately for them, testing is required
for system assurance, and for good systems administration. I shudder
to think how much system downtime, and sys admin time, has been
wasted by less-than-thorough testing.
Never Change Anything on Fridays
Whatever the cause (in a hurry to get home, goblins, gamma rays),
changes made before weekends or trips frequently turn into disaster.
Do not tempt the fates -- just wait until there are a couple
of days in a row to make the change and monitor the change. Everyone
will be happier.
Use Defaults Whenever Possible
I recall a conversion with a client in which the client was trying
to go outside of the box, save some time and money, and produce
a complex but theoretically workable solution. The solution had
problems, was difficult to debug, was difficult for vendors to support,
and generally helped make the company miserable for a while.
In another example, some sys admins make changes for convenience
that end up making the system different from others. For example,
some still put files into /usr/local/bin, even though that
has been deprecated for many years. It may make the user's
lives easier because that's where they expect to find files,
but other admins may be unpleasantly surprised when they use standard
methods and find they conflict with the current system configuration.
Create a Backup
Back up the entire system if making systemic changes, such as
system upgrades or major application upgrades. Back up individual
files for minor changes.
Avoid Poor Decisions from Above
Management can have bad data and make bad decisions, or can even
have good data and make bad decisions. Admins, however, have to
live with (and suffer with) these decisions. Take a stand, with
reason and data, to "encourage" the correct decision.
Even if you lose the battle, you can always say "I told you
so" and feel good about yourself (while looking for a new job).
If You Haven't Seen It Work, It Probably Doesn't
Products are announced, purchased, and installed with the expectation
that they will perform as advertised, and sometimes they actually
do. Most of the time, however, they are over-promised and under-delivered.
If You Are Constantly Fighting Fires, Find the Source
Fighting computer system "fires", instead of finding
the causes of those fires, wastes many man-hours. To avoid this
problem, automate whenever possible (especially monitoring and log
checking).
If You Don't Understand It, Don't Play with It on
the Production Systems
In my university days, there was a student programmer who noticed
the key switch on our main production Sun server, wondered what
it does, and turned it to find out. He probably should have asked
someone about it first, or at least tried it on a test system, which
is the case with just about everything administration-related.
If It Can be Accidentally Used, and Can Produce Bad Consequences,
Protect It
For example, if there is a big red power button, right around
shoulder height, at a spot on a wall that is just begging to be
leaned against, be sure to put something uncomfortable around it
to discourage that action. Otherwise the power could be lost in
the entire workstation lab and lots of students could lose their
work. Not that such a thing happened on the systems I was managing...
This rule should be extrapolated to just about everything users
have access to, hardware or software.
Ockham's Razor is Very Sharp Indeed
William of Ockham (or Occam) advanced humanity immensely when
he devised his "razor", or the "law of parsimony",
which says that the simplest of two or more competing theories is
the preferable one.
In one example where this rule was ignored, a production system
kept booting and crashing during the boot. The problem appeared
in the boot sequence when the disks where being fsck checked
and mounted. It appeared that the disks or controller had a problem,
but replacing the controller, moving the disks, and other time-wasting
experiments proved it could not be the disk. A couple of more disk-oriented
experiments and lots of head-scratching later, a similar system
was examined and it was noticed that right before the disks were
enabled, networking was enabled. After the interface was brought
up, packets started flowing, and during the disk-check step a bad
packet was being received and causing the system to crash. Once
the SGI server was prevented from annoying the Sun by sending it
large packets (this was a long time ago), abject depression set
it about the amount of human time we wasted and down time we caused
by not looking for the root cause.
What Rules Do You Live By?
I hope you enjoyed these rules, and the practical examples of
their use and abuse. If you have any to add to the "official"
golden rules, please send them to me at pbg@petergalvin.org.
Peter Baer Galvin (http://www.petergalvin.org)
is the Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column,
and Pete's Super Systems, the systems management column for
Unix Insider (http://www.unixinsider.com).
Peter is coauthor of the Operating Systems Concepts and Applied
Operating Systems Concepts textbooks. As a consultant and trainer,
Peter has taught tutorials and given talks on security and systems
administration worldwide.
|