|  The 
              Golden Rules of Sun Systems Administration
 Peter Baer Galvin
              This month, Peter provides a hard-earned set of rules that can 
              help guide systems administrators toward good systems practices. 
              True-life examples show why these rules are so important.
              When in Doubt, Reboot
              As silly and hackneyed as it sounds, this is probably the most 
              important rule. There are many circumstances when reboots are required, 
              such as after installing kernel patches, and after making changes 
              to /etc/system. There are other times when a reboot is debatable. 
              In these instances, the correct decision usually is to perform the 
              reboot. For example, I once made a simple change to a startup script 
              and because it was so simple, I did not feel testing was necessary. 
              A month or so later, the system crashed and failed to complete its 
              startup. After much debugging, I found the change (that I had forgotten 
              I had made) had an error that had caused the problem. It was then 
              time to cancel those service calls I'd made for new hardware 
              and systems patches.
              One client was having a problem with software that had been ported 
              from Windows to Solaris. The vendor recommended that we reinstall 
              the operating system and all the application software, which would 
              have been a multi-day project. Rather than reinstalling the software 
              and operating system, we decided to take the Sun approach of debugging 
              the problem. It turned out to be a bad Sun jumbo patch cluster, 
              and backing down to the previous revision solved the problem and 
              saved several days on the project.
              Communicate with Users
              Veteran sys admins realize that they are supposed to simplify 
              the lives of users. Those who are going to enjoy long careers as 
              admins take this to heart, and make decisions and recommendations 
              based on usability, stability, and security. The antithesis of this 
              rule is the joke I used to tell as a university sys admin: "This 
              university job would be great if it wasn't for the dang students 
              messing up my systems." Good admins realize that this is a 
              joke, not gospel!
              If You Have a Problem, Check the Cables
              This rule is similar to the first question that appliance customer-service 
              representatives ask when someone calls in for support -- is 
              it plugged in? I've heard of (and experienced) several instances 
              where the problem was with a cable, even if it did not appear to 
              be. In one instance, a server had an I/O board added and the system 
              failed to boot properly. After much debugging, the problem resolved 
              to a bad cable. The cable was not even touched during the upgrade.
              All Projects Take Twice the Estimated Time and Money
              Project planning, even when performed by experienced planners, 
              misses small but important steps that can delay the project or destroy 
              the plan (or the resulting systems). The most experienced planners 
              pad the times they think each step will take, doubling it for any 
              given plan. Then, when the plan proves to be accurate, he or she 
              is considered to be a genius. In reality, the planner thought "this 
              step should take four days, so I'll put down eight".
              It's Not Done Until It's Tested
              Many sys admins like their work because of the technical challenges, 
              minute details, and creative processes involved in running systems. 
              The type of personality drawn to these challenges is not typically 
              the type that will thoroughly test a solution, and then retest after 
              changes to variables. Unfortunately for them, testing is required 
              for system assurance, and for good systems administration. I shudder 
              to think how much system downtime, and sys admin time, has been 
              wasted by less-than-thorough testing.
              Never Change Anything on Fridays
              Whatever the cause (in a hurry to get home, goblins, gamma rays), 
              changes made before weekends or trips frequently turn into disaster. 
              Do not tempt the fates -- just wait until there are a couple 
              of days in a row to make the change and monitor the change. Everyone 
              will be happier.
              Use Defaults Whenever Possible
              I recall a conversion with a client in which the client was trying 
              to go outside of the box, save some time and money, and produce 
              a complex but theoretically workable solution. The solution had 
              problems, was difficult to debug, was difficult for vendors to support, 
              and generally helped make the company miserable for a while.
              In another example, some sys admins make changes for convenience 
              that end up making the system different from others. For example, 
              some still put files into /usr/local/bin, even though that 
              has been deprecated for many years. It may make the user's 
              lives easier because that's where they expect to find files, 
              but other admins may be unpleasantly surprised when they use standard 
              methods and find they conflict with the current system configuration.
              Create a Backup 
              Back up the entire system if making systemic changes, such as 
              system upgrades or major application upgrades. Back up individual 
              files for minor changes.
              Avoid Poor Decisions from Above
              Management can have bad data and make bad decisions, or can even 
              have good data and make bad decisions. Admins, however, have to 
              live with (and suffer with) these decisions. Take a stand, with 
              reason and data, to "encourage" the correct decision. 
              Even if you lose the battle, you can always say "I told you 
              so" and feel good about yourself (while looking for a new job).
              If You Haven't Seen It Work, It Probably Doesn't
              Products are announced, purchased, and installed with the expectation 
              that they will perform as advertised, and sometimes they actually 
              do. Most of the time, however, they are over-promised and under-delivered.
              If You Are Constantly Fighting Fires, Find the Source
              Fighting computer system "fires", instead of finding 
              the causes of those fires, wastes many man-hours. To avoid this 
              problem, automate whenever possible (especially monitoring and log 
              checking).
              If You Don't Understand It, Don't Play with It on 
              the Production Systems
              In my university days, there was a student programmer who noticed 
              the key switch on our main production Sun server, wondered what 
              it does, and turned it to find out. He probably should have asked 
              someone about it first, or at least tried it on a test system, which 
              is the case with just about everything administration-related.
              If It Can be Accidentally Used, and Can Produce Bad Consequences, 
              Protect It
              For example, if there is a big red power button, right around 
              shoulder height, at a spot on a wall that is just begging to be 
              leaned against, be sure to put something uncomfortable around it 
              to discourage that action. Otherwise the power could be lost in 
              the entire workstation lab and lots of students could lose their 
              work. Not that such a thing happened on the systems I was managing...
              This rule should be extrapolated to just about everything users 
              have access to, hardware or software.
              Ockham's Razor is Very Sharp Indeed
              William of Ockham (or Occam) advanced humanity immensely when 
              he devised his "razor", or the "law of parsimony", 
              which says that the simplest of two or more competing theories is 
              the preferable one.
              In one example where this rule was ignored, a production system 
              kept booting and crashing during the boot. The problem appeared 
              in the boot sequence when the disks where being fsck checked 
              and mounted. It appeared that the disks or controller had a problem, 
              but replacing the controller, moving the disks, and other time-wasting 
              experiments proved it could not be the disk. A couple of more disk-oriented 
              experiments and lots of head-scratching later, a similar system 
              was examined and it was noticed that right before the disks were 
              enabled, networking was enabled. After the interface was brought 
              up, packets started flowing, and during the disk-check step a bad 
              packet was being received and causing the system to crash. Once 
              the SGI server was prevented from annoying the Sun by sending it 
              large packets (this was a long time ago), abject depression set 
              it about the amount of human time we wasted and down time we caused 
              by not looking for the root cause.
              What Rules Do You Live By?
              I hope you enjoyed these rules, and the practical examples of 
              their use and abuse. If you have any to add to the "official" 
              golden rules, please send them to me at pbg@petergalvin.org.
              Peter Baer Galvin (http://www.petergalvin.org) 
              is the Chief Technologist for Corporate Technologies (www.cptech.com), 
              a premier systems integrator and VAR. Before that, Peter was the 
              systems manager for Brown University's Computer Science Department. 
              He has written articles for Byte and other magazines, and 
              previously wrote Pete's Wicked World, the security column, 
              and Pete's Super Systems, the systems management column for 
              Unix Insider (http://www.unixinsider.com). 
              Peter is coauthor of the Operating Systems Concepts and Applied 
              Operating Systems Concepts textbooks. As a consultant and trainer, 
              Peter has taught tutorials and given talks on security and systems 
              administration worldwide.
           |