Article

Questions and Answers

Bjorn Satdeva

I often wonder about the current state of system administration practices. When I started doing UNIX system administration almost 20 years ago, it was a hit and miss situation. I had no mentor to ask questions of, and the only written materials available were the UNIX manuals that came with the system. Thus, the only way to learn the system in those days was through trial and error. At least the system was a lot smaller in those days. UNIX version 7 did not have any networking capabilities, unless you call UUCP networking. I did not administer my first TCP/IP network of machines until 1984, so although the available educational material was small, the required learning curve was less steep than it is today.

However, now almost 20 years later, it appears to me that too many sites are still doing system administration in an ad hoc manner. This is in spite of the fact that in 1997 you can attend classes in system administration and find a large number of books offering advice and guidance on the topic. So, why has the situation not significantly improved?

For one thing, we need to know and be able to support larger and more complicated systems. We also need to be able to support a body of widely different systems, both UNIX and non-UNIX based.

Also, we still do not have adequate tools. In spite of the many tools available both as freeware and commercial software, there is nothing which will decisively help the system administrator maintain his or her site. Part of the problem is that system administration often abounds with policy issues (how our organization does business). Adding to that problem is the fact that many vendors still do not understand that it takes a system administrator to understand the needs of a system administrator, and that a programmer, however good at his trade, will not have the necessary experience required to develop adequate tools without guidance and supervision. This development takes an experienced system administrator, something that is rarely practiced.

Management in many organizations also carries a large portion of the responsibilities. In my consulting business, I have seen many situations in which the system administration staff has been unable to provide adequate support, because they had been set up to fail by their management, whether through inadequate staffing, lack of funding, lack of organization, or lack of reasonable support for their work. The latter can sometimes be extended to the user community too. Especially at sites where the system administration team is unable to cope effectively, the users might, through their actions, make the situation worse. That the users' actions are justifiable and understandable from their perspective does not change the big picture much.

None of this is new, and that is not the reason why I have chosen to bring it up. Over the years I have come to believe that to a degree this situation is created by the system administrators themselves. While this may sound like victim bashing, please bear with me, and I will try to explain. At almost every site I have come in contact with, system administration has been practiced in a very ad hoc manner. As a consultant I am naturally in contact mostly with sites that have a problem needing a solution. However, the problem appears to be fairly universal based on evidence from social gatherings of system administrators. One favorite pasttime of system administrators is sharing horror stories. This was the case at the early LISA conferences, and seems still to be the case today. However, nobody seems to notice that many of those awful situations could have been avoided if the site had developed good and systematic system administration practice. Most system administrators are able to justify the situation by showing how overworked they are, but they fail to understand that the lack of good practices adds significantly to their work load. Building good practices takes time, but such practices will reward the people who use them by freeing up time previously consumed by fighting fires or tracking problems.

Often when I talk with system administrators, they flatly deny that this is the case, and state they do not have time to develop such practices. However, when I ask some simple questions about the way they are practicing system administration, they start to see that there might indeed be room for improvement.

I think it could be very interesting to gather some substantial and systematic data about this topic and publish the findings. I have in fact decided to do just that. To do so, I will need data from a significant number of system administrators from all parts of the world. I have created a questionnaire on my Web site, which I would like all our readers to fill out. All information will be treated absolutely confidential, and everybody who fills out the questionnaire will, if they want, receive a copy of the completed paper.

So, please come on over and let me know about your system administration practices. The URL is http://www.sysadmin.com/questions.

In the September issue, there was a question from one reader who was unable to create a file. David A. Bandel sent me email stating that he had seen very similar behavior on systems using dynamically linked libraries, on which libc had been damaged. I thought his observation worth sharing with the readers of Sys Admin magazine. Bandel writes:

I have seen similar circumstances produced both accidentally (but by an administrator) and unintentionally (by file corruption). In both cases, the fault lies with libc.

In the first instance, the administrator wanted to upgrade libc, but not understanding the importance of this library, deleted the link to the actual library (instead of replacing it with ln -sf) and of course could not recreate it.

In the second instance, libc became corrupted from occupying a bad sector on the disk. Any program that relies on libc will obviously not work - often this includes shutdown, mount, ln, and any number of others.

And now this month's questions:

Bjorn, I read your response to a readers question in July's Sys Admin regarding the port setup in the /etc/services. This has spurred me to find an answer to a problem that has been bugging me for about 4 months. The problem is this: I have an anonymous ftp port and a Sybase server on the same box. I want to monitor the connections that go through these ports periodically.

I know that for the ftp port I can use the -L switch in /etc/inetd.conf to write logins to /var/adm/messages. But how do I monitor or view the connections that use our Sybase port (5034)? Is this a C programming job (beyond my scope), or can I do it with shell scripts?

You do not describe the usage of the two services. however, if you offering anonymous ftp on a machine that also supports a database for internal usage, you have a major security problem on your hands. If this is the case, the only prudent action is to move the ftp server to a disposable machine outside your security perimeter, i.e., outside your firewall. Using the -L switch for the ftp server will indeed give you information about who accessed the system. However, it will only give you that information after the fact, and you will only know if you take the time to read the log files. If you use the Firewall Tool Kit from TIS (ftp://ftp.tis.com), you will be able not only to log all accesses, but also to build more security around the ftp server than you can with plain ftpd only. As an alternative, you can use Wietse Venema's TCP wrapper found at:

ftp://www.sysadmin.com/pub/admin/ \
firewalls/tcp_wrappers

which will give you some ability to filter connections. But, it will not give you the advantage of changing the root within the filesystem the way the Firewall Toolkit does.

You can create listings of who connects to the SyBase port through a shell script or Perl program, if you use the a program like Van Jacobsen's tcpdump program to get the information off the network. This program can be found in:

ftp://ftp.sysadmin.com/pub/admin/tools/ networking/tcpdump.

This strategy will, however, not be able to prevent anybody from connecting to the server.

I do not know too much about Sybase but believe that the TCP port number is not defined in /etc/services. It can therefore be problematic to determine which ports are used by the server. You can use the portscan program from the Firewall Toolkit to find which TCP ports your system is listening to, which is always a good practice regardless. You might get some unpleasant surprises when you find out what services it is offering to the network. It is certainly better that you find out than somebody else.

You might be able to use the TCP wrappers to limit the access to the Sybase database. Try it out and see if it works. If it does not work right away, try using the tcpdump program to find out what is happening on the network. It might give you a clue to the problem.

In your column a while back, you suggested some ways of checking the daily backup. It seems overkill to me; do you really think these elaborate procedures are necessary?

This question refers to suggestions I gave that briefly consist of providing a quality assurance test of the backup by restoring a random file each day and doing a full restore of a filesystem monthly.

If you really think it is overkill, you do not have to follow this recommendation. However, if you consider the possibilities of what can go wrong, and the consequences for your organization, your own professional reputation, and even your job security, then it might be worthwhile for you to reconsider.

During my years as a system administrator and in system administration consulting, the number one cause of problems that I have seen has been failed backup. The problem is especially big for sites that uses homegrown backup shell scripts. Most backup scripts do not detect problems very well, and I have seen clients who were unable to restore critical data because a failure in the daily backups had gone undetected for some time.

Even commercial backup programs can fail, and so can the equipment. I still maintain that the only way you can know that data is written to the backup media is when you perform a quality assurance check, and go through some kind of restore operation. It is possible for the hardware to fail and produce backup tapes that cannot later be read. I know of some sites where they read the "table of contents" back from the tape, but with the possible exception of tar, this is is not a reliable mechanism. Several backup programs, such as dump produce a separate table of contents, which can be read very fast, while the data area still could be damaged or unreadable.

In my opinion, if you do not make daily backups, do not create an effective quality assurance procedure, or do not follow the procedures you have created, you are playing Russian roulette with your livelihood and your professional reputation. But as I said, if you disagree, you do not have to follow my recommendations.

We are a small ISP, and we are finding that our name server needs to support a growing number of domains. This has lead to the data becoming disorganized and sometimes not maintained correctly. Do you have any suggestions for a strategy for maintaining the name server data files in a reasonable manner?

Together with sendmail, the name service must be the most misconfigured service across the Internet. However, getting it right is not all that hard. The new BIND implementation has a lot of data consistency checks that were not present in the older version, so by paying attention to the messages generated by named, you will be able to catch some of the worst mistakes. Van Jacobsen's team at the Berkeley Lab has built a very nice utility, called nslint, which provides additional checks, and will find many common problems, such as missing reverse addresses. If you maintain a larger name server, this utility is simply priceless.

Another way to reduce the number of problems is to have a clear procedure for adding, deleting, and modifying named entries. If your procedures require people to enter or modify data in the reverse zone files before they make any changes to the regular zone files, many problems can be avoided, because one of the most common mistakes is forgetting to update the reverse zone. Also, it is a good idea to require people to use nslookup or dig to verify that added or modified host entries produce the expected result. Another common mistake when updating the name server is forgetting the period after the domain name, which results in the local domain being appended to the entry. In other words, heimdal.sysadmin.com becomes heimdal.sysadmin.com.sysadmin.com.

About the Author

Bjorn Satdeva is the president of /sys/admin, inc., a consulting firm which specializes in large installation system administration. Bjorn is also co-founder and former president of Bay-LISA, a San Francisco Bay Area user's group for system administrators of large sites. Bjorn can be contacted at /sys/admin, inc., 2787 Moorpark Ave., San Jose, CA 95128; electronically at bjorn@sysadmin.com; or by phone at (408) 241-3111.