Article

Questions and Answers

Bjorn Satdeva

I'm fairly new to UNIX system administration, and I want to learn more about shell scripting. I can write very basic shell scripts but need to learn it more in depth. Some have told me that I should just learn Perl and write my scripts in that. Is this realistic or should I go with shell scripting, at least initially?

Whoever told you to use Perl instead of shell scripts was absolutely right. Using any of the UNIX command shells for anything other than the smallest scripts is silly. It is much, much harder to write a fully debugged and reliable shell script than it is to do the same in Perl. To mention just one issue, when you run a shell script, the shell interpreter is only checking the syntax of the statements you execute. If you have a syntax error in a branch you do not test, the script will crash if you later execute that branch.

We are a large HP/UNIX shop (300+ servers) and want to create an in-house certification process for our Sys Admins that steps them through career progression from console operators to systems engineers. Are you aware of, or can you recommend, a structured process or program that provides progressive training and testing to qualify technical staff for increasingly more challenging UNIX work?

Your approach is a good one. Many large UNIX installations have learned that grooming their own system administrators is the best way to get a group of high-quality people together. However, be aware of a pitfall. When you say certification, it makes me think of some of the existing certification programs, where the focus is on memorizing certain things, rather than getting a good understanding of the underlying issues. I think it is best to have people with troubleshooting and problem resolving skills, and not just ones who have memorized the certification manual.

Unfortunately, I cannot point you to an existing structured process. However, if you look at the SAGE System Administration Job Descriptions (www.usenix.org/sage/jobs/sage-jobs.html), you will have a skills model that can be applied to your training.

Which backup method/software is better, tar or cpio? What exactly is the difference between them? Can I compress/decompress data during backup/restore operation?

This seems to be an eternal question. In reality, both tools are unsuited for doing backup. It is necessary to make a clear distinction between backup (which enables the system administrator to recover individual files or entire disk volumes lost very recently) and archiving (which is a way to create a snapshot at a given time, e.g., a development project). The latter should include all the related files (including development tools, such as compilers and loaders).

tar is often a good program for archiving purposes. It is for this purpose that it originally was written (tar = tape archiver). cpio is, at least in my opinion, an anachronism, which belongs back in history with other obsolete archiving tools such as volcopy.

cpio and tar suffer from the problem of accessing all files through the UNIX file system. This is slower than accessing the raw disk directly, and therefore, they have the problem of not understanding files with "holes", such as DBM files. If you back up such files with cpio or tar (except GNU tar, which has provisions to avoid this), then when you restore that file, you will get a file without the holes, which most likely will no longer fit on the disk.

For backup, I prefer to use dump and restore (ufsdump & ufsrestore on System V machines, including Solaris). It has been my experience that these programs are both faster and more reliable than for backup, compared to tar and especially cpio. If you disagree, please read Elizabeth Zwicky's paper "Torture-testing Backup and Archive Programs: Things you Ought to Know But Probably Would Rather Not" from the Usenix LISA V Conference Proceedings. It might change your mind about backup programs.

How should I automate backup on alternate Saturdays? At the same time, should I also shutdown databases? After doing backup again, should I start the database automatically?

The standard UNIX scheduler (cron) does not solve this problem gracefully. With cron, you can specify that a given command must be executed at a given time, based on cron's time parameters (which are the minute of the hour, the hour of the day, the day in month, the month in year, and the day in the week, or any combination thereof).

There is no facility that allows the system administrator to directly specify that a job should be executed every other Saturday. You either need to get a commercial scheduler, or you can go the traditional system administrator's way and write a wrapper.

If you choose to automate the backup process (including stopping and starting your database server), make sure that you have some kind of human review in the loop to catch possible problems. I have all too often seen backup processes automated so much that nobody was getting notified of possible problems. It is better to learn about problems with your backup software before you get to the point were you are unable to restore lost data.

I'm looking for a utility to clean up (delete old messages) users' mailboxes automatically. It should also include an exclude list.

I know of no tools other than the normal mail readers that can do this. You will need a tool that understands the mailbox format and can safely rewrite the mailbox.

There are, however, other issues surrounding this question. The question comes from Europe, and the legal issues there might be very different from in the United States. In any case, several countries now have email privacy laws, and I am not sure how deleting email from users mailboxes would be interpreted under such a law.

Thus, this is an area where it would be wise to tread very carefully. It might simply be better to disallow new mail delivery if the size of the system mailbox file grows bigger than a certain size.

From a system administration perspective, you can probably safely move a large mailbox file to the user's home directory, and then send them an email notifying them of the move and telling them how to access the old mail files.

I'm looking for some info about system administration history. Do you know where I can get more information, such as online documents, books, etc.?

I don't know of any books dedicated to the history of UNIX system administration. However, there are two books that tell a more generalized history of UNIX that you might like. One is A Quarter Century of Unix by Peter H. Salus, and the other is Life With Unix: A Guide for Everyone by Don Libes and Sandy Ressler.

If these books do not meet your needs, you could try finding people who have done UNIX system administration for a long time and talking with them. The problem is, however, in the past system administration was considered something you did until "you grew up and became a real programmer". Back in 1988, when I started /sys/admin as a UNIX system administration consulting business, most of my friends thought I was committing professional suicide. However, there were even then a few people who found system administration interesting. You will find the names of many of these people in early LISA conference proceedings.

How can I restrict a user to some specific port so that he can have services from that specific port and none other.

The short answer is "you can't". The TCP/IP network model is one of servers and clients. There is no concept of users and user IDs. If you want only certain users to get access to a port, you will need to implement user authentication and user authorization (note that the two are not the same).

Is there a way to convert from NIS+ to NIS on the Solaris? If so, is it worth it?

I do not know of an easy, straightforward way to do this. NIS+ keeps all data in data files separate from the standard UNIX configuration files. This has been done to offer some new functionality in NIS+, but with a clear loss of backward compatibility. In answer to the question of whether it is worth going back to NIS, the answer will have to be "it depends". If NIS+ does not offer you any additional functionality over NIS, then it is certainly worth giving it some thought. When you make that analysis, don't forget that NIS+ has a number of security enhancements too, although much of that gets lost if you must use NIS+ in NIS compatibility mode.

The big advantage of NIS is that it uses the standard UNIX configuration files (/etc/hosts, /etc/passwd, etc.). It is much simpler to write tools that interact with those text files, compared to the data files from NIS+.

Our UNIX box has been acting a bit strange lately. For the past two days, for no apparent reason, the system reboots by itself. It has already done this twice. We checked the system in and out for any hackers, and we found zero. We just recently changed our root passwords, and only two users, including me, have access to sudo (su is disabled for security reasons). We have a very tight box, and we check major locations for hacking scripts (such as rootshell and bugtraq) and have found none! What could it be?

There can easily be other reasons for system crashes than hacker activities. Check the system log files for system-related error conditions, and check your change log to see if any system changes were made around the time the system started crashing. If you are able to get a kernel dump, it will also be a good idea to check what the system was doing when it crashed.

Security is very important, but we should never lose sight of the simpler solutions. System security problems are certainly a much bigger problem than in the past, but the majority of the problems we encounter are still caused by simple mistakes.

Currently, I have a UNIX machine running a Web server. The users are complaining that the Web access is slow. I have done vmstat, iostat, and netstat, but I do not know how to interpret the information. I have printed out the man pages to get some reference. Could you please help me understand how to interpret the numbers? Thank you.

A complete guide to those programs would probably take a small book, so I will limit this to the most important items.

vmstat is, first of all, a tool to monitor memory usage. The output format differs between the various versions of UNIX, but the following should be true on almost every type of UNIX. The most important column is the one marker "po" (page out). This tells you the number of memory pages that have been paged out to the disk. It is okay if the system is doing a little bit of paging now and then, but if it gets to be too much, performance will suffer. Another interesting column is "procs", the number of processes in various states. "r" is the number of processes in the run queue (i.e., waiting for a CPU slice). "b" is the number of processes waiting (typically on I/O), and "w" is waiting (swapped out) processes. The latter should nearly always be zero on a system that has enough memory.

vmstat also gives some CPU statistics. "sy" is the percentage of the CPU time used for system processes, and "us" is the same for processes in user space. The split between those two values depends on the type of load on the system, but is typically a little over twice the time used in user space compared to system space. "id" is the CPU idle time. If it gets to zero, it means you are using all the available CPU cycles. This might mean that you need a faster CPU. Check the number of waiting processes to see how long the process queue is.

iostat gives some of the same information, but is otherwise used to monitor the disk access. You can use this program to see if you have a balanced load between the disks. An even disk load is almost impossible to achieve, but aim for a reasonable distribution. If you have many disks, but all the traffic is going to only one or two, you might consider changing the way you have your file systems allocated.

Both of the above commands will, when called without arguments, attempt to provide data with some kind of average since the system were booted last time. This information is almost always useless, and may not be traceable at all. So when you use these commands, keep them running and watch the output for problems. A good interval is normally around 5 seconds.

netstat is very different from the two above. Find the option for your version of UNIX that shows you the number of collisions and errors on the network (this is -i on many systems). The collisions should be an order of magnitude smaller than the number of packets transmitted, and the number of errors should be very, very low.

About the Author

Bjorn Satdeva is the president of /sys/admin, inc., a consulting firm which specializes in large installation system administration. Bjorn is also co-founder and former president of Bay-LISA, a San Francisco Bay Area user's group for system administrators of large sites. Bjorn can be reached at questions@sysadmin.com.