Article

A Host Health Probe

John Lees

The Department of Computer Science at Michigan State University operates several hundred computers for its students, staff, and faculty. Computers are located in three buildings on a large campus. Although we have a small systems management staff (three full-time people and ten or so half-time graduate students), we are expected to keep the computers up and running around the clock. We have developed a number of tools to ease this task. The subject of this article is a perl script named probe, which checks on the health of each of our systems every morning and mails a status report to the appropriate system managers.

We operate a total of about 250 computers running the UNIX operating system. Most are Suns running SunOS 4.1.x or Solaris 2.x, but there are a few NeXTs, half a dozen DEC Alphas running OSF/1, and an Apple workgroup server running A/UX. Any tool we use for systems management has to work in this diverse and ever-changing environment. We've chosen Larry Wall's perl language because the perl interpreter is easy to port to new versions of UNIX, is well documented by two O'Reilly handbooks, and provides a powerful base for writing custom scripts.

The probe script began as a way of checking that the xntpd daemon was running on all our computers. This is a daemon which keeps computers synchronized using the Network Time Protocol. We quickly saw the utility of making a few more simple checks on all our computers every day, and the probe script grew into its present form.

Our computers are divided into seven NIS (Sun Network Information System) domains for the purpose of controlling access. Because the probe script is driven by the netgroup map, it is necessary to understand a little about this database and about Sun's NIS (see the sidebar, "Sun Network Information Systems (NIS)").

The probe script uses the NIS netgroup map to find the names (host names) of the computers to probe. This adds to the complexity of the script, but we make too many changes to the netgroup (and hosts) databases for any other scheme to be practical. A simple list of systems to probe would have to be updated several times each week.

Sample Output

Listing 1 shows the kind of output generated by the probe script. This sample has been edited to condense it a little, but it gives the general idea. The fields for each computer being probed are:

The hostname

Absolute value of the offset, in seconds, between the time on the computer running the probe and the computer being probed

Utilization of the /, /usr, /var, and /home filesystems

System load

Number of users

Status of six selected daemons (lowercase if not running)

Uptime in days

Hostname of the NIS server to which the computer is bound

The probe Script

Listing 2 shows the probe script. I'll briefly discuss the major sections, but see the comments in the script for details.

The main program sets up global constants and variables, then calls the getngrp() subroutine to build the global list of NIS domains and computers within each domain. Using this list, the do_poll() subroutine probes each computer.

The do_poll() subroutine does most of the work. First it uses the newping program to determine if the computer is up and usable (newping is a modified version of a program published in Sys Admin 2.4). It next determines the time offset between the two computers, then forks a process to gather more information. The child process uses rsh to run several commands on the computer being probed. A process is forked to do this, to decrease the chance of hanging the entire probe if one of the rsh commands hangs. Finally, all the information is formatted and displayed using the do_print() subroutine.

The do_print() subroutine displays on stdout and/or mails to the manager of the current NIS domain. When reports are being mailed, this routine begins a new report when a new domain is begun.

The getngrp() subroutine reads the NIS netgroup map and builds a global associative array of all the netgroups and computers. (This subroutine has been made into a perl library routine, and has found use in a number of our other local scripts.)

Like many system administration tools, the health probe script grew over time rather than being designed of one piece. If I were to redesign it, I would probably break it into two pieces, one running on the master and one running on each computer, to more efficiently gather the information and to smooth out the differences among the five different versions of UNIX we have in use.

Using the probe Script

We run the probe script once each day, in the early morning so we can catch problems before the heavy user load begins. Because we have had quite a few network problems, we run the probe script indirectly, using a Bourne shell wrapper. The wrapper (see Listing 3) attempts to kill the probe script and any hung children if we are having a bad network day.

The probe script is normally run with no command-line options. This sends the reports generated to manager@ for each of the top-level netgroups. We have appropriate mail aliases set up for this. The complete set of reports is also copied to stdout, which is then mailed to an appropriate person (me, as I run the probe from my crontab). A -m option will suppress copying all the reports to stdout, and a -s option will suppress mailing the individual reports to lab managers.

Modifying the probe Script for Your Situation

You will almost certainly need to modify (that sounds better than "hack," doesn't it?) the probe script to fit your exact mix of machines. The first thing you have to look at is how the netgroup database is set up. You must either use a scheme like ours or modify the getngrp subroutine to work with your scheme. Modifying getngrp should be easy. You can even replace it with a simple routine to read a list of hostnames from a file.

Here are several other modifications you may need or want to make.

Use ping instead of newping, if you have not installed newping and experience no problems with the simple ping shipped with your system.

Check on different filesystems. The probe script checks on /, /usr, /var, and /home. This may not be appropriate for you.

Check for a different set of daemons, perhaps adding a section for a different operating system flavor.

And of course it is up to you whether to use the prun wrapper or run the probe script directly.

About the Author

John Lees has an M.S. in computer science and has worked during the past twenty years about equally as a teacher, technical writer, programmer, and system administrator. His computer experience began in the days of front panels and paper tape, and he doesn't have enough fingers and toes to count the operating systems he has used. His love/hate relationship with UNIX dates to early 1985. Currently Mr. Lees is a systems analyst with the Department of Computer Science, and manager of the Pattern Recognition and Image Processing Laboratory, at Michigan State University. He is a member of ACM, Computer Professionals for Social Responsibility, the IEEE Computer Society, the Society for Technical Communication, and the TeX Users Group. He may be contacted as lees@cps.msu.edu.