A Host Health Probe
The Department of Computer Science at Michigan State
several hundred computers for its students, staff, and
are located in three buildings on a large campus. Although
a small systems management staff (three full-time people
and ten or
so half-time graduate students), we are expected to
keep the computers
up and running around the clock. We have developed a
number of tools
to ease this task. The subject of this article is a
named probe, which checks on the health of each of our
every morning and mails a status report to the appropriate
We operate a total of about 250 computers running the
system. Most are Suns running SunOS 4.1.x or Solaris
2.x, but there
are a few NeXTs, half a dozen DEC Alphas running OSF/1,
and an Apple
workgroup server running A/UX. Any tool we use for systems
has to work in this diverse and ever-changing environment.
Larry Wall's perl language because the perl interpreter
is easy to port to new versions of UNIX, is well documented
O'Reilly handbooks, and provides a powerful base for
The probe script began as a way of checking that the
daemon was running on all our computers. This is a daemon
computers synchronized using the Network Time Protocol.
saw the utility of making a few more simple checks on
all our computers
every day, and the probe script grew into its present
Our computers are divided into seven NIS (Sun Network
System) domains for the purpose of controlling access.
probe script is driven by the netgroup map, it is necessary
to understand a little about this database and about
Sun's NIS (see
the sidebar, "Sun Network Information Systems (NIS)").
The probe script uses the NIS netgroup map to find the
(host names) of the computers to probe. This adds to
of the script, but we make too many changes to the netgroup
databases for any other scheme to be practical. A simple
list of systems
to probe would have to be updated several times each
Listing 1 shows the kind of output generated by the
script. This sample has been edited to condense it a
little, but it
gives the general idea. The fields for each computer
Absolute value of the offset, in seconds, between the
time on the computer running the probe and the computer
Utilization of the /, /usr, /var,
and /home filesystems
Number of users
Status of six selected daemons (lowercase if not running)
Uptime in days
Hostname of the NIS server to which the computer is
The probe Script
Listing 2 shows the probe script. I'll briefly discuss
major sections, but see the comments in the script for
The main program sets up global constants and variables,
the getngrp() subroutine to build the global list of
and computers within each domain. Using this list, the
subroutine probes each computer.
The do_poll() subroutine does most of the work. First
the newping program to determine if the computer is
usable (newping is a modified version of a program published
in Sys Admin 2.4). It next determines the time offset
the two computers, then forks a process to gather more
The child process uses rsh to run several commands on
computer being probed. A process is forked to do this,
the chance of hanging the entire probe if one of the
hangs. Finally, all the information is formatted and
the do_print() subroutine.
The do_print() subroutine displays on stdout and/or
mails to the manager of the current NIS domain. When
reports are being
mailed, this routine begins a new report when a new
domain is begun.
The getngrp() subroutine reads the NIS netgroup map
a global associative array of all the netgroups and
subroutine has been made into a perl library routine,
has found use in a number of our other local scripts.)
Like many system administration tools, the health probe
grew over time rather than being designed of one piece.
If I were
to redesign it, I would probably break it into two pieces,
on the master and one running on each computer, to more
gather the information and to smooth out the differences
five different versions of UNIX we have in use.
Using the probe Script
We run the probe script once each day, in the early
so we can catch problems before the heavy user load
we have had quite a few network problems, we run the
script indirectly, using a Bourne shell wrapper. The
Listing 3) attempts to kill the probe script and any
children if we are having a bad network day.
The probe script is normally run with no command-line
This sends the reports generated to manager@ for each
top-level netgroups. We have appropriate mail aliases
set up for this.
The complete set of reports is also copied to stdout,
is then mailed to an appropriate person (me, as I run
from my crontab). A -m option will suppress copying
all the reports to stdout, and a -s option will suppress
mailing the individual reports to lab managers.
Modifying the probe Script for Your Situation
You will almost certainly need to modify (that sounds
"hack," doesn't it?) the probe script to fit
exact mix of machines. The first thing you have to look
at is how
the netgroup database is set up. You must either use
a scheme like
ours or modify the getngrp subroutine to work with your
Modifying getngrp should be easy. You can even replace
with a simple routine to read a list of hostnames from
Here are several other modifications you may need or
want to make.
Use ping instead of newping, if you
have not installed newping and experience no problems
with the simple
ping shipped with your system.
Check on different filesystems. The probe script
checks on /, /usr, /var, and /home. This may
not be appropriate for you.
Check for a different set of daemons, perhaps adding
a section for a different operating system flavor.
And of course it is up to you whether to use the prun
wrapper or run
the probe script directly.
About the Author
John Lees has an M.S. in computer science and has worked
the past twenty years about equally as a teacher, technical
programmer, and system administrator. His computer experience
in the days of front panels and paper tape, and he doesn't
fingers and toes to count the operating systems he has
used. His love/hate
relationship with UNIX dates to early 1985. Currently
Mr. Lees is
a systems analyst with the Department of Computer Science,
of the Pattern Recognition and Image Processing Laboratory,
State University. He is a member of ACM, Computer Professionals
Social Responsibility, the IEEE Computer Society, the
Technical Communication, and the TeX Users Group. He
may be contacted