A Host Health Probe
John Lees
The Department of Computer Science at Michigan State
University operates
several hundred computers for its students, staff, and
faculty. Computers
are located in three buildings on a large campus. Although
we have
a small systems management staff (three full-time people
and ten or
so half-time graduate students), we are expected to
keep the computers
up and running around the clock. We have developed a
number of tools
to ease this task. The subject of this article is a
perl script
named probe, which checks on the health of each of our
systems
every morning and mails a status report to the appropriate
system
managers.
We operate a total of about 250 computers running the
UNIX operating
system. Most are Suns running SunOS 4.1.x or Solaris
2.x, but there
are a few NeXTs, half a dozen DEC Alphas running OSF/1,
and an Apple
workgroup server running A/UX. Any tool we use for systems
management
has to work in this diverse and ever-changing environment.
We've chosen
Larry Wall's perl language because the perl interpreter
is easy to port to new versions of UNIX, is well documented
by two
O'Reilly handbooks, and provides a powerful base for
writing custom
scripts.
The probe script began as a way of checking that the
xntpd
daemon was running on all our computers. This is a daemon
which keeps
computers synchronized using the Network Time Protocol.
We quickly
saw the utility of making a few more simple checks on
all our computers
every day, and the probe script grew into its present
form.
Our computers are divided into seven NIS (Sun Network
Information
System) domains for the purpose of controlling access.
Because the
probe script is driven by the netgroup map, it is necessary
to understand a little about this database and about
Sun's NIS (see
the sidebar, "Sun Network Information Systems (NIS)").
The probe script uses the NIS netgroup map to find the
names
(host names) of the computers to probe. This adds to
the complexity
of the script, but we make too many changes to the netgroup
(and hosts)
databases for any other scheme to be practical. A simple
list of systems
to probe would have to be updated several times each
week.
Sample Output
Listing 1 shows the kind of output generated by the
probe
script. This sample has been edited to condense it a
little, but it
gives the general idea. The fields for each computer
being probed
are:
The hostname
Absolute value of the offset, in seconds, between the
time on the computer running the probe and the computer
being probed
Utilization of the /, /usr, /var,
and /home filesystems
System load
Number of users
Status of six selected daemons (lowercase if not running)
Uptime in days
Hostname of the NIS server to which the computer is
bound
The probe Script
Listing 2 shows the probe script. I'll briefly discuss
the
major sections, but see the comments in the script for
details.
The main program sets up global constants and variables,
then calls
the getngrp() subroutine to build the global list of
NIS domains
and computers within each domain. Using this list, the
do_poll()
subroutine probes each computer.
The do_poll() subroutine does most of the work. First
it uses
the newping program to determine if the computer is
up and
usable (newping is a modified version of a program published
in Sys Admin 2.4). It next determines the time offset
between
the two computers, then forks a process to gather more
information.
The child process uses rsh to run several commands on
the
computer being probed. A process is forked to do this,
to decrease
the chance of hanging the entire probe if one of the
rsh commands
hangs. Finally, all the information is formatted and
displayed using
the do_print() subroutine.
The do_print() subroutine displays on stdout and/or
mails to the manager of the current NIS domain. When
reports are being
mailed, this routine begins a new report when a new
domain is begun.
The getngrp() subroutine reads the NIS netgroup map
and builds
a global associative array of all the netgroups and
computers. (This
subroutine has been made into a perl library routine,
and
has found use in a number of our other local scripts.)
Like many system administration tools, the health probe
script
grew over time rather than being designed of one piece.
If I were
to redesign it, I would probably break it into two pieces,
one running
on the master and one running on each computer, to more
efficiently
gather the information and to smooth out the differences
among the
five different versions of UNIX we have in use.
Using the probe Script
We run the probe script once each day, in the early
morning
so we can catch problems before the heavy user load
begins. Because
we have had quite a few network problems, we run the
probe
script indirectly, using a Bourne shell wrapper. The
wrapper (see
Listing 3) attempts to kill the probe script and any
hung
children if we are having a bad network day.
The probe script is normally run with no command-line
options.
This sends the reports generated to manager@ for each
of the
top-level netgroups. We have appropriate mail aliases
set up for this.
The complete set of reports is also copied to stdout,
which
is then mailed to an appropriate person (me, as I run
the probe
from my crontab). A -m option will suppress copying
all the reports to stdout, and a -s option will suppress
mailing the individual reports to lab managers.
Modifying the probe Script for Your Situation
You will almost certainly need to modify (that sounds
better than
"hack," doesn't it?) the probe script to fit
your
exact mix of machines. The first thing you have to look
at is how
the netgroup database is set up. You must either use
a scheme like
ours or modify the getngrp subroutine to work with your
scheme.
Modifying getngrp should be easy. You can even replace
it
with a simple routine to read a list of hostnames from
a file.
Here are several other modifications you may need or
want to make.
Use ping instead of newping, if you
have not installed newping and experience no problems
with the simple
ping shipped with your system.
Check on different filesystems. The probe script
checks on /, /usr, /var, and /home. This may
not be appropriate for you.
Check for a different set of daemons, perhaps adding
a section for a different operating system flavor.
And of course it is up to you whether to use the prun
wrapper or run
the probe script directly.
About the Author
John Lees has an M.S. in computer science and has worked
during
the past twenty years about equally as a teacher, technical
writer,
programmer, and system administrator. His computer experience
began
in the days of front panels and paper tape, and he doesn't
have enough
fingers and toes to count the operating systems he has
used. His love/hate
relationship with UNIX dates to early 1985. Currently
Mr. Lees is
a systems analyst with the Department of Computer Science,
and manager
of the Pattern Recognition and Image Processing Laboratory,
at Michigan
State University. He is a member of ACM, Computer Professionals
for
Social Responsibility, the IEEE Computer Society, the
Society for
Technical Communication, and the TeX Users Group. He
may be contacted
as lees@cps.msu.edu.
|