The Downtime Notification and Tracking System
Dave Brillhart
Abstract
System downtime in distributed environments can be a
real challenge
to system administrators, since in many cases the associated
hardware
may not be within the administrator's physical control.
In such circumstances
tools that automatically detect system problems and
notify support
personnel are particularly valuable.
This article describes and provides installation instructions
for
the Downtime Notification and Tracking System (DNAT),
a simple, "light-weight"
tool for monitoring the connectivity of UNIX and VMS-based
computers.
DNAT generates a system status report every morning
and notifies you
within 15 minutes when one of your computers goes down.
The system
uses very little bandwidth, can track an unlimited number
of computers
over a wide-area network, and is easily extended to
included other
types of system checks (e.g., disk space, printer queues,
etc). It
can help you both prevent downtime and respond immediately
when downtime
does occur.
System Overview
Every morning, DNAT produces a system status report
for the computers
it monitors (see Figure 1). The report either confirms
that your systems
are okay or lets you know you've got problems.
To detect machines that need immediate attention when
a problem occurs,
you can run DNAT every few minutes so that it will notify
you as soon
as a loss-of-contact condition is detected. When it
encounters a problem,
DNAT immediately sends mail similar to that shown in
Figure 3 to the
appropriate account. (One of the recipients at Harris
is the account
"autopager," which is an in-house utility
that converts the
subject of incoming mail into an alphanumeric page.
I've been in meetings
where three or four pagers go off simultaneously with
the message:
"DNAT: HELP!!".)
Finally, DNAT specifies a format for reporting the details
of each
unscheduled downtime (see Figure 4). This part of DNAT
helps us to
dig deeper into the cause of each incident, exposing
problems which
otherwise would have come back to haunt us. In addition,
these reports
are merged into a tracking database to facilitate quick
searches.
You can use any or all of these three facets of the
DNAT system. Daily
status reports and short interval system checks are
setup to run via
cron. Implementing the incident report system is simply
a
matter of a little extra up-front work.
Administrative Setup
If you decide to install the DNAT system, you must begin
by addressing
some administrative setup issues.
First, choose a reliable UNIX machine to serve as the
DNAT hub. This
machine will be used to run DNAT scripts and distribute
e-mail. You
will need to install Perl and create a non-privileged
account called
"dnat" on your hub. If you are going to use
DNAT to monitor
VMS machines, you should choose either an Ultrix or
a Sun machine
as your hub; otherwise most any UNIX machine will work.
The rdcl
program (Listing 2) allows batch (and interactive) access
to VMS machines
using the DECnet protocol. The version described here
is for Ultrix,
but if you have SunLink DNI and prefer a Sun hub, retrieve
the file
rdcl_sun.c via anonymous ftp from dave.mis.semi.harris.com.
Next, consider that DNAT uses rup to request uptime
information
from remote UNIX hosts. This utility requires that the
rstatd
daemon be running. Although most systems include and
run rstatd
by default, Ultrix does not. An Ultrix port is available,
however,
thanks to Sun Microsystems, Kazuro Furukawa, and Brendan
Kehoe. You
can retrieve the source via anonymous ftp from dave.mis.semi.harris.com:
just pull down the file called UltrixRPC.tar.Z. (If
you don't
have ftp access, I'll be happy to mail it to you.) Once
you process
the file and run it through make, put the rup binary
in a
local bin directory and execute the rstatd daemon from
/etc/rc.local.
If you would rather not count on rup and rstatd, you
can create a non-privileged account called "dnat"
on all the remote
hosts. Configure the accounts' .rhosts files so that
the hub
machine can execute rsh commands. Depending on your
setup,
you may need to have a matching entry in each machine's
/etc/hosts
file.
HUB_HOSTNAME dnat (~dnat/.rhosts file)
HUB_IP_ADDRESS HUB_HOSTNAME (/etc/hosts file)
Next, add the following four mail aliases to the /etc/aliases
file on the hub machine:
dnatAdmin: name1, name2, ...
dnatSunday: name1, name2, ...
dnatAlert: name1, name2, ...
dnatHelp: name1, name2, ...
Members of the dnatAdmin alias receive status reports
every
morning. Members of dnatSunday get the same report,
but only
on Sundays. dnatAlert members also receive the same
report,
but only on days when DNAT has found a problem (at Harris
this alias
contains developers and the help desk). Finally, members
of dnatHelp
receive incident notifications within 15 minutes of
a system crash.
Remember to ask sendmail to rebuild its databases after
any
change to the /etc/aliases file. You can do this with
the
newaliases command.
The final setup procedure involves configuring your
VMS machines to
allow proxy access to a non-privileged account. See
the sidebar, "VMS
Account and Proxy Setup," for a step-by-step procedure.
Setup Verification
To verify your setup, log in to the "dnat"
account on the
hub machine and check your mail aliases as follows:
dnat> /usr/lib/sendmail -bv dnatAlert
Next, verify access to the remote UNIX machines. One
or both of the following commands should return uptime
information.
dnat> rup HOSTNAME
dnat> rsh HOSTNAME uptime (if rup failed)
To verify proxy access to the VMS machines, one of the
following commands should return a VMS directory listing.
dnat> dls HOSTNAME:: (from Ultrix running DECnet)
dnat> dnils HOSTNAME:: (from SunOS running DNI)
crontab Configuration
With the setup complete and verified, the next step
is to configure
the hub machine. You'll set up crontab first, although
I suggest
that you comment out the entries until the system is
completely installed.
Below are the entries we use on an Ultrix machine with
a single root
crontab file. If you use a SunOS machine as your hub,
place
the entries in dnat's crontab file and specify only
the command
names. All three scripts should be placed in the home
directory of
the hub machine's dnat account.
00 5 * * * su - dnat -c 'dnatDaily'
05,20,35,50 * * * * su - dnat -c 'dnatOften'
00 6 * * * su - dnat -c 'dnatOn'
The following two-line dnatDaily script simply calls
the main
Perl program ( Listing 1, dnatRpt) with options that
tell
it to generate and mail a status report. The silent
option is useful
for batch execution as it prevents tty output.
#!/bin/sh
dnatRpt -silent -rpt dnatHosts
The dnatOften script checks the return value ($? for
Bourne shell) of dnatRpt; if a value greater than zero
is
returned (meaning at least one host is unreachable),
mail is sent
to the alias called dnatHelp:
#!/bin/sh
dnatRpt -silent dnatProd
[ $? -gt 0 ] && {
Mail -s 'DNAT: HELP!!' dnatHelp < /tmp/dnat911
mv dnatOften dnatOften.OFF
}
The last script is used to reset dnatOften. As you can
imagine, extended downtime would cause dnatOften to
send mail
every 15 minutes for the same problem. One message is
enough. After
you have resolved the problem, run dnatOn manually to
restart
DNAT's short-interval monitoring. If you forget, cron
will
reset it for you the next morning.
#!/bin/sh
[ -f dnatOften.OFF ] &&
mv dnatOften.OFF dnatOften
The Main Event
The Perl script that makes it all happen is in Listing
1. This script
should be placed in the home directory of the dnat account.
After defining some site-specific variables, the dnatRpt
script
begins by parsing command-line options and opening required
files.
The two binary options turn on status report distribution
(-rpt)
and turn off verbose mode (-silent). Running dnatRpt
interactively without options will allow you to test
both the script
and your connectivity, without sending reports. The
third option allows
you to specify alternate hostname files. This option
is very useful
for running DNAT separately against different classes
of computers.
You might, for instance, group certain machines into
a separate file
and instruct DNAT to check them every hour between 7am
and 6pm. Figure 2
is an example of a hostname file.
You will notice that dnatRpt creates two temporary files.
One of the files contains the system status report as
shown in Figure 1.
This file is unlinked (deleted) when the script exits.
The other
file contains the names of computers which either could
not be reached
or were restarted recently. This file, shown in Figure
3, is mailed
to you by the dnatOften script to help you track down
and
solve problems.
Finally, dnatRpt enters its main loop, where each line
of
the hostname file is processed. Empty lines and comments
preceded
by a "#" are skipped; report titles enclosed
in double-quotes
are printed; and whatever is left over is considered
a hostname. The
script uses the fact that DECnet hostnames end with
a double-colon
("::") to determine the appropriate transport
protocol (TCP/IP
or DECnet).
If the remote host is to be accessed via TCP/IP, dnatRpt
sends
an ICMP echo request (ping) to determine if the host
is reachable.
If an echo reply is returned, the script uses rup to
find
the elapsed time since the last reboot. If rup fails
(for
example, if rstatd is unavailable on the remote host),
an
uptime command is attempted via rsh. If both fail
(or if ping failed), the host is flagged as "UNAVAILABLE."
This message often indicates a network problem, but
as far as your
users are concerned, it's your problem.
You can gather similar information from a VMS host by
looking at the
first line of the output of a show system command. Listing 2,
a C program called rdcl.c (described later in this
article),
executes the show system command and determines the
uptime
information. If you don't have DECnet on your DNAT hub
machine, just
don't put any DECnet node names in the hostname file.
Finally, after all the hosts have been processed, dnatRpt
ends by optionally sending out its system status report
and returning
an exit value equal to the number of unreachable hosts.
The VMS Connection
Two programs are required to emulate the rsh command
execution
to a VMS-based host. The C code that runs on the DNAT
hub machine
might look intimidating (see Listing 2, rdcl.c), but
as you
will see it is pretty basic stuff. The version described
here, which
runs on an Ultrix machine, contains only two DECnet
specific lines
in the entire program. The library call dnet_conn returns
a standard UNIX socket descriptor which can be used
to read and write
through a channel to a VMS object. The SunOS version
of rdcl
differs in that SunLink DNI is implemented as a STREAMS
driver supporting
the Transport Layer Interface (TLI). If you download
and install the
SunOS version (as described earlier), DNAT will otherwise
run as advertised.
The rdcl binary should be placed in a local bin directory
on the hub machine (compile instructions are contained
in the header
comments of the code). The program actually supports
both interactive
and batch command modes, although DNAT only uses the
batch mode to
issue a show system comand. Briefly, rdcl.c parses
the command line, decides if you want batch or interactive
mode, opens
the socket channel, writes the specified VMS command
to the socket,
and displays the results on stdout.
The VMS com procedure (Listing 3, dnet_cmd.com), referred
to as a DECnet object, should be placed in the SYS$LOGIN
directory
of the VMS-based "dnat" account. The key to
this program is
a system logical called SYS$NET. Opening this logical
allows
the object to communicate with the rdcl program. dnet_cmd.com
opens the network channel, reads in and executes a VMS
command, directing
its output to a temporary file, and finally writes the
contents of
that file back out over the channel. You can debug this
connection
by checking the NETSERVER.LOG files created in the DNAT
directory.
The Hard Part
Thus far, I've shown you some useful utilities that
provide you with
status reports and let you know when a connection is
down. Now I turn
to a powerful strategy for following up on unscheduled
downtime.
The natural response to downtime in a high-pressure
environment is
to apply a quick fix and go on to the next problem.
For the last year
I've forced myself (and the local administrators at
our remote sites)
to complete the form shown in Figure 4. The incident
report is designed
to capture enough information to allow a new system
administrator
to solve the problem just by reading your notes. What
might have taken
hours to track down and solve should only take minutes
the next time
around. In addition to a detailed explanation, I include
the actual
commands I used to solve the problem. In some instances
I am able
to simply cut and paste a solution right from a report
to the affected
computer.
Over time, I've observed two interesting side effects
of enforcing
the use of incident reports. The first, is a significant
reduction
in the number of unscheduled downtimes: the incident
investigations
have enabled us to find and eliminate chronic problems.
The second,
is a reduction in the duration of the downtime we do
experience attributable
in part to the step-by-step solutions provided by the
downtime tracking
database. A quick search through the database can save
hours.
Conclusions
If this article has done its job, you should be able
to incorporate
the DNAT system into your own environment within a few
hours. Although
it is tempting to add features, I've tried to keep the
system simple
and focused. There is always room for improvement and
I would love
to hear how you have customized it for your own site.
Feel free to communicate with me regarding this system.
I'd appreciate
your comments and suggestions and will try to answer
your questions
promptly. In addition, I'll provide bug fixes and other
limited maintenance.
Enjoy!!
About the Author
Dave Brillhart is a UNIX Systems Manager for the Manufacturing
Systems Division of Harris Semiconductor. He has worked
in the UNIX
environment since 1989, assuming responsibility for
worldwide mission-critical
servers and development platforms. His current focus
is on developing
system admin tools which will free up his time to dig
deeper into
X-window and network programming. Dave can be reached
at dbrillha@harris.com
or you can visit his Mosaic home page at http://dave.mis.semi.harris.com/home.html.
|