Cover V03, I03
Article
Figure 1
Figure 2
Figure 3
Figure 4
Listing 1
Listing 2
Listing 3
Sidebar 1

may94.tar


The Downtime Notification and Tracking System

Dave Brillhart

Abstract

System downtime in distributed environments can be a real challenge to system administrators, since in many cases the associated hardware may not be within the administrator's physical control. In such circumstances tools that automatically detect system problems and notify support personnel are particularly valuable.

This article describes and provides installation instructions for the Downtime Notification and Tracking System (DNAT), a simple, "light-weight" tool for monitoring the connectivity of UNIX and VMS-based computers. DNAT generates a system status report every morning and notifies you within 15 minutes when one of your computers goes down. The system uses very little bandwidth, can track an unlimited number of computers over a wide-area network, and is easily extended to included other types of system checks (e.g., disk space, printer queues, etc). It can help you both prevent downtime and respond immediately when downtime does occur.

System Overview

Every morning, DNAT produces a system status report for the computers it monitors (see Figure 1). The report either confirms that your systems are okay or lets you know you've got problems.

To detect machines that need immediate attention when a problem occurs, you can run DNAT every few minutes so that it will notify you as soon as a loss-of-contact condition is detected. When it encounters a problem, DNAT immediately sends mail similar to that shown in Figure 3 to the appropriate account. (One of the recipients at Harris is the account "autopager," which is an in-house utility that converts the subject of incoming mail into an alphanumeric page. I've been in meetings where three or four pagers go off simultaneously with the message: "DNAT: HELP!!".)

Finally, DNAT specifies a format for reporting the details of each unscheduled downtime (see Figure 4). This part of DNAT helps us to dig deeper into the cause of each incident, exposing problems which otherwise would have come back to haunt us. In addition, these reports are merged into a tracking database to facilitate quick searches.

You can use any or all of these three facets of the DNAT system. Daily status reports and short interval system checks are setup to run via cron. Implementing the incident report system is simply a matter of a little extra up-front work.

Administrative Setup

If you decide to install the DNAT system, you must begin by addressing some administrative setup issues.

First, choose a reliable UNIX machine to serve as the DNAT hub. This machine will be used to run DNAT scripts and distribute e-mail. You will need to install Perl and create a non-privileged account called "dnat" on your hub. If you are going to use DNAT to monitor VMS machines, you should choose either an Ultrix or a Sun machine as your hub; otherwise most any UNIX machine will work. The rdcl program (Listing 2) allows batch (and interactive) access to VMS machines using the DECnet protocol. The version described here is for Ultrix, but if you have SunLink DNI and prefer a Sun hub, retrieve the file rdcl_sun.c via anonymous ftp from dave.mis.semi.harris.com.

Next, consider that DNAT uses rup to request uptime information from remote UNIX hosts. This utility requires that the rstatd daemon be running. Although most systems include and run rstatd by default, Ultrix does not. An Ultrix port is available, however, thanks to Sun Microsystems, Kazuro Furukawa, and Brendan Kehoe. You can retrieve the source via anonymous ftp from dave.mis.semi.harris.com: just pull down the file called UltrixRPC.tar.Z. (If you don't have ftp access, I'll be happy to mail it to you.) Once you process the file and run it through make, put the rup binary in a local bin directory and execute the rstatd daemon from /etc/rc.local.

If you would rather not count on rup and rstatd, you can create a non-privileged account called "dnat" on all the remote hosts. Configure the accounts' .rhosts files so that the hub machine can execute rsh commands. Depending on your setup, you may need to have a matching entry in each machine's /etc/hosts file.

HUB_HOSTNAME    dnat            (~dnat/.rhosts file)
HUB_IP_ADDRESS  HUB_HOSTNAME    (/etc/hosts file)

Next, add the following four mail aliases to the /etc/aliases file on the hub machine:

dnatAdmin: name1, name2, ...
dnatSunday: name1, name2, ...
dnatAlert: name1, name2, ...
dnatHelp: name1, name2, ...

Members of the dnatAdmin alias receive status reports every morning. Members of dnatSunday get the same report, but only on Sundays. dnatAlert members also receive the same report, but only on days when DNAT has found a problem (at Harris this alias contains developers and the help desk). Finally, members of dnatHelp receive incident notifications within 15 minutes of a system crash. Remember to ask sendmail to rebuild its databases after any change to the /etc/aliases file. You can do this with the newaliases command.

The final setup procedure involves configuring your VMS machines to allow proxy access to a non-privileged account. See the sidebar, "VMS Account and Proxy Setup," for a step-by-step procedure.

Setup Verification

To verify your setup, log in to the "dnat" account on the hub machine and check your mail aliases as follows:

dnat> /usr/lib/sendmail -bv dnatAlert

Next, verify access to the remote UNIX machines. One or both of the following commands should return uptime information.

dnat> rup HOSTNAME
dnat> rsh HOSTNAME uptime     (if rup failed)

To verify proxy access to the VMS machines, one of the following commands should return a VMS directory listing.

dnat> dls   HOSTNAME::   (from Ultrix running DECnet)
dnat> dnils HOSTNAME::   (from SunOS running DNI)

crontab Configuration

With the setup complete and verified, the next step is to configure the hub machine. You'll set up crontab first, although I suggest that you comment out the entries until the system is completely installed. Below are the entries we use on an Ultrix machine with a single root crontab file. If you use a SunOS machine as your hub, place the entries in dnat's crontab file and specify only the command names. All three scripts should be placed in the home directory of the hub machine's dnat account.

00 5 * * * su - dnat -c 'dnatDaily'
05,20,35,50 * * * * su - dnat -c 'dnatOften'
00 6 * * * su - dnat -c 'dnatOn'

The following two-line dnatDaily script simply calls the main Perl program ( Listing 1, dnatRpt) with options that tell it to generate and mail a status report. The silent option is useful for batch execution as it prevents tty output.

#!/bin/sh
dnatRpt -silent -rpt dnatHosts

The dnatOften script checks the return value ($? for Bourne shell) of dnatRpt; if a value greater than zero is returned (meaning at least one host is unreachable), mail is sent to the alias called dnatHelp:

#!/bin/sh
dnatRpt -silent dnatProd
[ $? -gt 0 ] && {
Mail -s 'DNAT: HELP!!' dnatHelp < /tmp/dnat911

mv dnatOften dnatOften.OFF
}

The last script is used to reset dnatOften. As you can imagine, extended downtime would cause dnatOften to send mail every 15 minutes for the same problem. One message is enough. After you have resolved the problem, run dnatOn manually to restart DNAT's short-interval monitoring. If you forget, cron will reset it for you the next morning.

#!/bin/sh
[ -f dnatOften.OFF ] &&
mv dnatOften.OFF dnatOften

The Main Event

The Perl script that makes it all happen is in Listing 1. This script should be placed in the home directory of the dnat account.

After defining some site-specific variables, the dnatRpt script begins by parsing command-line options and opening required files. The two binary options turn on status report distribution (-rpt) and turn off verbose mode (-silent). Running dnatRpt interactively without options will allow you to test both the script and your connectivity, without sending reports. The third option allows you to specify alternate hostname files. This option is very useful for running DNAT separately against different classes of computers. You might, for instance, group certain machines into a separate file and instruct DNAT to check them every hour between 7am and 6pm. Figure 2 is an example of a hostname file.

You will notice that dnatRpt creates two temporary files. One of the files contains the system status report as shown in Figure 1. This file is unlinked (deleted) when the script exits. The other file contains the names of computers which either could not be reached or were restarted recently. This file, shown in Figure 3, is mailed to you by the dnatOften script to help you track down and solve problems.

Finally, dnatRpt enters its main loop, where each line of the hostname file is processed. Empty lines and comments preceded by a "#" are skipped; report titles enclosed in double-quotes are printed; and whatever is left over is considered a hostname. The script uses the fact that DECnet hostnames end with a double-colon ("::") to determine the appropriate transport protocol (TCP/IP or DECnet).

If the remote host is to be accessed via TCP/IP, dnatRpt sends an ICMP echo request (ping) to determine if the host is reachable. If an echo reply is returned, the script uses rup to find the elapsed time since the last reboot. If rup fails (for example, if rstatd is unavailable on the remote host), an uptime command is attempted via rsh. If both fail (or if ping failed), the host is flagged as "UNAVAILABLE." This message often indicates a network problem, but as far as your users are concerned, it's your problem.

You can gather similar information from a VMS host by looking at the first line of the output of a show system command. Listing 2, a C program called rdcl.c (described later in this article), executes the show system command and determines the uptime information. If you don't have DECnet on your DNAT hub machine, just don't put any DECnet node names in the hostname file.

Finally, after all the hosts have been processed, dnatRpt ends by optionally sending out its system status report and returning an exit value equal to the number of unreachable hosts.

The VMS Connection

Two programs are required to emulate the rsh command execution to a VMS-based host. The C code that runs on the DNAT hub machine might look intimidating (see Listing 2, rdcl.c), but as you will see it is pretty basic stuff. The version described here, which runs on an Ultrix machine, contains only two DECnet specific lines in the entire program. The library call dnet_conn returns a standard UNIX socket descriptor which can be used to read and write through a channel to a VMS object. The SunOS version of rdcl differs in that SunLink DNI is implemented as a STREAMS driver supporting the Transport Layer Interface (TLI). If you download and install the SunOS version (as described earlier), DNAT will otherwise run as advertised.

The rdcl binary should be placed in a local bin directory on the hub machine (compile instructions are contained in the header comments of the code). The program actually supports both interactive and batch command modes, although DNAT only uses the batch mode to issue a show system comand. Briefly, rdcl.c parses the command line, decides if you want batch or interactive mode, opens the socket channel, writes the specified VMS command to the socket, and displays the results on stdout.

The VMS com procedure (Listing 3, dnet_cmd.com), referred to as a DECnet object, should be placed in the SYS$LOGIN directory of the VMS-based "dnat" account. The key to this program is a system logical called SYS$NET. Opening this logical allows the object to communicate with the rdcl program. dnet_cmd.com opens the network channel, reads in and executes a VMS command, directing its output to a temporary file, and finally writes the contents of that file back out over the channel. You can debug this connection by checking the NETSERVER.LOG files created in the DNAT directory.

The Hard Part

Thus far, I've shown you some useful utilities that provide you with status reports and let you know when a connection is down. Now I turn to a powerful strategy for following up on unscheduled downtime.

The natural response to downtime in a high-pressure environment is to apply a quick fix and go on to the next problem. For the last year I've forced myself (and the local administrators at our remote sites) to complete the form shown in Figure 4. The incident report is designed to capture enough information to allow a new system administrator to solve the problem just by reading your notes. What might have taken hours to track down and solve should only take minutes the next time around. In addition to a detailed explanation, I include the actual commands I used to solve the problem. In some instances I am able to simply cut and paste a solution right from a report to the affected computer.

Over time, I've observed two interesting side effects of enforcing the use of incident reports. The first, is a significant reduction in the number of unscheduled downtimes: the incident investigations have enabled us to find and eliminate chronic problems. The second, is a reduction in the duration of the downtime we do experience attributable in part to the step-by-step solutions provided by the downtime tracking database. A quick search through the database can save hours.

Conclusions

If this article has done its job, you should be able to incorporate the DNAT system into your own environment within a few hours. Although it is tempting to add features, I've tried to keep the system simple and focused. There is always room for improvement and I would love to hear how you have customized it for your own site.

Feel free to communicate with me regarding this system. I'd appreciate your comments and suggestions and will try to answer your questions promptly. In addition, I'll provide bug fixes and other limited maintenance. Enjoy!!

About the Author

Dave Brillhart is a UNIX Systems Manager for the Manufacturing Systems Division of Harris Semiconductor. He has worked in the UNIX environment since 1989, assuming responsibility for worldwide mission-critical servers and development platforms. His current focus is on developing system admin tools which will free up his time to dig deeper into X-window and network programming. Dave can be reached at dbrillha@harris.com or you can visit his Mosaic home page at http://dave.mis.semi.harris.com/home.html.