Cover V02, I04
Article
Listing 1
Listing 2
Listing 3
Listing 4
Table 1

jul93.tar


newping: Remote Host Downtime Detection

Eric Horne

Over the course of nights spent waiting for the Sun cluster in our lab to recover from yet another crash, my lab partners and I reached the conclusion that the main problem with recoveries lay not in the execution but in letting the administrator know that the system was down. In fact, knowing whether a system is down, or even hung, is the first step in solving the problem.

A Tool Called ping

A reasonable tool for this sort of detection might be the ping(1) command, which sends an echo request from the source machine to the destination machine. Upon receipt of the echo, the destination machine echoes the data sent along with the echo request back to the source machine. When the source machine receives the echoed data, it reports that the destination machine is up. If the source machine does not get the echoed data within a certain timeout period, the source machine reports the destination as down. ping also reports other problems it finds, like unreachable host/network or unknown hostname. Using ping in an automated process that checks each host's status is easy to do by incorporating cron and script programming. Running the script every X minutes via cron is the way to go.

A Closer Look at ping

Looking at ping a little more closely, I found that it wasn't as reliable as I had hoped. ping uses ICMP (Internet Control Message Protocol) to ask for a simple echo of some data. The data consists of a message number and the time sent from the source. The destination machine receives the ICMP echo request and sends back an ICMP echo reply. When the source machine receives the echo reply, it calculates the difference between the current time and the time data in the reply to find the total elapsed time it took the message to bounce back. If the destination machine is down, the source machine will eventually timeout waiting for a response that will never come.

I didn't really think much about this, in fact I figured it was exactly what I needed, until I stumbled into one important factor. The destination machine CPU never knows about the ICMP echo request, because the Ethernet card handles it automatically. In other words, the Ethernet card intercepts the ICMP echo request and sends the reply all by itself, allowing the CPU to handle other things, like user processes.

Allowing the Ethernet card alone to handle the echo request implies a big assumption on ping's part. ping is also misleading the user into thinking that the Ethernet card status represents the machine's status, which is not always true. It is possible that the Ethernet card is responding and handling its ICMP echo requests, while the CPU has come to a grinding halt because of a disk error or some other problem. Like it or not, systems hang, and ping won't tell you a thing about it.

A Look at newping

newping's purpose is to help correct ping's shortcomings. While this program performs many of the same services as ping, it differs significantly in its function. One significant difference between newping and ping is the choice of protocols. ping uses ICMP, while newping uses TCP/IP, which forces itself past the Ethernet card and demands attention from the inetd, and thus the CPU, running on the destination machine. The use of TCP/IP allows newping to connect() with the destination host via a well-known port. (newping uses the TIME/TCP port.) Each well-known port associates itself, thanks to inetd, with some sort of daemon, and if not, then with some sort of runnable code. When inetd detects a connection to a particular port, it notifies the daemon and redirects the data, if any, to it. The daemon wakes up and handles the request. Notice that the inetd and daemon actions both require a response from the CPU, not only the Ethernet card. Both newping and ping will notify the user of the test results, although newping returns much more detailed and meaningful values than ping. (See Table 1 for some details).

newping Code

Besides the initial startup instructions, the newping code is straightforward. newping begins by looking at the command line to determine any options and the name of the host to check, also the amount of time, in seconds, to use as a timeout. The program processes all of this information and stores it in several variables and structures used later by connect(). After completing verification of other miscellaneous data, newping uses signal() and alarm() calls to set an internal alarm to go off in exactly one second. When the second is up, the alarm calls an action routine, noconnect(), which keeps track of the time that has passed. If the time passed is more than the timeout value, a connection was not made within the time limits, and newping times out. Otherwise, if the timeout has not been violated, newping continues, with the alarm set to go off in another second. In effect, noconnect() either gives newping another second to connect, or exits newping with a return code set to 1 (Listing 2). If a connection is made prior to a timeout, the connection is deemed successful and the next phase begins.

If the connection is successful, which implies that the Ethernet card is OK, newping next tests the CPU. A quick call to signal() resets the SIGALRM action from calling noconnect() to calling noresponse(). Both functions perform the same task: the difference between them lies in the status code they return via exit() if the count exceeds the timeout value (see Listing 3). It is important to notice that the time passed is not reset from the noconnect() call. newping works in much the same way as the Ethernet card test does. The only real difference is that newping calls recv() instead of connect(). It is simply blocking, waiting for the destination machine to send its data. If newping times out before recv() detects the presence of any data, the destination machine is most likely a hung machine. The Ethernet card will respond, but the CPU is either so loaded with work it does not have time to return data, or the CPU has halted. Either case is worth investigating.

Recall that newping connects to the TIME/TCP port of the destination host. I chose the TIME/TCP port because when it detects a connection, it automatically sends the local time through it and disconnects the connection. The data returned is not used for anything. Its presence means that the machine is responding. Receiving the data allows newping to exit with a code of 0. Detecting any sort of error forces newping to exit with the proper return code (see Table 1).

Automating Detection

newping's ability over ping to return distinguishable codes for different states of a destination machine make newping very useful for shell scripts. A script using newping, a list of destination hosts, and a few loops can be surprisingly effective. Listing 4 shows a simple script, worthy of your improvement.

The script in Listing 4 acts as a base script. You can change it to match your needs. It could easily be altered to record the times a host went down and came back up again. It could notify a list of people using write(1) or mail(1) that there is a particular problem with some host. It could even test the stability of a network.

cron can run the script in Listing 4 every X minutes, allowing an ongoing, automatic notification system. My department even hooked a terminal to the back of my Sun 4, through /dev/ttya, to which a script could then write the status of each host in a list. The script in Listing 4 can be altered to do this by changing the /dev/console to /dev/ttya. We ran our particular version every five minutes through cron. This gave us an updated status for every important host in a list with the oldest update only five minutes old. Our script logged host downtime for a monthly report, which summarized how often a system went down and for how long. As your needs change, the script can change with them. Over time, this simple script grew to be one of the more complicated scripts I have ever written.

newping is not by any means the be-all-end-all method of detecting remote host downtime. Used in the right way, however, it can help to terrifically speed up your response to down hosts.

About the Author

Eric T. Horne is a graduating senior from Cal Poly at San Luis Obispo. He worked as a programmer analyst for 9 months at Teradyne, Inc. (ST division), where he assisted system administrators and wrote several utility sh scripts to help manage and measure performance of systems. He will be graduating hopefully sometime in August, 1993. You may contact Eric at 40 San Antonio Street, Newbury Park, CA 91320 ehorne@phoenix.csc.calpoly.edu.