| newping:  Remote Host Downtime Detection
 
Eric Horne 
Over the course of nights spent waiting for the Sun
cluster in our 
lab to recover from yet another crash, my lab partners
and I reached 
the conclusion that the main problem with recoveries
lay not in the 
execution but in letting the administrator know that
the system was 
down. In fact, knowing whether a system is down, or
even hung, is 
the first step in solving the problem. 
A Tool Called ping 
A reasonable tool for this sort of detection might be
the ping(1) 
command, which sends an echo request from the source
machine to the 
destination machine. Upon receipt of the echo, the destination
machine 
echoes the data sent along with the echo request back
to the source 
machine. When the source machine receives the echoed
data, it reports 
that the destination machine is up. If the source machine
does not 
get the echoed data within a certain timeout period,
the source machine 
reports the destination as down. ping also reports other
problems 
it finds, like unreachable host/network or unknown hostname.
Using 
ping in an automated process that checks each host's
status 
is easy to do by incorporating cron and script programming.
Running the script every X minutes via cron is the way
to go. 
A Closer Look at ping 
Looking at ping a little more closely, I found that
it wasn't 
as reliable as I had hoped. ping uses ICMP (Internet
Control 
Message Protocol) to ask for a simple echo of some data.
The data 
consists of a message number and the time sent from
the source. The 
destination machine receives the ICMP echo request and
sends back 
an ICMP echo reply. When the source machine receives
the echo reply, 
it calculates the difference between the current time
and the time 
data in the reply to find the total elapsed time it
took the message 
to bounce back. If the destination machine is down,
the source machine 
will eventually timeout waiting for a response that
will never come. 
I didn't really think much about this, in fact I figured
it was exactly 
what I needed, until I stumbled into one important factor.
The destination 
machine CPU never knows about the ICMP echo request,
because the Ethernet 
card handles it automatically. In other words, the Ethernet
card intercepts 
the ICMP echo request and sends the reply all by itself,
allowing 
the CPU to handle other things, like user processes. 
Allowing the Ethernet card alone to handle the echo
request implies 
a big assumption on ping's part. ping is also misleading
the user into thinking that the Ethernet card status
represents the 
machine's status, which is not always true. It is possible
that the 
Ethernet card is responding and handling its ICMP echo
requests, while 
the CPU has come to a grinding halt because of a disk
error or some 
other problem. Like it or not, systems hang, and ping
won't 
tell you a thing about it. 
A Look at newping 
newping's purpose is to help correct ping's shortcomings.
While this program performs many of the same services
as ping, 
it differs significantly in its function. One significant
difference 
between newping and ping is the choice of protocols.
ping uses ICMP, while newping uses TCP/IP, which forces
itself past the Ethernet card and demands attention
from the inetd, 
and thus the CPU, running on the destination machine.
The use of TCP/IP 
allows newping to connect() with the destination host
via a well-known port. (newping uses the TIME/TCP port.)
Each 
well-known port associates itself, thanks to inetd,
with some 
sort of daemon, and if not, then with some sort of runnable
code. 
When inetd detects a connection to a particular port,
it notifies 
the daemon and redirects the data, if any, to it. The
daemon wakes 
up and handles the request. Notice that the inetd and
daemon 
actions both require a response from the CPU, not only
the Ethernet 
card. Both newping and ping will notify the user of
the test results, although newping returns much more
detailed 
and meaningful values than ping. (See Table 1 for some
details). 
newping Code 
Besides the initial startup instructions, the newping
code 
is straightforward. newping begins by looking at the
command 
line to determine any options and the name of the host
to check, also 
the amount of time, in seconds, to use as a timeout.
The program processes 
all of this information and stores it in several variables
and structures 
used later by connect().  After completing verification
of 
other miscellaneous data, newping uses signal() and
alarm() calls to set an internal alarm to go off in
exactly 
one second. When the second is up, the alarm calls an
action routine, 
noconnect(), which keeps track of the time that has
passed. 
If the time passed is more than the timeout value, a
connection was 
not made within the time limits, and newping times out.
Otherwise, 
if the timeout has not been violated, newping continues,
with 
the alarm set to go off in another second. In effect,
noconnect() 
either gives newping another second to connect, or exits
newping 
with a return code set to 1 (Listing 2). If a connection
is made 
prior to a timeout, the connection is deemed successful
and the next 
phase begins. 
If the connection is successful, which implies that
the Ethernet card 
is OK, newping next tests the CPU. A quick call to signal()
resets the SIGALRM action from calling noconnect() to
calling 
noresponse(). Both functions perform the same task:
the difference 
between them lies in the status code they return via
exit() 
if the count exceeds the timeout value (see Listing
3). It is important 
to notice that the time passed is not reset from the
noconnect() 
call. newping works in much the same way as the Ethernet
card 
test does. The only real difference is that newping
calls recv() 
instead of connect(). It is simply blocking, waiting
for the 
destination machine to send its data. If newping times
out 
before recv() detects the presence of any data, the
destination 
machine is most likely a hung machine. The Ethernet
card will respond, 
but the CPU is either so loaded with work it does not
have time to 
return data, or the CPU has halted. Either case is worth
investigating. 
Recall that newping connects to the TIME/TCP port of
the destination 
host. I chose the TIME/TCP port because when it detects
a connection, 
it automatically sends the local time through it and
disconnects the 
connection. The data returned is not used for anything.
Its presence 
means that the machine is responding. Receiving the
data allows newping 
to exit with a code of 0. Detecting any sort of error
forces newping 
to exit with the proper return code (see Table 1). 
Automating Detection 
newping's ability over ping to return distinguishable
codes for different states of a destination machine
make newping 
very useful for shell scripts. A script using newping,
a list 
of destination hosts, and a few loops can be surprisingly
effective. 
Listing 4 shows a simple script, worthy of your improvement. 
The script in Listing 4 acts as a base script. You can
change it to 
match your needs. It could easily be altered to record
the times a 
host went down and came back up again. It could notify
a list of people 
using write(1) or mail(1) that there is a particular
problem with some host. It could even test the stability
of a network. 
cron can run the script in Listing 4 every X minutes,
allowing 
an ongoing, automatic notification system. My department
even hooked 
a terminal to the back of my Sun 4, through /dev/ttya,
to which 
a script could then write the status of each host in
a list. The script 
in Listing 4 can be altered to do this by changing the
/dev/console 
to /dev/ttya. We ran our particular version every five
minutes 
through cron. This gave us an updated status for every
important 
host in a list with the oldest update only five minutes
old. Our script 
logged host downtime for a monthly report, which summarized
how often 
a system went down and for how long. As your needs change,
the script 
can change with them. Over time, this simple script
grew to be one 
of the more complicated scripts I have ever written. 
newping is not by any means the be-all-end-all method
of detecting 
remote host downtime. Used in the right way, however,
it can help 
to terrifically speed up your response to down hosts.
 
 
 About the Author
 
Eric T. Horne is a graduating senior from Cal Poly
at San Luis 
Obispo. He worked as a programmer analyst for 9 months
at Teradyne, 
Inc. (ST division), where he assisted system administrators
and wrote 
several utility sh scripts to help manage and measure
performance 
of systems. He will be graduating hopefully sometime
in August, 1993. 
You may contact Eric at 40 San Antonio Street, Newbury
Park, CA 91320 
ehorne@phoenix.csc.calpoly.edu. 
 
 
 |