newping: Remote Host Downtime Detection
Eric Horne
Over the course of nights spent waiting for the Sun
cluster in our
lab to recover from yet another crash, my lab partners
and I reached
the conclusion that the main problem with recoveries
lay not in the
execution but in letting the administrator know that
the system was
down. In fact, knowing whether a system is down, or
even hung, is
the first step in solving the problem.
A Tool Called ping
A reasonable tool for this sort of detection might be
the ping(1)
command, which sends an echo request from the source
machine to the
destination machine. Upon receipt of the echo, the destination
machine
echoes the data sent along with the echo request back
to the source
machine. When the source machine receives the echoed
data, it reports
that the destination machine is up. If the source machine
does not
get the echoed data within a certain timeout period,
the source machine
reports the destination as down. ping also reports other
problems
it finds, like unreachable host/network or unknown hostname.
Using
ping in an automated process that checks each host's
status
is easy to do by incorporating cron and script programming.
Running the script every X minutes via cron is the way
to go.
A Closer Look at ping
Looking at ping a little more closely, I found that
it wasn't
as reliable as I had hoped. ping uses ICMP (Internet
Control
Message Protocol) to ask for a simple echo of some data.
The data
consists of a message number and the time sent from
the source. The
destination machine receives the ICMP echo request and
sends back
an ICMP echo reply. When the source machine receives
the echo reply,
it calculates the difference between the current time
and the time
data in the reply to find the total elapsed time it
took the message
to bounce back. If the destination machine is down,
the source machine
will eventually timeout waiting for a response that
will never come.
I didn't really think much about this, in fact I figured
it was exactly
what I needed, until I stumbled into one important factor.
The destination
machine CPU never knows about the ICMP echo request,
because the Ethernet
card handles it automatically. In other words, the Ethernet
card intercepts
the ICMP echo request and sends the reply all by itself,
allowing
the CPU to handle other things, like user processes.
Allowing the Ethernet card alone to handle the echo
request implies
a big assumption on ping's part. ping is also misleading
the user into thinking that the Ethernet card status
represents the
machine's status, which is not always true. It is possible
that the
Ethernet card is responding and handling its ICMP echo
requests, while
the CPU has come to a grinding halt because of a disk
error or some
other problem. Like it or not, systems hang, and ping
won't
tell you a thing about it.
A Look at newping
newping's purpose is to help correct ping's shortcomings.
While this program performs many of the same services
as ping,
it differs significantly in its function. One significant
difference
between newping and ping is the choice of protocols.
ping uses ICMP, while newping uses TCP/IP, which forces
itself past the Ethernet card and demands attention
from the inetd,
and thus the CPU, running on the destination machine.
The use of TCP/IP
allows newping to connect() with the destination host
via a well-known port. (newping uses the TIME/TCP port.)
Each
well-known port associates itself, thanks to inetd,
with some
sort of daemon, and if not, then with some sort of runnable
code.
When inetd detects a connection to a particular port,
it notifies
the daemon and redirects the data, if any, to it. The
daemon wakes
up and handles the request. Notice that the inetd and
daemon
actions both require a response from the CPU, not only
the Ethernet
card. Both newping and ping will notify the user of
the test results, although newping returns much more
detailed
and meaningful values than ping. (See Table 1 for some
details).
newping Code
Besides the initial startup instructions, the newping
code
is straightforward. newping begins by looking at the
command
line to determine any options and the name of the host
to check, also
the amount of time, in seconds, to use as a timeout.
The program processes
all of this information and stores it in several variables
and structures
used later by connect(). After completing verification
of
other miscellaneous data, newping uses signal() and
alarm() calls to set an internal alarm to go off in
exactly
one second. When the second is up, the alarm calls an
action routine,
noconnect(), which keeps track of the time that has
passed.
If the time passed is more than the timeout value, a
connection was
not made within the time limits, and newping times out.
Otherwise,
if the timeout has not been violated, newping continues,
with
the alarm set to go off in another second. In effect,
noconnect()
either gives newping another second to connect, or exits
newping
with a return code set to 1 (Listing 2). If a connection
is made
prior to a timeout, the connection is deemed successful
and the next
phase begins.
If the connection is successful, which implies that
the Ethernet card
is OK, newping next tests the CPU. A quick call to signal()
resets the SIGALRM action from calling noconnect() to
calling
noresponse(). Both functions perform the same task:
the difference
between them lies in the status code they return via
exit()
if the count exceeds the timeout value (see Listing
3). It is important
to notice that the time passed is not reset from the
noconnect()
call. newping works in much the same way as the Ethernet
card
test does. The only real difference is that newping
calls recv()
instead of connect(). It is simply blocking, waiting
for the
destination machine to send its data. If newping times
out
before recv() detects the presence of any data, the
destination
machine is most likely a hung machine. The Ethernet
card will respond,
but the CPU is either so loaded with work it does not
have time to
return data, or the CPU has halted. Either case is worth
investigating.
Recall that newping connects to the TIME/TCP port of
the destination
host. I chose the TIME/TCP port because when it detects
a connection,
it automatically sends the local time through it and
disconnects the
connection. The data returned is not used for anything.
Its presence
means that the machine is responding. Receiving the
data allows newping
to exit with a code of 0. Detecting any sort of error
forces newping
to exit with the proper return code (see Table 1).
Automating Detection
newping's ability over ping to return distinguishable
codes for different states of a destination machine
make newping
very useful for shell scripts. A script using newping,
a list
of destination hosts, and a few loops can be surprisingly
effective.
Listing 4 shows a simple script, worthy of your improvement.
The script in Listing 4 acts as a base script. You can
change it to
match your needs. It could easily be altered to record
the times a
host went down and came back up again. It could notify
a list of people
using write(1) or mail(1) that there is a particular
problem with some host. It could even test the stability
of a network.
cron can run the script in Listing 4 every X minutes,
allowing
an ongoing, automatic notification system. My department
even hooked
a terminal to the back of my Sun 4, through /dev/ttya,
to which
a script could then write the status of each host in
a list. The script
in Listing 4 can be altered to do this by changing the
/dev/console
to /dev/ttya. We ran our particular version every five
minutes
through cron. This gave us an updated status for every
important
host in a list with the oldest update only five minutes
old. Our script
logged host downtime for a monthly report, which summarized
how often
a system went down and for how long. As your needs change,
the script
can change with them. Over time, this simple script
grew to be one
of the more complicated scripts I have ever written.
newping is not by any means the be-all-end-all method
of detecting
remote host downtime. Used in the right way, however,
it can help
to terrifically speed up your response to down hosts.
About the Author
Eric T. Horne is a graduating senior from Cal Poly
at San Luis
Obispo. He worked as a programmer analyst for 9 months
at Teradyne,
Inc. (ST division), where he assisted system administrators
and wrote
several utility sh scripts to help manage and measure
performance
of systems. He will be graduating hopefully sometime
in August, 1993.
You may contact Eric at 40 San Antonio Street, Newbury
Park, CA 91320
ehorne@phoenix.csc.calpoly.edu.
|