Up and Down
Yufan Hu
As a systems administrator managing more than one system at geographically separate locations, ensuring that all the systems are up and running is of vital importance. This is especially true when these systems receive 24x7 client access. To achieve this goal, some kind of proactive monitoring tool is necessary. This article describes a simple tool written in Perl, which helps me keep an eye on most major services running on a local or remote server, and sends an alert to my pager and email box whenever the up/down status of the service is changed. It discusses some issues related to the design and use of the tool, such as monitoring multiple services on one server, multiple servers, and reliability.
What is Up and What is Down?
Most admins test whether a remote host is up by using the ping command. For example, if you want to see whether www.yahoo.com is up and reachable, you would issue:
ping www.yahoo.com
and expect the following output on a Solaris platform:
www.yahoo.akadns.net is alive
The host name part will be different for each host, but you would expect the substring is alive to be in the output for every host that is alive, or up.
However, this test does not provide anything more than that the host is reachable over the Internet and that the OS kernel is running. It does not tell us anything about whether any services, such as the HTTP server, is running properly. Actually, it is normal for an entire system to be in a totally crashed state but still reachable by ping. On the other hand, the ICMP packets used by ping may be blocked by a firewall and make the ping command useless for testing the availability of a system.
When monitoring a system, we need not only to test whether it is reachable over the net using ping (if possible), but more importantly to test the proper functioning of each individual service that the system provides to the users. These services include, but are not limited to, HTTP, email, DNS, ftp, NNTP, etc.
The simplest way of testing whether any of the above services is up and running is to make a connection to it, then send something over the connection if necessary, and see what we get in return.
For example, to test whether an HTTP server is up and running, we make a connection using the telnet command to port 80 on the remote server, and send HEAD / HTTP/1.0 followed by two return keys:
telnet www.foo.com 80
Trying 10.0.0.1...
Connected to www.foo.com.
Escape character is ^].
HEAD / HTTP/1.0
HTTP/1.1 200 OK
Server: Netscape-Enterprise/3.6 SP2
Date: Sat, 29 Apr 2000 16:02:33 GMT
Content-type: text/html
Connection: close
If the HTTP server is running heathily, we should expect the following line:
HTTP/1.1 200 OK
in the output.
The status of a SMTP service can be tested using a similar approach:
telnet smtp.foo.com smtp
Trying 10.0.0.2...
Connected to smtp.foo.com.
Escape character is ^].
220 smtp.foo.com ESMTP Sendmail 8.9.1/8.9.1; \
Sat, 29 Apr 2000 12:16:11 -0400
(EDT)
quit
221 smtp.foo.com closing connection
Connection closed by foreign host.
We make a connection to the SMTP port and then send a quit command. We should expect a line starting with:
220
if the server is healthy and running.
Based on the above discussion, we could conclude that the approach to determining whether a service is up and running is as follows:
1. Make a connection to the service and send some service-specific strings over the connection, if necessary.
2. Check the output received via the connection to see whether some predefined string pattern is there.
3. If the expected pattern is received, the service is considered up and running. Otherwise it is considered down.
24x7 Monitoring
The above-mentioned approach is good for testing the up/down status of a service, but it is not enough to monitor the system 24x7. We are not going to sit in front of our computer and issue a command every once a while. This task should be done automatically, and the tool to do this should have the following features:
1. It should enable the administrator to perform periodic testing in the background.
2. It should detect the change of up/down status.
3. It should notify us only when the up/down status is changed.
4. It should be generic so that it can be used to monitor different services.
It is very important that the tool send out notifications only when status changes. The notifications are usually sent to a pager using email. If we monitor a service with 5 minutes interval, we certainly do not expect a pager alert every 5 minutes telling us that the server is up and running!
UpDown The Simple Tool
In the past few years, I have administered a couple of servers. Some of them are located in our office, some are located in a server hosting site 200 miles away from our office, and some are located in Asia. Obviously, I was not happy to manually check the status of all these servers. It is an annoying job with many services running on each of the servers. To help myself, I wrote a Perl script called UpDown to perform the tasks that I did, but more frequently and systematically.
UpDown is a very simple but flexible tool. It monitors one service, on one host at a time. The service to monitor is defined in a service configuration file. The general use of this tool is as follows:
updown [-test] service.cfg
where the -test option will show the result on screen as well as sending the notification using email. Otherwise, the result will be sent only via email.
The service.cfg file contains all information about the service to be monitored. It has the following format:
host=www.foo.com
email=yufan@yahoo.com
status=/tmp/www.foo.com.http.status
log=/tmp/www.foo.com.http.log
service=Web Server
command=socket www.foo.com 80
send=HEAD / HTTP/1.0\n\n
expect=^HTTP.*200 OK
In this example:
Host identifies the host name of the host running the service.
Email gives a space-separated list of contact email addresses when notification needs to be sent out. I put both my email address and the email address for my pager there so that I receive alerts through my pager when I am not at my desk.
The status line gives a full file path name to the file in which we are going to store the status. UpDown will store the status of the service it detected upon the previous run in this file. It will compare the previous runs status with the current status and will decide whether to send out notification based on the comparison of the two entries.
log defines another file, which logs the history of up/down status changes.
service provides the name of the service being monitored, so we will know which service is having trouble when we receive a notification. It is not used during testing, but is very important when monitoring more than one service on the same host.
The command line gives the command to do the testing. It can be any command line you type at the shell prompt to do the testing manually. Since most testing can be done by making a simple socket connection to the service, an internal command called socket is implemented in UpDown to simplify the task. If a external command is used, the command should send the result to its stdout and should terminate itself in a short period of time. Currently, UpDown will wait up to 30 seconds for a result after the command is started.
The send line contains the string to be sent to the service once the command is fired. It can include \n to indicate a newline and \r for carriage return, as may be required by individual service protocols. If an external command is used for testing, it should expect input from its stdin, so UpDown can send it the text specified in the send line. Some commands may not need to be fed with anything.
The expect line is a Perl regular expression indicating what is expected for a healthy service. This is where we determine whether the service is up and running.
In the above example configuration, we will test the HTTP service running on port 80 on host www.foo.com. We do this by making a socket connection to the service and then sending a string HEAD / HTTP/1.0\n\n. We expect a string matching ^HTTP.*200 OK returned from the service. If UpDown finds such a string, the service is up; otherwise it is considered to be down and a notifcation will be sent out to email addresses listed in the email line.
A typical notification message will look similar to this:
Date: Sat, 29 Apr 2000 15:55:57 -0400
Subject: Down: problem in establishing the \
connection to www.foo.com:80 Broken pipe
From Name: monitor.foo.com
Webserver on www.foo.com
Down: problem in establishing the \
connection to www.foo.com:80
Broken pipe
Since Sat Apr 29 15:55:57 2000
The notification message tells us that the service Webserver on host www.foo.com must be down, because we cannot make connections to it. It also tells us that the monitoring was done on host monitor.foo.com in case we have set up multiple monitoring sites.
More Examples
Here are some additional example configuration files for monitoring various services.
To monitor the SMTP service on smtp.foo.com:
host=smtp.foo.com
email=yufan@yahoo.com
status=/tmp/smtp.foo.com.smtp.status
log=/tmp/smtp.foo.com.smtp.log
service=SMTP Server
command=socket smtp.foo.com 25
send=quit\n
expect=^220
To monitor the POP3 service on pop.foo.com:
host=pop.foo.com
email=yufan@yahoo.com
status=/tmp/pop.foo.com.pop.status
log=/tmp/pop.foo.com.pop.log
service=POP Server
command=socket pop.foo.com 110
send=quit\n
expect=^\+OK
To monitor the DNS service on ns.foo.com using the external command nslookup:
host=ns.foo.com
email=yufan@yahoo.com
status=/tmp/ns.foo.com.ns.status
log=/tmp/ns.foo.com.ns.log
service=DNS Server
command=nslookup www.foo.com ns.foo.com
send=quit\n
expect=^Server:
The above example shows how a protocol-specific tool can be used to test the availability of the service. For DNS, we are unable to make a simple TCP socket connection and expect a simple string pattern over the connection. There are too many protocol details to implement it. Instead, we fire an existing tool specific to the service to test its availability. This approach makes UpDown extremely flexible. As long as there is a command-line tool to use a service, and the command-line interface only interacts with the user through stdin/stdout, then this command can be used to test the availability of the service. UpDown need not know all the details of the protocol to do the monitoring. The real testing can be done using mature, existing, service-specific tools.
Design Issues
Genericity
UpDown is a very simple tool. It did not try to implement various ways of testing the availability of services by itself. Instead, it tries to encapsulate the generic approach for such a testing procedure: fire a command that best suits the testing of the service; feed the command with some input if necessary; wait for the command to finish and check its standard output for expected string patterns; if it finds the pattern, the service is marked as up and otherwise down.
Periodical Monitoring of the Service
UpDown will do the test only once for each invocation. For repeated testing, use the UNIX cron daemon, which is also mature and reliable. For example, to monitor the http service on www.foo.com based on the configuration file http-foo.cfg every 5 minutes, we can create a crontab entry like this on a Linux box:
*/5 * * * * /usr/local/bin/updown \
/usr/local/etc/updown/http-foo.cfg
However, UpDown does record the status of last run to determine the change of status of the service under monitoring.
Monitoring Multiple Services on One Server and Multiple Servers
UpDown currently only monitors one service on one server for each configuration file in order to keep its simplicity. To monitor multiple services, we can create a configuration file for each service and invoke UpDown once for each configuration. A simple shell script can be created to invoke UpDown for each configuration consecutively, or to put every invocation into the background to fire them in parallel.
Similarly, we can monitor multiple servers. A configuration file is created for each service on every server, and we fire the tests one by one. UpDown can be used to monitor both local hosts and remote hosts. There is no big difference in the configuration for the two.
Notification
UpDown will send notification messages once it detects a change from down to up. There is a little special treatment when the status is changed from up to down. The program will continually send notification if the current run is within half a hour since it first sent out the Down notification. This is to alert me to the server situation and urge me to take prompt action. The multiple notification is also intended to address the unreliability of my paging service. As long as it does not lose all the alerts sent in half an hour, I will receive the notification.
Reliability
UpDown has been running on multiple sites since last December. It has been quite reliable for determining the status of our servers since then.
UpDown relies heavily on the email service to send out notifications. Its reliability actually depends partly on the email service. Although email is becoming more and more reliable these days, it is still subject to failure and is also one of the major targets we want to monitor.
When monitoring remote sites, it is also possible for the connection between the host running UpDown and the host being monitored to become unstable. This will affect the results of testing and will cause UpDown to send out many false alarms due to connecctivity problems.
Currently, there is no satisfactory solution to cope with the above problems. However if we have multiple sites under our control, then we can run UpDown on mulitple servers at the same time to enhance the reliability. UpDown is lighweight and generic. It can be run on any UNIX system with Perl installed. It does not require a lot of resources in order to run.
There is, however, a down side of this distributed monitoring solution. The configuration files will have to be duplicated. This can be solved by maintaining the configuration files in a central location and synchronizing them to various monitoring sites using a tool such as GNU rsync. Another problem is that multiple monitoring sites will send out a lot of down notification messages when you are bringing down the server for maintenance, a somewhat annoying situation and very noisy if you send notification to your pager.
Conclusion
UpDown is a simple, generic, and reliable tool for monitoring services running on the hosts we administer. It relies not on the special testing approach implemented in it, but on more mature existing tools to test each individual service using the command most suited for the testing. It is a command-line based solution that provides a flexible tool for UNIX sys admins. When combined with shell and crontab, it can become a good monitoring solution, using few resources at no extra cost. It is not a perfect tool, but very useful.
About the Author
Dr. Yufan Hu started his systems administration and software development career using UNIX in 1983 on a PDP-11/23. Since then, UNIX systems and network administration have been part of his research and development career. He is currently in charge of networking and system-related activities for Regent Electronics Corp. He can be reached at: yufan@recmail.com.
|