Babysitting Your Intranet and Internet
Jonathan Feldman
At work 12 months ago, we didn't have a good tool, homegrown or otherwise, to monitor our external Internet services and public library Intranet. We needed to make sure that the various servers and services were up at all times because people were going on vacation, and we weren't sure someone who knew how to ping, telnet, and netstat and such would be around at all times. Clearly, the automation department needed some automation. So, we wrote a script, sitter, which does basic network management (i.e., watches the network) and screams if something goes wrong.
We decided to write rather than buy a tool because we know people who've spent a lot of money on "network management" packages, and although some of the packages live up to that lofty name, others do not because of overcomplexity and attempting to be smarter and more effective than a computer can be. The fact is, if something is wrong with the network, you need a person to fix it, not a tool that uses the network to fix it.
So, the germane function of "network management" should probably be monitoring, with a bonus function of interpreting events in plain terms so that users or administrators can take appropriate action (i.e., "Please reset the mail server," or "Please get someone to check the physical network, more than one gateway is not responding").
And if you know exactly what you want to monitor, it is nice to track precisely the desired services, no more, no less. For example, who cares if your Web server is responding to SNMP or ICMP packets? Your management package might tell you that the server is "up," but this does you no good if, for example, it is not serving up HTML because the httpd process has died. That hunk of silicon might as well be a boat anchor; it's not doing what it's supposed to be doing!
With these points in mind, we chose to code a simple and direct sitter script that would do simple network monitoring. Because the script monitors services rather than servers, it can monitor through a proxy (socks-like) firewall, which not all network manager software can do. The sitter script does the equivalent of sitting at a terminal and continually trying the Web site or sending email back and forth to ensure that the gateway's up. Because this is drudgery of the highest order, thank goodness for automation! You can even avoid the drudgery of typing in the sitter script by getting it from:
ftp://ftp.co.chatham.ga.us/pub/sitter.bash
The script is not fancy, but it works, and you can't beat the price. This tool has saved us a lot of irate phone calls (which is one of the points of network management packages), because we know first thing when something goes down. You do, however, have to remember to check the console periodically. You could code a network "write" message to replace the console messages, but again, it is of questionable value to use the network as a transport for a service or a message when the whole network may be down.
If you were serious about timely intervention, you could hardwire a printer to your help desk from the "sitter" machine, or hook an outgoing modem to the sitter machine and make the sitter script dial several pagers when something is amiss. But, if you assiduously check the console, these measures are unnecessary.
Implementation
The toughest part about coding a monitoring program was figuring out what to do with the monitoring information, and how to write intelligent alert messages based on that information. But, as with any automation, once we figured out how we would do it manually, the process naturally turned into a program.
When figuring out any problem, network or otherwise, it helps to start on the inside and work your way out, that is, to start simple and progressively add complexity. In the network world, this means you should first figure out whether your local network is okay before starting to freak out about other networks. And, always check IP numbers before complicating things with name services. Check those separately, as well as the other services. So, the process, which we had always done manually, turned out to be:
1. Check the local network, because the local network or network card may be the problem.
2. Check other internal networks. Routing may be the problem.
3. Check outside (Internet) networks. ISP may be the problem.
4. Check naming services; a lot of services won't work unless these are working.
5. Check vital services via WKS socket numbers.
Initially, I wrote a pingok() function of the script, which would ping the IP numbers given in items #1 through #3. It would return a failure value if passed an IP number that did not respond. In the body of the script, it would go through all the IP numbers and complain appropriately to the console. It would use the nslookup program to specifically go to the DNS server in question and ask it what its name was, because even if it doesn't know anything else, it had better know its own name. This was a function I used to use manually, that is, while logged into "degobah," to check a couple of other machines' DNS, I would type:
$ for i in hoth alderan endor ; do nslookup $i $i ; done
Any error meant that the name server was down. I simply applied this to the monitoring script.
The next process was to figure out exactly how to monitor a service rather than a server. Each service usually lives on a WKS port, which is typically listed in /etc/services, or %WINBOOTDIR%\services for NT and Win 95. Some services, notably the (relatively) new ones like http, aren't listed by default in some distributions, but you can always add the line:
http 80/tcp
to your services file, which makes for lovely symbolic resolution of service names.
We used to check services by hand using the versatile Unix telnet tool. For example, a simple way to see if any process is listening at the http socket of your Web server might be:
echo -e "J close\r" | telnet -e J www.degobah.com http ||
echo "Can't reach the web server!"
If the server is up, but nothing is listening at the socket (or the process doing the listening has died), you'll get a "Connection Refused." If the server is down, you'll hang as long as the timeout value for connect() is defined in the operating system. For Linux 1.2.x, this could mean several minutes, but when doing this interactively, you can just hit the interrupt key. This is a usually good enough test of whether a service is up, because unless somebody malevolent or prankish has reconfigured your box, the service is either listening or it's down.
So, although you could type:
echo -e "J close\r" | telnet -e J smtp.my.com smtp || echo "Can't reach our mail server!"
you could also type:
( sleep 5 ; echo -e "J close\r"; ) | telnet -e J \
smtp.my.com 25 |
grep -i SMTP || echo "Can't reach our mail server!"
which looks for the server to say something about SMTP. Again, you can extend this idea to any TCP service. If, for example, you are working for the Ultra-Paranoid HTML Corporation, you might want to make darn sure that your web server is up, and serving up the absolutely correct HTML index:
( echo -e "GET /index.html\r" ; sleep 5; echo -e \
"J close\r" ; ) |
telnet -e J www.my.com 80 | grep "Some Unique Text \
in Your Index Page" ||
echo "I can take it, but I can't dish it out! \
(WWW Down!)"
This is probably overkill for basic monitoring functions, but it is possible.
So, since services can be checked manually using telnet, it is also possible, through the magic of shell scripting, to do it in an automated, unattended fashion. This is what I did in the sitter script. Bear in mind that the only services discussed here are TCP services, because the telnet tool reads from and writes to a TCP socket, but fortunately, many critical services use TCP sockets. When deciding which services to monitor, if you're unsure whether it is a TCP or UDP service, you can always check the services file.
The neat thing about using telnet in this way is that you are pretending that you are a real client of whatever service you're checking, and so you can monitor a service even through a firewall or proxy gateway. So, instead of pinging your way around networks, you could simply connect to a WKS that is allowed through the firewall. To this end, so that the sitter script could be "inside" and monitoring, I implemented a variable, "FAKEPING," which, instead of real-pinging via ICMP, uses telnet to connect to the FAKEPING socket. (See Figure 1.)
Of course, to use telnet to "FAKEPING," you must specify a socket that you know the receiving end will accept when all is working correctly. The"echo" socket used to be a good one to use, but since some denial-of-service attacks use this socket as a target, most serious machines like the DNS root servers no longer accept echo requests. The workaround for these servers is to use the "domain" socket, because they do allow this type of request. (See Figure 1.)
The implementation of FAKEPING, in our Linux environment, at any rate, proved to be interesting. Since Linux's telnet tool doesn't time out on a dead IP address for more than a couple of minutes, checking IP addresses with FAKEPING would be time consuming at best. In the interests of quick reportage, the script spawns the telnet process (along with its associated "OK or not OK" subshell, and if it doesn't report back within the TIMEOUT period, assumes that the FAKEPING failed. (See Figure 1, lines 64-70.)
I hate using temporary files to hold information of a flag-oriented nature, because you can never be absolutely sure that you are using a uniquely named file. But, to communicate backward from a spawned process, I either had to: (a) create a named pipe, thus creating a uniquely named temporary file anyway, and then worry about permissions of the calling process; or (b) have the spawned process create a uniquely named and previously agreed upon regular file that would either be of zero size on failure, or greater than zero on success. I chose option b, because it was simpler. This was one of the few limitations of using a shell script versus C or Perl encountered during this particular adventure, so it was no big deal - just annoying from a purist's point of view.
However, the bash (Bourne Again Shell) does have a nice feature, the $! variable name, which means "my last spawned process," so it provided a good handle to kill the process when necessary. Note that job control does not work in an batch script, it only works from the command line. Thus, to refer to the spawned process as %, and try to kill it with kill % does not work from a script, although it does work from the command line.
Linux telnet once again proved somewhat stubborn, in that it does not die when its parent dies, but decides to stick around with a parent ID of 1 (init). To keep things tidy, we did not kill the subshell, instead we killed the telnet explicitly,
with:
ps -l | awk '/'$!'/ && /telnet/ {print $3}' | \
xargs kill >/dev/null 2>&1
instead of:
kill $!
The shell dies on its own once the telnet dies. I think that the kill $! of the shell would kill the telnet child with some default distribution telnet clients, but probably not most. It does not work under Dynix/ptx 2.01 or AIX 3.2.5, but these clients also have a much shorter timeout, making it less of an issue. However, their default telnet timeouts are still greater than a minute, which is silly when checking a service. If a service doesn't respond in a couple of seconds, something is wrong. So, spawning the process and killing it off in this way actually provides more control and lets you define your own acceptable response time.
Usage and Porting Concerns
Some caveats: the sitter script was written in, and tested with bash, and may or may not work with the Korn (ksh) shell. It definitely will not work with the standard Bourne (sh) shell or the C (csh) shell. Fortunately, bash is freely available from the Free Software Foundation, or as source from your nearest Linux distribution.
Also, the ps command works differently on different machines. You will almost definitely have to change the PID field number that the awk command extracts from the ps command ($3 under Linux) in Listing 1, line 70. You might also have to change the flag that ps issues (same line), but -l will probably work for most.
The script assumes that standard tools like grep, awk, ps, ping, telnet, and nslookup are in the search path, so you don't have to reinvent those particular wheels. However, the script won't work unless those tools are in the calling environment's search path. (They usually are on most systems.) Also, if you want to use the script from the inside of a proxy firewall, say, socks, you must have "socksified" your system. (This is probably going to be true if you are actually using socks for anything.)
Using the script is easy. Substitute your own IP numbers in place of the LOCALNET, REMOTENET, and INET numbers (starting on line 24, Listing 1). Remember that to get a good sense of whether a network or just one host is down, the script needs to be given more than one "known good host" IP number per variable, separated by spaces.
To test your own DNS, change line 32 to reflect your own DNS hosts. And to change the script so that your own services are tested, change line 34 to reflect "host:service," separated by spaces.
Of course, testing the script with "known bad" services is important too, just to make certain it works properly in your environment, so just give the script IP numbers that don't exist, or services that aren't in use, then run it and watch it react. Hopefully, once you run the script with the "known good" hosts and services, everything will look terrific, and the sitter will exit silently. I like to tell cron to run the script every hour on the half hour. If you are more Type A, you may want to do it more often.
The great thing is, this sitter will never tell you that it's unavailable because it's going to the prom, it will never bring over unwanted guests, it won't complain to its friends about your kid, and it won't use your phone unless you ask it to.
About the Author
Jonathan Feldman works at the Chatham County Government in Savannah, Georgia, with UNIX and NetWare. He likes to keep things simple so that even he can understand them. In his spare time, he likes to write spooky poetry and grow babies, roses, and grapes with his pal, Stacy.
|