Viagra:
Keeping Services Running on BSD
James O'Gorman
Services must stay up. To make this happen, the system must be
set up and maintained correctly. Otherwise, daemons will die on
you. When that happens, your users will not be happy because they
will not get the expected services. So the question is: "How
do I keep my services up?" You may believe that if you build
the server correctly and run bug-free code, then bad things will
not happen.
However, bad things do happen. You can spend a lot of money on
good hardware that may still fail. For instance, what if there is
a previously unknown bug in your smtp server, then some strange
client connects to it, sends a date string that is too long, and
the smtpd dies? What if you change to Apache and instead
of sending it a -HUP, you kill it without realizing you've
done so. No matter how long you have been working with UNIX, you
are still human and mistakes happen. Even the best administrator
will have daemons die once in a while.
There are two common ways to deal with these problems. One way
is that someone lets you know when a service dies, then you fix
the problem and hope that it does not happen again. But what if
you are unavailable when the service dies?
The other thing you can try is to make the service restart itself
and then notify you. Your users then will not experience downtime,
you know what happened, and you can deal with the problem.
A while ago, I was looking for a way to keep services up and running
on my servers. I wanted a tool that would be small, easy to administer,
and give me good troubleshooting tools to track down the source
of any problem that may arise.
I checked different Web sites and found a few programs that looked
promising -- the best of which was Daemon Tools (http://cr.yp.to/daemontools.html).
After looking through the documentation, I decided that I did not
need the amount and type of features that Daemon Tools provided.
Furthermore, for the number of machines I wanted to use it on, the
setup time was going to take me a while. Although it did not fit
my needs, many people use it with great success. If you are looking
for a nice service-watching utility, compare the script I discuss
here with Daemon Tools, then choose the one that best fits your
needs. If you find that you need to do advanced things (e.g., easily
pause services, get their status, or have log file monitoring),
Daemon Tools may be your only option.
After I decided against using Daemon Tools, I found a promising
script on UGU (http://www.ugu.com). I tried to find the URL
to the original script, but was unsuccessful. The original script
was a csh script that was for Sys V, and was not very flexible.
I changed a bit on it to make it work with BSD and also altered
the way services are restarted. I named my changed copy Viagra,
because it keeps the services up and at the ready. A few months
after that, I bought the book Unix Hints and Hacks by Kirk
Waingrow and saw the original version in there. Because I had always
meant to make my version available to others, I figured now is as
good a time as any (see Listing 1). Let's take a quick look
at the script:
#!/bin/csh
foreach DAEMON ( inetd apache )
ps -cax | fgrep "$DAEMON:t" | cut -c27-80 > /dev/null
if ( $status > 0) then
echo "Restarting $DAEMON"
date
/root/scripts/start/$DAEMON &
endif
end
The Script
I will walk through the script to give you an idea on how it works
and how to use it.
The first line:
foreach DAEMON ( inetd apache )
simply defines the variables for the script. In between the parentheses,
insert the name of each service you want to watch as it would appear
in a ps listing. Only services that you want to monitor should
be placed in here:
ps -cax | fgrep "$DAEMON:t" | cut -c27-80 > /dev/null
Next, we do a process listing of the machine, ps -cax, pipe
the output of that to a fgrep statement that searches for the
service's name, fgrep "$DAEMON:t", then pipe that to a
cut statement. The cut statement deletes everything
up to column 27, because in the ps listing, column 27 is where
the names of the daemons first appear. We are not interested in anything
that comes before that. The output from all of this is piped to /dev/null,
because we are not really interested in what it returns, just its
exit status:
if ( $status > 0) then
echo "Restarting $DAEMON"
date
Once we know whether the service is running, we have to act. The if
statement will check the exit status of the fgrep command.
If the exit status is 0, the condition will not match, and the script
will move on. If it does not match, we echo out a statement
that tells which daemon is restarting and the date so we know when
this happened.
If Apache has died, for example, any output from cron will get
emailed to the owner of the cron file. Root will receive an email
containing "Restarting Apache" with the date and time,
and Apache will be restarted.
Restarting the Service
/root/scripts/start/$DAEMON &
Once the previous steps are completed, we must restart the service.
We need to execute another script. I have Viagra set to execute a
script that is stored in /root/scripts/start/ and is named
the name of the daemon that you need to restart. I think this gives
us a lot of room in what we want to do next.
For instance, when looking for Apache, I normally start it on
boot using apachectl. To keep this simple, we could place
a file in /root/scripts/start/ called apache, then
place just a couple of lines into the file. We could make those
lines just:
#!/bin/sh
/usr/local/sbin/apachectl start
Then, when that script is executed, apachectl will be started
with the start command just like on boot-up of the system.
Let's say you have been having a problem with Apache --
it keeps dying on you and you do not know why. We could use the
script that restarts Apache to do a few other things. For instance,
do a ps aux to get a snapshot of what is occurring in the
process before you restart Apache. Perhaps a w to see who
is logged in and what they are doing. You could also play around
with vmstat to see what type of memory usage appears at at
that time or send an email to your pager to let you know your box
is having problems. This could be a great troubleshooting tool for
your servers.
You could also change the way Viagra runs the scripts. For instance,
if there are a lot of scripts you run in /usr/local/etc/rc.d/
on boot, and you want to use those to restart your services, you
just change the line /root/scripts/start/$DAEMON to /root/scripts/start/$DAEMON
start. Then, make symlinks from your /usr/local/etc/rc.d's
scripts to your /root/scripts/start dir. For instance, if
you look at Apache and in /usr/local/etc/rc.d there is a
file called apache.sh, you could symlink that script to /usr/root/start/apache.
Then, if you want to change the way Apache starts up, you only have
to make changes in one place. I prefer not to do this because, if
a service dies, I like to restart the daemon in a different manner
(e.g., to get a process listing mailed to me as well).
From there, the script loops back and goes through again for any
other daemons you might have defined the steps. Once it runs through
them all, the script ends.
When to Run?
After you have the script running the way you want, and watching
the daemons that you want it to watch, you must automate the running
of the script. In root's crontab, I have */10 * * * * /path/to/viagra.
This sets Viagra to run every ten minutes. Depending on your servers,
you may want it to run more or less often. Simply change the /10
to /5 if you want to run the script every five minutes, and
so on.
There is a lot you can do with this script. It may not be perfect,
but it has worked great for me. Feel free to look at it, poke it,
prod it, and change it. Use it if you like it, expand on it if you
like, or ignore it forever. Just make sure you have some way of
keeping your services up.
Jim O'Gorman lives in Lincoln, Nebraska with his wife,
son, and soon-to-be second child. He works for iPlanet E-Commerce
Solutions (a Sun-Netscape Alliance).
|