Viagra: Keeping Services Running on BSD

James O'Gorman

Services must stay up. To make this happen, the system must be set up and maintained correctly. Otherwise, daemons will die on you. When that happens, your users will not be happy because they will not get the expected services. So the question is: "How do I keep my services up?" You may believe that if you build the server correctly and run bug-free code, then bad things will not happen.

However, bad things do happen. You can spend a lot of money on good hardware that may still fail. For instance, what if there is a previously unknown bug in your smtp server, then some strange client connects to it, sends a date string that is too long, and the smtpd dies? What if you change to Apache and instead of sending it a -HUP, you kill it without realizing you've done so. No matter how long you have been working with UNIX, you are still human and mistakes happen. Even the best administrator will have daemons die once in a while.

There are two common ways to deal with these problems. One way is that someone lets you know when a service dies, then you fix the problem and hope that it does not happen again. But what if you are unavailable when the service dies?

The other thing you can try is to make the service restart itself and then notify you. Your users then will not experience downtime, you know what happened, and you can deal with the problem.

A while ago, I was looking for a way to keep services up and running on my servers. I wanted a tool that would be small, easy to administer, and give me good troubleshooting tools to track down the source of any problem that may arise.

I checked different Web sites and found a few programs that looked promising -- the best of which was Daemon Tools (http://cr.yp.to/daemontools.html). After looking through the documentation, I decided that I did not need the amount and type of features that Daemon Tools provided. Furthermore, for the number of machines I wanted to use it on, the setup time was going to take me a while. Although it did not fit my needs, many people use it with great success. If you are looking for a nice service-watching utility, compare the script I discuss here with Daemon Tools, then choose the one that best fits your needs. If you find that you need to do advanced things (e.g., easily pause services, get their status, or have log file monitoring), Daemon Tools may be your only option.

After I decided against using Daemon Tools, I found a promising script on UGU (http://www.ugu.com). I tried to find the URL to the original script, but was unsuccessful. The original script was a csh script that was for Sys V, and was not very flexible. I changed a bit on it to make it work with BSD and also altered the way services are restarted. I named my changed copy Viagra, because it keeps the services up and at the ready. A few months after that, I bought the book Unix Hints and Hacks by Kirk Waingrow and saw the original version in there. Because I had always meant to make my version available to others, I figured now is as good a time as any (see Listing 1). Let's take a quick look at the script:

#!/bin/csh

foreach DAEMON ( inetd apache )
        ps -cax | fgrep "$DAEMON:t" | cut -c27-80 > /dev/null
        if ( $status > 0) then
                echo "Restarting $DAEMON"
                date
                /root/scripts/start/$DAEMON &
        endif
end

The Script

I will walk through the script to give you an idea on how it works and how to use it.

The first line:

foreach DAEMON ( inetd apache )

simply defines the variables for the script. In between the parentheses, insert the name of each service you want to watch as it would appear in a ps listing. Only services that you want to monitor should be placed in here:

ps -cax | fgrep "$DAEMON:t" | cut -c27-80 > /dev/null

Next, we do a process listing of the machine, ps -cax, pipe the output of that to a fgrep statement that searches for the service's name, fgrep "$DAEMON:t", then pipe that to a cut statement. The cut statement deletes everything up to column 27, because in the ps listing, column 27 is where the names of the daemons first appear. We are not interested in anything that comes before that. The output from all of this is piped to /dev/null, because we are not really interested in what it returns, just its exit status:

if ( $status > 0) then
    echo "Restarting $DAEMON"
    date

Once we know whether the service is running, we have to act. The if statement will check the exit status of the fgrep command. If the exit status is 0, the condition will not match, and the script will move on. If it does not match, we echo out a statement that tells which daemon is restarting and the date so we know when this happened.

If Apache has died, for example, any output from cron will get emailed to the owner of the cron file. Root will receive an email containing "Restarting Apache" with the date and time, and Apache will be restarted.

Restarting the Service

/root/scripts/start/$DAEMON &

Once the previous steps are completed, we must restart the service. We need to execute another script. I have Viagra set to execute a script that is stored in /root/scripts/start/ and is named the name of the daemon that you need to restart. I think this gives us a lot of room in what we want to do next.

For instance, when looking for Apache, I normally start it on boot using apachectl. To keep this simple, we could place a file in /root/scripts/start/ called apache, then place just a couple of lines into the file. We could make those lines just:

 #!/bin/sh
  /usr/local/sbin/apachectl start

Then, when that script is executed, apachectl will be started with the start command just like on boot-up of the system.

Let's say you have been having a problem with Apache -- it keeps dying on you and you do not know why. We could use the script that restarts Apache to do a few other things. For instance, do a ps aux to get a snapshot of what is occurring in the process before you restart Apache. Perhaps a w to see who is logged in and what they are doing. You could also play around with vmstat to see what type of memory usage appears at at that time or send an email to your pager to let you know your box is having problems. This could be a great troubleshooting tool for your servers.

You could also change the way Viagra runs the scripts. For instance, if there are a lot of scripts you run in /usr/local/etc/rc.d/ on boot, and you want to use those to restart your services, you just change the line /root/scripts/start/$DAEMON to /root/scripts/start/$DAEMON start. Then, make symlinks from your /usr/local/etc/rc.d's scripts to your /root/scripts/start dir. For instance, if you look at Apache and in /usr/local/etc/rc.d there is a file called apache.sh, you could symlink that script to /usr/root/start/apache. Then, if you want to change the way Apache starts up, you only have to make changes in one place. I prefer not to do this because, if a service dies, I like to restart the daemon in a different manner (e.g., to get a process listing mailed to me as well).

From there, the script loops back and goes through again for any other daemons you might have defined the steps. Once it runs through them all, the script ends.

When to Run?

After you have the script running the way you want, and watching the daemons that you want it to watch, you must automate the running of the script. In root's crontab, I have */10 * * * * /path/to/viagra. This sets Viagra to run every ten minutes. Depending on your servers, you may want it to run more or less often. Simply change the /10 to /5 if you want to run the script every five minutes, and so on.

There is a lot you can do with this script. It may not be perfect, but it has worked great for me. Feel free to look at it, poke it, prod it, and change it. Use it if you like it, expand on it if you like, or ignore it forever. Just make sure you have some way of keeping your services up.

Jim O'Gorman lives in Lincoln, Nebraska with his wife, son, and soon-to-be second child. He works for iPlanet E-Commerce Solutions (a Sun-Netscape Alliance).