Cover V04, I06
Article
Listing 1

nov95.tar


Scheduled Rebooting

Larry Reznick

UNIX runs 24 hours a day every day of the year, and it does a pretty good job of that. But I've found that, after a long time of continuous running, it can sometimes get confused. Various UNIX systems lose track of certain processes, some internal tables appear to get out of whack, and zombie processes seem to take over the system. No manufacturer's UNIX seems better than any other in this respect. After seeing this happen on SunOS, HP-UX, SCO, and ESIX, I can't blame it on any manufacturer's variances. UNIX just needs rebooting every so often.

Rebooting every UNIX workstation and server about once a month is sufficient. Leave any workstation or server alone for longer and the system may work just fine until you need it the most. Can you afford to have your system act bizarrely in the midst of important operations? I couldn't. Systems might run for months without serious troubles, but then strange anomalies would start showing up during an application that nobody noticed on other stations running the same application. Occasionally, unusual error messages popped up. People lost time waiting for the reboot that corrected everything. Rebooting regularly, before the system demands fixing and while it is comparatively idle, works more smoothly.

You can set a cron job that wakes up once a month -- more or less often as needed -- to see whether the system is running a critical application. If the system is not running anything very important, advise everybody that the system is going down and reboot. The simplest way to do this appears in the rebootsys script (Listing 1). rebootsys checks the process list for any critical programs. If any are currently running, it won't reboot. With all critical programs absent, it reboots.

rebootsys Script

Listing 1 uses the PROGS variable to hold a list of critical program names, separated by vertical bars. egrep(1) uses the bar (|) to separate multiple regular expressions for simultaneous searching. If egrep finds any of those strings, it shows the line containing the match. When egrep succeeds, it sets the $? exit status variable to a nonzero. Finding a 0 exit code means at least one critical program matched. Strictly, egrep returns one nonzero value indicating an operation error even though it may have found a match, but I didn't think this would happen frequently enough to worry about.

PROGS only holds so many program names. egrep handles a limited number of parallel searches. If you have more the half-dozen critical programs demonstrated in Listing 1, put their program names in a file, each listed on a separate line. Then change the PROGS line to:

PROGS="-f /usr/local/data/noreboot.progs"

Be sure to set read-only permissions for noreboot.progs to prevent editing. This PROGS= line runs without changing the script's egrep command line. egrep applies every string in the match file against the process list that ps(1) pipes into it. However, egrep still takes only a limited number of expressions in this file. With many more critical processes to check for, you may need to set up multiple files, inserting a loop in the script to deliver each file to egrep's search through the process list.

Using egrep's -f option, $PROGS looks a little strange in the warning mail message. This message tells the users aliased to siteadmin that rebootsys couldn't run. When the critical program names appear more directly in the PROGS variable, those names are included in this mail message. But with the -f method, you'd see the filename containing the matches preceded by that -f. Seeing the filename might not be a bad idea, acting as a reminder of where to find the critical program names, but seeing the program names spelled out might be more helpful. If so, change the line reading

"$PROGS"

to use

'echo'cat\\'echo $PROGS | cut -d" " -f2\\\'\''

This messy-looking command line embeds an echo command inside the mail message's here script. echo starts cat as another shell substitution to output the contents of the file named in the PROGS variable. However, PROGS has egrep's -f argument, which cat doesn't like. So, I embed yet another shell substitution to isolate the filename in the $PROGS second argument. I send $PROGS to cut(1), which splits the string into fields separated by a space and selects the filename in the second field. Extra backslashes escape either the back-apostrophe or the backslash itself. I submit the extracted filename to cat, which submits the file's contents to echo, which outputs each critical program's name on one long line.

BOOTPROG holds the command line for rebooting the system. Listing 1 shows /etc/reboot, a special program that SCO provides for a quick reboot. You may also find /usr/ucb/reboot(1M) on some systems. These programs are equivalent to using init 6 (see init(1M)). Don't use reboot or init 6 on anything but a test system, because neither gives any warning to users. Both just reboot -- now! Not very nice, but great for testing. When you're ready to install rebootsys change BOOTPROG to use shutdown(1M):

BOOTPROG=\
"/usr/sbin/shutdown -y -g120 -i6"

This allows users working on the system -- if anyone is working at the time you set cron to run this script -- two minutes' grace before rebooting. Anyone depending on the system over the network will have to wait for the system to return, but they'll get a warning message, too. Change the -g option's number of seconds if you prefer a different grace period.

Rebooting Using cron

Once you've set your system's critical programs, and you've tested to be sure that rebootsys reboots only when your programs aren't running, install the following line in your root crontab.

1 5 1 * * /usr/local/bin/rebootsys

This simple version runs rebootsys at 5:01 a.m. on the first of every month. It may be too simple. If any critical program is running, rebootsys won't try to reboot the system again until the first of next month. Should your system be quieter at a different time, set that time instead.

Assume you want to reboot no more frequently than once a month but each time rebootsys looks at the process status those pesky critical programs are running. You may find your system not rebooting for months at a time. Adjusting the crontab time may not help much if you have shifts working round the clock on the system. Specifying several hours in the crontab won't work much better. For instance, if you try

1 1,3,5,7,9,11,13,15,17,19,21,23 1 * *\
/usr/local/bin/rebootsys

one of those two-hour periods may reboot. The problem comes when you get another reboot two hours later. Rather than turn the first of every month into hell day, consider using an empty file. So long as the file exists, rebootsys hasn't run yet. Try rebootsys again. Without that file, don't let rebootsys run. Try these crontab entries:

0 0 1 * * touch /tmp/try.reboot
1 * 1 * * test -f /tmp/try.reboot &&\
/usr/local/bin/rebootsys

At exactly midnight on the first of every month, cron creates try.reboot in the /tmp directory. One minute later, and every hour thereafter, cron checks the existence of /tmp/try.reboot. If it still exists, cron runs rebootsys. Some critical applications run so frequently that you may find checking only once an hour insufficient to catch when the application stops running. Change the minute setting to every half-hour, every quarter-hour, or as often as you think appropriate.

Every reboot should clean out files in the /tmp directory, so try.reboot will go away, thus preventing rebootsys from running again for the rest of that day. If you'd rather the cron job clean up the file after rebootsys runs successfully, add

&& rm -f /tmp/try.reboot

after rebootsys's invocation on the crontab line. If rebootsys fails, its exit 1 prevents the rm. Only when $BOOTPROG runs does rebootsys quit successfully, triggering the rm.

About the Author

Larry Reznick has been programming professionally since 1978. He is currently working on systems programming in UNIX, MS-DOS, and OS/2. He teaches C, C++, and UNIX language courses at the University of California, Davis Extension.