Scheduled Rebooting
Larry Reznick
UNIX runs 24 hours a day every day of the year, and
it does a pretty
good job of that. But I've found that, after a long
time of continuous
running, it can sometimes get confused. Various UNIX
systems lose track
of certain processes, some internal tables appear to
get out of whack,
and zombie processes seem to take over the system. No
manufacturer's
UNIX seems better than any other in this respect. After
seeing this
happen on SunOS, HP-UX, SCO, and ESIX, I can't blame
it on any
manufacturer's variances. UNIX just needs rebooting
every so often.
Rebooting every UNIX workstation and server about once
a month is
sufficient. Leave any workstation or server alone for
longer and the
system may work just fine until you need it the most.
Can you afford to
have your system act bizarrely in the midst of important
operations? I
couldn't. Systems might run for months without serious
troubles, but
then strange anomalies would start showing up during
an application that
nobody noticed on other stations running the same application.
Occasionally, unusual error messages popped up. People
lost time waiting
for the reboot that corrected everything. Rebooting
regularly, before
the system demands fixing and while it is comparatively
idle, works more
smoothly.
You can set a cron job that wakes up once a month --
more or less often
as needed -- to see whether the system is running a
critical application.
If the system is not running anything very important,
advise everybody
that the system is going down and reboot. The simplest
way to do this
appears in the rebootsys script (Listing 1). rebootsys
checks the
process list for any critical programs. If any are currently
running, it
won't reboot. With all critical programs absent, it
reboots.
rebootsys Script
Listing 1 uses the PROGS variable to hold a list of
critical program
names, separated by vertical bars. egrep(1) uses the
bar (|) to separate
multiple regular expressions for simultaneous searching.
If egrep finds
any of those strings, it shows the line containing the
match. When egrep
succeeds, it sets the $? exit status variable to a nonzero.
Finding a 0
exit code means at least one critical program matched.
Strictly, egrep
returns one nonzero value indicating an operation error
even though it
may have found a match, but I didn't think this would
happen frequently
enough to worry about.
PROGS only holds so many program names. egrep handles
a limited number
of parallel searches. If you have more the half-dozen
critical programs
demonstrated in Listing 1, put their program names in
a file, each
listed on a separate line. Then change the PROGS line
to:
PROGS="-f /usr/local/data/noreboot.progs"
Be sure to set read-only permissions for noreboot.progs
to prevent
editing. This PROGS= line runs without changing the
script's egrep
command line. egrep applies every string in the match
file against the
process list that ps(1) pipes into it. However, egrep
still takes only a
limited number of expressions in this file. With many
more critical
processes to check for, you may need to set up multiple
files, inserting
a loop in the script to deliver each file to egrep's
search through the
process list.
Using egrep's -f option, $PROGS looks a little strange
in the warning
mail message. This message tells the users aliased to
siteadmin that
rebootsys couldn't run. When the critical program names
appear more
directly in the PROGS variable, those names are included
in this mail
message. But with the -f method, you'd see the filename
containing the
matches preceded by that -f. Seeing the filename might
not be a bad
idea, acting as a reminder of where to find the critical
program names,
but seeing the program names spelled out might be more
helpful. If so,
change the line reading
"$PROGS"
to use
'echo'cat\\'echo $PROGS | cut -d" " -f2\\\'\''
This messy-looking command line embeds an echo command
inside the mail
message's here script. echo starts cat as another shell
substitution to
output the contents of the file named in the PROGS variable.
However,
PROGS has egrep's -f argument, which cat doesn't like.
So, I embed yet
another shell substitution to isolate the filename in
the $PROGS second
argument. I send $PROGS to cut(1), which splits the
string into fields
separated by a space and selects the filename in the
second field. Extra
backslashes escape either the back-apostrophe or the
backslash itself. I
submit the extracted filename to cat, which submits
the file's contents
to echo, which outputs each critical program's name
on one long line.
BOOTPROG holds the command line for rebooting the system.
Listing 1
shows /etc/reboot, a special program that SCO provides
for a quick
reboot. You may also find /usr/ucb/reboot(1M) on some
systems. These
programs are equivalent to using init 6 (see init(1M)).
Don't use reboot
or init 6 on anything but a test system, because neither
gives any
warning to users. Both just reboot -- now! Not very
nice, but great for
testing. When you're ready to install rebootsys change
BOOTPROG to use
shutdown(1M):
BOOTPROG=\
"/usr/sbin/shutdown -y -g120 -i6"
This allows users working on the system -- if anyone
is working at the
time you set cron to run this script -- two minutes'
grace before
rebooting. Anyone depending on the system over the network
will have to
wait for the system to return, but they'll get a warning
message, too.
Change the -g option's number of seconds if you prefer
a different grace
period.
Rebooting Using cron
Once you've set your system's critical programs, and
you've tested to be
sure that rebootsys reboots only when your programs
aren't running,
install the following line in your root crontab.
1 5 1 * * /usr/local/bin/rebootsys
This simple version runs rebootsys at 5:01 a.m. on the
first of every
month. It may be too simple. If any critical program
is running,
rebootsys won't try to reboot the system again until
the first of next
month. Should your system be quieter at a different
time, set that time
instead.
Assume you want to reboot no more frequently than once
a month but each
time rebootsys looks at the process status those pesky
critical programs
are running. You may find your system not rebooting
for months at a
time. Adjusting the crontab time may not help much if
you have shifts
working round the clock on the system. Specifying several
hours in the
crontab won't work much better. For instance, if you
try
1 1,3,5,7,9,11,13,15,17,19,21,23 1 * *\
/usr/local/bin/rebootsys
one of those two-hour periods may reboot. The problem
comes when you get
another reboot two hours later. Rather than turn the
first of every
month into hell day, consider using an empty file. So
long as the file
exists, rebootsys hasn't run yet. Try rebootsys again.
Without that
file, don't let rebootsys run. Try these crontab entries:
0 0 1 * * touch /tmp/try.reboot
1 * 1 * * test -f /tmp/try.reboot &&\
/usr/local/bin/rebootsys
At exactly midnight on the first of every month, cron
creates try.reboot
in the /tmp directory. One minute later, and every hour
thereafter, cron
checks the existence of /tmp/try.reboot. If it still
exists, cron runs
rebootsys. Some critical applications run so frequently
that you may
find checking only once an hour insufficient to catch
when the
application stops running. Change the minute setting
to every half-hour,
every quarter-hour, or as often as you think appropriate.
Every reboot should clean out files in the /tmp directory,
so try.reboot
will go away, thus preventing rebootsys from running
again for the rest
of that day. If you'd rather the cron job clean up the
file after
rebootsys runs successfully, add
&& rm -f /tmp/try.reboot
after rebootsys's invocation on the crontab line. If
rebootsys fails,
its exit 1 prevents the rm. Only when $BOOTPROG runs
does rebootsys quit
successfully, triggering the rm.
About the Author
Larry Reznick has been programming professionally since
1978. He is
currently working on systems programming in UNIX, MS-DOS,
and OS/2. He
teaches C, C++, and UNIX language courses at the University
of
California, Davis Extension.
|