The Sentinel
William Genosa
System administrators are responsible for ensuring the
availability
of their machines to users. When servers crash or backups
fail, the
users are inconvenienced, which will most likely result
in trouble
for the system administrator. If you are a sys admin
for more than
one machine and your machines are networked together
with TCP/IP,
then you may want to set up a Sentinel to watch over
your network
and inform you of a system crash or backup failure.
A modem and a
beeper would also be required.
My network consists of eight 3B2 hosts running AT&T
System V Rel 3.2.2
networked together with TCP/IP over ethernet. The Sentinel
is cpu4
and the two servers I am concerned with are cpu5 and
cpu8.
My servers reboot each weeknight and run unattended
level zero backups
to back up the entire system. A failure on a server
would result in
downtime the next morning that would cost my company
thousands of
dollars. If a server crashes at night, the Sentinel
alerts me so that
I can take corrective action before my users arrive
in the morning.
This makes my boss very happy.
The program makes use of the ruptime command, which
displays
the status of hosts on a local area network. Output
of the command
uses the following format:
cpu1 up 9:40, 5 users, load 0.04, 0.08, 0.05
cpu2 up 2+04:25, 8 users, load 0.08, 1.23, 0.40
cpu3 up 8:35, 5 users, load 1.00, 1.01, 1.06
cpu4 up 8:43, 6 users, load 1.04, 1.00, 0.95
cpu5 up 8:47, 75 users, load 4.60, 3.90, 4.65
cpu6 up 5+07:50, 0 users, load 0.00, 0.00, 0.00
cpu7 down 0:20
cpu8 up 8:45, 65 users, load 4.70, 4.20, 3.85
The first field represents the name of the host. The
second field displays the status of the host. The third
field shows
how long the host has been up and running in days, hours,
and minutes.
The fourth and fifth fields tell how many users are
currently logged
in. The last four fields display the system load or
average number
of processes over the last one, five, and fifteen minutes.
The program also makes use of uucp. My backup scripts
have
been modified to test for the exit status of cpio because
I
use cpio to create backups. If the backup is successful
on
the server, I send a file to cpu4, the Sentinel, using
the
following commands:
cpio -Obv -O/dev/RSA/qtape2 < /tmp/backup.list
if [ "$?" -ne 0 ]
then
echo The backup failed on `hostname` at `date`. | mail root
else
echo The backup was successful on `hostname` at `date`. | mail root
>/tmp/backup.ok
uuto /tmp/backup.ok cpu4!bill
fi
The routine tests for a successful backup by testing
the exit status of cpio. The if conditional checks $?,
the exit status. If the exit status is zero, I create
a zero-length
file called backup.ok and send it to cpu4 using the
uuto utility. If sent from the server cpu5, this file
would be sent to cpu4 and be placed in the directory
/usr/spool/uucppublic/receive/bill/cpu5.
If the Sentinel doesn't find the file
/usr/spool/uucppublic/receive/bill/cpu5/backup.ok, it
will
use the cu utility to page me on my beeper.
The sample entries here have been appended to the uucp
configuration
file, /usr/lib/uucp/Systems. Similar entries would be
required
on your Sentinel host. The first field of the Systems
file
usually represents a hostname. I have created bogus
hostnames to allow
the Sentinel host the ability to dial a beeper number
and transmit
a code which briefly describes the problem.
The entries that follow are called with cu; they are
used to
notify the system administrator by sending a code on
his or her beeper.
### The server called cpu5 has crashed.
cpu5down Any ACU 9600 93631448,,,,,5551111
### The server called cpu8 has crashed.
cpu8down Any ACU 9600 93631448,,,,,8881111
### The backup has failed on cpu5.
badback5 Any ACU 9600 93631448,,,,,5552222
### The backup has failed on cpu8.
badback8 Any ACU 9600 93631448,,,,,8882222
The next entry is a sample entry for those of you who
have a SkyPager Beeper. Notice the use of the pound
key and the trailing
comma.
Any ACU 9600 918007597243,,,,,6182093#,,,,5551111#,
The Sentinel program should run constantly to keep watch
over the critical hosts. For this reason I start the
program from
an rc script which runs at boot time. (See the sidebar
for
a brief explanation of how rc scripts work.)
The rc script I use (see Listing 1) is called sentinel
and is located in the directory /etc/init.d. It is then
linked to /etc/rc2.d/S99sentinel. No action will be
needed
upon shutdown. If there were a need to take action,
the same file
would also be linked to /etc/rc0.d/K99sentinel.
The Sentinel program (Listing 2) can be modified to
monitor other
critical events on your network. This is an example
of pro-active
system administration. Systems will crash and backups
will fail but
you can still attempt to minimize the effect this will
have on you
user community.
About the Author
William Genosa is the Chief System Administrator for
a leading systems
intergrator. He may be reached at 186 Bryant Avenue,
Floral Park, NY 11001.
|