Cover V02, I03
Article
Listing 1
Listing 2
Sidebar 1

may93.tar


The Sentinel

William Genosa

System administrators are responsible for ensuring the availability of their machines to users. When servers crash or backups fail, the users are inconvenienced, which will most likely result in trouble for the system administrator. If you are a sys admin for more than one machine and your machines are networked together with TCP/IP, then you may want to set up a Sentinel to watch over your network and inform you of a system crash or backup failure. A modem and a beeper would also be required.

My network consists of eight 3B2 hosts running AT&T System V Rel 3.2.2 networked together with TCP/IP over ethernet. The Sentinel is cpu4 and the two servers I am concerned with are cpu5 and cpu8. My servers reboot each weeknight and run unattended level zero backups to back up the entire system. A failure on a server would result in downtime the next morning that would cost my company thousands of dollars. If a server crashes at night, the Sentinel alerts me so that I can take corrective action before my users arrive in the morning. This makes my boss very happy.

The program makes use of the ruptime command, which displays the status of hosts on a local area network. Output of the command uses the following format:

cpu1    up     9:40,    5 users,   load 0.04, 0.08, 0.05
cpu2    up  2+04:25,    8 users,   load 0.08, 1.23, 0.40
cpu3    up     8:35,    5 users,   load 1.00, 1.01, 1.06
cpu4    up     8:43,    6 users,   load 1.04, 1.00, 0.95
cpu5    up     8:47,   75 users,   load 4.60, 3.90, 4.65
cpu6    up  5+07:50,    0 users,   load 0.00, 0.00, 0.00
cpu7  down     0:20
cpu8    up     8:45,   65 users,   load 4.70, 4.20, 3.85

The first field represents the name of the host. The second field displays the status of the host. The third field shows how long the host has been up and running in days, hours, and minutes. The fourth and fifth fields tell how many users are currently logged in. The last four fields display the system load or average number of processes over the last one, five, and fifteen minutes.

The program also makes use of uucp. My backup scripts have been modified to test for the exit status of cpio because I use cpio to create backups. If the backup is successful on the server, I send a file to cpu4, the Sentinel, using the following commands:

cpio -Obv -O/dev/RSA/qtape2 < /tmp/backup.list


if [ "$?" -ne 0 ]
then
echo The backup failed on `hostname` at `date`. | mail root

else
echo The backup was successful on `hostname` at `date`. | mail root

>/tmp/backup.ok
uuto /tmp/backup.ok cpu4!bill
fi     

The routine tests for a successful backup by testing the exit status of cpio. The if conditional checks $?, the exit status. If the exit status is zero, I create a zero-length file called backup.ok and send it to cpu4 using the uuto utility. If sent from the server cpu5, this file would be sent to cpu4 and be placed in the directory /usr/spool/uucppublic/receive/bill/cpu5. If the Sentinel doesn't find the file /usr/spool/uucppublic/receive/bill/cpu5/backup.ok, it will use the cu utility to page me on my beeper.

The sample entries here have been appended to the uucp configuration file, /usr/lib/uucp/Systems. Similar entries would be required on your Sentinel host. The first field of the Systems file usually represents a hostname. I have created bogus hostnames to allow the Sentinel host the ability to dial a beeper number and transmit a code which briefly describes the problem.

The entries that follow are called with cu; they are used to notify the system administrator by sending a code on his or her beeper.

### The server called cpu5 has crashed.
cpu5down Any ACU 9600 93631448,,,,,5551111


### The server called cpu8 has crashed.
cpu8down Any ACU 9600 93631448,,,,,8881111

### The backup has failed on cpu5.
badback5 Any ACU 9600 93631448,,,,,5552222

### The backup has failed on cpu8.
badback8 Any ACU 9600 93631448,,,,,8882222

The next entry is a sample entry for those of you who have a SkyPager Beeper. Notice the use of the pound key and the trailing comma.

Any ACU 9600 918007597243,,,,,6182093#,,,,5551111#,

The Sentinel program should run constantly to keep watch over the critical hosts. For this reason I start the program from an rc script which runs at boot time. (See the sidebar for a brief explanation of how rc scripts work.)

The rc script I use (see Listing 1) is called sentinel and is located in the directory /etc/init.d. It is then linked to /etc/rc2.d/S99sentinel. No action will be needed upon shutdown. If there were a need to take action, the same file would also be linked to /etc/rc0.d/K99sentinel.

The Sentinel program (Listing 2) can be modified to monitor other critical events on your network. This is an example of pro-active system administration. Systems will crash and backups will fail but you can still attempt to minimize the effect this will have on you user community.

About the Author

William Genosa is the Chief System Administrator for a leading systems intergrator. He may be reached at 186 Bryant Avenue, Floral Park, NY 11001.