Automating Basic System Activity Monitoring
Bruce Alan Wynn
A distressingly large percentage of UNIX systems administrators spend an equally large percentage of their time putting out fires, thereby reducing their overall effectiveness. The solution for that problem lies in consciously shifting from reactive system management to a proactive approach. One of the most effective ways to proactively administer a network of UNIX hosts is to perform the necessary monitoring of status and activity through automated software.
Various methods for monitoring collections of UNIX hosts exist. Commercial products, freeware and home-grown solutions are found on virtually every network. Most of these products focus on collecting information and generating status reports. Unfortunately, this doesn't help the system administrator detect problems before their users do.
Earlier this year, I had the opportunity to develop some simple tools that monitored common systems problems. In this article and the accompanying scripts, I will show how to convert simple monitoring tools into something even more useful: rudimentary event management tools.
An event management system responds in a predefined way to any of a set of predefined events. Advanced features of event management systems support the automatic escalation of events based on some criteria.
When I investigated some common problems, I found that they fell into three primary areas:
- Filesystems filling up<
- Hosts dropping off the network
- Critical processes failing
Using this list, I set out to build the tools that would make my work easier.
Before designing the monitoring tools themselves, I decided how problems should be reported. Most of the popular commercial monitoring packages support varying levels of problems reporting, so I chose to do the same. Through discussions with the user community and management, four levels of alarms were agreed upon. They are:
- Warning: log the event
- Severe: send mail to the administration team
- Critical: page the administrative team
- Fatal: page the administrative team and their manager
Listing 1, mon_error, implements these alarm methods. It takes two arguments: the alarm level and a text string.
An alarm level of "WARN" logs the text string using the "logger" system utility.
An alarm level of "SEVERE" sends mail to the administrative team, using the variable ADMINS. This list is a whitespace-delimited list of account names, and is easily modified.
An alarm level of "CRITICAL" causes the administrative team to be paged, sending them the text string. I carry an alphanumeric pager that supports email messaging, but this may need to be modified if your pager does not have this capability. The list of email addresses used to send alpha messages to the administrative team is stored in the variable ADMINS_PGR.
And finally, an alarm of "FATAL" pages both the administrative team and their manager. Our manager also carries an alphanumeric pager and felt that he needed to be aware when key filesystems overflowed.
Monitoring events began as a simple process. The initial versions of the scripts listed here as mon_dsk, mon_proc, and mon_host (Listings 2, 3, and 4) were run periodically via cron. After the first day of running these processes, we realized that we needed to keep track of the events we had already detected. Checking a mission-critical filesystem every 5 minutes had caused more than 600 mail messages to be sent to each of the administrators over the weekend, which filled yet another filesystem.
Enter the status file, mon_stat (Listing 5). This file is used to record events as they are detected. If the monitoring script for filesystems finds that it has already reported that a filesystem is at the "SEVERE" level, the script will not report it again. It will, however, report a change in severity.
Monitoring Filesystem Utilization
We decided to support all four levels of alarms for filesystem utilization. This way, slowly growing filesystems would generate lower level alarms that could be addressed at our convenience; faster growing filesystems would cause us - and eventually our manager - to be notified via our pagers.
The filesystem monitoring script, mon_dsk, reads a list of filesystems and severity levels from its input data file, mon_dsk.dat (Listing 6). For each filesystem listed, the script checks the utilization and generates the appropriate alarms. This configuration allows the administrator to specify how often to check the listed filesystems by running the mon_dsk script via cron at regular intervals. Because we needed to support incoming data from our customers, we monitored the incoming ftp directories very closely. Additionally, the administrator need not stop and restart the monitoring process to modify the list of filesystems: he or she can simply edit the input data file.
This script was designed to be run on each host with critical filesystems. Although we originally monitored every filesystem on a given host, we quickly decided that we should monitor only key filesystems.
Certain processes were necessary for our business operations. If these processes failed in the middle of the night, we needed to have a mechanism in place to notify us. We then decided to allow the monitoring script itself attempt to restart the failed process, and to only notify us if it were unable to restart the process.
As a result, the process monitoring script, mon_proc, reads from its input data file, mon_proc.dat (Listing 7), a process name and a command string that it will use to restart the process. The first time the script detects that the process is not running, it will log the fact that it is attempting to restart the process, and attempt to restart it. If the script detects that it has already tried to restart the process, it will generate a "CRITICAL" alarm, paging the administration team.
This process was also run on each host via cron. It also supports editing the input data file to change the list of processes being monitored, or the command used to restart a process.
Monitoring Host Availability
Although we had high-availability software in place, we wanted to be alerted when one of the Web server machines became unavailable so that we could immediately address the problem.
Designed to run on a single machine, the mon_host script, which reads from the input data file mon_host.dat (Listing 8), uses a simple ping to determine whether it can reach a given host. This has the inherent weakness that if part of the network itself fails - for example a hub - the administrator will receive deceptive alarms. We felt that this was an acceptable limitation for our network topology, although we did monitor our addressable network hardware as well.
These scripts do not provide the level of sophistication or the convenient GUI interfaces that some commercially available products do - particularly the host monitoring script. They do, however, provide basic problem detection and reporting. They are written in the Bourne shell for the maximum ease of portability. Although developed in a Solaris environment, operating system dependent commands are stored in variables to aid in the porting. Also, the installation directory of the scripts is stored in a variable and used to define the search path during execution. Finally, the price is unbeatable.
Although more sophisticated products are available, we found that this simple implementation of an automated event management system greatly reduced the problems we faced as an administrative team by alerting us to problems before they escalated beyond recovery.
About the Author
Bruce Alan Wynn is a Senior Member of Pencom Systems Administration. He is the co-founder of the Seattle SAGE Group.