Article

Configurable Subscription-Based Scripts for System Monitoring

Jeffrey Soto and Ravindra Nemleker

Networked distributed systems require remote logging and monitoring. However, for general purpose monitoring, SNMP-based systems are difficult to configure and limited in functionality.

This article describes a subscription-based system monitoring facility that provides fine-grained reporting of system events and conditions. The described system uses only native monitoring facilities and scripts and a message paging service. It does not require any form of sophisticated daemon for monitoring or reporting of events.

Systems, monitoring services, and contacts are maintained through subscription lists. A separate notification facility manages administrator notification. This provides reporting on a host on a by condition, by user basis. You can target any host, for many kinds of system events, and notify any person or group, by mail or page, at various levels of urgency. Monitoring can be distributed to scale on large networks. In addition, the system can automatically notify on problem resolution _ a feature that is not found even in expensive system management tools.

The tool contains about 500 lines of Korn shell code and is free of the intricacies of different UNIX versions over heterogeneous platforms. Systems can be monitored irrespective of their hardware and operating system details. This tool is currently being used at Donaldson, Lufkin and Jenrette in New York to monitor over 100 UNIX and Novell servers.

System and Log Monitoring

Large networks of distributed systems are continuously producing events or entering states that are of interest to various groups or individuals. The operating system, backups, databases, batches, and communication gateways all produce logged messages. Filesystems, processes, file transfers, and print and mail queues all generate copious amounts of information that needs to be monitored.

Monitoring and reporting intelligently on these events is a significant challenge. Obtaining information that can be used to provide support depends on being able to gather the information, filter it, and distribute it to the interested parties. Integrating these various logs with commercial SNMP-based event monitors is difficult. SNMP-based event monitors offer only a limited class of threshold monitors and network-oriented events traps and provide little flexibility in defining notification policy. Furthermore, such monitors often furnish metrics that are of marginal interest to application or systems administrators.

It is also easy to become overwhelmed with information. Out of all the available data, only a small subset is meaningful. Such reporting results in messages being ignored. Separating the important events from the noise is key to managing enterprise wide distributed systems. However, there is no standard means of managing this data. There is no standard way of logging messages, and there are no daemons that work for all types of error logs.

The bright side is that there exists a common denominator among all these disparate systems. That common denominator is flat ASCII text. UNIXcommands that report in ordinary ASCII text can be used to generate information on most system facilities. By first identifying key parameters and substrings that correspond to specific, high-interest events, and then filtering and distributing those events according to a predefined notification policy, you can significantly improve the quality of the information you receive.

Specific information on various conditions can be obtained by interrogation scripts, polled at various frequencies. Logs of any kind can be scanned using string searches. Once information on a condition or event has been obtained, it must be dispatched according to interest and priority. Polling can be distributed to scale. Events can be both pulled passively from the host by the monitor or pushed by the host by local process monitors, such as a batch monitor.

At Donaldson, Lufkin, and Jenrette, we have developed a simple, yet effective, set of scripts to gather and distribute these events.

Monitoring Services Types

There are two types of monitoring services:

Event/Log based

State based

Event-based reporting applies to system events that are singular in nature. Notification is once per occurrence. An example would be a SCSI disk error or a system reboot. Information on these events is generally parsed from system logs. These events are of interest primarily to system administrators. The Log Monitor (Listing 1) is a standalone script that parses the system log searching for predefined error strings.

State-based reporting deals with states or conditions. These services keep track of conditions and report on entry as well as exit from particular conditions. They are thus stateful in nature. An example would be an unreachable host, or a file system that was over capacity. Knowing about a change in the condition _ i.e., that the host is now reachable or that the filesystem is now safely within operating parameters _ allows support groups to coordinate their activities. If multiple parties simultaneously receive an urgent page, they may be unable reach each other to know if the problem is being addressed. Automatic notification on problem resolution means that on-call administrators don't have to login to find out whether or not a problem has already been resolved.

As we have implemented it, the system log monitor and condition monitor are on two separate systems (see Figure 1). This allows the two monitor systems to independently keep an eye on each other. It also provides some protection against human error as the two systems are configured separately and both are capable of independently reporting on a down host (see Figure 2). If one monitor server is malfunctioning or is misconfigured, the other will report the down host.

Each host receiving any monitoring service gets a dispatch directory entry on the monitor hosts (Figure 3). A separate subscription list is maintained for each service. Current probe services are

Network Connection Monitor

Filesystem Capacity Monitor

System Log Monitor

Database Log Monitor

Uptime Monitor

Each monitor service obtains configuration information for the problem type from a generic problem configuration file where defaults can be set. More specific behavior can be set at the individual host level. Each monitor service may require different parameters. For example, a disk monitor will have a default warning threshold of 85 percent. Some systems may require a higher or lower threshold or may only be interested in some filesystems and not others. Monitored hosts are listed in a minimum of two lists, a specific service list, which controls access to a particular service, and a master monitor list, which controls access to all monitor services. This provides a mechanism to turn off all monitoring services for a host without having to remove the name from multiple monitor service lists.

Once an error condition in a given host exists, a named condition or event problem ticket gets created in the dispatch directory established for that host. The event is simultaneously logged for inclusion in daily and weekly reports.

The Notifier

Notification of events and conditions is handled by a separate process that cycles through the host dispatch directories looking for event tickets. When one is found, the notifier consults the hierarchical configuration files for contact information and notification options. The name of the file determines its type and whether it is stateful or not. The actual error message is contained in the file. In this way the notifier is generalized, and can provide this service for any type of ticket. It is the responsibility of the monitoring service that created the ticket to remove it. Stateful tickets are removed by the creation of an "OK" ticket. This signals the notifier to send the "all clear" message, as defined by the monitoring service. Singular event tickets are removed by the notifier since they require no follow-up.

Unresolved tickets trigger a daily reminder message, so that outstanding problems don't get forgotten or overlooked.

Conditions can optionally trigger a script listed in the configuration file. This script will be invoked once on entry into the condition and can perform advanced diagnostics or execute a contingency option. For example, an over-capacity filesystem could trigger an emailed disk usage report to facilitate the clean-up. A down host could trigger a router diagnostic to check the network environment.

This structure allows new types of monitoring services to be added while using the same backend notification mechanism. A host with its own process monitor can create and send a problem ticket to a relay host to be picked up by the notifier. This eliminates the need to reinvent the subscription, configuration and reporting backend. Also, if the relay host is used, it isn't necessary to rsh to a trusted host to deposit tickets, and polling is reduced. The monitor host needs only to poll the relay hosts.

Further enhancements will include a network named pipe to transport events directly to the monitor host using a socket. This would facilitate local process monitoring and eliminate the latency associated with the relay host. It would also permit high-priority events to be asynchronously logged to the dispatch directories. The process would resemble the print spooler in structure.

The end result is reliable and consistent system event monitoring and a consolidated log regardless of the subsystem source. Notification policy can be defined very precisely and the system can be easily administered. Nothing needs to be done to the host being actively monitored, so there is no licensing or need for a local agent. This is an important cost consideration in large networks. Since only generic scripting and plain vanilla TCP/IP utilities are used, portability across heterogeneous platforms is assured.

Services

The following is a description of the files, scripts, and directories used as part of the monitor system;

monlist (a list of monitored systems)

config.filesystems (examples provided below)

event_log.ksh (Listing 1)

check_alive.ksh (Listing 2)

dskmon.ksh (Listing 3)

check_filesystem_limit.ksh (Listing 4)

logevent.ksh (Listing 5)

logok.ksh (Listing 6)

notification.ksh (Listing 7)

collect_events.ksh (Listing 9)

config.generic (Listing 10)

Service Subscription Lists and Configuration Directories

Service subscription lists are maintained for each service in a file called monlist. This file contains the names of all the machines which are to be monitored. Each machine is listed on a separate line. Comments should be preceded by a # sign and should occur at the start of the line only. The monitor lists are maintained for each monitor service. The master monitor list provides the method to enable or disable all monitoring services for a particular host.

Figure 3 shows the configuration directories used by these scripts.

The Event/Log Monitor Script

event_log.ksh polls all monitored systems and performs a string search on each system log. The strings are carefully selected to report on the most serious error events. This is the key to filtering the information and only reporting critical events. The script first performs a ping to the system to ensure that it's connected. The script is capable of distinguishing new events from previously reported ones.

Connection Monitor Script

check_alive.ksh is used for checking if the server is alive on the network. It uses an rpc version of ping, called newping, which checks if the machine can respond to a connection request. For non-UNIX based machines (like Novell and Tandem), the PING variable should be set to the generic ping command. The value of this variable is the command which will be used for checking the connectivity of the system. If the machine does not respond, the script waits for some time (the default is three seconds) before making the next attempt. If the second attempt fails, the script responds with an error. This reduces the number of false alarms.

If an error is detected, a file named Condition.NO_CONNECT is created in the ${BASEDIR}/hosts/<hostname> directory. This file contains the specific error message generated by the ping/newping command. If the machine responds to a polling, then the script checks for the presence of a previously generated Condition.NO_CONNECT file. If one exists, the script then creates a End.Condition.NO_CONNECT file. The presence of this file signals the notifier to remove the condition files and, optionally, send an all-clear message.

This script accepts parameters and can be used standalone or incorporated into other scripts. If no parameters are passed, all the machines present in monlist are checked. System names can also be passed as parameters for checking.

The Disk Monitoring Scripts

dskmon.ksh is the disk monitoring script. It calls check_filesystem_limit.ksh to check for two limits, a low-water mark and a high-water mark. The low-water mark can be used for warning messages and the high-water mark for more serious reporting. The low-water mark is called the LOW_FILE_LIMIT and is set by default to 80 percent. The high-water mark is called the HIGH_FILE_LIMIT and defaults to 95 percent filesystem full. The error messages are contained in the files, Condition.LOW_FILE_LIMIT and Condition.HIGH_FILE_LIMIT, for the respective cases. Limits can be set on a per-filesystem basis in the file config.filesystems.

config.filesystems contains the limits on a per-filesystem basis. The configuration file can be present at a global level for all hosts or for each individual host. The format of the file is:

<filesystem> <low_filesystem_limit> <high_filesystem_limit>
e.g.,
/usr 85 95

This specifies that the low limit for the /usr filesystem is 85 percent and the high limit is 95 percent.

Generic Event Notification Scripts

logevent.ksh is used to log events and conditions onto a single host, called as loggerhost with userid loguser. This script requires a filename as its parameter. It copies this file to the system loggerhost with userid loguser as Condition.EVENT_TYPE, where EVENT_TYPE is the filename passed to the script. The filename identifies the event. The contents of the file consist of a description of the event. For example,

logevent.ksh
/tmp/SYBASE_SERVER_NOT_RESPONDING

logok.ksh is used to log error-resolved messages. These messages are logged onto a single host, called as loggerhost, using the userid loguser.

logok.ksh requires a filename as its parameter. It copies this file to the system loggerhost with userid loguser as End.Condition.CONDITION_TYPE, where CONDITION_TYPE is the filename passed to the script.

The filename identifies the event. The contents of the file consist of a description of the okay message. Care should be taken that the filename CONDITION_TYPE is the same as that used for logging the condition, e.g.:

logok.ksh
/var/tmp/SYBASE_SERVER_NOT_RESPONDING

collect_events.ksh gathers the condition and okay files from the loggerhost system. These files are then deleted from loggerhost to prevent duplicate logging of messages.

Notification Script

notification.ksh is the heart of the monitoring system. It picks up all the errors, events, and conditions deposited by the monitor services, or collected by collect_events.ksh script and sends them to the subscribed users or groups. The users or groups to which notification should be sent can be configured in the config.generic file. Two types of notification are currently supported, mailing and paging. The two notification styles can be configured differently. Parameters can be set at the global level by editing the global config.generic file. Parameters can be set for each condition or event type on a global basis. Further tuning can be done within the configuration file present per host for each condition/event type, e.g.

#global config file.
config.generic
#event/cond specific file.
<CONDITION or EVENT>/config.generic
# condition per host
hosts/<host>/config.<CONDITION or EVENT>

The format of the configuration files for all the conditions is the same. These files make up the basis of the subscription system. The format is as follows:

<parameter_name>=<parameter_value>

White spaces are not supported in parameter_name. White spaces in parameter_value should be enclosed in double quotes. There should be no whitespace(s) on either side of the "=" sign. The following parameter_names are supported.

BEEP_CONTACTS: This parameter consists of the names of the users and/or groups which should be notified, by paging, for conditions, events, and ends of condition. If there are multiple arguments to this parameter they should be enclosed within double quotes.

BEEP_CONTACTS="sa dba" # Page sa and dba.

BEEP_COUNT: The argument to this parameter is the number of times the BEEP_CONTACTS will be paged. For events which do not maintain a state, this parameter is ignored.

BEEP_COUNT=2    # Beep only twice on a Condition.

BEEP_DAILY: Turn the paging of daily reminders on or off. Setting this value to 1 turns it on. Any other value turns it off.

BEEP_DAILY=0

MAIL_DAILY: Turn the mailing of daily reminders on or off. Setting this value to 1 turns it on. Any other value turns it off.

MAIL_DAILY=1

MAIL_CONTACTS: The users or groups who should be notified by mail. Multiple names or groups should be enclosed in double quotes.

MAIL_CONTACTS="sa dba"

MAIL_FLAG: Setting this value to 1 turns the mail sending option on. Any other value turns mail off.

MAIL_FLAG=1

OK_FLAG: This flag sends an okay message when the condition changes. Setting the value to 1 sends the okay, any other value does not send an okay.

OK_FLAG=1

EVENT_FLAG: This value is set to 1 for a condition and to 0 for an event. Events require no follow-up. The notifier removes the event after notification.

EVENT_FLAG=1

SCRIPT: This parameter specifies a program which is to be run when the condition or event is encountered for the first time. For example, in case of a full filesystem, a find command can be run to delete all the core dumps.

SCRIPT="find / -name core -exec /bin/rm -f {} \; "

This feature lets you extend the monitoring system to perform extended diagnostics, execute contingencies, kick off back-ups, or perform some other job as a result of a monitored condition.

Conclusion

With some planning, the system presented here can be configured to monitor and report on your site's events and states. You'll need to consider what kind of information you want and decide who should receive the information. Once you've installed it, you'll find that it provides both a highly useful information filter and a reliable means of notifying interested parties of system conditions.

About the Authors

Jeffrey Soto is Vice President, Distributed Systems, at Donaldson, Lufkin and Jenrette in New York. He can be reached via the Internet as jsoto@dlj.com.

Ravindra Nemlekar is a system administrator at Donaldson, Lufkin and Jenrette and is responsible for application-based services. He can be reached via the Internet as ravi@dlj.com.