Configurable Subscription-Based Scripts for System Monitoring
Jeffrey Soto and Ravindra Nemleker
Networked distributed systems require remote logging
and monitoring.
However, for general purpose monitoring, SNMP-based
systems are
difficult to configure and limited in functionality.
This article describes a subscription-based system monitoring
facility
that provides fine-grained reporting of system events
and conditions.
The described system uses only native monitoring facilities
and scripts
and a message paging service. It does not require any
form of
sophisticated daemon for monitoring or reporting of
events.
Systems, monitoring services, and contacts are maintained
through
subscription lists. A separate notification facility
manages
administrator notification. This provides reporting
on a host on a by
condition, by user basis. You can target any host, for
many kinds of
system events, and notify any person or group, by mail
or page, at
various levels of urgency. Monitoring can be distributed
to scale on
large networks. In addition, the system can automatically
notify on
problem resolution _ a feature that is not found even
in expensive
system management tools.
The tool contains about 500 lines of Korn shell code
and is free of the
intricacies of different UNIX versions over heterogeneous
platforms.
Systems can be monitored irrespective of their hardware
and operating
system details. This tool is currently being used at
Donaldson, Lufkin
and Jenrette in New York to monitor over 100 UNIX and
Novell servers.
System and Log Monitoring
Large networks of distributed systems are continuously
producing events
or entering states that are of interest to various groups
or
individuals. The operating system, backups, databases,
batches, and
communication gateways all produce logged messages.
Filesystems,
processes, file transfers, and print and mail queues
all generate
copious amounts of information that needs to be monitored.
Monitoring and reporting intelligently on these events
is a significant
challenge. Obtaining information that can be used to
provide support
depends on being able to gather the information, filter
it, and
distribute it to the interested parties. Integrating
these various logs
with commercial SNMP-based event monitors is difficult.
SNMP-based event
monitors offer only a limited class of threshold monitors
and
network-oriented events traps and provide little flexibility
in defining
notification policy. Furthermore, such monitors often
furnish metrics
that are of marginal interest to application or systems
administrators.
It is also easy to become overwhelmed with information.
Out of all the
available data, only a small subset is meaningful. Such
reporting
results in messages being ignored. Separating the important
events from
the noise is key to managing enterprise wide distributed
systems.
However, there is no standard means of managing this
data. There is no
standard way of logging messages, and there are no daemons
that work for
all types of error logs.
The bright side is that there exists a common denominator
among all
these disparate systems. That common denominator is
flat ASCII text.
UNIXcommands that report in ordinary ASCII text can
be used to generate
information on most system facilities. By first identifying
key
parameters and substrings that correspond to specific,
high-interest
events, and then filtering and distributing those events
according to a
predefined notification policy, you can significantly
improve the
quality of the information you receive.
Specific information on various conditions can be obtained
by
interrogation scripts, polled at various frequencies.
Logs of any kind
can be scanned using string searches. Once information
on a condition or
event has been obtained, it must be dispatched according
to interest and
priority. Polling can be distributed to scale. Events
can be both pulled
passively from the host by the monitor or pushed by
the host by local
process monitors, such as a batch monitor.
At Donaldson, Lufkin, and Jenrette, we have developed
a simple, yet
effective, set of scripts to gather and distribute these
events.
Monitoring Services Types
There are two types of monitoring services:
Event/Log based
State based
Event-based reporting applies to system events that
are singular in
nature. Notification is once per occurrence. An example
would be a SCSI
disk error or a system reboot. Information on these
events is generally
parsed from system logs. These events are of interest
primarily to
system administrators. The Log Monitor (Listing 1) is
a standalone
script that parses the system log searching for predefined
error
strings.
State-based reporting deals with states or conditions.
These services
keep track of conditions and report on entry as well
as exit from
particular conditions. They are thus stateful in nature.
An example
would be an unreachable host, or a file system that
was over capacity.
Knowing about a change in the condition _ i.e., that
the host is now
reachable or that the filesystem is now safely within
operating
parameters _ allows support groups to coordinate their
activities. If
multiple parties simultaneously receive an urgent page,
they may be
unable reach each other to know if the problem is being
addressed.
Automatic notification on problem resolution means that
on-call
administrators don't have to login to find out whether
or not a problem
has already been resolved.
As we have implemented it, the system log monitor and
condition monitor
are on two separate systems (see Figure 1). This allows
the two monitor
systems to independently keep an eye on each other.
It also provides
some protection against human error as the two systems
are configured
separately and both are capable of independently reporting
on a down
host (see Figure 2). If one monitor server is malfunctioning
or is
misconfigured, the other will report the down host.
Each host receiving any monitoring service gets a dispatch
directory
entry on the monitor hosts (Figure 3). A separate subscription
list is
maintained for each service. Current probe services
are
Network Connection Monitor
Filesystem Capacity Monitor
System Log Monitor
Database Log Monitor
Uptime Monitor
Each monitor service obtains configuration information
for the problem
type from a generic problem configuration file where
defaults can be
set. More specific behavior can be set at the individual
host level.
Each monitor service may require different parameters.
For example, a
disk monitor will have a default warning threshold of
85 percent. Some
systems may require a higher or lower threshold or may
only be
interested in some filesystems and not others. Monitored
hosts are
listed in a minimum of two lists, a specific service
list, which
controls access to a particular service, and a master
monitor list,
which controls access to all monitor services. This
provides a mechanism
to turn off all monitoring services for a host without
having to remove
the name from multiple monitor service lists.
Once an error condition in a given host exists, a named
condition or
event problem ticket gets created in the dispatch directory
established
for that host. The event is simultaneously logged for
inclusion in daily
and weekly reports.
The Notifier
Notification of events and conditions is handled by
a separate process
that cycles through the host dispatch directories looking
for event
tickets. When one is found, the notifier consults the
hierarchical
configuration files for contact information and notification
options.
The name of the file determines its type and whether
it is stateful or
not. The actual error message is contained in the file.
In this way the
notifier is generalized, and can provide this service
for any type of
ticket. It is the responsibility of the monitoring service
that created
the ticket to remove it. Stateful tickets are removed
by the creation of
an "OK" ticket. This signals the notifier
to send the "all clear"
message, as defined by the monitoring service. Singular
event tickets
are removed by the notifier since they require no follow-up.
Unresolved tickets trigger a daily reminder message,
so that outstanding
problems don't get forgotten or overlooked.
Conditions can optionally trigger a script listed in
the configuration
file. This script will be invoked once on entry into
the condition and
can perform advanced diagnostics or execute a contingency
option. For
example, an over-capacity filesystem could trigger an
emailed disk usage
report to facilitate the clean-up. A down host could
trigger a router
diagnostic to check the network environment.
This structure allows new types of monitoring services
to be added while
using the same backend notification mechanism. A host
with its own
process monitor can create and send a problem ticket
to a relay host to
be picked up by the notifier. This eliminates the need
to reinvent the
subscription, configuration and reporting backend. Also,
if the relay
host is used, it isn't necessary to rsh to a trusted
host to deposit
tickets, and polling is reduced. The monitor host needs
only to poll the
relay hosts.
Further enhancements will include a network named pipe
to transport
events directly to the monitor host using a socket.
This would
facilitate local process monitoring and eliminate the
latency associated
with the relay host. It would also permit high-priority
events to be
asynchronously logged to the dispatch directories. The
process would
resemble the print spooler in structure.
The end result is reliable and consistent system event
monitoring and a
consolidated log regardless of the subsystem source.
Notification policy
can be defined very precisely and the system can be
easily administered.
Nothing needs to be done to the host being actively
monitored, so there
is no licensing or need for a local agent. This is an
important cost
consideration in large networks. Since only generic
scripting and plain
vanilla TCP/IP utilities are used, portability across
heterogeneous
platforms is assured.
Services
The following is a description of the files, scripts,
and directories
used as part of the monitor system;
monlist (a list of monitored systems)
config.filesystems (examples provided below)
event_log.ksh (Listing 1)
check_alive.ksh (Listing 2)
dskmon.ksh (Listing 3)
check_filesystem_limit.ksh (Listing 4)
logevent.ksh (Listing 5)
logok.ksh (Listing 6)
notification.ksh (Listing 7)
collect_events.ksh (Listing 9)
config.generic (Listing 10)
Service Subscription Lists and Configuration Directories
Service subscription lists are maintained for each service
in a file
called monlist. This file contains the names of all
the machines which
are to be monitored. Each machine is listed on a separate
line. Comments
should be preceded by a # sign and should occur at the
start of the line
only. The monitor lists are maintained for each monitor
service. The
master monitor list provides the method to enable or
disable all
monitoring services for a particular host.
Figure 3 shows the configuration directories used by
these scripts.
The Event/Log Monitor Script
event_log.ksh polls all monitored systems and performs
a string search
on each system log. The strings are carefully selected
to report on the
most serious error events. This is the key to filtering
the information
and only reporting critical events. The script first
performs a ping to
the system to ensure that it's connected. The script
is capable of
distinguishing new events from previously reported ones.
Connection Monitor Script
check_alive.ksh is used for checking if the server is
alive on the
network. It uses an rpc version of ping, called newping,
which checks if
the machine can respond to a connection request. For
non-UNIX based
machines (like Novell and Tandem), the PING variable
should be set to
the generic ping command. The value of this variable
is the command
which will be used for checking the connectivity of
the system. If the
machine does not respond, the script waits for some
time (the default is
three seconds) before making the next attempt. If the
second attempt
fails, the script responds with an error. This reduces
the number of
false alarms.
If an error is detected, a file named Condition.NO_CONNECT
is created in
the ${BASEDIR}/hosts/<hostname> directory. This
file contains the
specific error message generated by the ping/newping
command. If the
machine responds to a polling, then the script checks
for the presence
of a previously generated Condition.NO_CONNECT file.
If one exists, the
script then creates a End.Condition.NO_CONNECT file.
The presence of
this file signals the notifier to remove the condition
files and,
optionally, send an all-clear message.
This script accepts parameters and can be used standalone
or
incorporated into other scripts. If no parameters are
passed, all the
machines present in monlist are checked. System names
can also be passed
as parameters for checking.
The Disk Monitoring Scripts
dskmon.ksh is the disk monitoring script. It calls
check_filesystem_limit.ksh to check for two limits,
a low-water mark and
a high-water mark. The low-water mark can be used for
warning messages
and the high-water mark for more serious reporting.
The low-water mark
is called the LOW_FILE_LIMIT and is set by default to
80 percent. The
high-water mark is called the HIGH_FILE_LIMIT and defaults
to 95 percent
filesystem full. The error messages are contained in
the files,
Condition.LOW_FILE_LIMIT and Condition.HIGH_FILE_LIMIT,
for the
respective cases. Limits can be set on a per-filesystem
basis in the
file config.filesystems.
config.filesystems contains the limits on a per-filesystem
basis. The
configuration file can be present at a global level
for all hosts or for
each individual host. The format of the file is:
<filesystem> <low_filesystem_limit> <high_filesystem_limit>
e.g.,
/usr 85 95
This specifies that the low limit for the /usr filesystem
is 85 percent
and the high limit is 95 percent.
Generic Event Notification Scripts
logevent.ksh is used to log events and conditions onto
a single host,
called as loggerhost with userid loguser. This script
requires a
filename as its parameter. It copies this file to the
system loggerhost
with userid loguser as Condition.EVENT_TYPE, where EVENT_TYPE
is the
filename passed to the script. The filename identifies
the event. The
contents of the file consist of a description of the
event. For example,
logevent.ksh
/tmp/SYBASE_SERVER_NOT_RESPONDING
logok.ksh is used to log error-resolved messages. These
messages are
logged onto a single host, called as loggerhost, using
the userid
loguser.
logok.ksh requires a filename as its parameter. It copies
this file to
the system loggerhost with userid loguser as
End.Condition.CONDITION_TYPE, where CONDITION_TYPE is
the filename
passed to the script.
The filename identifies the event. The contents of the
file consist of a
description of the okay message. Care should be taken
that the filename
CONDITION_TYPE is the same as that used for logging
the condition, e.g.:
logok.ksh
/var/tmp/SYBASE_SERVER_NOT_RESPONDING
collect_events.ksh gathers the condition and okay files
from the
loggerhost system. These files are then deleted from
loggerhost to
prevent duplicate logging of messages.
Notification Script
notification.ksh is the heart of the monitoring system.
It picks up all
the errors, events, and conditions deposited by the
monitor services, or
collected by collect_events.ksh script and sends them
to the subscribed
users or groups. The users or groups to which notification
should be
sent can be configured in the config.generic file. Two
types of
notification are currently supported, mailing and paging.
The two
notification styles can be configured differently. Parameters
can be set
at the global level by editing the global config.generic
file.
Parameters can be set for each condition or event type
on a global
basis. Further tuning can be done within the configuration
file present
per host for each condition/event type, e.g.
#global config file.
config.generic
#event/cond specific file.
<CONDITION or EVENT>/config.generic
# condition per host
hosts/<host>/config.<CONDITION or EVENT>
The format of the configuration files for all the conditions
is the
same. These files make up the basis of the subscription
system. The
format is as follows:
<parameter_name>=<parameter_value>
White spaces are not supported in parameter_name. White
spaces in
parameter_value should be enclosed in double quotes.
There should be no
whitespace(s) on either side of the "=" sign.
The following
parameter_names are supported.
BEEP_CONTACTS: This parameter consists of the names
of the users and/or
groups which should be notified, by paging, for conditions,
events, and
ends of condition. If there are multiple arguments to
this parameter
they should be enclosed within double quotes.
BEEP_CONTACTS="sa dba" # Page sa and dba.
BEEP_COUNT: The argument to this parameter is the number
of times the
BEEP_CONTACTS will be paged. For events which do not
maintain a state,
this parameter is ignored.
BEEP_COUNT=2 # Beep only twice on a Condition.
BEEP_DAILY: Turn the paging of daily reminders on or
off. Setting this
value to 1 turns it on. Any other value turns it off.
BEEP_DAILY=0
MAIL_DAILY: Turn the mailing of daily reminders on or
off. Setting this
value to 1 turns it on. Any other value turns it off.
MAIL_DAILY=1
MAIL_CONTACTS: The users or groups who should be notified
by mail.
Multiple names or groups should be enclosed in double
quotes.
MAIL_CONTACTS="sa dba"
MAIL_FLAG: Setting this value to 1 turns the mail sending
option on. Any
other value turns mail off.
MAIL_FLAG=1
OK_FLAG: This flag sends an okay message when the condition
changes.
Setting the value to 1 sends the okay, any other value
does not send an
okay.
OK_FLAG=1
EVENT_FLAG: This value is set to 1 for a condition and
to 0 for an
event. Events require no follow-up. The notifier removes
the event after
notification.
EVENT_FLAG=1
SCRIPT: This parameter specifies a program which is
to be run when the
condition or event is encountered for the first time.
For example, in
case of a full filesystem, a find command can be run
to delete all the
core dumps.
SCRIPT="find / -name core -exec /bin/rm -f {} \; "
This feature lets you extend the monitoring system to
perform extended
diagnostics, execute contingencies, kick off back-ups,
or perform some
other job as a result of a monitored condition.
Conclusion
With some planning, the system presented here can be
configured to
monitor and report on your site's events and states.
You'll need to
consider what kind of information you want and decide
who should receive
the information. Once you've installed it, you'll find
that it provides
both a highly useful information filter and a reliable
means of
notifying interested parties of system conditions.
About the Authors
Jeffrey Soto is Vice President, Distributed Systems,
at Donaldson,
Lufkin and Jenrette in New York. He can be reached via
the Internet as
jsoto@dlj.com.
Ravindra Nemlekar is a system administrator at Donaldson,
Lufkin and
Jenrette and is responsible for application-based services.
He can be
reached via the Internet as ravi@dlj.com.
|