Log File Event Management with Waldo
Joe Aguiar
A common problem for systems administrators is the efficient identification and prioritization of problems as they occur on a variety of platforms across a network. A popular tool used for this purpose is the UNIX syslog utility. By feeding messages from multiple services into syslog (inetd, snmp traps, etc.) and forwarding messages to one or more platforms acting as a centralized syslog console, the systems administrator can monitor the health of the network from a single point.
If the site is large or very busy, the number of log messages can quickly become overwhelming. The problem now becomes:
How can the truly serious problems be identified and the proper sys admin notified?
How can appropriate action be taken automatically?
How can redundant notices or actions be avoided?
How can each sys admin monitor messages according to responsibility?
There are a number of commercial applications as well as a few publicly available tools that can accomplish these things. This article describes how an application called Waldo was developed using open source tools (Perl, Tcl/Tk, flex, gcc) to accomplish these goals at one site.
A Solution
Waldo is a Perl application that uses regular expressions to identify events and to execute an action associated with the event. The action may be the execution of any script or program that the uid of the user has privilege to execute. In fact, action can be a semicolon- separated list of actions to be taken in sequence. Waldo provides no predefined actions -- only those specifically defined by the user may be executed. It is therefore a simple matter to integrate existing email notification, pager scripts, and the like.
Waldo uses an anomaly file to associate the occurrence of an event with some action, be it corrective, informative, or otherwise. Each record that Waldo reads from the specified log file is represented by the symbol, LINE, in the anomaly file. This may be used as part of the specified action or passed to another program or script as a parameter if desired.
The anomaly file is a four-field data file with each column separated by two colons (::). The first column contains the regular expression or search string that identifies an event (e.g., file system full). The second column is the Waldo id, which is a unique three-digit decimal number that the user chooses to identify the event. The third column is the acknowledgment timer. This is expressed in minutes and represents the time period during which this action will not be executed again, even if the event recurs. The fourth and last column is the action to be performed. A sample anomaly file is shown below:
# Filename: /var/log/anomalies
# Date: 12/29/98
# Descr: Sample use of anomaly file
#
###############################################################
# Look for log messages with a kernel tag. Use cron job to
# mail file at later time.
#
kernel::001::0:: echo LINE >> /var/adm/kernel.dat
#
# Create a list of unauthorized or denied requests for
# later mailing as reported by tcp_wrappers...and write the
# message to sysadmin as it happens.
#
(unauthor|Unauthor|refused|denied)::002::0:: echo LINE \
>> /var/adm/secure.dat; echo LINE | write sysadmin
#
# If var file system full try clearing some old log files.
# Send message to all ttys. Wait at least 30 minutes before
# doing this again. Check mail box size and shame users
# into cleaning mailboxes
#
/var file system full::245::30:: wall LINE; \
find /var/log -name "snmpd_log*" -mtime +6 -exec rm -f {} \;
du -b /var/spool/mail/* | \
sed -e 's/\([2-9][0-9]\{6,\}\)\|\([0-9]\{7,\}\)/\
Clean up you mailbox NOW please!/g' | wall
#
# Backups failed? Use a pager script to inform user fred.
# Wait an hour before sending similar pages.
#
amdump failed::734::60:: pageme.pl -u fred -m LINE
Sometimes it is necessary to have multiple anomaly files for different administrators or purposes. For instance, there may be a service running on a host, like an http server, that writes to its own set of log files. It may be necessary to monitor these logs for an entirely different set of events. Consider the following sample anomaly file:
Filename: /var/adm/httpanom
# Descr: Sample application to http log.
###########################################################
#
# Keep a running total of all access from Lynx browsers.
# Assuming the shell is sh/ksh. Perhaps this will be used
# in a cgi script.
#
Lynx::100::0:: \
i='cat lynx.cnt'; i='expr $i + 1'; echo $i > lynx.cnt
#
# Send interesting messages to the console ( see http RFC)
#
# Client side
(401|403|414)::400::0:: echo LINE > /dev/console; \
echo LINE >> /var/adm/client-msgs
#
# Server side
(5[0-9][0-9])::500::0:: echo LINE > /dev/console; \
echo LINE >> /var/adm/server-msgs
In this way, Waldo may be tailored to the service being monitored. The next question is, how does Waldo know which anomaly file to use? There are a number of command-line switches available with a full explanation presented in the man pages; however, the easiest way is to use a Waldo config file--usually this is called waldo.conf. The next few paragraphs show two sample Waldo config files. The first is for a typical messages file:
# Filename: /etc/waldo.conf
###########################################################
# turn verbose messages on
verbose = 1
# define logfile to read from
logfile = /var/adm/messages
# define anomaly file
anomaly = /var/adm/anomalies
# location of waldo ack file
ackfile = /var/adm/wack
The above file sets Waldo to operate in verbose mode, causing various informative messages to be written to standard output. It is important to place the verbose at the top of the config file so that the following config file settings are echoed to the screen as well. Note particularly the location of the logfile (file being read from), anomaly file (file that associates an event with an action), and the ack file (file that keeps track of the ack timers). These files are all located in the /var/adm path. These settings are typical for a systems administrator who wants to monitor syslog activity. Compare the previous /etc/waldo.conf file with the config file below, which might be typical for a Web administrator:
# Filename: /apache/adm/waldo.conf
###########################################################
# turn on verbose messages
verbose = 1
logfile = /apache/var/logs/access_log
anomaly = /apache/var/logs/anomalies
ackfile = /apache/var/wack
In this case, the log file being monitored is the Apache access log at /apache/var/logs/access_log. Because all the files being written to in the above examples (/var/adm/wack and /apache/var/wack) are in different locations, starting two instances of Waldo should present no problems. To do so, the -F option would be used as shown:
% waldo -F /etc/waldo.conf &
% waldo -F /apache/adm/waldo.conf &
In reality, the first instance could have simply been started with the command waldo because, by default, Waldo looks for a config file under /etc/waldo.conf. The second instance of Waldo also looks for /etc/waldo.conf and reads it, if it can. It then reads the /apache/adm/waldo.conf file and resets any parameters to values found in this file. In other words, the parameters found in /etc/waldo.conf are set first. If these settings are not overridden with a user config file, they remain in effect. This makes for a convenient way to set up global configuration defaults on a host.
If it is desired to have multiple administrators monitoring the same log file but looking for different events, then the administrators could maintain their own config files, anomaly files, and ackfiles. In fact, they should be able to use the same ack file since Waldo uses flock when modifying the ack file. Additionally, each administrator would use a separate set of Waldo id numbers for each event described in the individual anomaly files to prevent someone else's instance of Waldo from resetting the wrong timer.
All this is fine, but many sites have a variety of UNIX platforms mixed in with other systems, such as Windows NT. The idea is to somehow centralize logging for easier identification and response to current or potential problems.
The first step is to configure the UNIX hosts to send system log information to a central loghost. It may be a good idea to keep local logging in place, but send a duplicate to the central loghost. This is accomplished by editing the syslog.conf file and restarting the syslogd on each host, perhaps with different options on the loghost (to enable it to accept remote log messages). Please consult your local man pages for syslog.conf and syslogd, because this can vary a bit from platform to platform. A few articles in Sys Admin have covered this particular topic, and it may be wise to consult your back issues. In general, the syslog.conf can simply be directed to forward a copy of everything of interest to the central log. For instance, given a line in the syslog.conf that looks like this:
kern.* /dev/console
Add an additional line that looks like the following:
kern.* @targethost
where targethost is the name of your central loghost. This may be applied to any facility level messages for which forwarding is desired. Generally, I take the approach of forwarding everything and then throttling back if need be. Use your judgment and consult the references as needed.
Some applications may write to a separate log file from the syslog. It may be desirable, instead of running and maintaining a separate instance of Waldo for each, to send these messages to the central syslog. There are several simple tools included in the Waldo package to help with this type of task. For instance, instead of maintaining a separate set of config files for the http server mentioned earlier, the httpd anomaly file entries could be added to those on the central syslog host anomaly file, and the application log file entries forwarded using the included log file forwarder program:
% logfwder -p local0.notice -t httpd -h targethost \
/apache/var/logs/access_log
This will send a copy of each new message written to the access_log directly to the loghost (targethost) using udp port 514 with a tag of httpd and a priority of local0.notice. The same script can be used to forward access_log entries to the local syslog by leaving off the -h targethost option.
If there are Windows NT boxes that must be maintained, it may be desirable to apply event/action scenarios to the output of the event log. There is currently no version of Waldo that runs on Windows NT; but, there is a script included with the package that will forward event log messages to the central loghost. It is called evtfwder.pl and requires Perl and the Win32 module (see www.activestate.com). The following line illustrates its usage:
C:> perl -w evtfwder.pl -p 9 -h targethost \
Application
This sends a copy of any new entries in the Application Event Log to targethost (there are three event files: System, Security, and Application) with a priority of user.alert (the priority must be expressed as a number in this case).
Waldo may be run in a server configuration in that events are written to named pipes which are created and removed by client applications. This requires Waldo to be started with the -s command line option or to include the server-related parameters in the waldo.conf file, such as server=1. This allows the use of such applications as wgclient, which is a Tcl/Tk-based graphical interface to Waldo. It consists of a text widget, which displays the messages and an entry box to execute miscellaneous commands. The first column of wgclient display shows the Waldo id, followed by the ack expiration time, followed by the logfile entry. Figure 1 shows a sample wgclient display.
It is possible to highlight various messages in the display with a meaningful color. For instance, specific strings can be color coded with red, yellow, or green to indicate a level of importance. Any color your display supports can be used for whatever meaning desired. This is controlled using the .wgc_profile file. The example below shows a typical .wgc_profile file:
^(9[0-9][0-9])::red
^(8[0-9][0-9])::yellow
^(7[0-9][0-9])::green
^(6[0-9][0-9])::blue
|(warn|WARN)::yellow
|(fatal|FATAL|error|ERROR|fail|FAIL)::red
|(waldo restart...)::green
|(INFORMATION|info)::blue
amanda::pink
The first few lines assign a color to a range of Waldo id numbers that appear at the beginning of the line. The | at the beginning of the next few lines means apply an OR with the other regular expressions associated with the indicated color. In other words, 800 - 899 would be yellow at the line beginning as well as any occurrence of warn or WARN.
The Waldo acknowledgment capability is really designed to take advantage of the GUI client although it can be used from the command line. The GUI shows the current time just above the text window, and the text window shows the Waldo id and the ack expiration time. For instance, consider the message shown below:
528: Mon Jun 28 09:10:51 1999: Jun 28 07:10:45 host \
nfs-client nfs server not responding
The expiration time is two hours. If the problem gets fixed within two hours, the timer may be cleared by typing in the client entry box:
waldo -ack 528
The entry for anomaly 528 will be removed from the ack file, and Waldo will now report any further occurrences of this event. Waldo only sends events that match regular expressions in the anomaly file to the clients. As a result, if it is desired that a message show up on the client display without taking any specific action, add a line similar to the following to the anomaly file:
swap space limit exceeded::999::0:: echo LINE > /dev/null
This will cause the log file message to be sent to the client named pipe and the action would send the message to /dev/null. Waldo also has some support for monitoring the wtmp/utmp files. Since these files are not ASCII text, the Perl unpack function is used to translate new records into something printable. The waldo.conf parameters of utmp, template, and utmpsize may be used to tailor things to your particular utmp structure. There is built-in support for several popular UNIX platforms already. If it is necessary to tailor things a bit, the utmpsize.c file under the misc directory may be helpful. To compile it, type:
% gcc -o utmpsize utmpsize.c
Running this should spit out the size of various data types including the overall size of your system's particular utmp structure. This information, along with the utmp.h file listing, should allow you to create an appropriate template for the unpack routine.
Installation
A Perl 5 interpreter will be needed. Binaries are available from various sources, but it is probably best to compile it yourself. See www.perl.com for source code. Follow the installation instructions for the package. If the GUI client will be used, a Tcl/Tk interpreter will also be needed. See the www.scriptics.com site for this. As stated earlier, the ActiveState site is a good place to get Perl 5 for Win32. If you intend to use the anomaly file syntax checker, download and compile the latest flex from ftp.gnu.org.
Once these items have been installed, unpack the Waldo package and check to see that the proper locations of your interpreters are shown at the top of each script that you will be using in the waldo, waldo/clients, and waldo/misc directories. Examine the Makefile in the waldo directory and make any changes needed for your site. Type make, and then as root type make install. That should do it.
Now comes the task of configuring everything for your particular needs. If you are going to run a single instance of Waldo, it will be necessary to decide under what uid it will run. I typically run it as root. This means whatever actions are set in the anomaly file are executed as root by default. This can be dangerous, particularly if the anomaly file permissions allow write access to a number of users. Waldo will not write to files that it uses for normal processing if they are hard or symbolic links. It does not check files that are written to as a result of anomaly file actions, however, so be careful.
It may be wise to run Waldo under a separate, non-privileged uid and group that has read access to the log file being monitored and write access to the ack file. Also, Waldo will refuse to read configuration files if the permissions are not exclusive enough. Waldo comes with a decent set of man pages so be sure to reference them as needed. I hope you find the Waldo package useful (see http://dogbert.inreach.com/~jaguiar for download).
References
Wall, T., T. Christianson, and R. L. Schwartz. 1996. Programming Perl Second Edition. O'Reilly & Associates.
Christianson, T. and N. Torkington. 1998. Perl Cookbook. O'Reilly & Associates.
Roth, D. 1998. Win32 Perl Programming: The Standard Extensions. MacMillan Technical Publishing.
Stevens, W. R. 1992. Advanced Programming in the UNIX Environment. Addison-Wesley.
Stevens, W. R. 1998. UNIX Network PRogramming. Volume 1. Prentice Hall.
Friedl, J. 1997. Mastering Regular Expressions. O'Reilly & Associates.
Ousterhout, J. K. 1994. Tcl and the Tk Toolkit. Addison-Wesley.
libwin32 source code from CPAN archive, logger.c source code from www.OpenBSD.org. n
About the Author
Joe Aguiar has worked as a Systems and Network Administrator and Software Engineer since 1986. He holds a BSEE degree from CSU, Fresno. He can be reached at jaguiar@inreach.com. |