Versatile
Fault Management with Perl
Tim Shouldice
When I started working with "enterprise" fault management
consoles, I was surprised by how limited they were in many areas.
One of the biggest shortcomings is the difficulty of programming
them to manage events based on the service hours of the device creating
the fault. In the organizations in which I've worked, we had
many 24x7 devices, such as public Web servers and routers. However,
we also had many servers, such as internal database and file servers,
with service hours of only 8am to 5pm, Monday to Friday. Our support
staff continually dealt with alerts from these non-critical servers
at night and on weekends.
Another limitation I found was that many of these consoles required
a two-way network connection between the console and the various
management systems in order to collect the fault events. This made
it hard to implement the console for servers outside an internal
network, such as public Web servers on a demilitarized zone (DMZ).
In this article, I'll review fault management and its requirements,
then I'll show how to build a complete solution using my favorite
language -- Perl.
What Is a Fault Management System?
A vast array of monitoring tools is available for systems administrators.
Some common ones include open source tools such as Scotty (http://www.ibr.cs.tu-bs.de/projects/scotty)
and Mon (http://www.kernel.org/software/mon), in-house monitoring
scripts, and commercial products such as BMC Patrol, MKS Toolkit,
and HP Openview. When a monitoring tool detects a fault, it usually
sends an alert. Most of the tools can be configured or customized
to send the alert to a fault management system. Many monitoring
tools are capable of performing management activities on faults
prior to issuing alerts, but the job of a dedicated fault management
system is to manage alerts coming from either multiple monitoring
tools or performing management actions that are not easily accomplished
with the monitoring tool.
Fault management systems are usually centralized in such a way
that all events from one monitoring system are sent to only one
fault management system. However, many organizations have multiple
fault management systems, each receiving events from different monitoring
systems. This usually occurs where there is little benefit in integrating
the separate fault management systems because the events managed
by them are unlikely to ever need to be correlated (e.g., with one
console managing an organization's mainframes and another managing
the network and servers).
Commercial Fault Management Products
Vendors who sell multiple monitoring tools recognize the need
for a dedicated fault management system. The need to automate monitoring
and notification activities requires events from multiple sources
to be properly managed and correlated. The big four are:
- Tivoli Enterprise Console
- BMC Enterprise Manager
- CA Unicentre TNG
- HP Openview
Although these products perform well, there will always be areas
where they will not meet your organization's exact requirements.
Additionally, they can be fairly expensive. However, there is an
alternative.
A Perl Solution
One thing that makes these commercial products expensive is the
user interface. The interface is used to display the events to operators
and provide an interface for administrators to graphically create
the processing rules. I think that a fault management system should
have no interface of its own and that it should direct its output
(such as notification or problem management systems) to other systems.
I also find that no matter how flexible a graphical rule builder
is meant to be, it is no substitute for the power of a real programming
language. In the rest of this article, I will present a fault management
system called fault_manager that is written in Perl (Listing 1).
It performs all the essential functions of a robust fault management
system.
Sending the Alerts
The first step in building the fault management system is to establish
a standardized and reliable way to communicate the events from the
various monitoring systems to the fault management system. I chose
a direct and reliable route -- FTP. Each monitoring system is
modified to call an external program when they have an event to
communicate. This program then creates a text file and sends the
text file to the fault management system using anonymous FTP.
This approach pushes the events from each monitoring system to
the central fault management system, which provides a solution to
moving events out of secured areas. I usually place the fault system
on a dedicated server on a non-Internet routable portion of a DMZ.
This allows the firewall to be configured to route the FTP events
from all areas to the fault management server without any other
risks to either the server or the managed hosts. Figure 1 illustrates
this concept.
I usually use wu-ftpd (http://www.wu-ftpd.org) as the anonymous
FTP server. It has many security features over the standard FTP
servers included with most versions of UNIX. The other benefit of
using FTP as the communications mechanism is that wu-ftpd maintains
detailed logs. If there is a question of whether an alert was sent
or received by the fault management server, the xferlog can be consulted
to verify transmissions.
I've modified all sorts of monitoring systems to FTP alerts
using all types of languages. I first modify the alerting mechanism
of the monitoring tool to create a text file that looks like this
(examples are included in the source code):
Fault_Manager 1.0
Thu Feb 7 15:25:45 2002
Alert_Class.Alert_Instance.Alert_Parameter
Server1
Batch job #12 on server1 failed.
WARNING
Each fault management system uses some type of standard format for
their events. The format used by fault_manager and shown in the file
above is as follows:
Fault_Manager 1.0 -- A check string to prevent erroneous
files FTP'd into the directory from being processed
Thu Feb 7 15:25:45 2002 -- The date/time the event
occurred
Alert_Class -- A high-level class given to the event
(i.e., BATCH_JOB)
Alert_Instance -- A middle-level class given to an
instance of the class (i.e., JOB#12)
Alert_Parameter -- The parameter of the instance (i.e.,
Job_Status)
Server1 -- The host/device on where the event occurred
Batch job #12 on server1 failed. -- A free-text description
of the event
WARNING -- The severity of the event
Next, the notify.patrol.pl script sends the file and looks like
this:
#!/bin/perl
use Net::FTP;
# Set initial variables
$FTP_HOST="fault_server@your_domain.com";
$FTP_USER="anonymous";
$FTP_PASSWD="user@your_domain.com";
$FTP_DIR="/fault";
$FILENAME="/tmp/alert_file ";
$random=int(rand(1000));
$OUTFILE=$HOST.".".$random;
$ftp=Net::FTP->new("$FTP_HOST") or warn "Couldn't connect to $FTP_HOST\n";
$ftp->login("$FTP_USER","$FTP_PASSWD") or warn "Couldn't login\n";
$ftp->cwd("$FTP_DIR");
$ftp->put($FILENAME, $OUTFILE) or die "Couldn't put file\n";
$ftp->quit() or warn "Couldn't quit\n";
I use the Trap Demon provided in the NET-SNMP toolkit (http://net-snmp.sourceforge.net/)
to transmit an SNMP trap. I install the demon on the server where
the traps are being sent, and then use the configuration file to specify
trap handlers to hand off the trap to a shell script similar to the
examples above. This will create and send an appropriate alert file
from the trap.
I have written Patrol PSL code that works in conjunction with
an external Perl program to transmit a BMC Patrol alert. The code
can be found in the source code files. There is also an example
of transmitting an event from a shell script in the source code
files.
How fault_manager Works
The main routine of fault_manager is a loop that looks like this:
until($time_to_die) {
opendir(DIR, $dir) or warn "Cannot open $dir\n";
$event=readdir(DIR);
while (length($event) > 0) {
if (($event ne ".") and ($event ne "..")) {
GetEvent();
ApplyRules();
PushOntoQueue();
unlink($event);
NotifyPending();
}
$event=readdir(DIR);
}
sleep 10;
}
The key to how the program works is that after applying the rules
to the event, it puts the event into a queue of open events. Then,
it loops through this queue to perform any required notifications.
The reasons for the queue are threefold:
1. After processing, each event is pushed onto the event queue.
Each time the program runs through the queue, it checks the time
and compares it against the service hours of the device to see whether
the device is in its service-hour time period. If it is and it has
not already been notified, then it gets notified. If it isn't,
it remains on the queue and is rechecked each time through until
the time reaches the start of the service-hour time period, at which
point the event is notified. For example, an event from an 8-to-5
device could arrive on Friday night and sit on the queue until Monday
morning at 9am, when it would enter the service-hour time period
and get notified.
2. The queue is used by correlation rules. The correlation rules
loop through the contents of the open event queue looking for correlating
events. Because event correlation always occurs within a certain
time window, the script is set to drop notified events from the
queue after 30 minutes so that new events do not get correlated
with old, unrelated events. Additionally, events cannot be kept
on the queue indefinitely as the queue would continually grow in
size. Another approach is to add time logic to the correlation rules
and keep the events on the queue either for a longer period of time
or keep a fixed number of events on the queue and drop each one
off as a new one is put on. A queue of the last 100 events is adequate
for most correlation rules.
3. Some monitoring systems send an event twice -- once when
the parameter goes into alarm and then again when the parameter
goes out of alarm. I call these types of events "normal"
events. If a normal event arrives, the event queue is searched;
and if the initial alarm event is found, then it is removed. This
allows events to come in during non-service hours. Then, if a normal
event comes in, the original event gets deleted and support staff
will never receive an alert.
Be aware of some potential problems with this approach. If an
event such as a filesystem alert goes away in the middle of the
night, there is little support staff can do when the alert comes
in at 9am in the morning. By removing the alert, staff is not bothered
by nuisance events. However, it is very important to set appropriate
warning thresholds so that when there is a problem, the event remains
open.
On the other hand, some events may come and go, yet still be critically
important. One example is a server that seems to be unavailable
and then available again. If the problem was a network service outage,
then it is appropriate that the normal event removes the down event.
However, if the server was unavailable because it rebooted itself,
this could be indicative of intermittent hardware failure and be
something that support staff should know about. To solve this for
one client, I implemented the Mon reboot alert (which has no corresponding
"normal" event). So, if a server reboots in the middle
of the night, the availability alerts cancel themselves out but
the reboot alert remains open. For cases where servers are rebooted
for maintenance and no alerts are desired, this is handled by stopping
and restarting monitoring (which is covered in a later section).
In the fault_manager script, each event is put into a standard
Perl array, and each element of the array is an attribute of the
event. This is handled by the GetEvent() subroutine. After the event
rules are processed by the ApplyRules() subroutine, the event gets
pushed on the event queue by the PushOntoQueue() subroutine. The
event queue is a two-dimensional array.
When the fault event first arrives, it is logged. After all the
rules are applied and some events dropped, the event is logged again
when it is notified. Reports on numbers of faults received vs. number
of problems alerted can be of interest. In some cases, management
may also want to see reports broken down by applications, instances,
hosts, or some combination. If this is the case, the information
should be logged in a database. The choice of database, along with
the choice of reporting tool, depends on your organization's
standards. If the logs are not going to be reported on, then the
information is best logged into a simple text file. The code in
fault_manager uses the Perl DBI interface to log to an ASCII text
file, but it could be changed to use any of the databases supported
by the DBI.
The event processing rules are each written as a subroutine in
the main processing loop. Each rule has some standards. First, a
previous rule may have indicated that the event should be dropped,
which is done by setting the global variable $CONTINUE to
"false". At the start of each subroutine, the rule should
check for status of the $CONTINUE variable and process only
if it is true. Second, each rule should log the results of its decision
into a separate log called the Rule_Trace_Log, which is a text file
containing the decisions of each rule acting on an event. This information
is important when trying to determine why an event was processed
in a certain way.
The process of rule tracing can write a large amount of information
to the Rule_Trace_Log, so two flag variables are set up to govern
whether the logging occurs. First, there is a global variable, $GLOBAL_RULE_TRACE,
which will turn on rule trace logging within all rules. This variable
is turned off by default. Second, each rule should have a
local flag variable, $LOCAL_RULE_TRACE, which is turned on
by default. After the rule is fully tested and debugged, this local
trace flag could be turned off. If the event rule determines the
event should be dropped, it sets the $CONTINUE flag to be
false. When the section of code executes (which would push the event
onto the notification queue), it will drop the event if the $CONTINUE
flag is false. This flag is reset when the next event is read into
the program.
Here is a sample rule that filters events based on their application.
If the application is not in the configuration file, then the alert
is dropped. This works as a preventive measure in case administrators
accidentally turn on new alert types:
sub Check_Application() {
my $LOCAL_RULE_TRACE=1;
my $FOUND_FLAG=0;
# Event filtering based on event class listed in a text configuration file
# Check that the class is supposed to be alerted
# Note there is no rule logging if the event is not dropped
# as this test is applied to all events and logging a
# negative result would not be informative
if($CONTINUE) {
open(CONF, "< $application_list_file") or warn \
"Cannot open $application_list_file: $!\n";
@lines=<CONF>;
close(CONF);
foreach $line (@lines) {
chomp($line);
if ($line eq $fault[$_fault_application]) {
$FOUND_FLAG=1;
last;
}
}
if (not $FOUND_FLAG) {
$CONTINUE=0;
if(($GLOBAL_RULE_TRACE) or ($LOCAL_RULE_TRACE)) {
Log_Rule_Trace("Application $fault[$_fault_application]
is not in the list of applications, dropping alert.");
}
}
}
}
Event Notification and Problem Management Integration
Events can be notified in several ways. The three most common
are paging, email, and integration with a problem management system.
It is my belief that the fault management system should not provide
a visual console where events are viewed by support staff. Once
an event has been processed by a fault management system, then the
result is a problem, and problems should be managed by a dedicated
problem management system. These systems excel at things like logging
activities as work on the problem progresses, allowing transfer
of the problem to different groups, and escalating the problem if
it is not quickly solved. These systems also have extensive reporting
mechanisms. Most large IT support organizations revolve their operations
around the problem management system. For small organizations without
a problem management system, paging and emailing, along with reports
from the logs, may be sufficient -- until management starts
asking more questions.
Communicating the event to the problem system in some manner is
necessary in order to integrate with a problem management system.
Some problem systems can accept problem tickets via an SMTP interface,
which works well even if the server is on a DMZ (see Figure 1).
However, if the problem management integration requires a direct
network connection, then the location of the event server could
pose a problem. One way to solve the problem is to have the event
notification write the problem information into another file in
a separate directory on the event server. Then, you could have a
Perl program running on the problem management server that connects
to the event server via anonymous FTP every 15 or 30 seconds that
gets these problem record files. This program would then initiate
the connection to the problem management system to create the problem
tickets. I've done this, and it works quite well.
Determining which system will handle the other forms of notification
(paging and emails) is done when fault_manager integrates with a
problem management system. Some problem management systems excel
at this, others do not perform it at all. In some cases, the notifications
are split with one system notifying some people or groups, and the
other notifying other people or groups. We must also define which
system will determine the support group to which the problem will
be assigned. Some problem management systems are configured to do
this automatically based on the host/device and type of problem.
Others need the support group defined when the problem ticket is
raised.
The notification actions can either be written into the fault_manager
application or put in a separate script. I have put the notifications
into a separate script, which helps provide some logical separation
between the functions. This is completely arbitrary, and it could
be argued that it's not worth shelling out to a separate process
to achieve this separation.
The notification script that I've included provides a highly
flexible way to map events to support groups. The details are included
in the comments section of the script. Once the notification script
looks up the group to notify, it always raises a problem ticket.
Then, it pages and emails if the event is a FATAL-level event, and
sends only email if it's a WARNING-level event.
Paging
The paging is done by shelling out to the qpage (http://www.qpage.org)
snpp (simple network paging protocol) client. Snpp is the best way
to manage paging. The client makes the request and a dedicated snpp
server handles the work of managing the modems, queuing requests
for the modem, and handling retries. I recommend installing the
snpp server and modem on the same system as fault_manager because
if there is a network problem, the snpp request may not make it
to another server, and there will be no notification.
The qpage client then uses its configuration file to look up the
necessary pager service provider and pager pin number information.
The configuration file can also page multiple people within the
group, or page different people based on service hours. There is
another excellent snpp client/server package, Quickpage (http://sendpage.cpoint.net),
which is written in Perl. Its client program is written in Perl
and could be completely written inside the notification script,
thus eliminating the need to shell out the snpp client. Note that
Quickpage needs the group's pager number and PIN number passed
on the command line. This information must be maintained in the
notify.conf file along with other group information and looked up
before a page is issued:
sub page {
my $Group_Name = $_[0];
my $msg = "$Host: $Event_Class alert on $Event_Instance";
$return=system("/opt/bin/qpage -p $Group_Name $msg");
}
Emailing
The emailing is done through Sendmail. If the fault_manager script
is modified to run on NT, the Blat Sendmail-like program can be used.
Some of my clients use MS Exchange or Lotus Notes as their internal
email systems. In these cases, I configured a Sendmail alias for each
group and have Sendmail forward the email to their Exchange or Notes
SMTP email address:
sub email {
# Subroutine to email - make sure path to your sendmail is correct!
my $Group_Name = $_[0];
my $msg="\"$Msg\"";
my $to="$Group_Name";
my $subject="$Host: $Event_Class alert on $Event_Instance";
open(SENDMAIL, "|/usr/lib/sendmail -oi -t")
or warn "Cannot fork for sendmail: $!\n";
print SENDMAIL <<"EOF";
From: Fault_manager
To: $to
Subject: $subject
$msg
EOF
close(SENDMAIL) or warn "Sendmail didn't close nicely";
}
The problem management integration portion is specific to each organization
and problem management system. I have written integration to the Remedy
ARS system using the ARSPerl modules (http://arsinfo.cit.buffalo.edu/perl),
to the Peregrine ServiceCenter using the ServiceCenter SDK, and even
to the mainframe-based INFOMAN system using FTP and the client's
INFOMAN's batch API.
Setting Up the System
To get fault_manager set up and running in your environment, it
is necessary to do the following:
1. Perform an analysis of your organization's monitoring
requirements and support structure.
2. Configure your existing or new monitoring system(s) to send
events via FTP in the form expected by fault_manager.
3. Install the fault_manager script in the directory of your choice.
4. Set the paths within the fault_manager script.
5. Modify existing rules to suit your environment, remove rules
that don't, and write new ones as required.
6. If required, install the database for event logging and modify
the logging subroutine accordingly.
7. Configure the notification script.
8. Install and configure qpage.
9. Write startup scripts to automatically start fault_manager.
10. Generate test alerts and view results and rule trace logs
to ensure that everything is working as planned.
11. Write and test problem management system integration.
12. When everything is working properly, move into production.
Helper Scripts
During evenings and weekends, staff members make changes to the
system as part of planned maintenance activities. During these activities,
it is necessary to stop alerts from being falsely generated. The
easiest way to do this is to leave the monitoring systems alone
(i.e., let them go into alert) and configure the fault management
system to drop the alerts. I have included a simple script that
is executed by staff to disable monitoring. Staff members can log
in to the fault manager server and execute monitoring_stop hostname.
The script then adds the hostname to a configuration file. A rule
reads this configuration file for each incoming event and drops
the event if the host is in the file.
I think having staff members use this script is more effective
than asking them to edit the configuration file itself. I ensure
the script is in their path, so they don't need to know where
either it or the configuration file is located. I also have the
script send out an email to an administrator indicating that monitoring
has been turned off. When the maintenance is done, staff members
need to log in again and execute another script, monitoring_start
hostname, to re-enable alerting.
If a staff member forgets to restart monitoring, then important
events can get missed. To prevent this, I added a warning to the
monitoring_stop script. I also wrote a small script run by cron
each morning that emails out a list of hosts that have monitoring
disabled. This helps catch anything missed:
#!/bin/perl
# Get the host from the command line
$host=$ARGV[0];
$user='id';
$start=index($user, "(");
$end=index($user, ")");
$user=substr($user, $start+1, $end-$start-1);
if ($user eq 'root') {
print "You cannot run this as root, su to yourself and try again...\n";
exit;
}
# Read in the existing file of stopped hosts
$host_list_file = '/opt/bin/alert/hosts.conf';
open(CONF, "< $host_list_file") or warn "Cannot open $host_list_file:$!\n";
@lines=<CONF>;
close(CONF);
$FOUND_FLAG=1;
# Check if the host is already stopped
foreach $line (@lines) {
chomp($line);
if ($line eq $host) {
print "Monitoring already stopped for $host.\n";
$FOUND_FLAG=0;
last;
}
}
# Add host to list and rewrite file
if (not $FOUND_FLAG) {
push @lines, $host;
open(CONF, "> $host_list_file") or warn "Cannot open $host_list_file for writing:!\n";
foreach $line (@lines) {
print CONF "$line\n";
}
close(CONF);
# Send email to notify that monitoring was stopped
open(SENDMAIL, "|/usr/lib/sendmail -oi -t") or warn "Cant fork for sendmail: $!\n";
print SENDMAIL <<"EOF";
From: Fault_manager
To: root
Subject: Monitoring stopped for $host
Monitoring has been stopped for $host by $user.
EOF
close(SENDMAIL) or warn "Sendmail didn't close nicely";
}
print "Mnitoring has now been stopped for $host, please remember \
to restart monitoring when you are finished.\n";
I've also had cron run these scripts for those clients who say,
"it's a 24x7 system, except for the 10-minute window each
night between 3:40 and 3:50 when we reboot the system." To handle
this, I use cron to turn monitoring off then back on immediately before
and after the reboot.
Redundant and Fault-Tolerant System
In a large organization, a fault management system is a very important
component of the infrastructure. Such as system should have some
level of fault-tolerance and redundancy. From a fault-tolerance
perspective, the system should be set up on fault-tolerant hardware.
From a redundancy perspective, it is possible to set up a completely
separate server running all the same programs (ideally located on
a different part of the network). To make this work, the monitoring
system's FTP script that sends the alerts to the fault management
server needs to be modified. The script needs to detect the failure
to make an FTP connection. When this happens, the script could attempt
an FTP connection to the secondary server. The secondary server
would then receive and process the event.
When the secondary system starts processing events, the event
and rule trace logs will be spread across two systems. These will
need to be later re-joined for comprehensive reporting. It is also
important for both systems to contain exactly the same versions
of the scripts and the configuration files. It is rare that the
secondary server would be used, but a fully redundant system is
quite simple to set up and administer.
Conclusion
The programs outlined in this article are in production for real
clients, managing their environments 24 hours a day. For one client,
I duplicated the functionality of their expensive name-brand system,
and over a weekend replaced it with this system. At first, nobody
noticed. Soon, management and support staff noticed when previously
impossible things became possible and the number of nuisance events
dropped. Later, the accountants noticed when they had a significantly
lower software maintenance contract to sign.
In most large IT organizations, change seems to be the only constant
thing. The fault management system is subject to ongoing changes
as the support environment changes -- new procedures are added,
support duties get split, new monitoring systems are implemented,
and management makes new requests. It is good to know that your
Perl solution is ready to handle any such future request.
Tim Shouldice is a computer consultant living in Ottawa, Canada.
He specializes in designing and implementing monitoring, fault management,
and performance management systems. He can be reached by email at:
tims@rogers.com.
|