jul2002.tar

Versatile Fault Management with Perl

Tim Shouldice

When I started working with "enterprise" fault management consoles, I was surprised by how limited they were in many areas. One of the biggest shortcomings is the difficulty of programming them to manage events based on the service hours of the device creating the fault. In the organizations in which I've worked, we had many 24x7 devices, such as public Web servers and routers. However, we also had many servers, such as internal database and file servers, with service hours of only 8am to 5pm, Monday to Friday. Our support staff continually dealt with alerts from these non-critical servers at night and on weekends.

Another limitation I found was that many of these consoles required a two-way network connection between the console and the various management systems in order to collect the fault events. This made it hard to implement the console for servers outside an internal network, such as public Web servers on a demilitarized zone (DMZ). In this article, I'll review fault management and its requirements, then I'll show how to build a complete solution using my favorite language -- Perl.

What Is a Fault Management System?

A vast array of monitoring tools is available for systems administrators. Some common ones include open source tools such as Scotty (http://www.ibr.cs.tu-bs.de/projects/scotty) and Mon (http://www.kernel.org/software/mon), in-house monitoring scripts, and commercial products such as BMC Patrol, MKS Toolkit, and HP Openview. When a monitoring tool detects a fault, it usually sends an alert. Most of the tools can be configured or customized to send the alert to a fault management system. Many monitoring tools are capable of performing management activities on faults prior to issuing alerts, but the job of a dedicated fault management system is to manage alerts coming from either multiple monitoring tools or performing management actions that are not easily accomplished with the monitoring tool.

Fault management systems are usually centralized in such a way that all events from one monitoring system are sent to only one fault management system. However, many organizations have multiple fault management systems, each receiving events from different monitoring systems. This usually occurs where there is little benefit in integrating the separate fault management systems because the events managed by them are unlikely to ever need to be correlated (e.g., with one console managing an organization's mainframes and another managing the network and servers).

Commercial Fault Management Products

Vendors who sell multiple monitoring tools recognize the need for a dedicated fault management system. The need to automate monitoring and notification activities requires events from multiple sources to be properly managed and correlated. The big four are:

Tivoli Enterprise Console
BMC Enterprise Manager
CA Unicentre TNG
HP Openview

Although these products perform well, there will always be areas where they will not meet your organization's exact requirements. Additionally, they can be fairly expensive. However, there is an alternative.

A Perl Solution

One thing that makes these commercial products expensive is the user interface. The interface is used to display the events to operators and provide an interface for administrators to graphically create the processing rules. I think that a fault management system should have no interface of its own and that it should direct its output (such as notification or problem management systems) to other systems. I also find that no matter how flexible a graphical rule builder is meant to be, it is no substitute for the power of a real programming language. In the rest of this article, I will present a fault management system called fault_manager that is written in Perl (Listing 1). It performs all the essential functions of a robust fault management system.

Sending the Alerts

The first step in building the fault management system is to establish a standardized and reliable way to communicate the events from the various monitoring systems to the fault management system. I chose a direct and reliable route -- FTP. Each monitoring system is modified to call an external program when they have an event to communicate. This program then creates a text file and sends the text file to the fault management system using anonymous FTP.

This approach pushes the events from each monitoring system to the central fault management system, which provides a solution to moving events out of secured areas. I usually place the fault system on a dedicated server on a non-Internet routable portion of a DMZ. This allows the firewall to be configured to route the FTP events from all areas to the fault management server without any other risks to either the server or the managed hosts. Figure 1 illustrates this concept.

I usually use wu-ftpd (http://www.wu-ftpd.org) as the anonymous FTP server. It has many security features over the standard FTP servers included with most versions of UNIX. The other benefit of using FTP as the communications mechanism is that wu-ftpd maintains detailed logs. If there is a question of whether an alert was sent or received by the fault management server, the xferlog can be consulted to verify transmissions.

I've modified all sorts of monitoring systems to FTP alerts using all types of languages. I first modify the alerting mechanism of the monitoring tool to create a text file that looks like this (examples are included in the source code):

Fault_Manager 1.0
Thu Feb  7 15:25:45 2002
Alert_Class.Alert_Instance.Alert_Parameter
Server1
Batch job #12 on server1 failed.
WARNING

Each fault management system uses some type of standard format for their events. The format used by fault_manager and shown in the file above is as follows:

Fault_Manager 1.0 -- A check string to prevent erroneous files FTP'd into the directory from being processed

Thu Feb 7 15:25:45 2002 -- The date/time the event occurred

Alert_Class -- A high-level class given to the event (i.e., BATCH_JOB)

Alert_Instance -- A middle-level class given to an instance of the class (i.e., JOB#12)

Alert_Parameter -- The parameter of the instance (i.e., Job_Status)

Server1 -- The host/device on where the event occurred

Batch job #12 on server1 failed. -- A free-text description of the event

WARNING -- The severity of the event

Next, the notify.patrol.pl script sends the file and looks like this:

#!/bin/perl

use Net::FTP;

# Set initial variables

$FTP_HOST="fault_server@your_domain.com";
$FTP_USER="anonymous";
$FTP_PASSWD="user@your_domain.com";
$FTP_DIR="/fault";
$FILENAME="/tmp/alert_file ";
$random=int(rand(1000));
$OUTFILE=$HOST.".".$random;
$ftp=Net::FTP->new("$FTP_HOST") or warn "Couldn't connect to $FTP_HOST\n";
$ftp->login("$FTP_USER","$FTP_PASSWD") or warn "Couldn't login\n";
$ftp->cwd("$FTP_DIR");
$ftp->put($FILENAME, $OUTFILE) or die "Couldn't put file\n";
$ftp->quit() or warn "Couldn't quit\n";

I use the Trap Demon provided in the NET-SNMP toolkit (http://net-snmp.sourceforge.net/) to transmit an SNMP trap. I install the demon on the server where the traps are being sent, and then use the configuration file to specify trap handlers to hand off the trap to a shell script similar to the examples above. This will create and send an appropriate alert file from the trap.

I have written Patrol PSL code that works in conjunction with an external Perl program to transmit a BMC Patrol alert. The code can be found in the source code files. There is also an example of transmitting an event from a shell script in the source code files.

How fault_manager Works

The main routine of fault_manager is a loop that looks like this:

until($time_to_die) {
      opendir(DIR, $dir) or warn "Cannot open $dir\n";
      $event=readdir(DIR);
      while (length($event) > 0) {
             if (($event ne ".") and ($event ne "..")) {
            
                  GetEvent();
                  ApplyRules();
                  PushOntoQueue();
     
                  unlink($event);
                  NotifyPending();
              }
              $event=readdir(DIR);
      }
      sleep 10;
}

The key to how the program works is that after applying the rules to the event, it puts the event into a queue of open events. Then, it loops through this queue to perform any required notifications. The reasons for the queue are threefold:

1. After processing, each event is pushed onto the event queue. Each time the program runs through the queue, it checks the time and compares it against the service hours of the device to see whether the device is in its service-hour time period. If it is and it has not already been notified, then it gets notified. If it isn't, it remains on the queue and is rechecked each time through until the time reaches the start of the service-hour time period, at which point the event is notified. For example, an event from an 8-to-5 device could arrive on Friday night and sit on the queue until Monday morning at 9am, when it would enter the service-hour time period and get notified.

2. The queue is used by correlation rules. The correlation rules loop through the contents of the open event queue looking for correlating events. Because event correlation always occurs within a certain time window, the script is set to drop notified events from the queue after 30 minutes so that new events do not get correlated with old, unrelated events. Additionally, events cannot be kept on the queue indefinitely as the queue would continually grow in size. Another approach is to add time logic to the correlation rules and keep the events on the queue either for a longer period of time or keep a fixed number of events on the queue and drop each one off as a new one is put on. A queue of the last 100 events is adequate for most correlation rules.

3. Some monitoring systems send an event twice -- once when the parameter goes into alarm and then again when the parameter goes out of alarm. I call these types of events "normal" events. If a normal event arrives, the event queue is searched; and if the initial alarm event is found, then it is removed. This allows events to come in during non-service hours. Then, if a normal event comes in, the original event gets deleted and support staff will never receive an alert.

Be aware of some potential problems with this approach. If an event such as a filesystem alert goes away in the middle of the night, there is little support staff can do when the alert comes in at 9am in the morning. By removing the alert, staff is not bothered by nuisance events. However, it is very important to set appropriate warning thresholds so that when there is a problem, the event remains open.

On the other hand, some events may come and go, yet still be critically important. One example is a server that seems to be unavailable and then available again. If the problem was a network service outage, then it is appropriate that the normal event removes the down event. However, if the server was unavailable because it rebooted itself, this could be indicative of intermittent hardware failure and be something that support staff should know about. To solve this for one client, I implemented the Mon reboot alert (which has no corresponding "normal" event). So, if a server reboots in the middle of the night, the availability alerts cancel themselves out but the reboot alert remains open. For cases where servers are rebooted for maintenance and no alerts are desired, this is handled by stopping and restarting monitoring (which is covered in a later section).

In the fault_manager script, each event is put into a standard Perl array, and each element of the array is an attribute of the event. This is handled by the GetEvent() subroutine. After the event rules are processed by the ApplyRules() subroutine, the event gets pushed on the event queue by the PushOntoQueue() subroutine. The event queue is a two-dimensional array.

When the fault event first arrives, it is logged. After all the rules are applied and some events dropped, the event is logged again when it is notified. Reports on numbers of faults received vs. number of problems alerted can be of interest. In some cases, management may also want to see reports broken down by applications, instances, hosts, or some combination. If this is the case, the information should be logged in a database. The choice of database, along with the choice of reporting tool, depends on your organization's standards. If the logs are not going to be reported on, then the information is best logged into a simple text file. The code in fault_manager uses the Perl DBI interface to log to an ASCII text file, but it could be changed to use any of the databases supported by the DBI.

The event processing rules are each written as a subroutine in the main processing loop. Each rule has some standards. First, a previous rule may have indicated that the event should be dropped, which is done by setting the global variable $CONTINUE to "false". At the start of each subroutine, the rule should check for status of the $CONTINUE variable and process only if it is true. Second, each rule should log the results of its decision into a separate log called the Rule_Trace_Log, which is a text file containing the decisions of each rule acting on an event. This information is important when trying to determine why an event was processed in a certain way.

The process of rule tracing can write a large amount of information to the Rule_Trace_Log, so two flag variables are set up to govern whether the logging occurs. First, there is a global variable, $GLOBAL_RULE_TRACE, which will turn on rule trace logging within all rules. This variable is turned off by default. Second, each rule should have a local flag variable, $LOCAL_RULE_TRACE, which is turned on by default. After the rule is fully tested and debugged, this local trace flag could be turned off. If the event rule determines the event should be dropped, it sets the $CONTINUE flag to be false. When the section of code executes (which would push the event onto the notification queue), it will drop the event if the $CONTINUE flag is false. This flag is reset when the next event is read into the program.

Here is a sample rule that filters events based on their application. If the application is not in the configuration file, then the alert is dropped. This works as a preventive measure in case administrators accidentally turn on new alert types:

sub Check_Application() {
      my $LOCAL_RULE_TRACE=1;
      my $FOUND_FLAG=0;
      # Event filtering based on event class listed in a text configuration file
      # Check that the class is supposed to be alerted

      # Note there is no rule logging if the event is not dropped 
      # as this test is applied to all events and logging a 
      # negative result would not be informative
      if($CONTINUE) {
            open(CONF, "< $application_list_file") or warn \
                       "Cannot open $application_list_file: $!\n";
            @lines=<CONF>;
            close(CONF);
            foreach $line (@lines) {
                  chomp($line);
                  if ($line eq $fault[$_fault_application]) {
                        $FOUND_FLAG=1;
                        last;
                  }
            }
            if (not $FOUND_FLAG) {
                  $CONTINUE=0;
                  if(($GLOBAL_RULE_TRACE) or ($LOCAL_RULE_TRACE)) {
                        Log_Rule_Trace("Application $fault[$_fault_application] 
                                       is not in the list of applications, dropping alert.");
                  }
            }
      }
}

Event Notification and Problem Management Integration

Events can be notified in several ways. The three most common are paging, email, and integration with a problem management system. It is my belief that the fault management system should not provide a visual console where events are viewed by support staff. Once an event has been processed by a fault management system, then the result is a problem, and problems should be managed by a dedicated problem management system. These systems excel at things like logging activities as work on the problem progresses, allowing transfer of the problem to different groups, and escalating the problem if it is not quickly solved. These systems also have extensive reporting mechanisms. Most large IT support organizations revolve their operations around the problem management system. For small organizations without a problem management system, paging and emailing, along with reports from the logs, may be sufficient -- until management starts asking more questions.

Communicating the event to the problem system in some manner is necessary in order to integrate with a problem management system. Some problem systems can accept problem tickets via an SMTP interface, which works well even if the server is on a DMZ (see Figure 1). However, if the problem management integration requires a direct network connection, then the location of the event server could pose a problem. One way to solve the problem is to have the event notification write the problem information into another file in a separate directory on the event server. Then, you could have a Perl program running on the problem management server that connects to the event server via anonymous FTP every 15 or 30 seconds that gets these problem record files. This program would then initiate the connection to the problem management system to create the problem tickets. I've done this, and it works quite well.

Determining which system will handle the other forms of notification (paging and emails) is done when fault_manager integrates with a problem management system. Some problem management systems excel at this, others do not perform it at all. In some cases, the notifications are split with one system notifying some people or groups, and the other notifying other people or groups. We must also define which system will determine the support group to which the problem will be assigned. Some problem management systems are configured to do this automatically based on the host/device and type of problem. Others need the support group defined when the problem ticket is raised.

The notification actions can either be written into the fault_manager application or put in a separate script. I have put the notifications into a separate script, which helps provide some logical separation between the functions. This is completely arbitrary, and it could be argued that it's not worth shelling out to a separate process to achieve this separation.

The notification script that I've included provides a highly flexible way to map events to support groups. The details are included in the comments section of the script. Once the notification script looks up the group to notify, it always raises a problem ticket. Then, it pages and emails if the event is a FATAL-level event, and sends only email if it's a WARNING-level event.

Paging

The paging is done by shelling out to the qpage (http://www.qpage.org) snpp (simple network paging protocol) client. Snpp is the best way to manage paging. The client makes the request and a dedicated snpp server handles the work of managing the modems, queuing requests for the modem, and handling retries. I recommend installing the snpp server and modem on the same system as fault_manager because if there is a network problem, the snpp request may not make it to another server, and there will be no notification.

The qpage client then uses its configuration file to look up the necessary pager service provider and pager pin number information. The configuration file can also page multiple people within the group, or page different people based on service hours. There is another excellent snpp client/server package, Quickpage (http://sendpage.cpoint.net), which is written in Perl. Its client program is written in Perl and could be completely written inside the notification script, thus eliminating the need to shell out the snpp client. Note that Quickpage needs the group's pager number and PIN number passed on the command line. This information must be maintained in the notify.conf file along with other group information and looked up before a page is issued:

sub page {
      my $Group_Name = $_[0];
      my $msg = "$Host: $Event_Class alert on $Event_Instance";
      $return=system("/opt/bin/qpage -p $Group_Name $msg");
}
Emailing

The emailing is done through Sendmail. If the fault_manager script is modified to run on NT, the Blat Sendmail-like program can be used. Some of my clients use MS Exchange or Lotus Notes as their internal email systems. In these cases, I configured a Sendmail alias for each group and have Sendmail forward the email to their Exchange or Notes SMTP email address:

sub email {
      # Subroutine to email - make sure path to your sendmail is correct!
      my $Group_Name = $_[0];
      my $msg="\"$Msg\"";
      my $to="$Group_Name";
      my $subject="$Host: $Event_Class alert on $Event_Instance";
      open(SENDMAIL, "|/usr/lib/sendmail -oi -t")
            or warn "Cannot fork for sendmail: $!\n";
print SENDMAIL <<"EOF";
From: Fault_manager
To: $to
Subject: $subject
$msg
EOF
      close(SENDMAIL) or warn "Sendmail didn't close nicely";
}

The problem management integration portion is specific to each organization and problem management system. I have written integration to the Remedy ARS system using the ARSPerl modules (http://arsinfo.cit.buffalo.edu/perl), to the Peregrine ServiceCenter using the ServiceCenter SDK, and even to the mainframe-based INFOMAN system using FTP and the client's INFOMAN's batch API.

Setting Up the System

To get fault_manager set up and running in your environment, it is necessary to do the following:

1. Perform an analysis of your organization's monitoring requirements and support structure.

2. Configure your existing or new monitoring system(s) to send events via FTP in the form expected by fault_manager.

3. Install the fault_manager script in the directory of your choice.

4. Set the paths within the fault_manager script.

5. Modify existing rules to suit your environment, remove rules that don't, and write new ones as required.

6. If required, install the database for event logging and modify the logging subroutine accordingly.

7. Configure the notification script.

8. Install and configure qpage.

9. Write startup scripts to automatically start fault_manager.

10. Generate test alerts and view results and rule trace logs to ensure that everything is working as planned.

11. Write and test problem management system integration.

12. When everything is working properly, move into production.

Helper Scripts

During evenings and weekends, staff members make changes to the system as part of planned maintenance activities. During these activities, it is necessary to stop alerts from being falsely generated. The easiest way to do this is to leave the monitoring systems alone (i.e., let them go into alert) and configure the fault management system to drop the alerts. I have included a simple script that is executed by staff to disable monitoring. Staff members can log in to the fault manager server and execute monitoring_stop hostname. The script then adds the hostname to a configuration file. A rule reads this configuration file for each incoming event and drops the event if the host is in the file.

I think having staff members use this script is more effective than asking them to edit the configuration file itself. I ensure the script is in their path, so they don't need to know where either it or the configuration file is located. I also have the script send out an email to an administrator indicating that monitoring has been turned off. When the maintenance is done, staff members need to log in again and execute another script, monitoring_start hostname, to re-enable alerting.

If a staff member forgets to restart monitoring, then important events can get missed. To prevent this, I added a warning to the monitoring_stop script. I also wrote a small script run by cron each morning that emails out a list of hosts that have monitoring disabled. This helps catch anything missed:

#!/bin/perl
# Get the host from the command line

$host=$ARGV[0];
$user='id';
$start=index($user, "(");
$end=index($user, ")");
$user=substr($user, $start+1, $end-$start-1);
if ($user eq 'root') {
      print "You cannot run this as root, su to yourself and try again...\n";
      exit;
}
# Read in the existing file of stopped hosts
$host_list_file = '/opt/bin/alert/hosts.conf';
open(CONF, "< $host_list_file") or warn "Cannot open $host_list_file:$!\n";
@lines=<CONF>;
close(CONF);

$FOUND_FLAG=1;
# Check if the host is already stopped
foreach $line (@lines) {
      chomp($line);
      if ($line eq $host) {
            print "Monitoring already stopped for $host.\n";
            $FOUND_FLAG=0;
            last;
      }
}

# Add host to list and rewrite file
if (not $FOUND_FLAG) {
      push @lines, $host;
      open(CONF, "> $host_list_file") or warn "Cannot open $host_list_file for writing:!\n";
      foreach $line (@lines) {
            print CONF "$line\n";
      }
      close(CONF);

      # Send email to notify that monitoring was stopped
      open(SENDMAIL, "|/usr/lib/sendmail -oi -t") or warn "Cant fork for sendmail: $!\n";
      print SENDMAIL <<"EOF";
From: Fault_manager
To: root
Subject: Monitoring stopped for $host
Monitoring has been stopped for $host by $user.
EOF
close(SENDMAIL) or warn "Sendmail didn't close nicely";
}
print "Mnitoring has now been stopped for $host, please remember \
       to restart monitoring when you are finished.\n";

I've also had cron run these scripts for those clients who say, "it's a 24x7 system, except for the 10-minute window each night between 3:40 and 3:50 when we reboot the system." To handle this, I use cron to turn monitoring off then back on immediately before and after the reboot.

Redundant and Fault-Tolerant System

In a large organization, a fault management system is a very important component of the infrastructure. Such as system should have some level of fault-tolerance and redundancy. From a fault-tolerance perspective, the system should be set up on fault-tolerant hardware. From a redundancy perspective, it is possible to set up a completely separate server running all the same programs (ideally located on a different part of the network). To make this work, the monitoring system's FTP script that sends the alerts to the fault management server needs to be modified. The script needs to detect the failure to make an FTP connection. When this happens, the script could attempt an FTP connection to the secondary server. The secondary server would then receive and process the event.

When the secondary system starts processing events, the event and rule trace logs will be spread across two systems. These will need to be later re-joined for comprehensive reporting. It is also important for both systems to contain exactly the same versions of the scripts and the configuration files. It is rare that the secondary server would be used, but a fully redundant system is quite simple to set up and administer.

Conclusion

The programs outlined in this article are in production for real clients, managing their environments 24 hours a day. For one client, I duplicated the functionality of their expensive name-brand system, and over a weekend replaced it with this system. At first, nobody noticed. Soon, management and support staff noticed when previously impossible things became possible and the number of nuisance events dropped. Later, the accountants noticed when they had a significantly lower software maintenance contract to sign.

In most large IT organizations, change seems to be the only constant thing. The fault management system is subject to ongoing changes as the support environment changes -- new procedures are added, support duties get split, new monitoring systems are implemented, and management makes new requests. It is good to know that your Perl solution is ready to handle any such future request.

Tim Shouldice is a computer consultant living in Ottawa, Canada. He specializes in designing and implementing monitoring, fault management, and performance management systems. He can be reached by email at: tims@rogers.com.