Cover V03, I02
Article
Figure 1
Listing 1
Listing 2

mar94.tar


Improved Error Reporting in AIX

Thomas Richter

Error Reporting in AIX

AIX 3.2* logs all warnings and errors, both hardware- and software-related, in the file /var/adm/ras/errlog. Entries are always appended and are never deleted. All entries contain binary data. The system administrator uses the command errpt to create a readable output of this error log. No tools exist to automatically highlight severe errors or to notify the system administrator of new entries in the error log file. As a result, it's often the case that the error log file gets completely ignored. This is unfortunate because the error log file contains vital information, such as disk drive failures and adapter problems, which can't be ignored.

I present here two tools that I have developed to automate error reporting. They can be invoked on the command line or by cron. Their purpose is to notify the system administrator of new entries in the error log file. The tools can be used for daily and monthly error reporting and can mail the output to user root. Both are shell scripts which can be executed by either the Bourne or Korn shell.

Existing Commands

The following commands are supplied by the AIX operating system and are used in the scripts below.

  • errpt -- Prints binary entries from the system error log file /var/adm/ras/errlog to stdout. Various flags select entries by different criteria, such as time, error identifier, error type, device on which an error occurred.

  • errlogger -- Appends command-line parameters to the error log file.

  • errclear <days> -- Removes all entries in the error log file which are older than the number of days supplied as the parameter. This command must be used to reduce the size of the error log file.

    Daily Error Reporting

    The shell script /var/adm/ras/daily (in Listing 1) generates the daily error report. It is usually invoked by cron but can also be invoked from the command line.

    This program mails all selected errors which occurred today (flag -t), or since the last boot (default), to user root. The last boot time is found in /etc/utmp and can be displayed with the command who -b. The program uses the errpt command to extract entries from the system's error log.

    Even if no errors occurred, daily mails its output to root to indicate that the error report program was executed. In this case the system administrator receives mail with an empty message. A sample output of errpt is shown in Figure 1.

    As the figure shows, each line is a separate entry. The first column is a unique error identification. The second column specifies the time the entry was made, in the format mmddhhmmyy (month, day, hour, minute, and year). The third column describes the error type, which can be T (temporary), P (permanent), or U (unknown). The fourth column describes the class of the error, which can be H (hardware), S (software), or O (entries made by the super-user using the errlogger command). The fifth column is the resource name of the software product which made the entry, and the last column is a verbose explanation of the error.

    A powerful report generator, errpt can select entries in the error log file by time, error identifier, type of error (hardware, software), class of error (permanent, temporary), device type for hardware errors, and product names for software errors.

    If a spare terminal is available, errpt can be invoked in concurrent mode. All error log entries are reported as soon as they have been added to the error log file.

    Monthly Error Reports

    Daily reports are for system administrator notification. If error logs are to be kept for a longer period of time, they must be stored elsewhere. This is mandatory: if the system's error log file is not reduced with the errclear command, the error log will eat up all available disk space in that file system. The script in Listing 2 groups error reports by month and stores them in a compressed file. It also cleans the system's error log file by deleting all entries from the previous month. cron invokes the script the first five days of each month.

    This script keeps monthly error reports for exactly one year. The reports are stored in directory /var/adm/errors, which has to be created by the system administrator.

    cron Entries

    To automate the reporting, two entries are added to the crontab file for user root. The first entry generates daily error reports. The second entry invokes the monthly error reporting, which also removes last month's error log entries. This script is invoked the first five days each month, in case the system is shut down during the weekend or bank holidays. The two crontab entries are as follows:

    59   23   *     *   *   /usr/bin/ksh   /var/adm/errors/daily -t
    59   23   1-5   *   *   /usr/bin/ksh   /var/adm/errors/monthly

    Now, when the system administrator logs in each day, he or she will find a mail message containing a list of the most recently reported errors.

    Bibliography

    IBM. IBM AIX Commands Reference: Vol 2. GC23-2366-02.

    About the Author

    Thomas Richter has studied mathematics and Computer Science at the University of Ulm, Germany. He has worked on various UNIX platforms as a software developer and C/C++ as main programming languages. His projects include compiler construction, device drivers, and network programming. He has also administered various UNIX machines for the last 8 hears. He has worked for IBM UK since January 1993.


     



  •