Article

SM2UR: Simple-Minded System Monitoring & Uptime Reporting

Bill Gray

[Note: The software described here was developed as a (very, very) small part of activities pursuant to the conduct of US DOE contract DEAC07-94ID13223 at the Idaho National Engineering Laboratory (INEL). The author is presently employed by Lockheed-Martin Idaho Technologies, the INEL's Managing and Operating Contractor. Opinions expressed herein, if indeed there are any, are solely those of the author.]

Some time ago -- more than two years now -- we system administrators for a ragtag collection of UNIX workstations and servers were told that uptime reporting would be required, and had to be in place in two weeks. Then a mainframe (Cray) shop, we hadn't regarded these dwarfish boxes as full-fledged production systems. After all, they just plugged in to the wall, and were right there in the same rooms with us. How could one take seriously something that didn't require at least a ton of air conditioning? Be that as it may, we had marching orders and a simple requirement: a monthly report on the availability of each production server. Not too unreasonable, really, except maybe that bit about two weeks.

Of course, this also required that we accept the idea of something other than a mainframe as a production server. Back then, this also took some getting used to.

Why make rather than buy?

Simple. No money -- something about the shoemaker's children. Besides, what we ultimately did is really cheap.

Some Additional Requirements

Simplicity. The various system administrators would not tolerate further complications in their lives. A rococo, hard to understand and install package would not fly; Louis XV may make for elegant furniture, but generally Bauhaus is better for software.

Fire and forget. For the same reasons, something requiring constant care and feeding also would not fly; we wanted something we could install and essentially ignore.

Apart from these two considerations, we would also like the package to be self-contained; i.e., without fingers into other parts of the system, or heavily dependent on local configuration details.

How It Works: An Overview

... the representation of the data or tables. This is where the heart of a program lies.

-- Fred P. Brooks, Jr. The Mythical Man-Month

The idea behind our design is very straightforward: somewhere, maintain a "heartbeat" for the system; during times when there is no heartbeat, the system is down. Read the system up/downtime from the EKG strip chart, which in our case is a file. This file is also designed to be collapsible without loss of information, and still be human-readable. There is actually a collection of these files, but that's not a concern for now. Implementation is nearly trivial; the telling takes more than the doing.

As I write this, the current EKG file is /var/log/D.1995.09.28. The first few records in it look like:

+ 1995 09 28 00 12 02
+ 1995 09 28 00 27 01
+ 1995 09 28 00 42 01
+ 1995 09 28 00 57 00

These are heartbeat records, which are produced every 15 minutes, a granularity that I arbitrarily picked as being "good enough" for the purpose. They are being emitted by the following crontab(5) entry:

12,27,42,57 * * * * /usr/local/etc/sysmon > /dev/null 2>&1

If the system goes down, no cron, no heartbeat.

In order to make these records collapsible, we (paradoxically) use another record that is inserted in the EKG file at boot-up: an "incident" record. It so happens that the system crashed around 0945 on September 18; here are the records surrounding that event:

+ 1995 09 18 09 12 01
- 1995 09 18 09 46 48
+ 1995 09 18 23 57 02

The line preceded by "-" records an incident; it was put there as the last thing done by the last multi-user start-up script (in this case, /etc/rc.local) by the following sh(1) commands:

if [ -f /usr/local/etc/sysmon ]; then
/usr/local/etc/sysmon init
fi

These records tell us that the system was down after 9:12:01 and was back up at 9:46:48; i.e., it was down at most 34 min., 47 sec. Scheduled downtime can be recorded by editing the appropriate boot-up record the next day, replacing the "-" with a "*".

From the above, you have also probably gathered that the thing that produces these records is called sysmon (Listing 1). This is a fairly simple shell script whose role in life is to format heartbeat and incident records and append them to the appropriate file.

How It Works: Log File Naming, Collapsing, ... Details

The devil is in the details.

-- Anon.

Files are named using a scheme that makes them easy to process and identify. The lowest-level log is a daily one, named according to the pattern "D.yyyy.mm.dd"; today's log as I write this is

-rw-r--r-- 1 root 1232 Sep 29 13:57 /var/log/D.1995.09.29

Today's log is the only one touched by cron, or in other words, is "active". We don't otherwise mess with the active log. Since successive heartbeats up to an incident record are redundant -- we know the thing was alive up to the last heartbeat before an incident -- we can collapse a log file be removing them. Files collapsed this way have .Y appended; yesterday's collapsed log is

-rw-r--r-- 1 root 44 Sep 29 01:00 /var/log/D.1995.09.28.Y

It was collapsed by the following crontab entry:

0 1 * * * /usr/local/etc/cdl -r > /dev/null 2>&1

cdl ("collapse daily log" Listing 2) is a shell script concocted for the purpose. The -r flag tells cdl to remove unnecessary logs after it has collapsed yesterday's daily log for example, raw daily logs that have already been subsumed into collapsed logs.

Naturally, daily logs can be consolidated monthly, and once all the daily information is subsumed into the monthly, the daily logs can be discarded. Monthly logs are named according to the pattern "M.yyyy.mm", or, when collapsed, "M.yyyy.mm.Y". Last month's log is

-rw-r--r-- 1 root 1496 Sep 1 08:11 /var/log/M.1995.08

It was collapsed by the following crontab entry:

11 8 1 * * /usr/local/etc/cml > /dev/null 2>&1

(You can doubtless guess why it is named cml, in Listing 3.)

Notice by the sizes that none of these logs, especially when collapsed and consolidated, ever gets very big, unless the system crashes a lot.

In a similar vein, a yearly log is named according to the pattern "Y.yyyy", or, when collapsed, "Y.yyyy.Y". Presently, we don't keep yearly logs.

Reporting

None love the bearer of bad news.

-- Sophocles

Perhaps, in light of Sophocles' dictum, it is a bad idea to automate the process of reporting news, which, unless it reads "100% uptime," is in itself bad. Be that as it may, we are required to report uptime monthly. For this purpose, there is a sh(1) script, rmu (you guessed it, "report monthly uptime" in Listing 4). It is called by the following crontab entry:

27 8 1 * * /usr/local/etc/rmu -r > /dev/null 2>&1

In the only concession to medium-tech -- there is no high-tech -- in the entire package, rmu calls a C program, rst.c ("report system time" in Listing 5), to actually analyze the log and report uptime. rst can report using any uptime log over any time period; it is rmu's job to feed it data for a particular month, by default last month. So it is called, here at 0827 on the first of every month, to report last month's uptime. The -r option has the same sense as for the cdl script.

The following is an actual rmu/rst report. The scheduled downtime was from a full-day scheduled building power outage; it was entered by editing the log file and replacing the boot-up "-" init heartbeat record flag with a "*".

From root@myhostname Sat Jul  1 08:32:07 1995
Return-Path: <...>
...
Date: Sat, 1 Jul 95 08:27:05 MDT
From: root@myhostname (Operator)
Message-Id: <...>
To:  [...]
Subject: "Uptime Report from myhostname"
Status: R
Start time -  6/ 1/95   0:12: 4 (801987124)
Endtime -     6/30/95  23:57: 1 (804578221)
---
Total time in interval (sec)  =   2591097
Check cumulative time       =   2591097
---
Up time                       =   2553695
Unscheduled down time         =      4046
Scheduled down time           =     33356
---
Percent uptime                =     99.84

The big numbers in parenthesis by the start and end times are the corresponding UNIX epoch times, that is, the number of seconds 00:00:00Z, January 1, 1970. The total time is calculated from these two. The "[c]heck cumulative time" is a running total kept by the rst program; it must equal the total interval time, or something has gone wrong.

Some Remarks about the Package

SM2UR is mainly a collection of scripts, each of which is documented with man(1)-style comments at the very start; for example, the make(1) file starts with

# SM2UR - Simple Minded System Monitor and Uptime Reporter.
#
# SYNOPSIS
#    make [ all | install | clean | wipe | <trgts> ]
...

I believe that this makes the documentation much more likely to be modified appropriately along with the code than is the case with man pages in separate files. The intention was to provide an extraction utility that would build real man pages as part of the installation process, but that remains to be implemented. Nevertheless, the information is still there, and it is current with the code. An inventory of the package follows:

C programs

rst(8): report system time. Passes over a log file in the format maintained by sysmon(8) and reports uptime.

Sh(1) scripts

cdl(8): collapse daily log(s). Uses clf to collapse all inactive daily logs in a log directory.

clf(8) (Listing 6): collapse log file. Removes redundant consecutive heartbeats from a log file.

cml(8): collapse monthly log(s). Collapses daily logs for a month into a single log associated with that month.

rmu(8): report monthly uptime. Calls rst to report uptime for a given month.

sysmon(8): Maintains downtime/heartbeat/incident logs for calculating availability. Normally called from cron(8) and initially from a system startup file such as /etc/rc.local.

For information on access to the source code, please contact Bill Gray at whg@INEL.GOV.

About the Author

Bill Gray is a programmer and system administrator at the Idaho National Engineering Laboratory, operated for the Department of Energy by Lockheed Martin Corp.