SM2UR: Simple-Minded System Monitoring & Uptime Reporting
Bill Gray
[Note: The software described here was developed as
a (very, very) small
part of activities pursuant to the conduct of US DOE
contract
DEAC07-94ID13223 at the Idaho National Engineering Laboratory
(INEL).
The author is presently employed by Lockheed-Martin
Idaho Technologies,
the INEL's Managing and Operating Contractor. Opinions
expressed herein,
if indeed there are any, are solely those of the author.]
Some time ago -- more than two years now -- we system
administrators for a
ragtag collection of UNIX workstations and servers were
told that uptime
reporting would be required, and had to be in place
in two weeks. Then a
mainframe (Cray) shop, we hadn't regarded these dwarfish
boxes as
full-fledged production systems. After all, they just
plugged in to the
wall, and were right there in the same rooms with us.
How could one take
seriously something that didn't require at least a ton
of air
conditioning? Be that as it may, we had marching orders
and a simple
requirement: a monthly report on the availability of
each production
server. Not too unreasonable, really, except maybe that
bit about two
weeks.
Of course, this also required that we accept the idea
of something other
than a mainframe as a production server. Back then,
this also took some
getting used to.
Why make rather than buy?
Simple. No money -- something about the shoemaker's
children. Besides,
what we ultimately did is really cheap.
Some Additional Requirements
Simplicity. The various system administrators would
not tolerate further
complications in their lives. A rococo, hard to understand
and install
package would not fly; Louis XV may make for elegant
furniture, but
generally Bauhaus is better for software.
Fire and forget. For the same reasons, something requiring
constant care
and feeding also would not fly; we wanted something
we could install and
essentially ignore.
Apart from these two considerations, we would also like
the package to
be self-contained; i.e., without fingers into other
parts of the system,
or heavily dependent on local configuration details.
How It Works: An Overview
... the representation of the data or tables. This is
where the heart of
a program lies.
-- Fred P. Brooks, Jr.
The Mythical Man-Month
The idea behind our design is very straightforward:
somewhere, maintain
a "heartbeat" for the system; during times
when there is no heartbeat,
the system is down. Read the system up/downtime from
the EKG strip
chart, which in our case is a file. This file is also
designed to be
collapsible without loss of information, and still be
human-readable.
There is actually a collection of these files, but that's
not a concern
for now. Implementation is nearly trivial; the telling
takes more than
the doing.
As I write this, the current EKG file is /var/log/D.1995.09.28.
The
first few records in it look like:
+ 1995 09 28 00 12 02
+ 1995 09 28 00 27 01
+ 1995 09 28 00 42 01
+ 1995 09 28 00 57 00
These are heartbeat records, which are produced every
15 minutes, a
granularity that I arbitrarily picked as being "good
enough" for the
purpose. They are being emitted by the following crontab(5)
entry:
12,27,42,57 * * * * /usr/local/etc/sysmon > /dev/null 2>&1
If the system goes down, no cron, no heartbeat.
In order to make these records collapsible, we (paradoxically)
use
another record that is inserted in the EKG file at boot-up:
an
"incident" record. It so happens that the
system crashed around 0945 on
September 18; here are the records surrounding that
event:
+ 1995 09 18 09 12 01
- 1995 09 18 09 46 48
+ 1995 09 18 23 57 02
The line preceded by "-" records an incident;
it was put there as the
last thing done by the last multi-user start-up script
(in this case,
/etc/rc.local) by the following sh(1) commands:
if [ -f /usr/local/etc/sysmon ]; then
/usr/local/etc/sysmon init
fi
These records tell us that the system was down after
9:12:01 and was
back up at 9:46:48; i.e., it was down at most 34 min.,
47 sec.
Scheduled downtime can be recorded by editing the appropriate
boot-up
record the next day, replacing the "-" with
a "*".
From the above, you have also probably gathered that
the thing that
produces these records is called sysmon (Listing 1).
This is a fairly
simple shell script whose role in life is to format
heartbeat and
incident records and append them to the appropriate
file.
How It Works: Log File Naming, Collapsing, ... Details
The devil is in the details.
-- Anon.
Files are named using a scheme that makes them easy
to process and
identify. The lowest-level log is a daily one, named
according to the
pattern "D.yyyy.mm.dd"; today's log as I write
this is
-rw-r--r-- 1 root 1232 Sep 29 13:57 /var/log/D.1995.09.29
Today's log is the only one touched by cron, or in other
words, is
"active". We don't otherwise mess with the
active log. Since successive
heartbeats up to an incident record are redundant --
we know the thing
was alive up to the last heartbeat before an incident
-- we can collapse
a log file be removing them. Files collapsed this way
have .Y appended;
yesterday's collapsed log is
-rw-r--r-- 1 root 44 Sep 29 01:00 /var/log/D.1995.09.28.Y
It was collapsed by the following crontab entry:
0 1 * * * /usr/local/etc/cdl -r > /dev/null 2>&1
cdl ("collapse daily log" Listing 2) is a
shell script concocted for
the purpose. The -r flag tells cdl to remove unnecessary
logs after it
has collapsed yesterday's daily log for example, raw
daily logs that
have already been subsumed into collapsed logs.
Naturally, daily logs can be consolidated monthly, and
once all the
daily information is subsumed into the monthly, the
daily logs can be
discarded. Monthly logs are named according to the pattern
"M.yyyy.mm",
or, when collapsed, "M.yyyy.mm.Y". Last month's
log is
-rw-r--r-- 1 root 1496 Sep 1 08:11 /var/log/M.1995.08
It was collapsed by the following crontab entry:
11 8 1 * * /usr/local/etc/cml > /dev/null 2>&1
(You can doubtless guess why it is named cml, in Listing
3.)
Notice by the sizes that none of these logs, especially
when collapsed
and consolidated, ever gets very big, unless the system
crashes a lot.
In a similar vein, a yearly log is named according to
the pattern
"Y.yyyy", or, when collapsed, "Y.yyyy.Y".
Presently, we don't keep
yearly logs.
Reporting
None love the bearer of bad news.
-- Sophocles
Perhaps, in light of Sophocles' dictum, it is a bad
idea to automate the
process of reporting news, which, unless it reads "100%
uptime," is in
itself bad. Be that as it may, we are required to report
uptime monthly.
For this purpose, there is a sh(1) script, rmu (you
guessed it, "report
monthly uptime" in Listing 4). It is called by
the following crontab
entry:
27 8 1 * * /usr/local/etc/rmu -r > /dev/null 2>&1
In the only concession to medium-tech -- there is no
high-tech -- in the
entire package, rmu calls a C program, rst.c ("report
system time" in
Listing 5), to actually analyze the log and report uptime.
rst can
report using any uptime log over any time period; it
is rmu's job to
feed it data for a particular month, by default last
month. So it is
called, here at 0827 on the first of every month, to
report last month's
uptime. The -r option has the same sense as for the
cdl script.
The following is an actual rmu/rst report. The scheduled
downtime was
from a full-day scheduled building power outage; it
was entered by
editing the log file and replacing the boot-up "-"
init heartbeat record
flag with a "*".
From root@myhostname Sat Jul 1 08:32:07 1995
Return-Path: <...>
...
Date: Sat, 1 Jul 95 08:27:05 MDT
From: root@myhostname (Operator)
Message-Id: <...>
To: [...]
Subject: "Uptime Report from myhostname"
Status: R
Start time - 6/ 1/95 0:12: 4 (801987124)
Endtime - 6/30/95 23:57: 1 (804578221)
---
Total time in interval (sec) = 2591097
Check cumulative time = 2591097
---
Up time = 2553695
Unscheduled down time = 4046
Scheduled down time = 33356
---
Percent uptime = 99.84
The big numbers in parenthesis by the start and end
times are the
corresponding UNIX epoch times, that is, the number
of seconds
00:00:00Z, January 1, 1970. The total time is calculated
from these two.
The "[c]heck cumulative time" is a running
total kept by the rst
program; it must equal the total interval time, or something
has gone
wrong.
Some Remarks about the Package
SM2UR is mainly a collection of scripts, each of which
is documented
with man(1)-style comments at the very start; for example,
the make(1)
file starts with
# SM2UR - Simple Minded System Monitor and Uptime Reporter.
#
# SYNOPSIS
# make [ all | install | clean | wipe | <trgts> ]
...
I believe that this makes the documentation much more
likely to be
modified appropriately along with the code than is the
case with man
pages in separate files. The intention was to provide
an extraction
utility that would build real man pages as part of the
installation
process, but that remains to be implemented. Nevertheless,
the
information is still there, and it is current with the
code. An
inventory of the package follows:
C programs
rst(8): report system time. Passes over a log file in
the format
maintained by sysmon(8) and reports uptime.
Sh(1) scripts
cdl(8): collapse daily log(s). Uses clf to collapse
all inactive daily
logs in a log directory.
clf(8) (Listing 6): collapse log file. Removes redundant
consecutive
heartbeats from a log file.
cml(8): collapse monthly log(s). Collapses daily logs
for a month into a
single log associated with that month.
rmu(8): report monthly uptime. Calls rst to report uptime
for a given
month.
sysmon(8): Maintains downtime/heartbeat/incident logs
for calculating
availability. Normally called from cron(8) and initially
from a system
startup file such as /etc/rc.local.
For information on access to the source code, please
contact Bill Gray
at whg@INEL.GOV.
About the Author
Bill Gray is a programmer and system administrator
at the Idaho National
Engineering Laboratory, operated for the Department
of Energy by
Lockheed Martin Corp.
|