Multi-Platform
Performance Monitoring on the Cheap
Dale Southard
The monitoring system presented in this article grew out of a
simple need for a portable system with which I could remotely monitor
performance metrics on UNIX hosts. When I began looking for a solution,
I considered many of the available commercial products. The standard
xload is easy to understand, but only provides a single metric
(load average) and doesn't retain information across invocations.
The top command is better, but again provides only a snapshot,
not a history trail. Sun's perfmeter and rstatd
provide more metrics and the ability to save trails, but are only
available under a few architectures. SGI's Performance Co-Pilot
can monitor and save an incredible number of metrics, but at the
time was only available under IRIX (it has since been ported to
Linux as well). Finally, SNMP looked like a future contender, but
still suffered from a lack of affordable monitoring packages and
security issues on some platforms.
What I really wanted was the ability to collect and save a group
of performance metrics and then reduce them to a form that is easy
to understand. Ideally, the tools should be portable to a wide range
of UNIX flavors. Upon further consideration, I found my needs were
simple enough to be met with syslog and some common UNIX
utilities.
The original inspiration for the design came from one of syslogd's
built-in features, the mark timestamp. Most modern syslog
daemons provide a mark function that places a timestamp in the logfile
at regular intervals. This is often used to help fix the time of
catastrophic system events (such as sudden power loss) that would
otherwise provide no log evidence. What limits the usefulness of
the standard syslogd mark function is that it provides only
a mark indicating that the machine is powered on and running syslogd.
In most cases, users and sys admins are interested in monitoring
more than the state of machine power and correct syslogd
function.
Collecting the Metrics
For this article, I assume that we have a network of six machines
at foo.com. One is the central "monhost" where we will
be doing the monitoring. The other five are the client hosts that
we will monitor. Each client host is running a different operating
system (Solaris, IRIX, IRIX64, Linux, and MacOS X). For this example,
we will monitor the following metrics on each client host:
- Load average of the machine
- Amount of free memory in MB
- Amount of free swap space in MB
The first step in the process was determining what metrics to
monitor and how to obtain them. Getting the load average is trivial.
Most variants of SVR4 UNIX (Solaris 2.x, IRIX, etc.) and BSD UNIX
(SunOS 4.x, BSD, MacOS X, etc.) include the "uptime" command
that includes the system load averages for the past 1, 5, and 15
minutes. Since I was interested in the five-minute load average,
a simple awk command is enough to select the appropriate field:
uptime | awk '{gsub(",",""); print $(NF-1)}'
Free memory and swap are more difficult since the commands used to
monitor them differ wildly between UNIX flavors. Because the "freemem"
and free swap values are related and share the same units, I chose
to extract them to a single line of output -- first memory, then
swap, separated by white space.
Solaris and IRIX are standard SVR4 variants and provide the sar
command for monitoring a variety of performance metrics. For memory
and swap, the -r flag will narrow sar's output
to the metrics we are interested in. sar lists freemem in
pages and free swap in blocks. The default output is the freemem
and free swap metrics are pages and disk blocks. sar's
notion of a basic block is 512 bytes on all platforms presented
here, so dividing by 2048 will convert blocks to MB.
Conversion of freemem pages to MB is OS-dependent. For Solaris,
"pagesize" is 8192 bytes, so dividing by 128 gives MB.
IRIX comes in two "widths" -- the smaller desktop
machines use 32-bit kernels with a pagesize of 4096 bytes; the larger
servers use 64-bit kernels with a pagesize of 16384 bytes --
so we will need to divide by 254 or 64, respectively. We will again
use awk to filter for our desired metrics and perform the necessary
conversions.
Solaris free memory and free swap in MB:
sar -r 1 | awk '{m=int($2/128);s=int($3/2048)} END {print m,s}'
IRIX free memory and free swap in MB:
sar -r 1 | awk '{m=int($2/256);s=int($3/2048)} END {print m,s}'
IRIX64 free memory and free swap in MB:
sar -r 1 | awk '{m=int($2/64);s=int($3/2048)} END {print m,s}'
Linux and MacOS both lack sar, and obtaining the memory and
swap information is more difficult. In the case of Linux, the memory
information can be accessed through the /proc/meminfo pseudo-file.
Since that file presents the output as a series of lines, we will
need to use awk's pattern-matching abilities to select the correct
line for each of our metrics:
awk '/MemFree/ {m=int($2/1024)} \
/SwapFree/{f=int($2/1024)}\
END {print m,f}' /proc/meminfo
For MacOS X, things are more difficult. OS X uses a dynamic paging
system that can create swap files as needed (assuming that the disk
has enough free space to accommodate such files). This makes our notion
of "free swap space" somewhat bogus since the OS will simply
create additional swap files as space is required. Rather than present
unrealistic numbers for free swap, we'll just punt and report
only the free memory information under Mac OS X. This can be obtained
from the output of vm_stat:
vm_stat | awk '/free:/ {gsub("\\.","");print int($3/256)}'
Using syslogd for Transporting Information
Now that we have determined how to extract the metrics we want,
the next step is to provide a way to remotely monitor them. Luckily,
most UNIX variants provide this capability in the form of syslog.
The syslog daemon handles messages according to their priority.
In the case of syslog, it means a "facility.level"
pair. "Facility" refers to the class of information contained
in the message -- common facilities are "kern" for
kernel messages, "daemon" for system daemon messages,
and "auth" for security messages. Because our performance
metrics are a local addition, we will use one of the local facilities.
(For these examples, I chose the "local3" facility.)
The level part of the priority refers to how serious the message
is, ranging from emerg (meaning the system is unstable),
to debug (meaning normal debug-level messages). Since our
service will be informational, we will log at the info level.
The first step is to configure our central loghost to save the
messages in a file separate from the usual syslog information
(in this example, /var/log/perflog). This can be done by
adding the following line to the syslog.conf file and then
sending a SIGHUP to syslogd:
local3.info /var/log/perflog
We also need to configure syslog on each client host to send
the performance data to the monitoring host. Again, this is done by
adding a single line to syslog.conf and sending a SIGHUP to
the syslog daemon:
local3.info @monhost.foo.com
Finally, we should test that the above changes are working. Running
the command logger -p local3.info -t TEST hello world on one
of the clients should generate a line like the following in the /var/log/perflog
file on the monhost:
Jan 25 23:26:51 irixhost TEST: hello world
Note that the syslog entry includes both "timestamp"
and "hostname" information, which will be useful later when
we parse the perflog file for the metrics we want.
Gathering the Data
With syslogd configured on the hosts, we can now use the
command pipelines determined in the first step to extract and send
data to the monitoring host. Because we want to monitor the metrics
over time, we will execute the collection command from cron on the
client hosts. As an example, the following crontab entries could
be used to send the load, memory, and swap information from the
irixhost every 15 minutes (linebreaks have been added for clarity):
0,15,30,45 * * * * uptime | awk '{gsub(",",""); print $(NF-1)}' |\
logger -p local3.info -t load
0,15,30,45 * * * * sar -r 1 | awk '{m=int($2/256);s=int($3/2048)}\
END {print m,s}' | logger -p local3.info -t memswp
Similar entries should be made in the other client hosts, adjusting
the memswp line to incorporate the OS-specific metric collection
method previously determined. At this point, we should now be accumulating
performance data in the perflog file we configured on the monhost.
Data Reduction
The final step is to take the gathered data and turn it into something
understandable by non-technical users. We will use the open source
GNUplot program for this task. The first step is to split the data
into separate files for each machine and metric. Each file should
contain XY data for plotting (where the X data will be the time
and date information, and the Y data will be one or more related
metrics). Again, we can do this with awk or sed. For example, we
can extract the load average data for the irixhost client using
a command like the following:
awk '/irixhost load:/ {print $1,$2,$3,$6}' /var/log/perflog >irixhost.load
Or, we can use a simpler sed command to filter out the values we are
not interested in:
sed 's/irixhost load://;t;d' /var/log/perflog >irixhost.load
Once we have created one or more XY datafiles, we can plot them using
the GNUplot program. I find it useful to do the initial plots using
GNUplot's interactive mode. Again using the load average data
from the irixhost client as an example, the following GNUplot settings
will produce a basic plot that is a good starting point for further
customization:
set title "Load Average"
set data style lines
set yrange [0:]
set xdata time
set timefmt "%b %d %H:%M:%S"
set format x "%m/%d %H:%M:%S"
plot "irixhost.load" using 1:4 title "IRIX Workstation"
GNUplot is a very capable program, and entire articles could be devoted
to exploring various options. Here are a few suggestions:
- GNUplot has extensive online help. When in doubt, try the help
command.
- The set yrange command can be used to select a range
of dates to plot.
For example:
set xrange ["Jan 1 00:00:00":"Feb 1 00:00:00"]
- Multiple data sets can be plotted using a single plot command
(e.g., plotting a total of four metrics selected from two different
data files):
plot \
"irixhost.memswp" using 1:4 title "irixhost.foo.com free mem",\
"irixhost.memswp" using 1:5 title "irixhost.foo.com free swap",\
"solarishost.memswp" using 1:4 title "solarishost.foo.com free mem",\
"solarishost.memswp" using 1:5 title "solarishost.foo.com free swap"
- GNUplot supports several output formats including .png and
.eps. Use the set terminal command to select a format,
and the set output command to choose an output file.
- The save and load commands can be used to store the variable
settings for later use once you've found a good layout. This
is especially useful if you tend to look at the same metrics frequently,
because the plot can be updated by simply re-parsing the perflog
file and re-running gnuplot.
Where to Go from Here
It's easy to extend this system to metrics beyond those presented
here. Almost anything that can be run from cron can be timed or
filtered through awk, sed, or Perl to produce a metric that can
be sent to the monitoring host. It's also fairly easy to use
GNUplot from within a cron script to generate near-real-time performance
graphs. Such graphs can even be generated or copied into a directory
that is exported by a Web server to provide wider access to the
performance data. At one site I even extended this concept to send
alpha pages to the sys admin by parsing the perflog file
every few minutes and testing metrics against some established minimum/maximum
values.
Although the system presented here lacks many of the high-end
features found in packages like PCP, I have found it useful on many
occasions over the years. Because this method relies on components
found on almost all UNIX-like operating systems, I often find it
easier to make a couple of syslog.conf and crontab entries
than to install a more complex package just to monitor a single
metric.
Links
http://www.gnuplot.info/
http://www.sgi.com/software/co-pilot/
http://oss.sgi.com/projects/pcp/
Dale Southard is currently a systems administrator with the
Accelerated Strategic Computing Initiative at Lawrence Livermore
National Laboratory in Livermore, California. He can be contacted
at: dsouth@llnl.gov.
|