Article

SysMon - A Systems Monitor

Bob Ess

Managing UNIX desktops and servers in a large installation (more than 100 desktops and servers) can mean different things to different people. To a system administrator, it can mean "How do I put out all of the fires these users generate on a daily basis without losing my mind?". To an IT manager, it can mean "How many Ultra 5s do I have, how many Ultra 1s have 512 megs or more of memory, and what has my disk space consumption been like for the last 6 months?". Both of these perspectives present their own significant challenges, and their solutions are non-trivial to address.

Any seasoned UNIX system administrator has written scripts to address the monitoring of system resources, whether they be disk, CPU, network, user logins, etc. Data of this nature is constantly requested by management to help them better allocate resources and plan budgets. Thus, it is imperative that the data be as accurate and timely as possible.

SysMon is actually a series of scripts and tools, written over the past several years, that work in concert to provide accurate and timely system data. By using these tools in the manner described in this article, you will be able to track and identify the following:

Workstation/Server Inventory
Status of vital system processes on any machine
Monitors for CPU, DISK and NETWORK for each monitored system
Status of network printers including printer queues
Status of network attached storage (Network Appliance filers)
Status of any server and desktop with hypertext links showing:

- Change history
- Who is logged on to the machine
- Who has been logged in
- Uptime data
- System messages
- Who is doing what (whodo)

Disk consumption over a period of time. Data format exportable to common spreadsheet formats

The scripts used to construct SysMon include:

sys.conf - For more information, see Sys Admin, September 1996

syschange - For system change notification and history tracking (syschange.html, syschange.cgi)

namon.sh - To monitor Network Appliance Filers

wsmon.sh - To monitor any desktop or server system

nightly.sh - To keep all systems updated

prtmon.sh - To monitor network printers (HP specifically)

syslist.cgi - For presenting system configuration data via http

getethers - For polling all machines on all segments

inv.cgi - For presenting system inventory data via http

diskdaily.sh - For tracking server disk resources

I will look at these programs one at a time then show how they work together to help you better manage your systems. Figure 1 depicts the overall structure of the programs. (All listings for this article are available from ftp.mfi.com in /pub/sysadmin.)

sys.conf

This program and a related article, (called "Configuration Tracking") were published in the September 1996 issue of Sys Admin. The latest code has a few additions, but it remains relatively unchanged from its initial publication. This program is used to gather configuration information about a system, including both hardware and software data. A sample report that sys.conf will gather for you is shown in Figure 2.

The newest version of sys.conf (available at ftp.mfi.com) uses GNU's RCS (Revision Control System) to maintain multiple versions of the baseline files that sys.conf will generate. This keeps the program from polluting the /var/tmp directory with unwanted copies of the configuration file.

This program should be run from cron each night on every machine in your environment. It can be run by itself, or it can be called from another cron job, such as the nightly.sh script, which will be described later in this article.

syschange

In any UNIX environment that employs more than one system administrator, there must be accurate and timely communication among the administrators concerning significant changes performed on major servers. Why? Perhaps you have made a change to the sendmail.cf file on the mail server. Other administrators currently have no knowledge of the change you have made. Suddenly mail stops working. Another administrator goes into the sendmail.cf and starts making his/her own changes. Soon, changes are made that significantly impact your production environment, and there is no history or documentation of the changes to be referenced. RCS addresses this in regard to files. However, if a file is modified that is not under RCS control, or a significant change is made to a system (add a disk, add memory, crontab changes, reboots, etc.), there must be a mechanism in place to notify all responsible adminstrators and provide a repository for historical reference. syschange provides this mechanism, either from the command line or from a Web page. The concept and code behind syschange is very simple; send email to a mailing list with a message describing changes made to a system. Figure 3 shows a sample mail output for a change made to a system.

Notice in particular the first line that reads:

[ System Change history for acme95: file:/acme/adm/syschange/acme95 ]

Any mail recipient using the Netscape mail reader can click the file anchor, and the browser will display the directory containing all of the changes submitted for that particular system.

The shell script syschange has three options: -submit, -view, and -list. You can create and distribute a change notice for system server1 as follows:

syschange -submit server1

The script then prompts you for a change summary, the date of the change (defaulting to the current date), the time of the change (defaulting to the current time), whether downtime is anticipated and if so, how long the system will be unavailable. It also provides the option of opening an edit session to include in-depth details of the change.

Once the change is documented, you are prompted for a list of mailing addresses to email. Obviously, an alias distribution list should be set up so you don't have to enter more than one email address.

Finally, the script gives you the option of bailing out. If you do so, no harm done. If you commit to the change, it is recorded in the change database and email is sent to those you specified.

To see which systems have change history files associated with them, use:

syschange -list

This will provide a list of systems and the number of changes recorded against them.

To view the change history of a system, use:

syschange -view systemname

syschangewill then search the database and provide you with the number of changes recorded for the "systemname", and ask if you want the change records printed, mailed, or viewed.

syschange.html and syschange.cgi have been provided to allow you to use the syschange facility with a Web interface.

namon.sh

We recently migrated from a centralized data access model to a distributed model. This minimizes the impact on the engineering community in the event of a server failure. We are also able to deliver data faster and more reliably using file server appliances than was achieved with a mammoth centralized server. We chose Network Appliance filers for several reasons, including cost, speed, and simplicity.

Although the NetApp filers are simple to install, maintain, and operate, there are still subsystems that must be monitored for optimum performance. Since the filers do not have a traditional shell or a cron facility, I had to make use of the rsh facility available on the filers to implement monitoring. namon.sh is the script that must run on a host trusted to the monitored Network Appliance. It must either run as root or, as we run it in our shop, submit the rsh commands via sudo.

The SysMon main screen will provide access to the following information about a filer:

"green" status if the filer responds to a ping
nfsstat statistics
Operating system version
Disk space statistics
Network interface statistics
Snapshot schedule and disk usage
Uptime numbers
System configuration
Real-time NFS, CIFS, HTTP, and CPU statistics

The script creates a small file for each of these above items, referenced as a link underneath the filer name on the SysMon opening page. The filer name is a link to the Network Appliance Web-based administration page.

The shell code for this script is fairly straightforward with the exception of the line that returns the system real-time statistics. Again, since all commands must be done via the remote shell facility, all commands submitted must be processed as a one-line commands. The Network Appliance command sysstat takes only one argument - a number for an interval between updates. I needed a method to provide a 10-second snapshot of the system at any point in time. The one-liner below accomplishes this:

((sudo rsh $filer sysstat 1 > ${WSMONDIR}/$filer/sysstat)&);\
sleep 10;sudo kill -9 'ps -ef | grep [s]ysstat | awk '{print $2}''

How and why this works is left as an exercise to the reader.

The number of NetApp filers you have will determine how often and how many instances of namon.sh must run. We currently have five filers, and the script takes around 3 minutes on a Sun Ultra 1. Because the cpu load will vary on whatever machine this script runs on, it is imperative that some type of program locking be implemented. There is nothing uglier than a cron job shell program run amuck on a workstation. When namon.sh is first invoked, it checks for a lockfile. If it finds one, it bails out and waits to be invoked again. If it does not find the lockfile, it creates one to prevent further invocations until this instance terminates. Simple, but effective.

wsmon.sh

wsmon.sh is the underlying centerpiece to the SysMon suite of programs. When first conceived, it attempted to be all things to all UNIX boxes, be they workstations, servers, print spoolers, mail servers, NFS servers, etc. It did this by "classing" a CPU with corresponding alert classes. For example, if a machine was a desktop machine, it ran only the code that checked for desktop alerts, such as vold or sendmail. As is usually the case, simpler is better. Instead, it now checks for "critical" processes on each machine it runs on, gathers statistical data, and writes out the HTML code to a common area that syslist.cgi will gather later.

Several small files are output to the common repository for linking on the system's SysMon Web page:

who
whodo
uptime
last
/var/adm/messages

The static output of a top -d1 command for the top 15 processes

wsmon.sh can easily be modified to check for whatever processes you deem critical to your operation. I have selected the following:

inetd
syslogd
lpsched (or equivalent print daemon)
yp-related processes
nfs-related processes
sendmail
cron
automount
vold

I also check and flag the following three "critical" areas:

Disk usage above 95%
A process eating more than 5% of CPU
Network collisions greater than 5%

The critical processes are checked by dumping the output of the ps command to a file and grepping for the process name via a case statement. Since we are going to be outputing HTML code from this script, I set a failed grep to a red ball gif, otherwise I use a green ball gif.

The disk check is done simply to see if you are above a certain threshold in usage. You can set this to whatever you like. I check for anything above 98% for realtime monitoring. In actuality, if you have a disk approaching 80%, it is probably time to check historical usage and determine whether there is a need to increase the amount of disk the system needs. That is the purpose of disk.sh. I check the output of that program each morning. I use wsmon.sh disk monitoring for things that run away during the day (i.e., a runaway Netscape process filling up the CDE errorlog). Once this condition is observed and rectified via wsmon.sh, it prevents anomalous disk usage data from sneaking into the historical disk usage data.

wsmon.sh checks for processes eating more than 5% of the CPU with the command:

ps -ef -o pcpu -o pid -o comm | awk \
  '{print $1, $2, $3}' | grep "^[5-9].*"

If there is an offending process, the program tags it and writes out the process name, PID, and usage. This is then given a red blinking ball to indicate the CPU is experiencing abnormal usage.

The network indicator will change from green to red if collisions are in excess of 5%. This is done with simple division from the output of the netstat command. Figure 4 shows the html output of wsmon.sh running on a host.

nightly.sh

The nightly.sh script has been proven to be invaluable for the following:

Non-realtime updates to a system's environment
Data collection
Verifying system integrity

The main concept of the nightly.sh is to put it in every machine's root crontab file (if it needs to run as root), and ensure all machines have NFS access to the same nightly.sh. Some people call this a network cron. You can make a change to only one file (nightly.sh), and it affects your entire environment. Generally, this is a good thing. Test your changes to the script on one machine before you push it out to the world. Depending on how you install workstations at your site, the root crontab should be modified on your master OS distribution server so all new installs get the new crontab without your intervention.

One can argue that everything the nightly.sh is doing can be done in the initial install itself. This argument is not without merit. However, it has been my experience during the past 10 years that the only thing constant is change. And a mechanism is needed for pushing out that change. rdist addresses this for almost every situation, but I have still found a need for having the nightly.sh in place for the following:

Collecting log data and moving to a common repository
Calling the sys.conf program
Checking disk space for morning report runs
Keeping motd current
Copying in crontab modifications

Listing 1 shows the code for the nightly.sh we run at our shop.

prtmon.sh

Printers are the bane of system administrators everywhere. They are a necessary evil in any computing environment. Any device with as much mechanics as modern day printers and plotters is going to make your life difficult. prtmon.sh will not relieve you of the heat, but it will provide you with the following information from a Web interface:

Printer name
Green ball if it's online, red ball if it's off the net
Model
Location
Control panel contents
Online status
Uptime
Link to the print spool queue
Link to verbose printer information

All of this data is collected from Hewlett Packard printers using the jetadmin software. I have experimented a bit with the QMS printers and most the information is available as well.

Listing 2 shows the code used to gather the printer data and to create Web page. (All listings for this article can be found at ftp.mfi.com in /pub/sysadmin.)

syslist.cgi

syslist.cgi is the main CGI script that will provide you with the output you see in Figure 5. It tables the servers, printers, and filers and provides an at-a-glance update of their status.

Servers

If you see a blinking red ball next to a server name, one of the following conditions exists:

A process is using an inordinate amount of CPU
Free disk space is at 3% or less
Network collision are in excess of 5%
A process you have deemed critical on the machine is no longer in the process list

Regardless of the status of the machine, by clicking on the link of the server name, you are presented with the screen you see in Figure 6.

If the CPU status is green, the link will present you with the top 15 processes at the time wsmon.sh took a snapshot of the process table. If it is red, the link will show you the process consuming the most CPU.

If the DISK status is green, the link will show you the most recent df. Snapshots are taken every 20 seconds (configurable) so the data is usually quite accurate. If the DISK status is red, the link will show you what filesystems have exceeded the threshold.

If the network interface card is experiencing collisions at a rate of 5% or less, the NET status is green. Otherwise the status is red, and the link will show you the output of a netstat -i as well as the collision rate.

Other links that are shown below the critical processes are:

who - output of the who command during the last snapshot
whodo - output of the whodo command during the last snapshot
uptime - output of the uptime command during the last snapshot
last - output of the last command during the last snapshot
Messages - copy of /var/adm/messages
Change history - a link to the change history database created by the syschange script.

Printers

If the printer status is blinking red, the printer is not responding to a ping. Depending on how you feel about printers, this may not be such a bad thing. However, your users still want to print, so you should care.

The printer link presents the image in Figure 7. There are two links at the top of the HTML table: the printer name, which has verbose status information gleaned from the HP JetAdmin software; and a link to the actual printer queue cgi script (lpstat.cgi).

Filers

If the FILERS are blinking red, they are not responding to pings, and all other status links are irrelevant. Thankfully, we have found them to be quite reliable and use the links on the page to give us the following information:

nfsstat - A server side nfsstat showing a typical nfsstat output
version - The operating system version
df - Disk space stats for the filesystem and the NetApp snapshot directory
netstat - netstat statistics
snap - The NetApp snapshot schedule
uptime - Just as it says
sysconfig - The filer configuration
messages - The filer's messages file
sysstat - Current system performance
inv.html

This simple script parses the files created by sys.conf and creates an HTML page representing your site's inventory. Written in Korn shell, it can take several minutes to parse through several hundred configuration files. I run this at 2:00 a.m. every morning.

getethers

This is a not a script but a C program written by Dave Curry of Purdue University. This program queries a segment and returns all responding mac addresses and corresponding IP address. I have found this to be an invaluable resource if keeping track of machines in a large shop. I run getethers every night for each segment, parse the output to create a list of all workstations and servers at our site, and use that list to remote copy the sys.conf configuration files.

diskdaily.sh

Although I don't currently employ the output of this program in the Web interface for SysMon, I have included it here because at some point I will. Also, I think it shows a simple but effective method for tracking disk usage at your site.

The script provides two types of outputs. One is a space-delimited file for export into Excel. Three times a day, at 10:00 a.m., 12:00 p.m., and 3:00 p.m. I snapshot all disk space on the servers. The script outputs a line similar to the following:

980619_12:00 1427600651 782951294 616719433

Field one is the timestamp; field two is total configured disk space; field three is disk space used; and field four is disk space available. This file format exports very nicely to Excel and makes pretty charts that our managers all love to see. It provides real data about your disk usage over time.

The script also outputs a detailed report for each server you designate. These reports can then be parsed at a later time to provide disk usage on a per-server basis.

Tying It All Together

Okay, so now we have three CGI scripts, seven Korn shell scripts and no clue how to make it all work. Granted, it can seem a bit like a spider web, but I hope the following will make installation and configuration a little less painful.

Find a common NFS-accessible area in which to install the scripts. Also use a similar area to create an area for data repository. For example, you can create the following hierarchy:

/usr/local/sysmon - This is the top level of the Sysmon heirarchy

/usr/local/sysmon/bin - In this directory, the install program will install the following:

wsmon.sh
namon.sh
prtmon.sh
sys.conf
diskdaily.sh
invhtml.sh

/usr/local/sysmon/servers - This directory will initially be empty. After the first run of wsmon.sh, it will contain the HTML and data files used by syslist.cgi.

/usr/local/sysmon/disks - This directory will initially be empty as well. After the first run of diskdaily.sh, it will contain the directory "reports" and the file "usage".

/usr/local/sysmon/printers - After the first run of prtmon.sh, this directory will contain the HTML and statistics files for all configured printers.

/usr/local/sysmon/configs - This directory will hold all configuration files created by sys.conf.

/usr/local/sysmon/gifs - This directory holds the gifs used for the Web pages.

Pull down the code file from the Sys Admin ftp server (ftp.mfi.com in /pub/sysadmin). Uncompress and untar the file. After reviewing the README file for any gotchas, run the install.sh script. This will ask you a few pertinent questions about your environment. Once done, it will install as much of SysMon as it can.

You will still need to modify the root crontab on all machines. You can do this one of two ways:

If you are absolutely sure that all of your workstations and servers have the exact same root crontab (unlikely), you can rdist out a new crontab file.
If there are disparate crontabs running around, find or create a trusted host. Run installcron.sh on the trusted host. It will remote shell to all of the machine you specify and append the root crontab with the nightly.sh entry.

Run install_web on your web server. This will install the following:

syslist.cgi - This gives the sysmon main screen
syschange.html - Browser-based syschange interface
syschange.cgi - CGI script to process syschange data
lpstat.cgi - CGI script to query printer queues and display that output in HTML. Finally, run install_S99sysmon from your trusted host. This will install a startup script in /etc/init.d with a hard link into /etc/rc2.d. sysmon.sh will then be invoked at bootup.

Summary

This article and accompanying software and documentation have shown how to automatically track your network's resources, monitor major computing and printing sub-systems, and present the information in HTML. These scripts I have reviewed are like screwdrivers in your sysadmin toolbox. Bundled together, they present accurate and timely data about your network and its resources. Take the time to learn the in's and out's of these scripts and the SysMon framework. If you invest the time upfront, it will save you hundreds of hours over the long haul.

About the Author

Bob Ess is manager for the CAE UNIX Support Group for Fujitsu Network Communications in Richardson, Texas. Even though he is a manager, a trickle of his UNIX skills are allegedly intact. He can be reached at bob.ess@fnc.fujitsu.com.