One of the major responsibilities of any UNIX systems administrator is to know what is happening on the network at any given time. Once the number of machines on the network exceeds about seven, this process becomes dramatically more difficult. Determining the state of a system, before being told by an irate user that it's down, is an art form that only the paranoid and masochistic could embrace.
Having faced this situation too many times, I have written and rewritten innumerable little scripts to check on the health of a system. Reporting network information was always a problem. I would either send myself bits of mail (and hope I read them), or I would compromise security by automating direct connections between machines, generally using some form of remote shell. Nonetheless, I still spent a significant amount of time telnetting around the network, logging in and generally making sure that all was well, day and night.
So, I began looking into products to perform UNIX systems monitoring and notification. The systems I investigated had several serious drawbacks; they were quite expensive, some even requiring their own replicated servers to run. They seemed very complicated, requiring a considerable amount of time to install and configure. Most consumed vast amounts of systems resources, because they were so powerful and were operated from a central console. I just wanted to know how my machines were doing, without having to go into the office.
Fortunately, the past few years have seen an explosion in the Internet and the Web. One great function of HTML is its ability to create an attractive, portable, and remotely accessible graphical user interface (GUI) quickly and easily. So, I decided to build Big Brother, a Web-based UNIX systems and network monitor.
The ability to integrate information from other packages, like Legato Networker
In short, I wanted a Big Brother to watch the entire network for me.
Design
The central problem was not so much what the scripts were to do, but how to transmit this information to a central location in a secure fashion. For that purpose, I wrote two simple programs, a client (bb) that sends single lines of data to a server (bbd) over TCP port 1984.
Monitoring connectivity within the network proved straightforward. I used a Bourne shell script to ping every host I wanted monitored, and wrote bbnet to test any port on any server. If bbnet is given an HTTP address, it not only checks for the presence of a Web server, but also displays the output from the server. This network monitoring script is called bb-network.sh and is executed every 5 minutes.
I decided that each machine on the network should have a small client that measured local information and sent it back to a central location, the display server. The local client script, named bb-local.sh, just measured disk usage (using df), CPU usage and number of users (using uptime), and whether certain important processes are running (with ps and grep) every 5 minutes. All this information is sent back to a central location, the display server, using bb.
Finally, if any of these scripts notices trouble, they use bb to send a numeric message to the pager server. The pager server then calls the script bb-page, which is just a wrapper for kermit that dials the pager number and transmits the numeric message. Figure 2 shows how these parts interconnect.
The Core of Big Brother: bb, bbd, and bbnet
Because all the elements of the system use bb to send messages and bbd to receive them, it is important to understand the format of the messages they send. Again, the emphasis was on simplicity, so single lines of data are sent and received. The format of the bb command is as follows:
bb [ip-address] "msg-type message"
ip-address the IP address of the machine running the bbd server
msg-type either status or pager
For status messages, the message portion is in the following format:
hostname.area color-code date explanation
hostname the hostname where the report is from
area the type of report (i.e., conn, cpu, disk, msgs, etc.)
color-code green (OK), yellow (warning), or red (alert)
explanation a descriptive message like "/usr is 65% full"
For pager messages, the message portion is in the following format:
pager-number error-code ip-address
pager-number the pager number to call
error-code a numeric error code from Big Brother indicating the problem:
100 - Disk Error. The disk is over 95% full.
200 - CPU Error. CPU load average is unacceptably high.
300 - Process Error. An important process has died.
400 - The message file contains a serious error.
500 - Ping connectivity error, can't connect.
600 - Web server HTTP error - server is down.
911 - User Page. Message is phone number to call back.
ip-address The ip-address of the affected system.
When bbd receives these messages, it determines what action to take based on whether it is a status message, or a pager message. If it is a status message, it creates a log file in the log directory called hostname.area (i.e., coffee.disk) that gets processed by the display server. If it's a pager message, it calls the shell script, bb-page, to issue the numeric message using kermit.
Included in Big Brother is bbnet, a generic server testing program. The idea is simple; open a port on a machine and see what comes back. Currently bbnet tests Web servers, but it could also be used to test ftp, telnet, or any other servers. It will display the first 256 characters of the response from the server.
bbnet [URL | machine-name|ip-address:port]
Finally, for completeness, touchtime tells what time it was precisely 30 minutes ago. Its output is in a format compatible with the touch command. It is used with touch to create a file to which to compare data files on the display server to ensure that all reports are current.
The Web Display
When running, Big Brother will create lots of little status files in the log directory. These files are named using the hostname of the reporting machine and the area being reported on. So, for my machine named "coffee", the following files would be created in the log directory:
coffee.conn Network connectivity status
coffee.cpu CPU status
coffee.disk Disk space status
coffee.procs Running processes status
coffee.msgs System files status
coffee.http HTTP status
The contents of these files will all resemble the following disk report:
green Mon Nov 18 14:04:53 EST 1996 /usr is 65% full
Because the files are named by machine and area, and given that the first word of each of these files tells us its color-coded status, this information can be displayed on a Web page as a matrix of machines and areas being monitored. Figure 1 shows what the Big Brother web display looks like. Within the matrix, colored balls correspond to the status of any area at the time the report was issued. Clicking on the colored ball will display additional information. The scripts mkbb.sh and mkbb2.sh create the Big Brother Web pages.
The only area that approaches cleverness in the Web display is checking that the reports in the log directory are fresh. If any report is over 30 minutes old, the corresponding dot in the display matrix is changed to purple. This means that reports from that host are not being received by bbd, usually because bb has stopped running on that host. As previously mentioned, touchtime creates a file exactly 30 minutes old, and all reports are measured against the timestamp on this file.
Intuitive information is displayed by examining each of the log files in the directory to determine the most severe condition on the network at the time. The script mkbb.bkg generates the background color of the Big Brother Display page. In order of increasing severity, these background colors are:
green All is well on the network
yellow There is a warning somewhere on the network
purple Host is not reporting in for some reason
red Severe condition needs attention
Thus, the network status is clearly visible from the background color of the screen. The downside of this simplicity is that absolutely anyone looking at a Big Brother display knows the status of the network. Everybody becomes an expert, and they'll want to know why that screen is red.
Downloading the Big Brother Source Code
The source code, demos, and additional information are available on the Big Brother home page: http://www.iti.qc.ca/iti/users/sean/bb-dnld/ and via ftp from the Miller Freeman site: ftp://ftp.mfi.com/pub/sysadmin/.
To install the archives, decide where you want Big Brother to live. The archive will create a new directory, bb, that will house the system. This directory, henceforth shall be forever known by the environment variable BBHOME. To extract Big Brother, issue the following commands:
gzip -d bb-src.tgz
tar xvf bb-src.tar
The Big Brother system is structured as follows:
doc/ Documentation and configuration scripts
etc/ Where all the configuration files live
src/ bbd.c, bb.c, bbnet.c, and touchtime.c programs
bin/ Where the ported binaries and shell scripts live
web/ Scripts that create the Big Brother web pages
www/ The directory that should be linked into your Web site
www/logs The directory where bbd writes status files
www/notes A place to put information about monitored systems
Environment variables important to Big Brother are:
BBHOME The top level directory where Big Brother is installed
BBDISPLAY The machine with the Web Server, a.k.a. the Display Server
BBNET The machine that will run the network monitor
BBPAGER The machine that will process pager requests; needs kermit and a modem.
And finally, the whole package is completely dependent on the information you place in the file etc/bb-hosts.
Configuring Big Brother
Automatic configuration is supported for SCO, FreeBSD, Solaris, Linux, HPUX 10, and SunOS 4.1, NetBSD, OSF, Ultrix, and Irix. To run the automatic configuration program from the top level Big Brother directory, enter:
cd doc
./bbconfig [OS-NAME]
where OS-NAME is one of: sco, freebsd, solaris, hpux, linux, sunos, netbsd, osf, ultrix, or irix. This program just adjusts src/Makefile and copies the appropriate system definitions from etc/bbsys.OS-NAME to etc/bbsys.local.
For security purposes, I isolated all commands used by Big Brother, and placed the full pathname of each command into its own environment variable. This file is named etc/bbsys.sh. These defaults can be overridden by redefining these variables in the file etc/bbsys.local. Listing 1 shows a sample of this file for FreeBSD.
If you are not running one of the automatically configured operating systems, you will have to edit the Makefile, create your own version of the bbsys.local file and define where these commands live on your system.
Big Brother is highly configurable. You can tell Big Brother what constitutes a warning and what constitutes an urgent situation. For each area, you can also specify whether or not you want to be paged. The defaults are quite sensible, however you can adjust the pre-set values by editing the file etc/bbdef.sh; you will also have to tell Big Brother where to find kermit and the pager number to call. The standard version of etc/bbdef.sh is shown in Listing 2.
If you are able to autoconfigure, or once you've edited the Makefile, to compile the binaries all you have to do is issue the following commands:
cd ../src
make
If there are no problems, you can install the binaries:
make install
Next, edit the file runbb.sh. It needs to have the environment variable BBHOME set to the directory where Big Brother lives:
BBHOME="/home/sean/bb"
The entire system rests on the information you place in the etc/bb-hosts file. This file is really similar to your /etc/hosts file (in fact it used to be the same), but with additional information in what were the comment fields. Refer to Listing 3 for an example of a very simple bb-hosts file.
Big Brother consists of a network monitor script, which will run on the machine defined in the etc/bb-hosts file as BBNET. This script will check every host listed in the etc/bb-hosts file for connectivity via ping, and will also check for a Web server should the line contain a URL using bbnet.
Pager alerts are sent to the machine defined as BBPAGER in the etc/bb-hosts file, which needs kermit and a modem installed to work correctly.
Finally, the machine you define as BBDISPLAY in the etc/bb-hosts file is the Web server. All status reports are sent here, and bbd creates the files in the www/logs directory. This information becomes the basis for the Big Brother pages bb.html and bb2.html that live in the www/ directory.
The keywords required for configuration are:
BBDISPLAY machine for Web display
BBPAGER your pager server
BBNET network monitor machine
http:// check this URL on this box
Note that you can use one machine for all of the above servers or any combination of the above. It may seem a little confusing, but it makes things very flexible.
Running Big Brother
Now you're ready to test Big Brother for the first time. It helps to run the test on the BBDISPLAY machine, since you can check the results fairly easily by looking in the www/logs directory.
Go to the directory you defined as BBHOME and issue the following command:
./runbb.sh &
If there are files appearing in the www/logs directory, then things are looking good. The corresponding pages www/bb.html and www/bb2.html should also have been created, and all these files should get re-created every 5 minutes.
If all of the above seems to be functioning, then it's time to view the pages. Since most Web servers isolate their data under DocumentRoot or the like, the easiest way to get these pages on-line is to choose where in your Web to install it, and create a symbolic link to the BBHOME/www directory. For example, if your Web Document Root directory is /usr/www/docs, and BBHOME is set to /usr/acct/sean/bb then issue the following command to make the pages accessible:
cd /usr/www/docs
ln -s /usr/acct/sean/bb/www bb
You should then be able to access the pages via your favorite browser using the URL:
http://your-machine-name-here/bb/bb.html
Password protecting this area is also highly recommended, just on principle.
Installing the Clients
The next thing to do is to put Big Brother on all the clients you want to monitor. You can simply replicate your Big Brother directory on the different clients and execute runbb.sh. However, if you're running in a heterogeneous environment, you'll have to port Big Brother again. Assuming the information in the etc/bb-hosts file is correct, you should see information about the clients you are monitoring begin to come into the www/logs directory on display server. You don't really need everything, it's just easier to install it this way.
The alternative for those running in a homogeneous environment is to do the following:
cd BBHOME/docs
bbclient [client-hostname]
Assuming you've configured your etc/bb-hosts file correctly, this script will create a tar archive called bb-client-hostname.tar that you can install on the remote machine. Note that this archive is created above the BBHOME directory. Bring it across to the client, and execute runbb.sh as described earlier. Once you're really confident with it, have runbb.sh executed at system startup.
Debugging Big Brother
By far the most common problem is that the etc/bb-hosts file is incorrect. Machine names are case sensitive, and the BBDISPLAY and BBNET variables must be defined. Next, if no status reports are being created in the www/logs directory, you can test bb manually by setting BBHOME in your environment, then issuing the following commands:
./bbd
./bb ip-address "status test.test hello world"
This should create a file called test.test in the www/logs directory, containing the text, "hello world."
If client data does not appear to be coming in, remember that runbb.sh must be running on each client that you want to completely monitor. The bb-local.sh script collects the local client data and uses bb to transmit it to bbd on the server you've defined as BBDISPLAY in etc/bb-hosts, in the www/logs directory.
Syntax errors and complaints about incorrect formats for commands can usually be isolated to etc/bbsys.local, where all commands are defined. If you're not using a system that bb has been ported to, some adjustments to this file may be required.
Conclusion
Big Brother is a useful example of using a small client-server routine combined with HTML and a Web server to create a GUI front-end for otherwise ordinary shell scripts. It's not perfect, but it does demonstrate the flexibility of the Web to share and disseminate vital and useful information.
Big Brother is not a replacement for a qualified systems administrator, but it is an excellent assistant. It's easy to set up and can be integrated with any tool that can execute a UNIX command. I can now go for coffee in peace, secure in the knowledge that if something goes awry, I will be paged with an error code and the IP address of the machine involved. Even if something goes wrong in the middle of the night, I can call in, check the Web page, and get concise information about the network.
Since its creation, many people have helped in the porting and reporting of problems, too many to list here. I've received lots of email from happy users and have seen Big Brother propagate from Canada through the United States, Europe, Australia, Russia, Tahiti, and Namibia. Your feedback has been essential, exciting, and greatly appreciated.
Big Brother has saved me lots of time and effort, and I hope it will do the same for any administrator who installs it. It's the only situation where I'm comfortable knowing that Big Brother is watching.
About the Author
Sean MacGuire is a consultant who has spent almost 15 years in the company of UNIX systems. He has a couple of patents pending and is publisher of the literary e-zine "It's a Bunny." His email address is sean@iti.qc.ca.