Writing a Web-Based Admin Tool
While working at a location where I was in charge of more than 50 servers located around the country, a manager would occasionally walk into my office and say, "I just got a call saying that there's a problem in Kansas City. What's wrong with the machines?" I would then log onto each machine in Kansas City to see if there was anything unusual, because no one knew which machine (if any) was having the problem. The more machines at a given site, the longer it took to investigate.
After many occasions where it was necessary to check all of the machines while someone was waiting for an answer, I decided that I could give better service to these kinds of queries, and keep better track of my machines, if I had a tool to give me a quick overview of the general health of the machines. While developing this tool, I discovered "The Big Brother System and Network Monitor" (available from http://www.iti.qc.ca/iti/users/sean/bb-dnld), but after my team evaluated the tool, we determined that it was not quite what we needed.
One of the features we wanted in a tool was the ability to tell it what to monitor. We had many "non-standard" items that we wanted to monitor, such as database connectivity. Other items of interest were the number of users currently logged on (from a load-balancing standpoint) and "the number of users running command Z." We also wanted to monitor the filesystem free space on each machine, with the capability of ignoring certain filesystems, since some are used by a database and would always be "full."
The first version of this tool used Tcl/Tk to display the results in an X-based window, but I decided that a Web page would be a better way to display the information. The Web approach would not require the user to be logged into a given machine just to get status information. A Web-based display would also make the information available to other people, such as managers or the help desk. If one of these people received a query about a machine being down, or about the connection to a database, they could find the answer by viewing the Web page without calling an administrator.
The earlier version of the tool collected the data every time the page was refreshed, but even with a parallel processing approach to data collection, updates took too long. This version also used remote shells to collect the data, but I wanted a more secure method to do so. I had never written any client/server code before, but decided this was a good time to learn.
I began with the client/server example in the back of the O'Reilly camel book (Learning Perl, Second Edition by Randal Schwartz and Tom Christiansen) and one off the Web at:
which implements a simple "time server." In the latter example, the clients would connect to the server, and the server would send them the time of day, which the client would then display. I understood what the scripts were doing, and realized that I wanted a server that accepted data instead of sending it, so I made some modifications.
For my tool, the client would run on the machines to be monitored, on a regular basis via a cron job. Every time the client ran, it would collect certain information and send it back to the server. Before sending the data, the client would make a "good or bad" determination for each data item. I chose to have the decision made on the client side, because then the definitions for "good" and "bad" could be tuned for each individual machine and also keep the load distributed. Some tests that do not have a distinct line between good and bad, so there is a third possibility of "warning" for these situations.
Two examples of items that needed tuning were filesystems and usercounts. Certain filesystems, such as those used by databases, will always be "full." Which filesystems are used by a database vary by machine. By having a configuration file on each machine, certain filesystems could be removed from the "full" list before that information was sent to the server. Another item that needed tuning that I was interested in monitoring was "usercounts" or "how many people are on the machine running program Z." Certain machines were configured to support 75 users, while other machines were configured for 150 users. Thus, for example, 80 users on a machine is either "good" or "bad" depending on its configuration.
The client sends one file to the server, regardless of how many tests it performs. The first line of the file is the machine name followed by a + (passed) or - (failed) or a value to display (warning) for each test, separated by whitespace. The rest of the lines of the file are the details for each of the tests. Each line consists of the name of the machine followed by the name of the test and the results of that test - one test per line. I included the machine name on every line so that if I wanted to get the "filesystem" results for every machine, I could do a grep filesystem * in the data directory and know where the results originated.)
The client reports five pieces of data:
- network - How long did it take to ping the server
- filesystems - Which filesystems have less than 15% free space
- sar -q - The output from "sar -q 5"
- sar -r - The output from "sar -r 5"
- sar -u - The output from "sar -u 5"
The client has been tested under AT&T/NCR MP-RAS, HP-UX, Solaris, and Linux. When the particular OS does not support a test (sar -r does not exist on HP), there is no report for that test.
The server is the least complicated part of the tool. It accepts connections from the clients, reads the data sent, and recreates the file in the data directory. It could be configured to do some error checking or attempt to contact machines that have not checked in recently, but at the moment, all it does is accept files. (See the "Roads Not Taken" section for additional comments.)
The Web page that displays the information is the most complicated part of the tool because it has so many different functions. The page (see Figure 1) is divided into two vertical frames. The left frame is where the status information is displayed, while the right frame is used to display the details for any item.
The status information is displayed in a table or grid, where the columns contain the tests, and the rows contain the machines. If a given machine "passed" the test, then the cell is green; whereas, if the machine "failed," then the cell is red. Because some tests need an indicator between "pass" and "fail," there is also a "warning" that will cause the cell to be yellow. If there is no current data from a machine, the cell containing the machine name will turn blue, and the first test column will become orange. The final possibility is that there is no information at all from a given machine. In this case, the cell containing the machine name will turn red.
Selecting an individual cell will bring up the details for that item in the right-hand frame. These details are the results of that particular test. For example, if the filesystems cell is red, choosing that cell will bring up a list of which filesystems are almost full, and the percentage of space on each that is available. Choosing a green cell may display details (as with the output of the sar command) or it may display nothing (as in the case of filesystems). Choosing the name of a machine will bring up information about that machine, including contact information (the administrator) and what the machine is used for.
The left-hand frame has an "auto-promote" feature that makes it easier to detect problems. At the top of the frame is a table consisting only of machines with a "problem." Machines with "non-green" entries are automatically included in this table to make them more visible. These "problem" machines are also included in their normal position in the bottom table.
The "status" frame has auto-refresh capability, so the data displayed will always be as current as possible. All of the pages (actually the frames within the page) are dynamically generated by CGI scripts. The status frame is built from the files in the data directory, as are the details for the tests that appear in the right-hand frame. The machine "details" are generated by a script that reads a data file in a separate directory.
An input "form" at the bottom of the page allows the user to search the machine information for a given string. For example, a user knows that application Z is down, but they do not know which machine it runs on. This form can be used to search for all machines running application Z. If the data is current (very useful, but difficult to maintain in some environments), the screen will display all of the machines running application Z and the location or contact information for those machines. This feature is useful for the help desk, managers, and even new administrators. This search ignores case and searches all fields of the data, so users can search for machine names, administrators, or applications.
Each cell of the table contains a picture of a colored ball, which looks more interesting than just a solid color. Another possibility is to display a value in each cell instead of the graphic of the ball. The background will still change color (red, green, or yellow) but this feature would provide some details without clicking on the cell. To use this feature, the client needs to concatenate the value to be displayed with the +, -, or !. For a cell to display a value of three as a warning, the entry would be !3 instead of just !, whereas +30 would be interpreted as "display a 30 in a green cell."
Figure 2 provides a list of the various components and files which comprise the tool. You can download these files from the Sys Admin Web site (www.samag.com) or ftp site (ftp.mfi.com in /pub/sysadmin).
Roads Not Taken
One of the "special cases" for data is that it is "old." If the data is collected every 10 minutes, and the data from machine X was received 12 minutes ago, then this data is "old" - we should have received new data by now. My current approach for handling this situation is "ignore it" (although I do flag the data as "old") and hope to get data the next time around. All of the other approaches that I can think of become very complicated, and I want to keep things as simple as possible.
An enhancement to the server (and Web page) could be to utilize a database instead of flat data files. This has benefits from a management standpoint, making it easy to generate statistics and reports, and even trend analysis. I have tried using an Access database running on an NT machine, but the performance for the flat files was superior.
Another modification could be to make the process of adding a new test easier, but all of the "improvements" add their own complications. At the moment, adding a new test requires three steps. First, a script must be written to gather the data. Second, the client must have a routine that will handle the new data; and third, the script that generates the Web page must know about the new test.
To simplify this process, there could be a configuration file for the client that contained a list of tests and the script to run for that test. The data collection scripts would then "massage" the data and pass it to the client in its final form (the format that is sent to the server). The script that generates the Web page could easily generate the columns dynamically, but it currently uses an ordered list of tests to control the order in which the tests are displayed.
I began writing this tool for my personal use, as something that would help with one aspect of my job, making sure that my machines were "healthy." As it evolved, I have added and deleted features and tested various options. The version presented here is fairly generic and is gathering data on four different platforms. I encourage you to make changes to suit your requirements.
Although I could have modified an existing tool to suit my needs, I chose to write my own tool because the final product was only part of my goal. I also wanted to expand my knowledge of Perl, CGI, and Web development. I also wanted to create a tool that used simple concepts, that would run on "old" machines, and that would be easy to use as a "building block" for others who cannot or do not wish to start from scratch.
I do not plan to sell support, or release new versions of this tool, but comments and suggestions can be sent to me at: email@example.com.
About the Author
Mark has a Masters Degree in Computer Science from Mississippi State University, and has been working as a Systems Administrator for more than 10 years. He is currently a contractor with Princeton Information working as a UNIX and Web Administrator for a major telecommunications firm in Northern New Jersey.