The average system administrator's job involves more than simply installing systems and creating user accounts. Most of the time, the administrator works to ensure system availability and reliability. One method of accomplishing that is through the use of network-specific software tools. Although such tools are extremely useful to a networking administrator, they are not always useful to the UNIX administrator worried about performance, network throughput at the NIC, processor load, and other platform-specific issues.
The recognition of the interdependence between systems and the networks to which they are attached has led to the development of enterprise monitoring tools such as HP's Openview and IBM's Tivoli. These tools, because they are designed to do so much, are often difficult to install, tedious to configure, and time consuming to actually plan, configure, and implement. Such commercial tools are also relatively expensive, and thus difficult to cost-justify if management lacks the technical expertise to understand the scope of the problems addressed by the tools.
In my position, I needed an inexpensive tool that would do some basic network-availability monitoring and, just as importantly, tell me something about the servers I am responsible for administering. I also wanted a tool that I could quickly deploy to another machine or set of machines, so I would not necessarily be confronted with trans-firewall communications issues while monitoring disparate networks separated by a firewall. This article describes the tool I developed, and the rationale behind the features it provides.
In my environment, I had to work within some significant development constraints. First, I did not have a large, isolated test network available, so all development had to proceed in such a way that errors would not cause system outages. I also had disparate platform types; the network here has several versions of UNIX (AIX, HP-UX, Solaris), NT servers, Novell servers, NT workstations, and even some LanMan servers.
These requirements led to my choice of Tcl/Tk as the primary language for this tool with ancillary support from Perl scripts. Specifically, I used Tcl v8.0p2 with the corresponding Tk v8.0p2, the TclX v8.0.2 extension, and Perl 5.004r4. The development of the tool up to this point has been entirely on a Sun platform running Solaris 2.6. This combination allowed me to quickly deploy this tool to other operating systems and platforms including NT, Win95, and UNIX variants.
The selection of Solaris 2.6 as a general development environment itself imposed some restrictions on how the tool must be implemented; some of the choices made regarding multithreading, shell commands issued by the different scripts, and data return methodology were dictated by the Solaris operating environment.
The tool, which I term a network monitor since its functionality is largely network based, was designed to provide a rapidly decipherable high-level status display of a LAN or WAN environment. I wanted a way to quickly determine the overall status of a large number of machines, with the ability to drill down into specific machines for remote system and connectivity diagnostics.
To provide these capabilities, I chose to begin analysis at the subnet level, effectively imposing a logical subnet mask of 255.255.255.0 to any given network. Once each subnet was displayed at this level, I implemented a search function for the machine name and other DNS-related information. I plan to add some machine polling functionality to the network monitor in the future.
Logic Flow and Data Organization
Each of the windows and subwindows displayed by this application is controlled hierarchically; when a "toplevel" item is declared in the Tcl/Tk script, only that procedure and its immediate called procedures have access to the window created by the "toplevel" declaration. Likewise, the name of each "toplevel" item is generated by concatenating a specific deterministic string representing the depth within the execution hierarchy at which it was generated and then appended to the name of any toplevel item generated immediately above it in the recursion tree. This ensures a relatively high level of name standardization within the tool and also provides relatively easy bug point detection. This also allows a relatively straightforward method of functionality extension within the tool itself, and allows an extended functionality to be "plugged in" to the main control structures at appropriate points.
Within some Tcl control routines, a shell call is made to a script that returns a formatted list. Though this could be achieved by remaining within the Tcl paradigm and simply parsing the output of the system command appropriately, I was able to develop and test the logic streams in Perl more rapidly than in Tcl/Tk. I also found it much easier to fork processes out within Perl than within Tcl/Tk because of window control issues within the Tk windowing interface.
In other situations, some calls to system commands within Tcl would at first glance seem to be better off moved to an ancillary Perl script. However, in these cases, I found it easier to remain in Tcl to trap any erroneous output using Tcl's catch command than to attempt to catch system command errors in Perl. Moving between languages does have some drawbacks, however. Overall, though, I feel the advantages to using a multiple-language environment significantly outweigh the cost.
A Look Under the Hood
The Tcl/Tk portion of the script consists of the main control block, contained in the file sysmon.tcl. It also comprises the two main subsidiary control branches. These two control blocks, the subnet scan block and the name lookup block (located in the files pingtest.tcl and namemap.tcl, respectively) were separated from the main control block to maximize the modularization of the code and ensure that the two blocks would not interfere with one another. (All scripts mentioned in this article can be found at www.samag.com or ftp.mfi.com in /pub/sysadmin.)
The first branch of the main control flow is a subnet scanner. Within the subnet scanning block, there are individual routines to verify the specified values corresponding to a valid subnet address, to generate the display, and to perform the actual scan. This scan is done by launching subnet-scan.pl in the background, which sends its output to a named pipe. The pipe is read by the Tcl control block for the subnet scan, which then updates the screen display. The Perl process is multi-threaded to minimize the time required to perform a scan of all 256 hosts on a subnet.
This time minimization is accomplished by dividing the range of host addresses into 16 groups of 16 and forking a process to perform parallel scans of each host group. By basing the network to be scanned on a group of 256 hosts determined by the first three octets of the IP address, this scan process will function the same for class A, class B, or class C networks. Within each group, the individual hosts are scanned serially, resulting in a theoretical maximum of 16 simultaneous single-host ping scans occurring at any one time. Further parallelization by dividing the entire subnet into more groups of fewer hosts per group is possible. A risk of overloading the process table on the host machine determined the thread configuration.
Within each forked subgroup scan, as a host is scanned, the code examines the output of the ping call and determines availability. It then formats a simple message consisting of the host number (generated by multiplying the iteration number of the external loop and adding the iteration number of the internal loop) and a 0 or 1 indication to a named pipe. This named pipe is distinguishable by the fact that part of the pipe's name is generated by the subnet being scanned. The 0 or 1 indicator tells the Tcl control block reading the pipe of the host's availability - a 0 indicates that the host did not respond; a 1 indicates that it did respond.
As the Tcl control block reads data from the named pipe, it parses which hosts are up and which are down and updates the display accordingly. By using a series of forked Perl processes running in the background and subnet-specific named pipes, the application effectively interleaves multiple subnet scans. An unfortunate drawback to the current implementation is that the subnet scans are performed in a preemptive single-threaded LIFO tree. Thus, if one scan is in progress when a second scan is launched, the first scan will continue in the background but will not update the display window until after the second scan is complete (the forked Perl script has exited).
Another part of the subnet scanner is the per-host hostname lookup function. By clicking on the desired button within the subnet status window, a user can open another display window that retrieves hostname and domain information from the local machine's primary DNS server using the nslookup command. The determination of the hostname and domain portion of the FQDN (Fully Qualified Domain Name) uses standard DNS conventions. The hostname is determined by truncating the portion of the FQDN after the first dot. The domain name is determined by truncating the portion of the FQDN preceding the first dot.
The second branch of the main control flow is the DNS lookup interface. It is controlled by namemap.tcl, a file separate from the main control block. This piece allows for repeated, specific nslookup querying of the DNS server(s) listed in the local machine's /etc/resolv.conf file for specific DNS record types. The main control file first calls the get-namesvrs.pl script to obtain a list of name server IP addresses from the local resolv.conf file. This list is then used to present a dialog box. The user then selects which server he wants to run the queries against. Once selected, a second window is opened by a subroutine presenting a two-paned view of the main query engine. The left side of this window allows a user to enter a host name, IP address, or domain name to query the DNS database with, and also allows the user to specify which type of DNS record he would like to query for. The right pane displays the query results to the user.
The results of the query are returned to the Tcl control block managing the query window as a unit (regardless of success or failure of the query) and displayed. This allows a user to not only resubmit a modified query, but also to scroll backward to earlier queries issued in the same session. This scrolling capability allows somewhat greater freedom in chaining queries together quickly. Interpreting results after the entire chain has been submitted by obviating the need to pipe the output of an nslookup command line query to a file or through a screen paginator, such as pg or more, offers a certain freedom of use.
Installation and Use
Because this network monitor was developed in a RAD-type high-level scripting language, the installation process is somewhat different from a standard program installation. Once the files are copied into place and are made executable, the only installation task is to update the absolute paths of system calls within the various files. Most system calls from the ancillary Perl scripts will not need to be updated as those calls are to OS-level programs. However, the paths to those ancillary scripts, as well as the paths to the secondary TCL control blocks within the main block, must be updated to reflect the directory in which the tool was installed.
The directory paths that will need to be updated from the distribution are references to the /export/home/jberning tree; this directory tree was used initially because I was developing the system in my home directory. In future revisions, I plan to use a global directory prefix variable to simplify the installation to alternate paths.
Once the monitor is installed, simply invoke the top level TCL control block, sysmon.tcl. This brings up a GUI window that uses a Motif-type point-and-click interface schema. Once the top level block is invoked, the usage of the tool becomes relatively implicit in the window display structure.
Applicability and Future Directions
This tool provides a method to monitor UNIX-based desktop machines, mission critical servers, and network hardware. The tool that most people find most useful for whatever needs to be done is often one that they developed themselves. This tool is no exception, but I also believe it can readily be used in other areas and environments and provide similar levels of usefulness. It presents a relatively thorough picture of the overall network in a concise, windowed manner with a point-and-click GUI interface. It also encapsulates a wide variety of network and host information to help administrators perform detailed problem analysis.
This tool is also extensible in a relatively simple manner. Although not nearly as feature rich, and with nowhere near the extent of functionality of Enterprise Management packages, the simplicity and compactness of this tool nevertheless give it strong advantages over the commercially available packages. Its extensibility allows it to handle a variety of tasks. Its simplicity allows the rapid development of individual, useful extensions, and its modularity allows rapid integration of new features at virtually any level.
The main attraction of this tool is not what it does, but how much more it can be made to do. I attempted to design the tool to easily allow its extension with future Tcl control scripts and files by simply plugging those extensions into a button on the main entry panel within the sysmon.tcl file. Extension of the tool can be accomplished by simply generating a file containing the Tcl control process and logic for the addition then re-indexing the directory using Tcl's auto_mkindex command. Once the tclIndex file has been regenerated with the additional Tcl file(s) in place, and the sysmon.tcl file has been updated to provide an additional button for the new procedure, the new feature can be used immediately.
There is also a dim possibility that I will attempt to develop a client-server branch of the tool, which would allow more specific host-based information to be displayed at a central "command console" type of workstation. This piece would also allow an administrator to set up a timed polling mechanism for mission critical or fault-prone servers in an attempt to reduce the time delay from problem occurrence to administrator notification and intervention, and ultimately problem resolution. A drawback to this idea would be the obvious concerns over RPC and "home-grown" daemon security. Another drawback would be the necessity of placing a daemon "server" piece on each machine to be monitored in this fashion - something I was attempting to avoid in the initial conceptualization of this tool.
In the short time I have been using the monitor described here, I've found it to be a useful timesaver in first-run problem diagnosis. Although the monitoring system does not yet have all the capabilities I would like, it has proven valuable to me and has saved a greater amount of my personal time than it took to develop.
From building this network monitor, I've also gained a more thorough understanding of how different processes and different development languages interact. I also learned how multiple languages can be combined in sometimes mundane ways to provide a much more cost- and time-effective implementation of a solution that could otherwise be achieved with a single, sometimes monolithic, language.
About the Author
John Berninger is a UNIX Systems Administrator at Branch Banking and Trust's main Operations Center in Wilson, North Carolina. As the lead Solaris engineer, he is heavily involved with Sun Solaris platforms, but also has duties on HP-UX and AIX systems. Berninger is a graduate of the University of North Carolina at Chapel Hill, and holds a B.S. degree in Mathematical Sciences/Computer Science.