UNIX arrived at the Stanford Linear Accelerator Center (SLAC) somewhat later than at many other research labs, but it proliferated rapidly. SLAC is a high-energy physics research laboratory, funded by the Department of Energy and operated by Stanford University in Palo Alto, California. Demand for computing is limitless in this environment; an experiment under construction today that is scheduled to begin in 1999 will collect 50 terabytes of data per year and requires the equivalent of 20,000 mainframe MIPS. The perceived low cost of RISC computing drove us to UNIX workstations and servers to provide that capacity. The distributed computing environment turned out to be more complex and labor intensive to administer than the familiar VM/CMS mainframe. So, the sys admin staff decided to duplicate some of the tools we had used on the mainframe and to invent others demanded by the new environment.
The tool discussed in this article is patrol, a Perl script that periodically checks the health of all of our machines. The need for patrol first came from the need to control some of the runaway processes found on our machines from time to time. The image was that of a constable making the rounds, rattling doors and checking windows, and occasionally raising the alarm. Like most such tools, patrol grew in stages, with features added as the need arose and one major rewrite when lots of special purpose code was replaced by generalized code and a flexible rules language to drive it.
patrol is too long to reproduce in its entirety here, but it's available from ftp.slac.stanford.edu in the /software/sysadmin directory. It currently runs on SunOS, AIX, and NextStep, since that's what is supported at SLAC. There are some dependencies on the formats of the ps and df command outputs that would need tweaking to run on other systems. There are also some SLAC dependencies for messaging, paging, etc. that would need to be customized to your location, although the core functions will still work.
The patrol script runs every 15 minutes, courtesy of cron. If it's licensed to kill (or renice or restart) processes, it needs to run as root, but not if it is simply to observe and report.
Each time the script runs, it does a ps command and a df command, saves information about processes and filesystems in an observations file, and computes the changes since the previous observation. It also checks various quantities and deltas against the rules in the file patrol.cf, and if a rule matches, performs the listed action. A sample rule is:
PC browser.* \
This rule checks interactive machines for processes whose name matches the regular expression ^(browser.*)$. If the process uses more than 25% of the CPU during the preceding interval, patrol mails the user a message about the amount of resource he is using. If the process uses more than 50% of the CPU, it mails the user a message about a looping process, kills the process, and posts an alert.
The configuration file (Listing 1) contains any number of rule statements. Each statement has a number of fields delimited by white space. The number of fields and meanings vary somewhat by statement type. First, I'll describe the three common fields, and then go through the statement types.
The first field of every statement is a code that determines the resource to be checked. Table 1 lists the current set of statement identifiers.
The second field on each line contains a host selector field that determines whether that particular resource is checked on a given machine. The remainder of the statement is only executed if this field evaluates to true. This gives a method of having one common configuration file for many machines and only having the relevant checks performed on any given machine. The selector can take one of the following five forms:
1. a hostname as returned by the hostname command. Any Perl pattern is valid (e.g., pinto.*).
2. an architecture in square brackets such as [aix6000] or [sun4].
3. a feature name in angle brackets, such as <afs>, as explained below.
4. a command in backticks, indicating hosts on which the command returns true.
5. * to designate all hosts.
The features installed on each machine are listed in a small file named /etc/tailor.opts (Figure 1). This file drives SLAC's local software installation procedures. patrol consults this file to determine which resources to check. For instance, if AFS is installed on a machine, then the AFS daemons should be running.
There is one feature named usage employed to describe the general class of service provided by servers. This feature can take values such as interactive, batch, or fileserver. A selector such as <usage=fileserver> can be used to impose stricter limits on processes on a fileserver than on an interactive machine.
Most statements end with an action field to describe what to do. The available actions are listed in Table 2. The actions are fairly intuitive from their names, so I'll discuss them a bit later. All of the actions that send some sort of message to a log or to a person try to avoid flooding by establishing an interval in which they will not send a duplicate message. This interval generally defaults to six hours, but is specifiable with the intvl parameter.
Now, on to the description of what each statement does. The PC statement checks process CPU use against a set of thresholds. A process is selected by matching the process name against the regular expression in the third field. The first rule on which both the hostname selector and the process name selector match is the rule that is used. CPU usage for the interval is computed by subtracting the CPU time used at the beginning of the interval from the amount used at the end, and dividing by the number of seconds in the interval.
The fourth field on the PC statement is the threshold for the interval CPU use in percent. If the CPU use for the selected process is greater than or equal to the threshold, the actions are triggered. The threshold and actions may be repeated on subsequent lines with leading white space on each line to signal the continuation. Each threshold is tested in turn until one is exceeded or the list is exhausted. Hence, thresholds should be listed from larger to smaller, presumably with corresponding severities of action.
PC <usage=interactive>* 50 mail($user,LOOP),kill, \
The PM statement checks process memory usage. It compares the SIZ field reported by ps against the listed thresholds, and executes actions similar to the PC statement.
PM <usage=interactive> m?xrn 30000 mail($user,XRNHOG)
PM <usage=interactive> * 60000 mail($user,MEMHOG), kill
This set of statements will mail users if the mxrn or xrn process memory size exceeds 30M (which tended to happen if users left processes running for long periods of time). For any other process, it will kill the process and send the user mail if the process exceeds 60M, or just send mail if it exceeds 30M. In both cases, the wording used suggests that if this is normal behavior for that program, the user should run the process on some specialized machines where it won't interfere with other interactive users.
The F statement checks mounted filesystems (generally locally mounted ones) for full conditions. patrol compares the percent full number reported by df against the listed thresholds and executes the corresponding actions if any of the thresholds match. The threshold conditions are more complex than on the previous statements: they can cause triggers on either threshold percentages or on threshold percentages plus a threshold rate of growth. Several such thresholds can be listed in one group, as follows:
F nfs0 /home0[1-9] 99 page(admin,FSFULL)
This example checks servers "nfs01" and "nfs02" for filesystems /home01 through /home09 (there's no error for filesystems that don't actually exist, since the filesystems reported by df are matched against the rules, not vice versa). If the filesystems are 99 percent full or more, the administrators are paged. If the filesystems are 95 percent full or more and have grown by at least one percent in the last interval, or are 90 percent full and have grown by at least two percent, or 80 percent full and have grown by at least five percent, the administrators are sent mail about the condition. This allows notification of both full filesystems and ones that are growing at a rate that has a possibility of filling up quickly.
The F statement also causes patrol to perform a few other filesystem checks, for instance for a shortage of inodes. patrol is limited by the quality of the information returned by df. For instance, most systems do not report blocks in use by files that have not yet been closed. However, in practice this gives early warning of many troublesome conditions.
The D statement is a bit different from the previous ones. It provides a pattern that should match a daemon running on a selected machine. If no process in the ps list matches that pattern, the daemon is assumed not to be running, and the actions are triggered.
D <afs> afsd page(admin,NODAEMON)
D [sun4] lpd restart ('restartd lpd'), \
This feature gives an opportunity to notify the administrators of the failed daemon and to attempt to restart it if such an operation is safe.
The SP statement can check some Internet services on the local machine. It examines the output of netstat to determine whether any process is at least listening on that port. This simple test doesn't detect daemons that are hung up but that still have a connection to the TCP port. The SP statement can't test UDP- or RPC-based services, but in practice does detect and report a number of problems.
SP * tcp/telnet restart('kickinetd')
This example checks for the presence of the telnet service, and if not found, restarts it by sending a HUP signal to inetd.
The M statement provides message texts for substitution in the mail and page actions. The message text takes the form of a here document, which is evaluated as a Perl string before sending. There are some predefined variables that may be substituted in the message at that time: $cmd gives the command from the ps command; $cmdline give the entire command line with arguments; $pct gives the percentage of CPU used by the process or the percentage full of the filesystem; $host gives the current hostname; $sz gives the memory size in use for a process, and $mount gives the mount point of a filesystem. A few sample message texts are shown in Figure 2.
Finally, I will describe the possible actions in more detail. The log action, which is the default, writes a line to a systemwide logging daemon named slaclog. (slaclog is also available via ftp from the SLAC site.) This should be fairly easily customizable to another logging system if you prefer. The log lines are mostly for statistics and for a log of the more heavy-handed actions Patrol might take. (Think of it as the local police blotter.)
The mail action sends mail to the specified user. The word $user may be used to send mail to the owner of the process. The second parameter is the message ID to send. There must be an M record in the patrol.cf file with that ID. Optionally, the third parameter specifies the interval in which duplicate messages are suppressed: the interval may be specified as a number with a letter "s" for seconds, "m" for minutes, "h" for hours, or "d" for days. We set up the from address of the mail to be "The System Patrol," but have the reply-to address set to go to real person, so that messages can say "If you need further assistance, simply reply to this message."
The mcons action routes messages to the SLAC multiple console routing system. mcons is a central dispatcher that collects named streams of information from a variety of sources and sends it back out to subscribers to those streams. It allows any of us to subscribe to messages from patrol, from the server consoles, from syslog, and more. This is not currently available outside of SLAC.
The kill action sends a signal to the triggering process. The default signal is -9, but any other signal can be specified.
The restart action attempts to restart a daemon or a subsystem by executing a command. The command is executed by the system() call, and the log message will indicate whether the command succeeded or failed (returned a non-zero status code). There is one internally defined restart action that may be requested: kickinetd locates the inetd process and sends it a HUP signal.
The nice action will renice a process to a specified level or to the lowest possible value if min is specified.
The page action sends an alphanumeric page to a specified person. The patrol script is set up to use the Telalert paging system (www.telamon.com), but another may be substituted for it. The text of the page is specified by an M record in the patrol.cf file.
The name action specifies an alternate name for a process to be substituted in messages. It is useful when the command is matched by a pattern but you want a single more descriptive name in the messages.
The patrol script works well for us. Generally, users have been happy when a looping process was pointed out to them. We've been careful to warn people about most situations rather than outright killing the offending processes, so we haven't had users upset about lost work. The messages try to be helpful, pointing out alternatives to the users or directing them to documentation. The users have generally been grateful to find out the local rules and to be directed to appropriate machines for their work. We've tried to capture the spirit of the best sort of constable, one who offers assistance and service beyond simple enforcement.
About the Author
Chuck Boeheim is the head of the Systems Group at the Stanford Linear Accelerator Center, a high-energy physics research lab at Stanford University. He has been working in university computing for more than 20 years on a wide range of computing systems. He has a special interest in building tools that solve real-world computing problems in an elegant manner. See http://www.slac.stanford.edu/~boeheim for contact information.