Article

procmon: A Process Monitor

Chris Hare

Anyone responsible for maintaining a UNIX system must be concerned with ongoing processes, that is, those that are started when the system is brought up and that we never want to exit. Sometimes, however, that just isn't possible. Processes die, and some of the reasons for this include maximum parameters being reached, programming errors, or a system resource not being available when needed.

For some processes, you can combat this situation by using the /etc/inittab file on System V systems. /etc/inittab contains a list of processes that are to be executed when the system enters a run level, and specifies what to do when the process exits.

The inittab file consists of four colon-separated fields:

identifier:run levels:action:command

A sample /etc/inittab file is shown in Figure 1.

inittab is a powerful mechanism, but it is found only on System V variants of UNIX. In addition, /etc/inittab and the init command give no indication that the process has exited and been restarted unless the process continuously dies and init prints a message on the console. The message is typically something like "Command is respawning too rapidly."

BSD-based operating systems, such as SunOS, BSD/OS from BSD Inc, FreeBSD and others, do not use /etc/inittab to control processes spawned by init.

The problem then becomes "How can I provide a system-independent method of monitoring and restarting critical system processes?" The answer is procmon.

Introducing procmon

procmon is a perl script that is started during system startup, and runs for the life of the system. It has been written to be a system daemon, and it behaves as such. The purpose of this program is to monitor the operation of a set of defined processes, and if they are not present in the process list, to restart them. It also logs its actions, noting when the process fails and when it is restarted. A sample log, which is generated through the UNIX syslog facility, is shown in Figure 2.

The syslog output shows procmon starting up and recording what it is doing. The goal here is to capture as much logging information as possible about the process being monitored, in this case named.

Automatically Starting procmon

The benefit of a program such as procmon can only be realized when the program is started at system boot time. How this is accomplished depends upon the UNIX variant you are using. On System V systems, the lines shown in Figure 3 are added to /etc/rc2, or to a file in the /etc/rc2.d directory, which is the preferred method.

procmon is a daemon process. It handles all of the system signals, and disconnects itself from a controlling terminal. When procmon starts, it prints a line indicating what configuration parameters it is using, and then quietly moves to the background. All logging at this point is accomplished through the UNIX syslog facility. The output printed when procmon starts is shown in Figure 4.

The procmon Files

procmon uses two configuration files: procmon.cfg and procmon.cmd. Of the two, only procmon.cmd is absolutely necessary. If procmon.cfg exists, then it will be used to alter the base configuration of the program. I will discuss these files in detail here.

The procmon Configuration File

The default configuration file is /etc/procmon.cfg. If this file is not found when procmon starts, then it uses the default parameters built into the program. This configuration file is intended to provide a mechanism for the system administrator to change the location of the procmon.cmd file and the delay between checking the commands in the list.

If no /etc/procmon.cfg file is found, then the program looks in the /usr/local/bin directory for procmon.cmd and uses a delay of five minutes between checks. The sample procmon.cfg file shown in Figure 5 illustrates using a delay of 15 minutes (900 seconds), and a configuration directory of /etc. Notice that the delay value is in seconds, not minutes.

The /etc/procmon.cfg file is not processed by procmon; if it exists, it is loaded into procmon by perl. This means that comments using the # symbol are supported, and the last line of the file must contain the command 1; to signify the end of the the loaded file.

The reason for using this configuration file is that the parameters of the program can be modified without having to affect the source code. The delay_between variable is used to define the amount of delay between processing the list of commands. For example, if the delay_between variable is 300, then there will be a pause of 300 seconds between processing.

The ConfigDir variable defines to procmon where the procmon.cmd file is located. The program defaults to looking for it in /usr/local/bin. The sample in Figure 6 places the file in /etc.

If you look at Listing 1, the procmon source code, you will see

require "/etc/procmon.cfg";

which is how the configuration file is loaded into procmon. The perl require command causes perl to read the named file, procmon.cfg, into the current program. This is a powerful feature of perl, and it allows developers the freedom to concentrate on the problem they are trying to solve, rather than on the mundane task of processing configuration files.

The procmon Command File

The procmon command file contains the list of processes that are to be monitored. It contains two exclamation mark (!)-separated fields: the pattern to search for in the process list, and the name of the command to execute if the pattern is not found. A sample file is shown in Figure 6.

In this file, procmon will be watching for named and cron. If named is not in the process list, then the command /etc/named is started. The same holds true for the cron command. Again, the purpose of using a configuration file for this information is to allow the system administrator to configure the file on the fly. If the contents of the file change, then the procmon daemon must be restarted to read the changes.

procmon Messages

The syslog facility records several messages. These messages are discussed below.

Startup Messages

Startup messages are recorded by syslog when procmon starts up. The appropriate information is substituted for the values in <value>. The <timestamp> is replaced by the current time through syslog. <PID> is the process identification number of the procmon process, and <system_name> is the name of the system, as recorded by syslog.

<timestamp> <system_name> procmon[<PID>]: Process Monitor started
<timestamp> <system_name> procmon[<PID>]: Loaded config file
<value> <timestamp> <system_name> procmon[<PID>]: Command File:
<value> <timestamp> <system_name> procmon[<PID>]: Loop Delay=
<value> <timestamp> <system_name> procmon[<PID>]: Adding
<value> to stored process list <timestamp>
<system_name>
procmon[<PID>]: Monitoring : <value> processes

Monitoring Messages

These messages are printed during the monitoring process; they represent the status of the monitored processes.

<timestamp> <system_name> procmon[<PID>]: <process> running as PID

<PID>

This record is printed after every check, and indicates that the monitored process is running.

<timestamp> <system_name> procmon[<PID>]: <process> is NOT running

This record is printed when the monitored process cannot be found in the process list.

<timestamp> <system_name> procmon[<PID>]: Last Failure of <process> @
<time>

This record is printed to record when the last (previous) failure of the process was.

<timestamp> <system_name> procmon[<PID>]: issuing
<start_command> to system

This record is printed before the identified command is executed.

<timestamp> <system_name> procmon[<PID>]:<start_command>
returns <return_code>

This command is printed after the command has been issued to the system.

Looking at the syslog may give you clues regarding the status of things after the command was issued.

Enhancements and Deficiencies

The procmon code in Listing 1 was written to run on System V systems. It has been in operation successfully since December 18, 1994. However, some enhancements would be useful. For example, it would be wise to report a critical message in syslog if the command returns anything other than 0, since a non-zero return code generally indicates that the command did not start. Additionally, it would be better to include a BSD option to parse the ps output, and add an option in the configuration file to choose System V or BSD.

Conclusion

The procmon script helps to ensure that operation-critical applications remain in operation. While a similar mechanism is available through /etc/inittab and the init command, not all systems support it. Moreover, it provides no logging or history mechanism to determine if there is a significant problem to be reviewed.

About the Author

Chris Hare is the Operations Manager of fONOROLA i*internet, a Canadian national Internet Access Provider. He is the co-author of the books Inside UNIX and Internet Firewalls and Network Security. Along with his full-time job and writing, he is the president of the Unilabs Research Group, and is presently working on his third book, The UNIX Professional Reference, for New Riders Publishing. Chris can be reached at chrish@fonorola.net or chrish@unilabs.org.