| procmon: A Process Monitor
 
Chris Hare 
Anyone responsible for maintaining a UNIX system must
be concerned 
with ongoing processes, that is, those that are started
when the system 
is brought up and that we never want to exit. Sometimes,
however, 
that just isn't possible. Processes die, and some of
the reasons for 
this include maximum parameters being reached, programming
errors, 
or a system resource not being available when needed. 
For some processes, you can combat this situation by
using the /etc/inittab 
file on System V systems. /etc/inittab contains a list
of 
processes that are to be executed when the system enters
a run level, 
and specifies what to do when the process exits. 
The inittab file consists of four colon-separated fields: 
 
identifier:run levels:action:command 
 
A sample /etc/inittab file is shown in Figure 1. 
inittab is a powerful mechanism, but it is found only
on System 
V variants of UNIX. In addition, /etc/inittab and the
init 
command give no indication that the process has exited
and been restarted 
unless the process continuously dies and init prints
a message 
on the console. The message is typically something like
"Command 
is respawning too rapidly." 
BSD-based operating systems, such as SunOS, BSD/OS from
BSD Inc, FreeBSD 
and others, do not use /etc/inittab to control processes
spawned 
by init. 
The problem then becomes "How can I provide a system-independent
method of monitoring and restarting critical system
processes?" 
The answer is procmon. 
Introducing procmon 
procmon is a perl script that is started during system
startup, and runs for the life of the system. It has
been written 
to be a system daemon, and it behaves as such. The purpose
of this 
program is to monitor the operation of a set of defined
processes, 
and if they are not present in the process list, to
restart them. 
It also logs its actions, noting when the process fails
and when it 
is restarted. A sample log, which is generated through
the UNIX syslog 
facility, is shown in Figure 2. 
The syslog output shows procmon starting up and recording
what it is doing. The goal here is to capture as much
logging information 
as possible about the process being monitored, in this
case named. 
Automatically Starting procmon 
The benefit of a program such as procmon can only be
realized 
when the program is started at system boot time. How
this is accomplished 
depends upon the UNIX variant you are using. On System
V systems, 
the lines shown in Figure 3 are added to /etc/rc2, or
to a 
file in the /etc/rc2.d directory, which is the preferred
method. 
procmon is a daemon process. It handles all of the system
signals, and disconnects itself from a controlling terminal.
When 
procmon starts, it prints a line indicating what configuration
parameters it is using, and then quietly moves to the
background. 
All logging at this point is accomplished through the
UNIX syslog 
facility. The output printed when procmon starts is
shown 
in Figure 4. 
The procmon Files 
procmon uses two configuration files: procmon.cfg 
and procmon.cmd. Of the two, only procmon.cmd is absolutely
necessary. If procmon.cfg exists, then it will be used
to 
alter the base configuration of the program. I will
discuss these 
files in detail here. 
The procmon Configuration File 
The default configuration file is /etc/procmon.cfg.
If this 
file is not found when procmon starts, then it uses
the default 
parameters built into the program. This configuration
file is intended 
to provide a mechanism for the system administrator
to change the 
location of the procmon.cmd file and the delay between
checking 
the commands in the list.  
If no /etc/procmon.cfg file is found, then the program
looks 
in the /usr/local/bin directory for procmon.cmd and
uses a delay of five minutes between checks. The sample
procmon.cfg 
file shown in Figure 5 illustrates using a delay of
15 minutes (900 
seconds), and a configuration directory of /etc. Notice
that 
the delay value is in seconds, not minutes. 
The /etc/procmon.cfg file is not processed by procmon;
if it exists, it is loaded into procmon by perl. This
means 
that comments using the # symbol are supported, and
the last 
line of the file must contain the command 1; to signify
the 
end of the the loaded file.  
The reason for using this configuration file is that
the parameters 
of the program can be modified without having to affect
the source 
code. The delay_between variable is used to define the
amount 
of delay between processing the list of commands. For
example, if 
the delay_between variable is 300, then there will be
a pause 
of 300 seconds between processing. 
The ConfigDir variable defines to procmon where 
the procmon.cmd file is located. The program defaults
to looking 
for it in /usr/local/bin. The sample in Figure 6 places
the file in /etc. 
If you look at Listing 1, the procmon source code, you
will 
see  
 
require "/etc/procmon.cfg"; 
 
which is how the configuration file is loaded into procmon.
The perl require command causes perl to read the named
file, procmon.cfg, into the current program. This is
a powerful 
feature of perl, and it allows developers the freedom
to concentrate 
on the problem they are trying to solve, rather than
on the mundane 
task of processing configuration files. 
The procmon Command File 
The procmon command file contains the list of processes
that are to be monitored. It contains two exclamation
mark (!)-separated 
fields: the pattern to search for in the process list,
and the name 
of the command to execute if the pattern is not found.
A sample file 
is shown in Figure 6. 
In this file, procmon will be watching for named 
and cron. If named is not in the process list, then
the command /etc/named is started. The same holds true
for 
the cron command. Again, the purpose of using a configuration
file for this information is to allow the system administrator
to 
configure the file on the fly. If the contents of the
file change, 
then the procmon daemon must be restarted to read the
changes. 
procmon Messages 
The syslog facility records several messages. These
messages are discussed 
below. 
Startup Messages 
Startup messages are recorded by syslog when procmon
starts up. The appropriate information is substituted
for the values 
in <value>. The <timestamp> is replaced
by the current time through 
syslog. <PID> is the process identification number
of the 
procmon process, and <system_name> is the name
of the system, 
as recorded by syslog. 
 
<timestamp> <system_name> procmon[<PID>]: Process Monitor started
<timestamp> <system_name> procmon[<PID>]: Loaded config file
<value> <timestamp> <system_name> procmon[<PID>]: Command File:
<value> <timestamp> <system_name> procmon[<PID>]: Loop Delay=
<value> <timestamp> <system_name> procmon[<PID>]: Adding
<value> to stored process list <timestamp>
<system_name>
procmon[<PID>]: Monitoring : <value> processes 
 
Monitoring Messages 
These messages are printed during the monitoring process;
they represent 
the status of the monitored processes. 
 
<timestamp> <system_name> procmon[<PID>]: <process> running as PID
<PID> 
 
This record is printed after every check, and indicates
that the monitored 
process is running. 
 
<timestamp> <system_name> procmon[<PID>]: <process> is NOT running 
 
This record is printed when the monitored process cannot
be found 
in the process list.  
 
<timestamp> <system_name> procmon[<PID>]: Last Failure of <process> @
<time> 
 
This record is printed to record when the last (previous)
failure 
of the process was. 
 
<timestamp> <system_name> procmon[<PID>]: issuing
<start_command> to system 
 
This record is printed before the identified command
is executed. 
 
<timestamp> <system_name> procmon[<PID>]:<start_command>
returns <return_code> 
 
This command is printed after the command has been issued
to the system.  
Looking at the syslog may give you clues regarding the
status 
of things after the command was issued. 
Enhancements and Deficiencies 
The procmon code in Listing 1 was written to run on
System 
V systems. It has been in operation successfully since
December 18, 
1994. However, some enhancements would be useful. For
example, it 
would be wise to report a critical message in syslog
if the 
command returns anything other than 0, since a non-zero
return 
code generally indicates that the command did not start.
Additionally, 
it would be better to include a BSD option to parse
the ps 
output, and add an option in the configuration file
to choose System 
V or BSD. 
Conclusion 
The procmon script helps to ensure that operation-critical
applications remain in operation. While a similar mechanism
is available 
through /etc/inittab and the init command, not all 
systems support it. Moreover, it provides no logging
or history mechanism 
to determine if there is a significant problem to be
reviewed.  
 
 About the Author
 
Chris Hare is the Operations Manager of fONOROLA i*internet,
a Canadian national Internet Access Provider. He is
the co-author 
of the books Inside UNIX and Internet Firewalls and
Network 
Security. Along with his full-time job and writing,
he is the president 
of the Unilabs Research Group, and is presently working
on his third 
book, The UNIX Professional Reference, for New Riders
Publishing. 
Chris can be reached at chrish@fonorola.net or chrish@unilabs.org. 
 
 
 |