procmon: A Process Monitor
Chris Hare
Anyone responsible for maintaining a UNIX system must
be concerned
with ongoing processes, that is, those that are started
when the system
is brought up and that we never want to exit. Sometimes,
however,
that just isn't possible. Processes die, and some of
the reasons for
this include maximum parameters being reached, programming
errors,
or a system resource not being available when needed.
For some processes, you can combat this situation by
using the /etc/inittab
file on System V systems. /etc/inittab contains a list
of
processes that are to be executed when the system enters
a run level,
and specifies what to do when the process exits.
The inittab file consists of four colon-separated fields:
identifier:run levels:action:command
A sample /etc/inittab file is shown in Figure 1.
inittab is a powerful mechanism, but it is found only
on System
V variants of UNIX. In addition, /etc/inittab and the
init
command give no indication that the process has exited
and been restarted
unless the process continuously dies and init prints
a message
on the console. The message is typically something like
"Command
is respawning too rapidly."
BSD-based operating systems, such as SunOS, BSD/OS from
BSD Inc, FreeBSD
and others, do not use /etc/inittab to control processes
spawned
by init.
The problem then becomes "How can I provide a system-independent
method of monitoring and restarting critical system
processes?"
The answer is procmon.
Introducing procmon
procmon is a perl script that is started during system
startup, and runs for the life of the system. It has
been written
to be a system daemon, and it behaves as such. The purpose
of this
program is to monitor the operation of a set of defined
processes,
and if they are not present in the process list, to
restart them.
It also logs its actions, noting when the process fails
and when it
is restarted. A sample log, which is generated through
the UNIX syslog
facility, is shown in Figure 2.
The syslog output shows procmon starting up and recording
what it is doing. The goal here is to capture as much
logging information
as possible about the process being monitored, in this
case named.
Automatically Starting procmon
The benefit of a program such as procmon can only be
realized
when the program is started at system boot time. How
this is accomplished
depends upon the UNIX variant you are using. On System
V systems,
the lines shown in Figure 3 are added to /etc/rc2, or
to a
file in the /etc/rc2.d directory, which is the preferred
method.
procmon is a daemon process. It handles all of the system
signals, and disconnects itself from a controlling terminal.
When
procmon starts, it prints a line indicating what configuration
parameters it is using, and then quietly moves to the
background.
All logging at this point is accomplished through the
UNIX syslog
facility. The output printed when procmon starts is
shown
in Figure 4.
The procmon Files
procmon uses two configuration files: procmon.cfg
and procmon.cmd. Of the two, only procmon.cmd is absolutely
necessary. If procmon.cfg exists, then it will be used
to
alter the base configuration of the program. I will
discuss these
files in detail here.
The procmon Configuration File
The default configuration file is /etc/procmon.cfg.
If this
file is not found when procmon starts, then it uses
the default
parameters built into the program. This configuration
file is intended
to provide a mechanism for the system administrator
to change the
location of the procmon.cmd file and the delay between
checking
the commands in the list.
If no /etc/procmon.cfg file is found, then the program
looks
in the /usr/local/bin directory for procmon.cmd and
uses a delay of five minutes between checks. The sample
procmon.cfg
file shown in Figure 5 illustrates using a delay of
15 minutes (900
seconds), and a configuration directory of /etc. Notice
that
the delay value is in seconds, not minutes.
The /etc/procmon.cfg file is not processed by procmon;
if it exists, it is loaded into procmon by perl. This
means
that comments using the # symbol are supported, and
the last
line of the file must contain the command 1; to signify
the
end of the the loaded file.
The reason for using this configuration file is that
the parameters
of the program can be modified without having to affect
the source
code. The delay_between variable is used to define the
amount
of delay between processing the list of commands. For
example, if
the delay_between variable is 300, then there will be
a pause
of 300 seconds between processing.
The ConfigDir variable defines to procmon where
the procmon.cmd file is located. The program defaults
to looking
for it in /usr/local/bin. The sample in Figure 6 places
the file in /etc.
If you look at Listing 1, the procmon source code, you
will
see
require "/etc/procmon.cfg";
which is how the configuration file is loaded into procmon.
The perl require command causes perl to read the named
file, procmon.cfg, into the current program. This is
a powerful
feature of perl, and it allows developers the freedom
to concentrate
on the problem they are trying to solve, rather than
on the mundane
task of processing configuration files.
The procmon Command File
The procmon command file contains the list of processes
that are to be monitored. It contains two exclamation
mark (!)-separated
fields: the pattern to search for in the process list,
and the name
of the command to execute if the pattern is not found.
A sample file
is shown in Figure 6.
In this file, procmon will be watching for named
and cron. If named is not in the process list, then
the command /etc/named is started. The same holds true
for
the cron command. Again, the purpose of using a configuration
file for this information is to allow the system administrator
to
configure the file on the fly. If the contents of the
file change,
then the procmon daemon must be restarted to read the
changes.
procmon Messages
The syslog facility records several messages. These
messages are discussed
below.
Startup Messages
Startup messages are recorded by syslog when procmon
starts up. The appropriate information is substituted
for the values
in <value>. The <timestamp> is replaced
by the current time through
syslog. <PID> is the process identification number
of the
procmon process, and <system_name> is the name
of the system,
as recorded by syslog.
<timestamp> <system_name> procmon[<PID>]: Process Monitor started
<timestamp> <system_name> procmon[<PID>]: Loaded config file
<value> <timestamp> <system_name> procmon[<PID>]: Command File:
<value> <timestamp> <system_name> procmon[<PID>]: Loop Delay=
<value> <timestamp> <system_name> procmon[<PID>]: Adding
<value> to stored process list <timestamp>
<system_name>
procmon[<PID>]: Monitoring : <value> processes
Monitoring Messages
These messages are printed during the monitoring process;
they represent
the status of the monitored processes.
<timestamp> <system_name> procmon[<PID>]: <process> running as PID
<PID>
This record is printed after every check, and indicates
that the monitored
process is running.
<timestamp> <system_name> procmon[<PID>]: <process> is NOT running
This record is printed when the monitored process cannot
be found
in the process list.
<timestamp> <system_name> procmon[<PID>]: Last Failure of <process> @
<time>
This record is printed to record when the last (previous)
failure
of the process was.
<timestamp> <system_name> procmon[<PID>]: issuing
<start_command> to system
This record is printed before the identified command
is executed.
<timestamp> <system_name> procmon[<PID>]:<start_command>
returns <return_code>
This command is printed after the command has been issued
to the system.
Looking at the syslog may give you clues regarding the
status
of things after the command was issued.
Enhancements and Deficiencies
The procmon code in Listing 1 was written to run on
System
V systems. It has been in operation successfully since
December 18,
1994. However, some enhancements would be useful. For
example, it
would be wise to report a critical message in syslog
if the
command returns anything other than 0, since a non-zero
return
code generally indicates that the command did not start.
Additionally,
it would be better to include a BSD option to parse
the ps
output, and add an option in the configuration file
to choose System
V or BSD.
Conclusion
The procmon script helps to ensure that operation-critical
applications remain in operation. While a similar mechanism
is available
through /etc/inittab and the init command, not all
systems support it. Moreover, it provides no logging
or history mechanism
to determine if there is a significant problem to be
reviewed.
About the Author
Chris Hare is the Operations Manager of fONOROLA i*internet,
a Canadian national Internet Access Provider. He is
the co-author
of the books Inside UNIX and Internet Firewalls and
Network
Security. Along with his full-time job and writing,
he is the president
of the Unilabs Research Group, and is presently working
on his third
book, The UNIX Professional Reference, for New Riders
Publishing.
Chris can be reached at chrish@fonorola.net or chrish@unilabs.org.
|