Cover V06, I02
Article
Figure 1
Figure 2
Listing 1
Table 1
Table 2

feb97.tar


Listing 1: Configuration file for patrol

# Machine and process names have been changed to protect the innocent.

# -------------------------------------------------------------------------------------------
# Processes that we want to control.  Thresh/Action pairs can be
# repeated on succeeding lines, to provide different levels of Actions
# for different thresholds.
# -------------------------------------------------------------------------------------------

# -------------------------------------------------------------------------------------------
# First, list some processes that we want to restrict.  By listing specific machines and
# processes first, we can have exceptions from our general rules.  For instance, the
# operator machine has usage=server, and xclock would normally not be allowed, but
# by listing it specifically first, it is allowed as long as it doesn't take more
# than 5% of the CPU.
# -------------------------------------------------------------------------------------------

#   host                  Process name          Thresh Action
#   --------------------  --------------------  ------ --------------------------------------
PC  operator              xclock                   5   kill,mcons(alert),mail(boeheim,LOOP)
PC  web.*                 midaswww|x?mosaic       50   mcons(warn),mail($user,BROWSER)
PC  web.*                 netscape.*              50   mcons(warn),mail($user,BROWSER)
PC  <usage=interactive>   xbiff|xclock             5   kill,mcons(alert),mail($user,LOOP)
PC  <usage=interactive>   xscreensaver             5   kill,mcons(alert),mail($user,LOOP)
PC  <usage=.*>            xeyes|xroach|xbiff       0   kill,mcons(info),mail($user,BAN_PUBLIC)
PC  <usage=.*>            xclock|xscreensaver      0   kill,mcons(info),mail($user,BAN_SERVER)
PC  <usage=compute>       m?[xt]?rn|midaswww       0   mcons(info,6h),mail($user,BAN_COMPUTE)
PC  <usage=batch>         m?[xt]?rn|midaswww       0   kill,mcons(info),mail($user,BAN_BATCH)
PC  <usage=fileserver>    m?[xt]?rn|midaswww       0   kill,mcons(info),mail($user,BAN_FILESERVER)
PC  <usage=server>        m?[xt]?rn|midaswww       0   kill,mcons(info),mail($user,BAN_SERVER)
PC  <usage=compute>       x?mosaic|netscape.*      0   mcons(info,6h),mail($user,BAN_COMPUTE)
PC  <usage=batch>         x?mosaic|netscape.*      0   kill,mcons(info),mail($user,BAN_BATCH)
PC  <usage=fileserver>    x?mosaic|netscape.*      0   kill,mcons(info),mail($user,BAN_FILESERVER)
PC  <usage=server>        x?mosaic|netscape.*      0   kill,mcons(info),mail($user,BAN_SERVER)

# -------------------------------------------------------------------------------------------
# Processes that are always re-niced.
# -------------------------------------------------------------------------------------------

#   host                  Process name          Thresh Action
#   --------------------  --------------------  ------ --------------------------------------
PC  *                     xeyes|xroach|xbiff       0   nolog,nice(min)
PC  *                     xclock|xload             0   nolog,nice(min)
PC  *                     xscreensaver             0   nolog,nice(min)

# -------------------------------------------------------------------------------------------
# Processes that are expected to be heavy users in specific places.  These rules allow
# the backup processes on the fileserver and tape server, any process on sctl,
# and ypserv on any machine that runs it.
# -------------------------------------------------------------------------------------------

#   host                  Process name          Thresh Action
#   --------------------  --------------------  ------ --------------------------------------
PC  fs01                  dsc                     50   mcons(info)
PC  fs01                  ds                      50   mcons(info)
PC  fs01                  find                    50   mcons(warn),mail($user,CPUHOG)
PC  fs01                  diskusg                 20   mcons(info)
PC  sserv[0-9]            kproc                   30
PC  sserv[0-9]            butc                    60   mcons(warn)
PC  sctl                  *                       10   mcons(info)
PC  *                     ypserv                  10   mcons(info)

# -------------------------------------------------------------------------------------------
# Web browsers and other processes that can get out of hand.  Processes are generally put
# here after a few experiences with runaways.
# -------------------------------------------------------------------------------------------

#   host                  Process name          Thresh Action
#   --------------------  --------------------  ------ --------------------------------------
PC  <usage=interactive>   (x?mosaic|netscape.*)   50   kill,mcons(alert),mail($user,LOOP)
25   mcons(warn),mail($user,BROWSER)
PC  *                     xterm                   50   kill,mcons(alert),mail($user,LOOP)
20   mcons(warn),mail($user,CPUHOG)
PC  *                     telnet                  50   kill,mcons(alert),mail($user,LOOP)
20   mcons(warn),mail($user,CPUHOG)
PC  *                     rlogin                  50   kill,mcons(alert),mail($user,LOOP)
20   mcons(warn),mail($user,CPUHOG)
PC  *                     midaswww                50   kill,mcons(alert),mail($user,LOOP)
20   mcons(warn),mail($user,CPUHOG)
PC  *                     \[?pine\]?              50   kill,mcons(alert),mail($user,LOOP)
20   mcons(warn),mail($user,CPUHOG)

# -------------------------------------------------------------------------------------------
# General process limits for specific machines and classes of machines.
# -------------------------------------------------------------------------------------------

#   host                  Process name          Thresh Action
#   --------------------  --------------------  ------ --------------------------------------
PC  web.*                 *                       10   mcons(info)
PC  news                  *                       10   mcons(info)
PC  pmon                  *                       20   mcons(info)
PC  dbserv                *                       20   mcons(info)
PC  <usage=fileserver>    *                        5   mcons(warn),mail($user,CPUHOG)
PC  <usage=server>        *                        5   mcons(warn),mail($user,CPUHOG)
PC  <usage=interactive>   *                       50   mcons(warn),mail($user,CPUHOG),nice(min)
20   mcons(info),mail($user,CPUHOG)


# -------------------------------------------------------------------------------------------
# Process memory thresholds: processes using more than the listed
# amount of memory cause an alert.
# -------------------------------------------------------------------------------------------

#   host                  Process name    Mem Limit  Action
#   --------------------  --------------  ---------  ----------------------------------------
PM  <usage=interactive>   xrn               20000    mcons(alert),kill,mail($user,XRNMEM)
PM  <usage=compute>       *                100000    mcons(alert),kill,mail=($user,THRASH)
80000    mcons(warn)
40000    mcons(info)
PM  <usage=batch>         *                 60000    mcons(alert),kill,mail=($user,THRASH)
40000    mcons(warn)
25000    mcons(info)
PM  <usage=.*>            *                 80000    mcons(alert),kill,mail=($user,THRASH)
40000    mcons(warn)
20000    mcons(info)


# -----------------------------------------------------------------------------
# Filesystem thresholds: a list of comma-separated threshold values.  Each
# value is of the form t[+r], where t is the threshold percent full, and r is
# the threshold growth in percent per 15 minute interval.  The alert is sent if
# t and r are both reached or exceeded.  If r is not given, t is an absolute
# threshold.
# -----------------------------------------------------------------------------

# Direct messages from networking machines to networking group

#   host            Filesystem             Thresholds          Action
#   ------------    --------------         ------------------  --------------
F   tmon|pmon|nmon  /(var|tmp)             99                  mcons(warn),mail(netw,FSFULL,6h)
95,90+1,80+5,50+50  mcons(warn),mail(netw,FSFULL,1d)
F   tmon|pmon|nmon  /u[0-9]+               95                  mcons(warn),mail(netw,FSFULL,6h)
80+2                mcons(info),mail(netw,FSFULL,1h)


#   host          Filesystem               Thresholds          Action
#   ------------  --------------           ------------------  --------------
F   oper[12]      /u[12]                   99                  mcons(alert),page(boeheim,1d)
98,90+1,2+2         mcons(alert),mail(boeheim,FSFULL)
F   fs01          /u[0-9]+                 100                 mcons(alert),mail(unix-admin,FSFULL,1d)
99                  mcons(warn,2h)
90+1,80+5,20+20     mcons(info)
F   fs01          /usr/local               100                 mcons(alert),mail(unix-admin,FSFULL,1d)
99                  mcons(warn,2h)
90+1,80+5,20+20     mcons(info)
F   fs01          /usr/work                100                 mcons(alert),mail(unix-admin,FSFULL,1d)
99                  mcons(warn,2h)
90+1,80+5,20+20     mcons(info)
F   fs01          /var/adm/acct/collect    90,80+1             mcons(warn),mail(account,FSFULL)
F   afs[0-9]+     /vice.*                  95                  mcons(alert),mail(unix-admin,FSFULL,1d)
80+1,50+5,20+20     mcons(warn,2h)
F   news          /usr/lib/news            80,50+5,20+20       mcons(warn)
F   news          /var/spool/news          80,50+5,20+20       mcons(warn)
F   news          /var/spool/news/in.coming 80,50+5,20+20       mcons(warn)
F   news          /var/spool/news/out.going 80,50+5,20+20       mcons(warn)
F   news          /var/spool/threads       80,50+5,20+20       mcons(warn)
F   web1          /var/www                 99,90+5,95+1        mcons(warn),mail(www-admin,FSFULL)i
F   <usage=.*>    /(var|tmp)               99                  mcons(alert),mail(unix-
admin,FSFULL,6h),page(boeheim,6h)
95,90+1,80+5,50+50  mcons(warn)
F   *             /(var|tmp)               99                  mcons(warn),mail(unix-admin,FSFULL,1d)
95,90+1,80+5,50+50  mcons(info)


# -----------------------------------------------------------------------------
# Daemons: the absence of the listed daemons on a machine cause
# an alert.
# -----------------------------------------------------------------------------

#   host                  daemon          Action
#   --------------------  --------------  -------------------------------------
D   *                     inetd           mcons(alert)
D   <mailbox-server>      sendmail:?      mcons(alert),name(sendmail)
D   *                     cron            mcons(alert)
D   *                     ypbind          mcons(alert)
D   afs0x                 gated           nolog
D   <subnet=multiple>     gated           mcons(alert)
D   ![ultrix-mips]        syslogd         mcons(alert)
D   [ultrix-mips]         syslog          mcons(alert)
D   *                     portmap         mcons(alert)
D   [aix6000]             qdaemon         mcons(alert),restart('startsrc -s qdaemon')
D   [sun4]                lpd             mcons(info),restart('/usr/lib/lpd')
D   <afs>                 afs.inetd       mcons(alert)
D   <amd>                 amd             mcons(alert)
D   <lsf>                 res             mcons(warn)
D   <lsf>                 lim             mcons(warn)
D   !`grep '^$host' /configdir/lsb.hosts`    sbatchd  mcons(warn)
D   <console-server>      conserver       mcons(warn),restart('su - console -c exec /bin/startcs&')
D   ops                   dispatch        mcons(alert),page(boeheim,NO_DAEMON)
D   <watson-account>      acctd           mcons(warn),mail(account,NO_DAEMON)
D   [aix6000]             named           mcons(alert),mail(netw,NO_DAEMON),restart('startsrc -s named')

# ----------------------------------------------------------------------
# Service Port checks: see if anything is at least listening on the port.
# ----------------------------------------------------------------------

#   host              port            Action
#   --------------  --------------  -------------------------------------
SP  *                 tcp/telnet      mcons(warn),restart('kickinetd')
SP  *                 tcp/exec        mcons(warn),restart('kickinetd')
SP  *                 tcp/finger      mcons(warn),restart('kickinetd')
SP  *                 tcp/ftp         mcons(warn),restart('kickinetd')
SP  !<afs>            tcp/shell       mcons(warn),restart('kickinetd')
SP  !<afs>            tcp/login       mcons(warn),restart('kickinetd')
SP  <afs>             tcp/shell       mcons(warn),restart('kickafsinetd')
SP  <afs>             tcp/login       mcons(warn),restart('kickafsinetd')

# ----------------------------------------------------------------------
# Message texts for above conditions.  The variables $cmd, $cmdline,
# and $host are available to all messages, and $pct for PL and PK messages.
# ----------------------------------------------------------------------

M LOOP <<EOF
Your program $cmd has been using $pct% of the cpu on $host for a
period of at least 15 minutes.  This is an apparent loop, and the
process has been killed.  If this happens repeatedly to you, or
if you have any questions, simply reply to this message.
EOF

M BAN_COMPUTE <<EOF
The program $cmd may not be run on $host.  This machine is
designated as a server for compute-intensive work.  Interactive
programs should be run on one of the interactive hosts listed in
the UNIX Web page.

Thank you for your cooperation.
EOF

M BAN_BATCH <<EOF
The program $cmd may not be run on $host.  This machine is
designated as a server for batch work.  Interactive
programs should be run on one of the interactive hosts listed in
the UNIX Web page.

Thank you for your cooperation.
EOF

M BAN_FILESERVER <<EOF
The program $cmd may not be run on $host. This machine is designated
as a fileserver, and any non-essential programs may degrade its
performance.  Interactive programs should be run on one of the
interactive hosts listed in the UNIX Web page.

Thank you for your cooperation.
EOF

M BAN_SERVER <<EOF
The program $cmd may not be run on $host. This machine is designated
as a special-purpose server, and any non-essential programs may
degrade its performance.  Interactive programs should be run on one
of the interactive hosts listed in the UNIX Web page.

Thank you for your cooperation.
EOF

M BAN_PUBLIC <<EOF
The program $cmd may not be run on any public server.  It causes a
high level of network load and system interrupts.

Thank you for your cooperation.
EOF

M CPUHOG <<EOF
Your program $cmd has been using $pct% of the cpu on $host for a
period of at least 15 minutes.  If this is a runaway process, please
kill it.  If this is normal behavior for this program please run it
on the morgan cluster, which is intended for compute-intensive work.
If you need assistance, simply reply to this message.

See the UNIX Web page for a complete list of machines and the type
of work they are designated for.  You might also want to use the
batch compute farm; contact us for more details about the farm.

Thank you for your cooperation.
EOF

M BROWSER <<EOF
Your Web browser $cmd has been using $pct% of the cpu on $host for a
period of at least 15 minutes.  If this seems inconsistent with your
usage of it, it may be looping, and you should exit it and restart,
or kill the process with the UNIX kill command. If you need
assistance, simply reply to this message.
EOF

M FSFULL <<EOF
Filesystem $mount on $host is $pct% full.
EOF

M NO_DAEMON <<EOF
Daemon $name is not running on $host.
EOF

M XRNMEM <<EOF
Your program $cmd on $host has been using $sz KB of memory.
It has been killed to prevent it from causing problems for
other users of this machine. xrn has a known problem that
causes excessive memory use. If this persists, try using
mxrn, rn, or trn instead.

If you need assistance, simply reply to this message.
Thank you for your cooperation.
EOF

M THRASH <<EOF
Your program $cmd on $host has been using $sz KB of memory.
It has been killed to prevent it from causing problems for
other users of this machine. This exceeds the available
real memory of the machine, and causes excessive page
and swap activity.

If you need assistance, simply reply to this message.
Thank you for your cooperation.
EOF