Listing 1: Configuration file for patrol
# Machine and process names have been changed to protect the innocent.
# -------------------------------------------------------------------------------------------
# Processes that we want to control. Thresh/Action pairs can be
# repeated on succeeding lines, to provide different levels of Actions
# for different thresholds.
# -------------------------------------------------------------------------------------------
# -------------------------------------------------------------------------------------------
# First, list some processes that we want to restrict. By listing specific machines and
# processes first, we can have exceptions from our general rules. For instance, the
# operator machine has usage=server, and xclock would normally not be allowed, but
# by listing it specifically first, it is allowed as long as it doesn't take more
# than 5% of the CPU.
# -------------------------------------------------------------------------------------------
# host Process name Thresh Action
# -------------------- -------------------- ------ --------------------------------------
PC operator xclock 5 kill,mcons(alert),mail(boeheim,LOOP)
PC web.* midaswww|x?mosaic 50 mcons(warn),mail($user,BROWSER)
PC web.* netscape.* 50 mcons(warn),mail($user,BROWSER)
PC <usage=interactive> xbiff|xclock 5 kill,mcons(alert),mail($user,LOOP)
PC <usage=interactive> xscreensaver 5 kill,mcons(alert),mail($user,LOOP)
PC <usage=.*> xeyes|xroach|xbiff 0 kill,mcons(info),mail($user,BAN_PUBLIC)
PC <usage=.*> xclock|xscreensaver 0 kill,mcons(info),mail($user,BAN_SERVER)
PC <usage=compute> m?[xt]?rn|midaswww 0 mcons(info,6h),mail($user,BAN_COMPUTE)
PC <usage=batch> m?[xt]?rn|midaswww 0 kill,mcons(info),mail($user,BAN_BATCH)
PC <usage=fileserver> m?[xt]?rn|midaswww 0 kill,mcons(info),mail($user,BAN_FILESERVER)
PC <usage=server> m?[xt]?rn|midaswww 0 kill,mcons(info),mail($user,BAN_SERVER)
PC <usage=compute> x?mosaic|netscape.* 0 mcons(info,6h),mail($user,BAN_COMPUTE)
PC <usage=batch> x?mosaic|netscape.* 0 kill,mcons(info),mail($user,BAN_BATCH)
PC <usage=fileserver> x?mosaic|netscape.* 0 kill,mcons(info),mail($user,BAN_FILESERVER)
PC <usage=server> x?mosaic|netscape.* 0 kill,mcons(info),mail($user,BAN_SERVER)
# -------------------------------------------------------------------------------------------
# Processes that are always re-niced.
# -------------------------------------------------------------------------------------------
# host Process name Thresh Action
# -------------------- -------------------- ------ --------------------------------------
PC * xeyes|xroach|xbiff 0 nolog,nice(min)
PC * xclock|xload 0 nolog,nice(min)
PC * xscreensaver 0 nolog,nice(min)
# -------------------------------------------------------------------------------------------
# Processes that are expected to be heavy users in specific places. These rules allow
# the backup processes on the fileserver and tape server, any process on sctl,
# and ypserv on any machine that runs it.
# -------------------------------------------------------------------------------------------
# host Process name Thresh Action
# -------------------- -------------------- ------ --------------------------------------
PC fs01 dsc 50 mcons(info)
PC fs01 ds 50 mcons(info)
PC fs01 find 50 mcons(warn),mail($user,CPUHOG)
PC fs01 diskusg 20 mcons(info)
PC sserv[0-9] kproc 30
PC sserv[0-9] butc 60 mcons(warn)
PC sctl * 10 mcons(info)
PC * ypserv 10 mcons(info)
# -------------------------------------------------------------------------------------------
# Web browsers and other processes that can get out of hand. Processes are generally put
# here after a few experiences with runaways.
# -------------------------------------------------------------------------------------------
# host Process name Thresh Action
# -------------------- -------------------- ------ --------------------------------------
PC <usage=interactive> (x?mosaic|netscape.*) 50 kill,mcons(alert),mail($user,LOOP)
25 mcons(warn),mail($user,BROWSER)
PC * xterm 50 kill,mcons(alert),mail($user,LOOP)
20 mcons(warn),mail($user,CPUHOG)
PC * telnet 50 kill,mcons(alert),mail($user,LOOP)
20 mcons(warn),mail($user,CPUHOG)
PC * rlogin 50 kill,mcons(alert),mail($user,LOOP)
20 mcons(warn),mail($user,CPUHOG)
PC * midaswww 50 kill,mcons(alert),mail($user,LOOP)
20 mcons(warn),mail($user,CPUHOG)
PC * \[?pine\]? 50 kill,mcons(alert),mail($user,LOOP)
20 mcons(warn),mail($user,CPUHOG)
# -------------------------------------------------------------------------------------------
# General process limits for specific machines and classes of machines.
# -------------------------------------------------------------------------------------------
# host Process name Thresh Action
# -------------------- -------------------- ------ --------------------------------------
PC web.* * 10 mcons(info)
PC news * 10 mcons(info)
PC pmon * 20 mcons(info)
PC dbserv * 20 mcons(info)
PC <usage=fileserver> * 5 mcons(warn),mail($user,CPUHOG)
PC <usage=server> * 5 mcons(warn),mail($user,CPUHOG)
PC <usage=interactive> * 50 mcons(warn),mail($user,CPUHOG),nice(min)
20 mcons(info),mail($user,CPUHOG)
# -------------------------------------------------------------------------------------------
# Process memory thresholds: processes using more than the listed
# amount of memory cause an alert.
# -------------------------------------------------------------------------------------------
# host Process name Mem Limit Action
# -------------------- -------------- --------- ----------------------------------------
PM <usage=interactive> xrn 20000 mcons(alert),kill,mail($user,XRNMEM)
PM <usage=compute> * 100000 mcons(alert),kill,mail=($user,THRASH)
80000 mcons(warn)
40000 mcons(info)
PM <usage=batch> * 60000 mcons(alert),kill,mail=($user,THRASH)
40000 mcons(warn)
25000 mcons(info)
PM <usage=.*> * 80000 mcons(alert),kill,mail=($user,THRASH)
40000 mcons(warn)
20000 mcons(info)
# -----------------------------------------------------------------------------
# Filesystem thresholds: a list of comma-separated threshold values. Each
# value is of the form t[+r], where t is the threshold percent full, and r is
# the threshold growth in percent per 15 minute interval. The alert is sent if
# t and r are both reached or exceeded. If r is not given, t is an absolute
# threshold.
# -----------------------------------------------------------------------------
# Direct messages from networking machines to networking group
# host Filesystem Thresholds Action
# ------------ -------------- ------------------ --------------
F tmon|pmon|nmon /(var|tmp) 99 mcons(warn),mail(netw,FSFULL,6h)
95,90+1,80+5,50+50 mcons(warn),mail(netw,FSFULL,1d)
F tmon|pmon|nmon /u[0-9]+ 95 mcons(warn),mail(netw,FSFULL,6h)
80+2 mcons(info),mail(netw,FSFULL,1h)
# host Filesystem Thresholds Action
# ------------ -------------- ------------------ --------------
F oper[12] /u[12] 99 mcons(alert),page(boeheim,1d)
98,90+1,2+2 mcons(alert),mail(boeheim,FSFULL)
F fs01 /u[0-9]+ 100 mcons(alert),mail(unix-admin,FSFULL,1d)
99 mcons(warn,2h)
90+1,80+5,20+20 mcons(info)
F fs01 /usr/local 100 mcons(alert),mail(unix-admin,FSFULL,1d)
99 mcons(warn,2h)
90+1,80+5,20+20 mcons(info)
F fs01 /usr/work 100 mcons(alert),mail(unix-admin,FSFULL,1d)
99 mcons(warn,2h)
90+1,80+5,20+20 mcons(info)
F fs01 /var/adm/acct/collect 90,80+1 mcons(warn),mail(account,FSFULL)
F afs[0-9]+ /vice.* 95 mcons(alert),mail(unix-admin,FSFULL,1d)
80+1,50+5,20+20 mcons(warn,2h)
F news /usr/lib/news 80,50+5,20+20 mcons(warn)
F news /var/spool/news 80,50+5,20+20 mcons(warn)
F news /var/spool/news/in.coming 80,50+5,20+20 mcons(warn)
F news /var/spool/news/out.going 80,50+5,20+20 mcons(warn)
F news /var/spool/threads 80,50+5,20+20 mcons(warn)
F web1 /var/www 99,90+5,95+1 mcons(warn),mail(www-admin,FSFULL)i
F <usage=.*> /(var|tmp) 99 mcons(alert),mail(unix-
admin,FSFULL,6h),page(boeheim,6h)
95,90+1,80+5,50+50 mcons(warn)
F * /(var|tmp) 99 mcons(warn),mail(unix-admin,FSFULL,1d)
95,90+1,80+5,50+50 mcons(info)
# -----------------------------------------------------------------------------
# Daemons: the absence of the listed daemons on a machine cause
# an alert.
# -----------------------------------------------------------------------------
# host daemon Action
# -------------------- -------------- -------------------------------------
D * inetd mcons(alert)
D <mailbox-server> sendmail:? mcons(alert),name(sendmail)
D * cron mcons(alert)
D * ypbind mcons(alert)
D afs0x gated nolog
D <subnet=multiple> gated mcons(alert)
D ![ultrix-mips] syslogd mcons(alert)
D [ultrix-mips] syslog mcons(alert)
D * portmap mcons(alert)
D [aix6000] qdaemon mcons(alert),restart('startsrc -s qdaemon')
D [sun4] lpd mcons(info),restart('/usr/lib/lpd')
D <afs> afs.inetd mcons(alert)
D <amd> amd mcons(alert)
D <lsf> res mcons(warn)
D <lsf> lim mcons(warn)
D !`grep '^$host' /configdir/lsb.hosts` sbatchd mcons(warn)
D <console-server> conserver mcons(warn),restart('su - console -c exec /bin/startcs&')
D ops dispatch mcons(alert),page(boeheim,NO_DAEMON)
D <watson-account> acctd mcons(warn),mail(account,NO_DAEMON)
D [aix6000] named mcons(alert),mail(netw,NO_DAEMON),restart('startsrc -s named')
# ----------------------------------------------------------------------
# Service Port checks: see if anything is at least listening on the port.
# ----------------------------------------------------------------------
# host port Action
# -------------- -------------- -------------------------------------
SP * tcp/telnet mcons(warn),restart('kickinetd')
SP * tcp/exec mcons(warn),restart('kickinetd')
SP * tcp/finger mcons(warn),restart('kickinetd')
SP * tcp/ftp mcons(warn),restart('kickinetd')
SP !<afs> tcp/shell mcons(warn),restart('kickinetd')
SP !<afs> tcp/login mcons(warn),restart('kickinetd')
SP <afs> tcp/shell mcons(warn),restart('kickafsinetd')
SP <afs> tcp/login mcons(warn),restart('kickafsinetd')
# ----------------------------------------------------------------------
# Message texts for above conditions. The variables $cmd, $cmdline,
# and $host are available to all messages, and $pct for PL and PK messages.
# ----------------------------------------------------------------------
M LOOP <<EOF
Your program $cmd has been using $pct% of the cpu on $host for a
period of at least 15 minutes. This is an apparent loop, and the
process has been killed. If this happens repeatedly to you, or
if you have any questions, simply reply to this message.
EOF
M BAN_COMPUTE <<EOF
The program $cmd may not be run on $host. This machine is
designated as a server for compute-intensive work. Interactive
programs should be run on one of the interactive hosts listed in
the UNIX Web page.
Thank you for your cooperation.
EOF
M BAN_BATCH <<EOF
The program $cmd may not be run on $host. This machine is
designated as a server for batch work. Interactive
programs should be run on one of the interactive hosts listed in
the UNIX Web page.
Thank you for your cooperation.
EOF
M BAN_FILESERVER <<EOF
The program $cmd may not be run on $host. This machine is designated
as a fileserver, and any non-essential programs may degrade its
performance. Interactive programs should be run on one of the
interactive hosts listed in the UNIX Web page.
Thank you for your cooperation.
EOF
M BAN_SERVER <<EOF
The program $cmd may not be run on $host. This machine is designated
as a special-purpose server, and any non-essential programs may
degrade its performance. Interactive programs should be run on one
of the interactive hosts listed in the UNIX Web page.
Thank you for your cooperation.
EOF
M BAN_PUBLIC <<EOF
The program $cmd may not be run on any public server. It causes a
high level of network load and system interrupts.
Thank you for your cooperation.
EOF
M CPUHOG <<EOF
Your program $cmd has been using $pct% of the cpu on $host for a
period of at least 15 minutes. If this is a runaway process, please
kill it. If this is normal behavior for this program please run it
on the morgan cluster, which is intended for compute-intensive work.
If you need assistance, simply reply to this message.
See the UNIX Web page for a complete list of machines and the type
of work they are designated for. You might also want to use the
batch compute farm; contact us for more details about the farm.
Thank you for your cooperation.
EOF
M BROWSER <<EOF
Your Web browser $cmd has been using $pct% of the cpu on $host for a
period of at least 15 minutes. If this seems inconsistent with your
usage of it, it may be looping, and you should exit it and restart,
or kill the process with the UNIX kill command. If you need
assistance, simply reply to this message.
EOF
M FSFULL <<EOF
Filesystem $mount on $host is $pct% full.
EOF
M NO_DAEMON <<EOF
Daemon $name is not running on $host.
EOF
M XRNMEM <<EOF
Your program $cmd on $host has been using $sz KB of memory.
It has been killed to prevent it from causing problems for
other users of this machine. xrn has a known problem that
causes excessive memory use. If this persists, try using
mxrn, rn, or trn instead.
If you need assistance, simply reply to this message.
Thank you for your cooperation.
EOF
M THRASH <<EOF
Your program $cmd on $host has been using $sz KB of memory.
It has been killed to prevent it from causing problems for
other users of this machine. This exceeds the available
real memory of the machine, and causes excessive page
and swap activity.
If you need assistance, simply reply to this message.
Thank you for your cooperation.
EOF
|