A System Load Monitoring Trilogy
Leor Zolman
If you've been following my articles in the past two
issues of Sys
Admin, you've probably noticed that one of my big concerns
as system
administrator here at R&D Publications has been
to seek out new and
useful ways to smooth out the CPU system load on our
single-CPU Xenix
installation.
The overnight and background job spooling utilities
described previously
allow our users a great degree of direct control over
their use of
system resources. From time to time, the users must
make decisions
such as whether to launch a long series of reports in
the background
or to run them overnight, instead. Most of our users,
however, are
not technical enough to comfortably use the standard
UNIX/Xenix diagnostic
utilities to get a handle on the system load. Without
a tool to translate
the load figures spewed by programs such as uptime into
plain
English, those users would lack the information on which
to make job
scheduling decisions.
To address this problem, and to assist me in gauging
the effects of
various efficiency-related system policies and tools,
I have developed
the set of shell scripts described in this article.
The first script,
load, provides a single number and English-language
analysis
of the current system load for nontechnical users. The
a script
generates some useful instantaneous statistics for the
system administrator's
perusal, including the system load, the total number
of system jobs,
and the average number of jobs per user. The final script,
sysload.sh,
is a long-term system load tracking facility with automatic
periodic
averaging. All information processed by these scripts
is internally
generated using the standard UNIX utilities ps, who,
and uptime.
load: Characterizing the Current System Load
The system command uptime (actually a link to the w
command, equivalent to w -t) displays a line of system
statistics
containing the elapsed time since system boot-up, the
current number
of users, and the system CPU load (as the number of
jobs in the run
queue) averaged out over the last 1, 5, and 15 minutes.
The load
script (Listing 1) runs uptime and pipes the output
into an
awk script to extract the first of the three average
load values
and display a status report based on that value.
Line 11 extracts the load value based on the number
of tokens detected
in the uptime output text. The precise format of the
line produced
by uptime actually varies with the length of time the
system
has been up. Therefore, the awk script sets the val
variable to the value of the third-to-last token. Then,
lines 13-14
strip the trailing comma.
The rest of the script simply displays some text based
on the value
of val. The text tells a user what impact a CPU-intensive
background
job is likely to have on system performance at the current
load level.
The user is then in a better position to weigh the potential
performance
impact of his/her job against the criticality of that
job, and decide
whether or not to run the job in the background.
A sample output of the load script is shown in Figure
1. If your computer
system's horsepower differs significantly from ours
(a 486-33 ISA
machine), then you may want to alter the load values
hard-coded into
the script's comparison lines to better reflect the
load characteristics
of your particular machine.
a: Displaying System and User Processes Statistics
One very powerful window into the system process table
is the ps
command. I wrote the a shell script (Listing 2) to analyze
data provided by ps and display a summary containing
some basic
statistics otherwise difficult to glean from the raw
ps output.
When extracting data about user patterns and trends
from the system
process table, it is useful to first separate the "signal"
from the "noise." Therefore, a breaks the
list of all
system processes down into three categories: root processes,
printer
processes, and user processes. Root processes (getty,
cron,
other demons, etc.) and printer processes (the master
scheduler and
intermittent printer request handlers) are not large
contributors
to the system load, and are therefore segregated from
explicit user
applications when collecting user process data.
The a script recognizes one further dichotomy: shell
interpreters
are distinguished from other kinds of user processes.
Generally, shell
processes tend to be dormant while their subprocesses
are executing.
This is certainly not always the case, so I've included
a feature
to summarize the user process statistics both with and
without shell
interpreter instances taken into consideration.
The output from a sample a run is shown in Figure 2.
All analysis
is performed in lines 18-34. There is some tricky coding
involved,
so I'll annotate what I've done.
In line 18, the innermost in-line statement
ps -u root
generates a list of all processes owned by root.
This list is piped to
wc -l
to produce a single number representing a count of the
number of lines in the ps output. Finally, this number
is reduced
by 1 (using the expr command) to compensate for the
header
line produced by ps, and the result is assigned to the
rootpros
environment variable. The next line repeats the same
procedure to
count lp processes, and then the sum of the root and
lp process counts is assigned to the otherpros variable.
In line 22, a total system process count is computed
by running ps
-e, counting the output lines, and subtracting 3 (one
for the header
line, and two for the processes spawned by invocation
of the a
command itself). To get the number of user processes,
I subtract the
value of otherpros from totpros. The result is assigned
to userpros.
Lines 25-28 count up the number of user shell interpreters
currently
active, and assign that value to shpros. Since root
processes have already been counted up in a class of
their own, any
shell interpreters owned by root are excluded from the
shpros
count.
To calculate the total number of non-shell user processes,
the value
of shpros is subtracted from userpros and the result
is assigned to nonshpos (line 29).
To calculate the processes-per-user averages, it is
first necessary
to find out how many "distinct" users are
currently logged
in to the system, since a single user may be logged
in on multiple
terminals or have several multiscreen sessions active
on a single
terminal. Line 30 calculates the number of distinct
users by listing
the user ID of all processes, sorting by the ID, eliminating
duplicates,
and counting the number of lines in the output. The
resulting value
is assigned to nusers.
The final calculations in lines 31-34 produce the averages
to two
decimal places, applying a standard multiplication and
modulus kludge
useful with integer-only math. The integer and fractional
portions
of the average values are calculated separately.
sysload.sh: Recording a Periodic System Load History
The two scripts described above provide instantaneous
process information,
but contain no provisions for maintaining a history.
The last script
for this month is a facility for recording long-term
process load
history information into a set of log files. These files
may be inspected
periodically in order to seek out cyclical trends or
patterns of light
and heavy system usage.
sysload.sh (Listing 3) writes to three log files, given
the
symbolic names DAYLOG, LOADLOG, and AVGLOG. You
fill in the actual pathnames for these files in lines
26-28, and the
pathnames for the debugging versions in lines 30-32.
The DAYLOG file is used when the call to sysload.sh
has the form:
sysload.sh daily
You decide how often to sample the system load, and
create
a cron table entry that schedules the above command
accordingly.
For example, on our system the script runs every fifteen
minutes between
8 A.M. and 5:45 P.M. Monday through Friday. The cron
table entry appears as follows:
0,15,30,45 8-17 * * 1-5
/usr/local/sysload.sh daily
where /usr/local is where the sysload.sh
script resides. Figure 3 shows the entire contents of
our system's
DAILY log file as I write this. Each one-line entry
contains
the date, the time, and the system load. In Listing
3, these daily
runs are processed in lines 38-50.
After all sampling for the day is complete, sysload.sh
must
be run one more time with the argument final instead
of daily.
Several things happen at that point:
1. The entire contents of DAYLOG are appended onto LOADLOG.
LOADLOG thus contains a cumulative record of all daily
load
samples ever taken.
2. The average load for the day (as per all entries
in DAYLOG)
is computed, and a line containing this information
is appended onto
LOADLOG. The same line is also appended onto AVGLOG.
3. On Friday of each week, the five most recent daily
averages from
AVGLOG are themselves averaged, and a line containing
this
weekly average is appended onto AVGLOG.
4. The DAYLOG file is deleted, and the next weekday's
daily
averages are thus written to a new DAYLOG file.
Our cron table entry for the end-of-day sysload.sh invocation
is:
0 18 * * 1-5
/usr/local/sysload.sh final
The last daily run happens at 5:45, so the final run
is scheduled
for 6:00 P.M. Figure 4 shows the tail portion of the
contents
of a representative AVGLOG file.
Conclusion
These utilities have provided several benefits to me
as a system administrator.
With the help of the load program, nontechnical users
are now
confident enough to diagnose aberrant system slowdowns,
and often
bring such events to my attention before I'm even aware
of them.
The a program, in conjunction with SCO's vmstat utility,
gives me a fairly good, quick map of system utilization
at any one
given moment, and sysload.sh allows me to report long-term
system load statistics to management in order to help
evaluate hardware
and software requirements for the company. I hope the
tools prove
useful to you in your administration duties, as well.
Errata
I recently discovered a bug in one of the Onite system
scripts published
in the Sys Admin Premiere issue. In isonite.sh (Listing
7, page 24), the script that tells whether a particular
job name exists
in the overnight queue, the line printed as:
[ -r $SPOOLDIR/$1 ] && exit 0
is bogus. The line should be corrected to read:
[ -f $SPOOL
DIR/P$priority/$1 ] && exit 0
About the Author
Leor Zolman wrote BDS C, the first C compiler targeted
exclusively
for personal computers. He is currently a system administrator
and
software developer for R&D Publications, Inc., and
columnist for both
The C Users Journal and Windows/DOS Developer's Journal.
Leor's first book, Illustrated C, has just been published
by
R&D. He may be reached in care of R&D Publications,
Inc., or via net
E-mail as leor@rdpub.com ("...!uunet!rdpub!leor").
|