Cover V01, I03
Article
Figure 1
Figure 2
Figure 3
Figure 4
Listing 1
Listing 2
Listing 3

sep92.tar


A System Load Monitoring Trilogy

Leor Zolman

If you've been following my articles in the past two issues of Sys Admin, you've probably noticed that one of my big concerns as system administrator here at R&D Publications has been to seek out new and useful ways to smooth out the CPU system load on our single-CPU Xenix installation.

The overnight and background job spooling utilities described previously allow our users a great degree of direct control over their use of system resources. From time to time, the users must make decisions such as whether to launch a long series of reports in the background or to run them overnight, instead. Most of our users, however, are not technical enough to comfortably use the standard UNIX/Xenix diagnostic utilities to get a handle on the system load. Without a tool to translate the load figures spewed by programs such as uptime into plain English, those users would lack the information on which to make job scheduling decisions.

To address this problem, and to assist me in gauging the effects of various efficiency-related system policies and tools, I have developed the set of shell scripts described in this article. The first script, load, provides a single number and English-language analysis of the current system load for nontechnical users. The a script generates some useful instantaneous statistics for the system administrator's perusal, including the system load, the total number of system jobs, and the average number of jobs per user. The final script, sysload.sh, is a long-term system load tracking facility with automatic periodic averaging. All information processed by these scripts is internally generated using the standard UNIX utilities ps, who, and uptime.

load: Characterizing the Current System Load

The system command uptime (actually a link to the w command, equivalent to w -t) displays a line of system statistics containing the elapsed time since system boot-up, the current number of users, and the system CPU load (as the number of jobs in the run queue) averaged out over the last 1, 5, and 15 minutes. The load script (Listing 1) runs uptime and pipes the output into an awk script to extract the first of the three average load values and display a status report based on that value.

Line 11 extracts the load value based on the number of tokens detected in the uptime output text. The precise format of the line produced by uptime actually varies with the length of time the system has been up. Therefore, the awk script sets the val variable to the value of the third-to-last token. Then, lines 13-14 strip the trailing comma.

The rest of the script simply displays some text based on the value of val. The text tells a user what impact a CPU-intensive background job is likely to have on system performance at the current load level. The user is then in a better position to weigh the potential performance impact of his/her job against the criticality of that job, and decide whether or not to run the job in the background.

A sample output of the load script is shown in Figure 1. If your computer system's horsepower differs significantly from ours (a 486-33 ISA machine), then you may want to alter the load values hard-coded into the script's comparison lines to better reflect the load characteristics of your particular machine.

a: Displaying System and User Processes Statistics

One very powerful window into the system process table is the ps command. I wrote the a shell script (Listing 2) to analyze data provided by ps and display a summary containing some basic statistics otherwise difficult to glean from the raw ps output.

When extracting data about user patterns and trends from the system process table, it is useful to first separate the "signal" from the "noise." Therefore, a breaks the list of all system processes down into three categories: root processes, printer processes, and user processes. Root processes (getty, cron, other demons, etc.) and printer processes (the master scheduler and intermittent printer request handlers) are not large contributors to the system load, and are therefore segregated from explicit user applications when collecting user process data.

The a script recognizes one further dichotomy: shell interpreters are distinguished from other kinds of user processes. Generally, shell processes tend to be dormant while their subprocesses are executing. This is certainly not always the case, so I've included a feature to summarize the user process statistics both with and without shell interpreter instances taken into consideration.

The output from a sample a run is shown in Figure 2. All analysis is performed in lines 18-34. There is some tricky coding involved, so I'll annotate what I've done.

In line 18, the innermost in-line statement

ps -u root

generates a list of all processes owned by root. This list is piped to

wc -l

to produce a single number representing a count of the number of lines in the ps output. Finally, this number is reduced by 1 (using the expr command) to compensate for the header line produced by ps, and the result is assigned to the rootpros environment variable. The next line repeats the same procedure to count lp processes, and then the sum of the root and lp process counts is assigned to the otherpros variable.

In line 22, a total system process count is computed by running ps -e, counting the output lines, and subtracting 3 (one for the header line, and two for the processes spawned by invocation of the a command itself). To get the number of user processes, I subtract the value of otherpros from totpros. The result is assigned to userpros.

Lines 25-28 count up the number of user shell interpreters currently active, and assign that value to shpros. Since root processes have already been counted up in a class of their own, any shell interpreters owned by root are excluded from the shpros count.

To calculate the total number of non-shell user processes, the value of shpros is subtracted from userpros and the result is assigned to nonshpos (line 29).

To calculate the processes-per-user averages, it is first necessary to find out how many "distinct" users are currently logged in to the system, since a single user may be logged in on multiple terminals or have several multiscreen sessions active on a single terminal. Line 30 calculates the number of distinct users by listing the user ID of all processes, sorting by the ID, eliminating duplicates, and counting the number of lines in the output. The resulting value is assigned to nusers.

The final calculations in lines 31-34 produce the averages to two decimal places, applying a standard multiplication and modulus kludge useful with integer-only math. The integer and fractional portions of the average values are calculated separately.

sysload.sh: Recording a Periodic System Load History

The two scripts described above provide instantaneous process information, but contain no provisions for maintaining a history. The last script for this month is a facility for recording long-term process load history information into a set of log files. These files may be inspected periodically in order to seek out cyclical trends or patterns of light and heavy system usage.

sysload.sh (Listing 3) writes to three log files, given the symbolic names DAYLOG, LOADLOG, and AVGLOG. You fill in the actual pathnames for these files in lines 26-28, and the pathnames for the debugging versions in lines 30-32.

The DAYLOG file is used when the call to sysload.sh has the form:

sysload.sh daily

You decide how often to sample the system load, and create a cron table entry that schedules the above command accordingly. For example, on our system the script runs every fifteen minutes between 8 A.M. and 5:45 P.M. Monday through Friday. The cron table entry appears as follows:

0,15,30,45 8-17 * * 1-5
/usr/local/sysload.sh daily

where /usr/local is where the sysload.sh script resides. Figure 3 shows the entire contents of our system's DAILY log file as I write this. Each one-line entry contains the date, the time, and the system load. In Listing 3, these daily runs are processed in lines 38-50.

After all sampling for the day is complete, sysload.sh must be run one more time with the argument final instead of daily. Several things happen at that point:

1. The entire contents of DAYLOG are appended onto LOADLOG. LOADLOG thus contains a cumulative record of all daily load samples ever taken.

2. The average load for the day (as per all entries in DAYLOG) is computed, and a line containing this information is appended onto LOADLOG. The same line is also appended onto AVGLOG.

3. On Friday of each week, the five most recent daily averages from AVGLOG are themselves averaged, and a line containing this weekly average is appended onto AVGLOG.

4. The DAYLOG file is deleted, and the next weekday's daily averages are thus written to a new DAYLOG file.

Our cron table entry for the end-of-day sysload.sh invocation is:

0 18 * * 1-5
/usr/local/sysload.sh final

The last daily run happens at 5:45, so the final run is scheduled for 6:00 P.M. Figure 4 shows the tail portion of the contents of a representative AVGLOG file.

Conclusion

These utilities have provided several benefits to me as a system administrator. With the help of the load program, nontechnical users are now confident enough to diagnose aberrant system slowdowns, and often bring such events to my attention before I'm even aware of them.

The a program, in conjunction with SCO's vmstat utility, gives me a fairly good, quick map of system utilization at any one given moment, and sysload.sh allows me to report long-term system load statistics to management in order to help evaluate hardware and software requirements for the company. I hope the tools prove useful to you in your administration duties, as well.

Errata

I recently discovered a bug in one of the Onite system scripts published in the Sys Admin Premiere issue. In isonite.sh (Listing 7, page 24), the script that tells whether a particular job name exists in the overnight queue, the line printed as:

[ -r $SPOOLDIR/$1 ] && exit 0

is bogus. The line should be corrected to read:

[ -f $SPOOL
DIR/P$priority/$1 ] && exit 0

About the Author

Leor Zolman wrote BDS C, the first C compiler targeted exclusively for personal computers. He is currently a system administrator and software developer for R&D Publications, Inc., and columnist for both The C Users Journal and Windows/DOS Developer's Journal. Leor's first book, Illustrated C, has just been published by R&D. He may be reached in care of R&D Publications, Inc., or via net E-mail as leor@rdpub.com ("...!uunet!rdpub!leor").