Managing Performance on SCO OpenServer
Dan Reis and Bill Welch
As more sophisticated applications place greater loads
on our systems,
greater emphasis must be placed on monitoring and managing
the
performance of those systems. Although the methods of
achieving that
goal will vary between organizations, basic performance
management
concepts are applicable in a general way. In this article,
we focus on
performance management in a SCO OpenServer environment.
Although we do
not delve into the details of tuning specific kernel
parameters (see the
resources at the end of this article for pointers),
the techniques
presented here can be adapted to other environments
and present various
criteria to look for in performance management tools.
The Origins of Systems Management
Many different performance metrics can be used to track
and measure
system performance, most of which primarily consider
the ability to run
specific industry-accepted benchmarks as a measure of
discreet system
components. Consequently, the "systems" measured
are really only
individual platforms or applications, hard drives, I/O
cards, or other
separate components.
For performance metrics to have value in improving system
performance,
they must take into account the inter-relationship of
all the software
and hardware components that make up a computing environment.
When
performance tracking does this job, it becomes systems
management.
Systems management manages the performance impact of
these components,
as a group, on overall performance as seen by administrators
and users.
The ultimate goal of systems management is to deliver
optimum
performance by managing speed, along with reliability
and availability.
More precisely, systems management is the management
and control of
systems including CPUs, memory and operating systems,
as well as network
and peripheral devices, applications, and data running
together within a
computing environment, locally or remotely. Because
no software product
right out of the box can begin to address all companies'
computing
environments, or the way administrators manage those
environments, it is
important to select a flexible software tool that can
be modified to
solve current and future systems management issues.
Tool Selection
To get the necessary information to set up a system
(e.g., a Web server)
for optimum performance, you need some type of systems
management tool.
The tool choices that exist in this area basically can
be categorized
into three types: custom, in-house tools that use internal
resources to
build a product utilizing some type of shell scripting,
database, or
other language; commercial products that meet general
requirements; and
commercial products that meet the general requirements
and can be
customized to address specific site needs. A number
of tools on the
market today fulfill various parts of UNIX systems management
and may
meet specific needs. For SCO OpenServer environments,
one choice is SCO
Doctor as the base systems management tool along with
its Tcl scripting
language for customization.
Tcl scripting language can be used to write custom systems
management
alert and action programs for our Web server example.
Tuning and
customization performance information is supplied by
SCO Doctor. SCO
Doctor provides a baseline level of performance information
that enables
administrators to check a system's health against initial
parameters and
continues to monitor the system using additional performance
information
that is gathered and logged over time.
Tuning from a Baseline
The first step in our analysis is establishing a performance
baseline
for the Web server, so we can later ensure that the
server has the
resources it needs. To do that, we examine the use of
CPU, memory, disk,
I/O, network, and all other related aspects to determine
whether any
bottlenecks exist in these areas, and if so, what they
are. With this
information, the step of tuning the kernel for maximum
speed can be
completed. The tuning involves setting parameters to
allocate the right
amount of buffers, swap space, processes, etc. This
accomplishes the
speed component of systems management by allowing the
operating system
to run as fast and efficiently as possible for the Web
server. The next
goal is to maximize reliability and availability to
users. To ensure
this, the system speed may actually be tuned down to
arrive at an
acceptable trade off between speed, reliability, and
availability
performance goals.
What Needs to be Managed
For a Web server, the areas requiring management include:
general server
health (swapping, processes, disk space, IO, network);
Internet security
(is someone hacking, am I secure); and Internet load
checking (number of
http hits, PPP sessions, backlogging of requests, decrease
in access
time, slower Web access, network down, excessive user
access, user
accounting, concurrent POP usage, broken links, and
broken images).
Being able to effectively monitor and take action on
these performance
categories is a significant step in maintaining a system's
optimum
performance. You need to define and set tunable parameters,
gather
historical data, test the state of the system and data,
record and
correct any anomalies that you detect, report what has
been done or what
needs to be done, and be able to work with other systems
management
tools.
Be Proactive
In tracking the above parameters for systems management
decision making,
knowing what can affect the performance and reliability
of the Web
server is paramount. This includes: running out of disk
space, running
out of a kernel resource, not getting enough memory
or CPU time, and
failure of supporting services (DNS, other systems,
etc.). Also, the
ability to maintain a historical database to review
any and all problems
that have occurred leads to better insight into a system's
health.
Another benefit for both system administrators and users
is early
automatic detection and correction of any potential
or existing
problems. With proactive detection, the system will
start to look after
itself instead of forcing system administrators to react
when users or
customers experience degraded or lost service.
To implement a proactive solution, a few key components
must be put into
place. Proactive management requires some type of scheduling
facility
that can run system checks on a periodic basis. Ideally,
this mechanism
will allow scheduling for each check or test to be configured
individually.You also need some method, or collection
of methods, such
as a historical database, for gathering information
from the system.
There also must be a way to communicate with system
administrators about
the data collected and the results of system checks.
This could be via
log mechanism, email, pagers, SNMP, etc. In a more involved
scenario,
the communication could be between systems needing to
contact each other
before an administrator becomes involved. All these
services can be more
effective and meaningful if they are recorded for future
reference,
either by administrators, analysis programs, or even
the data collection
and system check programs themselves.
Automation Is Key
Computer systems can be set up to handle numerous problems
automatically
if given the appropriate determination capabilities
about what problems
are and how they can be solved. For example, the management
software can
be set up to tell whether the http server process has
died unexpectedly
by checking for it in the process table on a periodic
basis (e.g., once
a minute). The management software can instruct a Tcl
program to restart
the server process in this event, executing the same
command an
administrator would enter manually. The system can then
notify the
administrator (within prespecified parameters), about
the failure and
automatic correction, as well as recording the failure
for later
reference. In systems management terminology, this process
is often
referred to as event management. As an example, Listing
1 contains Tcl
code written for SCO Doctor to automatically restart
the Netscape http
server daemon if the system finds it is not running.
Two extensions to Tcl are used in Listing 1: dget and
forrow. These
commands provide access to SCO Doctor's database, which
tracks processes
running on the system, among other things. The dget
command opens the
database and also retrieves values from a database table
for a given row
and column. The forrow command iterates over all rows
in the database
table. It is akin to calling ps(C) and parsing each
row in the output,
but it is much faster and more convenient.
A series of failures may be indicative of a more serious
problem, which
can be gleaned by examining the system's failure history
in conjunction
with the system's overall performance history. By adding
a few more
lines of code to the example in Listing 1, an administrator
can be
notified of the problem and record the incident for
later examination.
(This illustrates one of the important reasons for having
a historical
database that works in conjunction with alert mechanisms.)
Listing 2 is
another sample of Tcl code used by SCO Doctor for this
system
administrator notification purpose.
Listing 1 and Listing 2 are written to a
file
internet.tcl
in the
/usr/lib/doctor/alert/alert.cf directory. The alert
specification is a
paragraph made up of attribute value pairs. The name
of the paragraph is
the name of the alert, and the attribute value pairs
specify the
conditions and behavior for the alert. Included are
a description of the
alert, instruction for when the alert should run, who
to notify if the
alert condition is true, and how to notify them.
Historical data are also useful for analyzing usage
trends to plan for
changes in service as well as upgrades to a system.
Performance
management tools should provide hard data to justify
new systems or
upgrades to current hardware and software. Historical
data can indicate
the need to reschedule some activities such as splitting
ftp and http
accesses across different servers to defer backups or
other system
intensive activities during times of peak usage.
Simplifying System Information
An example of something needing simplification is the
information from
an error log file. The Netscape server keeps an error
log in
/usr/internet/ns_httpd/httpd-80/logs/errors on an SCO
system. Many logs,
including this one, have a well-defined format for the
messages that are
written to the file, as in the example message below:
[06/Sep/1996:09:32:25] warning: for host tiger.shark.COM
trying to GET /notA File, send-file reports: can't find
/u/sysmgmt_Web/notAFile (No such file or directory)
This messages consists of the following 5 basic parts:
the date, the category, the host, the operation and the
error message.
Date: [06/Sep/1996:09:32:25]
Category: warning:
Host: for host tiger.shark.COM
Operation: COM trying to GET /notAFile,
Error: send-file reports: can't find
/u/sysmgmt_Web/notAFile (No such file or directory)
The problem at most sites is that there are too many
log files on too
many machines for administrators to read through. Also,
the format of
most log files does not lend itself to analysis of the
Web server's
health. For instance, because recurring problems do
not present
themselves clearly, it's hard to get a picture of the
regularity of a
problem or the total number of times it occurs. The
Netscape error log
has several categories of errors of less concern and
others that should
trigger alarms. With so many items to look for, system
administrators
are much more efficient if the relevant information
is most apparent.
SCO Doctor can record these kinds of information on
a regular basis into
a database that can then be viewed from the management
console, browsed
by an SNMP management station, or used as the foundation
for alerts and
actions.
SCO Doctor's error log records messages in six different
categories:
catastrophe, failure, inform, misconfig, warning, and
security. Inform
messages may be ignored, warnings and misconfig messages
may be looked
at occasionally, while failures are emailed to the administrator,
and
the administrator is paged in the event of a catastrophe.
The first step is to create a database table for the
components of the
error log messages. Each row in the table is a message
from the error
log file, and each column in the table is one of the
five basic
components of the error message (date, category, host,
operation, and
error message). You can create a table in SCO Doctor's
database by
adding a stanza type paragraph that defines each of
the columns included
in the table in the /usr/lib/doctor/db/smd_m.cf file.
This paragraph
contains the essential information about how to treat
the data kept in
the table and how the data can be acquired. This example
is set up
through a small bit of Tcl code included in the build
statement.
internet_errors_date {
table = "internet_errors";
column = "date";
description = "Date of entry";
product = "internet";
type = string;
dynamic = true;
min_period = 30;
build = '
if { ! [info exists internet_procs_loaded] } {
global D_PROD
source $D_PROD/db/internet.tcl
}
return [internet_parseErrorLog]
';
}
The other statements in the paragraph define the data
type (string,
integer, shortint, byte, big, lireral) and the lifetime
of data.
internet_errors_category {
table = "internet_errors";
column = "category";
description = "Category of error";
product = "internet";
type = string;
dynamic = true;
min_period = 30;
build = '
dget internet_errors date
';
}
The Tcl code that parses the error log file is contained
in a separate
file called /usr/lib/doctor/db/internet.tcl, which is
sourced in when
the table needs to be built. The internet_parseErrorLog
procedure
populates the entire table when called, not just the
date column. This
is a useful trick, because build statements for the
other columns can be
populated simply by performing a get on the date column.
This has the
advantage of keeping Tcl code in one place, reducing
the maintenance
required, and limiting the opportunities for introducing
errors. The
remaining columns (host, operation, and error message)
are similar to
the category definition, with the names and data types
being changed as
appropriate.
The next step is to create the Tcl script that contains
the procedures
for parsing the log file (Listing 3). The Tcl alert
code shown in
Listing 4 can then utilize the information collected
from the error log.
It will notify the administrator via a pager and email
the account "help
desk." If the problem is only a security or failure
alert, the program
will only send email.
File Space Management
Significant value can be seen by tracking disk usage
on the system for
various groups or users. A Tcl script can be run on
a regular basis
recording this information into a database and producing
reports that
give an overview or specific snapshot of use by groups
or users. This
can give administrators and users better information
by employing a more
complete disk accounting technique. With a script that
is run on an ad
hoc basis, users might simply be notified that they
are over their
limit. If, however, the data was collected and organized
in a regular
fashion, improved messaging could be offered that not
only notifies a
user that he or she is over the limit, but by how much
and for how long.
Additionally, if appropriate, disk usage could be broken
up into the
categories of mail folders, Webspace, ftp, and personal
files to
maintain a more thorough understanding and tracking
of disk usage.
Another important area is making sure that mail is flowing
correctly.
SCO Doctor can be used to check the directory /usr/spool/mqueue
for
excessively large numbers of files or files that have
been around too
long (see Listing 5).
Reporting Conclusions
In the same way that you watch the error log, you can
also incorporate
the activity log for Netscape into SCO Doctor. This
allows you to create
views and reports that graph access to the Web server
globally or on a
system-by-system basis. From these reports, you can
pinpoint critical
time periods in which the Web server is handling the
highest volume of
requests or determine the frequency with which the site
is revisited.
There is a lot of information that can assist administrators
with
current systems and planning for future systems growth
as long as it is
easy to access and administer.
As systems management is incorporated into company methodologies,
the
definition of optimum performance and how to achieve
it will expand.
Performance management could include more obscure areas
such as managing
your network bandwidth to maximize performance for specific
types of
applications. It could mean linking an existing tool,
such as SCO
Doctor, into a hardware management tool to send alerts
to administrators
that specific hardware components are out of service.
Systems management
must allow administrators the flexibility to improve
and fix system
issues effectively. With increasingly complex computing
environments,
the only way system administrators can hope to provide
performance at
levels that are satisfactory to everyone is by effectively
using systems
management to increase their management capabilities.
Resources
Loukides, M. 1990. System Performance Tuning, O'Reilly
& Associates,
Inc.
SCO OpenServer documentation: SCO OpenServer Performance
Guide.
Miscovich, Gina. 1994. SCO Performance Tuning Handbook,
Prentice Hall.
About the authors
Dan Reis is the Product Manager for SCO Doctor. He
has worked in the
software industry for 15 years in systems level software.
Bill Welch is the lead engineer for SCO Doctor. He has
a degree in
Computer Science. He has worked for SCO for 6 years,
and with SCO Doctor
for 3 years.
|