Cover V05, I12
Article
Listing 1
Listing 2
Listing 3
Listing 4
Listing 5

dec96.tar


Managing Performance on SCO OpenServer

Dan Reis and Bill Welch

As more sophisticated applications place greater loads on our systems, greater emphasis must be placed on monitoring and managing the performance of those systems. Although the methods of achieving that goal will vary between organizations, basic performance management concepts are applicable in a general way. In this article, we focus on performance management in a SCO OpenServer environment. Although we do not delve into the details of tuning specific kernel parameters (see the resources at the end of this article for pointers), the techniques presented here can be adapted to other environments and present various criteria to look for in performance management tools.

The Origins of Systems Management

Many different performance metrics can be used to track and measure system performance, most of which primarily consider the ability to run specific industry-accepted benchmarks as a measure of discreet system components. Consequently, the "systems" measured are really only individual platforms or applications, hard drives, I/O cards, or other separate components.

For performance metrics to have value in improving system performance, they must take into account the inter-relationship of all the software and hardware components that make up a computing environment. When performance tracking does this job, it becomes systems management. Systems management manages the performance impact of these components, as a group, on overall performance as seen by administrators and users. The ultimate goal of systems management is to deliver optimum performance by managing speed, along with reliability and availability.

More precisely, systems management is the management and control of systems including CPUs, memory and operating systems, as well as network and peripheral devices, applications, and data running together within a computing environment, locally or remotely. Because no software product right out of the box can begin to address all companies' computing environments, or the way administrators manage those environments, it is important to select a flexible software tool that can be modified to solve current and future systems management issues.

Tool Selection

To get the necessary information to set up a system (e.g., a Web server) for optimum performance, you need some type of systems management tool. The tool choices that exist in this area basically can be categorized into three types: custom, in-house tools that use internal resources to build a product utilizing some type of shell scripting, database, or other language; commercial products that meet general requirements; and commercial products that meet the general requirements and can be customized to address specific site needs. A number of tools on the market today fulfill various parts of UNIX systems management and may meet specific needs. For SCO OpenServer environments, one choice is SCO Doctor as the base systems management tool along with its Tcl scripting language for customization.

Tcl scripting language can be used to write custom systems management alert and action programs for our Web server example. Tuning and customization performance information is supplied by SCO Doctor. SCO Doctor provides a baseline level of performance information that enables administrators to check a system's health against initial parameters and continues to monitor the system using additional performance information that is gathered and logged over time.

Tuning from a Baseline

The first step in our analysis is establishing a performance baseline for the Web server, so we can later ensure that the server has the resources it needs. To do that, we examine the use of CPU, memory, disk, I/O, network, and all other related aspects to determine whether any bottlenecks exist in these areas, and if so, what they are. With this information, the step of tuning the kernel for maximum speed can be completed. The tuning involves setting parameters to allocate the right amount of buffers, swap space, processes, etc. This accomplishes the speed component of systems management by allowing the operating system to run as fast and efficiently as possible for the Web server. The next goal is to maximize reliability and availability to users. To ensure this, the system speed may actually be tuned down to arrive at an acceptable trade off between speed, reliability, and availability performance goals.

What Needs to be Managed

For a Web server, the areas requiring management include: general server health (swapping, processes, disk space, IO, network); Internet security (is someone hacking, am I secure); and Internet load checking (number of http hits, PPP sessions, backlogging of requests, decrease in access time, slower Web access, network down, excessive user access, user accounting, concurrent POP usage, broken links, and broken images). Being able to effectively monitor and take action on these performance categories is a significant step in maintaining a system's optimum performance. You need to define and set tunable parameters, gather historical data, test the state of the system and data, record and correct any anomalies that you detect, report what has been done or what needs to be done, and be able to work with other systems management tools.

Be Proactive

In tracking the above parameters for systems management decision making, knowing what can affect the performance and reliability of the Web server is paramount. This includes: running out of disk space, running out of a kernel resource, not getting enough memory or CPU time, and failure of supporting services (DNS, other systems, etc.). Also, the ability to maintain a historical database to review any and all problems that have occurred leads to better insight into a system's health. Another benefit for both system administrators and users is early automatic detection and correction of any potential or existing problems. With proactive detection, the system will start to look after itself instead of forcing system administrators to react when users or customers experience degraded or lost service.

To implement a proactive solution, a few key components must be put into place. Proactive management requires some type of scheduling facility that can run system checks on a periodic basis. Ideally, this mechanism will allow scheduling for each check or test to be configured individually.You also need some method, or collection of methods, such as a historical database, for gathering information from the system. There also must be a way to communicate with system administrators about the data collected and the results of system checks. This could be via log mechanism, email, pagers, SNMP, etc. In a more involved scenario, the communication could be between systems needing to contact each other before an administrator becomes involved. All these services can be more effective and meaningful if they are recorded for future reference, either by administrators, analysis programs, or even the data collection and system check programs themselves.

Automation Is Key

Computer systems can be set up to handle numerous problems automatically if given the appropriate determination capabilities about what problems are and how they can be solved. For example, the management software can be set up to tell whether the http server process has died unexpectedly by checking for it in the process table on a periodic basis (e.g., once a minute). The management software can instruct a Tcl program to restart the server process in this event, executing the same command an administrator would enter manually. The system can then notify the administrator (within prespecified parameters), about the failure and automatic correction, as well as recording the failure for later reference. In systems management terminology, this process is often referred to as event management. As an example, Listing 1 contains Tcl code written for SCO Doctor to automatically restart the Netscape http server daemon if the system finds it is not running.

Two extensions to Tcl are used in Listing 1: dget and forrow. These commands provide access to SCO Doctor's database, which tracks processes running on the system, among other things. The dget command opens the database and also retrieves values from a database table for a given row and column. The forrow command iterates over all rows in the database table. It is akin to calling ps(C) and parsing each row in the output, but it is much faster and more convenient.

A series of failures may be indicative of a more serious problem, which can be gleaned by examining the system's failure history in conjunction with the system's overall performance history. By adding a few more lines of code to the example in Listing 1, an administrator can be notified of the problem and record the incident for later examination. (This illustrates one of the important reasons for having a historical database that works in conjunction with alert mechanisms.) Listing 2 is another sample of Tcl code used by SCO Doctor for this system administrator notification purpose.

Listing 1 and Listing 2 are written to a file internet.tcl in the /usr/lib/doctor/alert/alert.cf directory. The alert specification is a paragraph made up of attribute value pairs. The name of the paragraph is the name of the alert, and the attribute value pairs specify the conditions and behavior for the alert. Included are a description of the alert, instruction for when the alert should run, who to notify if the alert condition is true, and how to notify them.

Historical data are also useful for analyzing usage trends to plan for changes in service as well as upgrades to a system. Performance management tools should provide hard data to justify new systems or upgrades to current hardware and software. Historical data can indicate the need to reschedule some activities such as splitting ftp and http accesses across different servers to defer backups or other system intensive activities during times of peak usage.

Simplifying System Information

An example of something needing simplification is the information from an error log file. The Netscape server keeps an error log in /usr/internet/ns_httpd/httpd-80/logs/errors on an SCO system. Many logs, including this one, have a well-defined format for the messages that are written to the file, as in the example message below:

[06/Sep/1996:09:32:25] warning: for host tiger.shark.COM
trying to GET /notA File, send-file reports: can't find
/u/sysmgmt_Web/notAFile (No such file or directory)
This messages consists of the following 5 basic parts:
the date, the category, the host, the operation and the
error message.
Date:  [06/Sep/1996:09:32:25]
Category: warning:
Host: for host tiger.shark.COM
Operation: COM trying to GET /notAFile,
Error: send-file reports: can't find
/u/sysmgmt_Web/notAFile (No such file or directory)

The problem at most sites is that there are too many log files on too many machines for administrators to read through. Also, the format of most log files does not lend itself to analysis of the Web server's health. For instance, because recurring problems do not present themselves clearly, it's hard to get a picture of the regularity of a problem or the total number of times it occurs. The Netscape error log has several categories of errors of less concern and others that should trigger alarms. With so many items to look for, system administrators are much more efficient if the relevant information is most apparent. SCO Doctor can record these kinds of information on a regular basis into a database that can then be viewed from the management console, browsed by an SNMP management station, or used as the foundation for alerts and actions.

SCO Doctor's error log records messages in six different categories: catastrophe, failure, inform, misconfig, warning, and security. Inform messages may be ignored, warnings and misconfig messages may be looked at occasionally, while failures are emailed to the administrator, and the administrator is paged in the event of a catastrophe.

The first step is to create a database table for the components of the error log messages. Each row in the table is a message from the error log file, and each column in the table is one of the five basic components of the error message (date, category, host, operation, and error message). You can create a table in SCO Doctor's database by adding a stanza type paragraph that defines each of the columns included in the table in the /usr/lib/doctor/db/smd_m.cf file. This paragraph contains the essential information about how to treat the data kept in the table and how the data can be acquired. This example is set up through a small bit of Tcl code included in the build statement.

internet_errors_date {
table = "internet_errors";
column = "date";
description = "Date of entry";
product = "internet";
type = string;
dynamic = true;
min_period = 30;
build = '
if { ! [info exists internet_procs_loaded] } {
global D_PROD
source $D_PROD/db/internet.tcl
}
return [internet_parseErrorLog]
';
}

The other statements in the paragraph define the data type (string, integer, shortint, byte, big, lireral) and the lifetime of data.

internet_errors_category {
table = "internet_errors";
column = "category";
description = "Category of error";
product = "internet";
type = string;
dynamic = true;
min_period = 30;
build = '
dget internet_errors date
';
}

The Tcl code that parses the error log file is contained in a separate file called /usr/lib/doctor/db/internet.tcl, which is sourced in when the table needs to be built. The internet_parseErrorLog procedure populates the entire table when called, not just the date column. This is a useful trick, because build statements for the other columns can be populated simply by performing a get on the date column. This has the advantage of keeping Tcl code in one place, reducing the maintenance required, and limiting the opportunities for introducing errors. The remaining columns (host, operation, and error message) are similar to the category definition, with the names and data types being changed as appropriate.

The next step is to create the Tcl script that contains the procedures for parsing the log file (Listing 3). The Tcl alert code shown in Listing 4 can then utilize the information collected from the error log. It will notify the administrator via a pager and email the account "help desk." If the problem is only a security or failure alert, the program will only send email.

File Space Management

Significant value can be seen by tracking disk usage on the system for various groups or users. A Tcl script can be run on a regular basis recording this information into a database and producing reports that give an overview or specific snapshot of use by groups or users. This can give administrators and users better information by employing a more complete disk accounting technique. With a script that is run on an ad hoc basis, users might simply be notified that they are over their limit. If, however, the data was collected and organized in a regular fashion, improved messaging could be offered that not only notifies a user that he or she is over the limit, but by how much and for how long. Additionally, if appropriate, disk usage could be broken up into the categories of mail folders, Webspace, ftp, and personal files to maintain a more thorough understanding and tracking of disk usage.

Another important area is making sure that mail is flowing correctly. SCO Doctor can be used to check the directory /usr/spool/mqueue for excessively large numbers of files or files that have been around too long (see Listing 5).

Reporting Conclusions

In the same way that you watch the error log, you can also incorporate the activity log for Netscape into SCO Doctor. This allows you to create views and reports that graph access to the Web server globally or on a system-by-system basis. From these reports, you can pinpoint critical time periods in which the Web server is handling the highest volume of requests or determine the frequency with which the site is revisited. There is a lot of information that can assist administrators with current systems and planning for future systems growth as long as it is easy to access and administer.

As systems management is incorporated into company methodologies, the definition of optimum performance and how to achieve it will expand. Performance management could include more obscure areas such as managing your network bandwidth to maximize performance for specific types of applications. It could mean linking an existing tool, such as SCO Doctor, into a hardware management tool to send alerts to administrators that specific hardware components are out of service. Systems management must allow administrators the flexibility to improve and fix system issues effectively. With increasingly complex computing environments, the only way system administrators can hope to provide performance at levels that are satisfactory to everyone is by effectively using systems management to increase their management capabilities.

Resources

Loukides, M. 1990. System Performance Tuning, O'Reilly & Associates, Inc.

SCO OpenServer documentation: SCO OpenServer Performance Guide.

Miscovich, Gina. 1994. SCO Performance Tuning Handbook, Prentice Hall.

About the authors

Dan Reis is the Product Manager for SCO Doctor. He has worked in the software industry for 15 years in systems level software.

Bill Welch is the lead engineer for SCO Doctor. He has a degree in Computer Science. He has worked for SCO for 6 years, and with SCO Doctor for 3 years.