Cover V10, I10

Article
Listing 1
Listing 2
Listing 3

oct2001.tar

Managing SolarisTM with Kstat

Alexander Golomshtok and Yefim Nodelman

Rapidly increasing demand for high-performance, mission-critical computer systems and especially the proliferation of Internet-based business applications gave birth to thousands of commercial and public-domain performance management solutions. Solaris, being a very mature operating environment with significant install base, enjoys the attention of many performance management software vendors and independent developers. Available tools range from very simple standalone programs, designed to monitor just a few aspects of the system's behavior, to very complex distributed systems with built-in troubleshooting, trend analysis, and forecasting capabilities.

Perhaps one of the most well-known commercial performance management solutions for all UNIX flavors is BMC Patrol1. Patrol is a multi-tiered system, capable of not only monitoring various aspects of system performance, but also advanced modeling and impact analysis. Although a good choice for enterprise-wide performance monitoring, Patrol may be an overkill for those who just wish to quickly address particular performance concerns -- its distributed architecture, complexity, and especially the licensing costs may be prohibitive to small organizations looking for a comprehensive low-cost and low-maintenance solution. Other vendors, including Sun Microsystems itself, offer sophisticated performance monitoring and management tools for Solaris, but unfortunately, sophistication comes at the price of complexity and high acquisition and maintenance costs2.

As a low-cost alternative to commercial performance-monitoring tools, free software enthusiasts have developed quite a few performance-monitoring applications for UNIX and Solaris in particular. Most of these free utilities concentrate on the monitoring aspects of performance management and do not include complex trend analysis and modeling capabilities. One example of such a free performance-monitoring application is a very popular utility by William LeFebvre, called "top"3. Top is a standalone program that continuously lists processes with highest CPU consumption percentage and displays other performance-related information, such as some CPU and virtual memory statistics. This handy tool quickly assesses the overall state of the system and does not require complex setup or maintenance. However, its functionality is very limited and does not provide for customization.

Proctool by Walter Nielsen and Morgan Herrington is another freely available performance-monitoring utility, which tops the list of some UNIX systems administrators4. Proctool was originally inspired by "top", but over the years, it evolved into a more sophisticated application with real Motif-based GUI and capabilities beyond those of "top".

Overall, most of the commercial as well as free performance-management systems do not possess enough flexibility to satisfy certain custom troubleshooting needs. As a rule, these tools are either too generic and all-encompassing or too simplistic and narrow in scope. A very common requirement, for instance, is the ability to set up custom alerts, which are triggered if a particular combination of performance measures meets or exceeds predefined thresholds. Some commercial and free tools allow for this kind of monitoring, however, this capability usually comes at the price of high complexity. Apart from making the lives of systems administrators miserable, complexity often implies excessive consumption of system resources by the performance-monitoring tool, which severely cripples the performance data-collection process. Even a relatively simple tool such as "top" has the resident set size of 1.5 MB on our dual CPU Sun ES 450 system and often comes at the top of its own process list, indicating the highest CPU utilization percentage. Other tools, such as BMC Patrol are even worse. For portability sake, these tools do not read the kernel performance statistics directly, but use programs such as iostat, netstat, and vmstat to collect the data, thus incurring the overhead of starting additional processes on a subject system. Another drawback is that most of these tools require root access to the system, which makes them simply unusable by those poor people who are not granted the administrative access to their computers.

Luckily, Adrian Cockcroft of Sun Microsystems and Richard Pettit of Resolute Software developed a revolutionary approach to solving performance-management problems. Instead of building a tool for performance monitoring, they came up with a toolkit, called SE (Symbolic Engine), consisting of a programming language interpreter, a few handy libraries, and a bunch of example scripts, mimicking the functionality of traditional vmstat, iostat, netstat, and other UNIX utilities5. The fundamental element of the SE Toolkit (www.setoolkit.com) is the SymbEL programming language, which provides a foundation for building custom performance management tools and utilities. The SE Toolkit is extremely versatile, efficient, and easy to use. For the most part, it does not require root privileges; the data collection algorithms, employed by the SymbEL interpreter, access the kernel statistical data directly, which allows for building very accurate performance monitors. The most attractive feature of the Toolkit is its unlimited flexibility -- any custom tool can be developed using SymbEL and the libraries that come with it. However, in order to take the full advantage of the SE flexibility one would have to master the SymbEL programming language.

Solaris Performance Metrics Interfaces

As I mentioned previously, reading the kernel statistics in order to collect the performance data is an approach far superior to using existing Solaris programs such as netstat, iostat, sar, and others. Direct access to the kernel data eliminates the need to start additional processes for data collection purposes, thereby greatly reducing the overall resource consumption.

Solaris 2 exposes numerous interfaces for collecting various performance and status data. One of the oldest interfaces, available since Sun OS 4.x, is kvm, which stands for "kernel virtual memory". As the name implies, this interface provides a way of accessing information within the address space of an operating system. Besides reading the kernel virtual memory on a running system, kvm can be used to analyze a dump of a running kernel, which may be a result of a system crash.

Kernel virtual memory can be accessed by simply reading the /dev/kmem file, which is a character-special file that provides user-level access to kernel memory image. However, libkvm library provides more robust and high-level interface by encapsulating the direct access to /dev/kmem and /dev/ksyms and simplifying the process of reading the kernel data6. Many Solaris performance utilities, such as netstat, utilized and continue to utilize the kvm interface despite its drawbacks, which include a need for root access (netstat is setuid) and lack of thread safety.

The Kstat (which stands for "kernel statistics") library7 is a newer interface for performance metrics collection that eliminates some of the disadvantages of the kvm. Kstat functions access the data stored within the user-level data structures. These are essentially copies of similar structures within the OS kernel, so that the kernel memory image no longer needs to be scanned directly. Apart from the obvious higher portability, this approach also allows for non-root access to a copy of a kernel data and solves thread-safety problems. Because the Kstat data is stored in the "user space", a user-level application may lock the Kstat structure, thus ensuring that none of the data changes while it is being accessed. The kvm, on the other hand, as a user-level library, has no mechanism for preventing the kernel data structures from being modified by kernel threads while the performance data collection operation is in progress.

The performance metrics accessible via the Kstat interface are stored in a linked list of structures, often referred to as "Kstat chain". There are actually two chains -- one stored in the user space (user chain) and another stored in the kernel space (kernel chain). Whenever an application process issues a data collection request through the Kstat library, the library dispatches the ioctl request to a special loadable driver, designed to act as a middleman between the kernel and the user space. The driver then locks a corresponding portion of the kernel Kstat chain and transfers the kernel copy of the data into the user space. This mechanism prevents kernel threads from modifying the Kstat chain while the collection operation is in progress, thus ensuring consistency of the data being read by the user-level process.

Each node of the Kstat chain (often called simply "Kstat") contains metrics that reflect the operations of a single functional component, such as a disk device or network interface. Each Kstat is generally identified by a unique "path" that consists of three distinct elements:

  • Module -- Uniquely identifies the functional area or subsystem. The module name for disk devices, for instance, may be "sd" or "ssd"; for network interfaces it is "tr" (token ring), "le" (lance ethernet), etc.
  • Instance number -- Uniquely identifies the instance within the module (i.e., disk instance number).
  • Name -- Uniquely identifies the functional component. For disk devices and network interfaces, the name is usually a combination of the module and the instance number, such as "sd1" or "hme0".

Each Kstat data structure contains a common header and a variable data portion. The header houses the Kstat identification information, such as its module, instance, name, and type along with the pointers to the data portion and the next Kstat within the chain. The data portion has variable structure and may be one of the following:

  • Raw -- A chunk of memory that can be cast to an appropriate C structure. An application, dealing with raw Kstats should have prior knowledge of what C structure the data portion of the Kstat should be cast to.
  • Named -- An array of name-value pairs.
  • Interrupt -- A C structure containing the information about interrupts.
  • IO -- A specific C structure containing the information about disk devices.
  • Timer -- An array of name-value pairs similar to the named type.

The Kstat library (libkstat) provides numerous functions for opening and closing Kstat chains, traversing the Kstat nodes and reading the performance data. The following little program, designed to list all available Kstats on a given system, provides an introductory example of the library usage:

 1  #include <stdio.h>
 2  #include <kstat.h>
 3   
 4  int
 5  main( int argc, char **argv ) {
 6     kstat_ctl_t *pc;
 7     kstat_t     *pk;
 8     if ( !( pc = kstat_open() ) ) {
 9        perror( "failed to open kstat" );
10        return -1;
11     }
12     printf( "%-10s%-5s%-16s\n", "module", "inst", "name" );
13     for( pk = pc->kc_chain; pk; pk = pk->ks_next )
14        printf( "%-10s%-5d%-16s\n",
15           pk->ks_module, pk->ks_instance, pk->ks_name );
16
17     kstat_close( pc );
18     return 0;
19 }
The program opens the Kstat chain (line 8) by calling the kstat_open function and iterates through the linked list of Kstat nodes (lines 13 through 15) printing the module, instance, and name fields for every Kstat. The kstat_open function returns a pointer to the Kstat control structure, which, among other things, contains the pointer to the head of the Kstat-linked list (kc_chain). As I mentioned previously, each Kstat contains the pointer to the next element of the list (ks_next), which makes it easy to traverse the chain from the beginning to the end. The program can be compiled with the following command (assuming that the source code of the example above is saved as lkstat.c):

cc  -o lkstat    lkstat.c    -lkstat
Running the resulting lkstat binary on our Sun ES 450 system produces the following output:

module    inst name            
unix      0    kstat_headers   
unix      0    kstat_types     
unix      0    sysinfo         
unix      0    vminfo          
unix      0    vmhatstat       
...
ufs       0    inode_cache     
sd        21   sd21            
sd        3    sd3             
cpu_stat  1    cpu_stat1       
cpu_info  1    cpu_info1       
cpu_info  3    cpu_info3       
cpu_stat  3    cpu_stat3       
...
The Kstat library is widely used by Solaris 2 performance monitoring utilities -- most of the functionality exposed by the SE Toolkit, for instance, is based upon the capabilities of the Kstat library. Even the simple programs, such as well-known uptime, rely on Kstat to obtain the statistical information. Listing 1 shows a sample implementation for the uptime utility, designed to further demonstrate the versatility of the Kstat library.

This program accesses a single Kstat "unix.0.system_misc", which contains some system usage information, such as a number of clock interrupts since the boot time. At first, the program obtains some system configuration information (number of clock interrupts or ticks per second) using sysconf (3C) library function at line 23. Then we open the Kstat chain using the kstat_open function and get the handle of the desired Kstat. This time, instead of iterating through the linked list of Kstats, we use the kstat_lookup function, which takes the module, instance, and name elements of the Kstat path and returns the pointer to the header of the "unix.0.system_misc" Kstat structure (line 28). At line 32, we issue the kstat_read call that signals the Kstat driver to read the kernel data into the user chain. The "unix.0.system_misc" Kstat is of named type, which can be easily checked by examining the ks_type field of the kstat_t structure. It is set to the value of 1 (which indicates the named type), so we use the kstat_data_lookup function to look up the values of the variables that we're interested in.

At lines 36 through 43, we call the kstat_data_lookup function four times to obtain the values of the "clk_intr" variable (which contains the number of click ticks since the boot), and the values of "avenrun_1min", "avenrun_5min", and "avenrun_15min" variables. These represent the average number of processes on the run queue within the last one, five, and fifteen minutes, respectively. The "avenrun" variables are used to calculate the system load average based on the formula borrowed from the source code of the "top" utility3. I believe that Solaris 2 native uptime program utilizes the same formula, which simply converts the unsigned long number into a double and divides it by a scaling factor FSCALE, taken from /usr/include/sys/param.h. We then use the value of the "clk_intr" variable to calculate the number of days, hours, minutes, and seconds since the last boot (lines 45 through 48). Finally, we print out the uptime information along with the load average figures and close the Kstat chain. When compiled and run on an ES 450, this program produces the following output:

Up: 204 day(s) 20 hour(s) 36 minute(s) 4 second(s), load average: 0.02, 0.01, 0.01
The output is quite similar to the output of the standard uptime command, however, to save space we excluded the code for calculating the number of users on the system. The number of users can easily be determined by reading the user and accounting information via the utmpx (4) interface.

Kstat and Perl

Although the Kstat programming model is fairly simple, it still requires extensive C programming skills, which may scare away even the most experienced systems administrators. One major flaw of the Kstat interface is the necessity to program around five different types of Kstats. This makes the process of reading the performance metrics inconsistent and may lead to obscure errors. This may not be an issue for small and simple programs, such as our uptime utility. However, developing an equivalent of vmstat, for instance, would require access to a few different Kstat structures of different types, which can easily lead to complex, convoluted, and impossible to debug code. The SymbEL programming language of SE Toolkit [5] takes a much more consistent approach by allowing the developer to read values of any Kstat variables in a uniform manner regardless of the Kstat type. Unfortunately, this consistency comes at a price -- one would have to learn SymbEL.

As usual, CPAN8 offers a Perl extension module, which enables anybody with basic Perl programming skills to take full advantage of Kstat interface. This module, called Solaris::Kstat 9, provides uniform access to Kstat data via tied hash interface, so that any Kstat variable can be read using its module, instance, and name simply as hash keys.

To demonstrate the advantages of the Perl-based approach and provide grounds for comparison, the uptime program was converted into a Perl script (Listing 2).

The first very noticeable difference is the fact that the Perl script is almost twice as small as the corresponding C version. Also, no function calls are required to navigate through the Kstat chain, instead we simply read the values of already familiar "clk_intr" and "avenrun" variables from a hash, using the module, instance, name, and variable names as hash keys (lines 9 through 15). The rest of the program remains the same, and it produces exactly the same output as the C version of uptime. Clearly, the Perl-based approach would appeal to administrators and developers in search of custom performance-monitoring utilities.

As mentioned previously, the main advantage of using the Kstat interface is the ability to quickly develop custom performance-monitoring scripts that check one or two very specific aspects of the system's behavior and can be tailored to the needs a particular environment. While configuring database and file servers, for instance, one would have to make sure that the load is spread evenly across all available disk devices, hence the need to monitor for slow or overloaded disks. Inspired by one of the example scripts that come with SE Toolkit10, we created another program that detects disks with response times and utilization percents that exceed the threshold (Listing 3).

This program, called "slowdisk", takes three command-line parameters: -i, which specifies the sleep interval between taking the snapshots of the performance metrics; -s, which sets the threshold for the service time; and -b, which specifies the threshold for the utilization percentage. Besides using the Solaris::Kstat module, this program loads the Solaris::MapDev extension, which is also a part of the Solaris bundle by Alan Burlison [9]. Solaris::MapDev is designed to provide the mapping between the instance names, used by the Kstat interface (i.e., "sd1") and conventional device names (i.e., "c0t0d0"). We use the get_inst_names function of the Solaris::MapDev to obtain the instance names for all disk devices on the system (line 14). Since the function returns not only disk, but also tape, floppy, CD-ROM, and other instance names, we apply a grep filter to select only those that start from "sd" or "ssd" (i.e., internal or storage array disks).

The performance metrics, exposed via the Kstat interface, are usually either running totals or instantaneous values. Thus, to assess the performance characteristics of a particular disk, we will have to take periodic snapshots of these values and then calculate the averages over a time interval. For these reasons, we save the initial values of performance metrics for each disk device using the foreach loop at line 16 through 21. This loop invokes the Solaris::Kstat update function, which calls the kstat_chain_update function of libkstat. This is necessary to synchronize the user and kernel chains, because once in a while, the kernel would modify its linked list of Kstats by adding new or removing old nodes. Once the chain is synchronized, we parse the disk instance name to obtain the module name ("sd" or "ssd") and the instance number needed to read the data from the Kstat hash (line 18). We then save the snapshot of Kstat values for a given disk in a hash, pointed by the $prev hash reference using the disk instance name as a key (line 19). Finally, we record the time of a snapshot by reading the value of the "clk_intr" variable.

Having saved the initial state of our performance metrics, we enter the main loop at line 23. At first the execution of the program is suspended using the sleep function (line 24) with the interval parameter, controlled by an optional command-line argument -i. In case the command-line argument is not supplied, the program sleeps for 5 seconds. When it wakes up, the new snapshot of the Kstat performance metrics is taken and recorded -- this time into a different hash, pointed to by $curr hash reference (lines 25 through 31). Now that we have both current and previous snapshots of the performance metrics, we can compare them and calculate the average figures. The foreach loop at line 32 once again iterates over each disk device, first calculating the elapsed time between snapshots (line 33). At lines 34 through 37, we calculate the average number of completed reads per second ($rps) and writes per second ($wps) using the values of "reads" and "writes" Kstat variables and elapsed time between snapshots, calculated in the previous step. We then figure out the high-resolution time interval ($hr_time) by calculating the difference between the current and the previous values of the "wlastupdate" Kstat variable that contains the time of the last update to the wait queue (lines 39 through 41).

In case the current and the previous values of "wlastupdate" are the same, we use the default high-resolution time interval of 1ns (line 41). At line 44 through 47, we compute the average busy wait ($avw) and busy run ($avr) queue length using the values of "wlentime" and "rlentime" variables, which contain the sum of the queue lengths multiplied by time at that length, for wait and run queues respectively. Finally, we compute the average wait ($avwait) and average service ($avserv) times (lines 49 through 49) using previously calculated values of average busy wait and busy run queue length and the total number of completed reads/writes per second, calculated at line 42 as a sum of reads per second ($rps) and writes per second ($wps). Then we can calculate the total response or residency time ($svc_t at line 55) as a sum of average wait and average service times, as well as average run percent ($r_pct at line 53). This is the difference between the current and previous times spent running ("rtime") divided by the high-resolution time and is expressed as a percentage.

Finally, we compare the newly calculated values against the thresholds, specified by the command-line arguments (line 59), print out the timestamp, disk name, and calculated values in case these thresholds are met or exceeded (line 57 through 58), then save the current snapshot values in the $prev hash for subsequent iterations (line 60). Note that we must convert the disk instance names into conventional device names, which is done by the inst_to_dev function of the Solaris::MapDev module at line 58. We also apply some default threshold values in case the command-line arguments are omitted -- 50 ms for total residency time and 20% for the average run percentage (line 59). These threshold values are exactly the same as those used by the virtual_adrian_lite.se script [10] of SE Toolkit to detect slow disks11 12.

When run on the ES 450 while heavy compress jobs are hitting one of the file systems, the script yields the following results:

slow disk detected: 12:47:41     /dev/dsk/c0t0d0     51.11    73.16
slow disk detected: 12:50:37     /dev/dsk/c0t0d0     51.03    71.33
slow disk detected: 12:52:58     /dev/dsk/c0t0d0     51.32    72.10
slow disk detected: 12:54:08     /dev/dsk/c0t0d0     60.43    73.16
slow disk detected: 12:55:54     /dev/dsk/c0t0d0     60.82    68.77
Conclusion

Solaris::Kstat is one of the handiest and the most exciting Perl modules that CPAN has to offer. It is not, however, without a flaw -- its programming model, although fairly portable and easy to use, is relatively low level. To produce, for example, an equivalent of the vmstat utility, one would have to possess a deep knowledge of the system's internals as virtual memory metrics are spread across multiple unrelated Kstat nodes. Also, as noted, Solaris::Kstat programming requires the developer to perform all computations to obtain the average figures used for performance monitoring, which can get very involved. The SE Toolkit, on the other hand, significantly simplifies programming by providing high-level wrappers, or classes, such as vmstat, iostat, netstat, and others, to encapsulate all the calculations needed to obtain complete virtual memory, IO, or network metrics. It is fairly trivial to create similar wrappers for Solaris::Kstat module. However, that is left as an exercise to the reader.

Another problem is lack of reliable documentation for Kstats. Solaris man pages are very sketchy and incomplete, and Adrian Cockcroft's Sun Performance and Tuning [2] still remains the most comprehensive source of information. To gain an intimate familiarity with the subject, I strongly encourage readers to examine the source code of the example scripts and classes of SE Toolkit. The SymbEL programming language of SE Toolkit is quite similar to C, so most of the example scripts can easily be understood and deciphered.

References

1. BMC Software. PATROL for Performance Management and Prediction. http://www.bmc.com/products/esm/perfpred.html.

2. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning, 2nd edition. Sun Microsystems Press, 1998. pp 26-37.

3. William LeFebvre. UNIX Top. http://www.groupsys.com/top.

4. Walter Nielsen, Morgan Herrington. Proctool. ftp://opcom.sun.ca/pub/binaries/proctool.

5. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning, 2nd edition. Sun Microsystems Press, 1998. pp 449-556.

6. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning, 2nd edition. Sun Microsystems Press, 1998. pp 373-386.

7. Kstat (3K) manpage. Sun Solaris 2.

8. CPAN -- Comprehensive Perl Archive. www.perl.com/CPAN-local.

9. Alan Burlison. CPAN Directory ABURLISON. Latest release: Solaris-0.05a.tar.gz 10/2/1999.

10. Adrian Cockcroft, Richard Pettit. virtual_adrian_lite.se. RICHPse/examples.

11. Adrian Cockcroft. System Performance Monitoring. Sun World Online, 09/05/1995.

12. Adrian Cockcroft. Clarifying Disk measurements and terminology. UNIX Insider, 09/01/1997.

Alexander Golomshtok is a professional consultant who, for the last decade, has been hanging around downtown New York developing large-scale software systems and infrastructure solutions for Wall Street firms. He can be reached at: golomshtok_alexander@jpmorgan.com .

Yefim Nodelman is a seasoned systems administrator with more than seven years of professional experience in supporting large UNIX and Windows installations.