Managing
SolarisTM with Kstat
Alexander Golomshtok and Yefim Nodelman
Rapidly increasing demand for high-performance, mission-critical
computer systems and especially the proliferation of Internet-based
business applications gave birth to thousands of commercial and
public-domain performance management solutions. Solaris, being a
very mature operating environment with significant install base,
enjoys the attention of many performance management software vendors
and independent developers. Available tools range from very simple
standalone programs, designed to monitor just a few aspects of the
system's behavior, to very complex distributed systems with
built-in troubleshooting, trend analysis, and forecasting capabilities.
Perhaps one of the most well-known commercial performance management
solutions for all UNIX flavors is BMC Patrol1. Patrol is a multi-tiered
system, capable of not only monitoring various aspects of system
performance, but also advanced modeling and impact analysis. Although
a good choice for enterprise-wide performance monitoring, Patrol
may be an overkill for those who just wish to quickly address particular
performance concerns -- its distributed architecture, complexity,
and especially the licensing costs may be prohibitive to small organizations
looking for a comprehensive low-cost and low-maintenance solution.
Other vendors, including Sun Microsystems itself, offer sophisticated
performance monitoring and management tools for Solaris, but unfortunately,
sophistication comes at the price of complexity and high acquisition
and maintenance costs2.
As a low-cost alternative to commercial performance-monitoring
tools, free software enthusiasts have developed quite a few performance-monitoring
applications for UNIX and Solaris in particular. Most of these free
utilities concentrate on the monitoring aspects of performance management
and do not include complex trend analysis and modeling capabilities.
One example of such a free performance-monitoring application is
a very popular utility by William LeFebvre, called "top"3.
Top is a standalone program that continuously lists processes with
highest CPU consumption percentage and displays other performance-related
information, such as some CPU and virtual memory statistics. This
handy tool quickly assesses the overall state of the system and
does not require complex setup or maintenance. However, its functionality
is very limited and does not provide for customization.
Proctool by Walter Nielsen and Morgan Herrington is another freely
available performance-monitoring utility, which tops the list of
some UNIX systems administrators4. Proctool was originally inspired
by "top", but over the years, it evolved into a more sophisticated
application with real Motif-based GUI and capabilities beyond those
of "top".
Overall, most of the commercial as well as free performance-management
systems do not possess enough flexibility to satisfy certain custom
troubleshooting needs. As a rule, these tools are either too generic
and all-encompassing or too simplistic and narrow in scope. A very
common requirement, for instance, is the ability to set up custom
alerts, which are triggered if a particular combination of performance
measures meets or exceeds predefined thresholds. Some commercial
and free tools allow for this kind of monitoring, however, this
capability usually comes at the price of high complexity. Apart
from making the lives of systems administrators miserable, complexity
often implies excessive consumption of system resources by the performance-monitoring
tool, which severely cripples the performance data-collection process.
Even a relatively simple tool such as "top" has the resident
set size of 1.5 MB on our dual CPU Sun ES 450 system and often comes
at the top of its own process list, indicating the highest CPU utilization
percentage. Other tools, such as BMC Patrol are even worse. For
portability sake, these tools do not read the kernel performance
statistics directly, but use programs such as iostat, netstat,
and vmstat to collect the data, thus incurring the overhead
of starting additional processes on a subject system. Another drawback
is that most of these tools require root access to the system, which
makes them simply unusable by those poor people who are not granted
the administrative access to their computers.
Luckily, Adrian Cockcroft of Sun Microsystems and Richard Pettit
of Resolute Software developed a revolutionary approach to solving
performance-management problems. Instead of building a tool for
performance monitoring, they came up with a toolkit, called SE (Symbolic
Engine), consisting of a programming language interpreter, a few
handy libraries, and a bunch of example scripts, mimicking the functionality
of traditional vmstat, iostat, netstat, and
other UNIX utilities5. The fundamental element of the SE Toolkit
(www.setoolkit.com) is the SymbEL programming language, which
provides a foundation for building custom performance management
tools and utilities. The SE Toolkit is extremely versatile, efficient,
and easy to use. For the most part, it does not require root privileges;
the data collection algorithms, employed by the SymbEL interpreter,
access the kernel statistical data directly, which allows for building
very accurate performance monitors. The most attractive feature
of the Toolkit is its unlimited flexibility -- any custom tool
can be developed using SymbEL and the libraries that come with it.
However, in order to take the full advantage of the SE flexibility
one would have to master the SymbEL programming language.
Solaris Performance Metrics Interfaces
As I mentioned previously, reading the kernel statistics in order
to collect the performance data is an approach far superior to using
existing Solaris programs such as netstat, iostat,
sar, and others. Direct access to the kernel data eliminates
the need to start additional processes for data collection purposes,
thereby greatly reducing the overall resource consumption.
Solaris 2 exposes numerous interfaces for collecting various performance
and status data. One of the oldest interfaces, available since Sun
OS 4.x, is kvm, which stands for "kernel virtual memory".
As the name implies, this interface provides a way of accessing
information within the address space of an operating system. Besides
reading the kernel virtual memory on a running system, kvm
can be used to analyze a dump of a running kernel, which may be
a result of a system crash.
Kernel virtual memory can be accessed by simply reading the /dev/kmem
file, which is a character-special file that provides user-level
access to kernel memory image. However, libkvm library provides
more robust and high-level interface by encapsulating the direct
access to /dev/kmem and /dev/ksyms and simplifying
the process of reading the kernel data6. Many Solaris performance
utilities, such as netstat, utilized and continue to utilize
the kvm interface despite its drawbacks, which include a
need for root access (netstat is setuid) and lack of thread
safety.
The Kstat (which stands for "kernel statistics") library7
is a newer interface for performance metrics collection that eliminates
some of the disadvantages of the kvm. Kstat functions access
the data stored within the user-level data structures. These are
essentially copies of similar structures within the OS kernel, so
that the kernel memory image no longer needs to be scanned directly.
Apart from the obvious higher portability, this approach also allows
for non-root access to a copy of a kernel data and solves thread-safety
problems. Because the Kstat data is stored in the "user space",
a user-level application may lock the Kstat structure, thus ensuring
that none of the data changes while it is being accessed. The kvm,
on the other hand, as a user-level library, has no mechanism for
preventing the kernel data structures from being modified by kernel
threads while the performance data collection operation is in progress.
The performance metrics accessible via the Kstat interface are
stored in a linked list of structures, often referred to as "Kstat
chain". There are actually two chains -- one stored in
the user space (user chain) and another stored in the kernel space
(kernel chain). Whenever an application process issues a data collection
request through the Kstat library, the library dispatches the ioctl
request to a special loadable driver, designed to act as a middleman
between the kernel and the user space. The driver then locks a corresponding
portion of the kernel Kstat chain and transfers the kernel copy
of the data into the user space. This mechanism prevents kernel
threads from modifying the Kstat chain while the collection operation
is in progress, thus ensuring consistency of the data being read
by the user-level process.
Each node of the Kstat chain (often called simply "Kstat")
contains metrics that reflect the operations of a single functional
component, such as a disk device or network interface. Each Kstat
is generally identified by a unique "path" that consists
of three distinct elements:
- Module -- Uniquely identifies the functional area or subsystem.
The module name for disk devices, for instance, may be "sd"
or "ssd"; for network interfaces it is "tr"
(token ring), "le" (lance ethernet), etc.
- Instance number -- Uniquely identifies the instance within
the module (i.e., disk instance number).
- Name -- Uniquely identifies the functional component. For
disk devices and network interfaces, the name is usually a combination
of the module and the instance number, such as "sd1"
or "hme0".
Each Kstat data structure contains a common header and a variable
data portion. The header houses the Kstat identification information,
such as its module, instance, name, and type along with the pointers
to the data portion and the next Kstat within the chain. The data
portion has variable structure and may be one of the following:
- Raw -- A chunk of memory that can be cast to an appropriate
C structure. An application, dealing with raw Kstats should have
prior knowledge of what C structure the data portion of the Kstat
should be cast to.
- Named -- An array of name-value pairs.
- Interrupt -- A C structure containing the information about
interrupts.
- IO -- A specific C structure containing the information
about disk devices.
- Timer -- An array of name-value pairs similar to the named
type.
The Kstat library (libkstat) provides numerous functions
for opening and closing Kstat chains, traversing the Kstat nodes
and reading the performance data. The following little program,
designed to list all available Kstats on a given system, provides
an introductory example of the library usage:
1 #include <stdio.h>
2 #include <kstat.h>
3
4 int
5 main( int argc, char **argv ) {
6 kstat_ctl_t *pc;
7 kstat_t *pk;
8 if ( !( pc = kstat_open() ) ) {
9 perror( "failed to open kstat" );
10 return -1;
11 }
12 printf( "%-10s%-5s%-16s\n", "module", "inst", "name" );
13 for( pk = pc->kc_chain; pk; pk = pk->ks_next )
14 printf( "%-10s%-5d%-16s\n",
15 pk->ks_module, pk->ks_instance, pk->ks_name );
16
17 kstat_close( pc );
18 return 0;
19 }
The program opens the Kstat chain (line 8) by calling the kstat_open
function and iterates through the linked list of Kstat nodes (lines
13 through 15) printing the module, instance, and name fields for
every Kstat. The kstat_open function returns a pointer to the
Kstat control structure, which, among other things, contains the pointer
to the head of the Kstat-linked list (kc_chain). As I mentioned
previously, each Kstat contains the pointer to the next element of
the list (ks_next), which makes it easy to traverse the chain
from the beginning to the end. The program can be compiled with the
following command (assuming that the source code of the example above
is saved as lkstat.c):
cc -o lkstat lkstat.c -lkstat
Running the resulting lkstat binary on our Sun ES 450 system
produces the following output:
module inst name
unix 0 kstat_headers
unix 0 kstat_types
unix 0 sysinfo
unix 0 vminfo
unix 0 vmhatstat
...
ufs 0 inode_cache
sd 21 sd21
sd 3 sd3
cpu_stat 1 cpu_stat1
cpu_info 1 cpu_info1
cpu_info 3 cpu_info3
cpu_stat 3 cpu_stat3
...
The Kstat library is widely used by Solaris 2 performance monitoring
utilities -- most of the functionality exposed by the SE Toolkit,
for instance, is based upon the capabilities of the Kstat library.
Even the simple programs, such as well-known uptime, rely on
Kstat to obtain the statistical information. Listing 1 shows a sample
implementation for the uptime utility, designed to further
demonstrate the versatility of the Kstat library.
This program accesses a single Kstat "unix.0.system_misc",
which contains some system usage information, such as a number of
clock interrupts since the boot time. At first, the program obtains
some system configuration information (number of clock interrupts
or ticks per second) using sysconf (3C) library function
at line 23. Then we open the Kstat chain using the kstat_open
function and get the handle of the desired Kstat. This time, instead
of iterating through the linked list of Kstats, we use the kstat_lookup
function, which takes the module, instance, and name elements of
the Kstat path and returns the pointer to the header of the "unix.0.system_misc"
Kstat structure (line 28). At line 32, we issue the kstat_read
call that signals the Kstat driver to read the kernel data into
the user chain. The "unix.0.system_misc" Kstat is of named
type, which can be easily checked by examining the ks_type
field of the kstat_t structure. It is set to the value of
1 (which indicates the named type), so we use the kstat_data_lookup
function to look up the values of the variables that we're
interested in.
At lines 36 through 43, we call the kstat_data_lookup function
four times to obtain the values of the "clk_intr" variable
(which contains the number of click ticks since the boot), and the
values of "avenrun_1min", "avenrun_5min", and
"avenrun_15min" variables. These represent the average
number of processes on the run queue within the last one, five,
and fifteen minutes, respectively. The "avenrun" variables
are used to calculate the system load average based on the formula
borrowed from the source code of the "top" utility3. I
believe that Solaris 2 native uptime program utilizes the same formula,
which simply converts the unsigned long number into a double and
divides it by a scaling factor FSCALE, taken from /usr/include/sys/param.h.
We then use the value of the "clk_intr" variable to calculate
the number of days, hours, minutes, and seconds since the last boot
(lines 45 through 48). Finally, we print out the uptime information
along with the load average figures and close the Kstat chain. When
compiled and run on an ES 450, this program produces the following
output:
Up: 204 day(s) 20 hour(s) 36 minute(s) 4 second(s), load average: 0.02, 0.01, 0.01
The output is quite similar to the output of the standard uptime
command, however, to save space we excluded the code for calculating
the number of users on the system. The number of users can easily
be determined by reading the user and accounting information via the
utmpx (4) interface.
Kstat and Perl
Although the Kstat programming model is fairly simple, it still
requires extensive C programming skills, which may scare away even
the most experienced systems administrators. One major flaw of the
Kstat interface is the necessity to program around five different
types of Kstats. This makes the process of reading the performance
metrics inconsistent and may lead to obscure errors. This may not
be an issue for small and simple programs, such as our uptime
utility. However, developing an equivalent of vmstat, for
instance, would require access to a few different Kstat structures
of different types, which can easily lead to complex, convoluted,
and impossible to debug code. The SymbEL programming language of
SE Toolkit [5] takes a much more consistent approach by allowing
the developer to read values of any Kstat variables in a uniform
manner regardless of the Kstat type. Unfortunately, this consistency
comes at a price -- one would have to learn SymbEL.
As usual, CPAN8 offers a Perl extension module, which enables
anybody with basic Perl programming skills to take full advantage
of Kstat interface. This module, called Solaris::Kstat 9,
provides uniform access to Kstat data via tied hash interface, so
that any Kstat variable can be read using its module, instance,
and name simply as hash keys.
To demonstrate the advantages of the Perl-based approach and provide
grounds for comparison, the uptime program was converted
into a Perl script (Listing 2).
The first very noticeable difference is the fact that the Perl
script is almost twice as small as the corresponding C version.
Also, no function calls are required to navigate through the Kstat
chain, instead we simply read the values of already familiar "clk_intr"
and "avenrun" variables from a hash, using the module,
instance, name, and variable names as hash keys (lines 9 through
15). The rest of the program remains the same, and it produces exactly
the same output as the C version of uptime. Clearly, the
Perl-based approach would appeal to administrators and developers
in search of custom performance-monitoring utilities.
As mentioned previously, the main advantage of using the Kstat
interface is the ability to quickly develop custom performance-monitoring
scripts that check one or two very specific aspects of the system's
behavior and can be tailored to the needs a particular environment.
While configuring database and file servers, for instance, one would
have to make sure that the load is spread evenly across all available
disk devices, hence the need to monitor for slow or overloaded disks.
Inspired by one of the example scripts that come with SE Toolkit10,
we created another program that detects disks with response times
and utilization percents that exceed the threshold (Listing 3).
This program, called "slowdisk", takes three command-line
parameters: -i, which specifies the sleep interval between
taking the snapshots of the performance metrics; -s, which
sets the threshold for the service time; and -b, which specifies
the threshold for the utilization percentage. Besides using the
Solaris::Kstat module, this program loads the Solaris::MapDev
extension, which is also a part of the Solaris bundle by Alan Burlison
[9]. Solaris::MapDev is designed to provide the mapping between
the instance names, used by the Kstat interface (i.e., "sd1")
and conventional device names (i.e., "c0t0d0"). We use
the get_inst_names function of the Solaris::MapDev
to obtain the instance names for all disk devices on the system
(line 14). Since the function returns not only disk, but also tape,
floppy, CD-ROM, and other instance names, we apply a grep
filter to select only those that start from "sd" or "ssd"
(i.e., internal or storage array disks).
The performance metrics, exposed via the Kstat interface, are
usually either running totals or instantaneous values. Thus, to
assess the performance characteristics of a particular disk, we
will have to take periodic snapshots of these values and then calculate
the averages over a time interval. For these reasons, we save the
initial values of performance metrics for each disk device using
the foreach loop at line 16 through 21. This loop invokes
the Solaris::Kstat update function, which calls the kstat_chain_update
function of libkstat. This is necessary to synchronize the
user and kernel chains, because once in a while, the kernel would
modify its linked list of Kstats by adding new or removing old nodes.
Once the chain is synchronized, we parse the disk instance name
to obtain the module name ("sd" or "ssd") and
the instance number needed to read the data from the Kstat hash
(line 18). We then save the snapshot of Kstat values for a given
disk in a hash, pointed by the $prev hash reference using
the disk instance name as a key (line 19). Finally, we record the
time of a snapshot by reading the value of the "clk_intr"
variable.
Having saved the initial state of our performance metrics, we
enter the main loop at line 23. At first the execution of the program
is suspended using the sleep function (line 24) with the
interval parameter, controlled by an optional command-line argument
-i. In case the command-line argument is not supplied, the
program sleeps for 5 seconds. When it wakes up, the new snapshot
of the Kstat performance metrics is taken and recorded -- this
time into a different hash, pointed to by $curr hash reference
(lines 25 through 31). Now that we have both current and previous
snapshots of the performance metrics, we can compare them and calculate
the average figures. The foreach loop at line 32 once again
iterates over each disk device, first calculating the elapsed time
between snapshots (line 33). At lines 34 through 37, we calculate
the average number of completed reads per second ($rps) and writes
per second ($wps) using the values of "reads" and "writes"
Kstat variables and elapsed time between snapshots, calculated in
the previous step. We then figure out the high-resolution time interval
($hr_time) by calculating the difference between the current and
the previous values of the "wlastupdate" Kstat variable
that contains the time of the last update to the wait queue (lines
39 through 41).
In case the current and the previous values of "wlastupdate"
are the same, we use the default high-resolution time interval of
1ns (line 41). At line 44 through 47, we compute the average busy
wait ($avw) and busy run ($avr) queue length using the values of
"wlentime" and "rlentime" variables, which contain
the sum of the queue lengths multiplied by time at that length,
for wait and run queues respectively. Finally, we compute the average
wait ($avwait) and average service ($avserv) times (lines 49 through
49) using previously calculated values of average busy wait and
busy run queue length and the total number of completed reads/writes
per second, calculated at line 42 as a sum of reads per second ($rps)
and writes per second ($wps). Then we can calculate the total response
or residency time ($svc_t at line 55) as a sum of average wait and
average service times, as well as average run percent ($r_pct at
line 53). This is the difference between the current and previous
times spent running ("rtime") divided by the high-resolution
time and is expressed as a percentage.
Finally, we compare the newly calculated values against the thresholds,
specified by the command-line arguments (line 59), print out the
timestamp, disk name, and calculated values in case these thresholds
are met or exceeded (line 57 through 58), then save the current
snapshot values in the $prev hash for subsequent iterations
(line 60). Note that we must convert the disk instance names into
conventional device names, which is done by the inst_to_dev
function of the Solaris::MapDev module at line 58. We also
apply some default threshold values in case the command-line arguments
are omitted -- 50 ms for total residency time and 20% for the
average run percentage (line 59). These threshold values are exactly
the same as those used by the virtual_adrian_lite.se script
[10] of SE Toolkit to detect slow disks11 12.
When run on the ES 450 while heavy compress jobs are hitting
one of the file systems, the script yields the following results:
slow disk detected: 12:47:41 /dev/dsk/c0t0d0 51.11 73.16
slow disk detected: 12:50:37 /dev/dsk/c0t0d0 51.03 71.33
slow disk detected: 12:52:58 /dev/dsk/c0t0d0 51.32 72.10
slow disk detected: 12:54:08 /dev/dsk/c0t0d0 60.43 73.16
slow disk detected: 12:55:54 /dev/dsk/c0t0d0 60.82 68.77
Conclusion
Solaris::Kstat is one of the handiest and the most exciting
Perl modules that CPAN has to offer. It is not, however, without
a flaw -- its programming model, although fairly portable and
easy to use, is relatively low level. To produce, for example, an
equivalent of the vmstat utility, one would have to possess
a deep knowledge of the system's internals as virtual memory
metrics are spread across multiple unrelated Kstat nodes. Also,
as noted, Solaris::Kstat programming requires the developer
to perform all computations to obtain the average figures used for
performance monitoring, which can get very involved. The SE Toolkit,
on the other hand, significantly simplifies programming by providing
high-level wrappers, or classes, such as vmstat, iostat,
netstat, and others, to encapsulate all the calculations
needed to obtain complete virtual memory, IO, or network metrics.
It is fairly trivial to create similar wrappers for Solaris::Kstat
module. However, that is left as an exercise to the reader.
Another problem is lack of reliable documentation for Kstats.
Solaris man pages are very sketchy and incomplete, and Adrian Cockcroft's
Sun Performance and Tuning [2] still remains the most comprehensive
source of information. To gain an intimate familiarity with the
subject, I strongly encourage readers to examine the source code
of the example scripts and classes of SE Toolkit. The SymbEL programming
language of SE Toolkit is quite similar to C, so most of the example
scripts can easily be understood and deciphered.
References
1. BMC Software. PATROL for Performance Management and Prediction.
http://www.bmc.com/products/esm/perfpred.html.
2. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning,
2nd edition. Sun Microsystems Press, 1998. pp 26-37.
3. William LeFebvre. UNIX Top. http://www.groupsys.com/top.
4. Walter Nielsen, Morgan Herrington. Proctool. ftp://opcom.sun.ca/pub/binaries/proctool.
5. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning,
2nd edition. Sun Microsystems Press, 1998. pp 449-556.
6. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning,
2nd edition. Sun Microsystems Press, 1998. pp 373-386.
7. Kstat (3K) manpage. Sun Solaris 2.
8. CPAN -- Comprehensive Perl Archive. www.perl.com/CPAN-local.
9. Alan Burlison. CPAN Directory ABURLISON. Latest release: Solaris-0.05a.tar.gz
10/2/1999.
10. Adrian Cockcroft, Richard Pettit. virtual_adrian_lite.se.
RICHPse/examples.
11. Adrian Cockcroft. System Performance Monitoring. Sun World
Online, 09/05/1995.
12. Adrian Cockcroft. Clarifying Disk measurements and terminology.
UNIX Insider, 09/01/1997.
Alexander Golomshtok is a professional consultant who, for
the last decade, has been hanging around downtown New York developing
large-scale software systems and infrastructure solutions for Wall
Street firms. He can be reached at: golomshtok_alexander@jpmorgan.com
.
Yefim Nodelman is a seasoned systems administrator with more
than seven years of professional experience in supporting large
UNIX and Windows installations.
|