Solaris^TM 8 Performance Tuning

James C. McPherson

When talking about tuning the performance of a UNIX system, sys admins generally work through a number of concepts, such as the current perceived performance, the difference between this level and a desired level of performance, the optimal level claimed by the application vendor, how to measure performance differences, and finally, what to tune and test.

The current perceived level of performance is based on what your users tell you and what you observe in your regular system monitoring. In a networked environment, if you have some users on the end of a 100-Mbit full-duplex connection, and others who are still using 10-Mbit half-duplex, then the 10-Mbit users will always tell you that the system is slow. However, you might have to wait for a serious application overload or resource crunch before the 100-Mbit users complain. There are also copious logfiles and measurable kernel statistics to facilitate system-level monitoring. Some sites also make use of tools such as CA UniCentre TNG, HP OpenView, or Sun's SunMC to monitor via snmp MIBs. However, these are tools that must be used by competent systems and application administrators and should not be used alone.

The desired level of performance is somewhat different. Benchmarks are sometimes less than useful, but if you are using an application or system from a large vendor, they should have a specific benchmarking group that will replicate your environment and allow you to test realistic loads. You can trust a benchmark done in this manner because it is specifically tailored to your company's requirements. If you don't have the time or resources to organize a tailored benchmark, you'll have to spend extra time doing it yourself. Before you get started, determine what measurements will be valid for your environment and how to obtain them. Also, work with your users in regard to when you can test tuning and monitoring. Keep the users informed of what you are doing so that they can recognize changes and appreciate the value that you provide to them.

How to Measure before and after Performance Differences

To tune effectively, you must first decide what you want to achieve. Here are some tips to consider:

Determine your criteria for success.
Log everything, and run Sun Explorer data gatherer every time you make changes in order to keep a snapshot of your system configuration at each point.
Analyze the logs carefully.
Measure the changed performance against what you wish to achieve.
Be methodical.

If you do not already keep a record of system and resource loads, and have no method of monitoring the applications on your system, start simply. You can measure cpu/memory/swap utilization, disk utilization (specifically your application filesystems), and network utilization using standard system tools. Solaris 8 allows you to use the kstat(1M) utility to view various system properties in a variety of output formats. df -k is also useful. You can write scripts to dump these statistics to a file on a periodic basis or use tools such as apache and mrtg1 to give you a regularly refreshed view. A little bit of shell, awk, sed, or Perl will give you lots of numbers to graph.

The most useful monitoring of performance requires frequent and regular snapshots of your system for a period of time that you, your application admins, and users decide is long enough. My rule of thumb is to accrue not less than eight days of monitoring data (preferably including a peak load period) so that you have obvious peaks and troughs to look at. Another aspect of measurement is to survey the users both before and after to determine how they believe performance has changed in the course of the monitoring period. If your performance tuning has resulted in differences that the users cannot perceive, then I would say the performance hasn't really been tuned. Useful questions to ask include:

How long did it take to perform a standard task (measured in seconds) at 9am, 12noon, 2pm, 5pm before you started tuning?
How long does this task take now?
Do you think that the performance has (improved, stayed the same, become worse) over the past (X) days? (At least a seven-point scale is recommended for this question).

You, the sys admin, also need to measure these standard tasks yourself so that you have a frame of reference for your measurements.

What to Tune and When to Tune It

I/O and Buffers (Including VxVM and VxFS)

The most common target of performance tuning is obviously I/O, and in Solaris there are several buffers and tunables that we can work on. You are probably aware of the relative speeds of disks versus ram, and how this difference in speed has changed dramatically over the past 20 years. This is particularly important when configuring "virtual memory" for your server because of the way Solaris implements paging2. A primary concern here is to avoid paging data to your slow disk (even a 10000-rpm fcal disk is still slow) and ensure that the data stays in ram until it is sent down the SCSI or tcp stack to the next appropriate point.

How does one prevent paging? This question requires an appreciation of "good" and "bad" paging. So-called "good" paging happens when the system allocates or reclaims pages from a process, whereas "bad" paging is when this allocation relies upon a disk device and the system incurs a penalty for access. So, we don't particularly care about the "good" paging, because the intelligence built into the kernel's paging algorithms ensure that it happens as infrequently as possible. The priority_paging setting is necessary for Solaris 2.6 and Solaris 7. Neither Solaris 8 or 9 require this, because a modified version of priority_paging, called the cyclic page cache, was integrated into the new kernel. "Bad" paging is also known as swapping, and in extreme circumstances can result in such a high I/O load that your I/O subsystem is said to be thrashing.

You must specifically be aware of two tunables: ncsize and lotsfree from the Solaris side. If you are running Veritas Volume Manager (vxvm), then you need to be aware of volkio and friends. If you are running Veritas FileSystem (vxfs), you must carefully tune vxfs:vxfs_ninode and vxfs:vx_bc_bufhwm. I will address each of these variables in turn.

ncsize

The "nc" in ncsize refers to the Name Cache, and is used to set the size of the Directory Name Lookup Cache, or dnlc for short. This is an optimization to give you better performance from your filesystems because inodes are cached in memory and only flushed out of the cache to disk if they have been idle for quite a while. The default setting for ncsize depends on two other variables, max_nprocs and maxusers with the following relationship:

ncsize      =  (4 * (max_nprocs+maxusers)) + 320
max_nprocs  =  10 + (16 * maxusers)
maxusers    =  physmem - 2

Setting maxusers to less than the maximum possible value of 2048 used to be a required configuration task going back to SunOS 4.x. However, Solaris 2.x does not require maxusers to be set, and you can actually decrease the performance of your system by doing so. If maxusers is set to the system default, you might not have a very large dnlc. The recommendation is first to set maxusers to 2048, reboot, and after about a week of average usage, examine the kernel statistics on the dnlc (scroll forward to the line starting with dnlcstat). Under both Solaris 8 and 9, you can use the kstat(1M) utility. When looking at these statistics, note the value of misses/hits and see how close that ratio is to 0. The closer it is to 0, the less you need to worry about tuning ncsize. For example, my workstation (an Ultra80) has been up for about 36.5 days, and has this output from kstat -m unix -n dnlcstats:

module: unix                        instance: 0
name:   dnlcstats                   class:    misc
        crtime                      63.090334184
        ...
        hits                        141141001
        misses                      1953220
        negative_cache_hits         2070966
        pick_free                   802468
        pick_heuristic              1102434
        pick_last                   257563
        ...
        snaptime                    3162130.15322328 (ends here)

In my case, this works out to be 0.01383, meaning that my dnlc efficiency is very good. On a major fileserver that I use, the values are considerably different:

module: unix                        instance: 0
name:   dnlcstats                   class:    misc
        crtime                      180.496237386
        ...
        hits                        313301011
        misses                      77458379
        negative_cache_hits         14429400
        pick_free                   3664314
        pick_heuristic              51820181
        pick_last                   24067557
        snaptime                    6143454.64953917 (ends here)

Here, the ratio works out to be 0.2472, indicating that there is only about a 75% efficiency of the dnlc for this server, so ncsize should be tuned upwards. For the system above, I would recommend that the dnlc size be doubled. It is currently set to the system-calculated default of 139488, which for this E420R running in 64-bit mode takes up slightly more than 8.7mb of kernel memory 3. The system can easily afford double this amount of kernel memory for dnlc, given its workload.

lotsfree

This variable is used as the boundary condition for when to invoke the page scanner and make it look for pages to free. Typically, this is set to 1/64th the number of physical pages in your system, which works reasonably well. However, as the Kernel Tunable Parameters Manual indicates, if your system load is such that it cannot cope with sudden sharp increases in demand for memory, then you should seriously consider increasing this value. A general rule I've seen recommended is to set lotsfree to be 1/16th of the physmem value, rather than the default 1/64th. This allows the page scanner to activate in a more timely manner. When combined with priority paging or the cyclic page cache of Solaris 8, you should see that the memory load curve of your system is much smoother than before.

Veritas FileSystem Variables -- vxfs:vxfs_ninode and vxfs:vx_bc_bufhwm

These two variables are particularly important to tune. These have an interdependency on ncsize that is related to their usage of kernel memory. When you use Veritas FileSystem (VxFS), you must tune ncsize to be within 50% and 80% of the value of vxfs:vxfs_ninode in order to achieve good performance. Good performance in this context means that the performance curve of CPU cycles and memory used stays relatively close to the performance curve algorithm that Veritas builds into the product.

The upper limit on the amount of kernel memory that VxFS will allocate for its cache is set with vxfs:vx_bc_bufhwm -- the Buffer Cache's Buffer High Water Mark. Once the allocated amount reaches this limit, VxFS inodes are flushed from the cache. The vxfs:vxfs_ninode variable is the limit on the number of VxFS inode structures held in memory. This number is usually determined to be 125% of the value of ncsize for reasons of CPU cycle efficiency, and if you have your dnlc set by way of setting maxusers to 2048, then ncsize/vxfs:vxfs_ninode is approximately 0.78. This is very close to the 80% figure mentioned in the VxFS installation guide.

Veritas Volume Manager Variables

Many sites use Veritas Volume Manager (VxVM) to provide data protection and enhance their systems' performance. Mirrored filesystems, striped data filesystems, and RAID-5 filesystems are all features that VxVM provides. If you use VxVM as well as VxFS, then you should look at vxio:vol_maxio. This variable controls the maximum size of io requests that are sent down the SCSI chain without breaking the request up. Veritas recommends that this tunable not exceed 20% of kernel memory or physical memory (whichever is smaller), and that you match this tunable to the size of your widest stripe. Apart from this tunable, there are no others that must be tuned, and you should only really look at tuning the variables specified in the Veritas Volume Manager Administrator Guide if you are directed to by an appropriate technical contact within Veritas or a Veritas partner. This is because Veritas, like Sun with Solaris, has spent a lot of time and effort in making sure that the self-tuning algorithms work well for the vast majority of systems. If you are fortunate enough to have the sort of large installation where serious VxVM tuning is necessary, then you should engage Veritas or Sun Professional Services to analyze and tune your configuration.

Shared Memory, Semaphores, Message Queues

Applications such as rdbms engines (Oracle, Sybase, Informix, DB2, etc.), middleware like MQ-Series and Tuxedo, and some backup packages (Veritas NetBackup) make heavy use of shared memory, semaphore sets, and message queues in order to maximise their performance. For configuring these settings, you should always start with the application vendor's recommendations, and find out from the vendor how they log a deficiency in these settings. Veritas NetBackup, for example, will dump messages in logfiles like:

waited for empty buffer X times or
waited for full buffer Y times

which in conjunction with the NetBackup Troubleshooting guide will indicate whether you need more semaphores or message queues.

Remember that the kernel will not allow more than 25% of the dedicated kernel space (segkp) to be allocated for the shared memory, message queue, and semaphore structures. So, if you experiment with large values on a machine without much physical memory, you may see different values when you check with the sysdef(1M) utility after rebooting. For my workstation (see the "System Specification File" sidebar), the kernel memory usage under Solaris 8 (64-bit mode) for the example semaphore settings is approximately 15898 Kb, for the message queues approximately 1038 Kb, and for the shared memory itself, approximately 18 Kb. You might be wondering why the overhead for shared memory (shmsys) settings is so small compared with the semaphore (semsys) and message queue (msgsys). That is because the only variable used for administrative purposes with shmsys is shminfo_shmmni. This is the maximum number of shmid_ds structures in the system, each of which is 88 bytes.

The rule of thumb for shared memory, semaphores, and message queues is to tune these in conjunction with your application vendor, because the vendor will have concrete customer data and can advise you appropriately.

SCSI tunables

sd_max_throttle and sd_io_time

Two very common tuning targets are those that relate to the SCSI disk (sd) driver module: sd_max_throttle and sd_io_time. You commonly see these set (or need to set them) if you are using EMC, IBM, or Hitachi storage attached to your Solaris system. The sd_io_time variable is the limiter on how long an I/O can be outstanding before an error condition is returned. The Solaris default is 60 seconds (0x3c), but this is often set to 31 seconds (0x1f). The variable sd_max_throttle provides the limit on how many outstanding I/Os the system can handle at any one time and is commonly referred to as the "queue depth." A common setting for this (the default is 256 or 0x100) is 25, which is the mandated setting from JNI Corporation for use with their fcaw driver4 and EMC5 or Hitachi storage6.

The general recommendation for both these variables, however, is that if you are not required to set them by your storage vendor, then leave them unset in your /etc/system file and let Solaris handle the settings for you. This is because these variables cannot be set on a per-LUN or per-instance basis: any change made to these two variables affects the entire SCSI subsystem. If you set them to values greater than or much less than your storage subsystem can handle, you run a very real risk of having a badly performing system for disk, tape, and memory operations. Another important aspect to tuning the sd driver is that Sun's disk storage attaches using either the sd or, if attached using the Fibre Channel (FC) protocol, the ssd drivers. This allows for separate tuning of those stacks, but could well change in a later version of Solaris.

maxphys

The maxphys setting, often seen in conjunction with JNI and Emulex HBAs, is the upper limit on the largest chunk of data that can be sent down the SCSI path for any single request. There are no real issues with increasing the value of this variable to 8 Mb (in /etc/system, set maxphys=8388608), as long as your IO subsystem can handle it. All current Fibre Channel adapters are capable of supporting this, as are most ultra/wide SCSI HBAs, such as those from Sun, Adaptec, QLogic, and Tekram. It is possible (although I have not yet tested it) to set this variable in an (E)IDE-based system, such as a PC running Solaris for Intel, a Sun Ultra 5/10/Blade 100, or the lower end Netra systems. With the current range of (E)IDE disks and at least an ATA-66 interface, the system should be able to support this value for maxphys.

Networking Tunables

At a former employer, I worked closely with the DBAs on a particular system running a financial management application. After monitoring the system for several weeks, the first step in tuning was to get a 100-Mbit switched interface activated. When this 100FDX interface was connected, we noticed an immediate improvement in system performance: less swap in use due to less buffering of data, fewer "wasted" cpu cycles, and a much happier group of users. The change was so significant that we put off implementing the rest of the tuning while we analyzed whether our plan was still relevant. We did not need to do any specific configuration on our server because the switch handled it.

If you need to force your interface, there are several simple ways to do this. To begin, you must be absolutely certain that your Ethernet interface is cabled to plug into a 100-Mbit Ethernet switch, otherwise you will not get any response from your network connection. Let's look at the two most common methods: an rc script and editing /etc/system.

The boot-time rc script allows you to specify which instance of the interface you want to change the settings for. In the example below, I am setting the properties for my qfe3 interface:

#!/bin/sh
#
# script to force the interface properties for qfe3
# script name is /etc/rc2.d/S50ndd_qfe3
#
ndd -set /dev/qfe instance 3
# force OFF 100Mb half duplex
ndd -set /dev/qfe adv_100hdx_cap 0         
# force OFF 100Mb T4
ndd -set /dev/qfe adv_100T4_cap 0          
# force ON 100Mb full duplex
ndd -set /dev/qfe adv_100fdx_cap 1         
# force OFF autonegotiation (FORCE mode)
ndd -set /dev/qfe adv_autoneg_cap 0        
# end of script

Here's the /etc/system modification method:

set qfe:adv_100hdx_cap=0
set qfe:adv_100T4_cap=0
set qfe:adv_100fdx_cap=1
set qfe:adv_autoneg_cap=0
set qfe:adv_10hdx_cap=0

This method sets all of your qfe interfaces to operate at 100FDX, and if that's what your switch is configured to do, then you are ready to reboot and enjoy the benefits.

Of the two methods, I recommend using a boot-time rc script. It's easier to maintain if you write it correctly (because it uses the Bourne shell), and you don't have to worry about testing by way of a reboot because a quick unplumb/plumb followed by ndd(1m) allows you to make changes while your system is up and running. The /etc/system method is also useful, but given that you cannot really tune each instance separately using this method, it may not be appropriate for every site.

References

Solaris Internals. Richard McDougall and Jim Mauro. Sun Microsystems Press, 2000. ISBN 0-13-022496-0

Solaris Tunable Kernel Parameters Guide. Sun Microsystems. Sun Microsystems Press, 2000. p/n 806-4015

Sun Performance and Tuning: Java and the Internet. Adrian Cockcroft and Richard Pettit. Sun Microsystems Press, 1998. 2nd Edition. ISBN 0-13-095249-4

1. http://people.ee.ethz.ch/~oetiker/webtools/mrtg/

2. For more information on the implementation, read Solaris Internals and Sun Performance and Tuning.

3. The Tunable Parameters guide says that each dnlc entry in 32-bit mode takes 36 bytes, and 64 bytes in 64-bit mode.

4. http://www.jni.com/drivers

5. http://www.emc.com

6. http://www.hds.com and http://www.sun.com/storage/highend for the Sun StorEdge 9910.

James McPherson is a CPR Engineer for Sun Microsystems based in Sydney, Australia. When not wandering around the Solaris kernel, he and his wife like to race their NS14 dinghy around Sydney Harbour.

SolarisTM 8 Performance Tuning

Solaris^TM 8 Performance Tuning