|  SolarisTM 
              8 Performance Tuning
 James C. McPherson
              When talking about tuning the performance of a UNIX system, sys 
              admins generally work through a number of concepts, such as the 
              current perceived performance, the difference between this level 
              and a desired level of performance, the optimal level claimed by 
              the application vendor, how to measure performance differences, 
              and finally, what to tune and test.
              The current perceived level of performance is based on what your 
              users tell you and what you observe in your regular system monitoring. 
              In a networked environment, if you have some users on the end of 
              a 100-Mbit full-duplex connection, and others who are still using 
              10-Mbit half-duplex, then the 10-Mbit users will always tell you 
              that the system is slow. However, you might have to wait for a serious 
              application overload or resource crunch before the 100-Mbit users 
              complain. There are also copious logfiles and measurable kernel 
              statistics to facilitate system-level monitoring. Some sites also 
              make use of tools such as CA UniCentre TNG, HP OpenView, or Sun's 
              SunMC to monitor via snmp MIBs. However, these are tools that must 
              be used by competent systems and application administrators and 
              should not be used alone.
              The desired level of performance is somewhat different. Benchmarks 
              are sometimes less than useful, but if you are using an application 
              or system from a large vendor, they should have a specific benchmarking 
              group that will replicate your environment and allow you to test 
              realistic loads. You can trust a benchmark done in this manner because 
              it is specifically tailored to your company's requirements. 
              If you don't have the time or resources to organize a tailored 
              benchmark, you'll have to spend extra time doing it yourself. 
              Before you get started, determine what measurements will be valid 
              for your environment and how to obtain them. Also, work with your 
              users in regard to when you can test tuning and monitoring. Keep 
              the users informed of what you are doing so that they can recognize 
              changes and appreciate the value that you provide to them.
              How to Measure before and after Performance Differences
              To tune effectively, you must first decide what you want to achieve. 
              Here are some tips to consider:
              
              
             
               Determine your criteria for success. 
               Log everything, and run Sun Explorer data gatherer every time 
                you make changes in order to keep a snapshot of your system configuration 
                at each point. 
               Analyze the logs carefully. 
               Measure the changed performance against what you wish to achieve. 
               Be methodical.
              If you do not already keep a record of system and resource loads, 
              and have no method of monitoring the applications on your system, 
              start simply. You can measure cpu/memory/swap utilization, disk 
              utilization (specifically your application filesystems), and network 
              utilization using standard system tools. Solaris 8 allows you to 
              use the kstat(1M) utility to view various system properties 
              in a variety of output formats. df -k is also useful. You 
              can write scripts to dump these statistics to a file on a periodic 
              basis or use tools such as apache and mrtg1 to give you a regularly 
              refreshed view. A little bit of shell, awk, sed, or Perl will give 
              you lots of numbers to graph.
              The most useful monitoring of performance requires frequent and 
              regular snapshots of your system for a period of time that you, 
              your application admins, and users decide is long enough. My rule 
              of thumb is to accrue not less than eight days of monitoring data 
              (preferably including a peak load period) so that you have obvious 
              peaks and troughs to look at. Another aspect of measurement is to 
              survey the users both before and after to determine how they believe 
              performance has changed in the course of the monitoring period. 
              If your performance tuning has resulted in differences that the 
              users cannot perceive, then I would say the performance hasn't 
              really been tuned. Useful questions to ask include:
              
              
             
               How long did it take to perform a standard task (measured in 
                seconds) at 9am, 12noon, 2pm, 5pm before you started tuning? 
               How long does this task take now? 
               Do you think that the performance has (improved, stayed the 
                same, become worse) over the past (X) days? (At least a seven-point 
                scale is recommended for this question).
              You, the sys admin, also need to measure these standard tasks 
              yourself so that you have a frame of reference for your measurements.
              What to Tune and When to Tune It
              I/O and Buffers (Including VxVM and VxFS)
              The most common target of performance tuning is obviously I/O, 
              and in Solaris there are several buffers and tunables that we can 
              work on. You are probably aware of the relative speeds of disks 
              versus ram, and how this difference in speed has changed dramatically 
              over the past 20 years. This is particularly important when configuring 
              "virtual memory" for your server because of the way Solaris 
              implements paging2. A primary concern here is to avoid paging data 
              to your slow disk (even a 10000-rpm fcal disk is still slow) and 
              ensure that the data stays in ram until it is sent down the SCSI 
              or tcp stack to the next appropriate point.
              How does one prevent paging? This question requires an appreciation 
              of "good" and "bad" paging. So-called "good" 
              paging happens when the system allocates or reclaims pages from 
              a process, whereas "bad" paging is when this allocation 
              relies upon a disk device and the system incurs a penalty for access. 
              So, we don't particularly care about the "good" paging, 
              because the intelligence built into the kernel's paging algorithms 
              ensure that it happens as infrequently as possible. The priority_paging 
              setting is necessary for Solaris 2.6 and Solaris 7. Neither Solaris 
              8 or 9 require this, because a modified version of priority_paging, 
              called the cyclic page cache, was integrated into the new kernel. 
              "Bad" paging is also known as swapping, and in extreme 
              circumstances can result in such a high I/O load that your I/O subsystem 
              is said to be thrashing.
              You must specifically be aware of two tunables: ncsize 
              and lotsfree from the Solaris side. If you are running Veritas 
              Volume Manager (vxvm), then you need to be aware of volkio 
              and friends. If you are running Veritas FileSystem (vxfs), you must 
              carefully tune vxfs:vxfs_ninode and vxfs:vx_bc_bufhwm. 
              I will address each of these variables in turn.
              ncsize
              The "nc" in ncsize refers to the Name Cache, 
              and is used to set the size of the Directory Name Lookup Cache, 
              or dnlc for short. This is an optimization to give you better performance 
              from your filesystems because inodes are cached in memory and only 
              flushed out of the cache to disk if they have been idle for quite 
              a while. The default setting for ncsize depends on two other 
              variables, max_nprocs and maxusers with the following 
              relationship:
              
             
ncsize      =  (4 * (max_nprocs+maxusers)) + 320
max_nprocs  =  10 + (16 * maxusers)
maxusers    =  physmem - 2
Setting maxusers to less than the maximum possible value of 
            2048 used to be a required configuration task going back to SunOS 
            4.x. However, Solaris 2.x does not require maxusers to be set, 
            and you can actually decrease the performance of your system by doing 
            so. If maxusers is set to the system default, you might not 
            have a very large dnlc. The recommendation is first to set maxusers 
            to 2048, reboot, and after about a week of average usage, examine 
            the kernel statistics on the dnlc (scroll forward to the line starting 
            with dnlcstat). Under both Solaris 8 and 9, you can use the 
            kstat(1M) utility. When looking at these statistics, note the 
            value of misses/hits and see how close that ratio is to 0. The closer 
            it is to 0, the less you need to worry about tuning ncsize. 
            For example, my workstation (an Ultra80) has been up for about 36.5 
            days, and has this output from kstat -m unix -n dnlcstats:  
             
module: unix                        instance: 0
name:   dnlcstats                   class:    misc
        crtime                      63.090334184
        ...
        hits                        141141001
        misses                      1953220
        negative_cache_hits         2070966
        pick_free                   802468
        pick_heuristic              1102434
        pick_last                   257563
        ...
        snaptime                    3162130.15322328 (ends here)
In my case, this works out to be 0.01383, meaning that my dnlc efficiency 
            is very good. On a major fileserver that I use, the values are considerably 
            different:  
             
module: unix                        instance: 0
name:   dnlcstats                   class:    misc
        crtime                      180.496237386
        ...
        hits                        313301011
        misses                      77458379
        negative_cache_hits         14429400
        pick_free                   3664314
        pick_heuristic              51820181
        pick_last                   24067557
        snaptime                    6143454.64953917 (ends here)
Here, the ratio works out to be 0.2472, indicating that there is only 
            about a 75% efficiency of the dnlc for this server, so ncsize 
            should be tuned upwards. For the system above, I would recommend that 
            the dnlc size be doubled. It is currently set to the system-calculated 
            default of 139488, which for this E420R running in 64-bit mode takes 
            up slightly more than 8.7mb of kernel memory 3. The system can easily 
            afford double this amount of kernel memory for dnlc, given its workload.  lotsfree
              This variable is used as the boundary condition for when to invoke 
              the page scanner and make it look for pages to free. Typically, 
              this is set to 1/64th the number of physical pages in your system, 
              which works reasonably well. However, as the Kernel Tunable Parameters 
              Manual indicates, if your system load is such that it cannot cope 
              with sudden sharp increases in demand for memory, then you should 
              seriously consider increasing this value. A general rule I've 
              seen recommended is to set lotsfree to be 1/16th of the physmem 
              value, rather than the default 1/64th. This allows the page scanner 
              to activate in a more timely manner. When combined with priority 
              paging or the cyclic page cache of Solaris 8, you should see that 
              the memory load curve of your system is much smoother than before.
              Veritas FileSystem Variables -- vxfs:vxfs_ninode and 
              vxfs:vx_bc_bufhwm
              These two variables are particularly important to tune. These 
              have an interdependency on ncsize that is related to their 
              usage of kernel memory. When you use Veritas FileSystem (VxFS), 
              you must tune ncsize to be within 50% and 80% of the value 
              of vxfs:vxfs_ninode in order to achieve good performance. 
              Good performance in this context means that the performance curve 
              of CPU cycles and memory used stays relatively close to the performance 
              curve algorithm that Veritas builds into the product.
              The upper limit on the amount of kernel memory that VxFS will 
              allocate for its cache is set with vxfs:vx_bc_bufhwm -- 
              the Buffer Cache's Buffer High Water Mark. Once the allocated 
              amount reaches this limit, VxFS inodes are flushed from the cache. 
              The vxfs:vxfs_ninode variable is the limit on the number 
              of VxFS inode structures held in memory. This number is usually 
              determined to be 125% of the value of ncsize for reasons 
              of CPU cycle efficiency, and if you have your dnlc set by way of 
              setting maxusers to 2048, then ncsize/vxfs:vxfs_ninode 
              is approximately 0.78. This is very close to the 80% figure mentioned 
              in the VxFS installation guide.
              Veritas Volume Manager Variables
              Many sites use Veritas Volume Manager (VxVM) to provide data protection 
              and enhance their systems' performance. Mirrored filesystems, 
              striped data filesystems, and RAID-5 filesystems are all features 
              that VxVM provides. If you use VxVM as well as VxFS, then you should 
              look at vxio:vol_maxio. This variable controls the maximum 
              size of io requests that are sent down the SCSI chain without breaking 
              the request up. Veritas recommends that this tunable not exceed 
              20% of kernel memory or physical memory (whichever is smaller), 
              and that you match this tunable to the size of your widest stripe. 
              Apart from this tunable, there are no others that must be tuned, 
              and you should only really look at tuning the variables specified 
              in the Veritas Volume Manager Administrator Guide if you are directed 
              to by an appropriate technical contact within Veritas or a Veritas 
              partner. This is because Veritas, like Sun with Solaris, has spent 
              a lot of time and effort in making sure that the self-tuning algorithms 
              work well for the vast majority of systems. If you are fortunate 
              enough to have the sort of large installation where serious VxVM 
              tuning is necessary, then you should engage Veritas or Sun Professional 
              Services to analyze and tune your configuration.
              Shared Memory, Semaphores, Message Queues
              Applications such as rdbms engines (Oracle, Sybase, Informix, 
              DB2, etc.), middleware like MQ-Series and Tuxedo, and some backup 
              packages (Veritas NetBackup) make heavy use of shared memory, semaphore 
              sets, and message queues in order to maximise their performance. 
              For configuring these settings, you should always start with the 
              application vendor's recommendations, and find out from the 
              vendor how they log a deficiency in these settings. Veritas NetBackup, 
              for example, will dump messages in logfiles like:
              
             
waited for empty buffer X times or
waited for full buffer Y times
which in conjunction with the NetBackup Troubleshooting guide will 
            indicate whether you need more semaphores or message queues.  Remember that the kernel will not allow more than 25% of the dedicated 
              kernel space (segkp) to be allocated for the shared memory, message 
              queue, and semaphore structures. So, if you experiment with large 
              values on a machine without much physical memory, you may see different 
              values when you check with the sysdef(1M) utility after rebooting. 
              For my workstation (see the "System Specification File" 
              sidebar), the kernel memory usage under Solaris 8 (64-bit mode) 
              for the example semaphore settings is approximately 15898 Kb, for 
              the message queues approximately 1038 Kb, and for the shared memory 
              itself, approximately 18 Kb. You might be wondering why the overhead 
              for shared memory (shmsys) settings is so small compared with the 
              semaphore (semsys) and message queue (msgsys). That is because the 
              only variable used for administrative purposes with shmsys is shminfo_shmmni. 
              This is the maximum number of shmid_ds structures in the 
              system, each of which is 88 bytes.
              The rule of thumb for shared memory, semaphores, and message queues 
              is to tune these in conjunction with your application vendor, because 
              the vendor will have concrete customer data and can advise you appropriately.
              SCSI tunables
              sd_max_throttle and sd_io_time
              Two very common tuning targets are those that relate to the SCSI 
              disk (sd) driver module: sd_max_throttle and sd_io_time. 
              You commonly see these set (or need to set them) if you are using 
              EMC, IBM, or Hitachi storage attached to your Solaris system. The 
              sd_io_time variable is the limiter on how long an I/O can 
              be outstanding before an error condition is returned. The Solaris 
              default is 60 seconds (0x3c), but this is often set to 31 seconds 
              (0x1f). The variable sd_max_throttle provides the limit on 
              how many outstanding I/Os the system can handle at any one time 
              and is commonly referred to as the "queue depth." A common 
              setting for this (the default is 256 or 0x100) is 25, which is the 
              mandated setting from JNI Corporation for use with their fcaw driver4 
              and EMC5 or Hitachi storage6.
              The general recommendation for both these variables, however, 
              is that if you are not required to set them by your storage vendor, 
              then leave them unset in your /etc/system file and let Solaris handle 
              the settings for you. This is because these variables cannot be 
              set on a per-LUN or per-instance basis: any change made to these 
              two variables affects the entire SCSI subsystem. If you set them 
              to values greater than or much less than your storage subsystem 
              can handle, you run a very real risk of having a badly performing 
              system for disk, tape, and memory operations. Another important 
              aspect to tuning the sd driver is that Sun's disk storage attaches 
              using either the sd or, if attached using the Fibre Channel 
              (FC) protocol, the ssd drivers. This allows for separate 
              tuning of those stacks, but could well change in a later version 
              of Solaris.
              maxphys
              The maxphys setting, often seen in conjunction with JNI and Emulex 
              HBAs, is the upper limit on the largest chunk of data that can be 
              sent down the SCSI path for any single request. There are no real 
              issues with increasing the value of this variable to 8 Mb (in /etc/system, 
              set maxphys=8388608), as long as your IO subsystem can handle 
              it. All current Fibre Channel adapters are capable of supporting 
              this, as are most ultra/wide SCSI HBAs, such as those from Sun, 
              Adaptec, QLogic, and Tekram. It is possible (although I have not 
              yet tested it) to set this variable in an (E)IDE-based system, such 
              as a PC running Solaris for Intel, a Sun Ultra 5/10/Blade 100, or 
              the lower end Netra systems. With the current range of (E)IDE disks 
              and at least an ATA-66 interface, the system should be able to support 
              this value for maxphys.
              Networking Tunables
              At a former employer, I worked closely with the DBAs on a particular 
              system running a financial management application. After monitoring 
              the system for several weeks, the first step in tuning was to get 
              a 100-Mbit switched interface activated. When this 100FDX interface 
              was connected, we noticed an immediate improvement in system performance: 
              less swap in use due to less buffering of data, fewer "wasted" 
              cpu cycles, and a much happier group of users. The change was so 
              significant that we put off implementing the rest of the tuning 
              while we analyzed whether our plan was still relevant. We did not 
              need to do any specific configuration on our server because the 
              switch handled it.
              If you need to force your interface, there are several simple 
              ways to do this. To begin, you must be absolutely certain that your 
              Ethernet interface is cabled to plug into a 100-Mbit Ethernet switch, 
              otherwise you will not get any response from your network connection. 
              Let's look at the two most common methods: an rc script and 
              editing /etc/system.
              The boot-time rc script allows you to specify which instance of 
              the interface you want to change the settings for. In the example 
              below, I am setting the properties for my qfe3 interface:
              
             
#!/bin/sh
#
# script to force the interface properties for qfe3
# script name is /etc/rc2.d/S50ndd_qfe3
#
ndd -set /dev/qfe instance 3
# force OFF 100Mb half duplex
ndd -set /dev/qfe adv_100hdx_cap 0         
# force OFF 100Mb T4
ndd -set /dev/qfe adv_100T4_cap 0          
# force ON 100Mb full duplex
ndd -set /dev/qfe adv_100fdx_cap 1         
# force OFF autonegotiation (FORCE mode)
ndd -set /dev/qfe adv_autoneg_cap 0        
# end of script
Here's the /etc/system modification method:  
             
set qfe:adv_100hdx_cap=0
set qfe:adv_100T4_cap=0
set qfe:adv_100fdx_cap=1
set qfe:adv_autoneg_cap=0
set qfe:adv_10hdx_cap=0
This method sets all of your qfe interfaces to operate at 100FDX, 
            and if that's what your switch is configured to do, then you 
            are ready to reboot and enjoy the benefits.  Of the two methods, I recommend using a boot-time rc script. It's 
              easier to maintain if you write it correctly (because it uses the 
              Bourne shell), and you don't have to worry about testing by 
              way of a reboot because a quick unplumb/plumb followed by ndd(1m) 
              allows you to make changes while your system is up and running. 
              The /etc/system method is also useful, but given that you cannot 
              really tune each instance separately using this method, it may not 
              be appropriate for every site.
              References
              Solaris Internals. Richard McDougall and Jim Mauro. Sun 
              Microsystems Press, 2000. ISBN 0-13-022496-0
              Solaris Tunable Kernel Parameters Guide. Sun Microsystems. 
              Sun Microsystems Press, 2000. p/n 806-4015
              Sun Performance and Tuning: Java and the Internet. Adrian 
              Cockcroft and Richard Pettit. Sun Microsystems Press, 1998. 2nd 
              Edition. ISBN 0-13-095249-4
              1. http://people.ee.ethz.ch/~oetiker/webtools/mrtg/
              2. For more information on the implementation, read Solaris 
              Internals and Sun Performance and Tuning.
              3. The Tunable Parameters guide says that each dnlc entry in 32-bit 
              mode takes 36 bytes, and 64 bytes in 64-bit mode.
              4. http://www.jni.com/drivers
              5. http://www.emc.com
              6. http://www.hds.com and http://www.sun.com/storage/highend for 
              the Sun StorEdge 9910.
              James McPherson is a CPR Engineer for Sun Microsystems based 
              in Sydney, Australia. When not wandering around the Solaris kernel, 
              he and his wife like to race their NS14 dinghy around Sydney Harbour.
           |