SolarisTM
8 Performance Tuning
James C. McPherson
When talking about tuning the performance of a UNIX system, sys
admins generally work through a number of concepts, such as the
current perceived performance, the difference between this level
and a desired level of performance, the optimal level claimed by
the application vendor, how to measure performance differences,
and finally, what to tune and test.
The current perceived level of performance is based on what your
users tell you and what you observe in your regular system monitoring.
In a networked environment, if you have some users on the end of
a 100-Mbit full-duplex connection, and others who are still using
10-Mbit half-duplex, then the 10-Mbit users will always tell you
that the system is slow. However, you might have to wait for a serious
application overload or resource crunch before the 100-Mbit users
complain. There are also copious logfiles and measurable kernel
statistics to facilitate system-level monitoring. Some sites also
make use of tools such as CA UniCentre TNG, HP OpenView, or Sun's
SunMC to monitor via snmp MIBs. However, these are tools that must
be used by competent systems and application administrators and
should not be used alone.
The desired level of performance is somewhat different. Benchmarks
are sometimes less than useful, but if you are using an application
or system from a large vendor, they should have a specific benchmarking
group that will replicate your environment and allow you to test
realistic loads. You can trust a benchmark done in this manner because
it is specifically tailored to your company's requirements.
If you don't have the time or resources to organize a tailored
benchmark, you'll have to spend extra time doing it yourself.
Before you get started, determine what measurements will be valid
for your environment and how to obtain them. Also, work with your
users in regard to when you can test tuning and monitoring. Keep
the users informed of what you are doing so that they can recognize
changes and appreciate the value that you provide to them.
How to Measure before and after Performance Differences
To tune effectively, you must first decide what you want to achieve.
Here are some tips to consider:
- Determine your criteria for success.
- Log everything, and run Sun Explorer data gatherer every time
you make changes in order to keep a snapshot of your system configuration
at each point.
- Analyze the logs carefully.
- Measure the changed performance against what you wish to achieve.
- Be methodical.
If you do not already keep a record of system and resource loads,
and have no method of monitoring the applications on your system,
start simply. You can measure cpu/memory/swap utilization, disk
utilization (specifically your application filesystems), and network
utilization using standard system tools. Solaris 8 allows you to
use the kstat(1M) utility to view various system properties
in a variety of output formats. df -k is also useful. You
can write scripts to dump these statistics to a file on a periodic
basis or use tools such as apache and mrtg1 to give you a regularly
refreshed view. A little bit of shell, awk, sed, or Perl will give
you lots of numbers to graph.
The most useful monitoring of performance requires frequent and
regular snapshots of your system for a period of time that you,
your application admins, and users decide is long enough. My rule
of thumb is to accrue not less than eight days of monitoring data
(preferably including a peak load period) so that you have obvious
peaks and troughs to look at. Another aspect of measurement is to
survey the users both before and after to determine how they believe
performance has changed in the course of the monitoring period.
If your performance tuning has resulted in differences that the
users cannot perceive, then I would say the performance hasn't
really been tuned. Useful questions to ask include:
- How long did it take to perform a standard task (measured in
seconds) at 9am, 12noon, 2pm, 5pm before you started tuning?
- How long does this task take now?
- Do you think that the performance has (improved, stayed the
same, become worse) over the past (X) days? (At least a seven-point
scale is recommended for this question).
You, the sys admin, also need to measure these standard tasks
yourself so that you have a frame of reference for your measurements.
What to Tune and When to Tune It
I/O and Buffers (Including VxVM and VxFS)
The most common target of performance tuning is obviously I/O,
and in Solaris there are several buffers and tunables that we can
work on. You are probably aware of the relative speeds of disks
versus ram, and how this difference in speed has changed dramatically
over the past 20 years. This is particularly important when configuring
"virtual memory" for your server because of the way Solaris
implements paging2. A primary concern here is to avoid paging data
to your slow disk (even a 10000-rpm fcal disk is still slow) and
ensure that the data stays in ram until it is sent down the SCSI
or tcp stack to the next appropriate point.
How does one prevent paging? This question requires an appreciation
of "good" and "bad" paging. So-called "good"
paging happens when the system allocates or reclaims pages from
a process, whereas "bad" paging is when this allocation
relies upon a disk device and the system incurs a penalty for access.
So, we don't particularly care about the "good" paging,
because the intelligence built into the kernel's paging algorithms
ensure that it happens as infrequently as possible. The priority_paging
setting is necessary for Solaris 2.6 and Solaris 7. Neither Solaris
8 or 9 require this, because a modified version of priority_paging,
called the cyclic page cache, was integrated into the new kernel.
"Bad" paging is also known as swapping, and in extreme
circumstances can result in such a high I/O load that your I/O subsystem
is said to be thrashing.
You must specifically be aware of two tunables: ncsize
and lotsfree from the Solaris side. If you are running Veritas
Volume Manager (vxvm), then you need to be aware of volkio
and friends. If you are running Veritas FileSystem (vxfs), you must
carefully tune vxfs:vxfs_ninode and vxfs:vx_bc_bufhwm.
I will address each of these variables in turn.
ncsize
The "nc" in ncsize refers to the Name Cache,
and is used to set the size of the Directory Name Lookup Cache,
or dnlc for short. This is an optimization to give you better performance
from your filesystems because inodes are cached in memory and only
flushed out of the cache to disk if they have been idle for quite
a while. The default setting for ncsize depends on two other
variables, max_nprocs and maxusers with the following
relationship:
ncsize = (4 * (max_nprocs+maxusers)) + 320
max_nprocs = 10 + (16 * maxusers)
maxusers = physmem - 2
Setting maxusers to less than the maximum possible value of
2048 used to be a required configuration task going back to SunOS
4.x. However, Solaris 2.x does not require maxusers to be set,
and you can actually decrease the performance of your system by doing
so. If maxusers is set to the system default, you might not
have a very large dnlc. The recommendation is first to set maxusers
to 2048, reboot, and after about a week of average usage, examine
the kernel statistics on the dnlc (scroll forward to the line starting
with dnlcstat). Under both Solaris 8 and 9, you can use the
kstat(1M) utility. When looking at these statistics, note the
value of misses/hits and see how close that ratio is to 0. The closer
it is to 0, the less you need to worry about tuning ncsize.
For example, my workstation (an Ultra80) has been up for about 36.5
days, and has this output from kstat -m unix -n dnlcstats:
module: unix instance: 0
name: dnlcstats class: misc
crtime 63.090334184
...
hits 141141001
misses 1953220
negative_cache_hits 2070966
pick_free 802468
pick_heuristic 1102434
pick_last 257563
...
snaptime 3162130.15322328 (ends here)
In my case, this works out to be 0.01383, meaning that my dnlc efficiency
is very good. On a major fileserver that I use, the values are considerably
different:
module: unix instance: 0
name: dnlcstats class: misc
crtime 180.496237386
...
hits 313301011
misses 77458379
negative_cache_hits 14429400
pick_free 3664314
pick_heuristic 51820181
pick_last 24067557
snaptime 6143454.64953917 (ends here)
Here, the ratio works out to be 0.2472, indicating that there is only
about a 75% efficiency of the dnlc for this server, so ncsize
should be tuned upwards. For the system above, I would recommend that
the dnlc size be doubled. It is currently set to the system-calculated
default of 139488, which for this E420R running in 64-bit mode takes
up slightly more than 8.7mb of kernel memory 3. The system can easily
afford double this amount of kernel memory for dnlc, given its workload.
lotsfree
This variable is used as the boundary condition for when to invoke
the page scanner and make it look for pages to free. Typically,
this is set to 1/64th the number of physical pages in your system,
which works reasonably well. However, as the Kernel Tunable Parameters
Manual indicates, if your system load is such that it cannot cope
with sudden sharp increases in demand for memory, then you should
seriously consider increasing this value. A general rule I've
seen recommended is to set lotsfree to be 1/16th of the physmem
value, rather than the default 1/64th. This allows the page scanner
to activate in a more timely manner. When combined with priority
paging or the cyclic page cache of Solaris 8, you should see that
the memory load curve of your system is much smoother than before.
Veritas FileSystem Variables -- vxfs:vxfs_ninode and
vxfs:vx_bc_bufhwm
These two variables are particularly important to tune. These
have an interdependency on ncsize that is related to their
usage of kernel memory. When you use Veritas FileSystem (VxFS),
you must tune ncsize to be within 50% and 80% of the value
of vxfs:vxfs_ninode in order to achieve good performance.
Good performance in this context means that the performance curve
of CPU cycles and memory used stays relatively close to the performance
curve algorithm that Veritas builds into the product.
The upper limit on the amount of kernel memory that VxFS will
allocate for its cache is set with vxfs:vx_bc_bufhwm --
the Buffer Cache's Buffer High Water Mark. Once the allocated
amount reaches this limit, VxFS inodes are flushed from the cache.
The vxfs:vxfs_ninode variable is the limit on the number
of VxFS inode structures held in memory. This number is usually
determined to be 125% of the value of ncsize for reasons
of CPU cycle efficiency, and if you have your dnlc set by way of
setting maxusers to 2048, then ncsize/vxfs:vxfs_ninode
is approximately 0.78. This is very close to the 80% figure mentioned
in the VxFS installation guide.
Veritas Volume Manager Variables
Many sites use Veritas Volume Manager (VxVM) to provide data protection
and enhance their systems' performance. Mirrored filesystems,
striped data filesystems, and RAID-5 filesystems are all features
that VxVM provides. If you use VxVM as well as VxFS, then you should
look at vxio:vol_maxio. This variable controls the maximum
size of io requests that are sent down the SCSI chain without breaking
the request up. Veritas recommends that this tunable not exceed
20% of kernel memory or physical memory (whichever is smaller),
and that you match this tunable to the size of your widest stripe.
Apart from this tunable, there are no others that must be tuned,
and you should only really look at tuning the variables specified
in the Veritas Volume Manager Administrator Guide if you are directed
to by an appropriate technical contact within Veritas or a Veritas
partner. This is because Veritas, like Sun with Solaris, has spent
a lot of time and effort in making sure that the self-tuning algorithms
work well for the vast majority of systems. If you are fortunate
enough to have the sort of large installation where serious VxVM
tuning is necessary, then you should engage Veritas or Sun Professional
Services to analyze and tune your configuration.
Shared Memory, Semaphores, Message Queues
Applications such as rdbms engines (Oracle, Sybase, Informix,
DB2, etc.), middleware like MQ-Series and Tuxedo, and some backup
packages (Veritas NetBackup) make heavy use of shared memory, semaphore
sets, and message queues in order to maximise their performance.
For configuring these settings, you should always start with the
application vendor's recommendations, and find out from the
vendor how they log a deficiency in these settings. Veritas NetBackup,
for example, will dump messages in logfiles like:
waited for empty buffer X times or
waited for full buffer Y times
which in conjunction with the NetBackup Troubleshooting guide will
indicate whether you need more semaphores or message queues.
Remember that the kernel will not allow more than 25% of the dedicated
kernel space (segkp) to be allocated for the shared memory, message
queue, and semaphore structures. So, if you experiment with large
values on a machine without much physical memory, you may see different
values when you check with the sysdef(1M) utility after rebooting.
For my workstation (see the "System Specification File"
sidebar), the kernel memory usage under Solaris 8 (64-bit mode)
for the example semaphore settings is approximately 15898 Kb, for
the message queues approximately 1038 Kb, and for the shared memory
itself, approximately 18 Kb. You might be wondering why the overhead
for shared memory (shmsys) settings is so small compared with the
semaphore (semsys) and message queue (msgsys). That is because the
only variable used for administrative purposes with shmsys is shminfo_shmmni.
This is the maximum number of shmid_ds structures in the
system, each of which is 88 bytes.
The rule of thumb for shared memory, semaphores, and message queues
is to tune these in conjunction with your application vendor, because
the vendor will have concrete customer data and can advise you appropriately.
SCSI tunables
sd_max_throttle and sd_io_time
Two very common tuning targets are those that relate to the SCSI
disk (sd) driver module: sd_max_throttle and sd_io_time.
You commonly see these set (or need to set them) if you are using
EMC, IBM, or Hitachi storage attached to your Solaris system. The
sd_io_time variable is the limiter on how long an I/O can
be outstanding before an error condition is returned. The Solaris
default is 60 seconds (0x3c), but this is often set to 31 seconds
(0x1f). The variable sd_max_throttle provides the limit on
how many outstanding I/Os the system can handle at any one time
and is commonly referred to as the "queue depth." A common
setting for this (the default is 256 or 0x100) is 25, which is the
mandated setting from JNI Corporation for use with their fcaw driver4
and EMC5 or Hitachi storage6.
The general recommendation for both these variables, however,
is that if you are not required to set them by your storage vendor,
then leave them unset in your /etc/system file and let Solaris handle
the settings for you. This is because these variables cannot be
set on a per-LUN or per-instance basis: any change made to these
two variables affects the entire SCSI subsystem. If you set them
to values greater than or much less than your storage subsystem
can handle, you run a very real risk of having a badly performing
system for disk, tape, and memory operations. Another important
aspect to tuning the sd driver is that Sun's disk storage attaches
using either the sd or, if attached using the Fibre Channel
(FC) protocol, the ssd drivers. This allows for separate
tuning of those stacks, but could well change in a later version
of Solaris.
maxphys
The maxphys setting, often seen in conjunction with JNI and Emulex
HBAs, is the upper limit on the largest chunk of data that can be
sent down the SCSI path for any single request. There are no real
issues with increasing the value of this variable to 8 Mb (in /etc/system,
set maxphys=8388608), as long as your IO subsystem can handle
it. All current Fibre Channel adapters are capable of supporting
this, as are most ultra/wide SCSI HBAs, such as those from Sun,
Adaptec, QLogic, and Tekram. It is possible (although I have not
yet tested it) to set this variable in an (E)IDE-based system, such
as a PC running Solaris for Intel, a Sun Ultra 5/10/Blade 100, or
the lower end Netra systems. With the current range of (E)IDE disks
and at least an ATA-66 interface, the system should be able to support
this value for maxphys.
Networking Tunables
At a former employer, I worked closely with the DBAs on a particular
system running a financial management application. After monitoring
the system for several weeks, the first step in tuning was to get
a 100-Mbit switched interface activated. When this 100FDX interface
was connected, we noticed an immediate improvement in system performance:
less swap in use due to less buffering of data, fewer "wasted"
cpu cycles, and a much happier group of users. The change was so
significant that we put off implementing the rest of the tuning
while we analyzed whether our plan was still relevant. We did not
need to do any specific configuration on our server because the
switch handled it.
If you need to force your interface, there are several simple
ways to do this. To begin, you must be absolutely certain that your
Ethernet interface is cabled to plug into a 100-Mbit Ethernet switch,
otherwise you will not get any response from your network connection.
Let's look at the two most common methods: an rc script and
editing /etc/system.
The boot-time rc script allows you to specify which instance of
the interface you want to change the settings for. In the example
below, I am setting the properties for my qfe3 interface:
#!/bin/sh
#
# script to force the interface properties for qfe3
# script name is /etc/rc2.d/S50ndd_qfe3
#
ndd -set /dev/qfe instance 3
# force OFF 100Mb half duplex
ndd -set /dev/qfe adv_100hdx_cap 0
# force OFF 100Mb T4
ndd -set /dev/qfe adv_100T4_cap 0
# force ON 100Mb full duplex
ndd -set /dev/qfe adv_100fdx_cap 1
# force OFF autonegotiation (FORCE mode)
ndd -set /dev/qfe adv_autoneg_cap 0
# end of script
Here's the /etc/system modification method:
set qfe:adv_100hdx_cap=0
set qfe:adv_100T4_cap=0
set qfe:adv_100fdx_cap=1
set qfe:adv_autoneg_cap=0
set qfe:adv_10hdx_cap=0
This method sets all of your qfe interfaces to operate at 100FDX,
and if that's what your switch is configured to do, then you
are ready to reboot and enjoy the benefits.
Of the two methods, I recommend using a boot-time rc script. It's
easier to maintain if you write it correctly (because it uses the
Bourne shell), and you don't have to worry about testing by
way of a reboot because a quick unplumb/plumb followed by ndd(1m)
allows you to make changes while your system is up and running.
The /etc/system method is also useful, but given that you cannot
really tune each instance separately using this method, it may not
be appropriate for every site.
References
Solaris Internals. Richard McDougall and Jim Mauro. Sun
Microsystems Press, 2000. ISBN 0-13-022496-0
Solaris Tunable Kernel Parameters Guide. Sun Microsystems.
Sun Microsystems Press, 2000. p/n 806-4015
Sun Performance and Tuning: Java and the Internet. Adrian
Cockcroft and Richard Pettit. Sun Microsystems Press, 1998. 2nd
Edition. ISBN 0-13-095249-4
1. http://people.ee.ethz.ch/~oetiker/webtools/mrtg/
2. For more information on the implementation, read Solaris
Internals and Sun Performance and Tuning.
3. The Tunable Parameters guide says that each dnlc entry in 32-bit
mode takes 36 bytes, and 64 bytes in 64-bit mode.
4. http://www.jni.com/drivers
5. http://www.emc.com
6. http://www.hds.com and http://www.sun.com/storage/highend for
the Sun StorEdge 9910.
James McPherson is a CPR Engineer for Sun Microsystems based
in Sydney, Australia. When not wandering around the Solaris kernel,
he and his wife like to race their NS14 dinghy around Sydney Harbour.
|