Large SolarisTM Servers
Solaris will always attempt to choose good default configuration
values, but sometimes a simple heuristic cannot find the best value.
The optimum configuration may depend upon an important trade-off,
such as speed vs. space efficiency. In this article, I'll guide
you through a medley of performance-related considerations for large
Solaris servers and also cover some of the algorithms and heuristics
surrounding the most important tunables.
1. Sizing Swap
Sizing swap properly can be a trade-off since having either too
much or too little can cause problems. Neither situation prints
a simple message like "Swap too small" to the console.
You may see "not enough memory" messages, but you never
see "too much memory", even though this can happen.
Having swap too large will at best waste disks, and can easily
lead to hard-pageout. Adjustment depends on the application and
particular installation. Having too little swap can make the system
run out of virtual memory before it runs out of physical memory.
One problem is that as memory density grows, we often won't
care about larger and larger amounts of memory. Looking at "wasted"
memory should be done with a threshold set in cost, rather than
What looks like a wasted 100 MB of memory on one hand, may also
be considered a cheap insurance policy on avoiding performance problems.
Because of the implementation of swap in Solaris, the usual rule
of sizing a swap device at two to three times the size of real memory
is no longer accurate, and these values should each be reduced by
one. Thus, sizing swap at one to two times real memory is a good
The "iowait" statistic can be misleading on large Solaris
machines. In the past, certain systems required the CPU that initiated
an I/O to service that I/O interrupt, and this CPU was not available
for other processes on some systems until this I/O request was satisfied.
Thus, I/O-wait time came to be considered one of the states of a
processor -- a processor was either busy in Userland, busy in
kernel code, sitting in I/O-wait, or idle and therefore free to
run other processes.
Under Solaris there is no need for a particular CPU to service
the I/Os that it initiated. Any processor can service the I/O interrupt
and, more importantly, as soon as the I/O is initiated, the CPU
is free to go back to the run queue and pick up another process
(i.e., the CPU is idle). Any other CPU can then process the I/O
interrupt whenever it arrives. Therefore, my argument is that we
should have only three states for a processor: user, kernel, and
Adrian Cockcroft points out that the only reason we still have
%wio time reported is as some virtually meaningless artifact
that's hard to get rid off. Some people still wonder, "After
my I/O was initiated, how long until the interrupt was serviced?"
At best it's an attempt to measure the behavior of the I/O
hardware, but it certainly does not reflect a CPU state. Having
just a few outstanding I/O operations means that any idle CPUs are
potentially able to service the interrupts when they come. However,
only one CPU will handle them when they come, and there is no real
way to correlate the number of processes that are in the queue waiting
for the I/Os.
I can take an SMP with 20 processors, run a single threaded I/O-bound
process and sar will show about 95% I/O-wait time for the
entire system. It gives the incorrect impression that one thread
issuing I/O requests in a serial manner can tie up 20 processors.
Thus, it's best to ignore the %wio statistic. (There
is a proposed fix for the statistic, but it will still be a while
before that becomes standard.) I don't have enough experience
with non-SPARC MP systems to say whether this statistic is broken
on all platforms, but Adrian said, "I spoke to an HP performance
engineer, and he agreed that it was broken on all MP systems, including
3. Using Logging mount Options
Using logging mount options is especially important on any filesystem
where fsck is impractical. Running fsck has limitations in both
the time spent and the impact it has in the case of a failure. Failures
can be costly because they mean recent changes to files may be lost
or even whole files may not be saved in "lost+found".
Logging filesystems treat metadata updates as atomic operations
with an intent, operation, and confirm sequence. This makes the
consistency check trivial; thus it doesn't take much time.
4. Boosting Blocks Per Inode in Big Filesystems
UNIX filesystems are general purpose, so there are many trade-offs
made upon initializing a filesystem. For large systems running databases,
some benefits can be realized by tuning the newfs parameters.
The defaults are set statically for older versions of Solaris (i.e.,
2.8 chooses a value between 2048 and 8192). This is fine for general-purpose
use (e.g., average file size of around 2K to 8K, somewhat dynamic
in size, and capacity demands that grow and shrink regularly). When
creating a filesystem for a modern database computer or fileserver,
change the default to 500K or higher. This greatly reduces the time
to complete the newfs.
There are several options to mkfs and newfs that
are no longer used, are still documented, but have no effect on
modern disks. (One parameter that is ignored is rotdelay
5. Tuning fsflush
Depending on the amount of memory devoted to filesystem buffering,
the system process, fsflush, may become overloaded. Files
not opened with O_DSYNC have a policy of updating via fsflush.
The job of fsflush is to periodically check all buffers according
to two variables that can be set in the /etc/system file --
autoup and tune_t_fsflushr. See Listing 1.
In kernel versions before 2.8, this scan is the only way for the
system to recover used memory. Creating free memory is especially
important when many processes are consuming memory via malloc,
read(2), or mmap(2)-type reads. The default of 30
seconds for autoup makes the fsflush process too costly
on a big memory system.
To cool off fsflush, you can increase the cycle time by
increasing the autoup value, which will hit "diminishing
returns" above a few minutes. The impact of fsflush
is system-wide, so it is best to make it run for shorter periods
as well. By setting tune_t_fsflushr to a smaller number of
seconds, we make it do one fifth as much scanning (five times more
often than the default).
6. Setting noatime on mount
Sometimes we run applications on filesystems that don't need
the OS-level metadata part of the filesystem. The last accessed
time is simply not going to be used on these systems, but the default
mount option has access time update enabled, which means
every access to the file also has to update this value. This can
make a big hit on performance. Since Solaris 7, the mount
command has an option to relax the update policy on access time.
7. Tuning maxcontig
UFS under the Solaris operating environment uses an extent-like
feature called "clustering". A default setting is not
optimal for all filesystems since it is application-dependent. Many
small files accessed in a random pattern do not need extents, and
performance can suffer for both reads and writes when using extents.
Larger files can benefit from prefetch on reads and improved
allocation units when using extents in writes.
For reads, the extent-like feature is essentially read-ahead.
Tuning the fetchahead algorithm is done simply and dynamically
with the tunefs(1m) command. The value changed is maxcontig,
which sets the number of filesystem blocks read in read-ahead. When
a process reads more than one filesystem block, the kernel will
schedule the reads to fill the rest of maxcontig * filesystem
block-size bytes. One 8K (or smaller) random read on a file will
not trigger fetchahead. fetchahead also will not occur
on files being read with mmap.
The kernel will attempt to automatically detect whether an application
is doing small random or large sequential I/O. This often works
fine, but the definition of "small" or "large"
depends more on system application criteria than device characteristics.
Tuning maxcontig can obtain optimal performance.
With Solaris 2.5.1, the default maxcontig was always 7.
After that, the default changed to a device-specific value, so it
depends on the device and other related values:
Device default maxcontig
Since hardware RAID (A3500 and T3) devices do prefetch,
the benefit of doing it at the filesystem level is diminished, so
it is best to disable or reduce maxcontig for these devices.
When running a transaction-processing database, usually the database
will not use any fetch-ahead blocks, so all blocks pulled in for
the read-ahead will just be adding wasted activity on the disks.
In this case, set maxcontig to 1. Conversely, for large files
accessed in a roughly sequential manner, a boosted maxcontig
with boosted maxphys can improve the transfer efficiency
by increasing the amount of data read once the disk heads are positioned
within a cylinder. The best value will be to have it as large as
a cylinder, around 1 MB, which translates to a maxcontig
Values for maxcontig should be chosen based on the likelihood
of getting good data out of blocks near each other on the disk (or
volume) vs. the wasted I/O and (especially) memory to hold those
blocks if they are never used.
8. Routing in the Kernel
Using more than one network interface card (NIC) will, by default,
turn on routing in the kernel between those two subnets. By simply
having the file /etc/notrouter, the initialization scripts don't
let this happen:
9. Dispatch Table Tuning
The dispatch table is used by the kernel scheduler to make choices
about which process runs when the CPUs are too busy for every process
to have exclusive use of a CPU. From the classic point of view,
a CPU-intensive process is automatically pushed to lower priority,
which is achieved by noting that it is still running when its time
quanta expires. Interactive performance is enhanced by floating
processes to a higher priority when they sleep (waiting for I/O)
before their quanta expires. This classic view is how the default
Sun "Time-share" dispatch table works.
On the other hand, for big database servers, this can result in
a problem where a CPU-intensive portion of the database holds a
data-structure lock as it is running at low priority. To avoid this,
we use higher values for the time quanta, which may be known as
the "Starfire" dispatch table or the "Cray"
dispatch table. It is automatically used in certain cases on E10000
and may be useful on other machines as well. This feature has been
implemented slightly differently on different versions of the operating
environment. In 2.5.1 and 2.6, the E10K has the big quanta table
loaded by default. In 2.7 599, the E10K has a special initialization
script called "S99bigquanta", which loads on boot. In
the 2.8 kernels, a platform-specific module is loaded, but a script
can also be used.
Any user can execute the command dispadmin -c TS -g to
see what the quanta are. The big quanta tables run from 400 Ms to
340, while the normal quanta run from 200 to 20. I often make a
new rc3.d script to switch between a fast quanta and a big quanta
via the stop and start argument. This could be useful for a system
using batch at night and interactive during the day.
10. Fast-Write Enabled Disks on Database Logfiles
As databases scale up, the pressure on the log devices can become
an important issue. The database has to hold up a transaction while
the I/O is in progress. If the log is going to plain-mirrored devices,
the maximum throughput will include considerable wait time as the
disk rotates to the heads. Use fast-write enabled disks on database
11. Console Messages
Don't try to troubleshoot a problem without looking at the
console messages. Many drivers post error information through the
12. The Fastest Filesystem
Tempfs is not always the fastest filesystem. The tempfs filesystem
type has been optimized for small file and metadata-intensive access.
Make a real ufs mount if you want fast sequential reads.
12-MB/s is typical for /tmp, while 150-MB/s is easily obtained
on re-reads of cached ufs data.
Reboots due to panic can cause performance problems. Because of
the activity early in the boot to do the savecore, the memory
freelist may become fragmented. This can cause the shared memory,
ISM, not to get full allocation in 4-MB pages. One solution is to
start the database before the savecore or to umount
or direct mount the filesystem where the crash is kept.
14. Disk and Volume Management History
Several methods are plausible of saving disk and volume management
history. At a minimum, save the current configuration often:
alias newv='mv /etc/vxprint.out /etc/vxprint.out.old;
date > /etc/vxprint.out ; vxprint -thr >> /etc/vxprint.out'
format << EOT > /etc/disks.format.out
15. Copies of Crucial Files
Consider the size of some files versus the critical nature of
customizations when saving crucial files. I keep spare or "before"
versions of /etc/system, partition maps of disks, vfstab,
etc. right on the server in the etc directory:
prtvtoc /dev/rdsk/c1t0d0s2 > /etc/vtoc.ddisk
16. Full Install on Big Server
There's no point in trying to save a few hundred megabytes
only to discover that you need some package that's only installed
on the "server" option. Do the full install on the big
17. coreadm Policy
You may spend a lot of time finding core files if you haven't
set up a policy with coreadm. The recommendation is to reverse
the default action on saving core files by doing limit coredumpsize
0 in /etc/.login and ulimit -c 0 in /etc/profile. Under Solaris
8, the system policy is more easily managed with the coreadm
18. Tail /var/adm/messages
Tail /var/adm/messages when looking for a problem. The error log
delivery subsystem can be managed via the syslogd process.
Messages can be routed based on severity in addition to the default
of going to the logfile and console.
19. Pesky mount Issues
If a mount fails after doing a umount -f, you can reinitialize
the mount point, making a new one: rmdir mntpoint; mkdir
Use the /var/sadm/install/contents file to be comfortable with
package name and utility name relationships.
21. mkfile Speed
Don't expect mkfile to be fast if you use the small
maxcontig. The blocksize that mkfile uses is 8192.
This is too small for working on very large files without help from
maxcontig. With no option to increase the blocksize, you
must switch to a different utility for large file initialization.
dd works fine like this:
dd if=/dev/zero of=/misc/junk.3 bs=1024k count=100
timex mkfile 100m /misc/junk/junk.2
Another solution is to temporarily change maxcontig to 32 or
more when using programs that write sequential data in small blocks.
22. Print man Pages
Make pretty print man pages by setting the environment variable
TCAT to /usr/lib/lp/postscript/dpost, then:
man -t whatever | lp
23. Keeping Records
Install and use a permanent recorder of overall system performance
by extending iopass to make permanent records of other performance
and accounting information. Simply make a cron job like this:
0 0 * * * ( IFS="";trap 'kill 0' 0; vmstat 2|& while read -p \
line; do X='date +%Y.%m.%d.%H:%M:%S'; echo $X $line; \
done) >> /tmp/vmstat.'date +%m%d'
Remember, as with any word-processing software, save early and
save often. I would like to thank Sleepy, Grumpy, Sneezy, and most
of all, Prince.
Bob Larson is an Engineer in the Strategic Applications
Engineering group within Sun Microsystems which focuses on enterprise
server performance and architecture. He has over 12 years experience
in performance tuning, application and kernel development, and benchmarks.