Article

jun2002.tar

Tuning Large Solaris^TM Servers

Bob Larson

Solaris will always attempt to choose good default configuration values, but sometimes a simple heuristic cannot find the best value. The optimum configuration may depend upon an important trade-off, such as speed vs. space efficiency. In this article, I'll guide you through a medley of performance-related considerations for large Solaris servers and also cover some of the algorithms and heuristics surrounding the most important tunables.

1. Sizing Swap

Sizing swap properly can be a trade-off since having either too much or too little can cause problems. Neither situation prints a simple message like "Swap too small" to the console. You may see "not enough memory" messages, but you never see "too much memory", even though this can happen.

Having swap too large will at best waste disks, and can easily lead to hard-pageout. Adjustment depends on the application and particular installation. Having too little swap can make the system run out of virtual memory before it runs out of physical memory. One problem is that as memory density grows, we often won't care about larger and larger amounts of memory. Looking at "wasted" memory should be done with a threshold set in cost, rather than megabytes.

What looks like a wasted 100 MB of memory on one hand, may also be considered a cheap insurance policy on avoiding performance problems. Because of the implementation of swap in Solaris, the usual rule of sizing a swap device at two to three times the size of real memory is no longer accurate, and these values should each be reduced by one. Thus, sizing swap at one to two times real memory is a good starting point.

2. iowait

The "iowait" statistic can be misleading on large Solaris machines. In the past, certain systems required the CPU that initiated an I/O to service that I/O interrupt, and this CPU was not available for other processes on some systems until this I/O request was satisfied. Thus, I/O-wait time came to be considered one of the states of a processor -- a processor was either busy in Userland, busy in kernel code, sitting in I/O-wait, or idle and therefore free to run other processes.

Under Solaris there is no need for a particular CPU to service the I/Os that it initiated. Any processor can service the I/O interrupt and, more importantly, as soon as the I/O is initiated, the CPU is free to go back to the run queue and pick up another process (i.e., the CPU is idle). Any other CPU can then process the I/O interrupt whenever it arrives. Therefore, my argument is that we should have only three states for a processor: user, kernel, and idle.

Adrian Cockcroft points out that the only reason we still have %wio time reported is as some virtually meaningless artifact that's hard to get rid off. Some people still wonder, "After my I/O was initiated, how long until the interrupt was serviced?" At best it's an attempt to measure the behavior of the I/O hardware, but it certainly does not reflect a CPU state. Having just a few outstanding I/O operations means that any idle CPUs are potentially able to service the interrupts when they come. However, only one CPU will handle them when they come, and there is no real way to correlate the number of processes that are in the queue waiting for the I/Os.

I can take an SMP with 20 processors, run a single threaded I/O-bound process and sar will show about 95% I/O-wait time for the entire system. It gives the incorrect impression that one thread issuing I/O requests in a serial manner can tie up 20 processors. Thus, it's best to ignore the %wio statistic. (There is a proposed fix for the statistic, but it will still be a while before that becomes standard.) I don't have enough experience with non-SPARC MP systems to say whether this statistic is broken on all platforms, but Adrian said, "I spoke to an HP performance engineer, and he agreed that it was broken on all MP systems, including HP-UX."

3. Using Logging mount Options

Using logging mount options is especially important on any filesystem where fsck is impractical. Running fsck has limitations in both the time spent and the impact it has in the case of a failure. Failures can be costly because they mean recent changes to files may be lost or even whole files may not be saved in "lost+found".

Logging filesystems treat metadata updates as atomic operations with an intent, operation, and confirm sequence. This makes the consistency check trivial; thus it doesn't take much time.

4. Boosting Blocks Per Inode in Big Filesystems

UNIX filesystems are general purpose, so there are many trade-offs made upon initializing a filesystem. For large systems running databases, some benefits can be realized by tuning the newfs parameters. The defaults are set statically for older versions of Solaris (i.e., 2.8 chooses a value between 2048 and 8192). This is fine for general-purpose use (e.g., average file size of around 2K to 8K, somewhat dynamic in size, and capacity demands that grow and shrink regularly). When creating a filesystem for a modern database computer or fileserver, change the default to 500K or higher. This greatly reduces the time to complete the newfs.

There are several options to mkfs and newfs that are no longer used, are still documented, but have no effect on modern disks. (One parameter that is ignored is rotdelay or gap.)

5. Tuning fsflush

Depending on the amount of memory devoted to filesystem buffering, the system process, fsflush, may become overloaded. Files not opened with O_DSYNC have a policy of updating via fsflush. The job of fsflush is to periodically check all buffers according to two variables that can be set in the /etc/system file -- autoup and tune_t_fsflushr. See Listing 1.

In kernel versions before 2.8, this scan is the only way for the system to recover used memory. Creating free memory is especially important when many processes are consuming memory via malloc, read(2), or mmap(2)-type reads. The default of 30 seconds for autoup makes the fsflush process too costly on a big memory system.

To cool off fsflush, you can increase the cycle time by increasing the autoup value, which will hit "diminishing returns" above a few minutes. The impact of fsflush is system-wide, so it is best to make it run for shorter periods as well. By setting tune_t_fsflushr to a smaller number of seconds, we make it do one fifth as much scanning (five times more often than the default).

6. Setting noatime on mount

Sometimes we run applications on filesystems that don't need the OS-level metadata part of the filesystem. The last accessed time is simply not going to be used on these systems, but the default mount option has access time update enabled, which means every access to the file also has to update this value. This can make a big hit on performance. Since Solaris 7, the mount command has an option to relax the update policy on access time.

7. Tuning maxcontig

UFS under the Solaris operating environment uses an extent-like feature called "clustering". A default setting is not optimal for all filesystems since it is application-dependent. Many small files accessed in a random pattern do not need extents, and performance can suffer for both reads and writes when using extents. Larger files can benefit from prefetch on reads and improved allocation units when using extents in writes.

For reads, the extent-like feature is essentially read-ahead. Tuning the fetchahead algorithm is done simply and dynamically with the tunefs(1m) command. The value changed is maxcontig, which sets the number of filesystem blocks read in read-ahead. When a process reads more than one filesystem block, the kernel will schedule the reads to fill the rest of maxcontig * filesystem block-size bytes. One 8K (or smaller) random read on a file will not trigger fetchahead. fetchahead also will not occur on files being read with mmap.

The kernel will attempt to automatically detect whether an application is doing small random or large sequential I/O. This often works fine, but the definition of "small" or "large" depends more on system application criteria than device characteristics. Tuning maxcontig can obtain optimal performance.

With Solaris 2.5.1, the default maxcontig was always 7. After that, the default changed to a device-specific value, so it depends on the device and other related values:

Device default maxcontig
SD maxphys
SSD 1MB
vxvm stripe_width/512k
DiskSuite maxphys

Since hardware RAID (A3500 and T3) devices do prefetch, the benefit of doing it at the filesystem level is diminished, so it is best to disable or reduce maxcontig for these devices.

When running a transaction-processing database, usually the database will not use any fetch-ahead blocks, so all blocks pulled in for the read-ahead will just be adding wasted activity on the disks. In this case, set maxcontig to 1. Conversely, for large files accessed in a roughly sequential manner, a boosted maxcontig with boosted maxphys can improve the transfer efficiency by increasing the amount of data read once the disk heads are positioned within a cylinder. The best value will be to have it as large as a cylinder, around 1 MB, which translates to a maxcontig of 128.

Values for maxcontig should be chosen based on the likelihood of getting good data out of blocks near each other on the disk (or volume) vs. the wasted I/O and (especially) memory to hold those blocks if they are never used.

8. Routing in the Kernel

Using more than one network interface card (NIC) will, by default, turn on routing in the kernel between those two subnets. By simply having the file /etc/notrouter, the initialization scripts don't let this happen:

touch /etc/notrouter

9. Dispatch Table Tuning

The dispatch table is used by the kernel scheduler to make choices about which process runs when the CPUs are too busy for every process to have exclusive use of a CPU. From the classic point of view, a CPU-intensive process is automatically pushed to lower priority, which is achieved by noting that it is still running when its time quanta expires. Interactive performance is enhanced by floating processes to a higher priority when they sleep (waiting for I/O) before their quanta expires. This classic view is how the default Sun "Time-share" dispatch table works.

On the other hand, for big database servers, this can result in a problem where a CPU-intensive portion of the database holds a data-structure lock as it is running at low priority. To avoid this, we use higher values for the time quanta, which may be known as the "Starfire" dispatch table or the "Cray" dispatch table. It is automatically used in certain cases on E10000 and may be useful on other machines as well. This feature has been implemented slightly differently on different versions of the operating environment. In 2.5.1 and 2.6, the E10K has the big quanta table loaded by default. In 2.7 599, the E10K has a special initialization script called "S99bigquanta", which loads on boot. In the 2.8 kernels, a platform-specific module is loaded, but a script can also be used.

Any user can execute the command dispadmin -c TS -g to see what the quanta are. The big quanta tables run from 400 Ms to 340, while the normal quanta run from 200 to 20. I often make a new rc3.d script to switch between a fast quanta and a big quanta via the stop and start argument. This could be useful for a system using batch at night and interactive during the day.

10. Fast-Write Enabled Disks on Database Logfiles

As databases scale up, the pressure on the log devices can become an important issue. The database has to hold up a transaction while the I/O is in progress. If the log is going to plain-mirrored devices, the maximum throughput will include considerable wait time as the disk rotates to the heads. Use fast-write enabled disks on database logfiles.

11. Console Messages

Don't try to troubleshoot a problem without looking at the console messages. Many drivers post error information through the syslogd.

12. The Fastest Filesystem

Tempfs is not always the fastest filesystem. The tempfs filesystem type has been optimized for small file and metadata-intensive access. Make a real ufs mount if you want fast sequential reads. 12-MB/s is typical for /tmp, while 150-MB/s is easily obtained on re-reads of cached ufs data.

13. Reboots

Reboots due to panic can cause performance problems. Because of the activity early in the boot to do the savecore, the memory freelist may become fragmented. This can cause the shared memory, ISM, not to get full allocation in 4-MB pages. One solution is to start the database before the savecore or to umount or direct mount the filesystem where the crash is kept.

14. Disk and Volume Management History

Several methods are plausible of saving disk and volume management history. At a minimum, save the current configuration often:

alias newv='mv /etc/vxprint.out /etc/vxprint.out.old;
date > /etc/vxprint.out ; vxprint -thr >> /etc/vxprint.out'
format << EOT > /etc/disks.format.out
EOT.

15. Copies of Crucial Files

Consider the size of some files versus the critical nature of customizations when saving crucial files. I keep spare or "before" versions of /etc/system, partition maps of disks, vfstab, etc. right on the server in the etc directory:

prtvtoc /dev/rdsk/c1t0d0s2 > /etc/vtoc.ddisk

16. Full Install on Big Server

There's no point in trying to save a few hundred megabytes only to discover that you need some package that's only installed on the "server" option. Do the full install on the big server.

17. coreadm Policy

You may spend a lot of time finding core files if you haven't set up a policy with coreadm. The recommendation is to reverse the default action on saving core files by doing limit coredumpsize 0 in /etc/.login and ulimit -c 0 in /etc/profile. Under Solaris 8, the system policy is more easily managed with the coreadm utility.

18. Tail /var/adm/messages

Tail /var/adm/messages when looking for a problem. The error log delivery subsystem can be managed via the syslogd process. Messages can be routed based on severity in addition to the default of going to the logfile and console.

19. Pesky mount Issues

If a mount fails after doing a umount -f, you can reinitialize the mount point, making a new one: rmdir mntpoint; mkdir mntpoint.

20. /var/sadm/install/contents

Use the /var/sadm/install/contents file to be comfortable with package name and utility name relationships.

21. mkfile Speed

Don't expect mkfile to be fast if you use the small maxcontig. The blocksize that mkfile uses is 8192. This is too small for working on very large files without help from maxcontig. With no option to increase the blocksize, you must switch to a different utility for large file initialization. dd works fine like this:

dd if=/dev/zero of=/misc/junk.3 bs=1024k count=100
timex mkfile 100m /misc/junk/junk.2

Another solution is to temporarily change maxcontig to 32 or more when using programs that write sequential data in small blocks.

22. Print man Pages

Make pretty print man pages by setting the environment variable TCAT to /usr/lib/lp/postscript/dpost, then:

man -t whatever | lp

23. Keeping Records

Install and use a permanent recorder of overall system performance by extending iopass to make permanent records of other performance and accounting information. Simply make a cron job like this:

0 0 * * * ( IFS="";trap 'kill 0' 0; vmstat 2|& while read -p \
            line; do X='date +%Y.%m.%d.%H:%M:%S'; echo $X $line; \
            done) >> /tmp/vmstat.'date +%m%d'

Acknowledgments

Remember, as with any word-processing software, save early and save often. I would like to thank Sleepy, Grumpy, Sneezy, and most of all, Prince.

References

http://sunsite.net.edu.cn/online/Books/Solaris/SysAdmGuide2.pdf

http://docs.sun.com/ab2/coll.707.1/SOLTUNEPARAMREF

Bob Larson is an Engineer in the Strategic Applications Engineering group within Sun Microsystems which focuses on enterprise server performance and architecture. He has over 12 years experience in performance tuning, application and kernel development, and benchmarks.

Tuning Large SolarisTM Servers

Tuning Large Solaris^TM Servers