Article

oct2002.tar

Volume Management and System Tuning

Henry Newman

This month, I'll explain how and why to tune volume managers. In my previous columns, I've examined how application I/O using C Library I/O has a great impact on how much work the system has to do. In my most recent column, I specifically covered how applications I/O from system calls impact the system. The extra work caused by applications can be reduced or eliminated given how the volume managers are configured. By the same token, a poor volume can add to I/O overhead in the system and significantly reduce system performance. In both cases, the volume manager is a significant part of the equation. So this month, I will discuss aspects of tuning the system for volume managers.

System Tuning

One of the keys to tuning a system is gaining an understanding of the I/O path through the system. Many operating systems, volume managers, and file systems have limitations on the size of the I/O requests that can be made to the system. On Solaris, for example, by default, the largest physical request that can be made is 128 KB. This can be changed by a modification to /etc/system to add:

set maxphys= "value in decimal, octal or hexadecimal"

To set maxphys to 8 MB, you add to /etc/system:

set maxphys=8388608

set maxphys=0x800000

In Solaris, each of the device drivers relating to SCSI I/O sd (SCSI Device Driver), ssd (Fibre Channel Device Driver), and st (Tape Device Driver) supports an upper limit of 1 MB for a transfer size, even though maxphys from /etc/system is set to a value larger than 1 MB. Each of these device drivers must be changed if you want to make I/O requests over 1 MB for devices used by that driver.

Therefore, the largest value that can be transferred from an application to or from the system will be 1 MB without changes to the device driver -- even if you have made changes in /etc/system. Changes to the driver are more difficult than a changes to /etc/system. Mistakes in the configuration of device drivers can prevent you from seeing the devices. You must make changes to each driver in Solaris by changing the configuration files in /kernel/drv/"driver name".conf. Where the driver name is sd, ssd, or st, you need to add the following for each driver type that you would like to change (sd, ssd, or st).

"driver name"_max_xfer_size="value"

For example to change the sd driver to allow the largest transfer to be 16,777,216 (16MB)

sd_max_xfer_size=0x1000000;

For Solaris 8, this change is placed as the last line listing the SCSI targets, but for Solaris 9 it must be the first line in the file. It is extremely important for the case to be correct. You cannot use capital letters or a capital X for the hexadecimal representation. Using capital letters will prevent you from using all devices for that driver. Also, the line must end with a semicolon.

Equally important to note is that some Fibre Channel HBAs do not support requests greater than 8 MB. This is usually based on the DMA size allowed within the Fibre Channel chipset. It is important to test any changes to ensure that they work as expected. Making requests larger than the HBA allows can cause the channel to hang and no I/O can be done via that channel. I know that's not how it should work, but that's how it does work. After the file system has been configured and the tuning changes have been made, you can use the dd(1M) and sar(1) commands to check the average size of an I/O request. This is a final test to ensure that the system is correctly configured. Run the following two commands at the same time to verify the system performance:

dd if=/dev/zero of=path_of_filesystem/any_file bs=8192K count=64 &
sar -d 1 5

The dd(1M) command performs I/O with a block size (bs) of 8192 kilobytes (8,388,308 bytes), copying 64 records, or blocks. The sar -d command shows the resulting disk activity, as shown in the following example:

# sar -d 1 5

SunOS gandalf 5.8 Generic_108528-07 sun4u    04/16/02
 
16:59:36   device   %busy   avque   r+w/s  blks/s  avwait  avserv
 
16:59:37   sd2          0     0.0       4  131076     0.0     0.0
           sd3          0     0.0       0       0     0.0     0.0
 
16:59:38   sd2         24     1.8       3   98304    57.9    27.3
           sd3          0     0.0       0       0     0.0     0.0
 
16:59:39   sd2          1     0.0       4  131184     0.0     6.7
           sd3          0     0.0       0       0     0.0     0.0
 
16:59:40   sd2          0     0.0       5  163846     0.0     0.0
           sd3          0     0.0       0       0     0.0     0.0
 
16:59:41   sd2          0     0.0       3   98406     0.0     0.0
           sd3          0     0.0       0       0     0.0     0.0
 
 
Average    sd2          5     0.4     3.8  124562    53.1    25.6
           sd3          0     0.0       0       0     0.0     0.0
#

To determine the average size of an I/O request, divide the number of blocks transferred per second (blks/s) by the number of read/write operations per second (r+w/s). The resulting value should be approximately 16384 (8192 * 2). (In this example, the average I/O request is about 16390.)

If the average I/O request size is much smaller than this amount, there might be a kernel error causing a problem. Look for errors in /var/adm/messages, especially anything relating to maxphys and "xx"_max_xfer_size.

Tape tuning can be very problematic, so before you change the st driver to make large requests for tape, you must do the following:

1. Know that your tape can accept a block size over 1 MB as you can make 1 MB requests with the changes to /etc/system.

2. Ensure that the application you are using makes large requests to the tape drive. Most applications do not make large requests to the drive, so making changes to st makes no difference.

3. Test before you implement. Ronald Reagan once said, "Trust but verify". These are good watchwords for making system-critical changes. Even if you believe the answers to steps 1 and 2 are correct, run a test to see whether you are really making large requests using the example above.

Other operating systems, such as SGI IRIX, have an equivalent value to maxphys, and it is called maxdmasz. IRIX does not have the driver limitations that Solaris has. AIX has limitations based on the file system, but not based on the kernel. Linux is limited by the file system implementation and not by the kernel.

Volume Manager Configuration

The way to obtain the most optimal performance is to match the application I/O, volume manager stripe width, file system allocations, and the RAID allocation (which I will be addressing over the next few columns). This is almost impossible in all but very specific, controlled, application environments. Therefore, tuning for most applications environments becomes a tradeoff for high performance for some applications and lower performance for others.

Volume Manager Settings

For file systems that require a volume manager, these are the two most important areas to consider:

1. The stripe size of the volume.

2. The tunables that control request sizes.

Setting Stripe Values

The most important volume manager setting is the stripe size for the volume. Stripe values should be set based on the underlying hardware architecture. Thus, you must understand:

1. The number and type of devices that make up the volume.

a. Things like RAID stripe width (which I will discuss in detail in a future column).

b. The number of devices on the Fibre Channel or SCSI bus.

2. The speed of each device in the volume.

Setting the stripe value to a large number could improve performance if files are large, but if files are small, this could hurt performance. Take the following example:

Number of devices in the stripe group = 4

Average File size = 32KB

If you make your stripe value 2 MB, you would place on average 64 files on each device before writing on the next device. This might be a good thing if you accessed them sequentially. However, if you accessed them randomly in groups of 32, you could have performance problems because all 32 files would be on a single device. Having them on different devices could improve performance. Often, the longest time doing I/O to any RAID and/or disk is not the data transfer but the seek and latency time. (This, too, will be discussed in great detail in a later column.)

Additionally, some volume managers, such as Veritas VxVM, have tunables that limit the size of requests to the volume manager. In VxVM, the value is called vxio:vol_maxio. By default, the largest request that can be made to the volume manager without the system breaking up the I/O requests is 256 KB. To change this value for Solaris, add the following to /etc/system:

set vxio:vol_maxio= 32768  # sets the largest request to a volume
                           # before breaking up to 16 MB

The largest value that can be set is 65536 or 32 MB. Even if you set maxphys, /kernel/drv/sd.conf, and vxio:vol_maxio to 32 MB, the largest request that can be made to a SCSI driver currently is 32 MB - 32 KB - 1 byte. A standards committee is looking at changing this and the largest LUN supported, which is now 2 TB. Some vendors (like IBM AIX) have rewritten drivers to support these large LUNs.

A number of tradeoffs must to be considered when setting volume manager stripe settings:

1. The number of devices in the volume.

2. The size of the I/O requests from the application(s).

3. Issues with fragmentation and the I/O request size.

Number of Devices

The number of devices within the volume is important to understand. Considering that most vendors, as well as the standard, only support 2 TB LUNs, and many only support 2TB file systems, the number of underlying devices might only be 11 180-GB disk drives. It is still important to understand how many devices will be in the stripe group. For example, having 10 devices with a 32-KB stripe element means that a full stripe of data is 10*32 KB. Making that stripe size 2 MB instead of 32 KB means that a full stripe is now 20 MB. Having applications that can write full stripes or, even better, multiple stripes of data to allow file system readahead will significantly improve I/O performance. And, if available RAID readahead algorithms come into play, that's even better. Be aware that there is a downside, as I will discuss.

Size of I/O Requests

Now for the downside. If you are making small I/O request and want to ensure that you write a full stripe, your I/O performance will drop dramatically as the size of the request drops. Let's say your application requests 256-KB I/Os (Oracle can be tuned to make requests this size), and you have 8 devices and a 32-KB stripe element so you can write full stripes of data. The performance both in wall time and CPU overhead of 32-KB I/O as compared with 256-KB I/O is hugely different. A number of factors play into this, including the types of devices, the transport medium (SCSI or FC), and the file system, but you might easily see 50% reduction in megabytes per second and 20% increase in system overhead. So, as you can see, it is difficult to provide generalizable tuning suggestions. Also, to achieve good I/O performance, you must understand the whole I/O path.

Device Fragmentation

Using small stripe elements can cause the underlying device to become fragmented if files are not preallocated within the file system. The tradeoff is that for large I/O requests, you are not getting the advantage of multiple devices reading or writing at the same time. In this day and age, a single Fibre Channel RAID device provides sufficient performance for almost all applications. So, one of the things I have been suggesting to customers is to make very large stripe values, instead of making smaller stripe values to allow multiple devices to participate in delivering the I/O. For example, many databases use a 2-GB file size and, for those sites, I have suggested using 2-GB stripe values. That way, each database file is round-robined to a different device.

Conclusions

Although there are few system tuning parameters and few volume manager tunables, understanding how they work is critical to the efficient use of the hardware and thereby to reduced expenditures on new hardware. Understanding what the applications are doing to your system is the key to setting up the volume stripe size. Equally important are the file system setting and the RAID tunables, which will be covered in the next few months.

Henry Newman has worked in the IT industry for more than 20 years. Originally at Cray Research and now with a consulting organization, he has provided expertise in systems architecture and performance analysis to customers in government, scientific research, and industry around the world. His focus is high-performance computing, storage and networking for UNIX systems, and he previously authored a monthly column about storage for Server/Workstation Expert magazine. He may be reached at: hsn@hsnewman.com.