Volume
Management and System Tuning
Henry Newman
This month, I'll explain how and why to tune volume managers.
In my previous columns, I've examined how application I/O using
C Library I/O has a great impact on how much work the system has
to do. In my most recent column, I specifically covered how applications
I/O from system calls impact the system. The extra work caused by
applications can be reduced or eliminated given how the volume managers
are configured. By the same token, a poor volume can add to I/O
overhead in the system and significantly reduce system performance.
In both cases, the volume manager is a significant part of the equation.
So this month, I will discuss aspects of tuning the system for volume
managers.
System Tuning
One of the keys to tuning a system is gaining an understanding
of the I/O path through the system. Many operating systems, volume
managers, and file systems have limitations on the size of the I/O
requests that can be made to the system. On Solaris, for example,
by default, the largest physical request that can be made is 128
KB. This can be changed by a modification to /etc/system to add:
set maxphys= "value in decimal, octal or hexadecimal"
To set maxphys to 8 MB, you add to /etc/system:
set maxphys=8388608
or
set maxphys=0x800000
In Solaris, each of the device drivers relating to SCSI I/O sd
(SCSI Device Driver), ssd (Fibre Channel Device Driver), and
st (Tape Device Driver) supports an upper limit of 1 MB for
a transfer size, even though maxphys from /etc/system is set
to a value larger than 1 MB. Each of these device drivers must be
changed if you want to make I/O requests over 1 MB for devices used
by that driver.
Therefore, the largest value that can be transferred from an application
to or from the system will be 1 MB without changes to the device
driver -- even if you have made changes in /etc/system. Changes
to the driver are more difficult than a changes to /etc/system.
Mistakes in the configuration of device drivers can prevent you
from seeing the devices. You must make changes to each driver in
Solaris by changing the configuration files in /kernel/drv/"driver
name".conf. Where the driver name is sd, ssd,
or st, you need to add the following for each driver type
that you would like to change (sd, ssd, or st).
"driver name"_max_xfer_size="value"
For example to change the sd driver to allow the largest transfer
to be 16,777,216 (16MB)
sd_max_xfer_size=0x1000000;
For Solaris 8, this change is placed as the last line listing the
SCSI targets, but for Solaris 9 it must be the first line in the file.
It is extremely important for the case to be correct. You cannot
use capital letters or a capital X for the hexadecimal representation.
Using capital letters will prevent you from using all devices for
that driver. Also, the line must end with a semicolon.
Equally important to note is that some Fibre Channel HBAs do not
support requests greater than 8 MB. This is usually based on the
DMA size allowed within the Fibre Channel chipset. It is important
to test any changes to ensure that they work as expected. Making
requests larger than the HBA allows can cause the channel to hang
and no I/O can be done via that channel. I know that's not
how it should work, but that's how it does work. After
the file system has been configured and the tuning changes have
been made, you can use the dd(1M) and sar(1) commands
to check the average size of an I/O request. This is a final test
to ensure that the system is correctly configured. Run the following
two commands at the same time to verify the system performance:
dd if=/dev/zero of=path_of_filesystem/any_file bs=8192K count=64 &
sar -d 1 5
The dd(1M) command performs I/O with a block size (bs) of 8192
kilobytes (8,388,308 bytes), copying 64 records, or blocks. The sar
-d command shows the resulting disk activity, as shown in the
following example:
# sar -d 1 5
SunOS gandalf 5.8 Generic_108528-07 sun4u 04/16/02
16:59:36 device %busy avque r+w/s blks/s avwait avserv
16:59:37 sd2 0 0.0 4 131076 0.0 0.0
sd3 0 0.0 0 0 0.0 0.0
16:59:38 sd2 24 1.8 3 98304 57.9 27.3
sd3 0 0.0 0 0 0.0 0.0
16:59:39 sd2 1 0.0 4 131184 0.0 6.7
sd3 0 0.0 0 0 0.0 0.0
16:59:40 sd2 0 0.0 5 163846 0.0 0.0
sd3 0 0.0 0 0 0.0 0.0
16:59:41 sd2 0 0.0 3 98406 0.0 0.0
sd3 0 0.0 0 0 0.0 0.0
Average sd2 5 0.4 3.8 124562 53.1 25.6
sd3 0 0.0 0 0 0.0 0.0
#
To determine the average size of an I/O request, divide the number
of blocks transferred per second (blks/s) by the number of read/write
operations per second (r+w/s). The resulting value should be approximately
16384 (8192 * 2). (In this example, the average I/O request is about
16390.)
If the average I/O request size is much smaller than this amount,
there might be a kernel error causing a problem. Look for errors
in /var/adm/messages, especially anything relating to maxphys and
"xx"_max_xfer_size.
Tape tuning can be very problematic, so before you change the
st driver to make large requests for tape, you must do the
following:
1. Know that your tape can accept a block size over 1 MB as you
can make 1 MB requests with the changes to /etc/system.
2. Ensure that the application you are using makes large requests
to the tape drive. Most applications do not make large requests
to the drive, so making changes to st makes no difference.
3. Test before you implement. Ronald Reagan once said, "Trust
but verify". These are good watchwords for making system-critical
changes. Even if you believe the answers to steps 1 and 2 are correct,
run a test to see whether you are really making large requests using
the example above.
Other operating systems, such as SGI IRIX, have an equivalent
value to maxphys, and it is called maxdmasz. IRIX does not have
the driver limitations that Solaris has. AIX has limitations based
on the file system, but not based on the kernel. Linux is limited
by the file system implementation and not by the kernel.
Volume Manager Configuration
The way to obtain the most optimal performance is to match the
application I/O, volume manager stripe width, file system allocations,
and the RAID allocation (which I will be addressing over the next
few columns). This is almost impossible in all but very specific,
controlled, application environments. Therefore, tuning for most
applications environments becomes a tradeoff for high performance
for some applications and lower performance for others.
Volume Manager Settings
For file systems that require a volume manager, these are the
two most important areas to consider:
1. The stripe size of the volume.
2. The tunables that control request sizes.
Setting Stripe Values
The most important volume manager setting is the stripe size for
the volume. Stripe values should be set based on the underlying
hardware architecture. Thus, you must understand:
1. The number and type of devices that make up the volume.
a. Things like RAID stripe width (which I will discuss in detail
in a future column).
b. The number of devices on the Fibre Channel or SCSI bus.
2. The speed of each device in the volume.
Setting the stripe value to a large number could improve performance
if files are large, but if files are small, this could hurt performance.
Take the following example:
Number of devices in the stripe group = 4
Average File size = 32KB
If you make your stripe value 2 MB, you would place on average
64 files on each device before writing on the next device. This
might be a good thing if you accessed them sequentially. However,
if you accessed them randomly in groups of 32, you could have performance
problems because all 32 files would be on a single device. Having
them on different devices could improve performance. Often, the
longest time doing I/O to any RAID and/or disk is not the data transfer
but the seek and latency time. (This, too, will be discussed in
great detail in a later column.)
Additionally, some volume managers, such as Veritas VxVM, have
tunables that limit the size of requests to the volume manager.
In VxVM, the value is called vxio:vol_maxio. By default,
the largest request that can be made to the volume manager without
the system breaking up the I/O requests is 256 KB. To change this
value for Solaris, add the following to /etc/system:
set vxio:vol_maxio= 32768 # sets the largest request to a volume
# before breaking up to 16 MB
The largest value that can be set is 65536 or 32 MB. Even if you set
maxphys, /kernel/drv/sd.conf, and vxio:vol_maxio
to 32 MB, the largest request that can be made to a SCSI driver currently
is 32 MB - 32 KB - 1 byte. A standards committee is looking at changing
this and the largest LUN supported, which is now 2 TB. Some vendors
(like IBM AIX) have rewritten drivers to support these large LUNs.
A number of tradeoffs must to be considered when setting volume
manager stripe settings:
1. The number of devices in the volume.
2. The size of the I/O requests from the application(s).
3. Issues with fragmentation and the I/O request size.
Number of Devices
The number of devices within the volume is important to understand.
Considering that most vendors, as well as the standard, only support
2 TB LUNs, and many only support 2TB file systems, the number of
underlying devices might only be 11 180-GB disk drives. It is still
important to understand how many devices will be in the stripe group.
For example, having 10 devices with a 32-KB stripe element means
that a full stripe of data is 10*32 KB. Making that stripe size
2 MB instead of 32 KB means that a full stripe is now 20 MB. Having
applications that can write full stripes or, even better, multiple
stripes of data to allow file system readahead will significantly
improve I/O performance. And, if available RAID readahead algorithms
come into play, that's even better. Be aware that there is
a downside, as I will discuss.
Size of I/O Requests
Now for the downside. If you are making small I/O request and
want to ensure that you write a full stripe, your I/O performance
will drop dramatically as the size of the request drops. Let's
say your application requests 256-KB I/Os (Oracle can be tuned to
make requests this size), and you have 8 devices and a 32-KB stripe
element so you can write full stripes of data. The performance both
in wall time and CPU overhead of 32-KB I/O as compared with 256-KB
I/O is hugely different. A number of factors play into this, including
the types of devices, the transport medium (SCSI or FC), and the
file system, but you might easily see 50% reduction in megabytes
per second and 20% increase in system overhead. So, as you can see,
it is difficult to provide generalizable tuning suggestions. Also,
to achieve good I/O performance, you must understand the whole I/O
path.
Device Fragmentation
Using small stripe elements can cause the underlying device to
become fragmented if files are not preallocated within the file
system. The tradeoff is that for large I/O requests, you are not
getting the advantage of multiple devices reading or writing at
the same time. In this day and age, a single Fibre Channel RAID
device provides sufficient performance for almost all applications.
So, one of the things I have been suggesting to customers is to
make very large stripe values, instead of making smaller stripe
values to allow multiple devices to participate in delivering the
I/O. For example, many databases use a 2-GB file size and, for those
sites, I have suggested using 2-GB stripe values. That way, each
database file is round-robined to a different device.
Conclusions
Although there are few system tuning parameters and few volume
manager tunables, understanding how they work is critical to the
efficient use of the hardware and thereby to reduced expenditures
on new hardware. Understanding what the applications are doing to
your system is the key to setting up the volume stripe size. Equally
important are the file system setting and the RAID tunables, which
will be covered in the next few months.
Henry Newman has worked in the IT industry for more than 20
years. Originally at Cray Research and now with a consulting organization,
he has provided expertise in systems architecture and performance
analysis to customers in government, scientific research, and industry
around the world. His focus is high-performance computing, storage
and networking for UNIX systems, and he previously authored a monthly
column about storage for Server/Workstation Expert magazine.
He may be reached at: hsn@hsnewman.com.
|