Cover V10, I09

Article
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Listing 1
Listing 2
Listing 3
Listing 4
Sidebar 1
Sidebar 2
Sidebar 3
Table 1
Table 2
Table 3
Table 4

sep2001.tar


Serial Storage Architecture Management

Miles Purdy

Serial Storage Architecture, or SSA, is one of the least understood technologies I manage. SSA is a fast, high-capacity hard disk storage solution available for most UNIX-based and PC-based computers, although it may be most predominant on RS/6000's (a.k.a. pSeries). Many people think that SSA is an IBM proprietary standard; in fact, it is an open standard. I will explain the technology and show examples of how to manage SSA on RS/6000's. I will also describe some benefits of SSA and provide some performance and tuning tips and management scripts.

The Technology Overview

SSA technology is all about loops and devices talking to each other. (See Figure 1.) Every SSA device is called a node. The nodes in SSA adapters are called initiators because they issue commands, and the nodes in SSA disks are called targets because they respond to commands. To create an SSA loop, you start with an SSA adapter -- a PCI card. Each adapter has two loops (A and B) and two physical ports per loop (A1, A2, and B1, B2). Copper cables are commonly used to connect the adapter to an I/O drawer (a.k.a. enclosure). Enclosures contain a maximum of 16 hard drives. You can start the loop by cabling from port A1 on the SSA adapter to port 1 on the I/O drawer. If the I/O drawer is full of disks, you can complete the loop by cabling from port 16 on the I/O drawer back to port A2 on the adapter, thereby forming a complete loop. The automatic bypass cards will close if they detect a disk on either side, thus closing the loop.

The SSA adapters are amazing pieces of technology. The new Advanced Serial RAIDPlus adapter (feature code 6230 from IBM) supports RAID 0, 1, 5, and 1+0. It has both read and write cache. The fast-write cache, as it is called, buffers many smaller writes into blocks and sends full stripes to the array, greatly improving performance, especially for RAID 5. If the power fails, the adapter has a battery to maintain the fast-write cache, which will commit the write operations once the power returns. If you have two hosts, each with an adapter sharing common disks and using the fast-write cache, called two-way fast write cache, each host can maintain a copy of the other's fast-write cache. If one system fails, the adapter in the other host can then commit the write operations. The new adapter supports transfer rates of 40 MBs per port, for a total capacity of 160 MBs.

Copper cables are used to connect SSA devices, although fiber optic extenders are available to travel up to 10 km. The cables may connect an adapter to an I/O drawer, an adapter to an adapter, or an I/O drawer to an I/O drawer.

I/O drawers are used for large configurations, although a standalone desk-side unit is also available. The I/O drawers are rack mounted, fitting into a standard 19" rack. Each rack can hold up to six I/O drawers. As I mentioned, each I/O drawer can hold up to 16 disks, and each adapter loop can support 48 disks -- 96 disks in total, per adapter. However, for performance, 32 disks per loop is probably best. The disks in the I/O drawer are grouped into four groups of four. You can add disks one a time, but you should be adding disks in multiples of four at this level of storage. In fact, when buying a new drawer, always try to get 16 disks. If you are using more than 4 disks (i.e., 8 or 16 in single loop), there is no need to cable between ports 4 and 5, for example. The I/O drawer will automatically detect the presence of disks in slots 4 and 5 and close the loop with the automatic bypass card. Early models required a 6" cable between the ports. The I/O drawers feature redundant power and cooling.

The model D40 I/O drawer supports disk capacities of 9.1, 18.2, and 36.4 GB, in speeds of 7200 and 10200 rpms. Any combination of disk capacities and speeds is possible, but not recommended. If you're using RAID, all the disks must be the same or performance will suffer and capacity will be lost. With 96 disks per adapter at 36.4 GB each, a single adapter can access, at most, 3494 GB. The largest standalone RS/6000's (i.e., model S80) can have a maximum of 26 SSA adapters per system, for a maximum capacity of 90844 GB per system (although I've never heard of this being done).

Data travels between devices on the loop, not just between disks and adapters. Thus, there can be many data transfers occurring at the same time, which is called spatial reuse. Data travels in either direction on the loop, thus preventing a single break from isolating any disks. This is one of SSA's biggest benefits. Besides providing fault tolerance, this allows the systems administrator to dynamically add more disks to the system without taking anything offline. By breaking the loop in only one place, you can quickly and safely add another I/O drawer into a processing system. The loops are also full duplex.

Managing SSA

There are many AIX commands to display and manage storage technology; some of these commands with SSA examples are shown in Tables 1-4. For those not familiar with AIX, AIX has two devices for interfacing with SSA disks, hdisks, and pdisks. AIX defines a pdisk device for every physical SSA disk. If you're using just a bunch of disks (JBOD), there will be one logical disk device (hdisk) for every physical disk device, or if the disks are configured into a RAID array, there will be logical disk device for every RAID array. AIX then defines a container, called a volume group, to manage groups of logical disks. Most operations on storage, such as allocating space, work with logical disks or volume groups.

Our Environment

Our environment consists solely of IBM RS/6000's: an SP2 frame with Silver nodes, two S80s, an F50 for the control work station, Magstar L32 tape library, and an SP Switch. One S80 is the production Sybase database server (hostname UNXP); the other is the data warehouse (hostname UNXM) using Sybase IQ, Sybase ASE, and Cognos's Power Mart. One of the Silver nodes is our development environment (hostname UNXD), one is the Tivoli Storage Manager (TSM) server (hostname UNXR), and two are for system test. All of the nodes have some SSA disk attached to them, and we use only SSA for external disks.

The two production servers share four model D40's with 16 x 9.1-GB disks each, and one model 020 with 16 x 9.1-GB disks. Each server has two Advanced Serial RAIDPlus SSA adapters (feature code 6230) with the 32-MB fast-write cache option, 128-MB cache option, and micro code level A400. I recommend the fast-write cache, especially if you're using RAID 5. The 128-MB cache option is recommended for two-way fast-write cache operations.

The development environment contains one recently purchased model D40 enclosure with 16 x 18.2-GB disks, which consolidated three model 020 enclosures that were then moved to system test. The system test environment has 36 x 4.5-GB and 4 x 9.1-GB disks. The TSM server has 8 x 9.1-GB disks and 8 x 18.2-GB disks. The F50 is one of the few RS/6000's that support internal SSA disks, thus not requiring an SSA I/O drawer. It has 10 x 9.1-GB SSA disks.

Currently, all of the external disk drives are server attached, with no storage area network (SAN), yet. Managing a homogeneous environment has not created the problems that a SAN would solve. Although SSA disks can be used to boot the system, I prefer internal SCSI disks for the operating system (i.e., rootvg). All of the RAID levels that are used on all of the machines are implemented in the adapter hardware.

The Production Database Server

The production database server's disks are configured to solve two competing conditions: performance and availability. The disks are configured into two RAID 1+0 arrays and one RAID 1 array. The RAID 1 array has two disks, by definition, with one disk mirroring the other. RAID 1+0 has both performance (striping) and availability (mirroring). To further increase availability, hot spares, (or redundant disks) are used. RAID 1+0 first makes a copy (mirror) of every disk, and when data is written to the disks, it is striped across all disks (in the primary pool). The RAID adapter then automatically makes a copy to the secondary pool. The primary and secondary pools currently have seven disks each, plus one hot spare disk each, for a total of 32 disks. The hot spares are configured into preferred pools, meaning that each of the four groups of seven disks has a hot spare specifically assigned to it. This was done to prevent a hot spare from another enclosure from taking over for a failing disk.

The first RAID 1+0 array is physically in the P001 I/O drawer. Disks 1 to 7 are the primary pool, and disks 10 to 16 are the secondary pool. The other RAID 1+0 array is in the P002 I/O drawer. Disks 8 and 9 in each I/O drawer are the hot spares. The RAID 1 array is disks 1 and 16 in the CLR1 I/O drawer. (See Figures 2 and 3.)

Ideally, I would configure the primary pool of the first RAID 1+0 array and the primary pool of the second RAID 1+0 array to be in one I/O drawer (i.e., P001), and the secondary pool of the first RAID 1+0 array and the secondary pool of the second RAID 1+0 to be in the other I/O drawer (i.e., P002). (See Figure 5.) One I/O drawer should be located away from the server, perhaps on another floor, or in another building. If one enclosure fails, I can still have half of each array. Note that RAID arrays cannot cross loops. Therefore, although the primary and secondary pools are in different enclosures, they must be in the same loop. (See Figure 4 for the wiring diagram, and then compare Figure 2 to Figure 4, and Figure 3 to Figure 5.)

There are currently two volume groups -- ssavg16Prod1 and ssavg16Prod2. ssavg16Prod1 has one RAID 1+0 array and the RAID 1 array. ssavg16Prod2 has the other RAID 1+0 array.

After running with only two RAID 1+0 arrays for a couple of months, I noticed a performance problem on one array. It turned out the production Sybase server's tempdb was generating enough reads and writes to saturate the array. I added a RAID 1 array on a different SSA Adapter loop just to hold tempdb. The array is disks 1 and 16 in the CLR1 I/O drawer. IBM recommends that half of the disks in a RAID array are closest to one port and the other half closest to the other port in the loop. Implementing this has solved the performance problem.

The Data Warehouse

The data warehouse's disks are configured differently from the production database servers. The data warehouse contains mostly read-only data that is regenerated every couple of days. The data is copied from the production database server, so there is little critical permanent data residing there. The data warehouse also has larger storage requirements than that of the production database server. To this end, some of the storage is configured as RAID 1+0, and some of it is RAID 5.

The data warehouse has 46 x 9.1-GB disks. The 16 disks in the MIS1 enclosure comprise one RAID 1+0 array. Disks 2-14 in the CLR1 enclosure comprise another RAID 1+0 array. Finally, disks 1-8 and 9-16 comprise two RAID 5 arrays in the MIS enclosure. (See Figure 2.)

Managing storage on the data warehouse is much easier since I reduced the number of volume groups. I previously had four volume groups -- one for each RAID array. I have since migrated all the arrays into one volume group. Once all the arrays were in one volume group, it became very easy to move data around between the different types of RAID arrays. For example, if we see poor performance on one of the RAID 5 arrays for a particular logical volume (LV), we can now migrate that LV to a RAID 1+0 array while the data is being accessed. (This is a feature of the logical volume manager and not SSA directly.) The reverse is also true -- lesser used LVs have been moved to the RAID 5 arrays.

Characteristics of Production Disk Arrangement

Multiple Hosts

In the production system, there are two hosts -- UNXM and UNXP. Any applications that need to be shared are installed on the internal SCSI disks of both hosts. The data to be shared is on the external SSA disks. SSA simplifies sharing data because the disks can be cabled up to eight hosts. IBM's High Availability/Cluster Multi-Processing (HACMP) software is installed as a cluster on these two nodes. HACMP monitors the physical resources and in the event of a complete system failure, the other system takes over.

There are some special considerations for attaching SSA disks to more than one host:

  • All the disks and adapters still must form loops on one pair of ports. Loops cannot cross ports.
  • In most cases, only one host may have a volume group online at a time.
  • There are special rules for the number of adapters in a loop based on the adapter type, whether you are using RAID, and whether the fast-write cache is used. See the adapter's user manual.
  • Volume groups should not be configured to automatically varyon at system boot, especially if you're using HACMP.
  • If you are using the fast-write cache in a multi-initiator, multi-host loop, the overall throughput of each adapter will be less, because some adapter cycles are required to synchronize with the fast-write cache of the other adapter.
  • Put the disks closest to the host that will normally be using them.
  • Currently, you cannot use RAID 0 arrays in a multi-initiator loop.

Each SSA loop has two paths (ports) to each group of disks in the loop, and data can travel bi-directionally around the loops. Thus, a single break in the loop, such as a disk or host failing, will not affect the system.

Redundancy

RAID 1+0 automatically allows for N/2 disks to fail, as long as two copies of the same disk don't fail. With our hot spares, N/2 + 2 disks can fail in each array, as long as two copies of the same disk don't fail before the hot spare disk takes over and the array is rebuilt. RAID 5 allows for one disk to fail, two with a hot spare; in a RAID 1 array, one disk can fail.

TSM Server

The disks on the Tivoli Storage Manager (TSM) server host are configured mostly for speed. TSM is our backup software. There is no redundancy except for the TSM database and log. I use one RAID 1 array for the TSM database and log. RAID 0 is used for the TSM disk storage pool and a filesystem that the Sybase database backups reside on. This filesystem is NFS exported and mounted on all the other hosts. This allows all the hosts to easily back up and load our databases. For example, the production database backs up to this directory, then we load the database dump into the development environment from this directory.

Development Server

The development environment is configured to maximize the amount of the disk storage while providing some redundancy. To accomplish this, I use RAID 5. The performance of RAID 5 is at least twice as slow as RAID 1+0 (or non-RAID) disks for write operations. We really notice this when we load our production database into the test environment.

How to Configure Volume Groups for Large RAID Arrays

The first problem I encountered when we started using many large SSA RAID arrays was that they would not fit into our existing volume groups. Because SSA is a high-capacity storage system, you will probably need to modify your volume groups or create new ones to contain the large disks.

We started out using many small disks, and thus the physical partition (PP) size of our volume groups (VGs) was 4 MBs. A physical partition in AIX is the smallest unit of disk space that can be allocated under the logical volume manager (LVM). Don't get this confused with file system space -- filesystems are built on top of logical volumes. In versions of AIX prior to 4.3, there was a limitation of 1016 PPs per physical volume (PV), or logical disk. There was also a limit of 32 physical volumes per volume group.

In AIX 4.3.1 and above, the only limitation is 32512 physical partitions per volume group. The physical partition size must be 2n, where 1 = > n < = 1024 (see sidebar "Creating Large Volume Groups").

Performance and Tuning

The first thing you need when trying to increase computer speed is a yardstick -- a measurable event. For us, it is our backups because they make heavy use of the SSA disks. We store all of our performance and tuning data in a Sybase database and use our data warehousing tools to analyze the data.

One of the most important events that we monitor (partly for speed, mostly to ensure they work) is the production database backups. The production Sybase database server backs up to an NFS-mounted filesystem over the SP Switch.

For the backups, the data must be read on the client side, sent to the NFS mount over the SP switch, and written to the hard drives on the NFS server. This involves tuning the client hard drives, the client's SSA adapters, the client, NFS, the NFS server (UNXR), NFS server's SSA adapters, and NFS server's hard drives -- an almost impossible task that I'm sure I haven't gotten right yet. The client-side disks are RAID arrays -- RAID 1, 5, or 1+0. The NFS server's disks are RAID 0 arrays.

I use a script that I wrote called iostat_logging (see Listing 3) to monitor disk performance. This dumps the output of the iostat command into a database table every 15 minutes. This is very useful to gauge the disk's performance and view trends over time.

The production server can send data to the filesystems at about 6-8 MB/s. Also, this is how we get our database backups to tape; TSM will back up the database dumps as regular files.

It took about 135 minutes (about 4 MB/s) to load the production database into the test environment, and it took only 54 minutes (about 11 MB/s) to load the exact same dump into the data warehouse. The database is on two RAID 1+0 arrays in the data warehouse. See the sidebar "Simple Rules for Tuning SSA Devices" for SSA tuning recommendations.

Simple Recommendation for RDBMS

Had we not been using SSA disks, maintaining and expanding our production database server would have been more difficult. In the past five years, we have upgraded the hardware three times. Each time, all that was required to migrate the database was to unplug the SSA disks on one host and plug them into the new system. The production database server grows at a rate of about 10 GB per year, and SSA has allowed us to easily expand with it. The fast-write cache allows database transactions to commit faster, thereby increasing the speed and throughput of the database server. Finally, having the databases attached to multiple hosts allows for almost 100% uptime.

Systems administrators often wonder how they should configure their hard drives to work best with databases. I have been using the following recommendations:

  • Commonly joined tables and tables and their indexes should reside on different disks or arrays and, if possible, different adapters.
  • Use many smaller drives before fewer larger drives.
  • Use the correct RAID level. Different objects should be on different types of RAID. Don't put your log on a RAID 5 array, for example. I use only RAID 1+0 and RAID 1 for our production database.
Tuning the Host

To get the most out of an SSA disk subsystem, the performance and tuning exercise must include tuning the host and, in particular, the virtual memory manager (VMM). The VMM in AIX manages virtual memory, which includes real memory and swap space. The command to check and change the VMM's settings is vmtune. It is found in /usr/samples/kernel/vmtune (if you have installed bos.adt.samples). See the sidebar "Parameters" for parameters that I have changed.

Tuning AIO Servers

AIX uses asynchronous input/output (AIO) servers to perform asynchronous I/O to the disk subsystem. Using AIO can greatly improve SSA performance because the server doesn't have to wait for the I/O to complete. For example, to check or change your settings:

root@unxm:/>smitty aio
Select Change / Show Characteristics of Asynchronous I/O
The minimum number of servers is the number of AIO servers that get started at system boot; the maximum number of servers is the maximum number of servers that will run on that machine. When you are starting, your tuning maximum should be set to the number of disks performing AIO times 10. Minimum should be set to maximum divided by 2.

To find out how many AIO servers you currently have running, use:

pstat -a | grep -i aios | wc -l
If this number equals your minimum value, you may be able to reduce the number of AIO servers. If it equals your maximum, you may need to increase the number of AIO servers. If it is between your minimum and maximum, no changes should be required.

Other Tuning Considerations

Depending on your own situation, tuning the following may improve SSA or overall system performance:

  • Sync daemon -- syncd
  • I/O pacing -- high and low water marks
  • schedtune -- /usr/samples/kernel/schedtune

Production Database Server Stats

Whenever our production database is backed up, we automatically record statistics from the dump. The graph in Figure 6 shows the aggregate throughput of data from the production database server's hard drives to the NFS mount. The most noticeable feature in Figure 6 is the huge drop in performance from July to September of 2000. Unfortunately, I haven't been able to pinpoint what changed and completely reverse it. If we had not tracked the statistics, we never would have noticed the decrease in performance.

TSM Server Stats

This is an example of a graph that PowerPlay produces from the iostat_logging script from UNXR. It shows the Kbytes/s of the hdisks every 15 minutes for the month of February. The values are the average for every day in the month -- either the average for the 15-minute period or the average for the hour. For examples, see Figures 7 and 8.

In the figures, the ADSM database/log peak at 0700 and 0000 is when the ADSM database is backed up to tape. The NFS mount is busy from 0100-0230 when the production Sybase database does its backups to the NFS mounted filesystem. Then the filesystem is backed up to tape between 0300 and 0500. The large peak at 1900 is when UNXM's and UNXD's Sybase database do their backups. The sustained use of the NFS mount in the morning is the production database being loaded into various test environments. The array is fast enough for the clients, sustaining about 10MB/s. The bottleneck is probably NFS. Notice the peak transfer rate is about 22 MB/s, and the sustained rate for the 19th hour was 14 MB/s. The ADSM storage pool is busy at various times during the day when it is backed up to tape, or when node's filesystems are backed up or restored.

Scripts

I use several Korn shell and Perl scripts to manage storage and have included some of them here:

List Disks (lst_disks) -- lst_disks gives an hdisk-by-hdisk (which may be a RAID array) report of physical volumes, volume groups, and logical volumes. With it, you can easily find out which disk an LV is on, how big it is, its intra-disk distribution, and the mount point if it is a filesystem. It will show how much space is left on a logical disk and the physical partition size of the volume group. (See Listing 1.)

Enclosure Report (encl_to_vg) -- With the new SSA Drawers, model D40, you can now identify the drawers with a four-character string of your choosing. This has made physically identifying individual disks much easier than with previous SSA drawers. I wrote this script to give a report of each physical disk in a given drawer. It is very handy also for showing the pdisk-to-hdisk relationship, and easily shows when an hdisk is a RAID array. In a roundabout way, it also shows what disks may be hot spares. (If you want to get detailed information on RAID arrays, read the file /usr/ssa/ssaraid/ssaraid.README.) Also included is the micro-code level of each disk and the type of each disk. This is great for making sure that all disks in a RAID array are the same. (See Listing 2.)

Log iostat information to a Sybase database (run_iostat_logging) -- Because we already had the data warehouse environment set up, I thought it would be neat to store the disk I/O statistics. Thus, I wrote a script that stores iostat information in a Sybase database. (See Listing 3.)

Miles df report (mi_df) -- Because I was fed up with trying to read the standard df command's output, I wrote a little Perl script to fancy up the output. (See Listing 4.)

Final Thoughts

Imagine that your production server is out of disk space, and your 24x7 environment doesn't allow for down time. By breaking you current SSA loop in only one place, you can add another I/O drawer and 16 x 18.2-GB new disks. You configure the disks and bring a new RAID array online. If you're using a logical volume manager, you can even spread the load between the two arrays, all the while the server and disks are online and the data can be read and written to. You have just saved the company and your hide.

References

What is SCSI -- http://whatis.techtarget.com/WhatIs_Definition_Page/0,4152,214242,00.html

SSA Support sites -- http://www.hursley.ibm.com/ssa and http://www.ibm.com

"Storage Capacity Planning". Zwieback, Dave D. 1999. Sys Admin. August 1999.

Monitoring and Managing IBM SSA Disk Subsystem -- http://www.redbooks.ibm.com

Advanced SerialRAID Plus Adapter Planning Guide. 2nd Ed., October 2000. IBM.

Waters, F. 1996. AIX Performance and Tuning. New York: Prentice Hall.

"Serial Storage Architecture." Judd, Murfet and Palmer 1996 -- http://www.research.ibm.com/journal/rd/judd/judd.html

PCI Adapter Placement Guide. 10th Ed., February 2000. IBM.

Miles Purdy has a computer science degree from the University of Manitoba and works for the Canadian Federal Government's department of Agriculture (Agriculture and Agri-Food Canada, Farm Income Programs Directorate division). He is a system manager, managing a mid-sized RS/6000 environment. His department provides farm financial programs to approximately 250000 Canadian farmers and provides income stabilization and disaster assistance. He can be contacted at: purdym@fipd.gc.ca.