Serial Storage Architecture Management
Miles Purdy
Serial Storage Architecture, or SSA, is one of the least understood
technologies I manage. SSA is a fast, high-capacity hard disk storage
solution available for most UNIX-based and PC-based computers, although
it may be most predominant on RS/6000's (a.k.a. pSeries). Many
people think that SSA is an IBM proprietary standard; in fact, it
is an open standard. I will explain the technology and show examples
of how to manage SSA on RS/6000's. I will also describe some
benefits of SSA and provide some performance and tuning tips and
management scripts.
The Technology Overview
SSA technology is all about loops and devices talking to each
other. (See Figure 1.) Every SSA device is called a node. The nodes
in SSA adapters are called initiators because they issue commands,
and the nodes in SSA disks are called targets because they respond
to commands. To create an SSA loop, you start with an SSA adapter
-- a PCI card. Each adapter has two loops (A and B) and two
physical ports per loop (A1, A2, and B1, B2). Copper cables are
commonly used to connect the adapter to an I/O drawer (a.k.a. enclosure).
Enclosures contain a maximum of 16 hard drives. You can start the
loop by cabling from port A1 on the SSA adapter to port 1 on the
I/O drawer. If the I/O drawer is full of disks, you can complete
the loop by cabling from port 16 on the I/O drawer back to port
A2 on the adapter, thereby forming a complete loop. The automatic
bypass cards will close if they detect a disk on either side, thus
closing the loop.
The SSA adapters are amazing pieces of technology. The new Advanced
Serial RAIDPlus adapter (feature code 6230 from IBM) supports RAID
0, 1, 5, and 1+0. It has both read and write cache. The fast-write
cache, as it is called, buffers many smaller writes into blocks
and sends full stripes to the array, greatly improving performance,
especially for RAID 5. If the power fails, the adapter has a battery
to maintain the fast-write cache, which will commit the write operations
once the power returns. If you have two hosts, each with an adapter
sharing common disks and using the fast-write cache, called two-way
fast write cache, each host can maintain a copy of the other's
fast-write cache. If one system fails, the adapter in the other
host can then commit the write operations. The new adapter supports
transfer rates of 40 MBs per port, for a total capacity of 160 MBs.
Copper cables are used to connect SSA devices, although fiber
optic extenders are available to travel up to 10 km. The cables
may connect an adapter to an I/O drawer, an adapter to an adapter,
or an I/O drawer to an I/O drawer.
I/O drawers are used for large configurations, although a standalone
desk-side unit is also available. The I/O drawers are rack mounted,
fitting into a standard 19" rack. Each rack can hold up to
six I/O drawers. As I mentioned, each I/O drawer can hold up to
16 disks, and each adapter loop can support 48 disks -- 96 disks
in total, per adapter. However, for performance, 32 disks per loop
is probably best. The disks in the I/O drawer are grouped into four
groups of four. You can add disks one a time, but you should be
adding disks in multiples of four at this level of storage. In fact,
when buying a new drawer, always try to get 16 disks. If you are
using more than 4 disks (i.e., 8 or 16 in single loop), there is
no need to cable between ports 4 and 5, for example. The I/O drawer
will automatically detect the presence of disks in slots 4 and 5
and close the loop with the automatic bypass card. Early models
required a 6" cable between the ports. The I/O drawers feature
redundant power and cooling.
The model D40 I/O drawer supports disk capacities of 9.1, 18.2,
and 36.4 GB, in speeds of 7200 and 10200 rpms. Any combination of
disk capacities and speeds is possible, but not recommended. If
you're using RAID, all the disks must be the same or performance
will suffer and capacity will be lost. With 96 disks per adapter
at 36.4 GB each, a single adapter can access, at most, 3494 GB.
The largest standalone RS/6000's (i.e., model S80) can have
a maximum of 26 SSA adapters per system, for a maximum capacity
of 90844 GB per system (although I've never heard of this being
done).
Data travels between devices on the loop, not just between disks
and adapters. Thus, there can be many data transfers occurring at
the same time, which is called spatial reuse. Data travels in either
direction on the loop, thus preventing a single break from isolating
any disks. This is one of SSA's biggest benefits. Besides providing
fault tolerance, this allows the systems administrator to dynamically
add more disks to the system without taking anything offline. By
breaking the loop in only one place, you can quickly and safely
add another I/O drawer into a processing system. The loops are also
full duplex.
Managing SSA
There are many AIX commands to display and manage storage technology;
some of these commands with SSA examples are shown in Tables 1-4.
For those not familiar with AIX, AIX has two devices for interfacing
with SSA disks, hdisks, and pdisks. AIX defines a pdisk device for
every physical SSA disk. If you're using just a bunch of disks
(JBOD), there will be one logical disk device (hdisk) for every
physical disk device, or if the disks are configured into a RAID
array, there will be logical disk device for every RAID array. AIX
then defines a container, called a volume group, to manage groups
of logical disks. Most operations on storage, such as allocating
space, work with logical disks or volume groups.
Our Environment
Our environment consists solely of IBM RS/6000's: an SP2
frame with Silver nodes, two S80s, an F50 for the control work station,
Magstar L32 tape library, and an SP Switch. One S80 is the production
Sybase database server (hostname UNXP); the other is the data warehouse
(hostname UNXM) using Sybase IQ, Sybase ASE, and Cognos's Power
Mart. One of the Silver nodes is our development environment (hostname
UNXD), one is the Tivoli Storage Manager (TSM) server (hostname
UNXR), and two are for system test. All of the nodes have some SSA
disk attached to them, and we use only SSA for external disks.
The two production servers share four model D40's with 16
x 9.1-GB disks each, and one model 020 with 16 x 9.1-GB disks. Each
server has two Advanced Serial RAIDPlus SSA adapters (feature code
6230) with the 32-MB fast-write cache option, 128-MB cache option,
and micro code level A400. I recommend the fast-write cache, especially
if you're using RAID 5. The 128-MB cache option is recommended
for two-way fast-write cache operations.
The development environment contains one recently purchased model
D40 enclosure with 16 x 18.2-GB disks, which consolidated three
model 020 enclosures that were then moved to system test. The system
test environment has 36 x 4.5-GB and 4 x 9.1-GB disks. The TSM server
has 8 x 9.1-GB disks and 8 x 18.2-GB disks. The F50 is one of the
few RS/6000's that support internal SSA disks, thus not requiring
an SSA I/O drawer. It has 10 x 9.1-GB SSA disks.
Currently, all of the external disk drives are server attached,
with no storage area network (SAN), yet. Managing a homogeneous
environment has not created the problems that a SAN would solve.
Although SSA disks can be used to boot the system, I prefer internal
SCSI disks for the operating system (i.e., rootvg). All of
the RAID levels that are used on all of the machines are implemented
in the adapter hardware.
The Production Database Server
The production database server's disks are configured to
solve two competing conditions: performance and availability. The
disks are configured into two RAID 1+0 arrays and one RAID 1 array.
The RAID 1 array has two disks, by definition, with one disk mirroring
the other. RAID 1+0 has both performance (striping) and availability
(mirroring). To further increase availability, hot spares, (or redundant
disks) are used. RAID 1+0 first makes a copy (mirror) of every disk,
and when data is written to the disks, it is striped across all
disks (in the primary pool). The RAID adapter then automatically
makes a copy to the secondary pool. The primary and secondary pools
currently have seven disks each, plus one hot spare disk each, for
a total of 32 disks. The hot spares are configured into preferred
pools, meaning that each of the four groups of seven disks has a
hot spare specifically assigned to it. This was done to prevent
a hot spare from another enclosure from taking over for a failing
disk.
The first RAID 1+0 array is physically in the P001 I/O drawer.
Disks 1 to 7 are the primary pool, and disks 10 to 16 are the secondary
pool. The other RAID 1+0 array is in the P002 I/O drawer. Disks
8 and 9 in each I/O drawer are the hot spares. The RAID 1 array
is disks 1 and 16 in the CLR1 I/O drawer. (See Figures 2 and 3.)
Ideally, I would configure the primary pool of the first RAID
1+0 array and the primary pool of the second RAID 1+0 array to be
in one I/O drawer (i.e., P001), and the secondary pool of the first
RAID 1+0 array and the secondary pool of the second RAID 1+0 to
be in the other I/O drawer (i.e., P002). (See Figure 5.) One I/O
drawer should be located away from the server, perhaps on another
floor, or in another building. If one enclosure fails, I can still
have half of each array. Note that RAID arrays cannot cross loops.
Therefore, although the primary and secondary pools are in different
enclosures, they must be in the same loop. (See Figure 4 for the
wiring diagram, and then compare Figure 2 to Figure 4, and Figure
3 to Figure 5.)
There are currently two volume groups -- ssavg16Prod1
and ssavg16Prod2. ssavg16Prod1 has one RAID 1+0 array
and the RAID 1 array. ssavg16Prod2 has the other RAID 1+0
array.
After running with only two RAID 1+0 arrays for a couple of months,
I noticed a performance problem on one array. It turned out the
production Sybase server's tempdb was generating enough
reads and writes to saturate the array. I added a RAID 1 array on
a different SSA Adapter loop just to hold tempdb. The array
is disks 1 and 16 in the CLR1 I/O drawer. IBM recommends that half
of the disks in a RAID array are closest to one port and the other
half closest to the other port in the loop. Implementing this has
solved the performance problem.
The Data Warehouse
The data warehouse's disks are configured differently from
the production database servers. The data warehouse contains mostly
read-only data that is regenerated every couple of days. The data
is copied from the production database server, so there is little
critical permanent data residing there. The data warehouse also
has larger storage requirements than that of the production database
server. To this end, some of the storage is configured as RAID 1+0,
and some of it is RAID 5.
The data warehouse has 46 x 9.1-GB disks. The 16 disks in the
MIS1 enclosure comprise one RAID 1+0 array. Disks 2-14 in the CLR1
enclosure comprise another RAID 1+0 array. Finally, disks 1-8 and
9-16 comprise two RAID 5 arrays in the MIS enclosure. (See Figure
2.)
Managing storage on the data warehouse is much easier since I
reduced the number of volume groups. I previously had four volume
groups -- one for each RAID array. I have since migrated all
the arrays into one volume group. Once all the arrays were in one
volume group, it became very easy to move data around between the
different types of RAID arrays. For example, if we see poor performance
on one of the RAID 5 arrays for a particular logical volume (LV),
we can now migrate that LV to a RAID 1+0 array while the data is
being accessed. (This is a feature of the logical volume manager
and not SSA directly.) The reverse is also true -- lesser used
LVs have been moved to the RAID 5 arrays.
Characteristics of Production Disk Arrangement
Multiple Hosts
In the production system, there are two hosts -- UNXM and
UNXP. Any applications that need to be shared are installed on the
internal SCSI disks of both hosts. The data to be shared is on the
external SSA disks. SSA simplifies sharing data because the disks
can be cabled up to eight hosts. IBM's High Availability/Cluster
Multi-Processing (HACMP) software is installed as a cluster on these
two nodes. HACMP monitors the physical resources and in the event
of a complete system failure, the other system takes over.
There are some special considerations for attaching SSA disks
to more than one host:
- All the disks and adapters still must form loops on one pair
of ports. Loops cannot cross ports.
- In most cases, only one host may have a volume group online
at a time.
- There are special rules for the number of adapters in a loop
based on the adapter type, whether you are using RAID, and whether
the fast-write cache is used. See the adapter's user manual.
- Volume groups should not be configured to automatically varyon
at system boot, especially if you're using HACMP.
- If you are using the fast-write cache in a multi-initiator,
multi-host loop, the overall throughput of each adapter will be
less, because some adapter cycles are required to synchronize
with the fast-write cache of the other adapter.
- Put the disks closest to the host that will normally be using
them.
- Currently, you cannot use RAID 0 arrays in a multi-initiator
loop.
Each SSA loop has two paths (ports) to each group of disks in
the loop, and data can travel bi-directionally around the loops.
Thus, a single break in the loop, such as a disk or host failing,
will not affect the system.
Redundancy
RAID 1+0 automatically allows for N/2 disks to fail, as long as
two copies of the same disk don't fail. With our hot spares,
N/2 + 2 disks can fail in each array, as long as two copies of the
same disk don't fail before the hot spare disk takes over and
the array is rebuilt. RAID 5 allows for one disk to fail, two with
a hot spare; in a RAID 1 array, one disk can fail.
TSM Server
The disks on the Tivoli Storage Manager (TSM) server host are
configured mostly for speed. TSM is our backup software. There is
no redundancy except for the TSM database and log. I use one RAID
1 array for the TSM database and log. RAID 0 is used for the TSM
disk storage pool and a filesystem that the Sybase database backups
reside on. This filesystem is NFS exported and mounted on all the
other hosts. This allows all the hosts to easily back up and load
our databases. For example, the production database backs up to
this directory, then we load the database dump into the development
environment from this directory.
Development Server
The development environment is configured to maximize the amount
of the disk storage while providing some redundancy. To accomplish
this, I use RAID 5. The performance of RAID 5 is at least twice
as slow as RAID 1+0 (or non-RAID) disks for write operations. We
really notice this when we load our production database into the
test environment.
How to Configure Volume Groups for Large RAID Arrays
The first problem I encountered when we started using many large
SSA RAID arrays was that they would not fit into our existing volume
groups. Because SSA is a high-capacity storage system, you will
probably need to modify your volume groups or create new ones to
contain the large disks.
We started out using many small disks, and thus the physical partition
(PP) size of our volume groups (VGs) was 4 MBs. A physical partition
in AIX is the smallest unit of disk space that can be allocated
under the logical volume manager (LVM). Don't get this confused
with file system space -- filesystems are built on top of logical
volumes. In versions of AIX prior to 4.3, there was a limitation
of 1016 PPs per physical volume (PV), or logical disk. There was
also a limit of 32 physical volumes per volume group.
In AIX 4.3.1 and above, the only limitation is 32512 physical
partitions per volume group. The physical partition size must be
2n, where 1 = > n < = 1024 (see sidebar "Creating Large Volume
Groups").
Performance and Tuning
The first thing you need when trying to increase computer speed
is a yardstick -- a measurable event. For us, it is our backups
because they make heavy use of the SSA disks. We store all of our
performance and tuning data in a Sybase database and use our data
warehousing tools to analyze the data.
One of the most important events that we monitor (partly for speed,
mostly to ensure they work) is the production database backups.
The production Sybase database server backs up to an NFS-mounted
filesystem over the SP Switch.
For the backups, the data must be read on the client side, sent
to the NFS mount over the SP switch, and written to the hard drives
on the NFS server. This involves tuning the client hard drives,
the client's SSA adapters, the client, NFS, the NFS server
(UNXR), NFS server's SSA adapters, and NFS server's hard
drives -- an almost impossible task that I'm sure I haven't
gotten right yet. The client-side disks are RAID arrays -- RAID
1, 5, or 1+0. The NFS server's disks are RAID 0 arrays.
I use a script that I wrote called iostat_logging (see
Listing 3) to monitor disk performance. This dumps the output of
the iostat command into a database table every 15 minutes.
This is very useful to gauge the disk's performance and view
trends over time.
The production server can send data to the filesystems at about
6-8 MB/s. Also, this is how we get our database backups to tape;
TSM will back up the database dumps as regular files.
It took about 135 minutes (about 4 MB/s) to load the production
database into the test environment, and it took only 54 minutes
(about 11 MB/s) to load the exact same dump into the data warehouse.
The database is on two RAID 1+0 arrays in the data warehouse. See
the sidebar "Simple Rules for Tuning SSA Devices" for
SSA tuning recommendations.
Simple Recommendation for RDBMS
Had we not been using SSA disks, maintaining and expanding our
production database server would have been more difficult. In the
past five years, we have upgraded the hardware three times. Each
time, all that was required to migrate the database was to unplug
the SSA disks on one host and plug them into the new system. The
production database server grows at a rate of about 10 GB per year,
and SSA has allowed us to easily expand with it. The fast-write
cache allows database transactions to commit faster, thereby increasing
the speed and throughput of the database server. Finally, having
the databases attached to multiple hosts allows for almost 100%
uptime.
Systems administrators often wonder how they should configure
their hard drives to work best with databases. I have been using
the following recommendations:
- Commonly joined tables and tables and their indexes should
reside on different disks or arrays and, if possible, different
adapters.
- Use many smaller drives before fewer larger drives.
- Use the correct RAID level. Different objects should be on
different types of RAID. Don't put your log on a RAID 5 array,
for example. I use only RAID 1+0 and RAID 1 for our production
database.
Tuning the Host
To get the most out of an SSA disk subsystem, the performance
and tuning exercise must include tuning the host and, in particular,
the virtual memory manager (VMM). The VMM in AIX manages virtual
memory, which includes real memory and swap space. The command to
check and change the VMM's settings is vmtune. It is
found in /usr/samples/kernel/vmtune (if you have installed
bos.adt.samples). See the sidebar "Parameters"
for parameters that I have changed.
Tuning AIO Servers
AIX uses asynchronous input/output (AIO) servers to perform asynchronous
I/O to the disk subsystem. Using AIO can greatly improve SSA performance
because the server doesn't have to wait for the I/O to complete.
For example, to check or change your settings:
root@unxm:/>smitty aio
Select Change / Show Characteristics of Asynchronous I/O
The minimum number of servers is the number of AIO servers that get
started at system boot; the maximum number of servers is the maximum
number of servers that will run on that machine. When you are starting,
your tuning maximum should be set to the number of disks performing
AIO times 10. Minimum should be set to maximum divided by 2.
To find out how many AIO servers you currently have running, use:
pstat -a | grep -i aios | wc -l
If this number equals your minimum value, you may be able to reduce
the number of AIO servers. If it equals your maximum, you may need
to increase the number of AIO servers. If it is between your minimum
and maximum, no changes should be required.
Other Tuning Considerations
Depending on your own situation, tuning the following may improve
SSA or overall system performance:
- Sync daemon -- syncd
- I/O pacing -- high and low water marks
- schedtune -- /usr/samples/kernel/schedtune
Production Database Server Stats
Whenever our production database is backed up, we automatically
record statistics from the dump. The graph in Figure 6 shows the
aggregate throughput of data from the production database server's
hard drives to the NFS mount. The most noticeable feature in Figure
6 is the huge drop in performance from July to September of 2000.
Unfortunately, I haven't been able to pinpoint what changed
and completely reverse it. If we had not tracked the statistics,
we never would have noticed the decrease in performance.
TSM Server Stats
This is an example of a graph that PowerPlay produces from the
iostat_logging script from UNXR. It shows the Kbytes/s of
the hdisks every 15 minutes for the month of February. The values
are the average for every day in the month -- either the average
for the 15-minute period or the average for the hour. For examples,
see Figures 7 and 8.
In the figures, the ADSM database/log peak at 0700 and 0000 is
when the ADSM database is backed up to tape. The NFS mount is busy
from 0100-0230 when the production Sybase database does its backups
to the NFS mounted filesystem. Then the filesystem is backed up
to tape between 0300 and 0500. The large peak at 1900 is when UNXM's
and UNXD's Sybase database do their backups. The sustained
use of the NFS mount in the morning is the production database being
loaded into various test environments. The array is fast enough
for the clients, sustaining about 10MB/s. The bottleneck is probably
NFS. Notice the peak transfer rate is about 22 MB/s, and the sustained
rate for the 19th hour was 14 MB/s. The ADSM storage pool is busy
at various times during the day when it is backed up to tape, or
when node's filesystems are backed up or restored.
Scripts
I use several Korn shell and Perl scripts to manage storage and
have included some of them here:
List Disks (lst_disks) -- lst_disks gives an
hdisk-by-hdisk (which may be a RAID array) report of physical volumes,
volume groups, and logical volumes. With it, you can easily find
out which disk an LV is on, how big it is, its intra-disk distribution,
and the mount point if it is a filesystem. It will show how much
space is left on a logical disk and the physical partition size
of the volume group. (See Listing 1.)
Enclosure Report (encl_to_vg) -- With the new SSA Drawers,
model D40, you can now identify the drawers with a four-character
string of your choosing. This has made physically identifying individual
disks much easier than with previous SSA drawers. I wrote this script
to give a report of each physical disk in a given drawer. It is
very handy also for showing the pdisk-to-hdisk relationship, and
easily shows when an hdisk is a RAID array. In a roundabout way,
it also shows what disks may be hot spares. (If you want to get
detailed information on RAID arrays, read the file /usr/ssa/ssaraid/ssaraid.README.)
Also included is the micro-code level of each disk and the type
of each disk. This is great for making sure that all disks in a
RAID array are the same. (See Listing 2.)
Log iostat information to a Sybase database (run_iostat_logging)
-- Because we already had the data warehouse environment set
up, I thought it would be neat to store the disk I/O statistics.
Thus, I wrote a script that stores iostat information in a Sybase
database. (See Listing 3.)
Miles df report (mi_df) -- Because I was fed
up with trying to read the standard df command's output,
I wrote a little Perl script to fancy up the output. (See Listing
4.)
Final Thoughts
Imagine that your production server is out of disk space, and
your 24x7 environment doesn't allow for down time. By breaking
you current SSA loop in only one place, you can add another I/O
drawer and 16 x 18.2-GB new disks. You configure the disks and bring
a new RAID array online. If you're using a logical volume manager,
you can even spread the load between the two arrays, all the while
the server and disks are online and the data can be read and written
to. You have just saved the company and your hide.
References
What is SCSI -- http://whatis.techtarget.com/WhatIs_Definition_Page/0,4152,214242,00.html
SSA Support sites -- http://www.hursley.ibm.com/ssa
and http://www.ibm.com
"Storage Capacity Planning". Zwieback, Dave D. 1999.
Sys Admin. August 1999.
Monitoring and Managing IBM SSA Disk Subsystem -- http://www.redbooks.ibm.com
Advanced SerialRAID Plus Adapter Planning Guide. 2nd Ed.,
October 2000. IBM.
Waters, F. 1996. AIX Performance and Tuning. New York:
Prentice Hall.
"Serial Storage Architecture." Judd, Murfet and Palmer
1996 -- http://www.research.ibm.com/journal/rd/judd/judd.html
PCI Adapter Placement Guide. 10th Ed., February 2000. IBM.
Miles Purdy has a computer science degree from the University
of Manitoba and works for the Canadian Federal Government's
department of Agriculture (Agriculture and Agri-Food Canada, Farm
Income Programs Directorate division). He is a system manager, managing
a mid-sized RS/6000 environment. His department provides farm financial
programs to approximately 250000 Canadian farmers and provides income
stabilization and disaster assistance. He can be contacted at: purdym@fipd.gc.ca.
|