jul2002.tar

Storage Consolidation -- Part 2 -- Product Selection

Peter Baer Galvin

Last month, the Solaris Companion covered the reasons to consider storage consolidation. A storage evaluation can reveal short- and long-term savings in cost as well as management time. The needs-analysis phase leads to a design for a consolidated storage facility. This month, I'll discuss product selection. There is quite a bit to consider about storage arrays, SAN switches, and tape drives and libraries. The wrong choices among those options could cause serious pain in your SAN down the road.

Product Selection

Our theoretical design from last month calls for a storage SAN with network attached storage (NAS) interfaces. This design allows for servers needing maximum performance and dedicated storage to attach to the storage via SAN switches. Hosts needing to share file systems, or that are not capable of SAN attachment, can take advantage of the SAN through the NAS heads. All of the storage is thus consolidated into a set of RAID arrays that can be allocated to SAN-attached or NAS-using servers. Further, backups use the SAN switching infrastructure to allow hosts to copy their data from their allocated storage to the tape libraries without using the TCP/IP network to carry the data. Finally, replication of data is done from the production site to the disaster recovery site via storage array-based or host-based replication.

As it turns out, the design is one of the easier steps of a storage consolidation process. It is based on existing storage and future storage needs. Product selection is all about attention-requiring details.

Array Selection

The design provides the requirements for storage arrays. Selecting an array that meets those requirements may seem easy, but when intangibles such as future maintenance costs and disk drive evolution get weighed, choices get more complicated. Add to that the tangibles of cost and scalability, and the choices are narrowed.

Performance analysis of array choices leads to another quandary. There are no industry-standard storage benchmarks. Some companies choose to believe vendor performance information, which of course conflicts with claims by competing vendors. However, quite a bit can be learned from the details of the storage array architecture. Table 1 shows some technical array criteria, with choices sorted in order of preference.

The choice of spindle size and RAID level is a tricky one, but the more choices provided by the array, the more you can tune the array to suit your needs. For example, to maximize performance of a database, 36-GB drives with RAID 1+0 or 0+1 would be best. For a data warehouse, 144-GB drives with RAID 5 might be the best choice. An array that provides a choice of disk sizes and a choice of RAID levels and allows for them to be intermixed within the array will provide the best solution. Note that disks can vary in more than just size. Seek times, latencies, transfer rates, interface (e.g., dual-active fibre vs. dual fibre with one active), cost, and even MTBF can vary along with drive sizes even within an array. All those aspects must be considered for each when comparing arrays.

The "paper-based" comparison is a useful exercise, but it can only tell so much. How manageable is the array? How well does the software work? Does the actual performance bear out what is on paper? These questions can only be answered via hands-on experience. If the choice is down to two or three competing technologies, and time, effort, and facility are available, consider performing a "bake-off" between them. A bake-off can be an effective means of judging not only performance, but also manageability and general livability.

Unfortunately, performing such a bake-off is never as easy as it seems. Benchmarking requires load-generation machines, loaned or rented SAN infrastructure to test, and space and effort to set up the test environment. It also requires a controlled environment and knowledge of systems and storage. For example, if a host is not rebooted and storage re-initialized between tests, false readings can result from the use of already-cached data or fragmented file systems. So, while a bake-off is the most effective tool for judging performance, consider the time, effort, and cost it can incur.

On the financial side, array (and other technology) costs go beyond the purchase price. How much is maintenance? How much is software that you may want to add down the road? How much is additional cache? Consider expandability as well. If the array can "only" go to 72-GB drives, how long will it be useful in your environment? Of course, the cost of downtime should be factored as well. What are the single points of failure? What claims does the manufacturer make for the average reliability of the device? Planned downtime also has a cost in sys admin time and effort, if not also in money lost to the company from lost business or worker production. When is planned downtime needed (e.g., for LUN creation, configuration changes, firmware upgrades)? How often will it be needed and at what impact?

NAS Selection

Most of the above points pertain to all technology selection, not just arrays. But for selecting network-attached storage, there are some other factors to consider.

Currently, most NAS solutions include the compute component and the disk component. These all-in-one appliances excel in functionality and ease of management, but are contrary to storage consolidation efforts. If separate NAS servers are included in a facility, then they bring another set of spindles to design, configure, manage, capacity plan, and maintain. And what happens if the disk capacity of an all-in-one NAS is reached? Another NAS appliance will need to be added, further fragmenting storage. A recent trend is toward NAS heads that attach to a SAN rather than include their own spindles. These devices talk networking out the front-end, and have fibre to attach to a SAN out the back-end.

Here are some questions to consider when choosing a NAS solution:

Is it highly available (including at the disk tray and cache level)?
How does it perform? The SPECSFS benchmarking at http://www.spec.org is a good place to start.
How many network interfaces are there, and of what type?
How many fibre interfaces are there?
What network file systems are supported (SMB and NFS are the minimum)?
What naming services are supported (LDAP, NIS, NIS+, DNS, Active Directory, etc.)?

Switch Selection

Switches are literally at the heart of a SAN. While some of their features can be quantified, others again are intangible.

Obvious points to consider include how many ports are available in a given model. 2 Gb is becoming the norm for SAN infrastructure components, so be sure the switch has 2-Gb ports, auto senses between 1 Gb and 2 Gb, or can be easily upgraded. Supportability (and support from the vendors of the other products that are being evaluated) is an important factor, but more on that next month.

Unfortunately, it is hard to get a grasp on performance differences between switches. Again, there are vendor claims and counter claims, but the truth remains elusive, and there are no benchmarks. Looking into the technical details of the switch and its backplane, or testing it out under controlled benchmarking conditions is the only recourse.

Tape Drive and Tape Library Selection

Tape libraries can be used for backup, data interchange, and hierarchical storage (HSM). Tape drives and tape libraries come in many variations. Some vendors have created worksheets into which backup parameters are entered and out of which tape and library suggestions are generated.

Tape libraries can be evaluated based on the number of drives they can hold and the number of tape slot cartridges they can control. Also, some number of cleaning cartridges can be loaded into a library for drive maintenance. Library robotic methods vary, and vendors will be glad to debate which is best. Floor space should be a consideration, as some libraries can be huge. "Pass through" is a method used to automatically move cartridges between multiple robots. "Mailslots" are externally accessible slots that can be used to import and export cartridges to the library. For instance, a 10-mailslot library allows 10 tapes to be added or removed at once. Large libraries include some method (scanner or camera) of recognizing cartridge identifiers, so the library knows exactly which tape is in a particular slot or drive.

A cautionary note about sharing libraries: since a library has but one robot control port, only one program is allowed to control it. If you want to share a library between two programs (say for backup and HSM), software is needed to play the intermediary. Be sure to factor this into your SAN design plans.

The choice of drives includes consideration of speeds-and-feeds, tape capacity, compression availability and effectiveness, and size of the drive and tapes (how many can fit into a library). Current common tape-drive choices include LTO, SDTL, and ADIC. Some drives have special features that could drive the decision their way (i.e., 9840 drives have a WORM mode). Most libraries allow use of multiple drives and cartridges, so a hybrid approach is quite feasible.

Another important aspect is the drive's interface. Some libraries include bridges, so they may have fibre externally to hosts (or a SAN switch) and SCSI internally to the drives. Bridging is a workable solution, but native fibre externally and internally reduces failure points. Check for vendor recommendations on how to attach drives to hosts. For example, three drives may be allowed per host-SCSI connection, while only one fibre drive per host connection is supported.

With the advent of SANs, there are now new ways to architect backup/HSM solutions. Just as a SAN switch can allow hosts to connect to arrays, the same (or dedicated) switches can allow hosts to attach to tape drives within libraries. Backup/HSM data can move from the SAN storage, through the host, to the library, and never hit the network. Dedicated backup servers are now optional. Software such as Veritas NetBackup can manage not only the backups but also tape drive allocation. For example, two hosts could attach to a library via a SAN switch. Backup software can allocate some number of drives to one server, perform the backup, release the drives, and allocate them to the other server. Or both servers could do backups concurrently to half of the drives. For restores, all drives could be allocated to one host to minimize restoration time.

This "Backup SAN" involves choosing tapes and tape libraries, and devising methods for backups, restores, and HSM storage. It can be as complex, and as rewarding, as a SAN. If you are contemplating or executing a storage consolidation, be sure to consider a tape consolidation at the same time.

Replication

Replication of data between sites has the added complication that it can be accomplished via hardware or software. The choice of product here is very needs-dependent. It also depends greatly on the choices made in the areas above (or can drive the choices above). For example, if an array vendor has an array-to-array replication product that meets your needs, it could override other areas where the array is a weaker choice than its competitors.

With replication, there are several factors to consider:

What distance does the data need to travel? Some solutions are limited to a small radius, or require expense and complexity to go beyond that radius.
How much data needs to move at peak time? The bandwidth has a huge impact on replication design.
Does replication have to occur over dedicated connections (storage or WAN), or can it use the Internet?
Is the replication synchronous or asynchronous? Synchronous replication assures that the data reaches the remote site before the application gets its transaction acknowledged. Asynchronous reaches further distances, but some data can be lost depending on the applications running.
Does the replication implement "write ordering"? If write order is preserved, and asynchronous mode is used, then data may be lost should the primary site fail, but the data will be consistent. For example, a database would receive its transactions in order, but some transactions may be lost. This is better than out-of-order writes in which the database may be very unhappy with the state of its replicated data.

Conclusions and Next Month

Design and product selection is important, but the rubber meets the road in implementation. Next month, I'll conclude the storage consolidation project by describing the implementation of the facility. Implementation is full of details, and that is where the devil lies. Several demons will be described, and maps of routes to take to avoid them will be shown.

Peter Baer Galvin (http://www.petergalvin.org) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.