Building a SAN Backup Solution
Greg Schuweiler
A little more than a year ago, I began searching for a solution to a problem I had yet to personally experience. Many organizations, especially those in the e-business environment, had information that was doubling, and in some cases tripling, in a single year. The systems administrators in those environments watched their backup windows disappear. Cold backups of databases no longer existed. Restoring a critical system after a disaster would be time consuming and complicated.
I work on a team of three UNIX systems administrators at a premier medical clinic, hospital, and research institute located in the midwest. Our information was not growing at the rate mentioned above, but databases that we considered static in size started growing. More applications were being developed in-house or were being purchased to run on UNIX servers. We started looking at upgrading our Fibre Channel (FC-AL) attached RAID arrays. During this process, we also decided to evaluate our backup infrastructure. To do this, we started a UNIX Infrastructure Backup project. The project consisted of evaluating all of our current backup processes, both manual and automatic. The next step was to design a system that would:
grow along with disk storage needs
move backups off the public Ethernet
have a centralized point of control
allow rapid disaster recovery
keep the person on-call from being paged in the early morning hours.
After we reviewed what we wanted, or thought we needed, we put together a request for proposals.
About two-thirds of the systems we maintain were being backed up across the network to a shared tape library using Openvision; others were being backed up to a variety of locally-attached tape drives that consisted of DLT 4000, 8 mm, and 4 mm. One HP 9000 is a large database server with a SCSI-attached StorageTek STK9710 and is also a Legato server. Another large database server is a Sun E4000 with a SCSI-attached ELT 3500 and is one of the Legato clients. The systems were located in three different data centers, in three different buildings. One of the data centers is separated from the other two by a distance of 1300 meters, and the other two are separated by a distance of 470 meters. Backups on some of the databases were starting to exceed twelve hours and spanning multiple tapes. Although we personally did not have to manage the tapes, the tape librarians were having difficulty finding and storing tapes in off-site storage. The machines being backed up via the network were about evenly split between 10 and 100-Mbs Ethernet drops. We have a heterogeneous environment consisting of Hewlett Packard, IBM, SCO, and Solaris, with HP and Solaris being the more predominant machines.
We investigated both Gigabit Ethernet and the Storage Area Network (SAN). The Gigabit Ethernet would allow us to have the paradigm of centralized backups to a backup server. The speed would be much faster than our current infrastructure and we were more familiar with the concept of Ethernet than we were of a SAN. Keeping a separate Ethernet infrastructure would prevent traffic on the public network, but the TCP/IP overhead on the backup server would be high. Some of the increased speed provided by the Gigabit Ethernet would be eaten up by this overhead with the actual bandwidth to the tape device less than expected. We decided to build a completely separate SAN for our backup infrastructure. We saw some attractive benefits to implementing a LAN-free Backup or a SAN backup infrastructure. The many-to-many connectivity of Fibre Channel would not only allow us to share tape libraries, but also to share tape drives with multiple servers. Spreading the cost of the tape libraries and tape drives across multiple servers would decrease the total cost of backup per server.
Another very important aspect of the SAN for backups was the ability to move data great distances for disaster recovery. Instead of using carbon-based robots to move tapes between data center vaults, we planned to duplicate tapes between data centers automatically. By duplicating tapes between libraries in different data centers, the loss of a data center would not affect our tapes. This would also prevent having the wrong tape in the wrong place when needed. The SAN would also allow two systems in different data centers to participate in a highly available cluster, providing us with restore and backup continuation in the event of a lost data center. Traffic disruption on the LAN would be minimized by moving the backup traffic to a SAN (some meta-data would still be moved on the LAN). Our backup windows would be reduced by using 100 MByte/sec Fibre Channel instead of 10/100 Mbit/sec Ethernet networks.
The SAN would also allow centralization and simplification of the management of our backup resources. Part of our decision was also based on the fact that Fibre Channel was not a new item to us because we were using it in some highly available clusters for mirroring RAID peripherals between data centers. In the current configurations, all of our SAN environments were isolated to loops in groups of two or three machines running in a highly available cluster between our three primary data centers. All of these loops are using Fibre Channel-based technology, but they were really proprietary installations and mixing between vendors was discouraged.
Our first task was to find software that would operate in a multi-vendor SAN. The software obviously would need to work with fibre-channel attached storage, both disk and tape. As mentioned earlier, the need to not only share tape libraries, but to share the drives in those libraries between systems was required. Without the ability to share the tape drives on a SAN, there was no reason to continue with the project. We had to be able to configure the master server software into a highly available package for failover purposes.
Operating system client support for virtually all UNIX derivatives, including Linux, and coverage for Oracle and Sybase were required. We were looking for a non-proprietary tape format, and we felt that GNU tar would be best. This meant that we might not get as much on a tape as a proprietary utility would, but we would be able to pull a tape or set of tapes out of a silo and test the restore to an isolated machine or ship tapes to a vendor when required. An easy-to-use GUI would be a nice feature for manager show-and-tell along with allowing selected users to perform restores, but the power to run the server and client software from the command line was essential. We also wanted to be able to access all logs of backups, media, errors, etc., with Perl to create our own reports for displaying various information on a UNIX System intranet Web site. Client installation had to be very simple, preferably done from the master server. We were already using one StorageTek STK9710 so the software would need to support it. We also wanted software that was originally written for UNIX. Finally, we wanted quality technical support and a systems engineer available during implementation. We also wanted access to a well-informed project engineer for the software. We were very willing to sign non-disclosures -- our overall goal was to make the project a success. We decided to use Veritas' Netbackup 3.2 Beta release for the initial testing.
Hardware selection was easier because we had already worked with some of the vendors on other projects and were selecting certain pieces of hardware for the upgrading of a disk-based SAN. Sun E250's were selected for the server because of the ideal cost and excellent performance offered by these little boxes. Each was equipped with dual processors, 512-MB RAM and two 9-GB disk drives. System software was Solaris 2.6, Veritas File System, Veritas Volume Manager, and a host of home-grown and public utilities and tools we use on each of our systems for monitoring, reporting, etc. Each machine was also equipped with three JNI 6410 Host Bus Adapters (HBA) -- one for the backup SAN, two for the disk SAN, and an additional 100-Mbit Ethernet card for a heartbeat network. Since one of the E250's would be a failover for the other, they were placed in different data centers and internal drives held only the system software and a bootable mirror.
In all cases, our application software and associated databases are being placed on SAN-attached RAID5 storage. In the case of the backup project, Netbackup and its database was placed in a single LUN mirrored between two SAN-attached disk storage arrays. Failover is handled by Veritas Cluster Server 1.1.2. The initial testing and installation was targeted for an arbitrated loop with Gadzoox's Gibraltar hubs being used to connect the associated parts. An HP K420 and a Sun E450 were selected for the test boxes. The HP used an HP HBA, while the E250 and E450 used a Sun, then Emulex, and finally a JNI HBA. A StorageTek STK9710, with eight Quantum DLT 7000's, was selected for the tape library and four StorageTek Fibre Channel/SCSI bridges were used to make the jump from FC-AL to SCSI. One bridge was used for the robotics, two tape drives were daisy-chained to a bridge, and two drives were left idle for initial testing. All of this was connected into a Fibre Channel Loop with 50-micron cable.
Once the hardware was connected and powered up, but prior to loading Netbackup, the HP 9000 claimed six Quantum DLT7000 tape drives without any configuration issues. On the HP, we were able to use the tar and mt commands to write, read, and to fiddle with all six of the tape drives. Configuration on the Sun E250 proved to be time consuming, partly because the Sun drivers at the time were not multi-LUN drivers and would not work with tapes. The Emulex card worked in the beginning, but some questions arose early in the testing and without paying a one-year support contract, Emulex Technical Support refused to answer any questions. JNI, however, provided us with excellent technical support and a direct phone number for one of their top engineers. With the JNI card and drivers, the Sun E250 saw all six Quantum DLT7000 drives and responded to the tar and mt commands like the HP did. In fact, I could tar files to drive 0 in the tape library from the Sun, rewind the tape, and list a table of contents from the HP and vice versa. HP-UX required a couple of kernel patches to handle the wide SCSI to the tape and then was able to achieve a backup rate of 4.5 MB/s, whereas the Solaris 2.6 was able to achieve 5.8 MB/s. The next step was to load Netbackup on the Sun E250 and continue with the testing of the software.
The initial installation and configuration of Netbackup was straightforward. I configured only four of the tape drives to see how contention would be handled. All of these drives were configured with the Shared Storage Option on the E250, the E450, and the K420. The E450 and K420 were each placed in their own class, then I added four additional classes with six systems for network backups, set up the schedules, and let everything run. During the next thirty days, various problems were encountered, most of which were corrected by firmware updates to various pieces of hardware in the test environment. A large percentage of the other problems was due to poor fibre installation. I also discovered that there are situations when HP-UX and Solaris should not be on the same Fibre Channel Loop.
We have been using Netbackup, the StorageTek STK9710, the E250, and the Fibre Channel/SCSI bridges in production for a few months now. The only Fibre Channel in use is between the E250, one Gadzooks Hub, and the bridges. All backups are being done across the Ethernet. As mentioned earlier, a large percentage of our problems was poor fibre installation, and we also discovered that it is best to separate Solaris and HP-UX. To resolve these two issues, new fibre is being installed with factory-attached connectors. Ancor 16-port switches will be used to isolate the three separate HP FC-AL loops and the Sun systems will be connected directly to the switches.
When we started our research for building the SAN for backups, there were many vendors rushing to the market with new Fibre Channel products and putting them together into networks they called Storage Area Networks. What we found is that no single vendor in today's marketplace has all the pieces to put together a complete working SAN for storage, particularly tape storage. I believe that it is unlikely that a single vendor will be able to provide a complete SAN for some time yet. Part of the reason for this is that there is no single standard that currently exists. Instead, a loose set of standards is being tightened as vendors work with each other, with the standards committees, and with their customers. This leads to the biggest obstacle we found in the implementation of SAN technology today -- the lack of interoperability between the pieces of the SAN from various vendors.
It may seem like a lot of money was spent without achieving the desired results. I know that if we were backing up either all HPs or all Suns, the backup SAN would be a total success. With the addition of Fibre Channel switches, I am confident that we will soon be using the full capabilities of Fibre Channel to back up all of the systems. These systems will be backed up to one STK9710 during the evenings, and the tapes will be duplicated during the day to a remote STK9710. The ease of automatically generating reports from Netbackup requires just a glance at either my email or the UNIX intranet Web site to see if there have been any problems.
About the Author
Greg (shoe) Schuweiler has worked in the friendly Midwest for the last ten years as a consultant, an embedded software designer, Oracle DBA, and a host of other strange titles. He has been in the UNIX System Administrator area for the last four years. He can be reached at gshoe@xadd.org.
|