W. Curtis Preston
Is your commercial backup product prepared for multi-terabyte systems? The commercial backup product industry has come a long way in the past 5 years, and most products have solved the problems that have arisen over time. The particular problem that I explore in this article has been solved once, but is being solved again. The question used to be "How do you get 100's of gigabytes of data spread across numerous systems to some sort of centralized storage device?" Since that question has been answered by a number of products, now the question is "How do you get 100's of gigabytes (or even terabytes) that are all on one system backed up within the backup window?" Luckily, there are some products that can do this. In fact, watching the Legato and Veritas Web sites during the past year was liking watching a virtual race to break the sound barrier. First Legato cited 500 GB/hr, then Veritas cited 900 GB/hr, and now Legato is quoting 1.5 TB/hr.
Superfast Tape Drives
Tape drives have gotten a lot faster. Tape drives can now be purchased that can write faster than the SCSI buses to which they are attached. They can use Ultra-Wide SCSI, large buffers, and can write multiple streams of data to the drive at once. The end result is that there are a few mid-priced drives that can write at 8-10 MB/s. There are even drives that could write at almost 40 MB/s if the data could get there that fast. The difficulty with all these drives is that they must be supplied with an incoming data rate equivalent to their maximum throughput. If they don't receive data at this rate, they will write at a slower rate than anticipated.
All modern drives are meant to write at their optimum speed. If you supply a 10 MB/s drive with an incoming data rate of 10 MB/s, it will write at 10 MB/s. (This is referred to as "streaming" the drive.) However, once you drop below 10 MB/s, the drive will suffer from repositioning overhead. The result is that it will actually write at less than the speed of the incoming data.
Suppose you are using a single-threaded backup process such as dump or tar. That process will encounter four possible bottlenecks before getting to the drive. The first bottleneck is the disk itself. The disk may not be capable of supplying data at the optimal rate. The next bottleneck is the SCSI bus and system to which the disk is connected. It may be busy satisfying other I/O requests, and may not be able to transfer the data at that rate. The next bottleneck is the network. Even on a 100bT network, you won't be able to stream more than one drive that can write at 10MB/s. The final possible bottleneck is the system to which the tape drive is connected. Any one of those four possible bottlenecks could slow the data rate to less than 10 MB/s. If that happens, the drive will not be able to keep up and will become the new bottleneck.
You Need More Than One Thread
Most commercial backup products are not single-threaded. Thus, they are able to read data from several systems and several disks at one time and supply all of these threads to the tape drive simultaneously. The data from these different sources is then "interleaved" onto the tape. The tape drive is thereby supplied with a constant stream of data, so it will write at its maximum speed.
A systems administrator learning about this feature for the first time might become worried, because they don't like the fact that their data is spread out all over the tape. This feature, however, is really the only way to stream a tape drive that can read or write at speeds faster than a disk or a network. Therefore, most commercial backup products work this way. If you want to be able to stream your tape drives, the best you can do is pick one that acts in a way that you dislike the least.
This interleaving of data is also referred to as "parallelism" depending on which vendor you are talking to. There are several processes called data agents (DA for short) reading from disks at one time. Those processes feed their data streams to the parallelism agent (PA) that does the actual job of interleaving the separate streams together into one stream, which is then written to the tape drive.
Interleaving can be done in one of two ways. The first way is to leave the file intact. Each DA would read a file, and then tell the PA that it was ready to send this file. The PA would then read the entire file from the DA and write it to tape as one contiguous section. The PA would then look to see which DA had the next available file ready. This works fine for small files coming from systems that are all the same speed. However, if a DA on a relatively slow system has a large file, it will occupy the PA's time during the transmission of that entire file. Thus, you may not be streaming the tape all the time. This type of file-level parallelism is illustrated in Figure 1.
The second (and most common method) of interleaving or parallelism is "Block-level" or "Record-Level" parallelism (Figure 2). With this method, each file is split into multiple blocks of data. The blocks are then written to the tape in a non-contiguous fashion, interleaved with other blocks of data from other files or systems. It is sometimes called "record-level" parallelism because each block of data ends up being a record of data on the tape. This is the most efficient way, because it means that the data needed by the parallelism agent can be read from any disk on any system that is being backed up by this backup definition. Unlike file-level parallelism described above, no single system or disk can cause a bottleneck.
Note in Figure 2 that each file is split into multiple blocks, labeled "Blk 1," "Blk 2," etc. File one is a large file, which took up many blocks. It was coming from a slow system, which was not able to supply data very fast. File three, on the other hand, came from a pretty fast system. That explains why blocks two and three are together. This also shows why block-level parallelism is so efficient. Faster disk and faster systems will be allowed to stream the data as fast as they want to the tape, while slower systems might just get a block in every so often. The tape, meanwhile, will be streaming along.
Getting a VLDB or VLFS on Tape
This scenario is slightly different from the one described above. The concept of "many to one" involves dealing with the fact that you may have thousands of clients, but only a few tape drives. How do you get five or ten systems to share the same device, so that the devices are constantly streaming and backups are completed in a timely manner?
A situation that calls for a "one to many" type of backup is a single system that could not possibly make it to a single tape drive in one night. Even with a drive that is capable of SCSI speeds of 10 MB/s, that is only 36 GB per hour, or 288 GB in an 8-hour period - assuming the drive is streaming the whole time. What if your system is 2 TB? Also, what if your drive is only capable of 5 MB/s? That would result in only 144 GB in an 8-hour period. There are quite a few 5 MB/s drives backing up systems over 144 GB out there. (By the way, unless you want to spend more than $100,000 per drive, don't start shopping for one of the super-fast drives.)
There's another scenario that calls for a "one to many" type of backup. Many products are able to handle the above scenario, but what if it is a single large file system? Due to modern advancements in file systems, you can have file systems that are a terabyte or more! So now, not only do you need to back up 2 TB of data in one night, but it all resides in one file system!
There is no way to back up a 2 TB system to a 10 MB/s drive in one night. If you had a drive (and bus) capable of supporting 80 MB/s, you could do it (of course that drive would cost about one million dollars) but currently the only way to get that kind of throughput is to use multiple drives simultaneously on several SCSI channels. To make that happen, your backup software must be able to send the one system to many devices simultaneously. Many products are able to do this, but what is important is how they do it.
Do It Yourself Plan
There are two ways to accomplish the "one to many" requirement. The first and most common method is to configure multiple backup definitions, each of which is a subset of the entire system. For example, you would have one backup definition that backs up everything, excluding /home1 and /home2. A second backup definition would back up just /home1 and /home2. You then tell the software to back them up at the same time. That way, you will be using two different devices at the same time - theoretically cutting your backup time in half.
There are a number of problems with this method. The first problem is contention for the backup catalog. In this case, you have two independent processes with no knowledge of each other. Both of them will try to read from and write to the backup catalog for the same host at the same time. Depending on the catalog architecture, this may cause no problems at all, or it may cause a deadlock condition in your catalog resulting in database corruption.
The second problem with this method is that it does not adequately leverage your hardware resources. Assume you were able to define two backup definitions that were exactly the same size. On full backup day, both tape drives will be used exactly the same amount of time, and you have effectively cut your backup time in half. However, most backups that you will perform are incremental, not full backups. Since each file system has a separate purpose, the size of their incremental backups will vary greatly. On any given night, one backup definition may need to back up 5 GB, and the other may need to back up 20 GB. The 5 GB will finish in one fourth the time of the 20 GB, leaving its tape drive unused for the remainder of the backup. If the 20 GB incremental backup were allowed to "share" the other drive, it would finish much faster. However, in this configuration, that would never happen.
The third problem is the ever-changing nature of system configurations. Inevitably, /home2 will grow to such a size that you will create a /home3. If you defined the backup definitions as described above (using the "everything except" method), then the "everything except /home1 and /home2" definition will become larger, and its backup will take much longer. Eventually you will have to define another backup definition to balance the load even further.
The last problem with this method is the human factor. Suppose you don't define the backup definitions as described above. Suppose you have one definition that backs up /, /user, /var. Then you have another that backs up /home1 and /home2. What happens when you add /home3? If someone forgets to tell the person responsible for backups, it won't get backed up at all. WARNING: Never define your backups this way. If you have to split up your backup definitions for whatever reason, you should always have an "everything except" backup definition. That way, when you add a new file system it will automatically get backed up in this backup definition.
There is also the chance for syntax errors. Many backup programs allow you to exclude directories by using regular expressions. Depending on your experience level, though, regular expressions can be very cryptic, confusing, and easily misused. For example, suppose you wanted to exclude /home1 with a regular expression. If your backup product uses regular expressions like grep, you could also exclude /home10 accidentally. (The proper way to exclude only /home1 would be to use the regular expression /home1$).
The "human factor" makes this method a real pain. You must constantly monitor your backup definitions, adjusting them so that they are all the appropriate size. Some will get bigger, and some will get smaller, but they all have to be watched. The more systems there are to monitor, the bigger this pain becomes.
A More Excellent Way
Backup products have come a long way since dump. With dump, you essentially had to say, "backup this file system to that tape drive." With regard to performance, there is more than one problem with this scenario. The first problem is that it is single threaded. As mentioned in the "many to one" scenario, it is highly unlikely that a single-threaded backup process is going to come close to streaming a modern tape drive. Second, it is not intelligent. What if your file system is very large, and you need it to go to two tape drives simultaneously?
This is one of the areas where major advances have been made in the past few years. There are now a few backup products that can be set up as follows. First, you tell the software where your backup devices are. You then tell it the names of your clients. The software then utilizes "many to one" and "one to many" technology to fully utilize each device and ensure that your backups complete as quickly as possible.
Figures 3 through 5 show a logical representation of a combination of the "one to many" and "many to one" concepts. Each disk on the left is meant to logically represent data residing on a single system, although it may be spread across several drives. Each tape on the right is meant to represent one backup device, such as a DLT or 8-mm tape drive. As this is a logical representation, this is not meant to say that these are network tape drives. These could be network tape drives mounted on the backup server or tape drives locally mounted on the backup clients. The "Backup System Control" is a logical representation of the backup system that controls all of these data paths. The data paths, or "threads" are represented by the curved lines (A-Z and 1-6). Notice that there are four clients, each of which is capable of supporting eight threads. There are five backup devices, each of which can support four threads.
What is a Thread?
A thread is a single stream of data coming from a client and going to a backup device. When using dump, there is one thread reading from the disk, and one thread writing to a tape. When using a commercial backup utility, there may be several threads reading from your system at once. The more advanced utilities typically will not spawn more than one thread per physical disk. Some may spawn more than one thread from a logical disk, if the disk is a logical volume composed of several physical volumes. Also, since a single thread may not stream a tape drive at its full speed, the backup product may also send multiple threads to a tape drive at once, "interleaving" them into a constant stream of data fed to the tape drive. This article is trying to explain backup utilities that are able to dynamically allocate threads to the backup devices that need them.
In this scenario, the backup system has been configured to start up to eight threads per backup client and send up to four threads to each backup device. At the beginning of the backup, as illustrated by Figure 3, the backup system attempts to start eight threads per client. (For simplicity's sake, this example assumes that each client has enough physical disks to support that many threads.) The first four threads (A-D) are routed automatically by the backup system to the backup device "Tape A." Threads E-H are sent to "Tape B," I-L to "Tape C," M-P to "Tape D," and threads Q-T are sent to "Tape E."
Since each of the five tape drives can only support four threads apiece, a maximum of 20 threads can run at once. That means that threads U-6 must wait. (When a thread must wait, there is no data traffic for that thread, but they remain in the drawing for illustration purposes.) The system will continue backing up in this way until one of the threads has completed the disk or file system.
Assume for this example that threads B, C, E, and I were backing up file systems that were smaller than the rest, or that this was an incremental backup, so those file systems had very little changed data. In "Time Index 2," (Figure 4) those threads would be completed, leaving Tape A with only two threads, and Tapes B and C with only three. Since they are capable of supporting four threads, the backup system will attempt to find more threads for them. Threads U-X will be started and routed by the backup system to the drives in need of more threads. All four tape drives once again have four threads apiece.
In Figure 5, "Time Index 3," threads A-T have all finished. That leaves two tape drives with no threads, and three tape drives with less than four threads. Once again, the backup system will activate threads Y-6, as represented by the bold lines, and route them to the tape drives in need of threads. However, there are only 12 possible threads at this time, so tape drives D and E will not be needed.
This dynamic method of combining the "many to one" and "one to many" concepts varies slighty, depending on the backup product. This method is very efficient and ensures that all backup devices are spinning at their optimum speed. The only advancement that I could suggest is to allow the number of threads going to a device to be dynamic as well. For example, suppose the backup product knows that a given device is supposed to support 10 MB/s. It would make sure that it had enough threads to do so, or perhaps one extra. Some tape drives might only need one or two threads, if the threads are coming from a very fast system with a good link to the tape drives.
There are only three products that I know of supporting dynamic parallelism: Legato NetWorker, HP's Omniback, and SyncSort's Backup Express. If the size of your data is going to approach anywhere near the numbers mentioned in the beginning of this article, you should be looking at one of these products. Be aware that that all parallelism negatively affects restore speed - some methods more than another. Even so, dynamic parallelism is probably the secret to surviving the ever-increasing size of data in your environment.
About the Author
W. Curtis Preston is a senior consultant at Collective Technologies (http://www.colltech.com), specializing in designing and implementing enterprise backup systems for Fortune 500 companies. This article, his second for SysAdmin, is actually an excerpt from his upcoming O'Reilly book, currently titled Backup & Recovery Handbook. Curtis also speaks about backup and recovery around the world and may be reached at: email@example.com.