Cover V04, I06
Article
Sidebar 1
Table 1
Table 2
Table 3
Table 4

nov95.tar


Comparing Technologies for Long-Term, High-Capacity Archives

Packey P. Velleca

Introduction

Every information manager needs to provide backups or long-term archives of user data to some extent. There are many commercial solutions available for creating long-term archives, and they vary widely in cost and efficiency. The problem of selecting an archive technology is compounded by the differing requirements of individual sites, so that no one system is the most effective for all sites.

This article is intended to aid in determining the most cost-effective, commercial technology for creating long-term (5+ years), high-capacity (100-1,000 Gigabytes) archives. The article is organized in three parts: Part 1 discusses the relative merits of ten technologies, and presents a tool to aid in determining your system requirements. Part 2 describes a cost analysis model, along with its assumptions and limitations, used to evaluate the different technologies with respect to the chosen requirements. Part 3 walks through a real example, and presents a summary of the results projected by the model.

By examining your specific requirements and running an analysis with the model presented here, you will be better able to choose the most cost-effective system for your site.

Part 1: Technologies and Requirements

Technologies Considered

In this section Ibriefly discuss 10 readily available, easily integrated, UNIX-based technologies, with respect to technique, capacity, throughput, and features. Wherever capacity and throughput are cited throughout this article, it will be in reference to the uncompressed (native) mode, as not all technologies support hardware compression. This provides a consistent framework for technical evaluation. Compression will be considered in the section on cost effectiveness.

Also note that throughput numbers here are those reported by drive manufacturers, and typically represent the maximum sustained user data read/write rate. The overhead associated with writing user data, such as writing filemarks, will reduce overall effective throughput. This study will assume for the sake of simplicity that user data consists of one large file that is nearly the same size as the capacity of the media. In this way, the overhead associated with writing filemarks is minimized, and optimum read/write rates can be compared.

1. 1/2" Digital Linear Tape

Digital Linear Tape (DLT) gets its name from the fact that it writes data in tracks laid parallel to the direction of tape travel. Two tracks are written at once, and when the END-OF-MEDIA (EOM) is reached, the transport reverses the tape direction and writing continues back toward BEGINNING-OF-MEDIA (BOM). The Model 2000 drives currently can store 10.0 Gb native on a 4x4x1-inch cartridge, with a throughput of about 1.25 Mb/s sustained. Model 4000 drives can store 20Gb native per cartridge and have a throughput of about 1.5 Mb/s. Systems are available in capacity from a single drive, single tape, to multi-tape libraries containing up to 5, 7, 14, 28, 36, 48, 50, 60, 360, 480, 900 (!) tapes and 1-20 drives (50 Gb to 9 Tb with Model 2000 drives). DLT drives support hardware compression. The Model 2000 tapes are readable by the Model 4000 drives, but not vice versa. DLT can have a relatively long tape and head life because when reading/writing stops, there is no relative motion between the head and the tape.

2. 8mm Tape

This technology uses helical scanning to write tracks that are diagonal with respect to tape motion, resulting in high track densities, and thus high data density. Older Model 8205 drives can store 3.5 Gb at 0.26 Mb/s. Current Model 8505 drives can store 7.0 Gb native on a tape, with a throughput of about 0.5 Mb/s sustained. Future drives (4Q '95?) may be capable of storing 20.0 Gb native per tape at 6.0 Mb/s, but these are not available at this time. Cartridge size is about 4.8x3.3x0.6 inches. Systems are available in capacity from a single drive, single tape, to multi-tape libraries containing up to 10, 20, 40, 48, 60, 80, 120 tapes and 1-6 drives (70 Gb to 840 Gb with Model 8505 drives). 8mm drives generally support hardware compression. Head and tape life are comparatively short due to the higher rate of relative motion of the read/write head to the tape, even when no read/writes are being performed.

3. 4mm Tape

This drive is similar in recording technique to 8mm, with a smaller cartridge of 3x2x0.5 inches. It has the same form factor and data format as Digital Audio Tape (DAT). The current DDS-2 drives are capable of storing 4Gb native at 0.5 Mb/s sustained. Systems are available in capacity from a single drive, single tape, to multi-tape libraries containing up to 10, 20, 40, 48, 60 tapes and 1-6 drives (40 Gb to 240 Gb with Model DDS-2 drives). There are proposed standards for DDS-3 and DDS-4 formats, with respectively larger capacities and throughputs, but these drives are unavailable at this time. 4mm drives generally support hardware compression.

4. Magneto-Optical Disc

Magneto-Optical (MO) technology uses a laser head to write/read data on write-once-read-many (WORM) or write-many (WMRM) 5.5x6x0.4-inch discs. Shelf life of this media is very high _ exceeding 30 years. Whether the media are WORM or WMRM is often a matter of a software switch, since rewritable media can act as WORM. Discs in this study are 5.25 inches, and range in capacity from 1.1 to 2.0 Gb. Throughput is generally around 2.0 Mb/s sustained. Systems are available in capacity from a single drive, single disc, to multidisc libraries containing up to 16, 32, 48, 60, 88, 144, 180, 250, 500, 1000 discs and 1-8 drives (21 Gb to 1,370 Gb). These systems can be mounted as filesystems. Most MO drives have 1.3 Gb capacity, and can be formatted to either 512 bytes per sector or 1024 bytes/sector, which is useful for different filesystems. Note that smaller sector sizes have more overhead, and thus slightly less usable capacity than larger sector sizes. The Hitachi Model 152 drives are unusual because they can store over 2.0 Gb per disc. MO drives do not support hardware compression.

5. Compact Disc

Compact Disc Recordable (CD-R) technology uses the 5x5x0.1-inch disc and a laser head to create 0.68 Gb WORM discs. Throughput can reach 0.6 Mb/s. Systems are available in capacity from a single drive, single disc, to multidisc libraries containing 50 or 100 discs and 1 or more drives (34 Gb to 68 Gb). These systems can also be mounted as read-only filesystems. Some systems permit the jukeboxes to be daisy-chained, allowing up to 2,200 discs per system. Most drives are capable of reading multiple formats: IS0-9660, Whitebook, Yellowbook, Orangebook, Greenbook, and Redbook. CR-R drives generally do not support hardware compression.

6. 19mm (DD-1, DD-2) Tape

19mm tape systems use a magnetic head to record data on 8.1x1.3-inch cartridges of widths up to about 14 inches, with a throughput from 0.6 to 50.0 Mb/s sustained. Recording methods vary from helical scan to transverse scan, depending on the manufacturer. Transverse recording has tracks laid out perpendicular to tape travel, resulting in short, high-packed tracks. Capacity for a single cartridge varies from 25 to 790 Gb, depending on the tape length. Many of these drives have variable recording speeds, so that different record rates are supported. Systems are available in capacity from a single drive, single tape, to multitape libraries containing up to 7 tapes and 1 drive (up to 1,155 Gb). These drives are often used in military and space programs, and are very expensive. Some units have the ability to create logical partitions on tape, allowing near-direct access to any partition by using tape indexing.

These drives also typically have a higher Bit Error Rate (BER) _ usually from 10E-10 to 10E-15 _ than the other technologies covered here. 19mm drives generally do not support hardware compression.

7. 1/2" (VHS-Style) Tape

This technology uses a VHS-style transport to digitally record data on a 7.3x4x1-inch VHS-style cartridge. Recording method is helical scan, as with the 8mm format. Cartridge capacity varies from 14.5 to 21.1 Gb native, depending on tape length. Throughput is about 2.0 Mb/s sustained. Systems are available in capacity from a single drive, single tape, to multitape libraries containing up to 48 and 600 tapes and 1-6 drives (1,013 Gb to 12.7 Tb). These drives also typically have a higher Bit Error Rate (BER) -- usually about 10E-13 -- than the other technologies covered here. These systems can have near-direct file access via tape indexing, and do not have to be rewound to allow access to a file. VHS-style drives support hardware compression. Unlike the other technologies considered, which use metal-particle tapes, VHS-style tapes are generally made from metal-oxides.

8. 1/2" (IBM 3590) Tape

The IBM 3590 uses a 16-track interleaved serpentine recording method, much like DLT. These drives currently can store 10.0 Gb native on a XxYxZ-inch cartridge, with a throughput of about 9.0 Mb/s sustained. Systems are available in capacity as multitape libraries containing up to 10 tapes and 1 drive (100 Gb). These drives can read also read 3480/3490/3490E cartridges, and they support hardware compression.

9. 1/2" (Beta-style) Tape

This technology uses the professional BetaCam-style transport to digitally record data magnetically on a BetaCam-style XxYxZ-inch or XxYxZ cartridge. Recording method is helical scan, as with the 8mm format. Cartridge capacity varies from 12.0 to 42.0 Gb native, depending on tape length. Throughput is about 12.0 Mb/s sustained. Systems are available in capacity from a single drive, single tape, to multitape libraries containing up to 25, 35, 50, 70 tapes and 1 drive (600 Gb and 14,700 Gb). These drives do not support hardware compression.

10.Magnetic Disk

Traditional Winchester removable magnetic disks configured in an array (e.g., RAID 0) can provide very high sustained throughput (over 20 Mb/s and higher) and moderately high capacity, depending on the configuration. These systems can be mounted as filesystems. The media on these systems have a relatively short shelf life of five to seven years due to the complexity of the drive. This system is less than optimal for long-term archiving because the media are actually removable hard disk drives. They are written-to once and put on a shelf, and new drives must be purchased and installed. For this reason, the cost of operation is 100X higher than the most cost-effective technology evaluated here. With the ready availability of high-speed, high-capacity tape drives, there is no reason to use magnetic disks for long-term storage.

Media Considerations

For magnetic tape, there are generally two types of coatings: metal-oxide (e.g., ferric oxides) and metal-particle. Metal-oxide coatings contain particles made up of iron (Fe) and oxygen (O), where metal-particle coatings contain particles made only of iron. Because there are more iron atoms per unit volume, more magnetic energy can be stored per unit volume. However, metal-particles are unstable when exposed to oxygen, so tape must be coated with a thin protective layer that slightly reduces its magnetic storage capacity.

Tape life expectancy is a function of many variables, including: temperature, humidity, level of usage, and head and transport condition. Generally, metal-oxide tapes will have a longer life expectancy than metal particle ones, as metal-particles will oxidize slowly over time, thus reducing retentivity. Table 1 summarizes an estimation of life expectancies performed by the National Media Laboratory [1], shown by tape type as a function of temperature and humidity.

Compression Considerations

Many drives use some form of Lempel-Ziv compression implemented in hardware. This adaptive algorithm replaces long strings of data with corresponding (and much shorter) codewords from a dictionary. It is called adaptive because the dictionary is built from the data being compressed. Typical compression ratios range from 1.7:1 for binary data to 6.9:1 for bitmaps and and some image types. ASCII data and databases average about 3.4:1. Using compression increases the effective single cartridge capacity and throughput rate, thus increasing the overall cost effectiveness of the system.

Relative Merits

Each technology can be ranked relative to the others according to several variables that affect cost, throughput, capacity, and data availability. Table 2 depicts these technologies with respect to these and other variables. This chart can help you decide how realistic your requirements are with respect to your budget.

Part 2: Cost Analysis Model

Description of the Model

Once you understand the storage technologies available, the next task is to identify the requirements for your archival system and create a model to determine the most cost effective system that meets those requirements. Capacity, throughput, and total cost are the most important aspects for large data sets. The idea is to maximize capacity and throughput and minimize cost. Total cost is the most complex aspect to quantify, as it is a function of many different variables that may not be the same from site to site. This model identifies the most important aspects of total cost, and attempts to create a useful framework for comparing the costs of operating different systems.

Constants used within the model (e.g., Administrative Manpower Cost, Archive Baby-sitting Time, etc.) are not necessarily actuals; instead, they help create a framework by which different technologies can be compared relative to one another. The dollar figures represented by these costs may not be your actuals, but they provide a close approximation.

Assumptions

The following assumptions underlie the working model. The model attempts to represent all the costs associated with purchasing, initializing, writing, storing, and restoring regular archives of a data set.

  • The data set generated twice each month is 300.0 Gb. This means that twice each month, 300.0 Gb of new user data is generated, and needs to be archived. This helps model the materials and manpower needed to create an archive.

  • The data set is restored from archive media once each month. This helps model the manpower needed to restore user data.

  • System prices are approximations of the retail price for a single quantity. All prices listed here were actuals quoted from a vendor, but certainly do not necessarily reflect actual cost. As it turns out, the model shows that cumulative storage cost per gigabyte is relatively insensitive to moderate fluctuations in Initial System Cost.

  • System throughputs are matched as closely to 2.0 Mb/sec as possible. Most drive technologies have very dissimilar sustained read/write throughputs, but because most systems can use a library system with a media changer/picker and multiple drives, it is simple to create systems with closely matched throughput even though one drive may be 4X faster than another. Keep in mind that a four drive system may utilize more SCSI IDs than a one drive system, and require special application software.

  • System capacities are as close to 300.0 Gb as possible _ that is, a 900Gb system and a 250 Gb system were not used for comparison in this model unless there was no other configuration available. Most technologies allow libraries to be daisy-chained, or expanded, and each was configured to support 300GB as efficiently as possible.

    Factors That Incur Cost

    The following variables have short-term and long-term costs, and were considered to be the most important in this model:

    Initial System Cost -- The retail cost of the system hardware and any bundled application or device software; exclusive of media, hardware maintenance and software support.

    Single Media Cost -- The retail cost for one cartridge or disc, in moderate quantity.

    Compression Factor Used -- For the purpose of this analysis, a 2:1 compression ratio. This ratio was chosen for those systems that support it, as it is realistically attainable with typical data sets. The use of hardware compression within the system can significantly reduce the Media Cost per Archive, Extended Media Storage Cost, Extended Baby-sitting Cost, and Extended Manpower Media Loading Cost.

    Extended Media Storage Cost -- The cost of long-term storage of media, expressed as a function of the Number of Media per Archive, Single Media Volume, and the Physical Space Cost. The more media required per archive (i.e., the lower the Media Data Density), the more physical storage space will be needed.

    Extended Baby-sitting Cost -- The manpower cost of attending a backup and restore. This is expressed as a function of the Time to Write One Archive, the Administrative Manhour Cost, Archive Babysitting Time, and Effective System throughput.

    Extended Manpower Media Loading Cost -- The manpower cost of handling media for backup and restore. This is a function of the Number of Media per Archive, the time to load the media into the system (including Media Formatting Time, if applicable) and Administration Manhour Cost. Media have to be unwrapped, labeled, loaded, unloaded and stacked for each backup and restore.

    Extended Total Cost per Gb -- The sum of all the costs associated with creating an archive, divided by the size of the archive, in Gb. This metric is used in the final analysis for determining the most cost-effective archive technology.

    The following variables are important and have associated long-term cost, but were non-deterministic owing to lack of reliable information. The costs associated with them was therefore assumed to be the same for each.

    System Unscheduled Maintenance -- The cost to repair failed LRUs within the system. This is a function of MTBF and MTTR, material cost, effective downtime cost, and manpower rate. Most vendors of COTS systems do not provide reliable MTBF/MTTR data per MIL-STD-217, or any other industry standard, therefore making comparisons useless. This cost can be very significant over time, depending on the design and construction quality of the system. For example, the read/write assembly life cost factor was not considered due to unavailability of accurate, consistent vendor data. It is a factor for tape technologies, but not for disks. It is known that new tapes are much more abrasive than older, burnished tapes. This can affect head wear by as much as 100 percent.

    Single File Restoration -- The cost associated with restoring a single file from the entire archive. Since the software that will be used to create the archive is not known, and thus the method of file access, this cost cannot be reliably computed. Instead, this study naively assumes that the entire archive will be restored, and this cost is included in the Baby-sitting Cost (above).

    Also considered, but discovered to be negligible in terms of long-term and short-term cost were:

    System Space Cost -- The cost of housing the system. This is a function of system physical size and anticipated facility space cost, such as housing, cooling, etc.

    System Power Usage Cost -- The cost of providing power to the system. This is a function of average power consumption and power rates.

    Extended System Maintenance Cost -- The cost of operating and maintaining the system over time. This cost is a function of the Time to Write One Archive, the number of Archives/Restores per Month, the Machine Clean Rate, the Materials Cost per Cleaning, the Manhours per Cleaning, and the Administrative Manhour Cost.

    The following factors have an assciated cost, but have relatively less importance than the other factors. Also, because these costs can be more complex to model, they were assumed to be the same for each.

    Media Shelf Life -- It is assumed the media will be stored in a temperature-controlled facility. In this case, both metal-oxide and metal-particle tape-coating technologies can approach a shelf life of 15-30 years. This analysis assumes all media will have approximately the same shelf life. Also, there is a cost associated with re-recording data when the media has reached its life expectancy, but this is difficult to determine due to the long lead times. In 10-15 years, an entirely new technology will most likely replace the current one.

    Media Life -- For rewritable media, this trade assumes that all media will only be written to once, but may be read many times. It also assumes that the number of reads is much less than the vendor expected media life. The worst case here is 4mm tape, with a media life of 1500 writes/reads.

    Bit Error Rate -- The BER for 8mm, 4mm, and DLT is 10E5 better than the other technologies. However, no cost has been associated with BER for this analysis.

    Improved Technology -- This refers to cost-avoidance related to providing throughput and capacity upgrades. Many COTS systems have very good upgrade paths, such as 8 mm with the forthcoming Mammoth drive. Some systems have modularly expandable libraries, while others do not. But overall, most systems have at least a 2X capability for throughput and capacity expansion, so this factor is considered equal among technologies.

    Finally, the other constants and variables used in this model:

    Data Set Size -- The size, in Gb, of one data set to be written or restored per operation.

    Time of Extended Cost -- The number of years to run the model.

    Archives/Restores Per Month -- The number of data sets written/restored each month.

    Administrative Manhour Cost -- The cost per hour for an system administrator or operator.

    Physical Space Cost -- The cost per cubic foot per year for an environmentally controlled facility for long-term storage of media.

    Archive Babysitting Time -- A constant expressing the number of minutes per hour, on average, an operator would have to devote to the system during a backup or restore, for the duration of the operation.

    Single Media Native Capacity -- The native (uncompressed) storage capacity in Gb of a single media.

    Support Compression? -- Whether or not the system is capable of providing hardware-level data compression.

    Effective Media Capacity -- The amount of user data a single media can store. If compression is turned on in the model, this is a function of Single Media Native Capacity and Compression Factor Used.

    System Media Capacity -- The number of media a system can use at one time.

    System Total Data Capacity -- The largest data set size the system can archive or restore at one time.

    Effective Media Cost per Gb -- A function of Single Media Cost and Effective Media Capacity.

    Number of Media per Archive -- The number of media required to write one data set. This is a function of Data Set Size and Effective Media Capacity.

    Media Cost per Archive -- The cost for media to write one data set. This is a function of Number of Media per Archive and Single Media Cost.

    Extended Media Cost -- The long-term cost of purchasing media. This is a function of Media Cost per Archive and the number of Archives per Month.

    Single Media Volume -- The size in cubic inches of a single media, with carrier or cover, if applicable. This size represents the volume that will be used when the media is placed in long-term storage.

    Media Data Density -- The number of Gb that can be stored per cubic inch. This is a function of Single Media Volume and and Effective Media Capacity.

    Extended Media Storage Space -- The physical space required to store the media. This is a function of Number of Media per Archive, Single Media Volume, and the number of Archives per Month.

    Extended Media Storage Cost -- The cost associated with the Extended Media Storage Space and the Physical Space Cost.

    Native System Throughput -- The sustained throughput in Mb per second for reading and writing. This is vendor supplied. For systems with more than one drive, it assumes that some means is employed to use all drives in parallel (for example, striping via hardware or software).

    Effective System Throughput -- The throughput of the system as a function of Native System Throughput and Compression Factor Used, if applicable. Using compression effectively increases throughput.

    Time to Write One Data Set -- The time in hours to write one Data Set. This is a function of the Data Set Size and the Effective System Throughput.

    System Volume -- The physical size of the system, in cubic feet.

    System Power Usage -- The number of watts the system uses in a typical read or write operation.

    Manhours per Cleaning -- The time it takes an operator to clean the system drives, if applicable.

    Hours Between Cleaning -- The number of operational hours between preventive maintenance cleanings.

    Materials Cost per Cleaning -- The cost of the materials needed to perform the system cleaning.

    Media Formatting Time -- The time required for an operator to format media in preparation for writing the data set.

    Extended Total Cost -- The cost of operating the system for the long term. This is the sum of all costs in the model.

    Extended Total Cost per Data Set -- The total cost for each data set. This is a function of Extended Total Cost and the number of archives performed.

    Part 3: An Example Analysis

    Example Requirement Definition

    This section presents an example of an analysis with the model optimized for finding the most cost-effective system. Cost effectiveness is defined as the lowest cost per Gb for a specified period of operation. Other models can easily be defined by concentrating on other aspects of interest, such as throughput or capacity.

    The model was run for 1-, 2-, 5-, 8-, 10- and 15-year periods, with and without compression. Data Set Size was 300 Gb; Archives per Month was 2; Restores per Month was 1; and Compression Factor Used was toggled between 1 and 2. Table 3 presents a sample spreadsheet for Compression Factor Used set to 1:1 and Time of Extended Cost set to 1 year. The spreadsheet was actually run twelve times in order to fill out Table 4.

    Example Results

    Based on the requirements for the Data Set described above, which system offers the lowest cost of operation per Gb depends on how long you anticipate maintaining the system in operation. If you plan to discontinue operation after a few years, it would be prudent to purchase a system with a low Initial System Cost. For long-term operations, a more expensive system can be justified because the high initial cost is offset by the low cost of operation. Table 4 summarizes the cost effectiveness of each system as a function of data compression for a number of different years. Quantities are in units of dollars per GB ($/GB).

    Conclusion

    Which system meets the requirements listed here most cost effectively is a function of how long you intend to operate the system. If you only need to archive your data for a few years, then 8mm is by far the most cost-effective choice. It has a very low initial system cost, and the lowest media cost.

    However, if you intend to operate the system for more than five years, the most cost-effective system is the 19mm, or possibly the VHS. VHS is more economical if you archive only one or two data sets per month, but if you are generating about five or more data sets a month, then 19mm is the clear choice. 19mm has a very high initial system cost, but its throughput rate, single cartridge capacity, and data density are the best on the market for this application.

    References

    [1] VanBogart, Dr. John W. D. NML Media Stability Studies. National Media Laboratory: July, 1994.

    About the Author

    Packey Velleca currently works on a realtime telemetry processing system as a system developer. He has worked as a system administrator, and has published several articles in Sys Admin. He graduated from FIT with a BSEE in 1988, and can be reached at pvelleca@rsa.hisd.harris.com.


     



  •