Comparing Technologies for Long-Term, High-Capacity Archives
Packey P. Velleca
Introduction
Every information manager needs to provide backups or
long-term archives
of user data to some extent. There are many commercial
solutions
available for creating long-term archives, and they
vary widely in cost
and efficiency. The problem of selecting an archive
technology is
compounded by the differing requirements of individual
sites, so that no
one system is the most effective for all sites.
This article is intended to aid in determining the most
cost-effective,
commercial technology for creating long-term (5+ years),
high-capacity
(100-1,000 Gigabytes) archives. The article is organized
in three parts:
Part 1 discusses the relative merits of ten technologies,
and presents a
tool to aid in determining your system requirements.
Part 2 describes a
cost analysis model, along with its assumptions and
limitations, used to
evaluate the different technologies with respect to
the chosen
requirements. Part 3 walks through a real example, and
presents a
summary of the results projected by the model.
By examining your specific requirements and running
an analysis with the
model presented here, you will be better able to choose
the most
cost-effective system for your site.
Part 1: Technologies and Requirements
Technologies Considered
In this section Ibriefly discuss 10 readily available,
easily
integrated, UNIX-based technologies, with respect to
technique,
capacity, throughput, and features. Wherever capacity
and throughput are
cited throughout this article, it will be in reference
to the
uncompressed (native) mode, as not all technologies
support hardware
compression. This provides a consistent framework for
technical
evaluation. Compression will be considered in the section
on cost
effectiveness.
Also note that throughput numbers here are those reported
by drive
manufacturers, and typically represent the maximum sustained
user data
read/write rate. The overhead associated with writing
user data, such as
writing filemarks, will reduce overall effective throughput.
This study
will assume for the sake of simplicity that user data
consists of one
large file that is nearly the same size as the capacity
of the media. In
this way, the overhead associated with writing filemarks
is minimized,
and optimum read/write rates can be compared.
1. 1/2" Digital Linear Tape
Digital Linear Tape (DLT) gets its name from the fact
that it writes
data in tracks laid parallel to the direction of tape
travel. Two tracks
are written at once, and when the END-OF-MEDIA (EOM)
is reached, the
transport reverses the tape direction and writing continues
back toward
BEGINNING-OF-MEDIA (BOM). The Model 2000 drives currently
can store 10.0
Gb native on a 4x4x1-inch cartridge, with a throughput
of about 1.25
Mb/s sustained. Model 4000 drives can store 20Gb native
per cartridge
and have a throughput of about 1.5 Mb/s. Systems are
available in
capacity from a single drive, single tape, to multi-tape
libraries
containing up to 5, 7, 14, 28, 36, 48, 50, 60, 360,
480, 900 (!) tapes
and 1-20 drives (50 Gb to 9 Tb with Model 2000 drives).
DLT drives
support hardware compression. The Model 2000 tapes are
readable by the
Model 4000 drives, but not vice versa. DLT can have
a relatively long
tape and head life because when reading/writing stops,
there is no
relative motion between the head and the tape.
2. 8mm Tape
This technology uses helical scanning to write tracks
that are diagonal
with respect to tape motion, resulting in high track
densities, and thus
high data density. Older Model 8205 drives can store
3.5 Gb at 0.26
Mb/s. Current Model 8505 drives can store 7.0 Gb native
on a tape, with
a throughput of about 0.5 Mb/s sustained. Future drives
(4Q '95?) may be
capable of storing 20.0 Gb native per tape at 6.0 Mb/s,
but these are
not available at this time. Cartridge size is about
4.8x3.3x0.6 inches.
Systems are available in capacity from a single drive,
single tape, to
multi-tape libraries containing up to 10, 20, 40, 48,
60, 80, 120 tapes
and 1-6 drives (70 Gb to 840 Gb with Model 8505 drives).
8mm drives
generally support hardware compression. Head and tape
life are
comparatively short due to the higher rate of relative
motion of the
read/write head to the tape, even when no read/writes
are being
performed.
3. 4mm Tape
This drive is similar in recording technique to 8mm,
with a smaller
cartridge of 3x2x0.5 inches. It has the same form factor
and data format
as Digital Audio Tape (DAT). The current DDS-2 drives
are capable of
storing 4Gb native at 0.5 Mb/s sustained. Systems are
available in
capacity from a single drive, single tape, to multi-tape
libraries
containing up to 10, 20, 40, 48, 60 tapes and 1-6 drives
(40 Gb to 240
Gb with Model DDS-2 drives). There are proposed standards
for DDS-3 and
DDS-4 formats, with respectively larger capacities and
throughputs, but
these drives are unavailable at this time. 4mm drives
generally support
hardware compression.
4. Magneto-Optical Disc
Magneto-Optical (MO) technology uses a laser head to
write/read data on
write-once-read-many (WORM) or write-many (WMRM) 5.5x6x0.4-inch
discs.
Shelf life of this media is very high _ exceeding 30
years. Whether the
media are WORM or WMRM is often a matter of a software
switch, since
rewritable media can act as WORM. Discs in this study
are 5.25 inches,
and range in capacity from 1.1 to 2.0 Gb. Throughput
is generally around
2.0 Mb/s sustained. Systems are available in capacity
from a single
drive, single disc, to multidisc libraries containing
up to 16, 32, 48,
60, 88, 144, 180, 250, 500, 1000 discs and 1-8 drives
(21 Gb to 1,370
Gb). These systems can be mounted as filesystems. Most
MO drives have
1.3 Gb capacity, and can be formatted to either 512
bytes per sector or
1024 bytes/sector, which is useful for different filesystems.
Note that
smaller sector sizes have more overhead, and thus slightly
less usable
capacity than larger sector sizes. The Hitachi Model
152 drives are
unusual because they can store over 2.0 Gb per disc.
MO drives do not
support hardware compression.
5. Compact Disc
Compact Disc Recordable (CD-R) technology uses the 5x5x0.1-inch
disc and
a laser head to create 0.68 Gb WORM discs. Throughput
can reach 0.6
Mb/s. Systems are available in capacity from a single
drive, single
disc, to multidisc libraries containing 50 or 100 discs
and 1 or more
drives (34 Gb to 68 Gb). These systems can also be mounted
as read-only
filesystems. Some systems permit the jukeboxes to be
daisy-chained,
allowing up to 2,200 discs per system. Most drives are
capable of
reading multiple formats: IS0-9660, Whitebook, Yellowbook,
Orangebook,
Greenbook, and Redbook. CR-R drives generally do not
support hardware
compression.
6. 19mm (DD-1, DD-2) Tape
19mm tape systems use a magnetic head to record data
on 8.1x1.3-inch
cartridges of widths up to about 14 inches, with a throughput
from 0.6
to 50.0 Mb/s sustained. Recording methods vary from
helical scan to
transverse scan, depending on the manufacturer. Transverse
recording has
tracks laid out perpendicular to tape travel, resulting
in short,
high-packed tracks. Capacity for a single cartridge
varies from 25 to
790 Gb, depending on the tape length. Many of these
drives have variable
recording speeds, so that different record rates are
supported. Systems
are available in capacity from a single drive, single
tape, to multitape
libraries containing up to 7 tapes and 1 drive (up to
1,155 Gb). These
drives are often used in military and space programs,
and are very
expensive. Some units have the ability to create logical
partitions on
tape, allowing near-direct access to any partition by
using tape
indexing.
These drives also typically have a higher Bit Error
Rate (BER) _ usually
from 10E-10 to 10E-15 _ than the other technologies
covered here. 19mm
drives generally do not support hardware compression.
7. 1/2" (VHS-Style) Tape
This technology uses a VHS-style transport to digitally
record data on a
7.3x4x1-inch VHS-style cartridge. Recording method is
helical scan, as
with the 8mm format. Cartridge capacity varies from
14.5 to 21.1 Gb
native, depending on tape length. Throughput is about
2.0 Mb/s
sustained. Systems are available in capacity from a
single drive, single
tape, to multitape libraries containing up to 48 and
600 tapes and 1-6
drives (1,013 Gb to 12.7 Tb). These drives also typically
have a higher
Bit Error Rate (BER) -- usually about 10E-13 -- than
the other
technologies covered here. These systems can have near-direct
file
access via tape indexing, and do not have to be rewound
to allow access
to a file. VHS-style drives support hardware compression.
Unlike the
other technologies considered, which use metal-particle
tapes, VHS-style
tapes are generally made from metal-oxides.
8. 1/2" (IBM 3590) Tape
The IBM 3590 uses a 16-track interleaved serpentine
recording method,
much like DLT. These drives currently can store 10.0
Gb native on a
XxYxZ-inch cartridge, with a throughput of about 9.0
Mb/s sustained.
Systems are available in capacity as multitape libraries
containing up
to 10 tapes and 1 drive (100 Gb). These drives can read
also read
3480/3490/3490E cartridges, and they support hardware
compression.
9. 1/2" (Beta-style) Tape
This technology uses the professional BetaCam-style
transport to
digitally record data magnetically on a BetaCam-style
XxYxZ-inch or
XxYxZ cartridge. Recording method is helical scan, as
with the 8mm
format. Cartridge capacity varies from 12.0 to 42.0
Gb native, depending
on tape length. Throughput is about 12.0 Mb/s sustained.
Systems are
available in capacity from a single drive, single tape,
to multitape
libraries containing up to 25, 35, 50, 70 tapes and
1 drive (600 Gb and
14,700 Gb). These drives do not support hardware compression.
10.Magnetic Disk
Traditional Winchester removable magnetic disks configured
in an array
(e.g., RAID 0) can provide very high sustained throughput
(over 20 Mb/s
and higher) and moderately high capacity, depending
on the
configuration. These systems can be mounted as filesystems.
The media on
these systems have a relatively short shelf life of
five to seven years
due to the complexity of the drive. This system is less
than optimal for
long-term archiving because the media are actually removable
hard disk
drives. They are written-to once and put on a shelf,
and new drives must
be purchased and installed. For this reason, the cost
of operation is
100X higher than the most cost-effective technology
evaluated here. With
the ready availability of high-speed, high-capacity
tape drives, there
is no reason to use magnetic disks for long-term storage.
Media Considerations
For magnetic tape, there are generally two types of
coatings:
metal-oxide (e.g., ferric oxides) and metal-particle.
Metal-oxide
coatings contain particles made up of iron (Fe) and
oxygen (O), where
metal-particle coatings contain particles made only
of iron. Because
there are more iron atoms per unit volume, more magnetic
energy can be
stored per unit volume. However, metal-particles are
unstable when
exposed to oxygen, so tape must be coated with a thin
protective layer
that slightly reduces its magnetic storage capacity.
Tape life expectancy is a function of many variables,
including:
temperature, humidity, level of usage, and head and
transport condition.
Generally, metal-oxide tapes will have a longer life
expectancy than
metal particle ones, as metal-particles will oxidize
slowly over time,
thus reducing retentivity. Table 1 summarizes an estimation
of life
expectancies performed by the National Media Laboratory
[1], shown by
tape type as a function of temperature and humidity.
Compression Considerations
Many drives use some form of Lempel-Ziv compression
implemented in
hardware. This adaptive algorithm replaces long strings
of data with
corresponding (and much shorter) codewords from a dictionary.
It is
called adaptive because the dictionary is built from
the data being
compressed. Typical compression ratios range from 1.7:1
for binary data
to 6.9:1 for bitmaps and and some image types. ASCII
data and databases
average about 3.4:1. Using compression increases the
effective single
cartridge capacity and throughput rate, thus increasing
the overall cost
effectiveness of the system.
Relative Merits
Each technology can be ranked relative to the others
according to
several variables that affect cost, throughput, capacity,
and data
availability. Table 2 depicts these technologies with
respect to these
and other variables. This chart can help you decide
how realistic your
requirements are with respect to your budget.
Part 2: Cost Analysis Model
Description of the Model
Once you understand the storage technologies available,
the next task is
to identify the requirements for your archival system
and create a model
to determine the most cost effective system that meets
those
requirements. Capacity, throughput, and total cost are
the most
important aspects for large data sets. The idea is to
maximize capacity
and throughput and minimize cost. Total cost is the
most complex aspect
to quantify, as it is a function of many different variables
that may
not be the same from site to site. This model identifies
the most
important aspects of total cost, and attempts to create
a useful
framework for comparing the costs of operating different
systems.
Constants used within the model (e.g., Administrative
Manpower Cost,
Archive Baby-sitting Time, etc.) are not necessarily
actuals; instead,
they help create a framework by which different technologies
can be
compared relative to one another. The dollar figures
represented by
these costs may not be your actuals, but they provide
a close
approximation.
Assumptions
The following assumptions underlie the working model.
The model attempts
to represent all the costs associated with purchasing,
initializing,
writing, storing, and restoring regular archives of
a data set.
The data set generated twice each month is 300.0 Gb.
This means that
twice each month, 300.0 Gb of new user data is generated,
and needs to
be archived. This helps model the materials and manpower
needed to
create an archive.
The data set is restored from archive media once each
month. This helps
model the manpower needed to restore user data.
System prices are approximations of the retail price
for a single
quantity. All prices listed here were actuals quoted
from a vendor, but
certainly do not necessarily reflect actual cost. As
it turns out, the
model shows that cumulative storage cost per gigabyte
is relatively
insensitive to moderate fluctuations in Initial System
Cost.
System throughputs are matched as closely to 2.0 Mb/sec
as possible.
Most drive technologies have very dissimilar sustained
read/write
throughputs, but because most systems can use a library
system with a
media changer/picker and multiple drives, it is simple
to create systems
with closely matched throughput even though one drive
may be 4X faster
than another. Keep in mind that a four drive system
may utilize more
SCSI IDs than a one drive system, and require special
application
software.
System capacities are as close to 300.0 Gb as possible
_ that is, a
900Gb system and a 250 Gb system were not used for comparison
in this
model unless there was no other configuration available.
Most
technologies allow libraries to be daisy-chained, or
expanded, and each
was configured to support 300GB as efficiently as possible.
Factors That Incur Cost
The following variables have short-term and long-term
costs, and were
considered to be the most important in this model:
Initial System Cost -- The retail cost of the system
hardware and any
bundled application or device software; exclusive of
media, hardware
maintenance and software support.
Single Media Cost -- The retail cost for one cartridge
or disc, in
moderate quantity.
Compression Factor Used -- For the purpose of this analysis,
a 2:1
compression ratio. This ratio was chosen for those systems
that support
it, as it is realistically attainable with typical data
sets. The use of
hardware compression within the system can significantly
reduce the
Media Cost per Archive, Extended Media Storage Cost,
Extended
Baby-sitting Cost, and Extended Manpower Media Loading
Cost.
Extended Media Storage Cost -- The cost of long-term
storage of media,
expressed as a function of the Number of Media per Archive,
Single Media
Volume, and the Physical Space Cost. The more media
required per archive
(i.e., the lower the Media Data Density), the more physical
storage
space will be needed.
Extended Baby-sitting Cost -- The manpower cost of attending
a backup and
restore. This is expressed as a function of the Time
to Write One
Archive, the Administrative Manhour Cost, Archive Babysitting
Time, and
Effective System throughput.
Extended Manpower Media Loading Cost -- The manpower
cost of handling
media for backup and restore. This is a function of
the Number of Media
per Archive, the time to load the media into the system
(including Media
Formatting Time, if applicable) and Administration Manhour
Cost. Media
have to be unwrapped, labeled, loaded, unloaded and
stacked for each
backup and restore.
Extended Total Cost per Gb -- The sum of all the costs
associated with
creating an archive, divided by the size of the archive,
in Gb. This
metric is used in the final analysis for determining
the most
cost-effective archive technology.
The following variables are important and have associated
long-term
cost, but were non-deterministic owing to lack of reliable
information.
The costs associated with them was therefore assumed
to be the same for
each.
System Unscheduled Maintenance -- The cost to repair
failed LRUs within
the system. This is a function of MTBF and MTTR, material
cost,
effective downtime cost, and manpower rate. Most vendors
of COTS systems
do not provide reliable MTBF/MTTR data per MIL-STD-217,
or any other
industry standard, therefore making comparisons useless.
This cost can
be very significant over time, depending on the design
and construction
quality of the system. For example, the read/write assembly
life cost
factor was not considered due to unavailability of accurate,
consistent
vendor data. It is a factor for tape technologies, but
not for disks. It
is known that new tapes are much more abrasive than
older, burnished
tapes. This can affect head wear by as much as 100 percent.
Single File Restoration -- The cost associated with
restoring a single
file from the entire archive. Since the software that
will be used to
create the archive is not known, and thus the method
of file access,
this cost cannot be reliably computed. Instead, this
study naively
assumes that the entire archive will be restored, and
this cost is
included in the Baby-sitting Cost (above).
Also considered, but discovered to be negligible in
terms of long-term
and short-term cost were:
System Space Cost -- The cost of housing the system.
This is a function
of system physical size and anticipated facility space
cost, such as
housing, cooling, etc.
System Power Usage Cost -- The cost of providing power
to the system.
This is a function of average power consumption and
power rates.
Extended System Maintenance Cost -- The cost of operating
and maintaining
the system over time. This cost is a function of the
Time to Write One
Archive, the number of Archives/Restores per Month,
the Machine Clean
Rate, the Materials Cost per Cleaning, the Manhours
per Cleaning, and
the Administrative Manhour Cost.
The following factors have an assciated cost, but have
relatively less
importance than the other factors. Also, because these
costs can be more
complex to model, they were assumed to be the same for
each.
Media Shelf Life -- It is assumed the media will be
stored in a
temperature-controlled facility. In this case, both
metal-oxide and
metal-particle tape-coating technologies can approach
a shelf life of
15-30 years. This analysis assumes all media will have
approximately the
same shelf life. Also, there is a cost associated with
re-recording data
when the media has reached its life expectancy, but
this is difficult to
determine due to the long lead times. In 10-15 years,
an entirely new
technology will most likely replace the current one.
Media Life -- For rewritable media, this trade assumes
that all media
will only be written to once, but may be read many times.
It also
assumes that the number of reads is much less than the
vendor expected
media life. The worst case here is 4mm tape, with a
media life of 1500
writes/reads.
Bit Error Rate -- The BER for 8mm, 4mm, and DLT is 10E5
better than the
other technologies. However, no cost has been associated
with BER for
this analysis.
Improved Technology -- This refers to cost-avoidance
related to providing
throughput and capacity upgrades. Many COTS systems
have very good
upgrade paths, such as 8 mm with the forthcoming Mammoth
drive. Some
systems have modularly expandable libraries, while others
do not. But
overall, most systems have at least a 2X capability
for throughput and
capacity expansion, so this factor is considered equal
among
technologies.
Finally, the other constants and variables used in this
model:
Data Set Size -- The size, in Gb, of one data set to
be written or
restored per operation.
Time of Extended Cost -- The number of years to run
the model.
Archives/Restores Per Month -- The number of data sets
written/restored
each month.
Administrative Manhour Cost -- The cost per hour for
an system
administrator or operator.
Physical Space Cost -- The cost per cubic foot per year
for an
environmentally controlled facility for long-term storage
of media.
Archive Babysitting Time -- A constant expressing the
number of minutes
per hour, on average, an operator would have to devote
to the system
during a backup or restore, for the duration of the
operation.
Single Media Native Capacity -- The native (uncompressed)
storage
capacity in Gb of a single media.
Support Compression? -- Whether or not the system is
capable of providing
hardware-level data compression.
Effective Media Capacity -- The amount of user data
a single media can
store. If compression is turned on in the model, this
is a function of
Single Media Native Capacity and Compression Factor
Used.
System Media Capacity -- The number of media a system
can use at one
time.
System Total Data Capacity -- The largest data set size
the system can
archive or restore at one time.
Effective Media Cost per Gb -- A function of Single
Media Cost and
Effective Media Capacity.
Number of Media per Archive -- The number of media required
to write one
data set. This is a function of Data Set Size and Effective
Media
Capacity.
Media Cost per Archive -- The cost for media to write
one data set. This
is a function of Number of Media per Archive and Single
Media Cost.
Extended Media Cost -- The long-term cost of purchasing
media. This is a
function of Media Cost per Archive and the number of
Archives per Month.
Single Media Volume -- The size in cubic inches of a
single media, with
carrier or cover, if applicable. This size represents
the volume that
will be used when the media is placed in long-term storage.
Media Data Density -- The number of Gb that can be stored
per cubic inch.
This is a function of Single Media Volume and and Effective
Media
Capacity.
Extended Media Storage Space -- The physical space required
to store the
media. This is a function of Number of Media per Archive,
Single Media
Volume, and the number of Archives per Month.
Extended Media Storage Cost -- The cost associated with
the Extended
Media Storage Space and the Physical Space Cost.
Native System Throughput -- The sustained throughput
in Mb per second for
reading and writing. This is vendor supplied. For systems
with more than
one drive, it assumes that some means is employed to
use all drives in
parallel (for example, striping via hardware or software).
Effective System Throughput -- The throughput of the
system as a function
of Native System Throughput and Compression Factor Used,
if applicable.
Using compression effectively increases throughput.
Time to Write One Data Set -- The time in hours to write
one Data Set.
This is a function of the Data Set Size and the Effective
System
Throughput.
System Volume -- The physical size of the system, in
cubic feet.
System Power Usage -- The number of watts the system
uses in a typical
read or write operation.
Manhours per Cleaning -- The time it takes an operator
to clean the
system drives, if applicable.
Hours Between Cleaning -- The number of operational
hours between
preventive maintenance cleanings.
Materials Cost per Cleaning -- The cost of the materials
needed to
perform the system cleaning.
Media Formatting Time -- The time required for an operator
to format
media in preparation for writing the data set.
Extended Total Cost -- The cost of operating the system
for the long
term. This is the sum of all costs in the model.
Extended Total Cost per Data Set -- The total cost for
each data set.
This is a function of Extended Total Cost and the number
of archives
performed.
Part 3: An Example Analysis
Example Requirement Definition
This section presents an example of an analysis with
the model optimized
for finding the most cost-effective system. Cost effectiveness
is
defined as the lowest cost per Gb for a specified period
of operation.
Other models can easily be defined by concentrating
on other aspects of
interest, such as throughput or capacity.
The model was run for 1-, 2-, 5-, 8-, 10- and 15-year
periods, with and
without compression. Data Set Size was 300 Gb; Archives
per Month was 2;
Restores per Month was 1; and Compression Factor Used
was toggled
between 1 and 2. Table 3 presents a sample spreadsheet
for Compression
Factor Used set to 1:1 and Time of Extended Cost set
to 1 year. The
spreadsheet was actually run twelve times in order to
fill out Table 4.
Example Results
Based on the requirements for the Data Set described
above, which system
offers the lowest cost of operation per Gb depends on
how long you
anticipate maintaining the system in operation. If you
plan to
discontinue operation after a few years, it would be
prudent to purchase
a system with a low Initial System Cost. For long-term
operations, a
more expensive system can be justified because the high
initial cost is
offset by the low cost of operation. Table 4 summarizes
the cost
effectiveness of each system as a function of data compression
for a
number of different years. Quantities are in units of
dollars per GB
($/GB).
Conclusion
Which system meets the requirements listed here most
cost effectively is
a function of how long you intend to operate the system.
If you only
need to archive your data for a few years, then 8mm
is by far the most
cost-effective choice. It has a very low initial system
cost, and the
lowest media cost.
However, if you intend to operate the system for more
than five years,
the most cost-effective system is the 19mm, or possibly
the VHS. VHS is
more economical if you archive only one or two data
sets per month, but
if you are generating about five or more data sets a
month, then 19mm is
the clear choice. 19mm has a very high initial system
cost, but its
throughput rate, single cartridge capacity, and data
density are the
best on the market for this application.
References
[1] VanBogart, Dr. John W. D. NML Media Stability Studies.
National
Media Laboratory: July, 1994.
About the Author
Packey Velleca currently works on a realtime telemetry
processing system
as a system developer. He has worked as a system administrator,
and has
published several articles in Sys Admin. He graduated
from FIT with a
BSEE in 1988, and can be reached at pvelleca@rsa.hisd.harris.com.
|