Comparing Technologies for Long-Term, High-Capacity Archives
Packey P. Velleca
Every information manager needs to provide backups or
of user data to some extent. There are many commercial
available for creating long-term archives, and they
vary widely in cost
and efficiency. The problem of selecting an archive
compounded by the differing requirements of individual
sites, so that no
one system is the most effective for all sites.
This article is intended to aid in determining the most
commercial technology for creating long-term (5+ years),
(100-1,000 Gigabytes) archives. The article is organized
in three parts:
Part 1 discusses the relative merits of ten technologies,
and presents a
tool to aid in determining your system requirements.
Part 2 describes a
cost analysis model, along with its assumptions and
limitations, used to
evaluate the different technologies with respect to
requirements. Part 3 walks through a real example, and
summary of the results projected by the model.
By examining your specific requirements and running
an analysis with the
model presented here, you will be better able to choose
cost-effective system for your site.
Part 1: Technologies and Requirements
In this section Ibriefly discuss 10 readily available,
integrated, UNIX-based technologies, with respect to
capacity, throughput, and features. Wherever capacity
and throughput are
cited throughout this article, it will be in reference
uncompressed (native) mode, as not all technologies
compression. This provides a consistent framework for
evaluation. Compression will be considered in the section
Also note that throughput numbers here are those reported
manufacturers, and typically represent the maximum sustained
read/write rate. The overhead associated with writing
user data, such as
writing filemarks, will reduce overall effective throughput.
will assume for the sake of simplicity that user data
consists of one
large file that is nearly the same size as the capacity
of the media. In
this way, the overhead associated with writing filemarks
and optimum read/write rates can be compared.
1. 1/2" Digital Linear Tape
Digital Linear Tape (DLT) gets its name from the fact
that it writes
data in tracks laid parallel to the direction of tape
travel. Two tracks
are written at once, and when the END-OF-MEDIA (EOM)
is reached, the
transport reverses the tape direction and writing continues
BEGINNING-OF-MEDIA (BOM). The Model 2000 drives currently
can store 10.0
Gb native on a 4x4x1-inch cartridge, with a throughput
of about 1.25
Mb/s sustained. Model 4000 drives can store 20Gb native
and have a throughput of about 1.5 Mb/s. Systems are
capacity from a single drive, single tape, to multi-tape
containing up to 5, 7, 14, 28, 36, 48, 50, 60, 360,
480, 900 (!) tapes
and 1-20 drives (50 Gb to 9 Tb with Model 2000 drives).
support hardware compression. The Model 2000 tapes are
readable by the
Model 4000 drives, but not vice versa. DLT can have
a relatively long
tape and head life because when reading/writing stops,
there is no
relative motion between the head and the tape.
2. 8mm Tape
This technology uses helical scanning to write tracks
that are diagonal
with respect to tape motion, resulting in high track
densities, and thus
high data density. Older Model 8205 drives can store
3.5 Gb at 0.26
Mb/s. Current Model 8505 drives can store 7.0 Gb native
on a tape, with
a throughput of about 0.5 Mb/s sustained. Future drives
(4Q '95?) may be
capable of storing 20.0 Gb native per tape at 6.0 Mb/s,
but these are
not available at this time. Cartridge size is about
Systems are available in capacity from a single drive,
single tape, to
multi-tape libraries containing up to 10, 20, 40, 48,
60, 80, 120 tapes
and 1-6 drives (70 Gb to 840 Gb with Model 8505 drives).
generally support hardware compression. Head and tape
comparatively short due to the higher rate of relative
motion of the
read/write head to the tape, even when no read/writes
3. 4mm Tape
This drive is similar in recording technique to 8mm,
with a smaller
cartridge of 3x2x0.5 inches. It has the same form factor
and data format
as Digital Audio Tape (DAT). The current DDS-2 drives
are capable of
storing 4Gb native at 0.5 Mb/s sustained. Systems are
capacity from a single drive, single tape, to multi-tape
containing up to 10, 20, 40, 48, 60 tapes and 1-6 drives
(40 Gb to 240
Gb with Model DDS-2 drives). There are proposed standards
for DDS-3 and
DDS-4 formats, with respectively larger capacities and
these drives are unavailable at this time. 4mm drives
4. Magneto-Optical Disc
Magneto-Optical (MO) technology uses a laser head to
write/read data on
write-once-read-many (WORM) or write-many (WMRM) 5.5x6x0.4-inch
Shelf life of this media is very high _ exceeding 30
years. Whether the
media are WORM or WMRM is often a matter of a software
rewritable media can act as WORM. Discs in this study
are 5.25 inches,
and range in capacity from 1.1 to 2.0 Gb. Throughput
is generally around
2.0 Mb/s sustained. Systems are available in capacity
from a single
drive, single disc, to multidisc libraries containing
up to 16, 32, 48,
60, 88, 144, 180, 250, 500, 1000 discs and 1-8 drives
(21 Gb to 1,370
Gb). These systems can be mounted as filesystems. Most
MO drives have
1.3 Gb capacity, and can be formatted to either 512
bytes per sector or
1024 bytes/sector, which is useful for different filesystems.
smaller sector sizes have more overhead, and thus slightly
capacity than larger sector sizes. The Hitachi Model
152 drives are
unusual because they can store over 2.0 Gb per disc.
MO drives do not
support hardware compression.
5. Compact Disc
Compact Disc Recordable (CD-R) technology uses the 5x5x0.1-inch
a laser head to create 0.68 Gb WORM discs. Throughput
can reach 0.6
Mb/s. Systems are available in capacity from a single
disc, to multidisc libraries containing 50 or 100 discs
and 1 or more
drives (34 Gb to 68 Gb). These systems can also be mounted
filesystems. Some systems permit the jukeboxes to be
allowing up to 2,200 discs per system. Most drives are
reading multiple formats: IS0-9660, Whitebook, Yellowbook,
Greenbook, and Redbook. CR-R drives generally do not
6. 19mm (DD-1, DD-2) Tape
19mm tape systems use a magnetic head to record data
cartridges of widths up to about 14 inches, with a throughput
to 50.0 Mb/s sustained. Recording methods vary from
helical scan to
transverse scan, depending on the manufacturer. Transverse
tracks laid out perpendicular to tape travel, resulting
high-packed tracks. Capacity for a single cartridge
varies from 25 to
790 Gb, depending on the tape length. Many of these
drives have variable
recording speeds, so that different record rates are
are available in capacity from a single drive, single
tape, to multitape
libraries containing up to 7 tapes and 1 drive (up to
1,155 Gb). These
drives are often used in military and space programs,
and are very
expensive. Some units have the ability to create logical
tape, allowing near-direct access to any partition by
These drives also typically have a higher Bit Error
Rate (BER) _ usually
from 10E-10 to 10E-15 _ than the other technologies
covered here. 19mm
drives generally do not support hardware compression.
7. 1/2" (VHS-Style) Tape
This technology uses a VHS-style transport to digitally
record data on a
7.3x4x1-inch VHS-style cartridge. Recording method is
helical scan, as
with the 8mm format. Cartridge capacity varies from
14.5 to 21.1 Gb
native, depending on tape length. Throughput is about
sustained. Systems are available in capacity from a
single drive, single
tape, to multitape libraries containing up to 48 and
600 tapes and 1-6
drives (1,013 Gb to 12.7 Tb). These drives also typically
have a higher
Bit Error Rate (BER) -- usually about 10E-13 -- than
technologies covered here. These systems can have near-direct
access via tape indexing, and do not have to be rewound
to allow access
to a file. VHS-style drives support hardware compression.
other technologies considered, which use metal-particle
tapes are generally made from metal-oxides.
8. 1/2" (IBM 3590) Tape
The IBM 3590 uses a 16-track interleaved serpentine
much like DLT. These drives currently can store 10.0
Gb native on a
XxYxZ-inch cartridge, with a throughput of about 9.0
Systems are available in capacity as multitape libraries
to 10 tapes and 1 drive (100 Gb). These drives can read
3480/3490/3490E cartridges, and they support hardware
9. 1/2" (Beta-style) Tape
This technology uses the professional BetaCam-style
digitally record data magnetically on a BetaCam-style
XxYxZ cartridge. Recording method is helical scan, as
with the 8mm
format. Cartridge capacity varies from 12.0 to 42.0
Gb native, depending
on tape length. Throughput is about 12.0 Mb/s sustained.
available in capacity from a single drive, single tape,
libraries containing up to 25, 35, 50, 70 tapes and
1 drive (600 Gb and
14,700 Gb). These drives do not support hardware compression.
Traditional Winchester removable magnetic disks configured
in an array
(e.g., RAID 0) can provide very high sustained throughput
(over 20 Mb/s
and higher) and moderately high capacity, depending
configuration. These systems can be mounted as filesystems.
The media on
these systems have a relatively short shelf life of
five to seven years
due to the complexity of the drive. This system is less
than optimal for
long-term archiving because the media are actually removable
drives. They are written-to once and put on a shelf,
and new drives must
be purchased and installed. For this reason, the cost
of operation is
100X higher than the most cost-effective technology
evaluated here. With
the ready availability of high-speed, high-capacity
tape drives, there
is no reason to use magnetic disks for long-term storage.
For magnetic tape, there are generally two types of
metal-oxide (e.g., ferric oxides) and metal-particle.
coatings contain particles made up of iron (Fe) and
oxygen (O), where
metal-particle coatings contain particles made only
of iron. Because
there are more iron atoms per unit volume, more magnetic
energy can be
stored per unit volume. However, metal-particles are
exposed to oxygen, so tape must be coated with a thin
that slightly reduces its magnetic storage capacity.
Tape life expectancy is a function of many variables,
temperature, humidity, level of usage, and head and
Generally, metal-oxide tapes will have a longer life
metal particle ones, as metal-particles will oxidize
slowly over time,
thus reducing retentivity. Table 1 summarizes an estimation
expectancies performed by the National Media Laboratory
, shown by
tape type as a function of temperature and humidity.
Many drives use some form of Lempel-Ziv compression
hardware. This adaptive algorithm replaces long strings
of data with
corresponding (and much shorter) codewords from a dictionary.
called adaptive because the dictionary is built from
the data being
compressed. Typical compression ratios range from 1.7:1
for binary data
to 6.9:1 for bitmaps and and some image types. ASCII
data and databases
average about 3.4:1. Using compression increases the
cartridge capacity and throughput rate, thus increasing
the overall cost
effectiveness of the system.
Each technology can be ranked relative to the others
several variables that affect cost, throughput, capacity,
availability. Table 2 depicts these technologies with
respect to these
and other variables. This chart can help you decide
how realistic your
requirements are with respect to your budget.
Part 2: Cost Analysis Model
Description of the Model
Once you understand the storage technologies available,
the next task is
to identify the requirements for your archival system
and create a model
to determine the most cost effective system that meets
requirements. Capacity, throughput, and total cost are
important aspects for large data sets. The idea is to
and throughput and minimize cost. Total cost is the
most complex aspect
to quantify, as it is a function of many different variables
not be the same from site to site. This model identifies
important aspects of total cost, and attempts to create
framework for comparing the costs of operating different
Constants used within the model (e.g., Administrative
Archive Baby-sitting Time, etc.) are not necessarily
they help create a framework by which different technologies
compared relative to one another. The dollar figures
these costs may not be your actuals, but they provide
The following assumptions underlie the working model.
The model attempts
to represent all the costs associated with purchasing,
writing, storing, and restoring regular archives of
a data set.
The data set generated twice each month is 300.0 Gb.
This means that
twice each month, 300.0 Gb of new user data is generated,
and needs to
be archived. This helps model the materials and manpower
create an archive.
The data set is restored from archive media once each
month. This helps
model the manpower needed to restore user data.
System prices are approximations of the retail price
for a single
quantity. All prices listed here were actuals quoted
from a vendor, but
certainly do not necessarily reflect actual cost. As
it turns out, the
model shows that cumulative storage cost per gigabyte
insensitive to moderate fluctuations in Initial System
System throughputs are matched as closely to 2.0 Mb/sec
Most drive technologies have very dissimilar sustained
throughputs, but because most systems can use a library
system with a
media changer/picker and multiple drives, it is simple
to create systems
with closely matched throughput even though one drive
may be 4X faster
than another. Keep in mind that a four drive system
may utilize more
SCSI IDs than a one drive system, and require special
System capacities are as close to 300.0 Gb as possible
_ that is, a
900Gb system and a 250 Gb system were not used for comparison
model unless there was no other configuration available.
technologies allow libraries to be daisy-chained, or
expanded, and each
was configured to support 300GB as efficiently as possible.
Factors That Incur Cost
The following variables have short-term and long-term
costs, and were
considered to be the most important in this model:
Initial System Cost -- The retail cost of the system
hardware and any
bundled application or device software; exclusive of
maintenance and software support.
Single Media Cost -- The retail cost for one cartridge
or disc, in
Compression Factor Used -- For the purpose of this analysis,
compression ratio. This ratio was chosen for those systems
it, as it is realistically attainable with typical data
sets. The use of
hardware compression within the system can significantly
Media Cost per Archive, Extended Media Storage Cost,
Baby-sitting Cost, and Extended Manpower Media Loading
Extended Media Storage Cost -- The cost of long-term
storage of media,
expressed as a function of the Number of Media per Archive,
Volume, and the Physical Space Cost. The more media
required per archive
(i.e., the lower the Media Data Density), the more physical
space will be needed.
Extended Baby-sitting Cost -- The manpower cost of attending
a backup and
restore. This is expressed as a function of the Time
to Write One
Archive, the Administrative Manhour Cost, Archive Babysitting
Effective System throughput.
Extended Manpower Media Loading Cost -- The manpower
cost of handling
media for backup and restore. This is a function of
the Number of Media
per Archive, the time to load the media into the system
Formatting Time, if applicable) and Administration Manhour
have to be unwrapped, labeled, loaded, unloaded and
stacked for each
backup and restore.
Extended Total Cost per Gb -- The sum of all the costs
creating an archive, divided by the size of the archive,
in Gb. This
metric is used in the final analysis for determining
cost-effective archive technology.
The following variables are important and have associated
cost, but were non-deterministic owing to lack of reliable
The costs associated with them was therefore assumed
to be the same for
System Unscheduled Maintenance -- The cost to repair
failed LRUs within
the system. This is a function of MTBF and MTTR, material
effective downtime cost, and manpower rate. Most vendors
of COTS systems
do not provide reliable MTBF/MTTR data per MIL-STD-217,
or any other
industry standard, therefore making comparisons useless.
This cost can
be very significant over time, depending on the design
quality of the system. For example, the read/write assembly
factor was not considered due to unavailability of accurate,
vendor data. It is a factor for tape technologies, but
not for disks. It
is known that new tapes are much more abrasive than
tapes. This can affect head wear by as much as 100 percent.
Single File Restoration -- The cost associated with
restoring a single
file from the entire archive. Since the software that
will be used to
create the archive is not known, and thus the method
of file access,
this cost cannot be reliably computed. Instead, this
assumes that the entire archive will be restored, and
this cost is
included in the Baby-sitting Cost (above).
Also considered, but discovered to be negligible in
terms of long-term
and short-term cost were:
System Space Cost -- The cost of housing the system.
This is a function
of system physical size and anticipated facility space
cost, such as
housing, cooling, etc.
System Power Usage Cost -- The cost of providing power
to the system.
This is a function of average power consumption and
Extended System Maintenance Cost -- The cost of operating
the system over time. This cost is a function of the
Time to Write One
Archive, the number of Archives/Restores per Month,
the Machine Clean
Rate, the Materials Cost per Cleaning, the Manhours
per Cleaning, and
the Administrative Manhour Cost.
The following factors have an assciated cost, but have
importance than the other factors. Also, because these
costs can be more
complex to model, they were assumed to be the same for
Media Shelf Life -- It is assumed the media will be
stored in a
temperature-controlled facility. In this case, both
metal-particle tape-coating technologies can approach
a shelf life of
15-30 years. This analysis assumes all media will have
same shelf life. Also, there is a cost associated with
when the media has reached its life expectancy, but
this is difficult to
determine due to the long lead times. In 10-15 years,
an entirely new
technology will most likely replace the current one.
Media Life -- For rewritable media, this trade assumes
that all media
will only be written to once, but may be read many times.
assumes that the number of reads is much less than the
media life. The worst case here is 4mm tape, with a
media life of 1500
Bit Error Rate -- The BER for 8mm, 4mm, and DLT is 10E5
better than the
other technologies. However, no cost has been associated
with BER for
Improved Technology -- This refers to cost-avoidance
related to providing
throughput and capacity upgrades. Many COTS systems
have very good
upgrade paths, such as 8 mm with the forthcoming Mammoth
systems have modularly expandable libraries, while others
do not. But
overall, most systems have at least a 2X capability
for throughput and
capacity expansion, so this factor is considered equal
Finally, the other constants and variables used in this
Data Set Size -- The size, in Gb, of one data set to
be written or
restored per operation.
Time of Extended Cost -- The number of years to run
Archives/Restores Per Month -- The number of data sets
Administrative Manhour Cost -- The cost per hour for
administrator or operator.
Physical Space Cost -- The cost per cubic foot per year
environmentally controlled facility for long-term storage
Archive Babysitting Time -- A constant expressing the
number of minutes
per hour, on average, an operator would have to devote
to the system
during a backup or restore, for the duration of the
Single Media Native Capacity -- The native (uncompressed)
capacity in Gb of a single media.
Support Compression? -- Whether or not the system is
capable of providing
hardware-level data compression.
Effective Media Capacity -- The amount of user data
a single media can
store. If compression is turned on in the model, this
is a function of
Single Media Native Capacity and Compression Factor
System Media Capacity -- The number of media a system
can use at one
System Total Data Capacity -- The largest data set size
the system can
archive or restore at one time.
Effective Media Cost per Gb -- A function of Single
Media Cost and
Effective Media Capacity.
Number of Media per Archive -- The number of media required
to write one
data set. This is a function of Data Set Size and Effective
Media Cost per Archive -- The cost for media to write
one data set. This
is a function of Number of Media per Archive and Single
Extended Media Cost -- The long-term cost of purchasing
media. This is a
function of Media Cost per Archive and the number of
Archives per Month.
Single Media Volume -- The size in cubic inches of a
single media, with
carrier or cover, if applicable. This size represents
the volume that
will be used when the media is placed in long-term storage.
Media Data Density -- The number of Gb that can be stored
per cubic inch.
This is a function of Single Media Volume and and Effective
Extended Media Storage Space -- The physical space required
to store the
media. This is a function of Number of Media per Archive,
Volume, and the number of Archives per Month.
Extended Media Storage Cost -- The cost associated with
Media Storage Space and the Physical Space Cost.
Native System Throughput -- The sustained throughput
in Mb per second for
reading and writing. This is vendor supplied. For systems
with more than
one drive, it assumes that some means is employed to
use all drives in
parallel (for example, striping via hardware or software).
Effective System Throughput -- The throughput of the
system as a function
of Native System Throughput and Compression Factor Used,
Using compression effectively increases throughput.
Time to Write One Data Set -- The time in hours to write
one Data Set.
This is a function of the Data Set Size and the Effective
System Volume -- The physical size of the system, in
System Power Usage -- The number of watts the system
uses in a typical
read or write operation.
Manhours per Cleaning -- The time it takes an operator
to clean the
system drives, if applicable.
Hours Between Cleaning -- The number of operational
preventive maintenance cleanings.
Materials Cost per Cleaning -- The cost of the materials
perform the system cleaning.
Media Formatting Time -- The time required for an operator
media in preparation for writing the data set.
Extended Total Cost -- The cost of operating the system
for the long
term. This is the sum of all costs in the model.
Extended Total Cost per Data Set -- The total cost for
each data set.
This is a function of Extended Total Cost and the number
Part 3: An Example Analysis
Example Requirement Definition
This section presents an example of an analysis with
the model optimized
for finding the most cost-effective system. Cost effectiveness
defined as the lowest cost per Gb for a specified period
Other models can easily be defined by concentrating
on other aspects of
interest, such as throughput or capacity.
The model was run for 1-, 2-, 5-, 8-, 10- and 15-year
periods, with and
without compression. Data Set Size was 300 Gb; Archives
per Month was 2;
Restores per Month was 1; and Compression Factor Used
between 1 and 2. Table 3 presents a sample spreadsheet
Factor Used set to 1:1 and Time of Extended Cost set
to 1 year. The
spreadsheet was actually run twelve times in order to
fill out Table 4.
Based on the requirements for the Data Set described
above, which system
offers the lowest cost of operation per Gb depends on
how long you
anticipate maintaining the system in operation. If you
discontinue operation after a few years, it would be
prudent to purchase
a system with a low Initial System Cost. For long-term
more expensive system can be justified because the high
initial cost is
offset by the low cost of operation. Table 4 summarizes
effectiveness of each system as a function of data compression
number of different years. Quantities are in units of
dollars per GB
Which system meets the requirements listed here most
cost effectively is
a function of how long you intend to operate the system.
If you only
need to archive your data for a few years, then 8mm
is by far the most
cost-effective choice. It has a very low initial system
cost, and the
lowest media cost.
However, if you intend to operate the system for more
than five years,
the most cost-effective system is the 19mm, or possibly
the VHS. VHS is
more economical if you archive only one or two data
sets per month, but
if you are generating about five or more data sets a
month, then 19mm is
the clear choice. 19mm has a very high initial system
cost, but its
throughput rate, single cartridge capacity, and data
density are the
best on the market for this application.
 VanBogart, Dr. John W. D. NML Media Stability Studies.
Media Laboratory: July, 1994.
About the Author
Packey Velleca currently works on a realtime telemetry
as a system developer. He has worked as a system administrator,
published several articles in Sys Admin. He graduated
from FIT with a
BSEE in 1988, and can be reached at firstname.lastname@example.org.