Where Did You Get That Tape?
R. King Ables
Sending tapes to and receiving tapes from other UNIX
machines is a
fairly simple task. If a tape is created using tar(1)
or cpio(1),
virtually any other UNIX machine will be able to read
it because almost
all versions of UNIX have one (if not both) of these
tape utilities.
But what do you do if you get a tape from a non-UNIX
site? Programs
on your UNIX box cannot understand an EBCDIC tape from
an IBM system.
What if you need to send data to a site that has no
UNIX hosts? An
IBM mainframe cannot make sense of a tar tape.
Commands like tar write files onto a tape in a specific
format
and include operating-system-dependent information about
each file
(like date of creation, owner, protection, etc.). The
tar command
keeps the user from having to know exactly what the
data on the tape
looks like and makes moving data between UNIX machines
very easy,
since other sites can read the tape with their own version
of tar.
But a computer that doesn't have tar cannot use the
data on
the tape as easily.
Since no single tape utility runs on every known machine
and operating
system (because most are so dependent on operating system
information),
there is no single format that can be used to make a
tape easily readable
to any computer. The most generic way to prepare a tape
for another
computer system is to write only the data in the file
onto the tape
(that is, to exclude information about the file such
as its name,
creation date, or protection). But to do this, the user
must be concerned
with things like blocking factors, record lengths, and
character codes.
If you get a tape from a non-UNIX site, it may very
well contain only
data (that is, it may not be in tar or cpio format or
any other format that would allow you just to pull files
off the tape).
You need to know (or be able to figure out) what the
layout of the
data is in order to construct a set of commands to read
the data off
the tape.
Tape Attributes
There are several attributes that make a tape unique.
Understanding
the different possible formats and layouts of a magnetic
tape will
help you figure out how best to read data from a particular
tape.
Physical Type
The most obvious tape attribute is its physical type.
Most tapes that
come from non-UNIX sites will be 9-track tapes. These
are 1/2-inch-thick
tapes rolled onto large reels (like the ones you see
spinning on tape
drives in bad 1960s movies). More modern UNIX sites
now use cartridge
tapes (which look much like audio cassettes but about
twice as large),
8mm, or 4mm tapes. While everything I discuss here would
also apply
to cartridge, 8mm, or 4mm tapes, you will most likely
receive information
from non-UNIX sites on 9-track tapes.
Density
Density refers to the number of frames written per inch
of tape and
determines how much information will fit on the tape.
The more densely
the information is recorded on the tape, the more you
can fit on the
tape. But (just as with audio and video recordings)
the more dense
your recording, the lower the quality (and with computer
tape, the
higher the chance of having an error occur when reading
the tape).
The three most common density settings for 9-track tapes
are 800,
1600, and 6250 bpi (originally meaning bits per inch,
though this
isn't exactly true). 1600 bpi used to be the most common
density,
but as tape technology improved, 6250 bpi became more
and more popular
(since more information can be put on the tape at 6250
bpi), and now
most tapes are certified to work at the higher density.
Density is usually set on the tape drive itself or by
specifying a
density on the command line when writing a tape. Many
UNIX systems
use different device names for the same tape drive with
different
densities. When reading a tape, most tape drives can
sense the density
and adjust to read the data at the appropriate setting.
Character Code
A further attribute of a tape is its character coding.
The two most
commonly used codes are ASCII and EBCDIC. EBCDIC is
mainly used by
IBM mainframes; ASCII is the standard character code
used by most
PCs and UNIX machines, among others. Knowing which code
was used will
also help you figure out what is on a tape.
A tape should never be written in the internal display
code of a machine
(if different from EBCDIC or ASCII), and internally
represented numbers
(binary values of floating point numbers or even integers)
should
never be written directly to tape. The internal representation
of
this information is different for different processors
and operating
systems. If numeric data is to be sent to another machine,
the data
should be written out in a human-readable form (in character
representations
of the values the way you would print the values) and
then read back
in and converted to numeric values.
Blocksize
Data is written onto a tape in groups of bytes called
blocks. Sometimes
tape blocks are referred to as "records."
This was an unfortunate
choice since the text data in a tape block is sometimes
said to be
made up of records (see LRECL below), which are really
just lines
in the file. In reading documentation from different
vendors, you
may find the word "record" used to describe
either a tape
record (which I will call a block) or a data record
(which is a single
line of text inside a file on a tape). You must determine
the meaning
from the context.
Separating the blocks of data on a tape are empty spaces
known as
interblock gaps (sometimes called interrecord gaps).
These spaces
allow the tape drive to stay synchronized, read the
correct amount
of data, and recover if it encounters an error in a
block. Blocks
can be just about any size, but each block should be
the same size
as the others in a file (variable-length blocks greatly
complicate
the situation). The larger the blocksize, the more efficiently
the
data can be written on the tape (since there will be
fewer interblock
gaps and, thus, less tape used). However, if a block
contains an error,
all of the data contained in that block will be lost.
So the risk
of larger blocks is that if an error occurs, more data
is lost.
Note that some sites bill for resources and may charge
per block read
or written, so judicious use of blocksize can make a
big difference
in the charges for processing a tape.
Logical Record Length (LRECL)
Typically, each block contains some number of data records.
In the
IBM mainframe world, a "record" refers to
what the rest of
the world calls a "line" of a text file. This
is where we
get the quasi-acronym "LRECL," meaning "logical
record
length," which is the number of characters in each
line of text
contained in a block.
No end-of-line characters are included in the data,
because different
operating systems use different characters or combinations
of characters
to represent end-of-line. Records (lines of text) are
padded with
spaces so that they are all the same length. If records
are all the
same length, no end-of-line notations are necessary.
Usually a block
is made up of some whole number of padded records. Records
should
not be split across block boundaries.
A typical blocksize for a text file is 5120 bytes consisting
of 64
records with a record length of 80 characters (80x64=5120).
This is known as a fixed-length record because every
record is the
same length (short lines are padded, long lines are
truncated). If
every block in the file is 5120 bytes (except, perhaps,
the last one),
then the file is said to be a fixed-record, fixed-block
length file.
Label
A tape may or may not have a standard label on it. The
label is used
by the machine that wrote the tape to keep information
about the tape
such as a volume serial number or volume name. Many
UNIX machines
do not use tape labels, so the labels simply appear
as another file
on the tape. In most cases you will simply skip over
the label when
you're reading such a tape. Rarely does the label contain
information
that will be useful on a different kind of computer.
UNIX Utilities to Read These Tapes
A number of UNIX commands that can be used to examine
tapes and read
the information contained therein. First, though, there
are a few
common-sense steps that can be helpful.
Ask the person sending the tape to include a copy of
the run of the
job or the commands used to write the tape (even if
you don't know
the operating system, command-line arguments like LRECL=80
might give
you some clues). Ideally, the sender should also provide
a description
of the data and maybe a printout of the first and last
few pages,
so that you have something against which to validate
your results.
If you write a tape to send to someone else, the recipient
will certainly
appreciate getting this kind of information from you.
In the section that follows, I describe the UNIX commands
useful for
examining foreign tapes and discuss what these commands
can do and
how you will most often use them. You should also read
the manual
pages on your local system for a full description the
commands and
how they will work on your specific version of UNIX.
UNIX Tape Devices
UNIX often uses separate names in the /dev directory
for tape
devices with different characteristics. There are several
ways to
write to a tape drive and each method is represented
by a different
device name.
To do buffered writes to a tape (where you want the
operating system
to do your blocking for you), use the regular tape device.
This is
something like /dev/mt0 on a Berkeley (BSD) UNIX machine
or
/dev/rmt/c0s0 under System V Release 4 (SVR4). "mt"
generally refers to a 9-track tape drive, though if
there is no 9-track
drive, it may simply refer to the default tape drive.
The unit number
may be used to refer to one of several tape drives or
it may be used
to specify a density for the tape. You must consult
the documentation
for your version of UNIX to determine the proper device
names.
To do unbuffered writes to a tape, where you will be
controlling the
blocking (which is generally preferred), use the "raw"
tape
device. This is something like /dev/rmt0 for BSD or
/dev/rmt/c0s0r
for SVR4.
When you access a tape device and close it (that is,
the command you
are using terminates), the tape device driver rewinds
the tape. If
you have a single file on the tape that you are reading
time after
time, this is the behavior you want. But if you have
a tape containing
multiple files, you want to be positioned at the end
of the file you
just read (and at the beginning of the next one) when
your command
finishes, so that you can then process the next file.
In this case,
you should use the "no rewind" device when
accessing the tape.
This is generally denoted by adding an n to the device
name,
like /dev/nrmt0 (BSD) or /dev/rmt/c0s0nr. Again, consult
your local documentation for the exact name.
Moving around on the Tape
A tape may contain zero or more files. A file is made
up of zero or
more blocks of data. All blocks should be the same size
except the
last block of the file, which may be a short block (it
is not necessary
to pad the last block).
Multiple files are separated by end-of-file (EOF) marks.
The end of
the tape is generally represented by two sequential
EOF marks (that
is, an empty file).
The Berkeley UNIX mt(1) command is used to move the
tape forward
or backward on file marks. Solaris and AIX also have
the mt command.
SVR4 uses the tapecntl(1) command to perform this task.
If you want to read the third file of a tape, for example,
you'll
use one of these commands to space forward twice (skipping
the first
two files and setting the pointer to the beginning of
the third file).
On a Berkeley UNIX system, the command
$ mt -f /dev/nrmt0 fsf 2
skips two files (fsf stands for "forward skip
file"), leaving the tape at the beginning of the
third file. In
SVR4, the command is:
$ tapecntl -p 2 /dev/rmt/c0s0nr
Note that if you are moving around on the tape, you
must specify the
"no rewind" device to the mt and tapecntl
commands
to prevent the tape from rewinding after the command
completes (which
is the default).
Scan of Files and Blocksizes
Once you have the tape set up and you can access it,
you need to figure
out the data format (or verify the data format if you're
lucky enough
to have been sent information about the tape along with
the tape).
The tcopy(1) command prints out information about the
files
found on a tape.
$ tcopy /dev/rmt0
file 1: records 1 to 25: size 51200
file 1: record 26: size 5120
file 1: eof after 26 records: 1285120 bytes
file 2: records 1 to 77: size 51200
file 2: record 78: size 10240
file 2: eof after 78 records: 3952640 bytes
file 3: records 1 to 31: size 51200
file 3: record 32: size 25600
file 3: eof after 32 records: 1612800 bytes
eot
total length: 6850560 bytes
$
Note that tcopy uses "records" to mean
tape records, which I call "blocks."
The tcopy output says that there are three files on
the tape
(end-of-tape was encountered after the third file) and
that all files
have the same blocksize (51200). Assuming for the moment
that the
files have an LRECL of 80, you can calculate 640 lines
per
block, thus, the first file has 16064 lines (640 per
block for the
first 25 and 64 lines in the last short block), the
second file contains
49408 lines, and the third file contains 20160 lines.
This can be
verified by dividing 80 into the total byte count for
each file. At
this point, however, that LRECL value is only a guess.
You
can be fairly sure, though, that the LRECL is some even
divisor
of 51200, since it is bad practice to break records
across blocks.
Examining Data on Tape
Given some information about the tape, the next step
is to read off
the first few blocks of each file and see if you can
make sense out
of the data. Two different commands will help with this.
The dd(1) command is used to read raw data from a device.
The
od(1) command is used to display the contents in various
formats.
So to read the first block and display it, use:
$ dd if=/dev/rmt0 ibs=51200 count=1 | od -c
0000000 @ @ @ @ @ @ 342 344 302 331 326 344 343 311 325 305
0000020 @ 311 325 304 305 347 M 311 325 361 k 311 325 362 k 311
0000040 304 311 324 305 325 k 311 346 326 331 304 k 311
327 326 342
...
(for the sake of brevity, I won't list the entire block
here).
Clearly this makes no sense whatsoever. However, the
at-signs (@) at the beginning are interesting. Often
EBCDIC
spaces look like at-signs when converted to ASCII (which
is how the
od command interprets the data). So we can have dd convert
from EBCDIC to ASCII by adding an argument:
$ dd if=/dev/rmt0 ibs=51200 conv=ascii count=1 | od -c
0000000 S U B R O U T I N E
0000020 I N D E X ( I N 1 , I N 2 , I
0000040 D I M E N , I W O R D , I P O S
...
This does indeed look like the first line of some old
FORTRAN source code. If at this point the output still
looked like
nonsense, you could use other flags to the od command
to print
out a hexadecimal or octal dump of the data on the tape
so you could
check for other formats. If you request an octal or
hexadecimal dump,
be sure not to use the "conv" argument on
dd;
otherwise, dd will convert the data and you will not
see the
values that are really on the tape.
Looking at more of the dump, you can also verify that
the lines are
padded out to column 80 with spaces. This being the
case (and expected
with FORTRAN source code), the original hypothesis of
LRECL=80
is true and you now know enough to read the tape. It
is a fixed-record
length, fixed-block length, EBCDIC tape.
You could use mt (or tapecntl) to skip to the second
and third file and verify their format, but since tcopy
indicated
that the blocksizes were the same, it is probably safe
to assume the
other files are in the same format. Had the blocksizes
been different,
it would have been necessary to go through this same
process again
to figure out the LRECL for those files.
To read off the complete file index.for (the name could
have
been sent with the tape or could be assigned at the
receiving end
based on that first line), we would use the following
dd command:
$ dd if=/dev/rmt0 of=index.for ibs=51200 cbs=80 conv=ascii,unblock
This would produce a "normal" UNIX file that
can be edited or compiled or used in any way you wish.
Writing Data to a Tape for Someone Else
As I mentioned earlier, if you are writing a tape for
someone, the
most important thing you can do is give them a detailed
description
of what is on the tape along with a snapshot of at least
the first
block or so of data. To write a very generic format
tape of a file,
use a command that is somewhat the opposite of the command
used to
read the tape:
$ dd if=test.c of=/dev/rmt0 obs=51200 cbs=80 conv=block
This writes the file test.c out to the tape in
ASCII in a fixed-record length of 80 characters per
record and 640
records per block (just like the example tape). You
may adjust the
arguments to the dd command to write out the specific
sizes
as required by the data you are writing on the tape.
About the Author
R. King Ables has been a UNIX user since 1980 and has
been managing systems
or developing system management and networking tools
since 1983. He is
currently doing system and network management development
for HaL Computer
Systems in Austin, TX.
|