Cover V02, I02
Article

mar93.tar


Where Did You Get That Tape?

R. King Ables

Sending tapes to and receiving tapes from other UNIX machines is a fairly simple task. If a tape is created using tar(1) or cpio(1), virtually any other UNIX machine will be able to read it because almost all versions of UNIX have one (if not both) of these tape utilities.

But what do you do if you get a tape from a non-UNIX site? Programs on your UNIX box cannot understand an EBCDIC tape from an IBM system. What if you need to send data to a site that has no UNIX hosts? An IBM mainframe cannot make sense of a tar tape.

Commands like tar write files onto a tape in a specific format and include operating-system-dependent information about each file (like date of creation, owner, protection, etc.). The tar command keeps the user from having to know exactly what the data on the tape looks like and makes moving data between UNIX machines very easy, since other sites can read the tape with their own version of tar. But a computer that doesn't have tar cannot use the data on the tape as easily.

Since no single tape utility runs on every known machine and operating system (because most are so dependent on operating system information), there is no single format that can be used to make a tape easily readable to any computer. The most generic way to prepare a tape for another computer system is to write only the data in the file onto the tape (that is, to exclude information about the file such as its name, creation date, or protection). But to do this, the user must be concerned with things like blocking factors, record lengths, and character codes.

If you get a tape from a non-UNIX site, it may very well contain only data (that is, it may not be in tar or cpio format or any other format that would allow you just to pull files off the tape). You need to know (or be able to figure out) what the layout of the data is in order to construct a set of commands to read the data off the tape.

Tape Attributes

There are several attributes that make a tape unique. Understanding the different possible formats and layouts of a magnetic tape will help you figure out how best to read data from a particular tape.

Physical Type

The most obvious tape attribute is its physical type. Most tapes that come from non-UNIX sites will be 9-track tapes. These are 1/2-inch-thick tapes rolled onto large reels (like the ones you see spinning on tape drives in bad 1960s movies). More modern UNIX sites now use cartridge tapes (which look much like audio cassettes but about twice as large), 8mm, or 4mm tapes. While everything I discuss here would also apply to cartridge, 8mm, or 4mm tapes, you will most likely receive information from non-UNIX sites on 9-track tapes.

Density

Density refers to the number of frames written per inch of tape and determines how much information will fit on the tape. The more densely the information is recorded on the tape, the more you can fit on the tape. But (just as with audio and video recordings) the more dense your recording, the lower the quality (and with computer tape, the higher the chance of having an error occur when reading the tape).

The three most common density settings for 9-track tapes are 800, 1600, and 6250 bpi (originally meaning bits per inch, though this isn't exactly true). 1600 bpi used to be the most common density, but as tape technology improved, 6250 bpi became more and more popular (since more information can be put on the tape at 6250 bpi), and now most tapes are certified to work at the higher density.

Density is usually set on the tape drive itself or by specifying a density on the command line when writing a tape. Many UNIX systems use different device names for the same tape drive with different densities. When reading a tape, most tape drives can sense the density and adjust to read the data at the appropriate setting.

Character Code

A further attribute of a tape is its character coding. The two most commonly used codes are ASCII and EBCDIC. EBCDIC is mainly used by IBM mainframes; ASCII is the standard character code used by most PCs and UNIX machines, among others. Knowing which code was used will also help you figure out what is on a tape.

A tape should never be written in the internal display code of a machine (if different from EBCDIC or ASCII), and internally represented numbers (binary values of floating point numbers or even integers) should never be written directly to tape. The internal representation of this information is different for different processors and operating systems. If numeric data is to be sent to another machine, the data should be written out in a human-readable form (in character representations of the values the way you would print the values) and then read back in and converted to numeric values.

Blocksize

Data is written onto a tape in groups of bytes called blocks. Sometimes tape blocks are referred to as "records." This was an unfortunate choice since the text data in a tape block is sometimes said to be made up of records (see LRECL below), which are really just lines in the file. In reading documentation from different vendors, you may find the word "record" used to describe either a tape record (which I will call a block) or a data record (which is a single line of text inside a file on a tape). You must determine the meaning from the context.

Separating the blocks of data on a tape are empty spaces known as interblock gaps (sometimes called interrecord gaps). These spaces allow the tape drive to stay synchronized, read the correct amount of data, and recover if it encounters an error in a block. Blocks can be just about any size, but each block should be the same size as the others in a file (variable-length blocks greatly complicate the situation). The larger the blocksize, the more efficiently the data can be written on the tape (since there will be fewer interblock gaps and, thus, less tape used). However, if a block contains an error, all of the data contained in that block will be lost. So the risk of larger blocks is that if an error occurs, more data is lost.

Note that some sites bill for resources and may charge per block read or written, so judicious use of blocksize can make a big difference in the charges for processing a tape.

Logical Record Length (LRECL)

Typically, each block contains some number of data records. In the IBM mainframe world, a "record" refers to what the rest of the world calls a "line" of a text file. This is where we get the quasi-acronym "LRECL," meaning "logical record length," which is the number of characters in each line of text contained in a block.

No end-of-line characters are included in the data, because different operating systems use different characters or combinations of characters to represent end-of-line. Records (lines of text) are padded with spaces so that they are all the same length. If records are all the same length, no end-of-line notations are necessary. Usually a block is made up of some whole number of padded records. Records should not be split across block boundaries.

A typical blocksize for a text file is 5120 bytes consisting of 64 records with a record length of 80 characters (80x64=5120). This is known as a fixed-length record because every record is the same length (short lines are padded, long lines are truncated). If every block in the file is 5120 bytes (except, perhaps, the last one), then the file is said to be a fixed-record, fixed-block length file.

Label

A tape may or may not have a standard label on it. The label is used by the machine that wrote the tape to keep information about the tape such as a volume serial number or volume name. Many UNIX machines do not use tape labels, so the labels simply appear as another file on the tape. In most cases you will simply skip over the label when you're reading such a tape. Rarely does the label contain information that will be useful on a different kind of computer.

UNIX Utilities to Read These Tapes

A number of UNIX commands that can be used to examine tapes and read the information contained therein. First, though, there are a few common-sense steps that can be helpful.

Ask the person sending the tape to include a copy of the run of the job or the commands used to write the tape (even if you don't know the operating system, command-line arguments like LRECL=80 might give you some clues). Ideally, the sender should also provide a description of the data and maybe a printout of the first and last few pages, so that you have something against which to validate your results. If you write a tape to send to someone else, the recipient will certainly appreciate getting this kind of information from you.

In the section that follows, I describe the UNIX commands useful for examining foreign tapes and discuss what these commands can do and how you will most often use them. You should also read the manual pages on your local system for a full description the commands and how they will work on your specific version of UNIX.

UNIX Tape Devices

UNIX often uses separate names in the /dev directory for tape devices with different characteristics. There are several ways to write to a tape drive and each method is represented by a different device name.

To do buffered writes to a tape (where you want the operating system to do your blocking for you), use the regular tape device. This is something like /dev/mt0 on a Berkeley (BSD) UNIX machine or /dev/rmt/c0s0 under System V Release 4 (SVR4). "mt" generally refers to a 9-track tape drive, though if there is no 9-track drive, it may simply refer to the default tape drive. The unit number may be used to refer to one of several tape drives or it may be used to specify a density for the tape. You must consult the documentation for your version of UNIX to determine the proper device names.

To do unbuffered writes to a tape, where you will be controlling the blocking (which is generally preferred), use the "raw" tape device. This is something like /dev/rmt0 for BSD or /dev/rmt/c0s0r for SVR4.

When you access a tape device and close it (that is, the command you are using terminates), the tape device driver rewinds the tape. If you have a single file on the tape that you are reading time after time, this is the behavior you want. But if you have a tape containing multiple files, you want to be positioned at the end of the file you just read (and at the beginning of the next one) when your command finishes, so that you can then process the next file. In this case, you should use the "no rewind" device when accessing the tape. This is generally denoted by adding an n to the device name, like /dev/nrmt0 (BSD) or /dev/rmt/c0s0nr. Again, consult your local documentation for the exact name.

Moving around on the Tape

A tape may contain zero or more files. A file is made up of zero or more blocks of data. All blocks should be the same size except the last block of the file, which may be a short block (it is not necessary to pad the last block).

Multiple files are separated by end-of-file (EOF) marks. The end of the tape is generally represented by two sequential EOF marks (that is, an empty file).

The Berkeley UNIX mt(1) command is used to move the tape forward or backward on file marks. Solaris and AIX also have the mt command. SVR4 uses the tapecntl(1) command to perform this task.

If you want to read the third file of a tape, for example, you'll use one of these commands to space forward twice (skipping the first two files and setting the pointer to the beginning of the third file).

On a Berkeley UNIX system, the command

$ mt -f /dev/nrmt0 fsf 2

skips two files (fsf stands for "forward skip file"), leaving the tape at the beginning of the third file. In SVR4, the command is:

$ tapecntl -p 2 /dev/rmt/c0s0nr

Note that if you are moving around on the tape, you must specify the "no rewind" device to the mt and tapecntl commands to prevent the tape from rewinding after the command completes (which is the default).

Scan of Files and Blocksizes

Once you have the tape set up and you can access it, you need to figure out the data format (or verify the data format if you're lucky enough to have been sent information about the tape along with the tape).

The tcopy(1) command prints out information about the files found on a tape.

$ tcopy /dev/rmt0
file 1: records 1 to 25: size 51200
file 1: record 26: size 5120
file 1: eof after 26 records: 1285120 bytes
file 2: records 1 to 77: size 51200
file 2: record 78: size 10240
file 2: eof after 78 records: 3952640 bytes
file 3: records 1 to 31: size 51200
file 3: record 32: size 25600
file 3: eof after 32 records: 1612800 bytes
eot
total length: 6850560 bytes
$

Note that tcopy uses "records" to mean tape records, which I call "blocks."

The tcopy output says that there are three files on the tape (end-of-tape was encountered after the third file) and that all files have the same blocksize (51200). Assuming for the moment that the files have an LRECL of 80, you can calculate 640 lines per block, thus, the first file has 16064 lines (640 per block for the first 25 and 64 lines in the last short block), the second file contains 49408 lines, and the third file contains 20160 lines. This can be verified by dividing 80 into the total byte count for each file. At this point, however, that LRECL value is only a guess. You can be fairly sure, though, that the LRECL is some even divisor of 51200, since it is bad practice to break records across blocks.

Examining Data on Tape

Given some information about the tape, the next step is to read off the first few blocks of each file and see if you can make sense out of the data. Two different commands will help with this.

The dd(1) command is used to read raw data from a device. The od(1) command is used to display the contents in various formats. So to read the first block and display it, use:

$ dd if=/dev/rmt0 ibs=51200 count=1 | od -c
0000000  @  @  @  @  @  @ 342 344 302 331 326 344 343 311 325 305

0000020  @ 311 325 304 305 347  M 311 325 361  k 311 325 362  k 311

0000040 304 311 324 305 325  k 311 346 326 331 304 k 311

327 326 342
...

(for the sake of brevity, I won't list the entire block here).

Clearly this makes no sense whatsoever. However, the at-signs (@) at the beginning are interesting. Often EBCDIC spaces look like at-signs when converted to ASCII (which is how the od command interprets the data). So we can have dd convert from EBCDIC to ASCII by adding an argument:

$ dd if=/dev/rmt0 ibs=51200 conv=ascii count=1 | od -c
0000000              S  U  B  R  O  U  T  I  N  E
0000020    I  N  D  E  X  (  I  N  1  ,  I  N  2  ,  I
0000040  D  I  M  E  N  ,  I  W  O  R  D  ,  I  P  O  S
...

This does indeed look like the first line of some old FORTRAN source code. If at this point the output still looked like nonsense, you could use other flags to the od command to print out a hexadecimal or octal dump of the data on the tape so you could check for other formats. If you request an octal or hexadecimal dump, be sure not to use the "conv" argument on dd; otherwise, dd will convert the data and you will not see the values that are really on the tape.

Looking at more of the dump, you can also verify that the lines are padded out to column 80 with spaces. This being the case (and expected with FORTRAN source code), the original hypothesis of LRECL=80 is true and you now know enough to read the tape. It is a fixed-record length, fixed-block length, EBCDIC tape.

You could use mt (or tapecntl) to skip to the second and third file and verify their format, but since tcopy indicated that the blocksizes were the same, it is probably safe to assume the other files are in the same format. Had the blocksizes been different, it would have been necessary to go through this same process again to figure out the LRECL for those files.

To read off the complete file index.for (the name could have been sent with the tape or could be assigned at the receiving end based on that first line), we would use the following dd command:

$ dd if=/dev/rmt0 of=index.for ibs=51200 cbs=80 conv=ascii,unblock

This would produce a "normal" UNIX file that can be edited or compiled or used in any way you wish.

Writing Data to a Tape for Someone Else

As I mentioned earlier, if you are writing a tape for someone, the most important thing you can do is give them a detailed description of what is on the tape along with a snapshot of at least the first block or so of data. To write a very generic format tape of a file, use a command that is somewhat the opposite of the command used to read the tape:

$ dd if=test.c of=/dev/rmt0 obs=51200 cbs=80 conv=block

This writes the file test.c out to the tape in ASCII in a fixed-record length of 80 characters per record and 640 records per block (just like the example tape). You may adjust the arguments to the dd command to write out the specific sizes as required by the data you are writing on the tape.

About the Author

R. King Ables has been a UNIX user since 1980 and has been managing systems or developing system management and networking tools since 1983. He is currently doing system and network management development for HaL Computer Systems in Austin, TX.