Reading beyond a bad Header with tar
Recently a colleague returned to the US carrying an
Exabyte tape containing
several hundred megabytes of software and data representing
year's work here. Less than halfway through the tape,
an unreadable header and refused to read anything beyond
The files had already been deleted from the disk, but
I did have a
copy of his Exabyte tape here. Sure enough, I got the
same error and
couldn't read past it. I went to work on the problem
and came up with
the software described in this article. We were able
to read all files
on that tape.
The solution consists basically of three steps:
1. Find the bytes where the bad data appears on
2. Read the tape, skipping those bytes;
3. Run regular UNIX tar x on the result.
The first step is done in a UNIX command like this:
dd ibs=10240 if=/dev/rst0 | tartt > file
Note that the ibs block size should be the same
as the block size the tape was written with; for tar
is 10240 (or 20b). The second and third steps are done
in a UNIX pipe
dd ibs=10240 if=/dev/rst0 | passPart file | tar xvf -
The passPart takes as input the results of step
1. It is a simple filter which skips over the bad bytes.
Step 1 is the heart of the solution: it performs the
function of tar
t, but in addition to each file name, it lists the start
byte number of each file's information on the tape.
Also, if it finds
a bad header, it reports the start and end byte of that
for the next valid header, and continues from there.
Because of its
similarity to tar t, I call this program tartt. It is
shown in Listing 1. Figure 1 shows an
example of its
output. The passPart.c
program is presented in Listing 2. I separate the functions
and passPart because tartt can be used by itself to
list the complete contents of a tape that has a bad
segment in it.
In this article, I first describe tartt (with line numbers
keyed to Listing 1) and its output, then discuss passPart.c,
which is a relatively simple program. In this article,
refers to one "tar file" and a "file"
to one individual file that was archived on that tape:
t takes one tape as input, and gives a list of files
Lines 13-20 of tartt constitute the main program and
basic flow of this software. It searches for a valid
header and reads
it, writes one line of information based on that header,
over the number of blocks determined by that header.
calls readHeader() repeatedly until a valid header is
writeInformation() simply writes a line of information
and skipOverData() skips over the input data to the
the next valid header is expected to be. findHeader()
do most of the work of this software.
Lines 26-41 are straight from the online man 5 tar documentation
-- they describe the layout of the data in a tar header,
which is 512 bytes long and appears before each file.
Thus, for example,
a 1000-byte file takes 512x3 = 512+1000+24 = 1536 bytes
tape: 512 for the tar header, 1000 for the file, and
24 to round up
to the next multiple of 512.
Lines 49-103 fill that structure with data from the
header. This section
of code also does validity checking on the header, looking
zero-length name, nonprintable or blank characters in
too long a filename, or an improperly zero-filled filename
Lines 105-136 check the validity of the checksum of
the header. This
module was written more by experience and by looking
at valid tar
headers than by looking at man 5 tar, where the information
was insufficient for writing this module. For example,
should be read by %7o, not %8o as the online documentation
implies (though does not clearly state).
If no error is found, nBlocks remains, as it was set
the number of 512-byte blocks that the file takes on
the tape, where
the next tar header is expected to be found. If an error
found, a line beginning with the word "HEADER"
on stdout, and a "continue" statement is executed;
this forces readHeader() to be called again and again
a valid header is found. Only when a valid header is
found does findHeader()
Lines 138-143 print a line on stdout consisting of the
number, starting sequentially from 0 at the start of
the tape, where
the valid header starts, the byte number where the next
start, and the name of the file.
Lines 145-154 simply skip the next nBlocks 512-byte-blocks
of input data, where nBlocks is the number of 512-byte
that the file is expected to occupy on the tape, according
file's header on the tape.
When tar t is run on a small example with a corrupted
the output is
4 tar: directory checksum error (3370 != 3250)
This means that there was a checksum error on the fourth
file, and there's no way of knowing what was past it.
is run on the same example, the output is as shown in
The first three lines show the same information as tar
with byte numbers. The fourth line reports the checksum
that point, tar t gave up, but tartt doesn't.
In lines 5 and 6, tartt searches for a valid header,
finds one at line 7, byte 39936. Thus, there is a bad
header and possibly
a file from byte 38400 through 39935 of that tape.
Lines 7, 8, and 9 show files that tar t completely missed.
In this example, that's only three files, but in my
it was thousands of files, hundreds of megabytes: months
of work to
Lines 10 and 11 show the two null headers that, according
to the tar
specification, signify the end of the tape (the tar
file, to be specific). tartt ignores these, just in
might be some valid data past the null headers. It reads
can read no more: at the EOF marker on the tape, which
stops the reading
at the device driver level.
The information generated by tartt and shown in Figure
that if it were possible to skip over bytes 38400 through
should also be possible to run tar x to extract all
the tape with no problem. That is precisely what passPart.c,
in Listing 2, does: it looks at the output of tartt,
which parts of the corrupted tape to block and which
to pass, and
passPart.c uses two input streams, one for reading the
output and stdin for reading the corrupted tape; and
streams, one for tar x to read and stderr for debug
Lines 3-15 show the simple main program, which calls
one to read the file specified in argv and make the
of bytes to skip, and one to pass the appropriate bytes
to stdout. Lines 10-13 are just for debugging output
note that it must be directed to stderr, not stdout,
to be sure that the list is made properly (this simple
no error checking).
The makeSkipList() module is written in readable, though
slow, style because the output of tartt is usually not
long: a few thousand lines at most. Basically, it just
looks at the
first letter of each output line to determine whether
a good header or a bad header. From this information
it creates a
list of integers corresponding to the byte numbers to
contains the first byte to skip, and end contains the
byte to not skip.
The passBytes() module is written to be fast, because
of data it must process is typically hundreds of megabytes.
is straightforward: it passes the stdin stream to stdout,
or blocks it, depending on the byte number from the
list of start
and end points.
The software described here is a relatively simple set
of tools to
recover all files from a tar-format tape that contains
headers. It does not, of course, catch all types of
-- for example, if a few bytes, not a multiple of 512,
mistakenly inserted or deleted from the tape, these
tools cannot recover
However, the code could be rewritten to do just that,
by having it
read and verify a header based on a moving window of
length 512 bytes,
shifting one byte for each iteration. This would be
and, in my experience, this is usually unnecessary:
most errors are
due to substitution, not deletion or insertion. A moving
its own problems: if the tape contains a file which
is itself a tar-format
archive, the "moving window" algorithm will
And if it does hundreds of millions of comparisons,
the chance of
a nonsensical header mistakenly passing the readHeader()
findHeader() tests increases.
Now what about the files that were skipped because of
their bad headers?
I will leave that as an exercise for the reader.
About the Author
Ben Reaves received a BSEE degree from the University
California in 1981 and an MSEE from Stanford University
in 1983. He
was a Research Engineer and System Administrator for
Laboratory in Santa Barbara, California from 1985 to
1987 and now
works on location at Matsushita Electrical Industrial
Research Laboratory near Osaka, Japan.