Reading beyond a bad Header with tar
Ben Reaves
Introduction
Recently a colleague returned to the US carrying an
Exabyte tape containing
several hundred megabytes of software and data representing
over one
year's work here. Less than halfway through the tape,
tar found
an unreadable header and refused to read anything beyond
that.
The files had already been deleted from the disk, but
I did have a
copy of his Exabyte tape here. Sure enough, I got the
same error and
couldn't read past it. I went to work on the problem
and came up with
the software described in this article. We were able
to read all files
on that tape.
The Solution
The solution consists basically of three steps:
1. Find the bytes where the bad data appears on
the tape;
2. Read the tape, skipping those bytes;
3. Run regular UNIX tar x on the result.
The first step is done in a UNIX command like this:
dd ibs=10240 if=/dev/rst0 | tartt > file
Note that the ibs block size should be the same
as the block size the tape was written with; for tar
the default
is 10240 (or 20b). The second and third steps are done
in a UNIX pipe
like this:
dd ibs=10240 if=/dev/rst0 | passPart file | tar xvf -
The passPart takes as input the results of step
1. It is a simple filter which skips over the bad bytes.
Step 1 is the heart of the solution: it performs the
function of tar
t, but in addition to each file name, it lists the start
and end
byte number of each file's information on the tape.
Also, if it finds
a bad header, it reports the start and end byte of that
header, searches
for the next valid header, and continues from there.
Because of its
similarity to tar t, I call this program tartt. It is
shown in Listing 1. Figure 1 shows an
example of its
output. The passPart.c
program is presented in Listing 2. I separate the functions
of tartt
and passPart because tartt can be used by itself to
list the complete contents of a tape that has a bad
segment in it.
In this article, I first describe tartt (with line numbers
keyed to Listing 1) and its output, then discuss passPart.c,
which is a relatively simple program. In this article,
a "tape"
refers to one "tar file" and a "file"
refers
to one individual file that was archived on that tape:
thus tar
t takes one tape as input, and gives a list of files
as output.
tartt
Lines 13-20 of tartt constitute the main program and
show the
basic flow of this software. It searches for a valid
header and reads
it, writes one line of information based on that header,
then skips
over the number of blocks determined by that header.
findHeader()
calls readHeader() repeatedly until a valid header is
found;
writeInformation() simply writes a line of information
on stdout,
and skipOverData() skips over the input data to the
place that
the next valid header is expected to be. findHeader()
and readHeader()
do most of the work of this software.
Lines 26-41 are straight from the online man 5 tar documentation
-- they describe the layout of the data in a tar header,
which is 512 bytes long and appears before each file.
Thus, for example,
a 1000-byte file takes 512x3 = 512+1000+24 = 1536 bytes
of
tape: 512 for the tar header, 1000 for the file, and
24 to round up
to the next multiple of 512.
Lines 49-103 fill that structure with data from the
header. This section
of code also does validity checking on the header, looking
for end-of-file,
zero-length name, nonprintable or blank characters in
the filename,
too long a filename, or an improperly zero-filled filename
field.
Lines 105-136 check the validity of the checksum of
the header. This
module was written more by experience and by looking
at valid tar
headers than by looking at man 5 tar, where the information
was insufficient for writing this module. For example,
the checksum
should be read by %7o, not %8o as the online documentation
implies (though does not clearly state).
If no error is found, nBlocks remains, as it was set
in readHeader(),
the number of 512-byte blocks that the file takes on
the tape, where
the next tar header is expected to be found. If an error
is
found, a line beginning with the word "HEADER"
is printed
on stdout, and a "continue" statement is executed;
this forces readHeader() to be called again and again
until
a valid header is found. Only when a valid header is
found does findHeader()
return.
Lines 138-143 print a line on stdout consisting of the
byte
number, starting sequentially from 0 at the start of
the tape, where
the valid header starts, the byte number where the next
header should
start, and the name of the file.
Lines 145-154 simply skip the next nBlocks 512-byte-blocks
of input data, where nBlocks is the number of 512-byte
blocks
that the file is expected to occupy on the tape, according
to the
file's header on the tape.
When tar t is run on a small example with a corrupted
header,
the output is
1 /h/ben/work/
2 /h/ben/work/fullmeeting.txt
3 /h/ben/work/nc
4 tar: directory checksum error (3370 != 3250)
This means that there was a checksum error on the fourth
file, and there's no way of knowing what was past it.
When tartt
is run on the same example, the output is as shown in
Figure 1.
The first three lines show the same information as tar
t does,
with byte numbers. The fourth line reports the checksum
error. At
that point, tar t gave up, but tartt doesn't.
In lines 5 and 6, tartt searches for a valid header,
and finally
finds one at line 7, byte 39936. Thus, there is a bad
header and possibly
a file from byte 38400 through 39935 of that tape.
Lines 7, 8, and 9 show files that tar t completely missed.
In this example, that's only three files, but in my
colleague's case,
it was thousands of files, hundreds of megabytes: months
of work to
regenerate.
Lines 10 and 11 show the two null headers that, according
to the tar
specification, signify the end of the tape (the tar
archive
file, to be specific). tartt ignores these, just in
case there
might be some valid data past the null headers. It reads
until it
can read no more: at the EOF marker on the tape, which
stops the reading
at the device driver level.
passPart.c
The information generated by tartt and shown in Figure
1 suggests
that if it were possible to skip over bytes 38400 through
39935, it
should also be possible to run tar x to extract all
files from
the tape with no problem. That is precisely what passPart.c,
in Listing 2, does: it looks at the output of tartt,
decides
which parts of the corrupted tape to block and which
to pass, and
passes them.
passPart.c uses two input streams, one for reading the
tartt
output and stdin for reading the corrupted tape; and
two output
streams, one for tar x to read and stderr for debug
output.
Lines 3-15 show the simple main program, which calls
two subroutines:
one to read the file specified in argv[1] and make the
list
of bytes to skip, and one to pass the appropriate bytes
from stdin
to stdout. Lines 10-13 are just for debugging output
--
note that it must be directed to stderr, not stdout,
to be sure that the list is made properly (this simple
version does
no error checking).
The makeSkipList() module is written in readable, though
perhaps
slow, style because the output of tartt is usually not
too
long: a few thousand lines at most. Basically, it just
looks at the
first letter of each output line to determine whether
it describes
a good header or a bad header. From this information
it creates a
list of integers corresponding to the byte numbers to
skip: start
contains the first byte to skip, and end contains the
first
byte to not skip.
The passBytes() module is written to be fast, because
the amount
of data it must process is typically hundreds of megabytes.
Its function
is straightforward: it passes the stdin stream to stdout,
or blocks it, depending on the byte number from the
list of start
and end points.
Conclusion
The software described here is a relatively simple set
of tools to
recover all files from a tar-format tape that contains
bad
headers. It does not, of course, catch all types of
header errors
-- for example, if a few bytes, not a multiple of 512,
have been
mistakenly inserted or deleted from the tape, these
tools cannot recover
it.
However, the code could be rewritten to do just that,
by having it
read and verify a header based on a moving window of
length 512 bytes,
shifting one byte for each iteration. This would be
extremely slow
and, in my experience, this is usually unnecessary:
most errors are
due to substitution, not deletion or insertion. A moving
window has
its own problems: if the tape contains a file which
is itself a tar-format
archive, the "moving window" algorithm will
become confused.
And if it does hundreds of millions of comparisons,
the chance of
a nonsensical header mistakenly passing the readHeader()
and
findHeader() tests increases.
Now what about the files that were skipped because of
their bad headers?
I will leave that as an exercise for the reader.
About the Author
Ben Reaves received a BSEE degree from the University
of Southern
California in 1981 and an MSEE from Stanford University
in 1983. He
was a Research Engineer and System Administrator for
Speech Technology
Laboratory in Santa Barbara, California from 1985 to
1987 and now
works on location at Matsushita Electrical Industrial
Company's Central
Research Laboratory near Osaka, Japan.
|