Article

Reading beyond a bad Header with tar

Ben Reaves

Introduction

Recently a colleague returned to the US carrying an Exabyte tape containing several hundred megabytes of software and data representing over one year's work here. Less than halfway through the tape, tar found an unreadable header and refused to read anything beyond that.

The files had already been deleted from the disk, but I did have a copy of his Exabyte tape here. Sure enough, I got the same error and couldn't read past it. I went to work on the problem and came up with the software described in this article. We were able to read all files on that tape.

The Solution

The solution consists basically of three steps:

1. Find the bytes where the bad data appears on the tape;

2. Read the tape, skipping those bytes;

3. Run regular UNIX tar x on the result.

The first step is done in a UNIX command like this:

dd ibs=10240 if=/dev/rst0 | tartt > file

Note that the ibs block size should be the same as the block size the tape was written with; for tar the default is 10240 (or 20b). The second and third steps are done in a UNIX pipe like this:

dd ibs=10240 if=/dev/rst0 | passPart file | tar xvf -

The passPart takes as input the results of step 1. It is a simple filter which skips over the bad bytes.

Step 1 is the heart of the solution: it performs the function of tar t, but in addition to each file name, it lists the start and end byte number of each file's information on the tape. Also, if it finds a bad header, it reports the start and end byte of that header, searches for the next valid header, and continues from there. Because of its similarity to tar t, I call this program tartt. It is shown in Listing 1. Figure 1 shows an example of its output. The passPart.c program is presented in Listing 2. I separate the functions of tartt and passPart because tartt can be used by itself to list the complete contents of a tape that has a bad segment in it.

In this article, I first describe tartt (with line numbers keyed to Listing 1) and its output, then discuss passPart.c, which is a relatively simple program. In this article, a "tape" refers to one "tar file" and a "file" refers to one individual file that was archived on that tape: thus tar t takes one tape as input, and gives a list of files as output.

tartt

Lines 13-20 of tartt constitute the main program and show the basic flow of this software. It searches for a valid header and reads it, writes one line of information based on that header, then skips over the number of blocks determined by that header. findHeader() calls readHeader() repeatedly until a valid header is found; writeInformation() simply writes a line of information on stdout, and skipOverData() skips over the input data to the place that the next valid header is expected to be. findHeader() and readHeader() do most of the work of this software.

Lines 26-41 are straight from the online man 5 tar documentation -- they describe the layout of the data in a tar header, which is 512 bytes long and appears before each file. Thus, for example, a 1000-byte file takes 512x3 = 512+1000+24 = 1536 bytes of tape: 512 for the tar header, 1000 for the file, and 24 to round up to the next multiple of 512.

Lines 49-103 fill that structure with data from the header. This section of code also does validity checking on the header, looking for end-of-file, zero-length name, nonprintable or blank characters in the filename, too long a filename, or an improperly zero-filled filename field.

Lines 105-136 check the validity of the checksum of the header. This module was written more by experience and by looking at valid tar headers than by looking at man 5 tar, where the information was insufficient for writing this module. For example, the checksum should be read by %7o, not %8o as the online documentation implies (though does not clearly state).

If no error is found, nBlocks remains, as it was set in readHeader(), the number of 512-byte blocks that the file takes on the tape, where the next tar header is expected to be found. If an error is found, a line beginning with the word "HEADER" is printed on stdout, and a "continue" statement is executed; this forces readHeader() to be called again and again until a valid header is found. Only when a valid header is found does findHeader() return.

Lines 138-143 print a line on stdout consisting of the byte number, starting sequentially from 0 at the start of the tape, where the valid header starts, the byte number where the next header should start, and the name of the file.

Lines 145-154 simply skip the next nBlocks 512-byte-blocks of input data, where nBlocks is the number of 512-byte blocks that the file is expected to occupy on the tape, according to the file's header on the tape.

When tar t is run on a small example with a corrupted header, the output is

1 /h/ben/work/
2 /h/ben/work/fullmeeting.txt
3 /h/ben/work/nc
4 tar: directory checksum error (3370 != 3250)

This means that there was a checksum error on the fourth file, and there's no way of knowing what was past it. When tartt is run on the same example, the output is as shown in Figure 1.

The first three lines show the same information as tar t does, with byte numbers. The fourth line reports the checksum error. At that point, tar t gave up, but tartt doesn't.

In lines 5 and 6, tartt searches for a valid header, and finally finds one at line 7, byte 39936. Thus, there is a bad header and possibly a file from byte 38400 through 39935 of that tape.

Lines 7, 8, and 9 show files that tar t completely missed. In this example, that's only three files, but in my colleague's case, it was thousands of files, hundreds of megabytes: months of work to regenerate.

Lines 10 and 11 show the two null headers that, according to the tar specification, signify the end of the tape (the tar archive file, to be specific). tartt ignores these, just in case there might be some valid data past the null headers. It reads until it can read no more: at the EOF marker on the tape, which stops the reading at the device driver level.

passPart.c

The information generated by tartt and shown in Figure 1 suggests that if it were possible to skip over bytes 38400 through 39935, it should also be possible to run tar x to extract all files from the tape with no problem. That is precisely what passPart.c, in Listing 2, does: it looks at the output of tartt, decides which parts of the corrupted tape to block and which to pass, and passes them.

passPart.c uses two input streams, one for reading the tartt output and stdin for reading the corrupted tape; and two output streams, one for tar x to read and stderr for debug output.

Lines 3-15 show the simple main program, which calls two subroutines: one to read the file specified in argv[1] and make the list of bytes to skip, and one to pass the appropriate bytes from stdin to stdout. Lines 10-13 are just for debugging output -- note that it must be directed to stderr, not stdout, to be sure that the list is made properly (this simple version does no error checking).

The makeSkipList() module is written in readable, though perhaps slow, style because the output of tartt is usually not too long: a few thousand lines at most. Basically, it just looks at the first letter of each output line to determine whether it describes a good header or a bad header. From this information it creates a list of integers corresponding to the byte numbers to skip: start contains the first byte to skip, and end contains the first byte to not skip.

The passBytes() module is written to be fast, because the amount of data it must process is typically hundreds of megabytes. Its function is straightforward: it passes the stdin stream to stdout, or blocks it, depending on the byte number from the list of start and end points.

Conclusion

The software described here is a relatively simple set of tools to recover all files from a tar-format tape that contains bad headers. It does not, of course, catch all types of header errors -- for example, if a few bytes, not a multiple of 512, have been mistakenly inserted or deleted from the tape, these tools cannot recover it.

However, the code could be rewritten to do just that, by having it read and verify a header based on a moving window of length 512 bytes, shifting one byte for each iteration. This would be extremely slow and, in my experience, this is usually unnecessary: most errors are due to substitution, not deletion or insertion. A moving window has its own problems: if the tape contains a file which is itself a tar-format archive, the "moving window" algorithm will become confused. And if it does hundreds of millions of comparisons, the chance of a nonsensical header mistakenly passing the readHeader() and findHeader() tests increases.

Now what about the files that were skipped because of their bad headers? I will leave that as an exercise for the reader.

About the Author

Ben Reaves received a BSEE degree from the University of Southern California in 1981 and an MSEE from Stanford University in 1983. He was a Research Engineer and System Administrator for Speech Technology Laboratory in Santa Barbara, California from 1985 to 1987 and now works on location at Matsushita Electrical Industrial Company's Central Research Laboratory near Osaka, Japan.