Large
File System Backup Tool
W. Curtis Preston
For this edition of lost+found, I thought I'd talk about
a solution to a problem that plagues both large and small sites.
Since the solution requires a script, this column will be a mix
of backup and scripting information. I hope you find it useful.
Whether you've got a small site with one or two backup drives
to back it up with, or a 10-TB filesystem, I think you might find
yourself suffering from a problem that I have finally come up with
a solution for. (Although, I admit that large sites probably suffer
from this problem more than small sites.)
The Problem
Have you ever needed to back up a filesystem that was too large
for one backup job? Perhaps it was an 8-GB filesystem, and you happen
to have two 5-GB tape drives, but you don't know how to get
the backup to start on one tape and move over to the other. Perhaps
you've taken advantage of 32-bit operating systems, and have
started using terabyte-sized filesystems. How do you back up a terabyte-sized
mount point?
If you are like the shop described in the first example, the solution
is not easy. Most backup programs (including dump, tar,
and cpio) do not have any way to start a backup on one tape
drive and then move it to another. They only understand the concept
of swapping tapes, and even that is usually meant to be done by
a human. If you had a stacker and were good with expect,
you could probably write a script that would watch for the tape
prompt, swap the tape, then notify the backup program. However,
how do you do that with two tape drives? (I suppose a real hacker
could monkey around with symbolic links to device files, but that
would be really naughty.) The only real solution is to divide the
mount point into multiple backup lists, each of which would go to
a separate tape drive.
If you are more like the second shop listed above, you've
got a real problem. What you need to do is known in backup circles
as multi-streaming. However, most backup and recovery software only
multi-streams on the mount-point level. That means that each mount
point becomes a backup stream. This works well, as long as you've
got enough time and bandwidth for each stream. However, what if
you have a terabyte-sized mount point?
The only product that I know that automatically handles this situation
is SyncSort's Backup Express. They will multi-stream within
a filesystem. Where most backup software products assign a given
filesystem to a drive, Backup Express can make that assignment dynamically
for each file. That means that you can automatically back up a single
mount point to multiple tape drives. It's a really nice feature
if you need it.
But what about the rest of us out there that have really large
mount points, and need to stream the data to multiple tape drives
(or backup streams)? The only answer is to do what has already been
mentioned -- split up the mount point into multiple backup lists.
But how do you do that? I hope you're not going to say manually!
Creating the backup lists manually would violate one of my modus
operandi: All backup lists should be automatically created. Never
require the administrator to manually configure the lists of what
gets backed up. This is why I like NetWorker's "All saveset,"
or NetBackup's "ALL_LOCAL_DRIVES" feature.
The Solution
What if we had a script that could automatically divide a mount
point into multiple pieces? Here are several areas where I think
this would be really helpful:
- Large database servers, where the DBAs and SAs have created
one huge mount point for all databases on a machine.
- Large NAS servers (especially Network Appliance boxes), where
they allow you to easily configure terabyte-sized filesystems.
Many people back these up via NFS. Although these NetApps can
support over 100 MB/s sustained throughput during backups, they're
not going to do it with one backup stream!
- Any environment where they have disk drives bigger than their
tape drives!
A friend of mine and I were talking about this problem, and he
volunteered to undertake it as a project. The result is the script
that is included with this article (Listing 1). Many thanks to Hal
Skelly for his work on this script. Hal feels that the script still
needs some clean up, and it should have options built-in for other
backup products. (Right now, it creates an output file designed
for NetBackup.) However, we thought we'd get your feedback
before doing much more work on it.
To use the script, simply enter:
chopit -c streamCount -d rootdir [-v]
where streamcount is the number of streams to divide the filesystem
into, and rootdir is the mount point that must be split up.
How does the script work? Read on.
The Code
The code was written in Perl, developed under Solaris 2.6 and
has been tested with Red Hat Linux 6.2. Under Linux, it will successfully
split up both ext2 and DOS filesystems. The code has extensive comments,
and the rest of this article should serve as a high-level view of
the procedures.
The basic program flow is to take the (command-line option) specification
of a filesystem and the number of streams to divide it into. The
program will then do a search by depth first, the byproduct of which
is to have each directory name be a hash with a value of the size
of its directory tree. The program then builds streams (lists of
directories), which comprise the contents of a stream less than
or equal to the total size of the input directory name divided by
the number of streams. Please note that the actual number of streams
may be 1 greater than the number requested.
The reason for this is that if you have an 11-GB filesystem and
request two streams, each stream should be less than or equal to
5.5 GB. If the streams are constructed such that each is 5.45 GB
and the final directory to place in a stream is 0.1 GB, adding it
to either stream makes the stream bigger than 5.5, so a new stream
would be created. Finally, the program prints each stream with a
introductory line of NEW_STREAM as required by the structure
of a NetBackup include file.
Functions
The following section describes the various functions of the script
and how they work.
FINDDEPTH/WANTED -- This was created using guidelines
from pp. 439-440 of the second edition of Programming Perl
by Larry Wall, Tom Christiansen, and Randal L. Schwartz. The finddepth
function of the File::Find module requires a function specifying
the criteria and actions of the depth first search of a specified
directory. This function is the wanted function. In our program,
the wanted function looks at the current entry (a directory) and
adds the size of this directory to its parent and the size of its
directory entry (e.g. ls -ld.) to itself. By being a recursive,
depth-first search, a directory will already have the values (sizes)
of its subdirectories added to it. I'll express this in pseudo-code:
- If the current entry is a directory, add its directory entry
size to itself and add this new size to the value of the parent
of the current directory.
- If current entry is not a directory (file, link, pipe etc.)
add its size to the current directory. The results are put in
a hash where the key is the directory (or subdirectory) name,
and the value is the size of all files and directories below it.
STRMSIZE -- Given a comma-separated list of directories,
add up their sizes and return that value.
BUILDSTREAMS -- Create an array of lists of directories
that comprise separate streams. The index of the array is the stream
number. This too is a recursive function.
Start with either a file or directory entry. If the size of this
element will fit into any current stream array, add it to the array.
If the size will fit into a new stream, create a new array element
(equals one stream or list of directories/files). If it will not
fit into an existing stream or a new stream and is also a directory,
descend one level into the directory and call buildstreams
on each constituent of this directory. If it is an individual file
that is bigger than a desired single stream, exit (e.g., an 11-GB
filesystem is to be broken into 2 streams, but there is a file that
is, by itself, > 5.5 GB).
PRINTSTREAMS -- Use the global array, STREAMS
and print out a list of directories. The comma-separated lists that
are the array elements are printed both to STDOUT and also
in a format for the includes file. For example:
NEW_STREAM
Dir 1
Dir 2
NEW_STREAM
Verybigfile
The printstreams function also prints out (only to STDOUT)
the size of each stream and the grand total of all of the streams
(which should equal the size of the root of the starting directory
specified on the command line).
PRECIS
My friend Hal has a much more formal programming background than
I do. Therefore, he offers this section that formally describes
the purpose and flow of the script.
chopit
Input: A directory name and desired number (N) of streams into
which to divide this directory.
Output: A list of directories comprising at most (N+1) streams
to both STDOUT and to ./includes. The latter of which
is in the format of a NetBackup includes directives list.
Structure
1. Get command-line options.
2. Produce hash of directory sizes hashed by directory names (subroutine
'wanted').
3. Print out list of streams both to STDOUT and as NEW_STREAM
directives to ./includes.
4. Exit.
Functions
Finddepth -- (\&wanted, starting-directory)
from File::Find standard PERL library module.
Wanted -- If current entry is directory, add its own
size to the hash, 'dirs', and add that value to the parent
of current entry. If current entry is a file, add its size to the
hash of the current directory name.
Strmsize -- Return the total size of a list of comma-separated
directory and file entries.
Buildstreams -- Create an array (into a global variable
named STREAMS[]) of lists of directories and files. Each
array element is one stream. Input is the name of a directory or
file, but it is initially called with the starting directory. If
the size of the entry will fit into an existing stream, add it to
the list. Otherwise, if it will fit into a new stream, create a
new stream and add it there. If it is too big for an existing stream
or a new stream, then descend one level into directory and run buildstreams
on all directory entries.
Printstreams -- Print out the elements of STREAMS
both to STDOUT and to ./includes. The latter is in
the format of NEW_STREAMS directives. The output to STDOUT
will include size information on each stream as well as the composition.
Summary
I hope that this script proves useful to you. I know I've
wished for a script like it for some time, and am glad to finally
have it in my tool bag. Please feel free to drop me a note with
your suggestions for the script. We'll be adding it to our
list of backup and recovery tools at:
http://backups.sourceforge.net.
Note that this is a new undertaking. We plan to move all our tools
to this location to facilitate development for, use of, and discussion
about these free tools. If you'd like to join the team, drop
me a note.
W. Curtis Preston has specialized in storage for over eight
years, and has designed and implemented storage systems for several
Fortune 100 companies. He is the owner of Storage Designs, the Webmaster
of Backup Central (http://www.backupcentral.com),
and the author of two books on storage. He may be reached at curtis@backupcentral.com.
(Portions of some articles may be excerpted from Curtis's books.)
|