Article

File Management Tools

Marty Leisner

Copying, moving, and maintaining files are basic functions every computer user, but are particularly critical for system administrators. Common though these functions may be, they are not error proof (how many times have you accidentally overwritten a file, then had to haul out your backup tapes?). This article elaborates on a number of file management tools and techniques I've learned from the "school of hard knocks," techniques that help you deal with files in an organized way. Specifically, this article will address:

using the update and backup options

linking whole directory trees

managing distributions

New Features in Common Tools

The GNU fileutils packages present useful enhancements to a number of tools (especially mv, cp and ln). The most current copy is fileutils-3.12.tar.gz, available on any gnu archive (the main one is on prep.ai.mit.edu in /pub/gnu). In addition to file moving utilities, the package has enhanced copies of chmod, chown, dd, df, du, mknod, touch, rm, and rmdir, as well as a highly enhanced and useful version of ls (with its plethora of options).

GNU cp and mv have uniform features accessed via the following options:

backup -- Save previous data under a unique name, don't clobber it.

long -- Use the long word form of the option. Most options have a simple one-letter description and a corresponding long word. They usually (but not always) occur in matched pairs. For example:

--f, --force -- remove existing destinations, never prompt

--i, --interactive -- prompt before overwrite

The long options help shell scripts become self-documenting.

help -- Display a mini man page online. This is much quicker than doing a man. The discussion of the long option above is an example.

verbose -- Echo commands as they are executed.

interactive -- Ask for confirmation before executing destructive commands.

update -- Perform the operation only if the source is newer than the destination.

provide version -- Display a version interactively. This is useful when multiple versions are installed on various places of a system.

The update and backup facilities are new to these utilities (I've never seen them on other versions of UNIX), and they are very handy.

Backups and Version Control

In these utilities, version control amounts to the ability to make copies before doing anything destructive. This means that when you are confident you've done the right thing, you can remove the files by hand, which is much easier than finding you did something wrong and having to search through your backup tapes (and if you don't have recent enough backups, the file is lost). cp, ln, and mv offer version control via the following options:

numbered -- If the file is called foo, rename the old file to foo.~#~ where # is the newest number. Examples are: foo.~1~, bar.~23~.

simple -- Don't bother with a number, just add an extension. The default is ~. If the backup file previously exists, it is overwritten.

existing -- Make numbered backups of existing numbered backups; otherwise, use simple.

You set these options through the environment variable VERSION_CONTROL. However, you can override the environment setting by specifying the option as the command line. cp, ln, and mv also have a backup option (triggered by -b or -backup). The backup option lets you save the old copy, rather than overwrite it.

The simple bash session below shows what backups can do:

leisner@gnu$ ls
leisner@gnu$ >foo
leisner@gnu$ ls
foo
leisner@gnu$ cp -b foo bar; ls
bar  foo
leisner@gnu$ cp -b foo bar; ls
bar      bar.~1~  foo
leisner@gnu$ cp -b foo bar; ls
bar      bar.~1~  bar.~2~ foo

The options used here cause the backups to be automatically incremented. Then when you're ready, you can delete them ( rm *~). This backup scheme works somewhat like the Emacs backup (though Emacs appears to be more flexible).

You can also choose to do simple (non-numbered) backups, by specifying the extension with the -S option. For example, to use .old as the suffix:

leisner@gnu$ ls
bar  foo
leisner@gnu$ cp -b -V simple -S .old foo bar
leisner@gnu$ ls
bar      bar.old  foo

As noted above, these strategies will be familiar to people who have used the backup capability on Emacs.

The update option allows you to do use the cp command in a unique way (along with backups)., as in Figure 1. The update option won't copy foo over bar (since bar is newer than foo), but can copy bar over foo. The copy operation in Figure 1 produces a numbered backup instead of destruction (foo.~1~ was the old foo). When you use cp interactively, you typically won't use the long options; instead you would abbreviate the commands in Figure 1 to:

cp -ubv -V t

As mentioned above, instead of specifying the type of backup you want each time, you can set the type in the environment variable VERSION_CONTROL, as, for example:

export VERSION_CONTROL=numbered

and alias in your shell:

cp to 'cp -b'
mv to 'mv -b'
ln to 'ln -b'

Setting the backup type in the environment variable means you must make a conscious decision not to produce backups (this is preferable to having to make a conscious decision to do backups, because you normally discover you needed a backup when it's too late). In bash, I enter:

\cp foo /dev/null

to get the unaliased version (which doesn't do backups).

The environment variable VERSION_CONTROL is useful with other programs (when a program supports this feature and you don't know about it, you get a pleasant surprise). Two other tools I've stumbled across which support this strategy are GNU patch and GNU indent.

You may also want to define in your environment:

CP='cp -b'
MV='mv -b'
LN='mv -b'

Many configuration scripts will pull these out of your environment instead of using a default. Then, when you automatically generate a makefile, it will be correct. I have found this useful, since it means I don't have to think about creating backups. I only need to worry about backups when I want to eliminate file clutter or recover some space.

I use cp in place of the install program. While install allows permissions and ownership to be changed and files to be set setuid/setgid, I'd rather execute the command manually as root if I need to. Using cp lets me control what is happening to my system -- if something goes wrong, it's much easier to recover. Using GNU cp, in most makefiles I write

INSTALL=cp -b -u

which updates and does a backup (using VERSION_CONTROL in the environment for the default). However, this approach will not work if you use the install options for changing ownerships and permissions. (I've modified install to accept a -b option; it should be available in a future release.)

In normal mode cp and mv are destructive -- if a destination file already exists, it will be blindly overwritten (unless the file cannot be written by the user). Using non-destructive backups means you won't make many mistakes you'll be sorry for. You can alias cp and mv to include the backup option.

Making Multiple Symbolic Links

The syntax of the ln command allows you to make only one symbolic link at a time. But what if you want to make lots of symbolic links? One way to do it in sh is:

cd <destination>
source=<path of source>
for i in <file1> <file2> <file3>
do
ln -s $source/$i
done

An easier way to do it, with GNU cp, is:

cd <destination>
cp -s <source>/{file1,file2,file3}

or some variation on this, using the -s (make symbolic links) option. Note that the source can be a relative or an absolute path, and can span file systems. If you want to copy all files from $HOME/working, you can use:

cp -sv ~/working/* .

and see each link being made. The file arguments (the source) must be an absolute or a relative path name from the destination. I often find it easy to just go to the place I want to copy from and enter:

cp -s `pwd`/* ~/new-place

Distributions often have lots of files and subdirectories, arranged in a tree. Instead of physically copying the tree, you may prefer to make links to it. There are good reasons for this approach:

changes in the master are immediately reflected in the child

space is limited

the user does not need write access to the master.

To make links of a directory tree, you can use the tool lndir, available in the MIT X11 distribution. You create a shadow tree with:

lndir <source directory> <destination directory>

or (if you are already in the destination directory)

lndir <source directory>

To shadow the X11 tree from /usr/X11R6/src to ~/src/X11R6 you would enter:

mkdir ~/src/X11R6
cd ~/src/X11R6
lndir /usr/X11R6/src .

This strategy is useful when you are dealing with zipped archives on a CD-ROM. Rather than copy and unzip files in a particular subdirectory, you can shadow the parts of interest on the CD-ROM to a hard disk and then proceed as follows:

mkdir <directory>
cd <directory>
lndir /cdrom/<directory>
for i in *.zip
do
dir=$(basename $i .zip)
mkdir $dir
unzip $i -d $dir
done

I found the lndir script very handy, but slow and inflexible. It is slow because it executes a mkdir for every new directory, and an individual ln for every file to link. I wanted more speed and more flexibility, so I wrote my own version. My version of lndir performs four to five times faster, taking advantage of GNU cp's ability to do cp -s to create multiple symbolic links at a time.

There is a perl version of lndir that executes about an order of magnitude faster than mine (or 50 times faster than the standard MIT version). If I had just wanted to duplicate the functionality of lndir, I would simply have used the perl version. However, I wanted to be able to do the following:

filter files to avoid (using a regular expression)

specify directories to avoid (MIT X11 lndir just eliminated RCS, SCCS and CVS)

provide more verbose output.

Listing 1 is a bash script (lndir.sh) which duplicates the functionality of lndir but adds flexibility. Also included here are the man pages (Listing 2) for lndir.sh.

Dealing with Distributions

Many programs consist of large distributions that require placement of binaries, miscellaneous files, info files, and man pages in specific places in a file system hierarchy (the tree). A directory plan that accommodates most such requirements looks like this:

bin -- for binaries

lib -- for libraries and auxiliary files

man -- for manpages

info -- for TEXinfo info pages

src -- for source code

html -- for hypertext files

dist -- for distribution files

etc -- for other auxiliary files (could be in lib)

Using a standard tree simplifies figuring out where to put files and understanding the system.

When you install a new package of software, it is a good idea not to immediately discard the old software: what if the new version doesn't work? However, the normal procedure is to overwrite the old file with the new file, which means that if something goes wrong and you need to restore the older version, you have to resort to your backup tape.

One way around this is to bypass the standard install procedures and install by hand. I often leave several binaries of a commonly used program online, appending a version number onto each (e.g., gdb-4.9, gdb-4.12, gdb-4.13). A symbolic link from the file gdb will point to the commonly used version (e.g., gdb points to gdb-4.12). If you are adventurous, you might do an alias for gdb (e.g., alias gdb to gdb-4.13), you might have a symbolic link in a personal bin directory to the version being used (e.g., ~/bin/gdb to /usr/gnu/bin/gdb-4.13).

These are very useful strategies since they help ensure that important tools don't disappear. I keep several versions of common tools available on line. If the most current one doesn't work, I try another one.

I also like to leave my newest binaries, with symbols and source code, online, so that I can easily debug the program. However, I don't need to leave all binaries in this state, so I run strip on older binaries I don't care to debug. I compress binaries that aren't used often. This is easier than using backups: if something is online, it takes just a moment to uncompress it.

Another useful trick is to make a different directory for each substantial package (it's up to you to define what "substantial" means). For my purposes, a commonly used package with a dozen programs and man pages would count as substantional. I put each package in its own directory with a version number. Thus, when I configured GNU fileutils-3.12, I did:

mkdir /usr/gnu/fileutils-3.12
./configure --prefix=/usr/gnu/fileutils-3.12

Then I made and ran a

make; make install

to put the programs and documentation in /usr/gnu/fileutils-3.12. I have another copy of fileutils as /usr/gnu/fileutils-3.9. Most users don't have to know about the version number, so /usr/gnu/fileutils is a link to the version being used. Going one step further, it isn't even necessary to know about /usr/gnu/fileutils; in /usr/gnu/bin you can make symbolic links to all the programs:

leisner@gemini$ ls -l /usr/gnu/bin/{ls,ln,cp,mv}
lrwxrwxrwx 1 root     daemon  19 Jun  5  1994 /usr/gnu/bin/cp -> ../fileutils/bin/cp*
lrwxrwxrwx 1 root     daemon  19 Jun  5  1994 /usr/gnu/bin/ln -> ../fileutils/bin/ln*
lrwxrwxrwx 1 root     daemon  19 Jun  5  1994 /usr/gnu/bin/ls -> ../fileutils/bin/ls*
lrwxrwxrwx 1 root     daemon  19 Jun  5  1994 /usr/gnu/bin/mv -> ../fileutils/bin/mv*

and the man pages:

leisner@gemini$ ls -l /usr/gnu/man/man1/{ls,ln,cp,mv}.1
lrwxrwxrwx 1 leisner  sdsp    29 Dec  9 12:39 /usr/gnu/man/man1/cp.1 -> ../../fileutils/man/man1/cp.1
lrwxrwxrwx 1 leisner  sdsp    29 Dec  9 12:39 /usr/gnu/man/man1/ln.1 -> ../../fileutils/man/man1/ln.1
lrwxrwxrwx 1 leisner  sdsp    29 Dec  9 12:39 /usr/gnu/man/man1/ls.1 -> ../../fileutils/man/man1/ls.1
lrwxrwxrwx 1 leisner  sdsp    29 Dec  9 12:39 /usr/gnu/man/man1/mv.1 -> ../../fileutils/man/man1/mv.1

This strategy makes it easy to upgrade packages (since ../fileutils is another symlink) and reduces the burden on users, who just need to know about single paths (/usr/gnu/bin, /usr/gnu/man). Only the system administrators need to know about the physical architecture.

Figuring Out Your Paths

Now, instead of a catchall for every program you install, you have much more order and control (and more files to keep track of). How do you manage your path? A good way is to let your login shell figure this out for you. There are three PATHs which are important:

PATH -- for executables

MANPATH -- for manpages

INFOPATH -- for info files.

Listing 3 and Listing 4 show examples for csh and bash (and probably other Bourne-like shells as well). The idea is to have an array of possible paths. Each path may have a component with bin (for PATH), man (for MANPATH) and/or info (for INFOPATH). This lets you look and see which components are there and add them to your paths as appropriate. This is far better than the "kitchen sink" approach (if I define everything imaginable, eventually I'll hit what I'm looking for). This approach lets me use common login files on different domains -- my paths autoconfigure according to which components are present.

Remember to define the environment variable once when you log in. If you define it for each invocation of the shell, you will slow the system down (path hashing is expensive) and create confusion (if you change an environment variable and spawn another shell, you'll get back the old environment variable).

In the C shell example in Listing 3, notice that I set the PATH in one operation at the end. I discovered that each time you set the PATH, the shell does a search of the path, and you are better off doing this only one time.

About the Author

Marty Leisner has been programming in C and UNIX for a dozen years. You can reach him at leisner@sdsp.mc.xerox.com.