Cover V05, I02
Article
Listing 1
Listing 2
Table 1
Table 2
Table 3
Table 4
Table 5
Table 6
Table 7
Table 8

feb96.tar


Managing Disk Space

Marty Leisner

Managing disk space efficiently entails making rational decisions about storage. Backups are useful for protecting against catastrophe, but backups aren't always very useful in daily work. Anything online is at your fingertips and easy to work with, anything on tape is harder to access, and therefore is often ignored. Whenever possible, it's better to leave files you and your users may need in the future online. This article explores several tactics for maximizing your disk utilization through efficient storage techniques; it also presents some scripts that help implement those tactics.

The article contains a great deal of benchmark information to help you make logical choices. All the benchmarks were done on a DX4/100 machine running Linux 1.2.8. Size statistics are hardware independent, bytes are the same on all platforms. The absolute timing information is hardware dependent (faster computers will run faster). The relative times should be more meaningful to you: what's faster on my setup should be faster on all setups. I do a lot of benchmarking on two distributions: gawk-2.15.6 and gdb4.14. Why? Because:

  • they're fairly large (but gdb is an order of magnitude bigger than gawk)

  • they're representative of the type of file systems administrators often deal with

  • I had them handy

  • you can freely get them to compare your results with mine

    Eliminate Stale Files

    Your first step should be to determine which files are of use and which are not. If you still have on your system distributions of software that you've built, you can check the makefiles for one of the following flavors of clean:

    clean
    realclean
    distclean
    nuke
    clobber

    Most makefiles include one of the above; the choice depends on the flavor and desires of the makefile author. After using clean, you can see how much space you saved and how much space is consumed. Then you are ready for other tactics covered here.

    Identifying Files

    Disks accumulate files. Often, it's impossible to tell from the filename what a given file may be, especially when you're dealing with binaries from multiple architectures on NFS-mounted filesystems. You can use ls(1) to help you identify what files are there. According to the GNU fileutils info page, here's what ls(1) can do:

    `-F'
    `_classify'
    Append a character to each filename indicating the file type.
    Also, for regular files that are executable, append `*'. The file
    type indicators are `/' for directories, `@' for symbolic links,
    `|' for FIFOs, `=' for sockets,  and nothing for regular files.

    Most versions of ls include -F, and many users use it in their Bourne-type shell initilization files:

    alias ls='ls -F'

    Invoking ls -F will give you some hint as to what the names mean. Without identification, a directory listing looks like:

    % ls
    autoscan  bitcount.d  env       foo                strerror.c    xdos
    awk       bsdlpq      false     libXpm.so.4.5      tee           xsession
    bash      dired       fdformat  libmalloc.so.1.0   who

    With this information we get:

    % ls -F
    autoscan* bitcount.d  env*      foo/               strerror.c    xdos@
    awk@      bsdlpq*     false*    libXpm.so.4.5*     tee*          xsession*
    bash*     dired*      fdformat* libmalloc.so.1.0*  who*

    You can see immediately which files are executable, which are symbolic links, and which are directories.

    Another important tool is the file program, which has been in existance since Version 6 in 1975. It applies heuristics and a table of "magic numbers" to attempt to guess what the file is. Sometimes it's wrong, but it is right most of the time, and it often solves the nagging question of "what is that?" Running file on a variety of machines, I found some surprising differences.

    On SunOS 4.1.3:

    leisner@gnu$ /bin/file *
    autoscan:         executable /usr/gnu/bin/perl script
    awk:              symbolic link to /usr/gnu/bin/gawk
    bash:             sparc demand paged dynamically linked
    executable not stripped
    bitcount.d:       ascii text
    bsdlpq:           sparc demand paged dynamically linked
    set-uid executable not stripped
    dired:            commands text
    env:              data
    false:            executable shell script
    fdformat:         data
    foo:              directory
    libXpm.so.4.5:    data
    libmalloc.so.1.0: sparc demand paged shared library executable
    not stripped
    strerror.c:       c-shell commands
    tee:              data
    who:              data
    xdos:             symbolic link to /usr/bin/xdos
    xsession:         data

    On IBM AIX Version 4:

    % file *
    autoscan:         shell script
    awk:              symbolic link to /usr/gnu/bin/gawk.
    bash:             data or International Language text
    bitcount.d:       ascii text
    bsdlpq:           data or International Language text
    dired:            vax bsd demand paged executable - version 25600
    env:              vax bsd demand paged executable - version 25600
    false:            shell script  - sh (default shell)
    fdformat:         data or International Language text
    foo:              directory
    libXpm.so.4.5:    data or International Language text
    libmalloc.so.1.0: data or International Language text
    strerror.c:       English text
    tee:              data or International Language text
    who:              data or International Language text
    xdos:             symbolic link to /usr/bin/xdos.
    xsession:         data or International Language text

    Using Solaris 2.3:

    leisner@solar2$ file *
    autoscan:         executable /usr/gnu/bin/perl script
    awk:              Sun demand paged SPARC executable dynamically
    linked
    bash:             Sun demand paged SPARC executable dynamically
    linked
    bitcount.d:       ascii text
    bsdlpq:           Sun demand paged SPARC executable dynamically
    linked
    dired:            commands text
    env:              data
    false:            executable shell script
    fdformat:         ELF 32-bit LSB, dynamically linked, stripped
    foo:              directory
    libXpm.so.4.5:    ELF 32-bit LSB, dynamically linked
    libmalloc.so.1.0: Sun demand paged SPARC executable dynamically
    linked
    strerror.c:       English text
    tee:              ELF 32-bit MSB executable SPARC Version 1,
    dynamically linked, stripped
    who:              ELF 32-bit MSB executable SPARC Version 1,
    dynamically linked, not stripped
    xdos:             symbolic link to /usr/bin/xdos
    xsession:         ELF 32-bit LSB, dynamically linked, stripped

    Using the freeware file version 3.15 (started by Ian Darwin, currently maintained by Mark Moraes and Christos Zoulas):

    leisner@gnu$ file *
    autoscan:         a /usr/gnu/bin/perl script text
    awk:              symbolic link to /usr/gnu/bin/gawk
    bash:             sparc demand paged dynamically linked
    executable not stripped
    bitcount.d:       ascii text
    bsdlpq:           setuid sparc demand paged dynamically linked
    executable not stripped
    dired:            Linux/i386 demand-paged executable (ZMAGIC)
    env:              Linux/i386 demand-paged executable (ZMAGIC) not
    stripped
    false:            Bourne Shell script text
    fdformat:         ELF 32-bit LSB executable i386 (386 and up)
    Version 1
    foo:              directory
    libXpm.so.4.5:    ELF 32-bit LSB dynamic lib i386 (386 and up)
    Version 1
    libmalloc.so.1.0: sparc demand paged shared library not stripped
    strerror.c:       C or REXX program text
    tee:              ELF 32-bit MSB executable SPARC Version 1
    who:              ELF 32-bit MSB executable SPARC Version 1
    xdos:             broken symbolic link to /usr/bin/xdos
    xsession:         ELF 32-bit LSB executable i386 (386 and up)
    Version 1

    With the freeware file, all the guesses were correct and reasonable (since I generated the files, I can confirm the accuracy of the guesses). The freeware version also knows about lots of binary formats; the other files only know about the native machine format and call all foreign formats "data." A current version of the freeware file is available on ftp://tesla.ee.cornell.edu/pub/file-X.YY.tar.gz, where X and YY are version numbers.

    Shrink Binaries by Stripping

    After a binary is installed, you may be able to shrink it. If the program was compiled with the -g option, the symbol table will be very large (much larger than the program). If the program wasn't linked with the -s (strip) option, there is still symbol information available in the binary (typically on the order of 10 percent). This gives you enough information to produce a core dump, which produces enough information to let you figure out which routines the core dump occurred in (but not enough to run the debugger).

    Table 1 shows examples of stripped programs (on Linux 1.2.8), using the Elf tools based on gcc 2.6.3 and using -g -O. As the table shows, strippingwill reduce the size of debuggable binaries by 80 percent. If you use strip, however, debugging and analyzing core dumps becomes impossible. If you want to debug, you need to be able to regenerate the binary. Also, to analyze core dumps, you need to be able to regenerate the binary exactly.

    Use Shared Libraries

    Most modern UNIX systems support the concept of shared libraries. Shared libraries are much like shared text, in the sense that one copy needs to be kept in core, and multiple programs share this single copy. There is minor overhead on program startup (since dynamic relocations need to be performed). But the binaries are much smaller on disk, and the executable image is more compact.

    How many programs use printf? Would it help if all the programs used the same memory resident copy of printf? What if I want to compile printf with symbols for debugging? Shared libraries address these issues.

    Shared libraries offer significant space savings. Table 2 shows some examples. To help understand the space savings, in these examples all the programs are compiled with -O and are stripped.

    If you program with libraries and if a number of programs use the libraries, it may make sense to learn how to construct shared libraries (once constructed, they're very easy to use). Another major advantage to using shared libraries is that you can install updates transparently (applications that use the shared libraries don't have to be recompiled to be updated). This helps enormously in system administration. But be aware that the interface between the programs and the library has to remain the same for this to work. If there are subtle changes, you may be in for a head-scratching experience.

    Have One Copy of Source

    If you are supporting a number of diverse machines, it's confusing and inefficient to have multiple copies of the source floating around. Via NFS, you can easily have the source code on one host, and execute a make on another host with another architecture. Most GNU configure scripts support a -srcdir option, which allows you to specify where the source actually resides. But you must be careful -- if you configure within the directory where the source is, you can't reconfigure elsewhere: in this case, you should do a make distclean. Some configure scripts don't properly support the -srcdir option; if you have problems, try to execute the make in the source directory.

    -srcdir makes it much easier to support multiple machines from one source tree. You can even write-protect the sources so they won't change, via a chmod -R -w command. Alternatively, you can export a read-only NFS filesystem, so you won't accidentally write on distributed sources.

    Some distributions come in tree form on a CD-ROM. When a distribution is on CD-ROM (for example, O'Reilly sells the BSD 4.4 and X11 trees), you can leave the source there and use a tool like lndir (which I discussed in the article, "File Management Tools," in the March/April 1995 issue of Sys Admin) to make links to a hard disk. You can then build off the hard disk, getting read-only files from the CD-ROM via symbolic links.

    BSD 4.4 supports a layered filesystem, which takes the place of a link tree. You can structure a filesystem such that you have a read-only filesystem (normally a CD-ROM) under a hard disk. You thus have the appearance of a writable CD-ROM. I've never seen it, but it sounds clever.

    File Compression

    File compression is another useful tactic for keeping information online and minimizing disk usage. Compression reduces the size of the file, and uncompressing is much quicker than recovering from tape. GNU gzip has two major advantages over compress: size and space. When I first saw this, I found it hard to believe, since compression is so commonplace. compress(1) has been standard for years and is patented (by Unisys). But gzip

  • compresses better,

  • uncompresses faster,

  • compresses more slowly,

  • is covered by the GNU license.

    The fact that gzip compresses more slowly is not very important, since a file can only be compressed once. Once compressed, it can be uncompressed many times. Table 3 shows relative performance for gzip and compress.

    gzip has a significant advantage in size over compress. Moreover, gzip gives you a choice of nine different levels of compression, with a tradeoff of time/space. Table 4 shows time/space results for selected levels of gzip. Notice that all the levels of compression are significantly better than for compress. The default compression, level 6, takes about half the time for maximum compression.

    While the sizes at the different compression levels in the examples in Table 4 vary by less than 20 percent, the user time differs by 500 percent. You trade off a large amount of processing for a small amount of compression. The default level (level 6) is a reasonable tradeoff.

    Size isn't the only important factor: how fast decompression occurs is equally significant. Table 5 compares compression and decompression speeds for gzip and compress. gzipped files are extracted faster than compressed files. In addition, gunzip(1) can handle both gzipped and compressed files. It does a quicker job of extracting compressed files than uncompress(1). The examples in Table 5 are run to stdout, going into /dev/null.

    There is no performance penalty for unzipping maximally gzipped files. In fact, I've seen a small performance penalty for unzipping minimally compressed files. And there is a signficant advantage in using gunzip to deal with compressed files.

    Compressing Trees

    Compressing trees is also a good tactic for saving space, and it's very easy to uncompress trees when you need them. The dates are preserved, which is very important. Just do:

    gzip -r <path> to compress
    gunzip -r <path> to umcompress

    If you're root, the ownership is also preserved.

    As Table 6 shows, a compressed tree incurs a small space penalty compared to compressed tar files. But the convenience factor is very high, since looking through compressed files goes very quickly if you know where to look (and uncompressing a tree takes about the same amount of work as untarring/uncompressing a tree). Also, you can browse compressed text files easily with zmore and zcat.

    Compressing Executables

    gzip has a companion program called gzexe, which creates a compressed runnable script. The script is divided into two parts:

    1) a control section (a small script which uncompresses the program into tmp and runs it)

    2) the data (which is compressed, and follows the script).

    gzexe should be used carefully, since every time the program is run it uncompresses into tmp. In addition, multiple copies of running the same program at the same time won't share text (since each executable makes a separate image). An alternative may be to gzip the executable, and manually gunzip it when you need it. gzexe works best when applied to stripped binaries; applying it to unstripped binaries often results in larger files than you'd get by just stripping them. Also, note that very short runs (i.e., get version/help) take much longer, since the executable needs to be decompressed first. Table 7 shows the effect of gzexe on the size of certain executables.

    Compressing man Pages and Info Files

    Info readers and man page readers that can deal with zipped information can be very useful, since the documentation on your system can easily consume several megabytes. When I started to run Linux, I became aware of the possibility of zipping this information, since the documentation is enormous and it's my personal hard disk. There are copies of man and info that deal with gzipped man pages and info files. Most vendors' versions of man(1) won't handle this you'll need to get an enhanced version of man. One site for this is:

    ftp://sunsite.unc.edu/pub/Linux/system/Manualpagers/man-1.4e.tar.gz)

    In addition to dealing with compressed and gzipped man pages, enhanced man has a number of useful options not found in conventional versions. It was begun by John Eaton (jwe@che.utexas.edu), and has been supported and enhanced during the last five years by a number of maintainers. It's definitely worth taking a look.

    texinfo is a hypertext format which is the standard for the gnu project. A texinfo master is created, which can deliver either a printed manual, via TeX, or online reference, via info (generated by a program called makeinfo). The hyptertext format is a big advantage over standard man. If you're interested, ftp to prep.ai.mit.edu and get texinfo-3.6.tar.gz. The info program is included, along with other tools to develop texinfo documents. Be aware that many tools supply both an info document and an outdated man page; if the info document is much newer than the man page, the man page shouldn't be trusted.

    The tk tools (tkman and tkinfo) can both work with compressed files. They use John Ousterhout's wish (a windowing shell), generated with the Tk/Tcl toolkits. If you don't already have wish, tkman and tkinfo are excellent examples of applications to start with once you do install it.

    Dealing with Compressed tar Archives

    A standard way of passing information around UNIX systems is through compressed tar archives. Once an archive is compressed, it is never necessary to uncompress it. To work on foo.tar, just execute:

    zcat foo.tar.gz | tar -xf -

    or if you are using GNU tar

    tar -xzf foo.tar.gz

    There is a good reason to work this way. Uncompressed tar files can be space hogs. In fact, you may not have space for the uncompressed tar file and whatever it contains. This approach leaves the file as a compress file and uses pipes to extract the tar file. An even more extreme example would be listing the first few entries in a tar archive so you're sure of what you have; doing this in pipes is relatively fast, while having to uncompress the whole archive can be very time-consuming.

    Convert Compressed Files into gzip Files

    Atool called znew converts compressed files into gzipped files. If you have a tree of compressed files to convert, run this little pipeline:

    find . -name '*.Z' | xargs znew

    An example of the effect of znew is in Table 8. Notice that converting text from compressed format to gzip format results in a space saving of about one-third.

    zip versus gzip

    The most popular DOS compression program is zip. There are a number of other compression programs, which are less common (arc and zoo) and which I won't discuss here (except to say that there are implementations for all of them on UNIX). zip can use the same type of encoding as gzip on UNIX (along with a number of other algorithms). UNIX tends to break up compression and archival tools and use pipes to connect them, (while DOS-based tools bundle them together). The major advantage of a gzipped tar archive is that it applies compression to one large file, which gives a smaller result (compression works better on one large datastream than on lots of small datastreams). zip/unzip compresses each file separately.

    The Info-ZIP group has put together a freeware set of zip manipulation tools (available at ftp://ftpu.uu.net/pub/archiving/zip) which run on VMS, OS/2, MS/DOS, NT, AmigaDOS, Atari TOS, Macintosh, and almost every known flavor of UNIX. The major advantages to using this unzipper rather than others are portability and consistency of user interfaces across platforms. If you aren't concerned about portability, stick with gzip/tar. But if you are, take advantage of unzip/zip.

    The prune Script

    A little script called prune (Listing 1) allows you to implement the above recommendations. It's also safe to use for removing backup files. Be careful when using backups which end in ~; if you have ever done

    rm * ~

    instead of

    rm *~

    you remember the grief it caused. prune can:

  • act recursively

  • erase backup files (ending with ~)

  • gzip with level -9 compression

  • convert compressed files to gzip files

  • overwrite existing files

    prune works with the GNU findutils and bash (available at prep.ai.mit.edu:/pub/gnu). I initially wrote a much more complicated bash script, and thought I should convert it to perl. But I thought about the problem, and decided to really simplify it, using find and xargs to run the commands. The only shortcoming is that the tool doesn't compare the dates between compressed and uncompressed files with the same basename. In practice, it's effective, since most install scripts leave man pages and info files uncompressed, and specifying the overwrite option allows you to replace a previously compressed file with a newer file.

    Holes

    Most modern UNIX filesystems can handle sparse files efficiently that is, files which contain a large amount of empty space. An example of this is a core dump. To get the actual disk space used by a sparse file, do:

    leisner@gemini$ sleep 1d
    ^\Quit (core dumped)
    bash2 leisner@gemini$ ls -l core
    
    -rw-r--r--   1 leisner  staff     8421808
    Mar 18 15:18 core
    bash2 leisner@gemini$ du -s core
    104     core

    As you can see the sizes are radically different. And, if you move the file, you'll make it normal size. For example:

    leisner@gemini$ sleep 1d
    ^\\Quit (core dumped)
    leisner@gemini$ ls -l core
    -rw-r--r--  1 leisner  sdsp  8421808 Mar 18
    15:23 core
    leisner@gemini$ du -s core
    104     core
    leisner@gemini$ gzip core
    leisner@gemini$ ls -l core.gz
    -rw-r--r--  1 leisner  sdsp  15797 Mar 18
    15:23 core.gz
    leisner@gemini$ gunzip core
    leisner@gemini$ du -s core
    8240    core

    Taking the holes out of a core dump file increases the size (in this example) by more than two orders of magnitude. 8Mb files can quickly use up your disk space. Once the holes have been removed from a file, I don't know of a way to put them back in. I wrote a little program (holify.c -- Listing 2) which can rectify this problem.

    leisner@gnu$ sleep 1d
    ^\Quit (core dumped)
    leisner@gnu$ du -s core
    104     core
    leisner@gnu$ gzip -c core >core.sleep.gz
    leisner@gnu$ zcat core.sleep.gz | holify >core2
    Wrote 49152 real bytes,  79725107 virtual bytes
    leisner@gnu$ du -s core2 core
    80      core2
    104     core
    leisner@gnu$ cmp core core2
    leisner@gnu$

    Conclusion

    Erasing files is a very drastic way of dealing with the disk space shortages. Dynamic linking has size advantages over static linking. Further, debugging symbols show approximately an order of magnitude difference (the size of the debugging information far outweighs the size of the program). Compression tools can also help greatly, and it takes much less time to compress/uncompress files than to find an old copy on a backup tape. The benchmarks presented here show a number of advantages of gzip over compress if you're still using compress, you should look into gzip. The X11 font architecture uses compressed font files; using gzipped font files (which involves a change in the decompression code) would both increase performance and save space. All of the above can help you keep information available while paying a minimal storage price.

    About the Author

    Marty Leisner has a B.S.C.S. from Cornell University. He started using UNIX on a DEC PDP/11, and has been writing in C for over a decade, primarily doing real-time embedded applications.


     



  •