Managing Disk Space
Marty Leisner
Managing disk space efficiently entails making rational
decisions about
storage. Backups are useful for protecting against catastrophe,
but
backups aren't always very useful in daily work. Anything
online is at
your fingertips and easy to work with, anything on tape
is harder to
access, and therefore is often ignored. Whenever possible,
it's better
to leave files you and your users may need in the future
online. This
article explores several tactics for maximizing your
disk utilization
through efficient storage techniques; it also presents
some scripts that
help implement those tactics.
The article contains a great deal of benchmark information
to help you
make logical choices. All the benchmarks were done on
a DX4/100 machine
running Linux 1.2.8. Size statistics are hardware independent,
bytes are
the same on all platforms. The absolute timing information
is hardware
dependent (faster computers will run faster). The relative
times should
be more meaningful to you: what's faster on my setup
should be faster on
all setups. I do a lot of benchmarking on two distributions:
gawk-2.15.6
and gdb4.14. Why? Because:
they're fairly large (but gdb is an order of magnitude
bigger than gawk)
they're representative of the type of file systems administrators
often deal with
I had them handy
you can freely get them to compare your results with
mine
Eliminate Stale Files
Your first step should be to determine which files are
of use and which
are not. If you still have on your system distributions
of software that
you've built, you can check the makefiles for one of
the following
flavors of clean:
clean
realclean
distclean
nuke
clobber
Most makefiles include one of the above; the choice
depends on the
flavor and desires of the makefile author. After using
clean, you can
see how much space you saved and how much space is consumed.
Then you
are ready for other tactics covered here.
Identifying Files
Disks accumulate files. Often, it's impossible to tell
from the filename
what a given file may be, especially when you're dealing
with binaries
from multiple architectures on NFS-mounted filesystems.
You can use
ls(1) to help you identify what files are there. According
to the GNU
fileutils info page, here's what ls(1) can do:
`-F'
`_classify'
Append a character to each filename indicating the file type.
Also, for regular files that are executable, append `*'. The file
type indicators are `/' for directories, `@' for symbolic links,
`|' for FIFOs, `=' for sockets, and nothing for regular files.
Most versions of ls include -F, and many users use it
in their
Bourne-type shell initilization files:
alias ls='ls -F'
Invoking ls -F will give you some hint as to what the
names mean.
Without identification, a directory listing looks like:
% ls
autoscan bitcount.d env foo strerror.c xdos
awk bsdlpq false libXpm.so.4.5 tee xsession
bash dired fdformat libmalloc.so.1.0 who
With this information we get:
% ls -F
autoscan* bitcount.d env* foo/ strerror.c xdos@
awk@ bsdlpq* false* libXpm.so.4.5* tee* xsession*
bash* dired* fdformat* libmalloc.so.1.0* who*
You can see immediately which files are executable,
which are symbolic
links, and which are directories.
Another important tool is the file program, which has
been in existance
since Version 6 in 1975. It applies heuristics and a
table of "magic
numbers" to attempt to guess what the file is.
Sometimes it's wrong, but
it is right most of the time, and it often solves the
nagging question
of "what is that?" Running file on a variety
of machines, I found some
surprising differences.
On SunOS 4.1.3:
leisner@gnu$ /bin/file *
autoscan: executable /usr/gnu/bin/perl script
awk: symbolic link to /usr/gnu/bin/gawk
bash: sparc demand paged dynamically linked
executable not stripped
bitcount.d: ascii text
bsdlpq: sparc demand paged dynamically linked
set-uid executable not stripped
dired: commands text
env: data
false: executable shell script
fdformat: data
foo: directory
libXpm.so.4.5: data
libmalloc.so.1.0: sparc demand paged shared library executable
not stripped
strerror.c: c-shell commands
tee: data
who: data
xdos: symbolic link to /usr/bin/xdos
xsession: data
On IBM AIX Version 4:
% file *
autoscan: shell script
awk: symbolic link to /usr/gnu/bin/gawk.
bash: data or International Language text
bitcount.d: ascii text
bsdlpq: data or International Language text
dired: vax bsd demand paged executable - version 25600
env: vax bsd demand paged executable - version 25600
false: shell script - sh (default shell)
fdformat: data or International Language text
foo: directory
libXpm.so.4.5: data or International Language text
libmalloc.so.1.0: data or International Language text
strerror.c: English text
tee: data or International Language text
who: data or International Language text
xdos: symbolic link to /usr/bin/xdos.
xsession: data or International Language text
Using Solaris 2.3:
leisner@solar2$ file *
autoscan: executable /usr/gnu/bin/perl script
awk: Sun demand paged SPARC executable dynamically
linked
bash: Sun demand paged SPARC executable dynamically
linked
bitcount.d: ascii text
bsdlpq: Sun demand paged SPARC executable dynamically
linked
dired: commands text
env: data
false: executable shell script
fdformat: ELF 32-bit LSB, dynamically linked, stripped
foo: directory
libXpm.so.4.5: ELF 32-bit LSB, dynamically linked
libmalloc.so.1.0: Sun demand paged SPARC executable dynamically
linked
strerror.c: English text
tee: ELF 32-bit MSB executable SPARC Version 1,
dynamically linked, stripped
who: ELF 32-bit MSB executable SPARC Version 1,
dynamically linked, not stripped
xdos: symbolic link to /usr/bin/xdos
xsession: ELF 32-bit LSB, dynamically linked, stripped
Using the freeware file version 3.15 (started by Ian Darwin, currently
maintained by Mark Moraes and Christos Zoulas):
leisner@gnu$ file *
autoscan: a /usr/gnu/bin/perl script text
awk: symbolic link to /usr/gnu/bin/gawk
bash: sparc demand paged dynamically linked
executable not stripped
bitcount.d: ascii text
bsdlpq: setuid sparc demand paged dynamically linked
executable not stripped
dired: Linux/i386 demand-paged executable (ZMAGIC)
env: Linux/i386 demand-paged executable (ZMAGIC) not
stripped
false: Bourne Shell script text
fdformat: ELF 32-bit LSB executable i386 (386 and up)
Version 1
foo: directory
libXpm.so.4.5: ELF 32-bit LSB dynamic lib i386 (386 and up)
Version 1
libmalloc.so.1.0: sparc demand paged shared library not stripped
strerror.c: C or REXX program text
tee: ELF 32-bit MSB executable SPARC Version 1
who: ELF 32-bit MSB executable SPARC Version 1
xdos: broken symbolic link to /usr/bin/xdos
xsession: ELF 32-bit LSB executable i386 (386 and up)
Version 1
With the freeware file, all the guesses were correct
and reasonable
(since I generated the files, I can confirm the accuracy
of the
guesses). The freeware version also knows about lots
of binary formats;
the other files only know about the native machine format
and call all
foreign formats "data." A current version
of the freeware file is
available on ftp://tesla.ee.cornell.edu/pub/file-X.YY.tar.gz,
where X
and YY are version numbers.
Shrink Binaries by Stripping
After a binary is installed, you may be able to shrink
it. If the
program was compiled with the -g option, the symbol
table will be very
large (much larger than the program). If the program
wasn't linked with
the -s (strip) option, there is still symbol information
available in
the binary (typically on the order of 10 percent). This
gives you enough
information to produce a core dump, which produces enough
information to
let you figure out which routines the core dump occurred
in (but not
enough to run the debugger).
Table 1 shows examples of stripped programs (on Linux
1.2.8), using the
Elf tools based on gcc 2.6.3 and using -g -O. As the
table shows,
strippingwill reduce the size of debuggable binaries
by 80 percent. If
you use strip, however, debugging and analyzing core
dumps becomes
impossible. If you want to debug, you need to be able
to regenerate the
binary. Also, to analyze core dumps, you need to be
able to regenerate
the binary exactly.
Use Shared Libraries
Most modern UNIX systems support the concept of shared
libraries. Shared
libraries are much like shared text, in the sense that
one copy needs to
be kept in core, and multiple programs share this single
copy. There is
minor overhead on program startup (since dynamic relocations
need to be
performed). But the binaries are much smaller on disk,
and the
executable image is more compact.
How many programs use printf? Would it help if all the
programs used the
same memory resident copy of printf? What if I want
to compile printf
with symbols for debugging? Shared libraries address
these issues.
Shared libraries offer significant space savings. Table
2 shows some
examples. To help understand the space savings, in these
examples all
the programs are compiled with -O and are stripped.
If you program with libraries and if a number of programs
use the
libraries, it may make sense to learn how to construct
shared libraries
(once constructed, they're very easy to use). Another
major advantage to
using shared libraries is that you can install updates
transparently
(applications that use the shared libraries don't have
to be recompiled
to be updated). This helps enormously in system administration.
But be
aware that the interface between the programs and the
library has to
remain the same for this to work. If there are subtle
changes, you may
be in for a head-scratching experience.
Have One Copy of Source
If you are supporting a number of diverse machines,
it's confusing and
inefficient to have multiple copies of the source floating
around. Via
NFS, you can easily have the source code on one host,
and execute a make
on another host with another architecture. Most GNU
configure scripts
support a -srcdir option, which allows you to specify
where the source
actually resides. But you must be careful -- if you
configure within the
directory where the source is, you can't reconfigure
elsewhere: in this
case, you should do a make distclean. Some configure
scripts don't
properly support the -srcdir option; if you have problems,
try to
execute the make in the source directory.
-srcdir makes it much easier to support multiple machines
from one
source tree. You can even write-protect the sources
so they won't
change, via a chmod -R -w command. Alternatively, you
can export a
read-only NFS filesystem, so you won't accidentally
write on distributed
sources.
Some distributions come in tree form on a CD-ROM. When
a distribution is
on CD-ROM (for example, O'Reilly sells the BSD 4.4 and
X11 trees), you
can leave the source there and use a tool like lndir
(which I discussed
in the article, "File Management Tools," in
the March/April 1995 issue
of Sys Admin) to make links to a hard disk. You can
then build off the
hard disk, getting read-only files from the CD-ROM via
symbolic links.
BSD 4.4 supports a layered filesystem, which takes the
place of a link
tree. You can structure a filesystem such that you have
a read-only
filesystem (normally a CD-ROM) under a hard disk. You
thus have the
appearance of a writable CD-ROM. I've never seen it,
but it sounds
clever.
File Compression
File compression is another useful tactic for keeping
information online
and minimizing disk usage. Compression reduces the size
of the file, and
uncompressing is much quicker than recovering from tape.
GNU gzip has
two major advantages over compress: size and space.
When I first saw
this, I found it hard to believe, since compression
is so commonplace.
compress(1) has been standard for years and is patented
(by Unisys). But
gzip
compresses better,
uncompresses faster,
compresses more slowly,
is covered by the GNU license.
The fact that gzip compresses more slowly is not very
important, since a
file can only be compressed once. Once compressed, it
can be
uncompressed many times. Table 3 shows relative performance
for gzip and
compress.
gzip has a significant advantage in size over compress.
Moreover, gzip
gives you a choice of nine different levels of compression,
with a
tradeoff of time/space. Table 4 shows time/space results
for selected
levels of gzip. Notice that all the levels of compression
are
significantly better than for compress. The default
compression, level
6, takes about half the time for maximum compression.
While the sizes at the different compression levels
in the examples in
Table 4 vary by less than 20 percent, the user time
differs by 500
percent. You trade off a large amount of processing
for a small amount
of compression. The default level (level 6) is a reasonable
tradeoff.
Size isn't the only important factor: how fast decompression
occurs is
equally significant. Table 5 compares compression and
decompression
speeds for gzip and compress. gzipped files are extracted
faster than
compressed files. In addition, gunzip(1) can handle
both gzipped and
compressed files. It does a quicker job of extracting
compressed files
than uncompress(1). The examples in Table 5 are run
to stdout, going
into /dev/null.
There is no performance penalty for unzipping maximally
gzipped files.
In fact, I've seen a small performance penalty for unzipping
minimally
compressed files. And there is a signficant advantage
in using gunzip to
deal with compressed files.
Compressing Trees
Compressing trees is also a good tactic for saving space,
and it's very
easy to uncompress trees when you need them. The dates
are preserved,
which is very important. Just do:
gzip -r <path> to compress
gunzip -r <path> to umcompress
If you're root, the ownership is also preserved.
As Table 6 shows, a compressed tree incurs a small space
penalty
compared to compressed tar files. But the convenience
factor is very
high, since looking through compressed files goes very
quickly if you
know where to look (and uncompressing a tree takes about
the same amount
of work as untarring/uncompressing a tree). Also, you
can browse
compressed text files easily with zmore and zcat.
Compressing Executables
gzip has a companion program called gzexe, which creates
a compressed
runnable script. The script is divided into two parts:
1) a control section (a small script which uncompresses
the program
into tmp and runs it)
2) the data (which is compressed, and follows the script).
gzexe should be used carefully, since every time the
program is run it
uncompresses into tmp. In addition, multiple copies
of running the same
program at the same time won't share text (since each
executable makes a
separate image). An alternative may be to gzip the executable,
and
manually gunzip it when you need it. gzexe works best
when applied to
stripped binaries; applying it to unstripped binaries
often results in
larger files than you'd get by just stripping them.
Also, note that very
short runs (i.e., get version/help) take much longer,
since the
executable needs to be decompressed first. Table 7 shows
the effect of
gzexe on the size of certain executables.
Compressing man Pages and Info Files
Info readers and man page readers that can deal with
zipped information
can be very useful, since the documentation on your
system can easily
consume several megabytes. When I started to run Linux,
I became aware
of the possibility of zipping this information, since
the documentation
is enormous and it's my personal hard disk. There are
copies of man and
info that deal with gzipped man pages and info files.
Most vendors'
versions of man(1) won't handle this you'll need to
get an enhanced
version of man. One site for this is:
ftp://sunsite.unc.edu/pub/Linux/system/Manualpagers/man-1.4e.tar.gz)
In addition to dealing with compressed and gzipped man
pages, enhanced
man has a number of useful options not found in conventional
versions.
It was begun by John Eaton (jwe@che.utexas.edu), and
has been supported
and enhanced during the last five years by a number
of maintainers. It's
definitely worth taking a look.
texinfo is a hypertext format which is the standard
for the gnu project.
A texinfo master is created, which can deliver either
a printed manual,
via TeX, or online reference, via info (generated by
a program called
makeinfo). The hyptertext format is a big advantage
over standard man.
If you're interested, ftp to prep.ai.mit.edu and get
texinfo-3.6.tar.gz.
The info program is included, along with other tools
to develop texinfo
documents. Be aware that many tools supply both an info
document and an
outdated man page; if the info document is much newer
than the man page,
the man page shouldn't be trusted.
The tk tools (tkman and tkinfo) can both work with compressed
files.
They use John Ousterhout's wish (a windowing shell),
generated with the
Tk/Tcl toolkits. If you don't already have wish, tkman
and tkinfo are
excellent examples of applications to start with once
you do install it.
Dealing with Compressed tar Archives
A standard way of passing information around UNIX systems
is through
compressed tar archives. Once an archive is compressed,
it is never
necessary to uncompress it. To work on foo.tar, just
execute:
zcat foo.tar.gz | tar -xf -
or if you are using GNU tar
tar -xzf foo.tar.gz
There is a good reason to work this way. Uncompressed
tar files can be
space hogs. In fact, you may not have space for the
uncompressed tar
file and whatever it contains. This approach leaves
the file as a
compress file and uses pipes to extract the tar file.
An even more
extreme example would be listing the first few entries
in a tar archive
so you're sure of what you have; doing this in pipes
is relatively fast,
while having to uncompress the whole archive can be
very time-consuming.
Convert Compressed Files into gzip Files
Atool called znew converts compressed files into gzipped
files. If you
have a tree of compressed files to convert, run this
little pipeline:
find . -name '*.Z' | xargs znew
An example of the effect of znew is in Table 8. Notice
that converting
text from compressed format to gzip format results in
a space saving of
about one-third.
zip versus gzip
The most popular DOS compression program is zip. There
are a number of
other compression programs, which are less common (arc
and zoo) and
which I won't discuss here (except to say that there
are implementations
for all of them on UNIX). zip can use the same type
of encoding as gzip
on UNIX (along with a number of other algorithms). UNIX
tends to break
up compression and archival tools and use pipes to connect
them, (while
DOS-based tools bundle them together). The major advantage
of a gzipped
tar archive is that it applies compression to one large
file, which
gives a smaller result (compression works better on
one large datastream
than on lots of small datastreams). zip/unzip compresses
each file
separately.
The Info-ZIP group has put together a freeware set of
zip manipulation
tools (available at ftp://ftpu.uu.net/pub/archiving/zip)
which run on
VMS, OS/2, MS/DOS, NT, AmigaDOS, Atari TOS, Macintosh,
and almost every
known flavor of UNIX. The major advantages to using
this unzipper rather
than others are portability and consistency of user
interfaces across
platforms. If you aren't concerned about portability,
stick with
gzip/tar. But if you are, take advantage of unzip/zip.
The prune Script
A little script called prune (Listing 1) allows you
to implement the
above recommendations. It's also safe to use for removing
backup files.
Be careful when using backups which end in ~; if you
have ever done
rm * ~
instead of
rm *~
you remember the grief it caused. prune can:
act recursively
erase backup files (ending with ~)
gzip with level -9 compression
convert compressed files to gzip files
overwrite existing files
prune works with the GNU findutils and bash (available
at
prep.ai.mit.edu:/pub/gnu). I initially wrote a much
more complicated
bash script, and thought I should convert it to perl.
But I thought
about the problem, and decided to really simplify it,
using find and
xargs to run the commands. The only shortcoming is that
the tool doesn't
compare the dates between compressed and uncompressed
files with the
same basename. In practice, it's effective, since most
install scripts
leave man pages and info files uncompressed, and specifying
the
overwrite option allows you to replace a previously
compressed file with
a newer file.
Holes
Most modern UNIX filesystems can handle sparse files
efficiently that
is, files which contain a large amount of empty space.
An example of
this is a core dump. To get the actual disk space used
by a sparse file,
do:
leisner@gemini$ sleep 1d
^\Quit (core dumped)
bash2 leisner@gemini$ ls -l core
-rw-r--r-- 1 leisner staff 8421808
Mar 18 15:18 core
bash2 leisner@gemini$ du -s core
104 core
As you can see the sizes are radically different. And,
if you move the
file, you'll make it normal size. For example:
leisner@gemini$ sleep 1d
^\\Quit (core dumped)
leisner@gemini$ ls -l core
-rw-r--r-- 1 leisner sdsp 8421808 Mar 18
15:23 core
leisner@gemini$ du -s core
104 core
leisner@gemini$ gzip core
leisner@gemini$ ls -l core.gz
-rw-r--r-- 1 leisner sdsp 15797 Mar 18
15:23 core.gz
leisner@gemini$ gunzip core
leisner@gemini$ du -s core
8240 core
Taking the holes out of a core dump file increases
the size (in this
example) by more than two orders of magnitude. 8Mb files
can quickly use
up your disk space. Once the holes have been removed
from a file, I
don't know of a way to put them back in. I wrote a little
program
(holify.c -- Listing 2) which can rectify this problem.
leisner@gnu$ sleep 1d
^\Quit (core dumped)
leisner@gnu$ du -s core
104 core
leisner@gnu$ gzip -c core >core.sleep.gz
leisner@gnu$ zcat core.sleep.gz | holify >core2
Wrote 49152 real bytes, 79725107 virtual bytes
leisner@gnu$ du -s core2 core
80 core2
104 core
leisner@gnu$ cmp core core2
leisner@gnu$
Conclusion
Erasing files is a very drastic way of dealing with
the disk space
shortages. Dynamic linking has size advantages over
static linking.
Further, debugging symbols show approximately an order
of magnitude
difference (the size of the debugging information far
outweighs the size
of the program). Compression tools can also help greatly,
and it takes
much less time to compress/uncompress files than to
find an old copy on
a backup tape. The benchmarks presented here show a
number of advantages
of gzip over compress if you're still using compress,
you should look
into gzip. The X11 font architecture uses compressed
font files; using
gzipped font files (which involves a change in the decompression
code)
would both increase performance and save space. All
of the above can
help you keep information available while paying a minimal
storage
price.
About the Author
Marty Leisner has a B.S.C.S. from Cornell University.
He started using
UNIX on a DEC PDP/11, and has been writing in C for
over a decade,
primarily doing real-time embedded applications.
|