Setting up File Systems and Partitions
John Caywood
You can treat file system maintenance as drudgery, or
you can raise
it to a new level of skill by turning it into storage
engineering.
Engineering requires making design tradeoffs based on
initial cost,
intended use, efficiency, reliability, and maintenance.
When you take
the OS vendor's defaults for file system partitioning,
then adapt
your backup strategy to what you've been given, you're
doing maintenance.
When you allocate file systems to different disks to
equalize disk
activity, to mount binary-only file systems read-only,
and to integrate
partition size decisions into your backup strategies,
you're doing
storage engineering.
This article looks at the structure of UNIX file systems
from the
perspective of tuning for intended use, then examines
the issues involved
in determining how many file systems to create, where
to put them,
and how to create them with the optimum sizes. The article
outlines
the repartitioning process, but the actual commands
and parameters
must come from your own man pages. Finally, I present
some metrics
you can apply to measure how well you've engineered
your storage.
File System and Inode Structure
Historically, UNIX files have not had names, and UNIX
directories
have not contained files. Instead, UNIX relies on the
inode.
The inode contains all system information about the
file, except its
name. A directory is merely a particular kind of file.
Like all files,
a directory starts with an inode, in this case, an inode
whose data
blocks contain a flat list of filenames and inode numbers
("serial
numbers" in POSIX terminology).
A file system has typically contained a boot block,
a super block,
a number of contiguous blocks filled with inodes, and
data blocks
in what was left over. The first available inode of
the boot partition,
usually block 2, becomes the inode whose name is "/"
and the
file system begins. The data blocks pointed to by the
first inode
list names and inodes for the other common directories:
etc,
usr, tmp, and so forth.
All partitions on each disk are laid out the same, except
that the
boot block is only used on the partition the ROM says
has the bootstrap
loader.
The Fast File System
The Berkeley Fast File System (FFS) added some complexities
to this
simple scheme, to achieve a 10X improvement in disk
throughput.
The FFS, or some variant of it, is now used by nearly
all UNIX vendors.
One of the FFS optimizations relevant to storage engineering
is the
grouping of adjacent physical blocks into larger virtual
blocks. Recognizing
that disk subsystems are most efficient when transferring
large amounts
of data, the FFS forces the disk subsystem to transfer
8 (or 4, or
16, or more) physical blocks at a time -- whether or
not
all the physical blocks are actually asked for. Hence,
a "block"
is no longer really a block; the disk subsystem might
transfer 8 rotationally-adjacent
blocks at a time, which makes much better use of the
disk's bandwidth
than would transferring just the one block that was
actually requested.
This would be inefficient if disk accesses were truly
random, but
they're not. Consider how grep, awk, sed,
etc., behave: they read files sequentially, which means
they ask for
one block, process it, then in the next millisecond
ask for the next
block. With the FFS, the next physical block is already
in memory.
A fast file system is thus composed of inodes and "blocks,"
in which a "block" contains 1, 2, or 4 "fragments."
A fragment is the smallest unit of allocation; it may
contain 1 or
more physical blocks.
Suppose you set up a file system to contain 8 physical
blocks per
"block," and 4 fragments per "block".
Assuming a physical
block size of 512 bytes, a single "block"
could then contain
4 very small files (files with 1024 or fewer bytes),
with each file
taking up 1 fragment. Or, the 8 adjacent physical blocks
could contain
4K adjacent bytes within one file, making for an efficient
data transfer.
An inefficiency occurs when a small, fragment-sized
file grows. If
the "block" is full, the existing data has
to be copied to
a "block" with 2 or more free fragments. In
practice, this
inefficiency is small compared to the enormous increase
in efficiency
gained by transferring more physical blocks at a time.
Mount Points
It is common practice to mount "/", /usr,
and maybe
/home from different partitions. In fact, it's also
possible
just to mount /dev/sd0c or /dev/dsk/c0t0d0s2 on "/"
-- UNIX can be run with just one physical disk partition
for the
whole file system.
At the opposite extreme, every directory is a potential
mount point,
and no names are magic. Root becomes / only because
it is the root
inode of the boot partition. Instead of mounting /usr
and
/home from separate partitions, you could mount /usr/lib
and /home/george/news in separate partitions, with all
other
files living in the root partition.
The description of the FFS above suggests one reason
for mounting
other directories than /: if all directories are mounted
on one file
system, you've lost the ability to tune the file system
to the way
the file system is used. Mounting multiple directories
has other advantages
as well:
directories that change infrequently can be backed up
less frequently; if you use a file-system-oriented backup
program
like dump or ufsdump, you back up by file system,
not by directory
file systems with critical files can be mounted read-only
to prevent accidental or intentional corruption
if local software and application software are kept
on a separate file system from the operating system
software, then
the local software can be unmounted during an upgrade
and remounted
afterwards, eliminating the need for restoration from
tape
smaller "chunks" of directories can be moved
to new file sytems when new disks are added to a system
inodes per file system can be reduced when you create
a partition to be used for large files
You can go beyond the traditional mount points (root,
/usr,
swap, and home). Some good candidates are
-- /local (or /usr/local) for local software;
-- /tmp, so that accounting doesn't get turned off when
someone does a large compile;
-- spooling space (/usr/spool, /var/spool), for
the same reason;
-- third-party application installation directories;
-- an anonymous ftp directory, so that an unwelcome
gift of files
in /pub/incoming doesn't prevent your users from doing
their
work;
-- /export/swap, swap space for diskless workstations,
which needs only one inode per swap file (plus one for
the root),
rather than the thousands of inodes that are the default.
You can make tradeoffs among 4 characteristics: space
utilization,
transfer speed, ease of maintenance, and security. Figure 1
summarizes
these tradeoffs. To optimize for data transfer speed,
set up a file
system with large blocks and few fragments. To optimize
for space
efficiency (that is, minimum wasted space within a partition),
choose
smaller blocks and more fragments. To use read-only
mounts to protect
critical files (like the executables the OS vendor supplies
in /usr),
to increase flexibility in your backup schedule, and
to setup firewalls
against errant processes, choose fewer, smaller file
systems. See
the sidebar, "Many versus Few Partitions,"
for a more detailed
analysis.
Size Decisions
The benefits of tailoring the size of a partition to
its use are clear.
The tough part is deciding what should go where, and
how much space
to leave for each chunk. Here are some rules of thumb:
Put /tmp and /var (or /usr/spool) into their own
partitions. If your OS puts /tmp into swap space (e.g.,
Solaris 2.x), you can leave it there or put it in a
separate physical
partition. If /tmp is mounted on the swap partition,
you lose
swap space as /tmp grows, and you lose temporary space
as
swap usage grows. If you run large programs and need
large amounts
of temp space, a separate partition for /tmp may be
preferable.
Keep OS vendor binaries in /usr in their own partition,
separate from all local binaries and third-party applications.
The vendor binaries don't grow (at least, not till the
next OS upgrade),
so you can size the partition very close to the actual
use. You can
enhance security by mounting /usr read-only; not even
root
can write into a read-only mount. Read-only mounts are
a lot of trouble
when you have to make changes -- that's why only vendor
binaries
belong in a read-only /usr.
Consider every other chunk of 50M bytes or more as
a candidate for a separate partition, where a "chunk"
is
the du -s output for a directory and all its subdirectories.
Put related packages into the same partition, making
up your own relations
as you see fit.
Use a deep -- rather than a broad -- directory
structure when you install a package. Only directories
are mount
points, and every directory is a potential mount point,
so you maximize
your mounting options when you arrange files in more
directories.
You should expect to repartition your disks at least
yearly because
file systems are dynamic, and disk needs change as patterns
of usage
change. With the price of disk storage falling monthly,
you're also
probably installing new disks more frequently than you
may have done
five years ago. Unless you buy a separate disk for each
package you
install, you'll need to move data from one partition
to another when
you install a new disk, just to make good use of the
new space.
You might consider repartitioning at your next OS upgrade.
If you
can wait that long, this is an excellent time.
How to Repartition
As a first step, lay out the new partition sizes on
paper (you might
use graph paper, letting 1 block stand for 1M byte,
or 10M, or 20M.
Sketch in proportional sizes for each disk and each
partition to help
you visualize your space needs. If you use ink for the
disk boundaries
and pencil for the partitions, you can juggle size needs
to disk capacities
with a minimum of redrawing.
Except for the root and /usr partitions, here's how
to proceed:
1. Go to single-user mode.
2. Backup all affected partitions (level 0 dump).
3. Repartition according to OS instructions. The program
is usually called format, and it requires you to fill
in the
starting and ending cylinders (or starting cylinder
and length) of
each disk partition.
4. Reboot single-user.
5. Make new file systems on all partitions affected
by the changes in step 3. (See newfs or mkfs man pages,
and tuning comments below.)
6. Mount the new file systems. Edit /etc/vfstab
(/etc/fstab) to make the new mount points permanent.
7. Restore from backups. The kernel automatically routes
the files to the correct partitions.
8. Backup again (level 0). If you don't backup level
0 after a restore, restores from later incrementals
may fail.
9. Go to multi-user mode.
Though tedious and time-consuming, these steps are not
complicated.
The only serious error you can make is to have two partitions
overlap,
but judicious use of a calculator will prevent that.
Resizing Root and /usr
Many systems won't boot, even to single-user, without
both root and
/usr partitions, so moving or changing the size of root
or
/usr is more difficult. Resizing root also means reinstalling
the boot block after you've made a new file system.
To change the size of root, you must boot from a different
device.
On some systems, you can use the OS distribution medium
(CD-ROM or
tape) for this step. For other systems, you'll have
to duplicate the
root partition on another disk, make that partition
bootable (using
installboot or a similar program), reboot from the new
disk,
then backup-repartition-restore your first root.
For most systems, you must also run installboot after
making
a new file system on root. Run installboot after restoring
the files but before making a new level 0 backup.
The steps required to move or enlarge /usr are similar.
If
you can boot from the distribution medium, do so. Otherwise,
duplicate
/usr on another partition, edit /etc/vfstab, reboot
single-user with the duplicate of /usr, backup-repartition-restore
your first /usr, re-edit vfstab, and reboot.
If you've never moved root or /usr before, you might
want
to practice on a spare system first. If you're not blessed
with a
spare system, take it step-by-step and have hardcopy
of the man pages
next to you before you start.
Block and Fragment Tuning
With an understanding of file system structure and a
knowledge
of how your file systems will be used, you can begin
to see opportunities
for tuning. The easiest time to do this is when you
make the file
system.
As an example, imagine a user with hundreds of 2Mb data
files. You
can store the data in a separate file system set up
to have, say,
16 physical blocks per "block", and maybe
only 1 or 2 fragments
per "block." With a "block" of 16
physical blocks
and 2 fragments, the minimum allocation is 8 blocks
-- 4K bytes.
That would be wasteful for a user with a 17-byte file
(4096 - 17 =
4079 wasted bytes), but it means very efficient disk
transfers if
most of the files in that file system are 2M bytes or
more.
So there's one obvious file system tuning rule: file
systems with
large files benefit from large "blocks" and
(relatively) fewer
fragments. Databases are an obvious candidate for directories
mounted
on file systems with larger blocks.
Inode Tuning
The quantity of inodes allocated to a file system can
also be adjusted
when the file system is created. For a given file system
size, say
500M bytes, a file system with 1000 files will need
fewer inodes than
a file system with 5000 files, because you need one
inode for each
file you create, plus one for each directory, plus one
more for the
mount point.
Inode blocks do not store data, just kernel information
about the
file -- they're file system overhead. In the first case
in the
paragraph above, you need one inode per 500K bytes of
data; in the
second, one inode per 100 bytes of data. If you can
fit four inodes
into one physical block (128 bytes per inode -- see
inode.h
for the exact figure), then you will need 1000 / 4 =
250 blocks
full of inodes (plus spares for directories) in the
first case, but
1250 blocks full of inodes (plus spares) in the second.
That's an
extra 500K in inodes in a 500M partition -- 10 percent,
or 5M bytes,
in file system overhead.
In practice, the differences aren't quite so dramatic.
See for yourself
on your own files systems: compare df -k to df -e
(or df to df -i on BSD systems) to see how many inodes
are unused on your file systems. Systems with BSD parentage
usually
provide a newfs command to invokes mkfs with parameters
to specify how many inodes to allocate per k-bytes of
data space.
Measures of Quality
The quality of system administration is tough to measure.
A useful
approach is to imagine what would happen if you did
a perfect job.
So, if you could set up your file systems absolutely
perfectly, here's
how they might look:
Each file system would have exactly the right number
of inodes, that is, there would be no inodes left if
the file system
were to reach 110 percent of capacity.
File systems containing mostly large files would have
large blocks and few fragments.
A file accidentally deleted by a user could be instantly
retrieved from backup tapes.
Disk activity would be equally divided among all disks.
Performance monitors would show all disks with the same
rate of data
transfer.
Static file systems would be 100% full; see Table 1
for some other targets.
By applying your knowledge of the basic file system
structure to the
needs of your users and to your own needs for system
management, you
can engineer the use of your disks to trade off speed
for more space,
or security for simplicity. The tools to do so are already
present.
You need only apply your own good judgment, then find
the time to
do the job.
About the Author
John Caywood received B.S. and M.S. degrees in computer
science from
Old Dominion University, Norfolk, Va., and he taught
computer science
there for three years. He is currently employed by InfiNet,
an
Internet access provider in Norfolk, VA. He can be reached
at caywood@infi.net.
|