Cover V03, I04
Article
Figure 1
Sidebar 1
Table 1

jul94.tar


Setting up File Systems and Partitions

John Caywood

You can treat file system maintenance as drudgery, or you can raise it to a new level of skill by turning it into storage engineering.

Engineering requires making design tradeoffs based on initial cost, intended use, efficiency, reliability, and maintenance. When you take the OS vendor's defaults for file system partitioning, then adapt your backup strategy to what you've been given, you're doing maintenance. When you allocate file systems to different disks to equalize disk activity, to mount binary-only file systems read-only, and to integrate partition size decisions into your backup strategies, you're doing storage engineering.

This article looks at the structure of UNIX file systems from the perspective of tuning for intended use, then examines the issues involved in determining how many file systems to create, where to put them, and how to create them with the optimum sizes. The article outlines the repartitioning process, but the actual commands and parameters must come from your own man pages. Finally, I present some metrics you can apply to measure how well you've engineered your storage.

File System and Inode Structure

Historically, UNIX files have not had names, and UNIX directories have not contained files. Instead, UNIX relies on the inode.

The inode contains all system information about the file, except its name. A directory is merely a particular kind of file. Like all files, a directory starts with an inode, in this case, an inode whose data blocks contain a flat list of filenames and inode numbers ("serial numbers" in POSIX terminology).

A file system has typically contained a boot block, a super block, a number of contiguous blocks filled with inodes, and data blocks in what was left over. The first available inode of the boot partition, usually block 2, becomes the inode whose name is "/" and the file system begins. The data blocks pointed to by the first inode list names and inodes for the other common directories: etc, usr, tmp, and so forth.

All partitions on each disk are laid out the same, except that the boot block is only used on the partition the ROM says has the bootstrap loader.

The Fast File System

The Berkeley Fast File System (FFS) added some complexities to this simple scheme, to achieve a 10X improvement in disk throughput. The FFS, or some variant of it, is now used by nearly all UNIX vendors.

One of the FFS optimizations relevant to storage engineering is the grouping of adjacent physical blocks into larger virtual blocks. Recognizing that disk subsystems are most efficient when transferring large amounts of data, the FFS forces the disk subsystem to transfer 8 (or 4, or 16, or more) physical blocks at a time -- whether or not all the physical blocks are actually asked for. Hence, a "block" is no longer really a block; the disk subsystem might transfer 8 rotationally-adjacent blocks at a time, which makes much better use of the disk's bandwidth than would transferring just the one block that was actually requested.

This would be inefficient if disk accesses were truly random, but they're not. Consider how grep, awk, sed, etc., behave: they read files sequentially, which means they ask for one block, process it, then in the next millisecond ask for the next block. With the FFS, the next physical block is already in memory.

A fast file system is thus composed of inodes and "blocks," in which a "block" contains 1, 2, or 4 "fragments." A fragment is the smallest unit of allocation; it may contain 1 or more physical blocks.

Suppose you set up a file system to contain 8 physical blocks per "block," and 4 fragments per "block". Assuming a physical block size of 512 bytes, a single "block" could then contain 4 very small files (files with 1024 or fewer bytes), with each file taking up 1 fragment. Or, the 8 adjacent physical blocks could contain 4K adjacent bytes within one file, making for an efficient data transfer.

An inefficiency occurs when a small, fragment-sized file grows. If the "block" is full, the existing data has to be copied to a "block" with 2 or more free fragments. In practice, this inefficiency is small compared to the enormous increase in efficiency gained by transferring more physical blocks at a time.

Mount Points

It is common practice to mount "/", /usr, and maybe /home from different partitions. In fact, it's also possible just to mount /dev/sd0c or /dev/dsk/c0t0d0s2 on "/" -- UNIX can be run with just one physical disk partition for the whole file system.

At the opposite extreme, every directory is a potential mount point, and no names are magic. Root becomes / only because it is the root inode of the boot partition. Instead of mounting /usr and /home from separate partitions, you could mount /usr/lib and /home/george/news in separate partitions, with all other files living in the root partition.

The description of the FFS above suggests one reason for mounting other directories than /: if all directories are mounted on one file system, you've lost the ability to tune the file system to the way the file system is used. Mounting multiple directories has other advantages as well:

  • directories that change infrequently can be backed up less frequently; if you use a file-system-oriented backup program like dump or ufsdump, you back up by file system, not by directory

  • file systems with critical files can be mounted read-only to prevent accidental or intentional corruption

  • if local software and application software are kept on a separate file system from the operating system software, then the local software can be unmounted during an upgrade and remounted afterwards, eliminating the need for restoration from tape

  • smaller "chunks" of directories can be moved to new file sytems when new disks are added to a system

  • inodes per file system can be reduced when you create a partition to be used for large files

    You can go beyond the traditional mount points (root, /usr, swap, and home). Some good candidates are

    -- /local (or /usr/local) for local software;

    -- /tmp, so that accounting doesn't get turned off when someone does a large compile;

    -- spooling space (/usr/spool, /var/spool), for the same reason;

    -- third-party application installation directories;

    -- an anonymous ftp directory, so that an unwelcome gift of files in /pub/incoming doesn't prevent your users from doing their work;

    -- /export/swap, swap space for diskless workstations, which needs only one inode per swap file (plus one for the root), rather than the thousands of inodes that are the default.

    You can make tradeoffs among 4 characteristics: space utilization, transfer speed, ease of maintenance, and security. Figure 1 summarizes these tradeoffs. To optimize for data transfer speed, set up a file system with large blocks and few fragments. To optimize for space efficiency (that is, minimum wasted space within a partition), choose smaller blocks and more fragments. To use read-only mounts to protect critical files (like the executables the OS vendor supplies in /usr), to increase flexibility in your backup schedule, and to setup firewalls against errant processes, choose fewer, smaller file systems. See the sidebar, "Many versus Few Partitions," for a more detailed analysis.

    Size Decisions

    The benefits of tailoring the size of a partition to its use are clear. The tough part is deciding what should go where, and how much space to leave for each chunk. Here are some rules of thumb:

  • Put /tmp and /var (or /usr/spool) into their own partitions. If your OS puts /tmp into swap space (e.g., Solaris 2.x), you can leave it there or put it in a separate physical partition. If /tmp is mounted on the swap partition, you lose swap space as /tmp grows, and you lose temporary space as swap usage grows. If you run large programs and need large amounts of temp space, a separate partition for /tmp may be preferable.

  • Keep OS vendor binaries in /usr in their own partition, separate from all local binaries and third-party applications. The vendor binaries don't grow (at least, not till the next OS upgrade), so you can size the partition very close to the actual use. You can enhance security by mounting /usr read-only; not even root can write into a read-only mount. Read-only mounts are a lot of trouble when you have to make changes -- that's why only vendor binaries belong in a read-only /usr.

  • Consider every other chunk of 50M bytes or more as a candidate for a separate partition, where a "chunk" is the du -s output for a directory and all its subdirectories. Put related packages into the same partition, making up your own relations as you see fit.

  • Use a deep -- rather than a broad -- directory structure when you install a package. Only directories are mount points, and every directory is a potential mount point, so you maximize your mounting options when you arrange files in more directories.

    You should expect to repartition your disks at least yearly because file systems are dynamic, and disk needs change as patterns of usage change. With the price of disk storage falling monthly, you're also probably installing new disks more frequently than you may have done five years ago. Unless you buy a separate disk for each package you install, you'll need to move data from one partition to another when you install a new disk, just to make good use of the new space.

    You might consider repartitioning at your next OS upgrade. If you can wait that long, this is an excellent time.

    How to Repartition

    As a first step, lay out the new partition sizes on paper (you might use graph paper, letting 1 block stand for 1M byte, or 10M, or 20M. Sketch in proportional sizes for each disk and each partition to help you visualize your space needs. If you use ink for the disk boundaries and pencil for the partitions, you can juggle size needs to disk capacities with a minimum of redrawing.

    Except for the root and /usr partitions, here's how to proceed:

    1. Go to single-user mode.

    2. Backup all affected partitions (level 0 dump).

    3. Repartition according to OS instructions. The program is usually called format, and it requires you to fill in the starting and ending cylinders (or starting cylinder and length) of each disk partition.

    4. Reboot single-user.

    5. Make new file systems on all partitions affected by the changes in step 3. (See newfs or mkfs man pages, and tuning comments below.)

    6. Mount the new file systems. Edit /etc/vfstab (/etc/fstab) to make the new mount points permanent.

    7. Restore from backups. The kernel automatically routes the files to the correct partitions.

    8. Backup again (level 0). If you don't backup level 0 after a restore, restores from later incrementals may fail.

    9. Go to multi-user mode.

    Though tedious and time-consuming, these steps are not complicated. The only serious error you can make is to have two partitions overlap, but judicious use of a calculator will prevent that.

    Resizing Root and /usr

    Many systems won't boot, even to single-user, without both root and /usr partitions, so moving or changing the size of root or /usr is more difficult. Resizing root also means reinstalling the boot block after you've made a new file system.

    To change the size of root, you must boot from a different device. On some systems, you can use the OS distribution medium (CD-ROM or tape) for this step. For other systems, you'll have to duplicate the root partition on another disk, make that partition bootable (using installboot or a similar program), reboot from the new disk, then backup-repartition-restore your first root.

    For most systems, you must also run installboot after making a new file system on root. Run installboot after restoring the files but before making a new level 0 backup.

    The steps required to move or enlarge /usr are similar. If you can boot from the distribution medium, do so. Otherwise, duplicate /usr on another partition, edit /etc/vfstab, reboot single-user with the duplicate of /usr, backup-repartition-restore your first /usr, re-edit vfstab, and reboot.

    If you've never moved root or /usr before, you might want to practice on a spare system first. If you're not blessed with a spare system, take it step-by-step and have hardcopy of the man pages next to you before you start.

    Block and Fragment Tuning

    With an understanding of file system structure and a knowledge of how your file systems will be used, you can begin to see opportunities for tuning. The easiest time to do this is when you make the file system.

    As an example, imagine a user with hundreds of 2Mb data files. You can store the data in a separate file system set up to have, say, 16 physical blocks per "block", and maybe only 1 or 2 fragments per "block." With a "block" of 16 physical blocks and 2 fragments, the minimum allocation is 8 blocks -- 4K bytes. That would be wasteful for a user with a 17-byte file (4096 - 17 = 4079 wasted bytes), but it means very efficient disk transfers if most of the files in that file system are 2M bytes or more.

    So there's one obvious file system tuning rule: file systems with large files benefit from large "blocks" and (relatively) fewer fragments. Databases are an obvious candidate for directories mounted on file systems with larger blocks.

    Inode Tuning

    The quantity of inodes allocated to a file system can also be adjusted when the file system is created. For a given file system size, say 500M bytes, a file system with 1000 files will need fewer inodes than a file system with 5000 files, because you need one inode for each file you create, plus one for each directory, plus one more for the mount point.

    Inode blocks do not store data, just kernel information about the file -- they're file system overhead. In the first case in the paragraph above, you need one inode per 500K bytes of data; in the second, one inode per 100 bytes of data. If you can fit four inodes into one physical block (128 bytes per inode -- see inode.h for the exact figure), then you will need 1000 / 4 = 250 blocks full of inodes (plus spares for directories) in the first case, but 1250 blocks full of inodes (plus spares) in the second. That's an extra 500K in inodes in a 500M partition -- 10 percent, or 5M bytes, in file system overhead.

    In practice, the differences aren't quite so dramatic. See for yourself on your own files systems: compare df -k to df -e (or df to df -i on BSD systems) to see how many inodes are unused on your file systems. Systems with BSD parentage usually provide a newfs command to invokes mkfs with parameters to specify how many inodes to allocate per k-bytes of data space.

    Measures of Quality

    The quality of system administration is tough to measure. A useful approach is to imagine what would happen if you did a perfect job. So, if you could set up your file systems absolutely perfectly, here's how they might look:

  • Each file system would have exactly the right number of inodes, that is, there would be no inodes left if the file system were to reach 110 percent of capacity.

  • File systems containing mostly large files would have large blocks and few fragments.

  • A file accidentally deleted by a user could be instantly retrieved from backup tapes.

  • Disk activity would be equally divided among all disks. Performance monitors would show all disks with the same rate of data transfer.

  • Static file systems would be 100% full; see Table 1 for some other targets.

    By applying your knowledge of the basic file system structure to the needs of your users and to your own needs for system management, you can engineer the use of your disks to trade off speed for more space, or security for simplicity. The tools to do so are already present. You need only apply your own good judgment, then find the time to do the job.

    About the Author

    John Caywood received B.S. and M.S. degrees in computer science from Old Dominion University, Norfolk, Va., and he taught computer science there for three years. He is currently employed by InfiNet, an Internet access provider in Norfolk, VA. He can be reached at caywood@infi.net.


     



  •