aug2002.tar

Volume Management and File Systems Usage and Implementation

Henry Newman

Current file systems trace their roots from the UFS file system, which was proposed in 1965. By the early 1970s, the UNIX file system was up and running. Since then, not much has changed in file systems and there have only been incremental hardware changes. I think the file system and volume manager are the most critical components in achieving I/O performance from both the OS and underlying hardware. Even the best file system and volume manager can be configured so that the performance is poor. Therefore, my next couple of columns will cover file system and volume management, in addition to file system configuration and tuning.

File System Basics

The purpose of file systems (FS) is to maintain a view of the storage so we can create files. This is done so that users can create, delete, open, close, read, write, and extent files on the device(s). File systems can also be used to maintain security over files.

Volume Manager Basics

The original goal of the UNIX volume management (VM), which was developed in the late 1980s, was to group disk devices together so that file systems larger than a single device could be created, and to achieve high performance by striping devices. Since file systems could only mkfs a single device, volume managers provided that single device, and features were added to support software RAID.

Standard VM Inner Workings (Striping)

Most file systems require a VM to group disk and/or RAID devices together. Striping spreads the data across the devices based on the stripe size set within the volume manager. Note that some volume managers support concatenation, which starts with the first device and then only writes to the next device when the first device becomes full. The idea behind striping is to spread the data across multiple devices to improve performance and allow multiple I/O disk-head seeks to occur simultaneously. Figure 1 shows what happens with standard striping for allocation of multiple files writing at the same time, and shows what happens when one of those files is removed.

File Systems that Maintain Their Topology

Some file systems maintain and understand the device topology without a volume manager. These file systems support both striping and round-robin allocation. Round-robin allocation means that each device is used individually. In most cases, each file open moves to the next device. In some file systems, it could be that each directory created moves to the next device. Figure 2 shows an example of round-robin allocation, which is very different from striping. I will show how round-robin allocation has some important implications for performance.

File Allocation Comparison

One reason that volume managers do not provide a round-robin allocation method is because of the interaction between the volume manager and the file system. Every file system must allocate space and maintain consistency, which is one of the main purposes of the file system. There are multiple types of file system allocation, but the real issue is that a volume manager presents a single set address range for the block devices in the file system for the file system to allocate from. The volume manager then translates the address to each of the devices. It is difficult, but not impossible, for the volume manager to pass all of the underlying device topology to a file system. Also, most file systems designed with volume managers do not have an interface to understand the underlying volume topology. Other file systems that control their own topology can easily use round-robin allocation, because the file systems understand the underlying topology.

How Volume Managers and File Systems Work

It is important to fully understand how volume managers and file systems work internally to choose the best file system for the application. By understanding the inner workings, you will have a much better idea of what the tunable parameters mean and how to improve performance with those tunables and your available hardware.

Space Allocation

Each file system supports a method of representing space for files within the file system. The two most common methods are:

1. Extents

2. Indirect blocks

Extents

If the file system is using extents-based allocation, space is allocated within the inode for block address locations for the data for the file. In most file systems, 15 extents addresses for the data are used in the base inode, and the last address in the inode is linked to another inode, where an additional 15 extent addresses are available. Examples of extent-based file systems are Veritas VxFS, Sun QFS, SGI xfs, and Linux extfs-3.

Indirect Blocks

With indirect block allocation, the last block of space allocated for a file points to the next allocation. Thus, the base inode space is allocated for the data, and the last block of the allocated data space points to the next allocated space. Indirect blocks were originally designed for the UFS file system in 1970. At that time, disk drives were so slow that a seek was not required to go back to the inode and allocate more space. File systems that support indirect block allocation are Sun UFS, IBM JFS, and MS FAT-32.

Performance Comparison

Indirect block allocation and read/write performance can be painfully slow compared to the extent-based allocation method. For example, consider an application doing random reads and writes. To find the block address for the record, a file allocated with indirect blocks must read all of the data areas of the files for the record in question prior to reading the record. With extent-based allocation, the file system can simply read the inodes in question, which will make an enormous difference in performance. I am unaware of any new file systems using indirect blocks for space allocation because of the huge performance penalties for random I/O. Even for sequential I/O, the performance for indirect blocks is generally less than extent-based file systems.

Free Space Allocation and Representation Methods

Each file system uses an algorithm to find and allocate free space within the file system. Most file systems use binary tree (btree) allocation to represent free space, but some file systems use bitmaps to represent free space. Each method of free space representation has advantages and disadvantages.

Bitmap Representation

The use of bitmap representation is less common. This method is used where each bit in the map represents a single allocation unit such as 1,024 bytes, 512 KB, or even hundreds of megabytes. Therefore, a single bit could represent a great deal of space.

Binary Tree (Btree) Representation

A btree is basically a sorted list of all the free allocations and used space for the file system. It is important to understand how the space is allocated from the tree. Some good reading on btree allocation, which is used in most file systems to find free space, can be found at:

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/trees.html
http://cs-people.bu.edu/elenam/cs112-labs/lab9/lab9.html 
http://www.cs.usask.ca/research/research_groups/aries/projects/applets/tutorials/trees/bintree/

Free Space Allocation

With each allocation type (btree or bitmap), free space must be found and allocated with the representation method. These allocation algorithms find the free space based on their internal search algorithms. The two most common methods used are first fit and best fit.

First Fit

The first fit method tries to find the first space within the file system that matches the allocation size requested by the file being allocated. In some file systems, the first fit method is used to find the space closest to the last allocation of the file being extended, thereby allowing the allocation to be sequential block addresses allocated for the file within the file system.

Best Fit

The best fit method tries to find the best place in the file system for the allocation of the data. This method is used to try to reduce total file system fragmentation. This method always takes more CPU cycles than first fit, because the whole file system must be searched for the best allocation. (Note that in systems using round-robin allocation only, the device on which the initial allocation was made must be searched.) This method works to reduce fragmentation, especially when files cannot be pre-allocated (for file systems that support this method) or for large allocations, such as multiple megabytes. Most vendors do not support this method, and most allocations in file systems are not large because the overhead would be huge. The old Cray NC1FS supports this method by using hardware vector registers to quickly perform the search.

Conclusions

In general, btrees do a better job for small allocations, but then the files will be fragmented. Bitmaps are better for larger allocations, but the time for allocation can be much greater to search the map for free space. Both of these methods can be optimized for operational requirements with tunable parameters in the file system and volume manager and with RAID configuration and allocations. Some file systems have a hybrid approach to allocation -- these are generally proprietary.

Summary

In this article, I've provided a basic overview of how file systems and volume managers work internally. My next column will apply this information to specific tuning options and to understanding file system and volume manager tradeoffs and choices. After reviewing the file system and volume manager tuning options, I will then review HBA and RAID, thereby illustrating how I/O works in the system. I hope this information will help you make the best possible choices when configuring or tuning hardware and software systems.

Henry Newman has worked in the IT industry for more than 20 years. Originally at Cray Research and now with a consulting organization, he has provided expertise in systems architecture and performance analysis to customers in government, scientific research, and industry around the world. His focus is high-performance computing, storage and networking for UNIX systems, and he previously authored a monthly column about storage for Server/Workstation Expert magazine. He may be reached at: hsn@hsnewman.com.