Volume Management and File Systems Usage and Implementation
Henry Newman
Current file systems trace their roots from the UFS file system,
which was proposed in 1965. By the early 1970s, the UNIX file system
was up and running. Since then, not much has changed in file systems
and there have only been incremental hardware changes. I think the
file system and volume manager are the most critical components
in achieving I/O performance from both the OS and underlying hardware.
Even the best file system and volume manager can be configured so
that the performance is poor. Therefore, my next couple of columns
will cover file system and volume management, in addition to file
system configuration and tuning.
File System Basics
The purpose of file systems (FS) is to maintain a view of the
storage so we can create files. This is done so that users can create,
delete, open, close, read, write, and extent files on the device(s).
File systems can also be used to maintain security over files.
Volume Manager Basics
The original goal of the UNIX volume management (VM), which was
developed in the late 1980s, was to group disk devices together
so that file systems larger than a single device could be created,
and to achieve high performance by striping devices. Since file
systems could only mkfs a single device, volume managers
provided that single device, and features were added to support
software RAID.
Standard VM Inner Workings (Striping)
Most file systems require a VM to group disk and/or RAID devices
together. Striping spreads the data across the devices based on
the stripe size set within the volume manager. Note that some volume
managers support concatenation, which starts with the first device
and then only writes to the next device when the first device becomes
full. The idea behind striping is to spread the data across multiple
devices to improve performance and allow multiple I/O disk-head
seeks to occur simultaneously. Figure 1 shows what happens with
standard striping for allocation of multiple files writing at the
same time, and shows what happens when one of those files is removed.
File Systems that Maintain Their Topology
Some file systems maintain and understand the device topology
without a volume manager. These file systems support both striping
and round-robin allocation. Round-robin allocation means that each
device is used individually. In most cases, each file open moves
to the next device. In some file systems, it could be that each
directory created moves to the next device. Figure 2 shows an example
of round-robin allocation, which is very different from striping.
I will show how round-robin allocation has some important implications
for performance.
File Allocation Comparison
One reason that volume managers do not provide a round-robin allocation
method is because of the interaction between the volume manager
and the file system. Every file system must allocate space and maintain
consistency, which is one of the main purposes of the file system.
There are multiple types of file system allocation, but the real
issue is that a volume manager presents a single set address range
for the block devices in the file system for the file system to
allocate from. The volume manager then translates the address to
each of the devices. It is difficult, but not impossible, for the
volume manager to pass all of the underlying device topology to
a file system. Also, most file systems designed with volume managers
do not have an interface to understand the underlying volume topology.
Other file systems that control their own topology can easily use
round-robin allocation, because the file systems understand the
underlying topology.
How Volume Managers and File Systems Work
It is important to fully understand how volume managers and file
systems work internally to choose the best file system for the application.
By understanding the inner workings, you will have a much better
idea of what the tunable parameters mean and how to improve performance
with those tunables and your available hardware.
Space Allocation
Each file system supports a method of representing space for files
within the file system. The two most common methods are:
1. Extents
2. Indirect blocks
Extents
If the file system is using extents-based allocation, space is
allocated within the inode for block address locations for the data
for the file. In most file systems, 15 extents addresses for the
data are used in the base inode, and the last address in the inode
is linked to another inode, where an additional 15 extent addresses
are available. Examples of extent-based file systems are Veritas
VxFS, Sun QFS, SGI xfs, and Linux extfs-3.
Indirect Blocks
With indirect block allocation, the last block of space allocated
for a file points to the next allocation. Thus, the base inode space
is allocated for the data, and the last block of the allocated data
space points to the next allocated space. Indirect blocks were originally
designed for the UFS file system in 1970. At that time, disk drives
were so slow that a seek was not required to go back to the inode
and allocate more space. File systems that support indirect block
allocation are Sun UFS, IBM JFS, and MS FAT-32.
Performance Comparison
Indirect block allocation and read/write performance can be painfully
slow compared to the extent-based allocation method. For example,
consider an application doing random reads and writes. To find the
block address for the record, a file allocated with indirect blocks
must read all of the data areas of the files for the record in question
prior to reading the record. With extent-based allocation, the file
system can simply read the inodes in question, which will make an
enormous difference in performance. I am unaware of any new file
systems using indirect blocks for space allocation because of the
huge performance penalties for random I/O. Even for sequential I/O,
the performance for indirect blocks is generally less than extent-based
file systems.
Free Space Allocation and Representation Methods
Each file system uses an algorithm to find and allocate free space
within the file system. Most file systems use binary tree (btree)
allocation to represent free space, but some file systems use bitmaps
to represent free space. Each method of free space representation
has advantages and disadvantages.
Bitmap Representation
The use of bitmap representation is less common. This method is
used where each bit in the map represents a single allocation unit
such as 1,024 bytes, 512 KB, or even hundreds of megabytes. Therefore,
a single bit could represent a great deal of space.
Binary Tree (Btree) Representation
A btree is basically a sorted list of all the free allocations
and used space for the file system. It is important to understand
how the space is allocated from the tree. Some good reading on btree
allocation, which is used in most file systems to find free space,
can be found at:
http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/trees.html
http://cs-people.bu.edu/elenam/cs112-labs/lab9/lab9.html
http://www.cs.usask.ca/research/research_groups/aries/projects/applets/tutorials/trees/bintree/
Free Space Allocation
With each allocation type (btree or bitmap), free space must be
found and allocated with the representation method. These allocation
algorithms find the free space based on their internal search algorithms.
The two most common methods used are first fit and best fit.
First Fit
The first fit method tries to find the first space within the
file system that matches the allocation size requested by the file
being allocated. In some file systems, the first fit method is used
to find the space closest to the last allocation of the file being
extended, thereby allowing the allocation to be sequential block
addresses allocated for the file within the file system.
Best Fit
The best fit method tries to find the best place in the file system
for the allocation of the data. This method is used to try to reduce
total file system fragmentation. This method always takes more CPU
cycles than first fit, because the whole file system must be searched
for the best allocation. (Note that in systems using round-robin
allocation only, the device on which the initial allocation was
made must be searched.) This method works to reduce fragmentation,
especially when files cannot be pre-allocated (for file systems
that support this method) or for large allocations, such as multiple
megabytes. Most vendors do not support this method, and most allocations
in file systems are not large because the overhead would be huge.
The old Cray NC1FS supports this method by using hardware vector
registers to quickly perform the search.
Conclusions
In general, btrees do a better job for small allocations, but
then the files will be fragmented. Bitmaps are better for larger
allocations, but the time for allocation can be much greater to
search the map for free space. Both of these methods can be optimized
for operational requirements with tunable parameters in the file
system and volume manager and with RAID configuration and allocations.
Some file systems have a hybrid approach to allocation -- these
are generally proprietary.
Summary
In this article, I've provided a basic overview of how file
systems and volume managers work internally. My next column will
apply this information to specific tuning options and to understanding
file system and volume manager tradeoffs and choices. After reviewing
the file system and volume manager tuning options, I will then review
HBA and RAID, thereby illustrating how I/O works in the system.
I hope this information will help you make the best possible choices
when configuring or tuning hardware and software systems.
Henry Newman has worked in the IT industry for more than 20
years. Originally at Cray Research and now with a consulting organization,
he has provided expertise in systems architecture and performance
analysis to customers in government, scientific research, and industry
around the world. His focus is high-performance computing, storage
and networking for UNIX systems, and he previously authored a monthly
column about storage for Server/Workstation Expert magazine.
He may be reached at: hsn@hsnewman.com.
|