Using fsck to Check ufs File Systems
A file system is a mechanism for locating files on a
recording medium, the device implementing a compatible
storage structure (e.g., random access disks with fixed
and sequential access tapes with variable block sizes).
storage can be considered as an organized "pool"
of data storage
resources. There are several common types of UNIX file
but this article discusses only the UNIX ufs file system.
presented are applicable to all UNIX file systems being
File systems are generally logically independent of
the required underlying
physical storage devices, although their operation is
by these storage devices (for example, errors affecting
of the devices commonly affect the file system structure
or the files
maintained by the file system). However, not all file
can be related to an error reported by the underlying
(some other storage device, or source of error, may
be the root cause).
For simplicity, my discussion will be limited to storage
to the host computer via the SCSI (Small Computer System
bus. These devices support an addressable linear array
logical blocks, each block containing 512 bytes of data
block sizes are 1024 and 2048). Focusing on SCSI systems
to some extent problem determination and error recovery/restart
since these procedures are designed to support the selected
devices. In conjuction with the discussion of using
check ufs file systems, I present a system administrator's
for checking multiple file systems.
SCSI Disk Error Model
A very high percentage of SCSI disk errors occur when
data is retrieved
from the medium. Errors that occur when storing data
are usually not
detected until the data is retrieved, although if the
device is unable
to locate the proper block on a write operation it can
signal a write error.
Even though the majority of accesses to a disk are data
operations, complete media testing entails testing both
retrieval operations. The device that supports defect
linking around defective blocks. Typically, this service
either in response to the execution of a disk maintenance
or at the direction of a disk device driver in response
to a "hard
error" reported by the device.
A "hard error" exists: 1) where the device
signals that it
is unable to recover data (e.g., too many errors); or
2) where successive
unsuccessful attempts to correctly retrieve data (errors
along with the data) are halted by the device driver.
The disk driver may have previously requested that the
reassign a "bad" block during an initialization
not, it may issue a separate command to the device to
known "bad" block. It may also choose to do
a hole in the linear array and presenting an opportunity
utility to attempt a patch of the linear array.
Successive unsuccessful attempts to retrieve data followed
by a single
successful data retrieval can occur before the device
the data retrieval operations. Each such unsuccessful
attempt is termed
a "soft error." Excessive "soft errors"
of future problems and should be corrected during preventive
The reassignment of a data block may result in data
loss even though
performed without additional errors. This usually happens
device is unable to correctly recover data from the
block and transfer it to the patch block.
A device driver typically stores whatever data has been
from a "bad" block and writes it to the patch
completion of the reassignment. Where no data is retrieved,
of ZEROs is written to the patch block. Little else
can be done to
recover data directly; instead, restoring files from
should be considered.
Device maintenance commands such as format can be used
"patch" the linear array of logical blocks.
you can perform
surface analysis to detect and reassign "bad"
blocks and should
do this periodically since the recording media degrades
ufs File System
A ufs file system can be represented simply as a large
composed of sequences of one or more smaller data structures
to as a cylinder group) each composed of the following:
-- Cylinder Group Map
-- Storage Blocks
The "super-block" contains information on
the size and status
of the file system, the label (obtained from block 0
of a SCSI disk),
and the cylinder group. Multiple "super-blocks"
and used to repair ones that are bad.
Inodes contain all information about a file except its
in a directory). Typically, one inode is created for
every 2048 bytes
of available storage (this can be altered when the file
built; refer to the mkfs user command). An inode contains
information on the:
-- file type (regular, directory, block, character,
symbolic link or FIFO/pipe),
-- file permissions,
-- number of hard links,
-- user-id and group-id,
-- number of bytes,
-- first 12 disk block addresses,
-- three indirect pointers to additional disk block
-- file time-related data.
The majority of blocks in a cylinder group (1-to-32
are allocated to storage blocks. The UNIX user command
fsck is used
to quickly check the super-blocks and the inodes for
file system inconsistencies:
1) during the "install" phase where the operating
loaded onto the system disk and configured for normal
2) during a "bring-up" of the operating system,
3) when adding a new disk to the system,
4) when analyzing problems associated with a file system
by a specific physical disk,
5) during repair procedures prior to returning a disk
device to normal
6) during preventive and predictive maintenance procedures
a disk probing tool that looks for file system-related
fsck Check Sequence
fsck sequences through the following phases:
Check Blocks and Sizes (file system inode list),
Check Pathnames (directory entries),
Check Connectivity (one directory for each inode and
multiple links make sense),
Check Reference Counts, and Cylinder Groups (link count
and alterations made previously), and
Check the Free List (blocks are allocated to an inode
or the free block list).
(Refer to Thomas and Farrow  for a more detailed
the check sequence. The information presented here has
from the references listed on page 88.)
An underlying presumption regarding the use of fsck
files with corrupted inodes should be replaced from
and that the user is able to keep track of the files
action. The following key factors also affect the use
-- mounted (active) file systems may change while
being examined by fsck,
-- any change that occurs in a file system while
running fsck can produce inconsistencies,
-- inconsistencies may be minor enough to result
in an automatic repair action by fsck,
-- major inconsistencies require user intervention,
-- false inconsistencies are treated as though they
were actual inconsistencies.
To avoid related problems, fsck does not work on mounted
systems, the exception being the root (/) file system.
can be run on root while in single-user mode.
However, there are two interfaces to a block storage
and character, or raw), and it is possible to run fsck
a mounted file system through the raw interface. Doing
so makes the
check vulnerable to the problems that could arise if
were run on an "active" or mounted file system.
strongly recommend that fsck be run on unmounted file
and on the root file system when in single-user mode.
The "-y" Option
The "-y" option allows fsck to assume yes
as a response
to all queries about repair actions. This option should
not be used
on a file system that contains important user data.
As I explain later
(in the "Repairs" section), there are cases
should not be allowed to perform a repair action.
The install phase yields a fully operational, configured
system upon successful completion. It requires the identification
of at least one operable disk (the system disk) upon
which the necessary
number of supported, operable file systems (determined
can be built. This in turn requires that the host computer
supporting the system disk be operable (this can often
at a lower level of functionality than the install phase,
cases a PROM monitor supports communication with attached
At all points in the install phase, system disk-related
result in failures and aborts that require user analysis
proper responses. Since install processes rarely attempt
verification of the underlying hardware, the user must
upon past history to select an appropriate error recovery
(i.e., the presumption is that the install would be
on a fully operational system).
A successful install process does not guarantee that
the system disk
is completely operational. The install process transfers
the system disk and prepares it for normal operation,
which requires many storage operations that are not
followed by the
retrieval operations that would normally detect disk-related
Errors encountered after a successful install may actually
back to the time of the install.
Since fsck may be the only diagnostic tool to used by
process to check system disk operability, errors reported
in an install
should result in a more detailed test of the host computer
storage devices (some computer system vendors provide
system operability test packages). You should not, under
attempt to continue the install process.
A full destructive surface analysis of the media before
an operating system on a SCSI disk or after subsequent
errors have been encountered will detect most latent
errors that could cause "install" phase and
This will also require that the user re-install the
The following are useful guidelines for the install
If an error is reported by the system disk or any other
hardware component, analyze the problem to see if a
service call is
necessary (can the error be corrected via simple user
If the install has been corrupted by the reporting of
an error, try re-starting the install process.
If fsck has reported errors, the likely suspect
is the underlying storage device, since it is allegedly
a new file system has just been built on it, and little
time has passed
for data loss or corruption to occur.
If the install has not completed or initial checks have
not performed successfully, take the system down to
and use fsck as a diagnostic probe.
Consider any errors that occur during an install phase
to be indicative of an abnormal condition.
Presuming the successful completion of an install, the
at some time be booted and a "bring-up" phase
this phase the function of fsck is again a very quick
of file system inconsistencies. If minor problems are
may be able to correct them and continue. If not, the
process may require that fsck be run independently,
i.e., some repair actions
will be necessary.
fsck is important here because the system may have been
down incorrectly, may have encountered a "panic"
or may have shut down due to errors or equipment modifications.
statements regarding fsck apply here as well, but there
one notable additional factor -- recorded past history
in the system
Problems reported by fsck during a bring-up phase should
the user to scan the system log for device-related errors.
Both "soft" and "hard" errors should
since soft errors can evolve into hard errors.
Where difficult errors are present, especially those
system disk, a suggested approach is to boot an operating
over the net or from a local CDROM device, or integrate
into the host
computer a known good disk containing an operating system,
it to perform further checking and data recovery.
fsck is not a particularly useful tool for problem determination.
Errors reported by fsck require a good deal of interpretation
based upon intimate knowledge of the file system and
storage device. Such errors should always be analyzed
with the system log, past history, experience, and a
facility. The proper next step is rarely obvious from
provided by fsck.
Preventive and predictive maintenance procedures should
to probe for file system inconsistencies. Such procedures
perform full media "non-destructive" testing
to ensure that
no blocks are accessible that could cause a future "hard"
or "soft" error. "Destructive" media
be undertaken only after full data recovery has been
It is possible for maintenance procedures themselves
errors (even system "panic" conditions) that
have no relation
to a recorded storage device error (e.g., data corruption
and the maintenance procedure is processing garbage).
not dissuade you from developing and performing them,
but should help
you to see the importance of executing them in sufficiently
environments (e.g., full backups performed).
fsck has a repair feature that can correct minor problems
but can also, if used inappropriately, do major damage
to a file system.
fsck is presumed to be an expert on the file system
have the proper repair action selected before requesting
to proceed. The "-y" option, for those who
have complete faith
in its ability, causes it to automatically repair any
errors it encounters.
In a number of cases, bowever, fsck's suggested repair
not appropriate. In most of these cases, fsck will be
for permission to remove a file or clear an inode; authorizing
repairs without investigating can result in data loss.
cases have been identified empirically.
SORRY: NO lost+found DIRECTORY or NO SPACE in lost+found
Clearly there is a problem with the lost+found directory
immediate attention. fsck should be terminated and other
taken to resolve problems.
If no such directory exists, create one and then rerun
If there is no space left in the directory, and the
number of files
in the directory is large, rerunning fsck is inappropriate
(instead, find out why so many files were placed in
Where large files exist in the directory, attempt to
the large files are valid and are not file fragments.
An example would
be a copy operation that could not complete successfully,
result that only a portion of a file remains in the
DUP TABLE OVERFLOW
The DUP table stores a list of inodes with duplicate
message occurs when the table runs out of space. Such
should put the user on notice that unusual conditions
and the potential for data loss is high. The recommended
is to "write down the inode numbers of inodes with
found after this point, and don't REMOVE any of the
to the inodes or CLEAR these inodes." .
An alternate suggestion, given that the overflow represents
condition, is to stop at this point and attempt to recover
files are accessible before proceeding. The recovery
simply be a raw disk copy (using the dd user command).
Read, Write or Seek Errors
These messages indicate that a "hard" error
and fsck is asking if it should continue in the presence
such errors. Only if you are very familiar with the
devices and with what fsck is attempting to accomplish
you attempt to continue. A possible alternative is to
perform a "non-destructive"
surface analysis of the attached storage device.
The surface analysis should reassign a "bad"
block if it encounters
a "hard" error (refer to the "SCSI Disk
section earlier). Since data may be lost in the procedure,
record all reassigned blocks , rerun fsck, and mark
all files affected by the reassignment. Upon completion,
consider restoring files from recent backups.
PARTIALLY ALLOCATED INODE I = 14
Legal inode types are given in reference . A partially
inode is one that has a type of 0, but some information
the mode word. This often indicates a block containing
occurrence of many of these suggests that the file system
to have widespread damage, including corrupted files.
A good practice here is to record the inode numbers
so that, if they
become needed in phase 2, the filenames linked to the
inodes can be
found. Data recovery after completion, as well as close
the file system during preventive and predictive maintenance
is also recommended. If actual data corruption is found,
device should be re-evaluated and the file system rebuilt
the storage device to service.
LINK COUNT TABLE OVERFLOW,
This message indicates that there is no more room to
that have a zero link count and will recur for all subsequent
with zero link counts. If this is the only error reported,
allow fsck to continue, but if multiple errors are reported,
fsck should be terminated. Upon completion or termination,
a file recovery procedure should be performed.
EXCESSIVE BAD BLOCKS I=13
Ten bad blocks have been detected while checking this
Something is seriously wrong, and it is time to do something
than continue with fsck.
Initialization errors are worth mentioning because they
can be triggered
by recent problems with devices that have been performing
over some period of time (storage devices can fail suddenly).
fsck is initiated, perhaps in response to an error message,
the following error can be quite surprising:
Cannot stat <device name>
Here fsck cannot obtain information on the file
system supported by <device name>. It is possible
file system does not exist, cannot be opened due to
has been removed from the device tree by the device
the device no longer responds to commands). The appropriate
is to immediately initiate a problem determination procedure
underlying storage device.
Minor problems automatically detected and corrected
rarely result in data loss or corruption. However, the
data loss or corruption increases up to a certainty
if the user chooses
to continue running fsck in the presence of clearly
major errors. By placing complete faith in the ability
to detect and correct errors (the "-y" option),
you lose control
over the repair process and may remain unaware of the
major problems that require immediate attention, e.g.,
Administrator's Bourne Shell Script
When a system administrator has to check multiple file
a tool for performing the checks and highlighting the
handy. The mfsck script (Listing 1) allows you to specify
a large number of disks and partitions/slices to check
It's important that you avoid overloading the OS when
doing this by
initiating too many processes.
mfsck supports a very simple interface which is displayed
if no arguments are provided:
mfsck <task file>
<task file> is the name of a file containing the disks and the file systems to check.
The syntax of the file <task file> is
# comment line
<logical device> <partition> ... <partition>
c0t5d0 0 1 6
The <task file> below contains a single entry
and points to a nonexistent disk:
# cat task.file
Executing the script with this <task file> yields
output in Figure 1.
Using a single-entry <task file> that contains
a valid disk
# cat task.file
and yields results like those in Figure 2.
The fsck operations are performed in the background
script checks for completion. The script can be broken
up into smaller
files, as originally designed, and further simplified.
form is intended to be simple and straightforward, facilitating
Multiple <task file> specifications can be built
of future requirements. Log files generated by the script
can be removed
or saved at the user's option. Where errors are encountered,
files should be scanned to determine where the error(s)
fsck is a tool for checking file systems that requires
level of skill and experience to use effectively. fsck
be considered, in most cases, as an expert in selecting
the best "next
step" in a repair process. However, it has a very
in some cases confusing, user interface. Moreover, it
some cases that the user record data to be used at a
fsck also has a high potential for data loss and corruption,
since checking is limited to the file system. It can
be used effectively
as an indicator of possible current damage, but provides
in the development of appropriate error recovery procedures.
Even given its limitations, however, when combined with
understanding of the underlying storage devices and
of the file system, fsck remains a very useful tool
system administrator's toolbox.
1. UNIX Software Operation. UNIX System V Release
4 System Administrator's Guide. Englewood Cliffs, N.J.:
2. Bach, Maurice J. The Design of the UNIX Operating
System. Englewood Cliffs, N.J.: Prentice-Hall, 1986.
3. Nemeth, Evi, Garth Snyder, and Scott Seebass.
UNIX System Administration Handbook. Englewood Cliffs,
4. Fiedler, David, and Bruce H. Hunter. UNIX
System Administration. Indianapolis, IN: Hayden Books,
5. Thomas, Rebecca, and Rik Farrow. UNIX Administration
Guide for System V. Englewood Cliffs, N.J.: Prentice-Hall,
About the Author
Tom Clark has been working with UNIX since 1984 as
developer. He is currently working as a system software
for Sun Microsystems SMCC System Software Quality Assurance
in Mountain View, Ca.. Tom has a B.S.E.E. from the University
Mexico, an M.S.E.E. from Wichita State University, an
Engineering degree from the University of Southern California,
a B.S.L. and J.D. from Peninsula University. He can
be reached at