When the bard said, "There are more things in heaven
and earth...
than are dreamt of in your philosophy," he could
have been speaking
of objects in the UNIX filesystem. To be able to answer
questions
from your users and handle problems that arise during
software installations,
you, as system administrator, need to understand all
the filesystem
objects in the UNIX heavens. This article covers a collection
of filesystem
topics that I find most beginning (and some experienced)
system administrators
don't know about or understand well.
Two Sys Admin articles by Emmett Dulaney -- "Understanding
Filesystems" (March/April 1995) and "When
inodes Go Bad"
(July/August 1994) -- explain the underlying structure
of the UNIX
filesystem. This underlying structure (superblocks and
inodes) is
generally invisible to the user. The present article
takes a look
at the unfamiliar, but visible, objects in the filesystem
and at some
unusual behaviors of familiar objects:
devices
Although most of what I say applies to System V Release
4 in general and even to such oddball **ixes as the
Coherent 4.2 system
I run at home, some of the commands and details are
specific to Sun
Solaris 2.4. I encourage you to see how these commands
and examples
work on your system by trying them out as you read.
Files, Directories, and Links
The logical concept of a file under UNIX is a string
of bytes pointed
to by a directory entry. Most of the time this concept
serves quite
well, but occasionally files exhibit behavior that makes
it clear
this concept is an oversimplification.
Sparse Files
The data portion of a file is not actually one big string
of bytes.
If you open a file, position out to byte 9,999 (using
something like
the lseek system call) and write one byte, the resulting
file
does not use up 10,000 bytes of disk space. Only the
data you write
takes up disk space. If you write only one byte, the
file is reported
as the same size in blocks, regardless of the offset
of that byte
in the file. (Just how many blocks a one-byte file takes
up depends
on invisible things I won't go into. Try the program
in Listing 1
and see what happens on your system.)
When you read from a sparse file it looks as if the
file truly has
all those bytes. You get 9,999 NULs and the one byte
you wrote. But
only the one byte is actually out there on the disk.
This is why system
files like quotas and lastlog show different block
and byte sizes in an ls -ls listing:
320 -rw-r--r-- 1 root root 1026720 Mar 23 12:35 quotas
176 -rw-r--r-- 1 root root 593320 Mar 24 19:39 lastlog
These files contain entries indexed on UID. When an
entry for, say,
UID 15432 is made, the offset into the file is calculated
and the
entry is written at that offset. This results in a file
full of "holes."
If you move such a sparse file (using the mv command),
it
remains a sparse file. A move (within the same filesystem)
is a change
to directory entries only and does not involve the data
portion of
the file. Moving a file is the same as making a second
hard link to
it, then removing the original hard link. Moving a file
to a different
filesystem acts like a copy followed by removal of the
original file
(because of the way directories work -- see the section
of this
article on links.)
If you copy a sparse file (using cp or tar), it will
no longer be a sparse file! Copy the example file and
the copy will
use 10,000 bytes of disk to hold 9,999 NULs and one
final byte. Backup
utilities such as ufsdump work at a lower level and
do backup
and restore sparse files correctly. You can copy a directory
containing
sparse files like this:
ufsdump 0f - /directory | (cd /new/location; ufsrestore xvf -)
You must be root to use dump and restore
because they access the raw device. This enables them
to read and
write the actual blocks that make up a file. They can
also back up
named pipes and device special files.
Directories
Directories are fairly straightforward (unless you accidentally
use
vi on one), but here, too, there are some quirks. Have
you
ever copied a tree from one place to another and wondered
why the
size of the copy is a few blocks smaller than the original,
yet all
the file sizes are unchanged?
As you add entries to a directory it grows in size to
hold the entries,
but does not shrink again when entries are removed.
When you copy
a directory using cp or tar, a new directory is built,
without the unused blocks, so the directory takes less
space. (Again,
dump/restore makes an exact copy). You can see a directory
shrink by doing this:
cp -r /lost+found /var/tmp
You probably have to do this as root because lost+found
is usually not world readable. Copy to /var/tmp because
/tmp
already has a lost+found directory if you are using
tmpfs.
The lost+found directory is always created with extra
room
for lots of entries so it does not have to be expanded
while fsck
is cleaning up a disk. When you copy lost+found, it
will shrink
to the minimum size for a directory, usually one block.
Here is ls
-ls before and after making a copy of the directory:
7 drwxr-xr-x 2 root 3232 Sun Feb 12 18:11 /lost+found
1 drwxr-xr-x 2 lisa 32 Sat Apr 1 16:46 /usr/tmp/lost+found
Links
You can think of a directory entry as being the head
pointer of the
linked list that is a file's data. When you make a "hard
link,"
e.g.,
ln file1 file2
you now see both file1 and file2 when
you do an ls -l, and they have identical entries except
for
the link count field, which is "2" for each
file:
-rw-r--r-- 2 lisa parent 509 Sat Mar 25 19:19 file1
-rw-r--r-- 2 lisa parent 509 Sat Mar 25 19:19 file2
The file has not been copied. In fact, the data portion
of the file has not been touched. What has happened,
essentially,
is that the directory entry has been cloned but with
a different file
name. Both directory entries point to the same data.
You can delete
either entry and the data is still there. If you delete
the entry
for file1 in this example, the effect is exactly the
same
as
mv file1 file2
As stated earlier, moving a file to a different filesystem
turns into a copy followed by removal of the original
file. A directory
entry contains inode numbers, which are unique only
within a filesystem.
A directory entry cannot be "cloned" to a
different filesystem;
there is no way for a directory entry to point to the
data portion
of a file on a different filesystem. So you cannot hard-link
a file
to a different filesystem. But you can make a soft or
symbolic link:
ln -s file1 file2
which looks like:
-rw------- 1 lees manager 70 Mar 25 19:31 file1
lrwxrwxrwx 1 lees manager 5 Mar 25 19:31 file2 -> file1
A soft link is not a cloned directory entry, but is
itself
a file, the content of which is a pathname. This pathname
may point
anywhere, or nowhere, or to itself! It has no direct
relation to the
data in the original file. If the original file is moved,
or removed,
the soft link remains and is now simply wrong.
Soft links use space outside a directory. No error checking
is done
when a soft link is created. Soft links require extra
disk accesses
to resolve. Nothing guarantees that a soft link will
remain valid.
Still, if you want to link two objects in different
filesystems, you
have no choice but to use a soft link.
Soft links are useful within a filesytem because they
preserve the
notion of an original file that is pointed to, or aliased.
If you
have installed SunOS binary compatibility, there is
a /dev
directory full of soft links, and links to links, such
as (ls
-l /dev/rdsk/c0t0d0s0):
2 lrwxrwxrwx 1 root root 55 Mar 8 13:28 c0t3d0s0 ->
../../devices/sbus@1,f8000000/esp@0,800000/sd@3,0:a,raw
This could be a hard link, but then you could not tell
by looking at it what it is a link to. That is the great
drawback
to hard links. The link count field tells you there
are one or more
other hard links to the object, but gives no clue at
all as to what
they are.
Devices
A device special file ("device" for short)
is a filesystem
object created by the mknod command. When a device is
accessed,
control is given to a device driver. The device driver
knows how to
do the required operation (open, read, write,
ioctl, and so on) on the hardware associated with the
device,
if any. Devices are a way of providing a very similar
interface to
a wide range of physically very different objects.
The mknod command doesn't know anything about devices
or device
drivers. It creates a special file object of type character
(raw) or block (file), and gives it a name and a major
and
a minor number. The major number determines which device
driver is
invoked when the device file is accessed. The minor
number is just
handed to the device driver; it is used to choose the
unit, partition,
or whatever is appropriate. Here are the device special
files for
the c0t3d0s0 device:
0 brw-r----- 1 root sys 32, 24 Mar 8 12:29 sd@3,0:a
0 crw-r----- 1 root sys 32, 24 Mar 8 13:07 sd@3,0:a,raw
The "b" in the listing signifies the block
device, the "c" the character or raw device.
The major
number 32 is the SCSI driver. The minor number 24 identifies
the SCSI
unit and slice (partition). The pattern in minor numbers
should be
evident from a few more SCSI device files:
0 brw-r----- 1 root sys 32, 25 Mar 8 12:29 sd@3,0:b
0 brw-r----- 1 root sys 32, 26 Mar 8 12:29 sd@3,0:c
0 brw-r----- 1 root sys 32, 48 Mar 8 12:29 sd@6,0:a
0 crw-r----- 1 root sys 32, 48 Mar 8 12:29 sd@6,0:a,raw
Notice the minor number incrementing for each new slice,
then jumping by 24 for the next SCSI logical unit (next
physical device
on that controller). Under Solaris the minor numbers
may change when
you add a disk and do boot -r. This usually did not
happen
under SunOS, because the kernel config file defined
all the
devices that could reasonably exist and these device
files were created
whether or not the physical devices were present.
The device major number is an index into an operating
system table
of device drivers. Under SunOS you knew this number
because you had
to edit /sys/sun/conf.c and rebuild the kernel when
adding
a new device driver. Under Solaris the kernel is configured
at boot
time (though additional modules may be added via the
modload
command). The modinfo command tells you the major number
of
device drivers built into the operating system at boot
time or installed
with the modload command. These are a few of the loaded
modules
on my SPARCstation 2 running Solaris 2.4; the "Info"
field
is the major number:
Id Loadaddr Size Info Rev Module Name
12 ff137000 cb70 32 1 sd (SCSI Direct Access Disk Driver)
51 ff2d6000 7dfc 11 1 tmpfs (filesystem for tmpfs)
61 ff40f000 6fdd 36 1 fd (Floppy Driver)
71 ff654000 9968 33 1 st (SCSI Sequential Access Driver)
Sun some time ago started making physical devices smart
enough to identify themselves at boot time (the adapter
cards contain
FORTH routines in ROM that do this). When you boot -r
a tree
is built under /devices containing the device files
for all
the physical devices that identify themselves in this
manner.
If you are using "compatibility mode," a tree
of soft links
-- the links having the name of the SunOS device file
-- is
built under /dev to the Solaris device file. You can
use mknod
here, but the preferred approach is to make a soft link
to the appropriate
device special file in the /devices tree. If you forget
to
do boot -r the following set of commands does more or
less
the same thing:
/usr/sbin/drvconfig
/usr/sbin/devlinks
/usr/sbin/disks
/usr/sbin/ports
/usr/sbin/tapes
/usr/sbin/audlinks
/usr/ucb/ucblinks
Pipes
The mknod command can also make an object that is not
a device,
but a named pipe. We're all familiar with pipes, the
use of the "|"
character to connect the standard output of one process
to the standard
input of another process. A named pipe is similar, but
it is a permanent
object in the filesystem. The command
mknod my_fifo p
creates a named pipe named my_fifo. A named pipe
has ownership and permissions just like a file, so you
can restrict
its use with groups and file permissions. This named
pipe is read/write
for owner and for group parent:
prw-rw---- 1 lisa parent 0 Sun Mar 26 19:12 my_fifo
Named pipes have several advantages over "|"
pipes. The processes writing to and reading from the
named pipe do
not have to be related in any way. They can be run by
different users.
One can be foreground and one background. They can even
be on different
machines if the named pipe is an object in a mounted
filesystem! Root
privileges are not needed to create named pipes.
You will find named pipes in a number of places around
Solaris.
They are used in the Service Access Facility and in
the print spooling
system.
tmpfs Filesystems
Sun introduced the tmpfs filesystem quite some time
ago (in 4.1.something)
and it is now the default for /tmp when installing Solaris
2. This makes /tmp an in-memory filesystem, kind of
like a
RAM disk on a PC. What tmpfs actually is doing is putting
/tmp in virtual memory, which means that files on /tmp
compete with programs for space in RAM and are paged
out to swap
space when memory is needed for something else.
Normally, all the data structures of a file (directory
entries, inodes,
data blocks) exist on disk. Many operations on a file
in a disk-based
filesystem require multiple accesses to disk to retrieve
or update
information in these data structures. In a tmpfs filesystem,
only file data is paged to disk. The associated information
always
stays in memory. So fewer, even zero, disk accesses
are needed to
use a file on a tmpfs filesystem.
But there are tradeoffs. Because the control information
is always
kept in memory, the number of files that can be created
is limited.
The limit is based on RAM size, not on the size of the
swap partition.
Heavily loaded systems with inadequate RAM memory may
be pushed into
thrashing by the use of tmpfs filesytems. This is one
more
thing to be aware of in analyzing and tuning performance.
The best use of a tmpfs /tmp is for many small, short-lived
files, such as the intermediate and scratchpad files
written by compilers.
(You should set the TMPDIR environment variable to /tmp
to make certain that compilers use /tmp instead of /usr/tmp
for these files.) The worst use of a tmpfs /tmp is to
store
large files that are written once and read many times.
Use a disk-based
filesystem local to the workstation for these files.
One last caveat: Although the man page for tmpfs claims
that tmpfs "provides standard file operations and
semantics,"
the exercise of creating sparse files (Listing 1) shows
this is not
completely true. There are no holes in a file on a tmpfs
filesystem.
Also, the df command does not give useful information
for
a tmpfs filesystem.
Tricks
This is a collection of wisdom I have picked up
in the school of hard knocks.
Weird File Names
What do you do when one of your users manages to name
a file Ctrl-D
or embed a space in a filename? Once you realize that
something funny
is going on, you can use ls -b to show the problem filename,
with the control characters. Then it is a question of
figuring out
how to name the file to the rm command. Using single
quotes
solves many problems, such as:
rm 'a space'
rm 'back\slash'
You may be able to use wildcards:
rm a?space
rm back*
And if all else fails,
rm -ir dir
will go through all the files in a directory, even those
composed of completely untypeable names. (WARNING: Do
not, as root,
ever do something like
/bin/rm -rf .*
in a well-meaning attempt to remove all the files in
a directory. The wildcard ".*" includes "..",
which
is the parent directory of the one you are in. The syntax
"-r
.." means to remove everything in the parent directory.
I know
of a beginning sysadmin who did this (/bin/rm -rf .*)
to remove
a user directory and found to her horror that she was
removing all
user home directories on that filesystem! I believe
she has a job
doing something else now.)
Magic
How does the file command figure out the type of a file?
The secret
is in the file /etc/magic. This file defines file types
based
on the value of the initial bytes of a file. It as an
ASCII file,
so you can edit it and add new types or correct wrong
types (for example,
if the file command tells you a spreadsheet is a "pdp11/pre
System
V vax executable").
What's Going On Here?
How can you find out who is using a filesystem you need
to unmount?
With the fuser command
fuser -c /home/pixel/u40
The fuser command shows you the process ID of
all processes using the specified filesystem object.
You can then
use the ps command to find the users associated with
those
processes and ask them to do something else for a while
(see Figure 1).
You must be root to use the fuser command because
it needs
access to /dev/kmem.
A related command is truss. This command lets you trace
the
system calls made by a process (open, read, write,
and so on) and hence lets you find out what files the
process uses.
This command displays filenames as they are opened:
truss -t open command-string
and is even better than doing:
strings command | grep / | less
because you see filenames generated as the program runs,
and you get some feel for what the program is doing
as it accesses
the files. Any user can use the truss command, but it
will
not work with setuid or setgid programs, so it is
not a security hole. The man page for truss gives
detailed information on its options and examples of
its use. I find
truss particularly helpful in determining why a program
is
hanging.
Move It!
Finally, if you are not familiar with the dd command,
do cultivate an acquaintance. This is the way to move
blocks from
one device to another. It can do many manipulations
on what it is
moving, but I find it most useful for making exact copies
in situations
where cp cannot be used. For example,
rsh server dd if=/dev/nrst0 | cat > file.tar
reads a tar file from a tape on one machine and
puts it in a disk file on the current machine. The commands:
dd if=/dev/fd0 of=/tmp/floppy bs=36b count=80
(insert and format a new floppy)
dd if=/tmp/floppy of=/dev/fd0 bs=36b count=80
perform a sector-by-sector copy of a high density floppy
disk to a file, then copy it out to a floppy disk again.
This is a
way to make an exact copy of a floppy disk. (A high-density
floppy
disk has 18 512-byte blocks on a track, 80 tracks, 2
cylinders, hence
my choice of bs and count. You may have to be
root to access the floppy drive, and details are system
dependent.
You still have to low-level format the floppy disk before
you write
to it. Use the fdformat command.)
About the Author
Lisa Lees has an M.S. in computer science and has worked
during
the past twenty years as a teacher, writer, programmer,
and system
administrator. Her love/hate relationship with UNIX
dates to 1985.
Lisa is a systems analyst with the Department of Computer
Science
and is the manager of the Pattern Recognition and Image
Processing
Lab at Michigan State University. She is a member of
ACM, AEGIS, CPSR,
and the USENIX System Administrators Guild. She can
be reached at:
lees@cps.msu.edu; http://www.cps.msu.edu/~lees.