Cover V04, I04
Article
Figure 1
Listing 1

jul95.tar


Unidentified File Objects

Lisa Lees

When the bard said, "There are more things in heaven and earth... than are dreamt of in your philosophy," he could have been speaking of objects in the UNIX filesystem. To be able to answer questions from your users and handle problems that arise during software installations, you, as system administrator, need to understand all the filesystem objects in the UNIX heavens. This article covers a collection of filesystem topics that I find most beginning (and some experienced) system administrators don't know about or understand well.

Two Sys Admin articles by Emmett Dulaney -- "Understanding Filesystems" (March/April 1995) and "When inodes Go Bad" (July/August 1994) -- explain the underlying structure of the UNIX filesystem. This underlying structure (superblocks and inodes) is generally invisible to the user. The present article takes a look at the unfamiliar, but visible, objects in the filesystem and at some unusual behaviors of familiar objects:

  • files
  • directories
  • links
  • named pipes
  • devices

    Although most of what I say applies to System V Release 4 in general and even to such oddball **ixes as the Coherent 4.2 system I run at home, some of the commands and details are specific to Sun Solaris 2.4. I encourage you to see how these commands and examples work on your system by trying them out as you read.

    Files, Directories, and Links

    The logical concept of a file under UNIX is a string of bytes pointed to by a directory entry. Most of the time this concept serves quite well, but occasionally files exhibit behavior that makes it clear this concept is an oversimplification.

    Sparse Files

    The data portion of a file is not actually one big string of bytes. If you open a file, position out to byte 9,999 (using something like the lseek system call) and write one byte, the resulting file does not use up 10,000 bytes of disk space. Only the data you write takes up disk space. If you write only one byte, the file is reported as the same size in blocks, regardless of the offset of that byte in the file. (Just how many blocks a one-byte file takes up depends on invisible things I won't go into. Try the program in Listing 1 and see what happens on your system.)

    When you read from a sparse file it looks as if the file truly has all those bytes. You get 9,999 NULs and the one byte you wrote. But only the one byte is actually out there on the disk. This is why system files like quotas and lastlog show different block and byte sizes in an ls -ls listing:

    320 -rw-r--r--  1 root  root  1026720 Mar 23 12:35 quotas
    176 -rw-r--r--  1 root  root   593320 Mar 24 19:39 lastlog

    These files contain entries indexed on UID. When an entry for, say, UID 15432 is made, the offset into the file is calculated and the entry is written at that offset. This results in a file full of "holes."

    If you move such a sparse file (using the mv command), it remains a sparse file. A move (within the same filesystem) is a change to directory entries only and does not involve the data portion of the file. Moving a file is the same as making a second hard link to it, then removing the original hard link. Moving a file to a different filesystem acts like a copy followed by removal of the original file (because of the way directories work -- see the section of this article on links.)

    If you copy a sparse file (using cp or tar), it will no longer be a sparse file! Copy the example file and the copy will use 10,000 bytes of disk to hold 9,999 NULs and one final byte. Backup utilities such as ufsdump work at a lower level and do backup and restore sparse files correctly. You can copy a directory containing sparse files like this:

    ufsdump 0f - /directory | (cd /new/location; ufsrestore xvf -)

    You must be root to use dump and restore because they access the raw device. This enables them to read and write the actual blocks that make up a file. They can also back up named pipes and device special files.

    Directories

    Directories are fairly straightforward (unless you accidentally use vi on one), but here, too, there are some quirks. Have you ever copied a tree from one place to another and wondered why the size of the copy is a few blocks smaller than the original, yet all the file sizes are unchanged?

    As you add entries to a directory it grows in size to hold the entries, but does not shrink again when entries are removed. When you copy a directory using cp or tar, a new directory is built, without the unused blocks, so the directory takes less space. (Again, dump/restore makes an exact copy). You can see a directory shrink by doing this:

    cp -r /lost+found /var/tmp

    You probably have to do this as root because lost+found is usually not world readable. Copy to /var/tmp because /tmp already has a lost+found directory if you are using tmpfs. The lost+found directory is always created with extra room for lots of entries so it does not have to be expanded while fsck is cleaning up a disk. When you copy lost+found, it will shrink to the minimum size for a directory, usually one block. Here is ls -ls before and after making a copy of the directory:

    7 drwxr-xr-x  2 root  3232 Sun Feb 12 18:11 /lost+found
    1 drwxr-xr-x  2 lisa    32 Sat Apr  1 16:46 /usr/tmp/lost+found

    Links

    You can think of a directory entry as being the head pointer of the linked list that is a file's data. When you make a "hard link," e.g.,

    ln file1 file2

    you now see both file1 and file2 when you do an ls -l, and they have identical entries except for the link count field, which is "2" for each file:

    -rw-r--r--  2 lisa  parent  509 Sat Mar 25 19:19 file1
    -rw-r--r--  2 lisa  parent  509 Sat Mar 25 19:19 file2

    The file has not been copied. In fact, the data portion of the file has not been touched. What has happened, essentially, is that the directory entry has been cloned but with a different file name. Both directory entries point to the same data. You can delete either entry and the data is still there. If you delete the entry for file1 in this example, the effect is exactly the same as

    mv file1 file2

    As stated earlier, moving a file to a different filesystem turns into a copy followed by removal of the original file. A directory entry contains inode numbers, which are unique only within a filesystem. A directory entry cannot be "cloned" to a different filesystem; there is no way for a directory entry to point to the data portion of a file on a different filesystem. So you cannot hard-link a file to a different filesystem. But you can make a soft or symbolic link:

    ln -s file1 file2

    which looks like:

    -rw-------   1 lees  manager  70 Mar 25 19:31 file1
    lrwxrwxrwx   1 lees  manager   5 Mar 25 19:31 file2 -> file1

    A soft link is not a cloned directory entry, but is itself a file, the content of which is a pathname. This pathname may point anywhere, or nowhere, or to itself! It has no direct relation to the data in the original file. If the original file is moved, or removed, the soft link remains and is now simply wrong.

    Soft links use space outside a directory. No error checking is done when a soft link is created. Soft links require extra disk accesses to resolve. Nothing guarantees that a soft link will remain valid. Still, if you want to link two objects in different filesystems, you have no choice but to use a soft link.

    Soft links are useful within a filesytem because they preserve the notion of an original file that is pointed to, or aliased. If you have installed SunOS binary compatibility, there is a /dev directory full of soft links, and links to links, such as (ls -l /dev/rdsk/c0t0d0s0):

    2 lrwxrwxrwx  1 root  root  55 Mar  8 13:28 c0t3d0s0 ->
    ../../devices/sbus@1,f8000000/esp@0,800000/sd@3,0:a,raw

    This could be a hard link, but then you could not tell by looking at it what it is a link to. That is the great drawback to hard links. The link count field tells you there are one or more other hard links to the object, but gives no clue at all as to what they are.

    Devices

    A device special file ("device" for short) is a filesystem object created by the mknod command. When a device is accessed, control is given to a device driver. The device driver knows how to do the required operation (open, read, write, ioctl, and so on) on the hardware associated with the device, if any. Devices are a way of providing a very similar interface to a wide range of physically very different objects.

    The mknod command doesn't know anything about devices or device drivers. It creates a special file object of type character (raw) or block (file), and gives it a name and a major and a minor number. The major number determines which device driver is invoked when the device file is accessed. The minor number is just handed to the device driver; it is used to choose the unit, partition, or whatever is appropriate. Here are the device special files for the c0t3d0s0 device:

    0 brw-r-----  1 root  sys  32, 24 Mar  8 12:29 sd@3,0:a
    0 crw-r-----  1 root  sys  32, 24 Mar  8 13:07 sd@3,0:a,raw

    The "b" in the listing signifies the block device, the "c" the character or raw device. The major number 32 is the SCSI driver. The minor number 24 identifies the SCSI unit and slice (partition). The pattern in minor numbers should be evident from a few more SCSI device files:

    0 brw-r-----  1 root  sys  32, 25 Mar  8 12:29 sd@3,0:b
    0 brw-r-----  1 root  sys  32, 26 Mar  8 12:29 sd@3,0:c
    
    0 brw-r-----  1 root  sys  32, 48 Mar  8 12:29 sd@6,0:a
    0 crw-r-----  1 root  sys  32, 48 Mar  8 12:29 sd@6,0:a,raw

    Notice the minor number incrementing for each new slice, then jumping by 24 for the next SCSI logical unit (next physical device on that controller). Under Solaris the minor numbers may change when you add a disk and do boot -r. This usually did not happen under SunOS, because the kernel config file defined all the devices that could reasonably exist and these device files were created whether or not the physical devices were present.

    The device major number is an index into an operating system table of device drivers. Under SunOS you knew this number because you had to edit /sys/sun/conf.c and rebuild the kernel when adding a new device driver. Under Solaris the kernel is configured at boot time (though additional modules may be added via the modload command). The modinfo command tells you the major number of device drivers built into the operating system at boot time or installed with the modload command. These are a few of the loaded modules on my SPARCstation 2 running Solaris 2.4; the "Info" field is the major number:

    Id Loadaddr  Size Info Rev Module Name
    12 ff137000  cb70  32   1  sd (SCSI Direct Access Disk Driver)
    51 ff2d6000  7dfc  11   1  tmpfs (filesystem for tmpfs)
    61 ff40f000  6fdd  36   1  fd (Floppy Driver)
    71 ff654000  9968  33   1  st (SCSI Sequential Access Driver)

    Sun some time ago started making physical devices smart enough to identify themselves at boot time (the adapter cards contain FORTH routines in ROM that do this). When you boot -r a tree is built under /devices containing the device files for all the physical devices that identify themselves in this manner.

    If you are using "compatibility mode," a tree of soft links -- the links having the name of the SunOS device file -- is built under /dev to the Solaris device file. You can use mknod here, but the preferred approach is to make a soft link to the appropriate device special file in the /devices tree. If you forget to do boot -r the following set of commands does more or less the same thing:

    /usr/sbin/drvconfig
    /usr/sbin/devlinks
    /usr/sbin/disks
    /usr/sbin/ports
    /usr/sbin/tapes
    /usr/sbin/audlinks
    /usr/ucb/ucblinks

    Pipes

    The mknod command can also make an object that is not a device, but a named pipe. We're all familiar with pipes, the use of the "|" character to connect the standard output of one process to the standard input of another process. A named pipe is similar, but it is a permanent object in the filesystem. The command

    mknod my_fifo p

    creates a named pipe named my_fifo. A named pipe has ownership and permissions just like a file, so you can restrict its use with groups and file permissions. This named pipe is read/write for owner and for group parent:

    prw-rw----  1 lisa  parent  0 Sun Mar 26 19:12 my_fifo

    Named pipes have several advantages over "|" pipes. The processes writing to and reading from the named pipe do not have to be related in any way. They can be run by different users. One can be foreground and one background. They can even be on different machines if the named pipe is an object in a mounted filesystem! Root privileges are not needed to create named pipes.

    You will find named pipes in a number of places around Solaris. They are used in the Service Access Facility and in the print spooling system.

    tmpfs Filesystems

    Sun introduced the tmpfs filesystem quite some time ago (in 4.1.something) and it is now the default for /tmp when installing Solaris 2. This makes /tmp an in-memory filesystem, kind of like a RAM disk on a PC. What tmpfs actually is doing is putting /tmp in virtual memory, which means that files on /tmp compete with programs for space in RAM and are paged out to swap space when memory is needed for something else.

    Normally, all the data structures of a file (directory entries, inodes, data blocks) exist on disk. Many operations on a file in a disk-based filesystem require multiple accesses to disk to retrieve or update information in these data structures. In a tmpfs filesystem, only file data is paged to disk. The associated information always stays in memory. So fewer, even zero, disk accesses are needed to use a file on a tmpfs filesystem.

    But there are tradeoffs. Because the control information is always kept in memory, the number of files that can be created is limited. The limit is based on RAM size, not on the size of the swap partition. Heavily loaded systems with inadequate RAM memory may be pushed into thrashing by the use of tmpfs filesytems. This is one more thing to be aware of in analyzing and tuning performance.

    The best use of a tmpfs /tmp is for many small, short-lived files, such as the intermediate and scratchpad files written by compilers. (You should set the TMPDIR environment variable to /tmp to make certain that compilers use /tmp instead of /usr/tmp for these files.) The worst use of a tmpfs /tmp is to store large files that are written once and read many times. Use a disk-based filesystem local to the workstation for these files.

    One last caveat: Although the man page for tmpfs claims that tmpfs "provides standard file operations and semantics," the exercise of creating sparse files (Listing 1) shows this is not completely true. There are no holes in a file on a tmpfs filesystem. Also, the df command does not give useful information for a tmpfs filesystem.

    Tricks

    This is a collection of wisdom I have picked up in the school of hard knocks.

    Weird File Names

    What do you do when one of your users manages to name a file Ctrl-D or embed a space in a filename? Once you realize that something funny is going on, you can use ls -b to show the problem filename, with the control characters. Then it is a question of figuring out how to name the file to the rm command. Using single quotes solves many problems, such as:

    rm 'a space'
    rm 'back\slash'

    You may be able to use wildcards:

    rm a?space
    rm back*

    And if all else fails,

    rm -ir dir

    will go through all the files in a directory, even those composed of completely untypeable names. (WARNING: Do not, as root, ever do something like

    /bin/rm -rf .*

    in a well-meaning attempt to remove all the files in a directory. The wildcard ".*" includes "..", which is the parent directory of the one you are in. The syntax "-r .." means to remove everything in the parent directory. I know of a beginning sysadmin who did this (/bin/rm -rf .*) to remove a user directory and found to her horror that she was removing all user home directories on that filesystem! I believe she has a job doing something else now.)

    Magic

    How does the file command figure out the type of a file? The secret is in the file /etc/magic. This file defines file types based on the value of the initial bytes of a file. It as an ASCII file, so you can edit it and add new types or correct wrong types (for example, if the file command tells you a spreadsheet is a "pdp11/pre System V vax executable").

    What's Going On Here?

    How can you find out who is using a filesystem you need to unmount? With the fuser command

    fuser -c /home/pixel/u40

    The fuser command shows you the process ID of all processes using the specified filesystem object. You can then use the ps command to find the users associated with those processes and ask them to do something else for a while (see Figure 1). You must be root to use the fuser command because it needs access to /dev/kmem.

    A related command is truss. This command lets you trace the system calls made by a process (open, read, write, and so on) and hence lets you find out what files the process uses. This command displays filenames as they are opened:

    truss -t open command-string

    and is even better than doing:

    strings command | grep / | less

    because you see filenames generated as the program runs, and you get some feel for what the program is doing as it accesses the files. Any user can use the truss command, but it will not work with setuid or setgid programs, so it is not a security hole. The man page for truss gives detailed information on its options and examples of its use. I find truss particularly helpful in determining why a program is hanging.

    Move It!

    Finally, if you are not familiar with the dd command, do cultivate an acquaintance. This is the way to move blocks from one device to another. It can do many manipulations on what it is moving, but I find it most useful for making exact copies in situations where cp cannot be used. For example,

    rsh server dd if=/dev/nrst0 | cat > file.tar

    reads a tar file from a tape on one machine and puts it in a disk file on the current machine. The commands:

    dd if=/dev/fd0 of=/tmp/floppy bs=36b count=80
    (insert and format a new floppy)
    dd if=/tmp/floppy of=/dev/fd0 bs=36b count=80

    perform a sector-by-sector copy of a high density floppy disk to a file, then copy it out to a floppy disk again. This is a way to make an exact copy of a floppy disk. (A high-density floppy disk has 18 512-byte blocks on a track, 80 tracks, 2 cylinders, hence my choice of bs and count. You may have to be root to access the floppy drive, and details are system dependent. You still have to low-level format the floppy disk before you write to it. Use the fdformat command.)

    About the Author

    Lisa Lees has an M.S. in computer science and has worked during the past twenty years as a teacher, writer, programmer, and system administrator. Her love/hate relationship with UNIX dates to 1985. Lisa is a systems analyst with the Department of Computer Science and is the manager of the Pattern Recognition and Image Processing Lab at Michigan State University. She is a member of ACM, AEGIS, CPSR, and the USENIX System Administrators Guild. She can be reached at: lees@cps.msu.edu; http://www.cps.msu.edu/~lees.


     



  •