Free Snapshots?

Peter Baer Galvin

Over the past few months, I've been covering new and useful Solaris 8 features in the Solaris Companion. This month continues the trend by looking at the new fssnap command, which provides snapshot copies of the default UFS file system, much like commercial file systems provide. But is it a winner like UFS logging and IP multipathing? This month I test fssnap, I provide useful new reference services, and some feedback on a previous column.

Much Ado About Nothing

What a contrast there is between Sun and Microsoft. I recently attended the Windows XP launch to see what the fuss was about. The three-hour seminar by Microsoft employees conveyed these important XP points:

All previous operating systems by Microsoft had serious flaws, including inefficient user interfaces and a lack of reliability.
Look how cool our new screen savers are.
Look at these games we include!

I thought about what the industry would do if Sun had a press conference about a new release of Solaris and said similar things about their operating system. In fact, Sun tends to do the opposite of Microsoft publicity -- they have great engineers who busily add lots of interesting features to Solaris, and then they leave it up to the end users to find them, figure out how they work, and figure out how to use them!

Snapshots, in Theory

On that note, my column this month explores another hidden Solaris feature -- UFS snapshots. A snapshot can be thought of as a cousin of RAID 1. Rather than performing a block-by-block copy of a disk, and then performing all writes to both copies, a snapshot takes a shortcut. The snapshot starts from an original disk (in this case, actually a UFS partition) and instead of copying all of the original blocks, it creates a copy of the metadata structures. In essence, it has pointers to all the data blocks. Thus, a snap is very fast to create.

The snapshot is placed within a file system or (theoretically) on a raw device. The snapshot target is called the "backing store". Changes to the snapped file system are then handled specially. For every block (metadata or normal data) that is to be written to the snapped file system, a copy of the original contents is created and placed on the snapshot and then the write is allowed to occur to the original file system. In this manner, the original source file system is kept up to date and its snapshot copy has the contents that the file system had when the snapshot occurred.

Why is this useful? In theory, there are many uses for it. Certainly other products that include this snapshot feature (Network Appliance, the Veritas File System) allow some great functionality.

By having the snapshot copy mounted, a user could "cd back in time" by accessing the snapshot copy rather than the original file system. For example, if the system created a snapshot of a file system every midnight, a user could look at the system as it was yesterday, the day before he deleted that important file. He could then copy the file from "yesterday" to the current file system to restore its use.
By allowing multiple snapshots of the same file system, it is possible to have views of the file system as it was every day in the past week, or every four hours in the past day, and so on.
By quiescing a database or other important application, taking a snapshot, and continuing the application, a reliable, consistent backup could be made of data that is constantly changing. The snapshot copy is a view of the world at a time when the data was on disk and not changing. The application can continue running, users can continue their work, and the backup can be done without interfering with their activities.

Because snapshots are fast and low overhead, they can be used extensively without great concern for system performance or disk use (although those aspects must also be considered).

Implementation

How does Sun's current implementation compare to the others that are available? I tested UFS snapshots on a Solaris 8 7/01 release Ultra 10, upgraded with the latest kernel jumbo patch and file system patches. There is only one command for UFS snapshots, which certainly keeps testing simple. fssnap performs UFS snapshots, provides information about them, and manages and deletes them.

The basic command to create a snapshot is:

# fssnap -o backing-store=/snap /
/dev/fssnap/0

Where backing-store is the file system on which to put the snapshot, and the last argument / is the file system to snap. The command returns a device name, which is an access point to the snapshot file system. Of course, you can create multiple snapshots, one per file system on the system:

# fssnap -o backing-store=/snap /opt
/dev/fssnap/1

The snapshot operation on a quiet file system took a few seconds. The busier the file system, the longer the operation.

A snapshot can reside on any file system type, even NFS, which allows you to snap to a remote server. Of course, the snap is only useful when accessed from the original snapped server, where the rest of the data blocks reside.

Unfortunately, my testing revealed that an unmounted device cannot currently be used as the backing store, contrary to the documentation. There is a bug on sunsolve.sun.com against this problem, so hopefully it will get solved with a patch or a future Solaris release. There are several other errors in the documentation on http://docs.sun.com, including incorrect arguments to several commands. The examples in this column use the correct commands and arguments.

Now we can check the status of a snapshot:

# fssnap -i /
Snapshot number               : 0
Block Device                  : /dev/fssnap/0
Raw Device                    : /dev/rfssnap/0
Mount point                   : /
Device state                  : idle
Backing store path            : /snap/snapshot0
Backing store size            : 2624 KB
Maximum backing store size    : Unlimited
Snapshot create time          : Wed Oct 31 10:20:18 2001
Copy-on-write granularity     : 32 KB

Note that there are several options on snapshot creation, including limiting the maximum amount of disk space that the snap can take on its backing store.

From the system point of view, the snapshot looks a bit strange. The disk use, at least at the initial snap, is minimal as would be expected:

# df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0t0d0s0    4030518 1411914 2578299    36%    /
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
mnttab                     0       0       0     0%    /etc/mnttab
swap                  653232      16  653216     1%    /var/run
swap                  653904     688  653216     1%    /tmp
/dev/dsk/c0t1d0s7    5372014  262299 5055995     5%    /opt
/dev/dsk/c0t0d0s7    4211158 2463312 1705735    60%    /export/home
/dev/dsk/c0t1d0s0    1349190    3313 1291910     1%    /snap

However, an ls shows seemingly very large files:

# ls -l /snap
total 6624
drwx------   2 root     root        8192 Oct 31 10:19 lost+found
-rw-------   1 root     other    4178771968 Oct 31 10:30 snapshot0
-rw-------   1 root     other    5500942336 Oct 31 10:24 snapshot1

These files are "holey". Logically, they are the same size as the snapped file system. As changes are made to the original, the actual size of the snapshot grows as it holds the original versions of each block. However, almost all of the blocks are empty at the start, and so are left as "holes" in the file. The disk use is thus only the metadata and blocks that have changed.

The performance impact of a snapshot is that any write to a snapped-file system first has the original block written to the snap, so writes are 2X non-snapped file systems. This is similar to the overhead of RAID-1. Typically, RAID-1 writes are done synchronously to both mirror devices. That is, the writes must make it to both disks before the write operation is considered to be complete. This extra overhead makes writes more expensive. It is not clear whether snapfs commands are done synchronously or asynchronously, although it is likely the former.

What can be done once a snapshot is created? Certainly a backup can be made of the snapshot, solving the previously ever-present "how to back up a live file system consistently" problem. In fact, fssnap has built-in options to make it trivial to use in conjunction with ufsdump:

# ufsdump 0ufN /dev/rmt/0 'fssnap -F ufs -o raw,bs=/snap,unlink/dev/rdsk/c0t0d0s0'

This command will snapshot the root partition, ufsdump it to tape, and then unlink the snapshot so the snapshot file is removed when the command finishes (or at least the file should be removed). In testing, the unlink option does indeed unlink the snapshot file, but the fssnap -d command is required to terminate the use of the snapshot and actually free up the disk space. Thus, this would be the full command:

# ufsdump 0ufN /dev/rmt/0 'fssnap -F ufs -o raw,bs=/snap,unlink/dev/rdsk/c0t0d0s0'
# fssnap -d /

fssnap gets interesting when the snapshot itself is mounted for access, as in:

# mount -o ro /dev/fssnap/0 /mnt

Now we can create a file in /, and see that it does not appear in:

/mnt:
# touch /foo
# ls -l /foo
-rw-r--r--   1 root     other          0 Nov  5 12:25 /foo
# ls -l /mnt/foo
/mnt/foo: No such file or directory

Unfortunately, there does not appear to be any method to promote the snapshot to replace the current active file system. For example, if a systems administrator was about to attempt something complicated, such as a system upgrade, she could perform a snapshot first. If she did not like the result, she could restore the system to the snapshot version. (Of course, the "live upgrade" feature that is just now rolling out as part of Solaris provides a similar functionality.)

The backing store can be deleted manually after it's finished. fssnap -d "deletes" the snap, but that is probably the wrong terminology. Rather, it stops the use of the snapshot, more like "detaching" it from the source file system. To actually remove the snapshot, the snapshot file must also be deleted via rm.

Alternately, the unlink option can be specified when the snap is created. This prevents a file system directory entry from being made for the file. In essence, it is then like an open, deleted file. Once the file is closed, the inode and its data are automatically removed. Unlinked files are not visible in the file system via ls and similar commands, making them harder to manage than normal "linked" file systems.

Apparently only one active snapshot of a file system can exist. This limits the utility of UFS snapshots to be a kind of safety net for users or systems administrators. For instance, a snapshot could be made once a night, but only one day's worth of old data would then be available.

Another limitation comes from the Sun documentation. If the backing-store file runs out of space, the snapshot might delete itself, which could cause backups and access to the snapshot to fail. Errors are logged to /var/adm/messages, which should be checked for possible snapshot errors.

Summary

On the whole, fssnap is a welcome addition to the UFS functionality. Sun is obviously paying attention to its file systems, and adding features to make it more competitive with the commercial offerings. There are a couple of limitations in the current implementation, making it useful only for creating consistent ufsdump backups. I hope the functionality will increase over time.

Useful References

Book publishers are finally starting to take advantage of Internet functionality to allow their books to be more widely used. Two examples are http://www.books24X7.com books24x7.com and http://safari.oreilly.com/ O'Reilly's Safari. Books 24X7 provides unlimited access to hundreds of books on line, with a sophisticated search engine, live links from within books to Internet resources, and fully scanned diagrams. Of course, it has a price to match those high-end features.

The Safari project, which includes O'Reilly, Addison Wesley, New Riders, Prentice Hall, and other presses, provides access to many books, but in a more limited fashion. It has a point system in which you pay for a certain number of points per month, and that gives you access to that many points worth of books in the month. Every month, you can change which books you access with those points.

Both services are worth looking into if you enjoy ready access to reference and "how-to" materials.

Letters

Thanks for the informative note from Boyd Fletcher:

Good article (Reliable Network with Solaris, November 2001: http://www.samag.com/documents/s=1441/sam0111i/0111i.htm). You are correct -- AP and IP Multipathing are mutually exclusive. AP has been replaced as of Solaris 8 07/01 with MPXIO (IO multipathing) for hard disks and IP multipathing for network cards. Both are easier to configure, faster, and more reliable that AP. On the Serengeti machines AP is not available so you have to use MPXIO and IPMP. You might want to mention in a follow-up article that IPMP can wreck havoc on your VLANs and may not even work if you are running with port lockdowns based on MAC addresses.

Also, SAN StorEdge 3.0 at http://www.sun.com/storage/san/ is required for MPXIO. AP will still work with existing hardware, just not on the SunFire line.

Boyd

Peter Baer Galvin is the Chief Technologist for Corporate Technologies, a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider. Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.