Cover V07, I12
Article
Figure 1

dec98.tar


Managing Documentation with RCS

Robert Kiesling

Although the Revision Control System (RCS) is commonly used to manage source code in software projects, the package is useful in documentation projects, especially when managing documentation written by multiple authors through multiple revisions. Similarly, the file utilities tar and find, and the text processing utility awk, can be used singly or in conjunction with RCS to make distribution, archiving, and backup much easier. The project discussed in this article provides a good example of how RCS can be employed to manage conventional documents.

When asked to edit the anthology, Linux: The Complete Reference, 6th Edition, the publishers had only the text of the previous editions, which had been written by approximately 30 authors. Some of the articles had been superseded, some were no longer being maintained by their authors (they were written by authors of the Linux Documentation Project, and some turnover was inevitable), and some were technically out of date. Additionally, the anthology was to include several book-length manuscripts, one of which was undergoing revision at the time. To keep track of the revisions, RCS was pressed into service to manage this library of documents.

RCS, which was written by Walter F. Tichy and Paul Eggert, is not one program, but a suite of programs. It manages archives of revisions, checks documents in and out of the archives, views the revision logs, and manages keyword searches. RCS is distributed by the Free Software Foundation (FSF) under the GNU (General Public License). The rcsintro(1) manual page provides a quick introduction to the system.

Basic RCS Usage

The RCS suite of programs manages archives of text file revisions, version releases, and manages access to the revisions. The revisions are a series of diff output files that list the changes in each revision, with the initial document as the starting point. The ci program checks revisions into an archive; co checks versions out from an archive; rcs creates archives and manages their attributes; rlog prints the archives' revision logs; rcsmerge merges different sets of revisions into one; and rcsdiff is a front end for diff and diff3. The rcsintor(1) manual page provides an overview of the programs (see Figure 1).

The first step in organizing the huge amounts of text for the Linux book was to set up an RCS archive for each article. The individual archives incorporated the initial revision. In this case, the initial revisions were the text of the previous edition. By default, RCS stores its archives in a subdirectory, ./RCS, and this directory structure was maintained for ease of archiving and backup as described below.

Because only one person had access to the archive, non-strict locking was used. With many people accessing a set of documents, revision clashes can be prevented by using the RCS locking mechanism. The shorter articles were written in SGML, and the book-length manuscripts were written in LaTeX 2e. Before editing, the SGML was translated into LaTeX. For every article that made it into the anthology, Postscript proofs of the edited articles were provided to the authors, and their corrections incorporated into the draft that went to the printer. For each document, file.sgml, and its derived file.tex, an RCS archive was created and initialized with the commands:

rcs -i -U -tinitial.msg [file.sgml]
rcs -i -U -tinitial.msg [file.tex]

The file initial.msg contained a single line of text that identified the document as legacy text from the previous revision. The -U option prevents strict locking of the archive. By default, RCS provides strict locking of revisions.

The archive file is created in the ./RCS subdirectory with the same name as the working file and a ",v" suffix. By default, read-only permissions are given to the archive, and only the person who created the archive can access it. For example:

-r--r--r--   1 kiesling users          91 Aug  3 10:54 file.sgml,v

If the -t command line option is not used, RCS prompts the user for a message.

Before the working document has been checked in, the initial RCS archive entry, as printed by the rlog command, looks like this:

RCS file: RCS/file.sgml,v
Working file: file.sgml
head:
branch:
locks:
access list:
symbolic names:
keyword substitution: kv
total revisions: 0
description:
This is the initial revision.

If strict locking was selected, the "locks:" entry would have read, "locks: strict".

At least one of the documents was edited by several people. It was necessary, when archiving between editors, to control who had the ability to access the archives. RCS, by default, provides access to the archive's owner. Enabling other users to access the archive is done with the command:

rcs -asmith,jones file.sgml

Then the RCS archive information for the archive looks like:

RCS file: RCS/file.sgml,v
Working file: file.sgml
head:
branch:
locks:
access list:
smith
jones
symbolic names:
keyword substitution: kv
total revisions: 0
description:
This is the initial revision.

After an archive is created, the document itself must be checked in. This is done with the ci command. So, for each SGML and LaTeX file, the commands are simply:

ci -u file.sgml
ci -u file.tex

The -u flag on the command line maintains the archives in an unlocked state. If strict locking of a particular revision is desired, the -l flag can be used instead.

If we specify the -l flag, the archive entry looks like this.

RCS file: RCS/file.sgml,v
Working file: file.sgml
head: 1.1
branch:
locks: strict
access list:
smith
jones
symbolic names:
keyword substitution: kv
total revisions: 1;     selected revisions: 1
description:
This is the initial revision.
-----------------------------
revision 1.1
date: 1998/08/03 16:16:18;   author: kiesling;  state: Exp;
Initial revision

The -u command line option also has the effect of checking in a working file and then checking it out again. By default the initial revision is 1.1, and each successive revision has its minor version number incremented by one (e.g., 1.2, 1.3, 1.4, and so on). A different revision number can be given as an argument to the ci command, as in:

ci -u2.1 file.sgml

This increments the revision number to 2.1 and results in the archive entry:

RCS file: RCS/file.sgml,v
Working file: file.sgml
head: 2.1
branch:
locks: strict
access list:
smith
jones
symbolic names:
keyword substitution: kv
total revisions: 2;     selected revisions: 2
description:
This is the initial revision.
-----------------------------
revision 2.1
date: 1998/08/03 16:23:49;  author: kiesling;  state: Exp;  lines: +1 -0
This revision has the major version number incremented.
-----------------------------
revision 1.1
date: 1998/08/03 16:16:18;  author: kiesling;  state: Exp;
Initial revision
Organizing and Searching by Key Word

RCS provides for key word searching in the text. The ident program can be used to search and categorize documents by key word. Although not used in the Linux: The Complete Reference project because of formatting considerations, key words can be included in RCS working files, which are updated when the working file is checked out. The key words are delimited by dollar signs, "$". Whenever a document is checked out of a working file, the date and time, and revision number are made current. The revision log can also be included.

For example, when a working file is checked out, the "$Id$" keyword is expanded to the date and time of the revision, the user ID of the author (or the person who created the archive), and the file name of the document. In a LaTeX file, if we create a one-line text file, called id.txt, with the string:

%% $Id$

it can be prepended to a LaTeX source file, as in the following example:

cat id.txt file.tex.orig >file.tex.out

Then the preamble the LaTeX file might look like:

%% $Id$
\documentclass[12pt,twoside]{book}
\begin{document}

When checked in to the RCS archive and checked out again, the $Id$ string is expanded, and the preamble looks like the following.

%% $Id: file.tex,v 2.2 1998/08/03 17:25:08 kiesling Exp $
\documentclass[12pt,twoside]{book}
\begin{document}

In TeX and LaTeX documents, dollar signs are used to delimit math mode input, so RCS strings used in this way should be commented out. Ident prints the RCS key words that are embedded in a file.

> ident file.tex
$Id: file.tex,v 2.2 1998/08/03 17:25:08 kiesling Exp $

The rlog(1) manual page provides details on the RCS key words that can be embedded in documents. They include the path name of the working file, the user name of the file's owner, the date and time the revision was checked out, and the revision log text for that revision. Checking out a document with co expands the key word where it appears in the document. For example, an expanded $Log$ entry might look like this.

%% $Log: file.tex,v $
%% Revision 1.3  1998/08/04 19:06:36  kiesling
%% Derived from the second draft of the author's SGML code.
%%

In a SGML file, the $Log$ key word (if it is enclosed in a comment) might look like this before expansion.

<!-- $Log$ >

After it is expanded, the revision log entry should appear as:

<!-- $Log: file.sgml,v $
<!-- Revision 1.2  1998/08/04 19:34:34  kiesling
<!-- Author's second revision.
<!-->

Unfortunately, SGML (at least using the LinuxDoc DTD) is particular about how comments are formatted. The proper form of a $Log$ entry, if it is to be enclosed by comments, is:

<!-- $Log: file.sgml,v $
Revision 1.2  1998/08/04 19:34:34  kiesling
Author's second revision.
-->

The following sed script will format the comment that contains the $Log$ key word correctly.

#! /usr/bin/sed -f
s/^<!-- [^$]/    /
s/^<!-->/-->/

If the script is placed in a file of its own (here called sgml-comment.sed) and given execute permissions, the standard output of the command, showing only the edited comment of the SGML file, will look like this.

> sed-comment.sed <file.sgml
<!-- $Log: file.sgml,v $
Revision 1.2  1998/08/04 19:34:34  kiesling
Author's second revision.
-->

Including RCS entries in source files, however, may not be desirable if the RCS data is to appear in the formatted output. This is because the output is dependent on the formatter, and results will be different if the output is post-processed by LaTeX or troff. Generally, RCS key words format well in HTML and plain ASCII text.

Archiving, Distribution, and Backup

For archival storage purposes, the default RCS directory structure, where revisions are stored in a ./RCS subdirectory, means that no special measures are necessary when making backups. However, for distributing the archives to authors or outside peer reviewers, for example, it is often not desirable to include the RCS archives in the distribution. Fortunately, tar provides an exclusion mechanism, provided by the "X" flag to tar. This indicates that a file containing the regular expressions of file path names to exclude from the archive follows on the command line. Although tar's command line syntax is less than intuitive, file names generally follow the option switches in the order that the options are specified. For example, to make a backup file archive-file.tar.gz, the command is the following:

tar zcvfX archive-file.tar.gz exclude-file archive-directory

The file, exclude-file, is a plain text file that contains the single line:

*RCS*

Another file processing utility, find, can be used to clean up the auxiliary and temporary files generated during text processing in order to save space in the archive and to speed backup. Here is part of one shell script that was used for backup.

#!/bin/sh
find . -name "*~" -exec rm -f {} \; -print
find . -name "#*" -exec rm -f {} \; -print
find . -name "core" -exec rm -f {} \; -print
find . -name "*.dvi" -exec rm -f {} \; -print
find . -name "*.aux" -exec rm -f {} \; -print
find . -name "*.log" -exec rm -f {} \; -print
find . -name "*.out' -exec rm -f {} \; -print
find . -name '*.bak" -exec rm -f {} \; -print
find . -name "*.dj" -exec rm -f {} \; -print

Each find command searches the current directory and subdirectories for files that match the arguments to the -name search criterion. The command "-exec rm -f {}" executes the shell command "rm -f". The braces ("{}") expand to the name of the matching file. The semicolon (";") marks the end of the command to be executed, and the -print command prints the name of each matching file to the standard output.

If we wanted to list the archived files belonging to the author, "Jones", the following command would be used:

find ./RCS -name "*,v" -exec awk '/Jones/' \{\} \; -print

The braces and semicolon must be escaped with backslashes when the command is executed from the sh shell prompt. This prevents the shell from interpreting them itself.

Of course, the working file names do not have the ",v" suffix, and they are located in the current directory, not the ./RCS subdirectory. Again, sed comes to the rescue to edit the path names:

find ./RCS -name "*,v" -exec awk '/Jones/' \{\} \; -print | \
sed `s/,v//' |\
sed 's/.\/RCS\///'

The first sed command simply removes the ",v" suffix from the archive name. The second sed command removes the "./RCS/" prefix. The forward slashes in the path name must be escaped with backslashes, so they are not confused with the pattern delimiters.

The files whose archives contain a string that is given on the command line can be archived with the following shell script.

#! /bin/sh
#
find ./RCS -name "*,v" -exec awk "/$1/{print ARGV[1]}" {} \; | \
sed 's/,v//' |\
sed 's/.\/RCS\///' |\
tar zcvfT $1.tar.gz -

If we name the shell script archive-string.sh and give it executable permission, it uses the following syntax:

archive-string kiesling

to search for the archives that contain the name "kiesling". To dissect this program line by line, the -name argument to find in the first line matches all archive files in the ./RCS subdirectory. The -exec statement:

awk "/$1/{print ARGV[1]}"

matches the files that contain the string in the awk text pattern, not the files matched by find. Note that the $1 in the awk pattern is expanded from the shell prompt when the script is executed, but the ARGV[1] string is expanded with the file name provided by find.

The two sed commands, as described above, edit the names of the matching archive files, so they match the working files which contain the author's name in the current directory.

The tar statement is straightforward. The T option specifies that tar take its list of input files from a file - in this case, the standard input.

Conclusion

The tasks of cataloging authors' revisions, editing and formatting the documents, proofreading the output, archiving, and generating authors' proofs and final output, is complex enough when only one document is involved. The complexity increases when multiple documents are involved, just as it does in software projects that have multiple source code modules. Utilities that automate the task can save hours of the editors' time. RCS and the text processing utilities are flexible enough that these techniques described above can be adapted easily to projects' specific requirements.

About the Author

Robert Kiesling is the editor of Linux: The Complete Reference, 6th Edition and a contributor to Linux Installation and Getting Started. He is also the maintainer of the Linux Frequently Asked Questions (FAQ) list. Comments should be directed to: kiesling@terracom.net.