Article

Filter Library System

Dejan Muhamedagic

The significance of creating programs that can easily and naturally be made to cooperate cannot be stressed enough. This was neatly outlined at the 1983 ACM Turing Award presentation to Ken Thompson and Dennis Ritchie: "The genius of the UNIX system is its framework, which enables programmers to stand on the work of others." In part, what the language of the award refers to is the UNIX design philosophy of small programs that perform one function well and can be combined into a pipeline. In general, programs that accept stdin as an input and provide output on stdout are called filters.

My filter library system emerged as one way to efficiently handle the filter environment. I put it together out of sheer necessity, because my job at the time involved lots of text processing. Later, when faced with a problem of converting a database application consisting of tens of megabytes of code, I appreciated the value of this system even more. Now, the filter environment is more complex, after so many different data formats have emerged, and it is the IT industry's job to make it available to ever-hungry users. One way to achieve a wider range of availability is data conversion.

The Data Conversion Process

The process of data conversion includes changing the data in question, catching errors, and keeping track of what actions have occurred. Some sort of verification is also necessary, but that is beyond this discussion.

To join the game, you must have tools that will do the conversion. If they come packaged in a monolithic application, then there is (usually) nothing else to worry about. On the other hand, if it is a UNIX-like filter or a homegrown set of programs, you may need a comfortable environment to handle the process. This is where my filter library system fits in.

First, my filter library system Listing 1 provides a means to process files "in place." More often than not, UNIX filters are not equipped with such a function. For system administrators it would take a small amount of time to type in a tiny script to handle the task. However, users not so acquainted with their shell would be faced with a big job. In some situations, like converting many files spread in multiple directories, even the most experienced shell users would benefit more from a proper tool than from their skill.

Second, if you change your mind or the filter has not done the job as expected, the filter library system can archive the files prior to processing them. In this case, the undo function will be at your disposal to ease the way back. Also, it creates a log containing errors that happened during the conversion. Finally, it tracks conversion steps in a log file.

This software allows you to organize the filters, hence the "library" in its name. You can organize filters on a system-wide basis as well as for privately owned libraries. The library may consist of one or more flat ASCII files, where the simple name-to-code mapping scheme is implemented. A simple functional diagram consisting of main blocks is shown in Figure 1. It is easy to put the libraries in an NIS database. Though it would be a fancy feature, I felt that the system was already quite usable and did not want to go to such great lengths.

The Implementation

The filter library system (see Listing 1) is implemented as a single Korn shell script. Using a UNIX shell as a platform was a natural decision. Scripting in general is becoming more and more a method of choice for programming tasks, thus replacing compiled languages. Even vendors of commercial flavors of UNIX have started bundling applications that ease system and network administration, written in various scripting languages ranging from shell to Tcl/Tk. Because these packages are primarily freeware, I expect they will have an increasing influence on the software industry.

It is interesting how little effort and code are needed to achieve some useful paradigms. The shell inside the shell was implemented with just a read command and a case construct. Parsing of both the command line and libraries is a bit naive but seemed to be enough for the task in question. Similarly, the library search path was almost too easy to conquer.

The data about to be changed is usually valuable (that is why it is being converted in the first place). Therefore, there are many security checks designed into a script. One of the most important points is the data archival prior to a conversion step, and archives are always tested before any change takes place. Archives are stored on a per-user basis in a separate directory. These are standard compressed tar files named in a simple fashion: save nnn .tgz, where nnn is a three-digit number. If the conversion step fails for any reason, the user can use the filter's undo function to retrieve the data from the last compressed archive. My design does not assume that the restore will be successful, so the archive is not removed after an undo; consequently, the multilevel undo is not implemented. However, it may be implemented if the user removes the last archive manually.

The Library

UNIX developers eventually acquire a multitude of useful scripts written in various languages. They are usually scattered all around the filesystem, and it is not always easy to remember the calling sequences or the task to which each of them is suited. The worst cases are the so-called one-liners, which never make it to a higher form (i.e., a file) but are doomed to be typed in repeatedly. Needless to say, as our minds cannot be networked, these scripts are reinvented many times. The library presented in this article provides a unified interface to this "tinyware" as well as to more complex programs.

Library entries are similar to Korn shell aliases. The example:

lower   cat @@ | tr '[:upper:]' '[:lower:]'

though simple, illustrates all aspects of the system presented here. Its function is to change the case of letters from upper to lower using the standard tr UNIX program. If the user wants to apply it to all files in the current directory, the command line

$ filter lower *

or, the equivalent

$ filter
filter> lower *

would do. The filter's user interface is not new and resembles that of the lpc or xauth program.

The very first action is to prepare the list of files containing only ordinary files with read and write permissions. After that, search for the filter named lower in all libraries on the search path. Then, everything from the list is archived and, finally, files are processed. The activity records will be appended to a file called filter.log and errors to filter.err. Users preparing their own conversion programs should make sure to print errors on stderr, as this is where the filter script will "wait" for them. Furthermore, care should be taken to give meaningful exit status because the script depends on it to report any failures.

The filter script can be used even without these libraries. For example, if the user wants to replace all occurrences of "Jones" with "Smith," the line

$ filter with sed 's/Jones/Smith/g' -- *

would suffice. The keyword with signals that the literal shell code follows, and - marks the beginning of file specification.

This program does not know anything about directories. If a directory happens to be among the arguments it will be rejected. However, this should not be a real limitation because if the '-' character appears in place of any file, the script will fetch the list of files on standard input. Thus, you can use a powerful find program to provide arguments

$ find . -name '*.txt' | filter lower -

Unfortunately, it is not possible to use '-' at the filter prompt. Trying to parse shell code without proper tools like lex and yacc is error-prone, so I decided to leave that out.

Finally, when the list of files is deliberately left empty, the script will behave just like any other ordinary filter. This may be useful in testing new conversion programs or using the filter script with vi's ! command.

The script can be controlled through settings of shell environment variables that have reasonable default values. All of them have the FLT_ prefix and are handled with a filter env command. Further adjustments are possible by choosing different programs for archiving and compressing, but you must ensure that both programs accept stdin as input, similar to tar and compress programs. These are defined along with variables' default values at the beginning of the script.

Shell Recursion

One of scripting languages strong points is their ability to execute the code prepared on the fly. In a Korn shell this is implemented with the eval built-in command, which I have exploited extensively. Coding the filter script was straightforward. The hard part was trying to make it as robust as possible.

Readers may find the code devoted to library handling particularly interesting. In an effort to make libraries more flexible, I provided a way to include files (see Listing 2). Perhaps the cpp preprocessor could have been used to implement such feature, but I reverted to Korn shell again. It eventually ended with a recursion and raised doubts about how /bin/ksh would behave. Luckily, the doubts turned out to be unfounded. The Digital UNIX implementation on the Alpha server went to the 27th level and then reported that too many files were open. Surprisingly, a PC running Linux upheld 200 levels, and there was no point in torturing it any further. It seems that the limit on Alpha was forced because this system is POSIX-compliant.

Conclusion

Data conversion is a sensitive business and I took every possible step to catch errors and protect data. The Korn shell proved again to be a dependable and powerful tool, with one exception: the parsing of both the libraries and the arguments which, perhaps, should have been trusted to lex and yacc. Nevertheless, my experience did not justify the demand for separate scanning/parsing machinery. Further improvement is possible in the area of managing archive, error, and log files which are handled manually at the moment. Also, less experienced users would benefit from a GUI interface, where a drag-and-drop paradigm could be used to process files. The Tcl/Tk package is readily available for such a task.

The filter script is of limited use unless equipped with proper filter libraries. Their richness depends on the usability of programs and the code that users possess. On the other hand, the possibility of including and sharing libraries may well make this system a comfortable and productive environment. I have used it successfully for four years and hope it will be useful to others as well.

About the Author

Dejan Muhamedagic has been involved in various UNIX systems and networking equipment maintenance and configuration for the past five years. He is currently with Yunix, a consulting company based in Belgrade. He can be reached at: muja@galeb.etf.bg.ac.yu.