Article

jun2001.tar

Fuzzy Text Searches with agrep and afind

Alex Golomshtok and Yefim Nodelman

In the past few years, the advents of storage and networking technology and the Internet have contributed to a dramatic increase in the amount of information accessible from the desktop. Users are now faced with the problem of information overload -- it can be difficult to locate the resources relevant to their needs. Under these conditions, the ability to effectively search through vast amounts of data becomes a primary requirement for any modern computing environment.

Most of the popular programming languages now are equipped with text-searching and pattern-matching facilities of varying degrees of sophistication. Most common are the "all-or-none" search facilities that incorporate either exact string matching or wildcard and regular expression matching algorithms. When the precise search string is known in advance, the exact string matching algorithms usually offer the best performance, although the speed of the search is proportional to the length of the search argument and the quality of the matching algorithm. For very short search patterns, for instance, with length of the pattern not exceeding four characters, the brute-force matching algorithms are usually the fastest. For longer search patterns, more sophisticated algorithms (e.g., Rabin-Karp Checksum [1] or Boyer-Moore [2] Algorithm) are known to provide superior performance.

Regular expressions [3] are often used when an exact search pattern is not known in advance. The power of regular expressions comes from the ability to express a search pattern in generic terms, potentially representing an infinite set of search arguments with a single expression. A pattern like "[A-Za-z]+", for instance, will match any string comprised by the upper and lower letters of the alphabet. In addition to expressing the patterns using generic "meta-characters", regular expression implementations often provide for anchors or ability to control the exact position of a match within the search text, and back-references that allow for easy text replacement and substitution. Regular expressions are the natural solution to many pattern-matching problems, such as looking for files and directories, monitoring and extracting the information from log files, filtering mail messages, etc.

Despite all their power, exact matching as well as regular expressions are usually an all-or-nothing proposition. These algorithms make it difficult, if not impossible, to account for typos, alternative spellings, spelling errors, and errors originated from data transmission. To solve these types of problems efficiently, you would have to employ an alternative technique known as approximate matching. Approximate string matching is fundamental to modern text processing and is heavily used by such computer applications as word processors or search and indexing engines.

Based on the needs of a particular application, the actual algorithm behind the approximate matching mechanism may vary. Phonetic algorithms such as soundex [4], for example, allow for matching strings that sound similarly. A typical soundex implementation is simply an indexing system that translates the strings into alphanumeric soundex codes by ignoring the vowels and assigning "weights" to the consonants. Since similar-sounding consonants are assigned the same weight (e.g., "B", "P", "F", and "V" are all assigned "1"), similar-sounding strings are likely to produce equal soundex codes. Soundex is a very popular technology, the most famous application of which was the initiative of the U.S. Bureau of the Census to create an index for individuals listed in the U.S. census records after 1880.

While being efficient and easy to implement, soundex is limited due to its phonetic nature; different natural languages require different implementations. The German soundex, for instance, usually uses six character codes as opposed to four-character English codes, and the consonants are grouped based on different rules.

Yet another family of string-matching algorithms, known as fuzzy matching algorithms, takes an entirely different approach of calculating various measures to assess the degree of proximity between strings. There are two major measures that reflect the degree of proximity -- k-mismatches and k-differences. The k-mismatches measure, also known as Hamming distance, expresses the difference between two strings by the number of characters that need to be changed to obtain one from the other, where "k" is the maximum number of differences allowed. For instance, "butter" and "ladder" have a Humming distance of four as four characters need to be replaced to obtain one from the other. The k-differences measure or Levenshtein distance [5] indicates how many "edits" (i.e., insertions, deletions, or substitutions) are necessary to transform one string into the other, where "k" is the maximum number of edits allowed. The Levenshtein distance between "GUMBO" and "GAMBOL", for example, is two because two edits are necessary.

There are numerous fuzzy matching algorithms that utilize k-mismatches (such as Baeza-Yates-Gonnet [6]), and k-differences (such as Wu-Manber [7]). These algorithms are efficient, versatile, and language independent, which makes them suitable for many text-processing tasks.

Perl and Fuzzy String Matching

Perhaps one of the most exciting features of Perl is its unsurpassed support for regular expressions that has been an intrinsic part of the language since its inception. One thing that Perl lacks, however, is built-in fuzzy or approximate text-matching facilities. This niche is filled with Perl extension modules, thanks to the enthusiasm of numerous Perl developers. Currently, the Comprehensive Perl Archive Network (CPAN) (http://www.perl.com/CPAN-local/CPAN.html) offers several modules that perform fuzzy string matching and calculate similarities between strings.

One of these modules, String::Approx by Jarkko Hietaniemi [8], allows for approximate string matching and substitutions using Wu-Manber k-differences algorithm. This module performs fuzzy matching based on a set of inputs that control such aspects of matching process as maximum degree of proximity, case sensitivity, and more.

Yet another extension, String::Similarity by Mark Lehmann [9], is designed to calculate the similarity index between two string arguments. This module is based on an O(ND) Difference Algorithm [10], which also calculates the k-differences measure for its input arguments.

These modules bring the benefits of fuzzy matching to Perl users while encapsulating the complexities of the underlying algorithms and exposing enough functionality to solve nearly any text searching and matching problem.

afind

To illustrate how approximate string matching may be applied to everyday computing tasks, we developed several useful utilities that extend the functionality of traditional UNIX programs such as find and grep. The afind script recursively descends the directory hierarchy searching for files, named such that the degree of similarity between the file name and a search string does not exceed a certain percentage. The similarity search limit percentage is controlled via a command-line parameter. The script outputs a list of files that match the criteria, sorted by similarity index (most similar first).

The script takes several command line options. Option -l specifies the similarity limit as a percentage -- a higher number will force the script to output closer matches. Option -h causes the script to print the usage string and exit immediately. Finally, option -v forces verbose output, which can be very useful for diagnostic purposes. The remaining two arguments are the name of the directory to use as a starting point for the search and a search pattern. The directory name can be omitted, in which case the current working directory will be used as a starting point.

The following is the complete source code listing for the afind script:

1  #!/usr/local/bin/perl
2
3  use Getopt::Std;
4  use File::Find;
5  use String::Similarity;
6   
7  $| = 1;
8   
9  $usage = qq(Usage: afind -lvh [directory] search_pattern\n);
10
11  die $usage
12    unless  getopts( qq(l:vh) );
13  die $usage
14     unless @ARGV;
15  die $usage
16     if $opt_h;
17
18  $limit   = ( $opt_l =~ /^(\d{1,2}|100)$/ ) ? ( $opt_l/100.0 ) : 0.75;
19  $dir     = @ARGV > 1 ? shift : q(.);
20  $pattern = shift;
21
22  die qq(Directory $dir doesn't exist!\n)
23     unless -d $dir;
24
25  File::Find::find( \&wanted, $dir );
26
27  foreach $file ( sort { $b->{ sindex } <=> $a->{ sindex } } @files ) {
28     printf "%.2f\t%s\n",$file->{ sindex }, $file->{ name };
29  }
30  exit( 0 );
31
32  sub wanted {
33     my $sindex   = similarity( $pattern, $_, $limit );
34     print qq(Processing $File::Find::name ( $sindex )...\n)
35        if $opt_v && -t;
36     push( @files, { name => $File::Find::name, sindex => $sindex } )
37        if $sindex >= $limit;
38  }

Lines 3 through 5 load several Perl extension modules, used by the script Getopt::Std [11] that facilitates parsing the command-line arguments, File::Find [12] that provides the framework for traversing directory structures and, finally, String::Similarity [9] that provides the functionality for fuzzy pattern matching. Line 7 sets a special variable $| to a non-zero value, which turns off the IO buffering and forces the flush after every print statement on currently selected output channel. Line 9 sets up the variable that contains the usage string for afind; this string is output by lines 11 through 16 in case the script is invoked with wrong or insufficient command-line arguments. Lines 11 and 12 call the getopts routine from the Getopt::Std module in order to parse the command-line arguments. This routine takes a single argument -- a string that contains all allowed command-line options, in our case single character options l, v, and h. The switches that take arguments, such as l, are followed by colons. On completion, the getopts routine sets up variables for every command-line argument. For afind, three option variables will be set -- $opt_v, $opt_h, and $opt_l. Because $opt_v and $opt_h do not take arguments, they will be assigned a Boolean true or false value based on whether the command-line switches for these options are supplied. $opt_l, however, if set, will contain the value for the similarity limit parameter.

The getopts routine stops processing command-line arguments as soon as it encounters an argument that does not look like a command-line switch (i.e., does not have the dash as a first character). Therefore, after getopts returns, the @ARGV array is expected to contain at least one argument (the search string) because the starting directory is optional. Lines 13 and 14 validate this condition by checking the @ARGV array, which in scalar context evaluates to the number of remaining command-line arguments. Finally, lines 15 and 16 print the "usage" string and exit the program if the "help" option is set.

Lines 18 through 23 perform some rudimentary validity checking of input arguments. Line 18 ensures that the similarity limit value, if supplied, is a valid percentage (i.e., an integer between 0 and 100). If no valid value is supplied, the limit is set to a default value of 75% (0.75). Line 19 checks if the starting directory command-line parameter is present, otherwise the default starting point is set to the current working directory. Line 20 simply obtains the value of the command-line search pattern argument and lines 22 and 23 validate the starting directory.

Line 25 invokes the find routine from the File::Find module. This routine traverses the directory hierarchy recursively starting from the directory, identified by the second parameter, the $dir variable. For every directory entry encountered by File::Find, it invokes a user-supplied subroutine; reference to this subroutine is passed to File::Find as a first parameter.

This subroutine, called wanted (see lines 32 through 38), performs the bulk of the work. Every time it is called, File::Find sets the default input variable $_ to the base name of the directory entry being processed. The subroutine calls "similarity" function, which is part of the String::Similarity module, to calculate the similarity index between the search pattern, contained in the global variable $pattern and $_, which is set to the current file or directory name. The similarity index ($sindex), calculated by the similarity function, may take values between 0 and 1, where 1 means that two string arguments are identical. The third parameter to the similarity function ($limit) is optional; it sets the minimum similarity that the input string arguments must satisfy.

Based on this parameter, the similarity function stops analyzing its input arguments as soon as it determines that the result is likely to fall below the limit. In this case, the returned similarity index will be invalid but lower than the limit. Therefore, setting the limit may significantly boost the performance of the algorithm, especially when comparing long strings. Lines 34 and 35 of the wanted subroutine simply check whether the verbose output is requested and output the full name of the current file or directory, contained in the $File::Find::name variable, along with the calculated similarity index. This feature comes handy while debugging the script or checking the results of the similarity search.

Finally, lines 36 and 37 save the name and the similarity index of those files (and directories) that satisfy search criteria (i.e., have the similarity index that exceeds the search limit). File names and corresponding indexes are saved into an anonymous hash with two keys, name and sindex, and the reference to this hash is pushed into a global @files array.

Upon the completion of the File::Find::find function, the script iterates through the global @files array that contains one entry for every file or directory that satisfies the search criteria. The array is sorted in descending order by the similarity index using a user-defined anonymous sort subroutine [13]. Each iteration of the foreach loop at line 27 assigns the reference to a hash, containing the current file information to the variable $file. Then, line 28 outputs the full name of the file along with its similarity index. Running afind on the system with the following command:

afind /etc sesytem

produces the following output:

0.77 /etc/system

While changing the similarity limit from the default of 75% to 50%:

afind -l50 /etc sesytem

yields the following:

0.77 /etc/system
0.62 /etc/setmnt
0.57 /etc/lp/Systems
0.57 /etc/uucp/Systems

agrep The agrep script is an example of another application of an approximate string-matching technique to the well-known computing problems. This script works in a way similar to a traditional UNIX grep utility -- it searches input files for occurrences of a particular string, specified by a command-line argument.

The script requires at least two command-line arguments -- the search pattern and the name of the file to search for occurrences of the pattern. Multiple file names or filenames with wildcards may be supplied. Note that the agrep script relies on a shell to perform the wildcard expansion; if the wildcard arguments are quoted so that the expansion does not take place, the script will process these arguments as exact file names.

The behavior of the script and the amount of output can be controlled via three command-line switches: -l, which specifies the similarity search limit similar to the parameter of the afind script; -i, which causes the search algorithm to ignore the case of the letters while comparing; and -n, which directs the script to only output the names of the files that contain the search pattern instead of printing the actual matches (similar to the -l option of a traditional grep).

The following is a complete source code for the agrep script:

1  #!/usr/local/bin/perl
2
3  use Getopt::Std;
4  use String::Approx 'amatch';
5   
6  $| = 1;
7   
8  $usage = qq(Usage: agrep -linh search_pattern [files...]\n);
9   
10  die $usage
11    unless  getopts( qq(l:inh) );
12  die $usage
13     unless @ARGV;
14  die $usage
15     if $opt_h;
16
17  $limit   = ( $opt_l =~ /^(\d{1,2}|100)$/ ) ?
18     q($opt_l%) : q(15%);
19  $pattern = shift;
20
21  foreach $file ( @ARGV ) {
22     next if -B $file;
23     open( _F, $file ) || do {
24        print STDERR qq(Can't open file $file - $!.\n);
25        next;
26     };
27     @lines = <_F>;
28     foreach $match ( amatch( $pattern,
29        [ $limit, $opt_i && qq(i) ], @lines ) ) {
30        print qq($file: ), $opt_n ? qq(\n) : $match;
31        last

32           if $opt_n;
33     }
34     close( _F );
35  }

Lines 3 through 15 are very similar to the corresponding code of the afind script with one exception -- instead of loading the String::Similarity [10] module, the agrep uses another Perl extension for approximate string matching, String::Approx [9]. This module exposes numerous functions, however, the agrep script will only use one of them -- amatch.

The similarity search limit parameter, taken by the String::Aprox module, is different from the corresponding parameter of the String::Similarity module. It is expressed either as an absolute number of allowed "edits" (i.e., k-differences measure) or a percentage value of the number of allowed "edits" relative to the length of the string (i.e., 10% means that every 10th character may be an "edit"). For the sake of consistency, agrep uses the percentage values to control the similarity limit with the default set at 15%.

Line 21 of the script iterates over the @ARGV array that contains the names of the files to be searched for the occurrences of the pattern. Every iteration of the loop assigns the name of the file from the @ARGV to the variable $file. Line 22 skips to the next iteration of the loop if the current file is a binary; for the sake of simplicity we excluded the processing for binary files from the script. Lines 23 through 27 open the current file for read access and load the file data into a global array @lines. Finally, the foreach loop at line 28 invokes the amatch function of the String::Approx module, collects the matches one-by-one into the $match variable, and prints them out.

The amatch function is the simplest of the functionality, exposed by the String::Approx. It takes a search pattern as its first parameter, then a number of optional modifiers and an optional array of input strings. The third parameter, the array of inputs, may be omitted, in which case, the amatch will assume that the default input $_ contains the string to be matched. In list context, this function returns all input lines that match the pattern; in scalar context, it simply returns a true or false value based on whether any of the inputs match the pattern.

The behavior of amatch can be controlled via a number of optional modifiers, passed as a second parameter. One of the modifiers is the similarity limit, discussed previously. As mentioned before, the limit here is a total number of "edits" -- insertions, deletions, and substitutions allowed in the search. This, however, can be controlled at a much finer level by supplying explicit numbers for each type of "edit". For instance, "I10%", "D20%", and "S5%" will allow for 10% of insertions, 20% of deletions, and 5% of substitutions, respectively. Another modifier, used by agrep, is the case-sensitivity modifier i, which turns off case-sensitive matching. Other modifiers are available. For instance, the exact positions within the input argument where the search should begin and end can be controlled via initial_position and final_position modifiers. Running agrep on the system:

agrep sysshm /etc/system

produces the following output:

/etc/system: *                exclude: sys/shmsys
/etc/system: set shmsys:shminfo_shmmax=1610612736
/etc/system: forceload: sys/shmsys

Bumping up the similarity limit from the default 15% to 25%:

agrep -l25 sysshm /etc/system

will yield the following:

/etc/system: *ident   "@(#)system    1.15    92/11/14 SMI" /* SV
/etc/system: * root device and root filesystem configuratio
/etc/system: *    the root filesystem) rather than at first
/etc/system: set shmsys:shminfo_shmmax=1610612736
/etc/system: forceload: sys/msgsys
/etc/system: forceload: sys/shmsys
/etc/system: forceload: sys/semsys
...

The output of agrep is not sorted by relevance or similarity index simply because the amatch function does not return the calculated edit distance. However, another function of the String::Approx module, adist, is specifically designed to calculate and return the edit distance between the input string and the pattern. Also, function aslice may be directed to return the calculated edit distance via optional modifier minimal_distance. This function is especially useful as it provides the starting position and the length of the match.

Summary

Approximate string matching is an extremely powerful technique, quickly gaining popularity within the modern software development community. More and more often we see the new software products, reaping the benefits of those fuzzy searching algorithms, which up until a few years ago were thought of as a subject for a Ph.D. in Computer Science rather than something that had any practical application. Languages like Perl make these algorithms easy to use, which leaves no excuse for not exploring the power of the approximate text matching capabilities.

We hope that this article stimulated your interest, and we encourage you to become familiar with Perl text-processing extension modules. Apart from the two modules discussed here, CPAN offers some more: stemming and inflection modules, such as Text::Stem and Lingua::EN::Inflect, which may be very useful for spell checking applications; and Text::Soundex, which encapsulates the soundex algorithm implementation and others. Altogether, these modules will enable you to produce powerful and user-friendly software applications.

References

1. R. Karp and M. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31:249-260, 1987.

2. R. Boyer and J. Moore. A fast string-searching algorithm. Communications of the ACM, 20:762-772, 1977.

3. J. Friedl and A. Oram. Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools. O'Reilly & Associates. 1997.

4. D. Knuth. The Art of Computer Programming: Sorting and Searching, Vol. 3. Addison-Wesley, 1998.

5. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl., 6:707-710, 1966.

6. G. Gonnet and R. Baeza-Yates. Handbook of Algorithms and Data Structures. Addison-Wesley, 1991.

7. S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35:83-91, 1992.

8. J. Hietaniemi. String::Approx, Latest Version 3.15, Feb, 2000. CPAN directory JHI.

9. M. Lehmann. String::Similarity, Latest Version 0.02, Nov 2000. CPAN directory MLEHMANN.

10. E. Myers. An O(ND) Difference Algorithm and its Variations. Algorithmica, Vol 1. No. 2. 251-266, 1986.

11. G.Sarathy. Getopt::Std, Latest Version - Perl 5.6.1, Jan 2001. CPAN directory GSAR.

12. G.Sarathy. File::Find, Latest Version - Perl 5.6.1, Nov 2000. CPAN directory GSAR.

13. L. Wall, T. Christiansen, R. L. Schwartz. Programming Perl. O'Reilly & Associates, 1996.

Alexander Golomshtok is a professional consultant who, for the last decade, has been hanging around downtown New York developing large-scale software systems and infrastructure solutions for Wall Street firms. He can be reached at: golomshtok_alexander@jpmorgan.com.

Yefim Nodelman is a seasoned systems administrator with seven years of professional experience in supporting UNIX and Windows NT ystems. He can be reached at: nodelman_yefim@jpmorgan.com.