Fuzzy
Text Searches with agrep and afind
Alex Golomshtok and Yefim Nodelman
In the past few years, the advents of storage and networking technology
and the Internet have contributed to a dramatic increase in the
amount of information accessible from the desktop. Users are now
faced with the problem of information overload -- it can be
difficult to locate the resources relevant to their needs. Under
these conditions, the ability to effectively search through vast
amounts of data becomes a primary requirement for any modern computing
environment.
Most of the popular programming languages now are equipped with
text-searching and pattern-matching facilities of varying degrees
of sophistication. Most common are the "all-or-none" search
facilities that incorporate either exact string matching or wildcard
and regular expression matching algorithms. When the precise search
string is known in advance, the exact string matching algorithms
usually offer the best performance, although the speed of the search
is proportional to the length of the search argument and the quality
of the matching algorithm. For very short search patterns, for instance,
with length of the pattern not exceeding four characters, the brute-force
matching algorithms are usually the fastest. For longer search patterns,
more sophisticated algorithms (e.g., Rabin-Karp Checksum [1] or
Boyer-Moore [2] Algorithm) are known to provide superior performance.
Regular expressions [3] are often used when an exact search pattern
is not known in advance. The power of regular expressions comes
from the ability to express a search pattern in generic terms, potentially
representing an infinite set of search arguments with a single expression.
A pattern like "[A-Za-z]+", for instance, will match any
string comprised by the upper and lower letters of the alphabet.
In addition to expressing the patterns using generic "meta-characters",
regular expression implementations often provide for anchors or
ability to control the exact position of a match within the search
text, and back-references that allow for easy text replacement and
substitution. Regular expressions are the natural solution to many
pattern-matching problems, such as looking for files and directories,
monitoring and extracting the information from log files, filtering
mail messages, etc.
Despite all their power, exact matching as well as regular expressions
are usually an all-or-nothing proposition. These algorithms make
it difficult, if not impossible, to account for typos, alternative
spellings, spelling errors, and errors originated from data transmission.
To solve these types of problems efficiently, you would have to
employ an alternative technique known as approximate matching. Approximate
string matching is fundamental to modern text processing and is
heavily used by such computer applications as word processors or
search and indexing engines.
Based on the needs of a particular application, the actual algorithm
behind the approximate matching mechanism may vary. Phonetic algorithms
such as soundex [4], for example, allow for matching strings that
sound similarly. A typical soundex implementation is simply an indexing
system that translates the strings into alphanumeric soundex codes
by ignoring the vowels and assigning "weights" to the
consonants. Since similar-sounding consonants are assigned the same
weight (e.g., "B", "P", "F", and "V"
are all assigned "1"), similar-sounding strings are likely
to produce equal soundex codes. Soundex is a very popular technology,
the most famous application of which was the initiative of the U.S.
Bureau of the Census to create an index for individuals listed in
the U.S. census records after 1880.
While being efficient and easy to implement, soundex is limited
due to its phonetic nature; different natural languages require
different implementations. The German soundex, for instance, usually
uses six character codes as opposed to four-character English codes,
and the consonants are grouped based on different rules.
Yet another family of string-matching algorithms, known as fuzzy
matching algorithms, takes an entirely different approach of calculating
various measures to assess the degree of proximity between strings.
There are two major measures that reflect the degree of proximity
-- k-mismatches and k-differences. The k-mismatches measure,
also known as Hamming distance, expresses the difference between
two strings by the number of characters that need to be changed
to obtain one from the other, where "k" is the maximum
number of differences allowed. For instance, "butter"
and "ladder" have a Humming distance of four as four characters
need to be replaced to obtain one from the other. The k-differences
measure or Levenshtein distance [5] indicates how many "edits"
(i.e., insertions, deletions, or substitutions) are necessary to
transform one string into the other, where "k" is the
maximum number of edits allowed. The Levenshtein distance between
"GUMBO" and "GAMBOL", for example, is two because
two edits are necessary.
There are numerous fuzzy matching algorithms that utilize k-mismatches
(such as Baeza-Yates-Gonnet [6]), and k-differences (such as Wu-Manber
[7]). These algorithms are efficient, versatile, and language independent,
which makes them suitable for many text-processing tasks.
Perl and Fuzzy String Matching
Perhaps one of the most exciting features of Perl is its unsurpassed
support for regular expressions that has been an intrinsic part
of the language since its inception. One thing that Perl lacks,
however, is built-in fuzzy or approximate text-matching facilities.
This niche is filled with Perl extension modules, thanks to the
enthusiasm of numerous Perl developers. Currently, the Comprehensive
Perl Archive Network (CPAN) (http://www.perl.com/CPAN-local/CPAN.html)
offers several modules that perform fuzzy string matching and calculate
similarities between strings.
One of these modules, String::Approx by Jarkko Hietaniemi
[8], allows for approximate string matching and substitutions using
Wu-Manber k-differences algorithm. This module performs fuzzy matching
based on a set of inputs that control such aspects of matching process
as maximum degree of proximity, case sensitivity, and more.
Yet another extension, String::Similarity by Mark Lehmann
[9], is designed to calculate the similarity index between two string
arguments. This module is based on an O(ND) Difference Algorithm
[10], which also calculates the k-differences measure for its input
arguments.
These modules bring the benefits of fuzzy matching to Perl users
while encapsulating the complexities of the underlying algorithms
and exposing enough functionality to solve nearly any text searching
and matching problem.
afind
To illustrate how approximate string matching may be applied to
everyday computing tasks, we developed several useful utilities
that extend the functionality of traditional UNIX programs such
as find and grep. The afind script recursively
descends the directory hierarchy searching for files, named such
that the degree of similarity between the file name and a search
string does not exceed a certain percentage. The similarity search
limit percentage is controlled via a command-line parameter. The
script outputs a list of files that match the criteria, sorted by
similarity index (most similar first).
The script takes several command line options. Option -l
specifies the similarity limit as a percentage -- a higher number
will force the script to output closer matches. Option -h
causes the script to print the usage string and exit immediately.
Finally, option -v forces verbose output, which can be very
useful for diagnostic purposes. The remaining two arguments are
the name of the directory to use as a starting point for the search
and a search pattern. The directory name can be omitted, in which
case the current working directory will be used as a starting point.
The following is the complete source code listing for the afind
script:
1 #!/usr/local/bin/perl
2
3 use Getopt::Std;
4 use File::Find;
5 use String::Similarity;
6
7 $| = 1;
8
9 $usage = qq(Usage: afind -lvh [directory] search_pattern\n);
10
11 die $usage
12 unless getopts( qq(l:vh) );
13 die $usage
14 unless @ARGV;
15 die $usage
16 if $opt_h;
17
18 $limit = ( $opt_l =~ /^(\d{1,2}|100)$/ ) ? ( $opt_l/100.0 ) : 0.75;
19 $dir = @ARGV > 1 ? shift : q(.);
20 $pattern = shift;
21
22 die qq(Directory $dir doesn't exist!\n)
23 unless -d $dir;
24
25 File::Find::find( \&wanted, $dir );
26
27 foreach $file ( sort { $b->{ sindex } <=> $a->{ sindex } } @files ) {
28 printf "%.2f\t%s\n",$file->{ sindex }, $file->{ name };
29 }
30 exit( 0 );
31
32 sub wanted {
33 my $sindex = similarity( $pattern, $_, $limit );
34 print qq(Processing $File::Find::name ( $sindex )...\n)
35 if $opt_v && -t;
36 push( @files, { name => $File::Find::name, sindex => $sindex } )
37 if $sindex >= $limit;
38 }
Lines 3 through 5 load several Perl extension modules, used by the
script Getopt::Std [11] that facilitates parsing the command-line
arguments, File::Find [12] that provides the framework for
traversing directory structures and, finally, String::Similarity
[9] that provides the functionality for fuzzy pattern matching. Line
7 sets a special variable $| to a non-zero value, which turns
off the IO buffering and forces the flush after every print
statement on currently selected output channel. Line 9 sets up the
variable that contains the usage string for afind; this string
is output by lines 11 through 16 in case the script is invoked with
wrong or insufficient command-line arguments. Lines 11 and 12 call
the getopts routine from the Getopt::Std module in order
to parse the command-line arguments. This routine takes a single argument
-- a string that contains all allowed command-line options, in
our case single character options l, v, and h.
The switches that take arguments, such as l, are followed by
colons. On completion, the getopts routine sets up variables
for every command-line argument. For afind, three option variables
will be set -- $opt_v, $opt_h, and $opt_l.
Because $opt_v and $opt_h do not take arguments, they
will be assigned a Boolean true or false value based on whether the
command-line switches for these options are supplied. $opt_l,
however, if set, will contain the value for the similarity limit parameter.
The getopts routine stops processing command-line arguments
as soon as it encounters an argument that does not look like a command-line
switch (i.e., does not have the dash as a first character). Therefore,
after getopts returns, the @ARGV array is expected
to contain at least one argument (the search string) because the
starting directory is optional. Lines 13 and 14 validate this condition
by checking the @ARGV array, which in scalar context evaluates
to the number of remaining command-line arguments. Finally, lines
15 and 16 print the "usage" string and exit the program
if the "help" option is set.
Lines 18 through 23 perform some rudimentary validity checking
of input arguments. Line 18 ensures that the similarity limit value,
if supplied, is a valid percentage (i.e., an integer between 0 and
100). If no valid value is supplied, the limit is set to a default
value of 75% (0.75). Line 19 checks if the starting directory command-line
parameter is present, otherwise the default starting point is set
to the current working directory. Line 20 simply obtains the value
of the command-line search pattern argument and lines 22 and 23
validate the starting directory.
Line 25 invokes the find routine from the File::Find module.
This routine traverses the directory hierarchy recursively starting
from the directory, identified by the second parameter, the $dir
variable. For every directory entry encountered by File::Find,
it invokes a user-supplied subroutine; reference to this subroutine
is passed to File::Find as a first parameter.
This subroutine, called wanted (see lines 32 through 38),
performs the bulk of the work. Every time it is called, File::Find
sets the default input variable $_ to the base name of the
directory entry being processed. The subroutine calls "similarity"
function, which is part of the String::Similarity module,
to calculate the similarity index between the search pattern, contained
in the global variable $pattern and $_, which is set
to the current file or directory name. The similarity index ($sindex),
calculated by the similarity function, may take values between 0
and 1, where 1 means that two string arguments are identical. The
third parameter to the similarity function ($limit)
is optional; it sets the minimum similarity that the input string
arguments must satisfy.
Based on this parameter, the similarity function stops
analyzing its input arguments as soon as it determines that the
result is likely to fall below the limit. In this case, the returned
similarity index will be invalid but lower than the limit. Therefore,
setting the limit may significantly boost the performance of the
algorithm, especially when comparing long strings. Lines 34 and
35 of the wanted subroutine simply check whether the verbose
output is requested and output the full name of the current file
or directory, contained in the $File::Find::name variable,
along with the calculated similarity index. This feature comes handy
while debugging the script or checking the results of the similarity
search.
Finally, lines 36 and 37 save the name and the similarity index
of those files (and directories) that satisfy search criteria (i.e.,
have the similarity index that exceeds the search limit). File names
and corresponding indexes are saved into an anonymous hash with
two keys, name and sindex, and the reference to this
hash is pushed into a global @files array.
Upon the completion of the File::Find::find function, the
script iterates through the global @files array that contains
one entry for every file or directory that satisfies the search
criteria. The array is sorted in descending order by the similarity
index using a user-defined anonymous sort subroutine [13]. Each
iteration of the foreach loop at line 27 assigns the reference
to a hash, containing the current file information to the variable
$file. Then, line 28 outputs the full name of the file along
with its similarity index. Running afind on the system with
the following command:
afind /etc sesytem
produces the following output:
0.77 /etc/system
While changing the similarity limit from the default of 75% to 50%:
afind -l50 /etc sesytem
yields the following:
0.77 /etc/system
0.62 /etc/setmnt
0.57 /etc/lp/Systems
0.57 /etc/uucp/Systems
agrep The agrep script is an example of another application
of an approximate string-matching technique to the well-known computing
problems. This script works in a way similar to a traditional UNIX
grep utility -- it searches input files for occurrences
of a particular string, specified by a command-line argument.
The script requires at least two command-line arguments --
the search pattern and the name of the file to search for occurrences
of the pattern. Multiple file names or filenames with wildcards
may be supplied. Note that the agrep script relies on a shell
to perform the wildcard expansion; if the wildcard arguments are
quoted so that the expansion does not take place, the script will
process these arguments as exact file names.
The behavior of the script and the amount of output can be controlled
via three command-line switches: -l, which specifies the
similarity search limit similar to the parameter of the afind
script; -i, which causes the search algorithm to ignore the
case of the letters while comparing; and -n, which directs
the script to only output the names of the files that contain the
search pattern instead of printing the actual matches (similar to
the -l option of a traditional grep).
The following is a complete source code for the agrep script:
1 #!/usr/local/bin/perl
2
3 use Getopt::Std;
4 use String::Approx 'amatch';
5
6 $| = 1;
7
8 $usage = qq(Usage: agrep -linh search_pattern [files...]\n);
9
10 die $usage
11 unless getopts( qq(l:inh) );
12 die $usage
13 unless @ARGV;
14 die $usage
15 if $opt_h;
16
17 $limit = ( $opt_l =~ /^(\d{1,2}|100)$/ ) ?
18 q($opt_l%) : q(15%);
19 $pattern = shift;
20
21 foreach $file ( @ARGV ) {
22 next if -B $file;
23 open( _F, $file ) || do {
24 print STDERR qq(Can't open file $file - $!.\n);
25 next;
26 };
27 @lines = <_F>;
28 foreach $match ( amatch( $pattern,
29 [ $limit, $opt_i && qq(i) ], @lines ) ) {
30 print qq($file: ), $opt_n ? qq(\n) : $match;
31 last
32 if $opt_n;
33 }
34 close( _F );
35 }
Lines 3 through 15 are very similar to the corresponding code of the
afind script with one exception -- instead of loading the
String::Similarity [10] module, the agrep uses another Perl
extension for approximate string matching, String::Approx [9].
This module exposes numerous functions, however, the agrep script
will only use one of them -- amatch.
The similarity search limit parameter, taken by the String::Aprox
module, is different from the corresponding parameter of the String::Similarity
module. It is expressed either as an absolute number of allowed
"edits" (i.e., k-differences measure) or a percentage
value of the number of allowed "edits" relative to the
length of the string (i.e., 10% means that every 10th character
may be an "edit"). For the sake of consistency, agrep
uses the percentage values to control the similarity limit with
the default set at 15%.
Line 21 of the script iterates over the @ARGV array that
contains the names of the files to be searched for the occurrences
of the pattern. Every iteration of the loop assigns the name of
the file from the @ARGV to the variable $file. Line
22 skips to the next iteration of the loop if the current file is
a binary; for the sake of simplicity we excluded the processing
for binary files from the script. Lines 23 through 27 open the current
file for read access and load the file data into a global array
@lines. Finally, the foreach loop at line 28 invokes
the amatch function of the String::Approx module,
collects the matches one-by-one into the $match variable,
and prints them out.
The amatch function is the simplest of the functionality,
exposed by the String::Approx. It takes a search pattern
as its first parameter, then a number of optional modifiers and
an optional array of input strings. The third parameter, the array
of inputs, may be omitted, in which case, the amatch will
assume that the default input $_ contains the string to be
matched. In list context, this function returns all input lines
that match the pattern; in scalar context, it simply returns a true
or false value based on whether any of the inputs match the pattern.
The behavior of amatch can be controlled via a number of
optional modifiers, passed as a second parameter. One of the modifiers
is the similarity limit, discussed previously. As mentioned before,
the limit here is a total number of "edits" -- insertions,
deletions, and substitutions allowed in the search. This, however,
can be controlled at a much finer level by supplying explicit numbers
for each type of "edit". For instance, "I10%",
"D20%", and "S5%" will allow for 10% of insertions,
20% of deletions, and 5% of substitutions, respectively. Another
modifier, used by agrep, is the case-sensitivity modifier
i, which turns off case-sensitive matching. Other modifiers
are available. For instance, the exact positions within the input
argument where the search should begin and end can be controlled
via initial_position and final_position modifiers.
Running agrep on the system:
agrep sysshm /etc/system
produces the following output:
/etc/system: * exclude: sys/shmsys
/etc/system: set shmsys:shminfo_shmmax=1610612736
/etc/system: forceload: sys/shmsys
Bumping up the similarity limit from the default 15% to 25%:
agrep -l25 sysshm /etc/system
will yield the following:
/etc/system: *ident "@(#)system 1.15 92/11/14 SMI" /* SV
/etc/system: * root device and root filesystem configuratio
/etc/system: * the root filesystem) rather than at first
/etc/system: set shmsys:shminfo_shmmax=1610612736
/etc/system: forceload: sys/msgsys
/etc/system: forceload: sys/shmsys
/etc/system: forceload: sys/semsys
...
The output of agrep is not sorted by relevance or similarity
index simply because the amatch function does not return the
calculated edit distance. However, another function of the String::Approx
module, adist, is specifically designed to calculate and return
the edit distance between the input string and the pattern. Also,
function aslice may be directed to return the calculated edit
distance via optional modifier minimal_distance. This function
is especially useful as it provides the starting position and the
length of the match.
Summary
Approximate string matching is an extremely powerful technique,
quickly gaining popularity within the modern software development
community. More and more often we see the new software products,
reaping the benefits of those fuzzy searching algorithms, which
up until a few years ago were thought of as a subject for a Ph.D.
in Computer Science rather than something that had any practical
application. Languages like Perl make these algorithms easy to use,
which leaves no excuse for not exploring the power of the approximate
text matching capabilities.
We hope that this article stimulated your interest, and we encourage
you to become familiar with Perl text-processing extension modules.
Apart from the two modules discussed here, CPAN offers some more:
stemming and inflection modules, such as Text::Stem and Lingua::EN::Inflect,
which may be very useful for spell checking applications; and Text::Soundex,
which encapsulates the soundex algorithm implementation and others.
Altogether, these modules will enable you to produce powerful and
user-friendly software applications.
References
1. R. Karp and M. Rabin. Efficient randomized pattern-matching
algorithms. IBM Journal of Research and Development, 31:249-260,
1987.
2. R. Boyer and J. Moore. A fast string-searching algorithm. Communications
of the ACM, 20:762-772, 1977.
3. J. Friedl and A. Oram. Mastering Regular Expressions: Powerful
Techniques for Perl and Other Tools. O'Reilly & Associates.
1997.
4. D. Knuth. The Art of Computer Programming: Sorting and Searching,
Vol. 3. Addison-Wesley, 1998.
5. V. I. Levenshtein. Binary codes capable of correcting deletions,
insertions and reversals. Sov. Phys. Dokl., 6:707-710, 1966.
6. G. Gonnet and R. Baeza-Yates. Handbook of Algorithms and
Data Structures. Addison-Wesley, 1991.
7. S. Wu and U. Manber. Fast text searching allowing errors. Communications
of the ACM, 35:83-91, 1992.
8. J. Hietaniemi. String::Approx, Latest Version 3.15,
Feb, 2000. CPAN directory JHI.
9. M. Lehmann. String::Similarity, Latest Version 0.02,
Nov 2000. CPAN directory MLEHMANN.
10. E. Myers. An O(ND) Difference Algorithm and its Variations.
Algorithmica, Vol 1. No. 2. 251-266, 1986.
11. G.Sarathy. Getopt::Std, Latest Version - Perl 5.6.1,
Jan 2001. CPAN directory GSAR.
12. G.Sarathy. File::Find, Latest Version - Perl 5.6.1,
Nov 2000. CPAN directory GSAR.
13. L. Wall, T. Christiansen, R. L. Schwartz. Programming Perl.
O'Reilly & Associates, 1996.
Alexander Golomshtok is a professional consultant who, for
the last decade, has been hanging around downtown New York developing
large-scale software systems and infrastructure solutions for Wall
Street firms. He can be reached at: golomshtok_alexander@jpmorgan.com.
Yefim Nodelman is a seasoned systems administrator with seven
years of professional experience in supporting UNIX and Windows
NT ystems. He can be reached at: nodelman_yefim@jpmorgan.com.
|