A Powerful
Search Tool for ASCII Files
Koos Pol
Sometimes the main UNIX principle of combining small tools to
accomplish complex tasks just isn't enough. There are times
that you just need more. A striking example is the task of searching
through ASCII files. This can be any sort of file: C programs, error
logs, HTML files, etc. If you need to find a specific string, it
is usually sufficient to grep through the file and view the
results. Combined with find, you can get a long way. For
example, if you need to dig through your HTML files and find the
ones that have obsolete links to your old department, you may want
to run something like:
find /usr/local/web -name "*.html" -print |
while read F; do echo "**** $F";
grep http://intranet.mycompany.com/olddepartment $F; done
If your queries get a bit more complicated, you may still get by using
egrep instead of grep, but you will run out of steam
very soon. Besides that, you really don't want to learn all the
egrep options if they can be different on any operating system.
So what's the alternative? This is a perfect challenge for Perl
regular expressions: they can be extremely powerful and are the same
for all UNIXes that run Perl. So, what if we could rewrite the monster
above in something more attractive, such as:
find.pl http://intranet.mycompany.com/olddepartment "/usr/local/web/*.html"
We obviously need to combine find and Perl's regular expressions
in one script. If we can do that, we really have a powerful tool for
searching files. Here are a few more examples:
Look for all your HTML files with images on remote servers:
find.pl "<img src=\"((http)|(ftp))" "/usr/local/web/*.html"
You inherited a bunch of Perl scripts and you want a quick view of
all the subroutines used:
find.pl -v "sub\s+\w+\s*{" "*.pl"
You are sifting through some C sources for a bug. It appears it has
to do with signals in combination with file operations:
find.pl "signal.*?FILE" "*.c"
This will produce a list of all lines containing the word "signal"
and a constant that has "FILE" in its name. However,
this list is too long to handle. If we ignore all the FILE_SELECTION
matches because they don't seem to be involved, it makes
the list much smaller:
-i "\$COPY\W.*\#.*status" "/usr/local/scripts/*.sh"
Note that I changed from a double quote to a single quote because
some shells really don't like the ! on the command line
and I had to prevent the shell from interpretting it.
You may have noticed that we need a few command switches to display
only the file name or to display the matching lines as well. We
may also want to search case-insensitive. With the Getopt::Std
package, this is very easy. It is even included in the standard
Perl distribution. So, we don't need to worry a lot about that.
my %opt; # h=help, i=no case, v=view lines, l=line numbers
my $regex; # what we're looking for
my $filemask; # which files
my $dir; # location
my $filename; # name of file found
my $line; # matching line
my $linenr; # remembers the line number
my $nameonly; # only print the filename
The start is easy. All we need is find to deliver us a list
of files. This list can then be read one-by-one:
open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die /"$!\n" ;
while (defined ($filename=<LIST>)) {
chomp $filename;
next if (-B $filename); # don't check binary files
open (FILE, "<$filename") or warn "Can't open $filename ($!)\n";
By now we have the first of our files opened for reading. We can scan
it to see if it matches our regex:
while (defined($line = <FILE>)) { # read as long as necessary
if ($line =~ /$regex/) { # we have a match
If we have reached this point, we have had a match. Now we can print
the filename and move on to the next file in the list. Let's
see what we have so far:
open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die /"$!\n" ;
while (defined ($filename=<LIST>)) {
chomp $filename;
next if (-B $filename); # don't check binary files
open (FILE, "<$filename") or warn "Can't open $filename ($!)\n";
while (defined($line = <FILE>)) { # read as long as necessary
if ($line =~ /$regex/) { # we have a match
print $filename,"\n";
last;
}
}
}
You may have noticed I cheated on the $dir and $filemask.
Where did those come from? We used regular expressions! If the given
command-line parameter contains a "/", we know that
a directory is involved. We just cut the string on the last "/".
Everything before the "/" is the directory, and everything
after it is the actual filemask:
$dir = '.'; # use current dir if we don't get one
if ($filemask =~ m|/|) { # directory given
$filemask =~ m|(^.*)/(.*$)|; # split the string up...
($dir,$filemask) = ($1,$2); # ...and save both parts
}
Of course, we want to search case-insensitive, as well. Perl accommodates
this easily by extending the regular expression syntax. The prefix
to make a regex case insensitive is (?i). We'll
stick that to the regex if there is a "-i"
on the command line:
$opt{i} && ($regex = '(?i)'.$regex); # case insensitive?
(You can find all the details on these extensions in the perlre
man pages). There is one more hurdle -- if we want to view the
matching lines or their line numbers, then we need to modify the logic
a bit. When we have a match on our regex, then we need to continue
reading the file for more matches until the whole file is read. Only
then can we skip to the next file. By the way, let's agree on
one thing -- if we want to see the line numbers, it is reasonable
to display the lines as well, isn't it? There is not much use
in displaying the line numbers only:
$nameonly = !(($opt{1} || $opt{v}); # only print the filename
Start at the point where we have a match on our regex. Instead
of just printing the filename, it becomes:
# read as long as necessary
NEXTLINE: while (defined($line = <FILE>)) {
$linenr++; # remember this line number
if ($line =~ /$regex/) { # we have a match
If we didn't get a "-v" or "-l"
on the command line, it is sufficient to print the filename. There
is no need to read the rest of the file.
if ($nameonly) { # only print the filename
print $filename,"\n";
last NEXTLINE;
If we do have a "-v" or "-l" on
the command line, we print a filename that is more visible in the
clutter of long screens full of text. Of course, we print the line
and, if requested, the line number also.
} else {
print "**** $filename *****\n";
print $opt{l} ? "$linenr: " : "", $line;
We now continue reading the rest of the file for other matches. If
we find them, we again print the lines or line numbers. We close off
by printing a new line as a separator between files:
while (defined($line = <FILE>)) { # read until EOF
$linenr++;
if ($line =~ /$regex/) { # more matches
print $opt{l} ? "$linenr: " : "", $line;
}
}
print "\n";
}
}
}
close (FILE);
}
close (LIST)
Let's give the lost user some helpful messages in case the program
gets the wrong parameters. When we stuff that into the script and
clean up some things here and there, then this is the final result:
#! /usr/bin/perl -w
use strict;
use Getopt::Std;
my %opt; # h=help, i=no case, v=view lines, l=line numbers
my $regex; # what we're looking for
my $filemask; # which files
my $dir; # location
my $filename; # name of file found
my $line; # matching line
my $linenr; # remembers the line number
my $nameonly; # print only the filename
sub usage1 {
$0 =~ s|^.*/||; # strip of the path
print "Usage: $0 [-h] [-i] [-l] [-v] regex filemask\n";
exit
}
sub usage2 {
$0 =~ s|^.*/||; # strip of the path
print <<EOF;
Usage: Usage: $0 [-h] [-i] [-l] [-v] regex filemask
$0 is a powerful text grepper. It combines find and Perl regular expressions.
Options:
-h help screen
-i case insensitive matching
-l display line numbers on matching lines
-v display the lines matching 'regex'
EOF
exit
}
getopts('hlvi',\%opt); # process the command line switches
usage2() if (defined $opt{h}); # "HELP ME...!"
if (defined($ARGV[0]) && defined($ARGV[1])) {
($regex, $filemask) = ($ARGV[0], $ARGV[1]);
} else {
usage1();
}
usage1() if (! defined $regex) or (! defined $filemask);
$opt{v} = 1 if ($opt{l} == 1 ); # -l without -v is useless
$opt{i} && ($regex = '(?i)'.$regex); # case insensitive?
$dir = '.'; # use current dir if we don't get one
if ($filemask =~ m|/|) { # directory given
$filemask =~ m|(^.*)/(.*$)|; # split the string up...
($dir,$filemask) = ($1,$2); # ...and save both parts
}
open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die ;
while (defined ($filename=<LIST>)) {
chomp $filename;
open (FILE, "<$filename") or die "Can't open $filename ($!)\n";
$linenr=0;
# read as long as necessary
NEXTLINE: while (defined($line = <FILE>)) {
$linenr++; # remember this line number
if ($line =~ /$regex/) { # we have a match
if (!$opt{l}) { # just print the filename
print $filename,"\n";
last NEXTLINE;
} else {
print "**** $filename *****\n";
print $opt{l} ? "$linenr: " : "", $line;
while (defined($line = <FILE>)) { # read until EOF
$linenr++;
if ($line =~ /$regex/) { # more matches
print $opt{l} ? "$linenr: " : "", $line;
}
}
print "\n";
}
}
}
close (FILE);
}
close (LIST);
Koos Pol is a systems administrator with Compuware. He has a broad
experience in UNIX, Windows, and OS/2 systems. His main responsibilities
are providing tools and database support on various platforms to a
group of developers. He also provides command-line interfaces and
Web interfaces to database backends. He can be reached at: koos_pol@nl.compuware.com.
|