A Powerful Search Tool for ASCII Files

Koos Pol

Sometimes the main UNIX principle of combining small tools to accomplish complex tasks just isn't enough. There are times that you just need more. A striking example is the task of searching through ASCII files. This can be any sort of file: C programs, error logs, HTML files, etc. If you need to find a specific string, it is usually sufficient to grep through the file and view the results. Combined with find, you can get a long way. For example, if you need to dig through your HTML files and find the ones that have obsolete links to your old department, you may want to run something like:

find /usr/local/web -name "*.html" -print |
while read F; do echo "**** $F";
grep http://intranet.mycompany.com/olddepartment $F; done

If your queries get a bit more complicated, you may still get by using egrep instead of grep, but you will run out of steam very soon. Besides that, you really don't want to learn all the egrep options if they can be different on any operating system. So what's the alternative? This is a perfect challenge for Perl regular expressions: they can be extremely powerful and are the same for all UNIXes that run Perl. So, what if we could rewrite the monster above in something more attractive, such as:

find.pl http://intranet.mycompany.com/olddepartment "/usr/local/web/*.html"

We obviously need to combine find and Perl's regular expressions in one script. If we can do that, we really have a powerful tool for searching files. Here are a few more examples:

Look for all your HTML files with images on remote servers:

find.pl "<img src=\"((http)|(ftp))" "/usr/local/web/*.html"

You inherited a bunch of Perl scripts and you want a quick view of all the subroutines used:

find.pl -v "sub\s+\w+\s*{" "*.pl"

You are sifting through some C sources for a bug. It appears it has to do with signals in combination with file operations:

find.pl "signal.*?FILE" "*.c"

This will produce a list of all lines containing the word "signal" and a constant that has "FILE" in its name. However, this list is too long to handle. If we ignore all the FILE_SELECTION matches because they don't seem to be involved, it makes the list much smaller:

-i "\$COPY\W.*\#.*status" "/usr/local/scripts/*.sh"

Note that I changed from a double quote to a single quote because some shells really don't like the ! on the command line and I had to prevent the shell from interpretting it.

You may have noticed that we need a few command switches to display only the file name or to display the matching lines as well. We may also want to search case-insensitive. With the Getopt::Std package, this is very easy. It is even included in the standard Perl distribution. So, we don't need to worry a lot about that.

my %opt;        # h=help, i=no case, v=view lines, l=line numbers
my $regex;      # what we're looking for
my $filemask;   # which files
my $dir;        # location
my $filename;   # name of file found
my $line;       # matching line
my $linenr;     # remembers the line number
my $nameonly;   # only print the filename

The start is easy. All we need is find to deliver us a list of files. This list can then be read one-by-one:

open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die /"$!\n" ;
while (defined ($filename=<LIST>)) {
    chomp $filename;
    next if (-B $filename); # don't check binary files
    open (FILE, "<$filename") or warn "Can't open $filename ($!)\n";

By now we have the first of our files opened for reading. We can scan it to see if it matches our regex:

    while (defined($line = <FILE>)) {    # read as long as necessary
        if ($line =~ /$regex/) {         # we have a match

If we have reached this point, we have had a match. Now we can print the filename and move on to the next file in the list. Let's see what we have so far:

open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die /"$!\n" ;
while (defined ($filename=<LIST>)) {
    chomp $filename;
    next if (-B $filename); # don't check binary files
    open (FILE, "<$filename") or warn "Can't open $filename ($!)\n";
    while (defined($line = <FILE>)) {    # read as long as necessary
        if ($line =~ /$regex/) {         # we have a match
                print $filename,"\n";
            last;
        }
    }
}

You may have noticed I cheated on the $dir and $filemask. Where did those come from? We used regular expressions! If the given command-line parameter contains a "/", we know that a directory is involved. We just cut the string on the last "/". Everything before the "/" is the directory, and everything after it is the actual filemask:

$dir = '.';    # use current dir if we don't get one
if ($filemask =~ m|/|) {          # directory given
    $filemask =~ m|(^.*)/(.*$)|;  # split the string up...
    ($dir,$filemask) = ($1,$2);   # ...and save both parts
}

Of course, we want to search case-insensitive, as well. Perl accommodates this easily by extending the regular expression syntax. The prefix to make a regex case insensitive is (?i). We'll stick that to the regex if there is a "-i" on the command line:

$opt{i} && ($regex = '(?i)'.$regex);  # case insensitive?

(You can find all the details on these extensions in the perlre man pages). There is one more hurdle -- if we want to view the matching lines or their line numbers, then we need to modify the logic a bit. When we have a match on our regex, then we need to continue reading the file for more matches until the whole file is read. Only then can we skip to the next file. By the way, let's agree on one thing -- if we want to see the line numbers, it is reasonable to display the lines as well, isn't it? There is not much use in displaying the line numbers only:

$nameonly = !(($opt{1} || $opt{v}); # only print the filename

Start at the point where we have a match on our regex. Instead of just printing the filename, it becomes:

# read as long as necessary
NEXTLINE: while (defined($line = <FILE>)) {
    $linenr++;                        # remember this line number
    if ($line =~ /$regex/) {          # we have a match

If we didn't get a "-v" or "-l" on the command line, it is sufficient to print the filename. There is no need to read the rest of the file.

            if ($nameonly) {           # only print the filename
                print $filename,"\n";
                last NEXTLINE;

If we do have a "-v" or "-l" on the command line, we print a filename that is more visible in the clutter of long screens full of text. Of course, we print the line and, if requested, the line number also.

            } else {
                print "**** $filename *****\n";
                print $opt{l} ? "$linenr: " : "", $line;

We now continue reading the rest of the file for other matches. If we find them, we again print the lines or line numbers. We close off by printing a new line as a separator between files:

                while (defined($line = <FILE>)) {  # read until EOF
                    $linenr++;
                    if ($line =~ /$regex/) {      # more matches
                        print $opt{l} ? "$linenr: " : "", $line;
                    }
                }
                print "\n";
            }
        }
    }
    close (FILE);
}
close (LIST)

Let's give the lost user some helpful messages in case the program gets the wrong parameters. When we stuff that into the script and clean up some things here and there, then this is the final result:

#! /usr/bin/perl -w
use strict;
use Getopt::Std;

my %opt;         # h=help, i=no case, v=view lines, l=line numbers
my $regex;       # what we're looking for
my $filemask;    # which files
my $dir;         # location
my $filename;    # name of file found
my $line;        # matching line
my $linenr;      # remembers the line number
my $nameonly;    # print only the filename

sub usage1 {
    $0 =~ s|^.*/||;  # strip of the path
    print  "Usage: $0 [-h] [-i] [-l] [-v] regex filemask\n";
    exit
}
sub usage2 {
    $0 =~ s|^.*/||;  # strip of the path
    print <<EOF;
Usage: Usage: $0 [-h] [-i] [-l] [-v] regex filemask

$0 is a powerful text grepper. It combines find and Perl regular expressions.

Options:
     -h  help screen
     -i  case insensitive matching
     -l  display line numbers on matching lines
     -v  display the lines matching 'regex'
EOF
    exit
}

getopts('hlvi',\%opt);  # process the command line switches
usage2() if (defined $opt{h});  # "HELP ME...!"
if (defined($ARGV[0]) && defined($ARGV[1])) {
    ($regex, $filemask) = ($ARGV[0], $ARGV[1]);
} else {
    usage1();
}

usage1() if (! defined $regex) or (! defined $filemask);
$opt{v} = 1 if ($opt{l} == 1 );       # -l without -v is useless
$opt{i} && ($regex = '(?i)'.$regex);  # case insensitive?

$dir = '.';    # use current dir if we don't get one
if ($filemask =~ m|/|) {          # directory given
    $filemask =~ m|(^.*)/(.*$)|;  # split the string up...
    ($dir,$filemask) = ($1,$2);   # ...and save both parts
}
open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die ;
while (defined ($filename=<LIST>)) {
    chomp $filename;
    open (FILE, "<$filename") or die "Can't open $filename ($!)\n";
    $linenr=0;
    # read as long as necessary
    NEXTLINE: while (defined($line = <FILE>)) {
        $linenr++;                    # remember this line number
        if ($line =~ /$regex/) {      # we have a match
            if (!$opt{l}) {           # just print the filename
                print $filename,"\n";
                last NEXTLINE;
            } else {
                print "**** $filename *****\n";
                print $opt{l} ? "$linenr: " : "", $line;
                while (defined($line = <FILE>)) {  # read until EOF
                    $linenr++;
                    if ($line =~ /$regex/) {       # more matches
                        print $opt{l} ? "$linenr: " : "", $line;
                    }
                }
                print "\n";
            }
        }
    }
    close (FILE);
}
close (LIST);

Koos Pol is a systems administrator with Compuware. He has a broad experience in UNIX, Windows, and OS/2 systems. His main responsibilities are providing tools and database support on various platforms to a group of developers. He also provides command-line interfaces and Web interfaces to database backends. He can be reached at: koos_pol@nl.compuware.com.