Parsing
Interesting Things
Randal L. Schwartz
Someone recently popped into one of the newsgroups I frequent
and asked how to parse an INI file. You might have seen those before,
with sections and keyword=value lines, like:
[login]
timeout=30
remote=yes
[password]
minlength=6
I think they started in the Microsoft world, since no sane UNIX hacker
would have come up with something like that. No, we come up with things
like .Xdefaults and sendmail.cf and termcap.
But the request seemed simple: parse the file and gather the information
into a hash for quick access, two levels deep, of course.
Now, I usually carry the banner here for "use the CPAN", and in
fact, there are numerous CPAN modules that parse INI files (too
many, I think). But let's take a different route here. Suppose we
were parsing a file that wasn't already CPANned to death. What tools
could we use?
Well, certainly Perl's regular expressions are pretty powerful
in the first place, and this task really wouldn't be that difficult
with hand-written code, but we can go a bit further and pull out
a nifty tool from the CPAN: the "madman of Perl" Damian Conway's
Parse::RecDescent. This module permits extremely complex
parsers to be built by specifying a nice hierarchical description
of the data (as a grammar), and a series of actions to be taken
as each portion of the data is returned. I find it very simple to
use, and whipped up a parser in no time.
The key to a useful grammar is getting the description right,
and what to do once you've seen that. First, let's look at a file.
A file is a series of sections, so in the grammar language, that's
given as:
file: sections
Actually, a file is a bit more than that. If we just used that, the
grammar would match any prefix of the input that also had sections.
So, we need to anchor that:
file: sections /\z/
Which says, match sections, and when you're done matching sections,
match the end of the string. If you're not at the end of the string
when you are done matching sections, this isn't a file that we want.
And now, sections is zero or more sections, which we write
as:
sections: section(s?)
with the (s?) suffix meaning "zero or more". Very readable
so far. A section is a section marker (the square-bracket line) and
some definitions:
section: section_marker definitions
definitions: definition(s?)
And we've defined the definitions as well. So far, we've managed to
capture the essence of an INI-like file, but we've not actually matched
anything (except the end of string). That's because we've been constructing
"non-terminals". Grammar rules can also contain "terminals" (like
the end-of-string token above) to define specific things to match.
Let's start with a section marker:
section_marker: /\[.*\]/
There. A section marker is a square-bracketed thingy. And what's a
definition?
definition: key /=/ value
Yeah, it's a key and a value, separated by an equals. But what are
those? Why, more terminals!
key: /\w+/
value: /.*/
And already with just a few lines of code, we've defined most of the
grammar. But now we need to introduce a bit more knowledge about Parse::RecDescent.
Between each of the items of the rules, the generated parser will
be permitted to skip over the current skip string, which is "whitespace"
by default. This is fine for section markers: we don't mind any preceding
whitespace being tossed. But it's a pain if whitespace gets in-between
the key and the rest of the line. Fortunately, we can define that
the skip string be altered for the remainder of a rule:
definition: key <skip: ''> /=/ value
which means that the string '' (the empty string) is now the
skip string, meaning that the equals must be adjacent to the end of
the key, and the value starts immediately after the equals. Good!
We could stick all the rules above into a string $GRAMMAR,
and then create a parser $PARSER using these rules as:
use Parse::RecDescent;
my $PARSER = Parse::RecDescent->new($GRAMMAR)
or die;
This $PARSER can then be used repeatedly to see whether a file
fits the specifications. To do that, we call the top-level rule (file)
as a method, passing it $INPUT, the contents of the file in
question:
if (defined(my $result = $PARSER->file($INPUT))) {
print "It's a valid INI file!\n";
} else {
print "No good.\n";
}
Now, if all we were doing was verifying well-formedness, that's enough.
But we wanted to also use the data as it was parsed. To do that, we
need to also know that every rule is like a subroutine call, and passes
back the last value evaluated. By default, that's the string matching
the terminal (or $1 if it's included), or whatever value the
last subrule returns. (For the repetitions above, an arrayref is returned
of all the matches, if any.) However, we can include some Perl code
enclosed in a block as the last rule, and then that will be
the return value.
For example, we really don't want the brackets included in the
section marker, so we can select (using $1) them away:
section_marker: /\[(.*)\]/
There. Now the brackets are not part of the return value. If we didn't
know that $1 is automatically returned, we could return it
explicitly:
section_marker: /\[(.*)\]/ { $1 }
which says to perform the regex match, and if it succeeds, evaluate
the block. As long as the block doesn't return undef, it's
also considered a "match", and as the last thing in a rule, it's also
the overall value of the rule.
But what about the definitions? We want to note both the key and
the value, so we'll use some sort of Perl block at the end of the
rule. And we can return an arrayref of the two items just fine,
but we need to access the "value" of the key and value subrules
through the magical %item hash. The keys to this hash are
the names of the subrules. (Sorry for the overloading of the key/value
terms here.)
definition: key <skip: ''> /=/ value
{ [$item{key}, $item{value}] }
And now a definition is an arrayref, consisting of the found key,
and its found value. (If there's more than one item called "key",
then you must resort to positional syntax, but it's almost always
easier and clearer to just invent a new non-terminal name for that
particular slot.)
Similarly, a section needs the name of the section and all of
the definitions of that section.
section: section_marker definitions
{ [$item{section_marker}, $item{definitions}] }
Note that definitions will already be an arrayref of individual
definitions, which are themselves references to two-element arrays.
All this stacking is taken care of automatically by the parser built
by Parse::RecDescent!
Finally, the fun part. A file wants to be all the sections. And
we could just punt and return that:
file: sections /\z/ { $item{sections} }
which will then be an arrayref pointing to a list of sections, each
section being an arrayref pointing to a list of definitions in that
section, each definition being an arrayref pointing to a key/value
tuple. But let's convert this into a hash for quick access:
file:
sections /\z/
{ my %return;
my $sections = $item{sections};
for my $section (@$sections) {
my ($section_marker, $definitions) = @$section;
for my $definition (@$definitions) {
my ($key, $value) = @$definition;
for ($return{$section_marker}{$key}) {
if (not defined $_) {
$_ = $value;
} elsif (not ref $_) {
$_ = [$_, $value];
} else {
push @$_, $value;
}
}
}
}
\%return;
}
Wow. What was that? Well, first we define a hash to be returned (as
a hashref), and then walk the multiple levels of the arrayrefs of
arrayrefs of tuples. The interesting part starts in the middle, which
is merely aliasing $return{$section_marker}{$key} to $_
for the rest of the inner loop. If that value isn't defined, then
this is the first time we've seen a keyword under a given section,
so we stuff the value. If it's already defined, then we've seen the
same keyword twice. In this case, I decided to turn the value into
an arrayref, so that the values are individually extractable. And
finally, if it's already an arrayref, then we just push the latest
hit onto the end.
The return value of calling the file method is now either
this hashref, or undef. So to get the "timeout" parameter
from the example INI file above, we'd say:
my $timeout = $result->{login}{timeout};
Because the names are case sensitive, we might want to add a few other
things to force all the section names and keys to lowercase, or perhaps
we could do that while we were building the hash.
There you have it: an INI-like file parser made with Parse::RecDescent.
I hope this brief intro to this powerful module will get you interested
enough to read the rest of the documentation and study its amazing
array of features. And you'll never fear parsing an odd-looking
file again. Until next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration, security,
technical writing, and training. He has coauthored the "must-have"
standards: Programming Perl, Learning Perl, Learning
Perl for Win32 Systems, and Effective Perl Programming,
as well as writing regular columns for WebTechniques and
Unix Review magazines. He's also a frequent contributor to
the Perl newsgroups, and has moderated comp.lang.perl.announce since
its inception. Since 1985, Randal has owned and operated Stonehenge
Consulting Services, Inc.
|