Parsing Interesting Things

Randal L. Schwartz

Someone recently popped into one of the newsgroups I frequent and asked how to parse an INI file. You might have seen those before, with sections and keyword=value lines, like:

[login]
timeout=30
remote=yes

[password]
minlength=6

I think they started in the Microsoft world, since no sane UNIX hacker would have come up with something like that. No, we come up with things like .Xdefaults and sendmail.cf and termcap. But the request seemed simple: parse the file and gather the information into a hash for quick access, two levels deep, of course.

Now, I usually carry the banner here for "use the CPAN", and in fact, there are numerous CPAN modules that parse INI files (too many, I think). But let's take a different route here. Suppose we were parsing a file that wasn't already CPANned to death. What tools could we use?

Well, certainly Perl's regular expressions are pretty powerful in the first place, and this task really wouldn't be that difficult with hand-written code, but we can go a bit further and pull out a nifty tool from the CPAN: the "madman of Perl" Damian Conway's Parse::RecDescent. This module permits extremely complex parsers to be built by specifying a nice hierarchical description of the data (as a grammar), and a series of actions to be taken as each portion of the data is returned. I find it very simple to use, and whipped up a parser in no time.

The key to a useful grammar is getting the description right, and what to do once you've seen that. First, let's look at a file. A file is a series of sections, so in the grammar language, that's given as:

file: sections

Actually, a file is a bit more than that. If we just used that, the grammar would match any prefix of the input that also had sections. So, we need to anchor that:

file: sections /\z/

Which says, match sections, and when you're done matching sections, match the end of the string. If you're not at the end of the string when you are done matching sections, this isn't a file that we want.

And now, sections is zero or more sections, which we write as:

sections: section(s?)

with the (s?) suffix meaning "zero or more". Very readable so far. A section is a section marker (the square-bracket line) and some definitions:

section: section_marker definitions
definitions: definition(s?)

And we've defined the definitions as well. So far, we've managed to capture the essence of an INI-like file, but we've not actually matched anything (except the end of string). That's because we've been constructing "non-terminals". Grammar rules can also contain "terminals" (like the end-of-string token above) to define specific things to match. Let's start with a section marker:

section_marker: /\[.*\]/

There. A section marker is a square-bracketed thingy. And what's a definition?

definition: key /=/ value

Yeah, it's a key and a value, separated by an equals. But what are those? Why, more terminals!

key: /\w+/
value: /.*/

And already with just a few lines of code, we've defined most of the grammar. But now we need to introduce a bit more knowledge about Parse::RecDescent. Between each of the items of the rules, the generated parser will be permitted to skip over the current skip string, which is "whitespace" by default. This is fine for section markers: we don't mind any preceding whitespace being tossed. But it's a pain if whitespace gets in-between the key and the rest of the line. Fortunately, we can define that the skip string be altered for the remainder of a rule:

definition: key <skip: ''> /=/ value

which means that the string '' (the empty string) is now the skip string, meaning that the equals must be adjacent to the end of the key, and the value starts immediately after the equals. Good!

We could stick all the rules above into a string $GRAMMAR, and then create a parser $PARSER using these rules as:

use Parse::RecDescent;
my $PARSER = Parse::RecDescent->new($GRAMMAR)
  or die;

This $PARSER can then be used repeatedly to see whether a file fits the specifications. To do that, we call the top-level rule (file) as a method, passing it $INPUT, the contents of the file in question:

if (defined(my $result = $PARSER->file($INPUT))) {
  print "It's a valid INI file!\n";
} else {
  print "No good.\n";
}

Now, if all we were doing was verifying well-formedness, that's enough. But we wanted to also use the data as it was parsed. To do that, we need to also know that every rule is like a subroutine call, and passes back the last value evaluated. By default, that's the string matching the terminal (or $1 if it's included), or whatever value the last subrule returns. (For the repetitions above, an arrayref is returned of all the matches, if any.) However, we can include some Perl code enclosed in a block as the last rule, and then that will be the return value.

For example, we really don't want the brackets included in the section marker, so we can select (using $1) them away:

section_marker: /\[(.*)\]/

There. Now the brackets are not part of the return value. If we didn't know that $1 is automatically returned, we could return it explicitly:

section_marker: /\[(.*)\]/ { $1 }

which says to perform the regex match, and if it succeeds, evaluate the block. As long as the block doesn't return undef, it's also considered a "match", and as the last thing in a rule, it's also the overall value of the rule.

But what about the definitions? We want to note both the key and the value, so we'll use some sort of Perl block at the end of the rule. And we can return an arrayref of the two items just fine, but we need to access the "value" of the key and value subrules through the magical %item hash. The keys to this hash are the names of the subrules. (Sorry for the overloading of the key/value terms here.)

definition: key <skip: ''> /=/ value
  { [$item{key}, $item{value}] }

And now a definition is an arrayref, consisting of the found key, and its found value. (If there's more than one item called "key", then you must resort to positional syntax, but it's almost always easier and clearer to just invent a new non-terminal name for that particular slot.)

Similarly, a section needs the name of the section and all of the definitions of that section.

section: section_marker definitions
  { [$item{section_marker}, $item{definitions}] }

Note that definitions will already be an arrayref of individual definitions, which are themselves references to two-element arrays. All this stacking is taken care of automatically by the parser built by Parse::RecDescent!

Finally, the fun part. A file wants to be all the sections. And we could just punt and return that:

file: sections /\z/ { $item{sections} }

which will then be an arrayref pointing to a list of sections, each section being an arrayref pointing to a list of definitions in that section, each definition being an arrayref pointing to a key/value tuple. But let's convert this into a hash for quick access:

  file:
    sections /\z/
    { my %return;
      my $sections = $item{sections};
      for my $section (@$sections) {
my ($section_marker, $definitions) = @$section;
for my $definition (@$definitions) {
  my ($key, $value) = @$definition;
  for ($return{$section_marker}{$key}) {
    if (not defined $_) {
      $_ = $value;
    } elsif (not ref $_) {
      $_ = [$_, $value];
    } else {
      push @$_, $value;
    }
  }
}
      }
      \%return;
    }

Wow. What was that? Well, first we define a hash to be returned (as a hashref), and then walk the multiple levels of the arrayrefs of arrayrefs of tuples. The interesting part starts in the middle, which is merely aliasing $return{$section_marker}{$key} to $_ for the rest of the inner loop. If that value isn't defined, then this is the first time we've seen a keyword under a given section, so we stuff the value. If it's already defined, then we've seen the same keyword twice. In this case, I decided to turn the value into an arrayref, so that the values are individually extractable. And finally, if it's already an arrayref, then we just push the latest hit onto the end.

The return value of calling the file method is now either this hashref, or undef. So to get the "timeout" parameter from the example INI file above, we'd say:

my $timeout = $result->{login}{timeout};

Because the names are case sensitive, we might want to add a few other things to force all the section names and keys to lowercase, or perhaps we could do that while we were building the hash.

There you have it: an INI-like file parser made with Parse::RecDescent. I hope this brief intro to this powerful module will get you interested enough to read the rest of the documentation and study its amazing array of features. And you'll never fear parsing an odd-looking file again. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.