Grammars and Metagrammars
We can summarize the various components that make up a grammar as follows:
- A grammar is a collection of rules that describe how a language is structured;
- A rule is a collection of productions that list the alternative ways in which parts of the language can appear;
- A production is a list of non-terminals and/or terminals that must appear in the text being parsed if the
surrounding rule is to match;
- A non-terminal is a reference to some other rule that must be matched as part of the matching of the
surrounding rule.
- A terminal is something that actually can literally appear in a text (a word in the grammar's vocabulary);
What's interesting about this summary is how similar it is itself to a grammar. Just as each grammar rule is defined as a series of references to other rules (non-terminals) and token names (terminals), so the terms 'grammar', 'rule', 'production',
'non-terminal', and 'terminal' are defined by reference to other terms (non-terminals) or by self-contained explanations (terminals). That leads to the somewhat recursive conclusion that you could define a grammar using a grammar. Or maybe even using the same grammar!
And so you can. Here is the entire grammar for specifying Parse::RecDescent grammars, expressed itself as a
Parse::RecDescent grammar:
grammar: components(s)
component: rule | comment
rule: "\n" identifier ":" productions(?)
productions: production '|' productions | production
production: items(s)
item: lookahead(?) simpleitem | directive | comment
lookahead: '...' | '...!'
simpleitem: subrule args(?) | repetition | terminal | bracket args(?) | action
subrule: identifier
args: {extract_codeblock($text,'[]')}
repetition: subrule args(?) howoften
howoften: '(?)' | '(s?)' | '(s)' | /(\d+)?[.][.](/\d+)?
terminal: /[/]([\][/]|[^/])*[/][gimsoxe]*/ | /"([\]"|[^"])*"/ | /'([\]'|[^'])*'/
action: { extract_codeblock($text) }
bracket: '(' Item(s) production(s?) ')'
directive: '<commit>' | '<uncommit>' | '<resync>' | '<resync:' pattern '>'
| '<reject>' | '<reject:' condition '>' | '<error>'
| '<error:' string '>' | '<error?>' | '<error?:' string '>'
| '<rulevar:' /[^]+/ '>' | '<matchrule:' string '>'
identifier: /[a-z]\w*/i
comment: /#[^\n]*/
pattern: {extract_bracketed($text,'<')}
condition: {extract_codeblock($text,'{<')}
string: {extract_variable($text)} | {extract_quotelike($text)}
| {extract_bracketed($text,'<')}
The grammar is fairly straightforward, except perhaps the actions which call the various extract_ functions. These functions are imported from the Text::Balanced module and are useful shortcuts for parsing matching nested parentheses, Perl variables, quoted strings, Perl quotelike operators, and complete blocks of Perl text. It would be possible to dispense with all of these and encode the entire grammar without any actions, but then the grammar would be four
times bigger.
But, of course, now you're asking yourself: "Does Parse::RecDescent use a Parse::RecDescent grammar to convert grammars into parsers?" The answer is no. If Parse::RecDescent had to call Parse::RecDescent to build a parser for the Parse::RecDescent grammar, then that second version of Parse::RecDescent would itself have to call Parse::RecDescent to build a parser for the "Parse::RecDescent grammar" grammar, and that version would in turn have to call Parse::RecDescent to build a parser for the ' "Parse::RecDescent grammar" grammar' grammar, and so on right down to Little Cat Z.
Hence, although it can automatically build parsers for just about any other purpose, Parse::RecDescent must content itself with a hand-built parser.