Perl 6 By Example: Improved INI Parsing with Grammars

This blog post is part of my ongoing project
to write a book about Perl 6.

If you’re interested, please sign up for the mailing list at the bottom of
the article, or here. It will be
low volume (less than an email per month, on average).


Last week we’ve seen a collection of regexes that can parse a
configuration file in the INI format
that’s popular in world of
Microsoft Windows applications.

Here we’ll explore grammars, a feature that groups regexes into a class-like
structure, and how to extract structured data from a successful match.

Grammars

A grammar is class with some extra features that makes it suitable for
parsing text. Along with methods and attributes you can put regexes into a
grammar.

This is what the INI file parser looks like when formulated as a grammar:

grammar IniFile {
    token key     { w+ }
    token value   { <!before s> <-[n;]>+ <!after s> }
    token pair    { <key> h* '=' h* <value> n+ }
    token header  { '[' <-[ [ ] n ]>+ ']' n+ }
    token comment { ';' N*n+  }
    token block   { [<pair> | <comment>]* }
    token section { <header> <block> }
    token TOP     { <block> <section>* }
}

You can use it to parse some text by calling the parse method, which uses
regex or token TOP as the entry point:

my $result = IniFile.parse($text);

Besides the standardized entry point, a grammar offers more advantages.
You can inherit from it like from a normal class, thus bringing even
more reusability to regexes. You can group extra functionality together with
the regexes by adding methods to the grammar. And then there are some
mechanisms in grammars that can make your life as a developer easier.

One of them is dealing with whitespace. In INI files, horizontal whitespace is
generally considered to be insignificant, in that key=value and key =
value
lead to the same configuration of the application. So far we’ve dealt
with that explicitly by adding h* to token pair. But there are place we
haven’t actually considered. For example it’s OK to have a comment that’s not
at start of the line.

The mechanism that grammars offer is that you can define a rule called ws,
and when you declare a token with rule instead of token (or enable this
feature in regex through the :sigspace modifier), Perl 6 inserts implicit
<ws> calls for you where there is whitespace in the regex definition:

grammar IniFile {
    token ws { h* }
    rule pair { <key>  '='  <value> n+ }
    # rest as before
}

This might not be worth the effort for a single rule that needs to parse
whitespace, but when there are more, this really pays off by keeping
whitespace parsing in a singles space.

Note that you should only parse insignificant whitespace in token ws. For
example for INI files, newlines are significant, so ws shouldn’t match
them.

Extracting Data from the Match

So far the IniFile grammar only checks whether a given input matches the
grammar or not. But when it does match, we really want the result of the parse
in a data structure that’s easy to use. For example we could translate this
example INI file:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Into this data structure of nested hashes:

{
    _ => {
        key1 => "value2"
    },
    section1 => {
        key2 => "value2",
        key3 => "with spaces"
    },
    section2 => {
        more => "stuff"
    }
}

Key-value pairs from outside of any section show up in the _ top-level
key.

The result from the IniFile.parse call is a
Match object that has (nearly) all the
information necessary to extract the desired match. If you turn a Match object
into a string, it becomes the matched string. But there’s more. You can use it
like a hash to extract the matches from named submatches. For example if the
top-level match from

token TOP { <block> <section>* }

produces a Match object $m, then $m<block> is again a Match object, this
one from the match of the call of token block´. And$m

is a list
of
Matchobjects from the repeated calls to tokensection. So aMatch` is
really a tree of matches.

We can walk this data structure to extract the nested hashes.
Token header matches a string like "[section1]n", and we're only
interested in
“section1”. To get to the inner part, we can modify token
header` by inserting a pair of round parenthesis around the subregex whose
match we’re interested in:

token header { '[' ( <-[ [ ] n ]>+ ) ']' n+ }
#                  ^^^^^^^^^^^^^^^^^^^^  a capturing group

That’s a capturing group, and we can get its match by using the top-level
match for header as an array, and accessing its first element. This leads us
to the full INI parser:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input);
    unless $m {
        die "The input is not a valid INI file.";
    }

    sub block(Match $m) {
        my %result;
        for $m<block><pair> -> $pair {
            %result{ $pair<key>.Str } = $pair<value>.Str;
        }
        return %result;
    }

    my %result;
    %result<_> = hash-from-block($m);
    for $m<section> -> $section {
        %result{ $section<header>[0].Str } = hash-from-block($section);
    }
    return %result;
}

This top-down approach works, but it requires a very intimate understanding of
the grammar’s structure. Which means that if you change the structure during
maintenance, you’ll have a hard time figuring out how to change the data
extraction code.

So Perl 6 offers a bottom-up approach as well. It allows you to write a data
extraction or action method for each regex, token or rule. The grammar engine
passes in the match object as the single argument, and the action method can
call the routine make to attach a result to the match object. The result is
available through the .made method on the match object.

This execution of action methods happens as soon as a regex matches
successfully, which means that an action method for a regex can rely on the
fact that the action methods for subregex calls have already run. For example
when the rule pair { <key> '=' <value> n+ } is being executed, first
token key matches successfully, and its action method runs immediately
afterwards. Then token value matches, and its action method runs too. Then
finally rule pair itself can match successfully, so its action method can
rely on $m<key>.made and $m<value>.made being available, assuming that the
match result is stored in variable $m.

Speaking of variables, a regex match implicitly stores its result in the
special variable $/, and it is custom to use $/ as parameter in action
methods. And there is a shortcut for accessing named submatches: instead of
writing $/<key>, you can write $<key>. With this convention in mind, the
action class becomes:

class IniFile::Actions {
    method key($/)     { make $/.Str }
    method value($/)   { make $/.Str }
    method header($/)  { make $/[0].Str }
    method pair($/)    { make $<key>.made => $<value>.made }
    method block($/)   { make $<pair>.map({ .made }).hash }
    method section($/) { make $<header>.made => $<block>.made }
    method TOP($/)     {
        make {
            _ => $<block>.made,
            $<section>.map: { .made },
        }
    }
}

The first two action methods are really simple. The result of a key or
value match is simply the string that matched. For a header, it’s just the
substring inside the brackets. Fittingly, a pair returns a
Pair object, composed from key and value.
Method block constructs a hash from all the lines in the block by iterating
over each pair submatch, extracting the already attached Pair object.
One level above that in the match tree, section takes that hash and pairs it
with the name of section, extracted from $<header>.made. Finally the
top-level action method gathers the sectionless key-value pairs under they key
_ as well as all the sections, and returns them in a hash.

In each method of the action class, we only rely on the knowledge of the
first level of regexes called directly from the regex that corresponds to the
action method, and the data types that they .made. Thus when you refactor one
regex, you also have to change only the corresponding action method. Nobody
needs to be aware of the global structure of the grammar.

Now we just have to tell Perl 6 to actually use the action class:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input, :actions(IniFile::Actions));
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

If you want to start parsing with a different rule than TOP (which you might
want to do in a test, for example), you can pass a named argument rule to
method parse:

sub parse-ini(Str $input, :$rule = 'TOP') {
    my $m = IniFile.parse($input,
        :actions(IniFile::Actions),
        :$rule,
    );
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

say parse-ini($ini).perl;

use Test;

is-deeply parse-ini("k = vn", :rule<pair>), 'k' => 'v',
    'can parse a simple pair';
done-testing;

To better encapsulate all the parsing functionality within the grammar, we can
turn parse-ini into a method:

grammar IniFile {
    # regexes/tokens unchanged as before

    method parse-ini(Str $input, :$rule = 'TOP') {
        my $m = self.parse($input,
            :actions(IniFile::Actions),
            :$rule,
        );
        unless $m {
            die "The input is not a valid INI file.";
        }

        return $m.made
    }
}

# Usage:

my $result = IniFile.parse-ini($text);

To make this work, the class IniFile::Actions either has to be declared before the
grammar, or it needs to be pre-declared with class IniFile::Action { ... }
at the top of the file (with literal three dots to mark it as a forward
declaration).

Summary

Match objects are really a tree of matches, with nodes for each named submatch
and for each capturing group. Action methods make it easy to decouple parsing
from data extraction.

Next we’ll explore how to generate better error messages from a failed parse.

Subscribe to the Perl 6 book mailing list

* indicates required

  • Article By :

Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*