Perl 6 By Example: Parsing INI files

This blog post is part of my ongoing project
to write a book about Perl 6.

If you’re interested, please sign up for the mailing list at the bottom of
the article, or here. It will be
low volume (less than an email per month, on average).


You’ve probably seen .ini files before; they are quite common as
configuration files on the Microsoft Windows platform, but are also in many
other places like ODBC configuration files, Ansible’s inventory
files
and so on.

This is what they look like:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Perl 6 offers regexes for parsing, and grammars for structuring and reusing
regexes.

Regex Basics

A regex is a piece of code that acts as a pattern for strings with a common
structure. It’s derived from the computer science concept of a regular
expression
, but adopted to
allow more constructs than pure regular expressions allow, and with some added
features that make them easier to use.

We’ll use named regexes to match the primitives, and then use regexes that
call these named regexes to build a parser for the INI files. Since INI files
have no universally accepted, formal grammar, we have to make stuff up a we
go.

Let’s start with parsing value pairs, like key1=value1. First the key.
It may contain letters, digits and the underscore _. There’s a shortcut for
match such characters, w, and matching more at least one works by appending
a + character:

use v6;

my regex key { w+ }

multi sub MAIN('test') {
    use Test;
    ok 'abc'    ~~ /^ <key> $/, '<key> matches a simple identifier';
    ok '[abc]' !~~ /^ <key> $/, '<key> does not match a section header';
    done-testing;
}

my regex key { w+ } declares a lexically (my) scoped regex called key
that matches one or more word character.

There is a long tradition in programming languages to support so-called Perl
Compatible Regular Expressions
, short PCRE. Many programming languages
support some deviations from PCRE, including Perl itself, but common syntax
elements remain throughout most implementations. Perl 6 still supports some of
these elements, but deviates substantially in others.

Here w+ is the same as in PCRE, but the fact that white space around the
w+ is ignored is not. In the testing routine, the slashes in 'abc' ~~ /^
<key> $/
delimit an anonymous regex. In this regex, ^ and $ stand for the
start and the end of the matched string, respectively, which is familiar from
PCRE. But then <key> calls the named regex key from earlier. This again
is a Perl 6 extension. In PCRE, the < in a regex matches a literal <. In
Perl 6 regexes, it introduces a subrule call.

In general, all non-word characters are reserved for “special” syntax, and you
have to quote or backslash them to get the literal meaning. For example <
or '<' in a regex match a less-then sign. Word characters (letters, digits
and the underscore) always match literally.

Parsing the INI primitives

Coming back the INI parsing, we have to think about what characters are
allowed inside a value. Listing allowed characters seems to be like a futile
exercise, since we are very likely to forget some. Instead, we should think
about what’s not allowed in a value. Newlines certainly aren’t, because they
introduce the next key/value pair or a section heading. Neither are
semicolons, because they introduce a comment.

We can formulate this exclusion as a negated character class: <-[ n ; ]>
matches any single character that is neither a newline nor a semicolon. Note
that inside a character class, nearly all characters lose their special
meaning. Only backslash, whitespace and the closing bracket stand for anything
other than themselves. Inside and outside of character classes alike, n
matches a single newline character, and s and whitespace. The upper-case
inverts that, so that for example S matches any single character that is
not whitespace.

This leads us to a version version of a regex for match a value in an ini
file:

my regex value { <-[ n ; ]>+ }

There is one problem with this regex: it also matches leading and trailing
whitespace, which we don’t want to consider as part of the value:

my regex value { <-[ n ; ]>+ }
if ' abc ' ~~ /<value>/ {
    say "matched '$/'";         # matched ' abc '
}

If Perl 6 regexes were limited to a regular language in the Computer Science
sense, we’d have to something like this:

my regex value { 
    # match a first non-whitespace character
    <-[ s ; ]> 
    [
        # then arbitrarily many that can contain whitespace
        <-[ n ; ]>* 
        # ... terminated by one non-whitespace 
        <-[ s ; ]> 
    ]?  # and make it optional, in case the value is only
        # only one non-whitespace character
}

And now you know why people respond with “And now you have two problems”
when proposing to solve problems with regexes. A simpler solution is to match
a value as introduced first, and then introduce a constraint that neither the
first nor the last character may be whitespace:

my regex value { <!before s> <-[ s ; ]>+ <!after s> }

along with accompanying tests:

is ' abc ' ~~ /<value>/, 'abc', '<value> does not match leading or trailing whitespace';
is ' a' ~~ /<value>/, 'a', '<value> watches single non-whitespace too';
ok "anb" !~~ /^ <value> $/, '<valuee> does not match n';

<!before regex> is a negated look-ahead, that is, the following text must
not match the regex, and the text isn’t consumed while matching.
Unsurprisingly, <!after regex> is the negated look-behind, which tries to
match text that’s already been matched, and must not succeed in doing so for
the whole match to be successful.

This being Perl 6, there is of course yet another way to approach this
problem. If you formulate the requirements as “a value must not contain a
newline or semicolon and start with a non-whitepsace and end with a
non-whitespace”, it become obvious that if we just had an AND operator in
regexes, this could be easy. And it is:

my regex value { <-[ s ; ]>+ & S.* & .*S }

The & operator delimits two or more smaller regex expressions that must all
match the same string successfully for the whole match to succeed. S.*
matches any string that starts with a non-whitesspace character (S), followed
by any character (.) any number of times *. Likewise .*S matches any
string that ends with non-whitespace character.

Who would have thought that matching something as seemingly simple as a value
in a configuration file could be so involved? Luckily, matching a pair of key
and value is much simpler, now that we know how to match each on their own:

my regex pair { <key> '=' <value> }

And this works great, as long as there are no blanks surrounding the equality
sign. If there is, we have to match it separately:

my regex pair { <key> h* '=' h* <value> }

h matches a horizontal whitespace, that is, a blank, a tabulator character,
or any other fancy space-like thing that Unicode has in store for us (for
example also the non-breaking space), but not a newline.

Speaking of newlines, it’s a good idea to match a newline at the end of regex
pair, and since we ignore empty lines, let’s match more than one too:

my regex pair { <key> h* '=' h* <value> n+ }

Time to write some tests as well:

ok "key=vaulen" ~~ /<pair>/, 'simple pair';
ok "key = valuenn" ~~ /<pair>/, 'pair with blanks';
ok "keyn= valuen" !~~ /<pair>/, 'pair with newline before assignment';

A section header is a string in brackets, so the string itself shouldn’t
contain brackets or a newline:

my regex header { '[' <-[ [ ] n ]>+ ']' n+ }

# and in multi sub MAIN('test'):
ok "[abc]n"    ~~ /^ <header> $/, 'simple header';
ok "[a c]n"    ~~ /^ <header> $/, 'header with spaces';
ok "[a [b]]n" !~~ /^ <header> $/, 'cannot nest headers';
ok "[anb]n"  !~~ /^ <header> $/, 'No newlines inside headers';

The last remaining primitive is the comment:

my regex comment { ';' N*n+ }

N matches any character that’s not a newline, so the comment is just a
semicolon, and then anything until the end of the line.

Putting Things Together

A section of an INI file is a header followed by some key-value pairs or
comment lines:

my regex section {
    <header>
    [ <pair> | <comment> ]*
}

[...] groups a part of a regex, so that the quantifier * after it
applies to the whole group, not just to the last term.

The whole INI file consists of potentially some initial key/value pairs or
comments followed by some sections:

my regex inifile {
    [ <pair> | <comment> ]*
    <section>*
}

The avid reader has noticed that the [ <pair> | <comment> ]* part of a regex
has been used twice, so it’s a good idea to extract it into a standalone
regex:

my regex block   { [ <pair> | <comment> ]* }
my regex section { <header> <block> }
my regex inifile { <block> <section>* }

It’s time for the “ultimate” test:

my $ini = q:to/EOI/;
key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff
EOI

ok $ini ~~ /^<inifile>$/, 'Can parse a full ini file';

Backtracking

Regex matching seems magical to many programmers. You just state the pattern,
and the regex engine determines for you whether a string matches the pattern
or not. While implementing a regex engine is a tricky business, the basics
aren’t too hard to understand.

The regex engine goes through the parts of a regex from left to right, trying
to match each part of the regex. It keeps track of what part of the string
it matched so far in a cursor. If a part of a regex can’t find a match, the
regex engine tries to alter the previous match to take up fewer characters,
and the retry the failed match at the new position.

For example if you execute the regex match

'abc' ~~ /.* b/

the regex engine first evaluates the .*. The . matches any character.
The * quantifier is
greedy, which means it tries to match as many characters as it can. It ends
up matching the whole string, abc. Then the regex engine tries to match the
b, which is a literal. Since the previous match gobbled up the whole string,
matching c against the remaining empty string fails. So the previous regex
part, .*, must give up a character. It now matches ab, and the literal
matcher for the b compares b and c, and fails again. So there is a final
iteration where the .* once again gives up one character it matched, and now
the b literal can match the second character in the string.

This back and forth between the parts of a regex is called backtracking.
It’s great feature when you search for a pattern in a string. But in a parser,
it is usually not desirable. If for example the regex key matched the
substring "key2" in the input“key2=value2`, you don’t want it to match a
shorter substring just because the next part of the regex can’t match.

There are three major reasons why you don’t want that. The first is that it makes
debugging harder. When humans think about how a text is structured, they
usually commit pretty quickly to basic tokenization, such as where a word or a
sentence ends. Thus backtracking can be very uninuitive. If you generate error
messages based on which regexes failed to match, backtracking basically always
leads to the error message being pretty useless.

The second reason is that backtracking can lead to unexpected regex matches.
For example you want to match two words, optionally separated by whitespaces,
and you try to translate this directly to a regex:

say "two words" ~~ /w+s*w+/;     # 「two words」

This seems to work: the first w+ matches the first word, the seccond oen
matches the second word, all fine and good. Until you find that it actually
matches a single word too:

say "two" ~~ /w+s*w+/;           # 「two」

How did that happen? Well, the first w+ matched the whole word, s*
successfully matched the empty string, and then the second w+ failed,
forcing the previous to parts of the regex to match differently. So in the
second iteration, the first w+only matchestw, and the secondw+
matches
o`. And then you realize that if two words aren’t delimtied by
whitespace, how do you even tell where one word ends and the next one starts?
With backtracking disabled, the regex fails to match instead of matching in an
unintended way.

The third reason is performance. When you disable backtracking, the regex
engine has to look at each character only once, or once for each branch it can
take in the case of alternatives. With backtracking, the regex engine can be
stuck in backtracking loops that take over-proportionally longer with
increasing length of the input string.

To disable backtracking, you simply have to replace the word regex by
token in the declaration, or by using the :ratchet modifier inside the
regex.

In the INI file parser, only regex value needs backtracking (though other
formualtions discussed above don’t need it), all the other regexes can be
switched over to tokens safely:

my token key     { w+ }
my regex value   {  <!before s> <-[n;]>+ <!after s> }
my token pair    { <key> h* '=' h* <value> n+ }
my token header  { '[' <-[ [ ] n ]>+ ']' n+ }
my token comment { ';' N*n+  }
my token block { [ <pair> | <comment> ]* }
my token section { <header> <block> }
my token inifile { <block> <section>* }

Summary

Perl 6 allows regex reuse by treating them as first-class citizens,
allowing them to be named and called like normal routines.
Further clutter is removed by allowing whitespace inside regexes.

These features allows you to write regexes to parse proper file formats and
even programming languages. So far we have only seen a binary decision about
whether a string matches a regex or not. In the future, we’ll explore ways to
improve code reuse even more, extract structured data from the match, and give
better error messages when the parse fails.

Subscribe to the Perl 6 book mailing list

* indicates required

(function($) {window.fnames = new Array(); window.ftypes = new Array();fnames[0]=’EMAIL’;ftypes[0]=’email’;fnames[1]=’FNAME’;ftypes[1]=’text’;}(jQuery));var $mcj = jQuery.noConflict(true);


Source From: perlgeek.de.
Original article title: Perl 6 By Example: Parsing INI files.
This full article can be read at: Perl 6 By Example: Parsing INI files.

Advertisement


  • Article By :

Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*