Subsecciones

Introducción

El Problema

La documentación de Regexp::Grammars establece cual es el problema que aborda el módulo:

...Perl5.10 makes possible to use regexes to recognize complex, hierarchical-and even recursive-textual structures. The problem is that Perl 5.10 doesn’t provide any support for extracting that hierarchical data into nested data structures. In other words, using Perl 5.10 you can match complex data, but not parse it into an internally useful form.

An additional problem when using Perl 5.10 regexes to match complex data formats is that you have to make sure you remember to insert whitespace- matching constructs (such as \s*) at every possible position where the data might contain ignorable whitespace. This reduces the readability of such patterns, and increases the chance of errors (typically caused by overlooking a location where whitespace might appear).

Una solución: Regexp::Grammars

The Regexp::Grammars module solves both those problems.

If you import the module into a particular lexical scope, it preprocesses any regex in that scope, so as to implement a number of extensions to the standard Perl 5.10 regex syntax. These extensions simplify the task of defining and calling subrules within a grammar, and allow those subrule calls to capture and retain the components of they match in a proper hierarchical manner.

La sintaxis de una expresión regular Regexp::Grammars

Las expresiones regulares Regexp::Grammars aumentan las regexp Perl 5.10. La sintáxis se expande y se modifica:

A Regexp::Grammars specification consists of a pattern (which may include both standard Perl 5.10 regex syntax, as well as special Regexp::Grammars directives), followed by one or more rule or token definitions.

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n balanced_brackets.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8      qr{
     9        (<pp>)
    10
    11        <rule: pp>   \( (?: [^()]*+ | <escape> | <pp> )* \)
    12
    13        <token: escape> \\.
    14
    15      }xs;
    16  };
    17
    18  while (my $input = <>) {
    19      while ($input =~ m{$rbb}g) {
    20          say("matches: <$&>");
    21          say Dumper \%/;
    22      }
    23  }

Note that there is no need to explicitly place \s* subpatterns throughout the rules; that is taken care of automatically.

...

The initial pattern ((<pp>)) acts like the top rule of the grammar, and must be matched completely for the grammar to match.

The rules and tokens are declarations only and they are not directly matched. Instead, they act like subroutines, and are invoked by name from the initial pattern (or from within a rule or token).

Each rule or token extends from the directive that introduces it up to either the next rule or token directive, or (in the case of the final rule or token) to the end of the grammar.

El hash %/: Una representación del AST

Al ejecutar el programa anterior con entrada (2*(3+5))*4+(2-3) produce:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 balanced_brackets.pl
(2*(3+5))*4+(2-3)
matches: <(2*(3+5))>
$VAR1 = {
          '' => '(2*(3+5))',
          'pp' => {
                    '' => '(2*(3+5))',
                    'pp' => '(3+5)'
                  }
        };

matches: <(2-3)>
$VAR1 = {
          '' => '(2-3)',
          'pp' => '(2-3)'
        };

Each rule calls the subrules specified within it, and then return a hash containing whatever result each of those subrules returned, with each result indexed by the subrule’s name.

In this way, each level of the hierarchical regex can generate hashes recording everything its own subrules matched, so when the entire pattern matches, it produces a tree of nested hashes that represent the structured data the pattern matched.

...

In addition each result-hash has one extra key: the empty string. The value for this key is whatever string the entire subrule call matched.

Diferencias entre token y rule

The difference between a token and a rule is that a token treats any whitespace within it exactly as a normal Perl regular expression would. That is, a sequence of whitespace in a token is ignored if the /x modifier is in effect, or else matches the same literal sequence of whitespace characters (if /x is not in effect).

En el ejemplo anterior el comportamiento es el mismo si se reescribe la regla para el token escape como:

    13        <rule: escape> \\.
En este otro ejemplo mostramos que la diferencia entre token y rule es significativa:
pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsrule.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8      qr{
     9        <s>
    10
    11        <rule: s> <a> <c>
    12
    13        <rule: c>  c d
    14
    15        <token: a>  a b
    16
    17      }xs;
    18  };
    19
    20  while (my $input = <>) {
    21      if ($input =~ m{$rbb}) {
    22          say("matches: <$&>");
    23          say Dumper \%/;
    24      }
    25      else {
    26          say "Does not match";
    27      }
    28  }

Al ejecutar este programa vemos la diferencia en la interpretación de los blancos:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 tokenvsrule.pl
ab c d
matches: <ab c d>
$VAR1 = {
          '' => 'ab c d',
          's' => {
                   '' => 'ab c d',
                   'c' => 'c d',
                   'a' => 'ab'
                 }
        };

a b c d
Does not match
ab cd
matches: <ab cd>
$VAR1 = {
          '' => 'ab cd',
          's' => {
                   '' => 'ab cd',
                   'c' => 'cd',
                   'a' => 'ab'
                 }
        };
Obsérvese como la entrada a b c d es rechazada mientras que la entrada ab c d es aceptada.

Redefinición de los espacios en blanco

In a rule, any sequence of whitespace (except those at the very start and the very end of the rule) is treated as matching the implicit subrule <.ws>, which is automatically predefined to match optional whitespace (i.e. \s*).

You can explicitly define a <ws> token to change that default behaviour. For example, you could alter the definition of whitespace to include Perlish comments, by adding an explicit <token: ws>:

                      <token: ws>
                         (?: \s+ | #[^\n]* )*

But be careful not to define <ws> as a rule, as this will lead to all kinds of infinitely recursive unpleasantness.
El siguiente ejemplo ilustra como redefinir <ws>:
pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsruleandws.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8      no warnings 'uninitialized';
 9      qr{
10        <s>
11
12        <token: ws> (?: \s+ | /\* .*? \*/)*+
13
14        <rule: s> <a> <c>
15
16        <rule: c>  c d
17
18        <token: a>  a b
19
20      }xs;
21  };
22
23  while (my $input = <>) {
24      if ($input =~ m{$rbb}) {
25          say Dumper \%/;
26      }
27      else {
28          say "Does not match";
29      }
30  }
Ahora podemos introducir comentarios en la entrada:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 -w tokenvsruleandws.pl
ab /* 1 */ c d
$VAR1 = {
          '' => 'ab /* 1 */ c d',
          's' => {
                   '' => 'ab /* 1 */ c d',
                   'c' => 'c d',
                   'a' => 'ab'
                 }
        };

Llamando a las subreglas

To invoke a rule to match at any point, just enclose the rule’s name in angle brackets (like in Perl 6). There must be no space between the opening bracket and the rulename. For example:

           qr{
               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
               <name>            # Call <rule: name>
               <options>?        # Call <rule: options> (it's okay if it fails)

               <rule: name>
                   # etc.
           }x;

If you need to match a literal pattern that would otherwise look like a subrule call, just backslash-escape the leading angle:

           qr{
               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
               \<name>           # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
               <options>?        # Call <rule: options> (it's okay if it fails)

               <rule: name>
                   # etc.
           }x;

El siguiente programa ilustra algunos puntos discutidos en la cita anterior:

casiano@millo:~/src/perl/regexp-grammar-examples$ cat -n badbracket.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8      qr{
 9        (<pp>)
10
11        <rule: pp>   \( (?: <b  > | \< | < escape> | <pp> )* \)
12
13        <token: b  > b
14
15        <token: escape> \\.
16
17      }xs;
18  };
19
20  while (my $input = <>) {
21      while ($input =~ m{$rbb}g) {
22          say("matches: <$&>");
23          say Dumper \%/;
24      }
25  }

Obsérvense los blancos en < escape> y en <token: b > b. Pese a ello el programa funciona:

casiano@millo:~/src/perl/regexp-grammar-examples$ perl5.10.1 badbracket.pl
(\(\))
matches: <(\(\))>
$VAR1 = {
          '' => '(\\(\\))',
          'pp' => {
                    '' => '(\\(\\))',
                    'escape' => '\\)'
                  }
        };

(b)
matches: <(b)>
$VAR1 = {
          '' => '(b)',
          'pp' => {
                    '' => '(b)',
                    'b' => 'b'
                  }
        };

(<)
matches: <(<)>
$VAR1 = {
          '' => '(<)',
          'pp' => '(<)'
        };

(c)

casiano@millo:

Eliminación del anidamiento de ramas unarias en %/

...Note, however, that if the result-hash at any level contains only the empty-string key (i.e. the subrule did not call any sub-subrules or save any of their nested result-hashes), then the hash is unpacked and just the matched substring itself if returned.

For example, if <rule: sentence> had been defined:

    <rule: sentence>
        I see dead people

then a successful call to the rule would only add:

    sentence => 'I see dead people'

to the current result-hash.

This is a useful feature because it prevents a series of nested subrule calls from producing very unwieldy data structures. For example, without this automatic unpacking, even the simple earlier example:

    <rule: sentence>
        <noun> <verb> <object>

would produce something needlessly complex, such as:

    sentence => {
        ""     => 'I saw a dog',
        noun   => {
            "" => 'I',
        },
        verb   => {
            "" => 'saw',
        },
        object => {
            ""      => 'a dog',
            article => {
                "" => 'a',
            },
            noun    => {
                "" => 'dog',
            },
        },
    }

El siguiente ejemplo ilustra este punto:

pl@nereida:~/Lregexpgrammars/demo$ cat -n unaryproductions.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8      qr{
 9        <s>
10
11        <rule: s> <noun> <verb> <object>
12
13        <token: noun> he | she | Peter | Jane
14
15        <token: verb> saw | sees
16
17        <token: object> a\s+dog | a\s+cat
18
19      }x;
20  };
21
22  while (my $input = <>) {
23      while ($input =~ m{$rbb}g) {
24          say("matches: <$&>");
25          say Dumper \%/;
26      }
27  }

Sigue una ejecución del programa anterior:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 unaryproductions.pl
he saw a dog
matches: <he saw a dog>
$VAR1 = {
          '' => 'he saw a dog',
          's' => {
                   '' => 'he saw a dog',
                   'object' => 'a dog',
                   'verb' => 'saw',
                   'noun' => 'he'
                 }
        };

Jane sees a cat
matches: <Jane sees a cat>
$VAR1 = {
          '' => 'Jane sees a cat',
          's' => {
                   '' => 'Jane sees a cat',
                   'object' => 'a cat',
                   'verb' => 'sees',
                   'noun' => 'Jane'
                 }
        };

Ámbito de uso de Regexp::Grammars

Cuando se usa Regexp::Grammars como parte de un programa que utiliza otras regexes hay que evitar que Regexp::Grammars procese las mismas. Regexp::Grammars reescribe las expresiones regulares durante la fase de preproceso. Esta por ello presenta las mismas limitaciones que cualquier otra forma de 'source filtering' (véase perlfilter). Por ello es una buena idea declarar la gramática en un bloque do restringiendo de esta forma el ámbito de acción del módulo.

 5  my $calculator = do{
 6      use Regexp::Grammars;
 7      qr{
 .          ........
28      }xms
29  };

Casiano Rodríguez León
2009-12-09