Lexical Analysis

Lexical Analysis decomposes the input stream in a sequence of lexical units called tokens. Associated with each token is its attribute which carries the corresponding information. In the code example below the attribute associated with token NUM is its numerical value and the attribute associated with token VAR is the actual string. Each time the parser requires a new token, the lexer returns the couple (token, attribute) that matched. Some tokens - like PRINT - do not carry any special information. In such cases, just to keep the protocol simple, the lexer returns the couple (token, token). Using Eyapp terminology such tokens are called syntactic tokens. On the other side, Semantic tokens are those tokens - like VAR or NUM - whose attributes transport useful information. When the end of input is reached the lexer returns the couple ('', undef).

sub Lex {
  my($parser)=shift; 
  
  for ($parser->YYData->{INPUT}) { 
    m{\G[ \t]*}gc;
    m{\G\n}gc                      
      and $lineno++;
    m{\G([0-9]+(?:\.[0-9]+)?)}gc   
      and return('NUM',$1);
    m{\Gprint}gc                   
      and return('PRINT', 'PRINT');
    m{\G([A-Za-z][A-Za-z0-9_]*)}gc 
      and return('VAR',$1);
    m{\G(.)}gc                     
      and return($1,$1);
    return('',undef); # End of input
  }
}

Lexical analyzers can have a non negligible impact in the overall performance. Ways to speed up this stage can be found in the works of Simoes [7] and Tambouras [8].

Procesadores de Lenguaje 2007-03-01