NUM
is its numerical value and the attribute associated with
token VAR
is the actual string.
Each time the parser
requires a new token, the lexer returns
the couple (token, attribute) that matched.
Some tokens - like PRINT
- do not carry any special
information. In such cases, just to keep the protocol
simple, the lexer returns the couple (token, token)
.
Using Eyapp terminology such tokens are called syntactic tokens.
On the other side, Semantic tokens are those tokens - like VAR
or NUM
- whose attributes transport
useful information. When the end of input is reached the lexer
returns the couple ('', undef)
.
sub Lex { my($parser)=shift; for ($parser->YYData->{INPUT}) { m{\G[ \t]*}gc; m{\G\n}gc and $lineno++; m{\G([0-9]+(?:\.[0-9]+)?)}gc and return('NUM',$1); m{\Gprint}gc and return('PRINT', 'PRINT'); m{\G([A-Za-z][A-Za-z0-9_]*)}gc and return('VAR',$1); m{\G(.)}gc and return($1,$1); return('',undef); # End of input } }
Lexical analyzers can have a non negligible impact in the overall performance. Ways to speed up this stage can be found in the works of Simoes [7] and Tambouras [8].