Blex Tutorial
From BorielWiki
Contents |
Step by Step Tutorial
This tutorial will show how to use the Blex class to create a simple lexical scanner.
The Blexer Module
The blexer module (blexer.py file) contains the Blex class, Token class and Pattern classes definitions. First thing you want to do is import this class within your module:
from blexer import Blex, Token, YY_EOF
Now, create a Blex instance. In this example, we'll call it scanner.
scanner = Blex()
Then we'll set our scanner to use the greedy operation mode:
scanner.GREEDY = True
Adding Tokens
The next step is to tell the Blex instance which patterns it should expect. The method add_token allows us to define a pattern using a regular expression.
scanner.add_token('if', 'IF')
This way, we have defined the token 'IF'. The scanner will return an IF Token instance every time the input matches the word if. The add_token takes 2 parameters, as defined in the Blex class. The former is a regular expression string that will be used to match the input. The latter can be either a string containing a Token Name (e.g. 'IF', 'IDENTIFIER', 'EQ') or a python t-uple containing (Token Name, Hook function). The Token hook is a normal Python method or function (see Token hook for how to implement a pattern action).
Let's add some other tokens at once:
scanner.add_tokens({'then':'THEN', 'else':'ELSE', r'\(':'LP', r'\)':'RP',
'[0-9]+':'INTEGER', ';':'SC', '=':'EQ', r'\+':'PLUS', '<':'LT'})
The above sentence is equivalent to:
scanner.add_token('then', 'THEN')
scanner.add_token('else', 'ELSE')
scanner.add_token(r'\(', 'LP') # Left Parenthesis "escaped"
scanner.add_token(r'\)', 'RP') # Right Parenthesis "escaped"
scanner.add_token('[0-9]+', 'INTEGER')
scanner.add_token(';', 'SC')
scanner.add_token(r'\+', 'PLUS') # Plus sign, "escaped"
scanner.add_token('<', 'LT')
But these last method require many sentences.
Warning: It's a good idea to use raw python strings (prefixed with r) when using regular expressions
This is true when using patterns which have escaped (backlashed) characters as in the example. Notice backslashed strings are r-prefixed.
Until now, our scanner will recognize the following tokens: 'IF', 'THEN', 'ELSE', 'LP', 'RP', and some others... We'll need to define a SEPARATOR (tabs, whitespaces). Since we want separators to be ignored, we will use a Token hook to ignore them.
scanner.add_token('[ \t\n]+', ('SEPARATOR', Token.skip))
Notice the different syntax. Now the 2nd parameter is a python tuple containing (Token Name, Token Hook).
Defining the Input Buffer
We're almost done! The input buffer property is just a string. Let's add a simple python triple-quoted string:
scanner.set_buffer('''
if (a < 5) then
a = a + 1;
''')
Getting input from a file
Usually a compiler takes its input from a text file called source file. If you want to get the input from a file, then you could write something like:
f = open(filename, 'rt') scanner.set_buffer(f.read())
Using the scanner
Ok. We have finished configuring our scanner. Now let's use it:
tok = Token('START') # Dummy token
while tok.id != YY_EOF:
tok = scanner.lex()
print tok.lineno, tok.id, "'%s'" % tok.text, tok.value
Python Listing
If you join al the above lines togheter, this is the complete listing:
from blexer import Blex, Token, YY_EOF scanner = Blex() scanner.add_token('if', 'IF') scanner.add_tokens({'then':'THEN', 'else':'ELSE', r'\(':'LP', r'\)':'RP', '[0-9]+':'INTEGER', ';':'SC', '=':'EQ', r'\+':'PLUS', '<':'LT'}) scanner.add_token('[ \t\n]+', ('SEPARATOR', Token.skip)) scanner.set_buffer(''' if (a < 5) then a = a + 1; ''') scanner.GREEDY = True tok = Token('START') # Dummy token while tok.id != YY_EOF: tok = scanner.lex() print tok.lineno, tok.id, "'%s'" % tok.text, tok.value
First Execution
If you execute the above script, you will get the following:
2 IF 'if' None
2 LP '(' None
2 ERROR 'a' None
2 LT '<' None
2 INTEGER '5' 5.0
2 RP ')' None
2 THEN 'then' None
3 ERROR 'a' None
3 EQ '=' None
3 ERROR 'a' None
3 PLUS '+' None
3 INTEGER '1' 1.0
3 SC ';' None
4 EOF None
Each line contains the line number where the token appeared, the Token ID, the Token text and the numeric value if appliable (None otherwise). See Token class properties to see what info does Blex.lex() return.
Some things to notice:
- The identifier a is not being recognized. When a pattern is not recognized, the ERROR Token is returned. You should check this for errors.
- The token EOF (End Of File) is a special reserved token, returned at end of input. You should use to check end of input. Successive calls to scanner.lex() will continue returning EOF Token instances.
- The SEPARATOR (newlines and spaces) Token is not being printed because we told the scanner to skip them using the Token hook.
Recognizing identifiers
To recognize the a identifier, we will add this pattern after line #7:
scanner.add_token('[a-zA-Z]+', 'IDENTIFIER') # Identifiers containing only letters
This will recognize lower and uppercases identifiers, but not identifiers containing numbers, like alpha1 or counter32. The above is not a C identifier pattern definition. Use '[_A-Za-z][_A-Za-z0-9]*' for C identifiers.
Rerun your .py source. Now you will get:
2 IF 'if' None
2 LP '(' None
2 IDENTIFIER 'a' None
2 LT '<' None
2 INTEGER '5' 5.0
2 RP ')' None
2 THEN 'then' None
3 IDENTIFIER 'a' None
3 EQ '=' None
3 IDENTIFIER 'a' None
3 PLUS '+' None
3 INTEGER '1' 1.0
3 SC ';' None
4 EOF None
The uppercase words are the Token id. This will be the token sequence that will be passed to the parser:
IF LP IDENTIFIER LT INTEGER RP THEN IDENTIFIER EQ IDENTIFIER PLUS INTEGER SC EOF

