Blex Tutorial

From BorielWiki
Jump to: navigation, search

Step by Step Tutorial

This tutorial will show how to use the Blex class to create a simple lexical scanner.

The Blexer Module

The blexer module (blexer.py file) contains the Blex class, Token class and Pattern classes definitions. First thing you want to do is import this class within your module:

from blexer import Blex, Token, YY_EOF

Now, create a Blex instance. In this example, we'll call it scanner.

scanner = Blex()

Then we'll set our scanner to use the greedy operation mode:

scanner.GREEDY = True

Adding Tokens

The next step is to tell the Blex instance which patterns it should expect. The method add_token allows us to define a pattern using a regular expression.

scanner.add_token('if', 'IF')

This way, we have defined the token 'IF'. The scanner will return an IF Token instance every time the input matches the word if. The add_token takes 2 parameters, as defined in the Blex class. The former is a regular expression string that will be used to match the input. The latter can be either a string containing a Token Name (e.g. 'IF', 'IDENTIFIER', 'EQ') or a python t-uple containing (Token Name, Hook function). The Token hook is a normal Python method or function (see Token hook for how to implement a pattern action).

Let's add some other tokens at once:

scanner.add_tokens({'then':'THEN', 'else':'ELSE', r'\(':'LP', r'\)':'RP',
    '[0-9]+':'INTEGER', ';':'SC', '=':'EQ', r'\+':'PLUS', '<':'LT'})

The above sentence is equivalent to:

scanner.add_token('then', 'THEN')
scanner.add_token('else', 'ELSE')
scanner.add_token(r'\(', 'LP')  # Left Parenthesis "escaped"
scanner.add_token(r'\)', 'RP')  # Right Parenthesis "escaped"
scanner.add_token('[0-9]+', 'INTEGER')
scanner.add_token(';', 'SC')
scanner.add_token(r'\+', 'PLUS') # Plus sign, "escaped"
scanner.add_token('<', 'LT')

But these last method require many sentences.

Warning: It's a good idea to use raw python strings (prefixed with r) when using regular expressions

This is true when using patterns which have escaped (backlashed) characters as in the example. Notice backslashed strings are r-prefixed.

Until now, our scanner will recognize the following tokens: 'IF', 'THEN', 'ELSE', 'LP', 'RP', and some others... We'll need to define a SEPARATOR (tabs, whitespaces). Since we want separators to be ignored, we will use a Token hook to ignore them.

scanner.add_token('[ \t\n]+', ('SEPARATOR', Token.skip))

Notice the different syntax. Now the 2nd parameter is a python tuple containing (Token Name, Token Hook).

Defining the Input Buffer

We're almost done! The input buffer property is just a string. Let's add a simple python triple-quoted string:

scanner.set_buffer('''
    if (a < 5) then
       a = a + 1;
    ''')

Getting input from a file

Usually a compiler takes its input from a text file called source file. If you want to get the input from a file, then you could write something like:

f = open(filename, 'rt')
scanner.set_buffer(f.read())

Using the scanner

Ok. We have finished configuring our scanner. Now let's use it:

tok = Token('START') # Dummy token
while tok.id != YY_EOF:
    tok = scanner.lex()
    print tok.lineno, tok.id, "'%s'" % tok.text, tok.value

Python Listing

If you join all the above lines together, this is the complete listing:

from blexer import Blex, Token, YY_EOF
 
scanner = Blex()
scanner.add_token('if', 'IF')
scanner.add_tokens({'then':'THEN', 'else':'ELSE', r'\(':'LP', r'\)':'RP',
    '[0-9]+':'INTEGER', ';':'SC', '=':'EQ', r'\+':'PLUS', '<':'LT'})
scanner.add_token('[ \t\n]+', ('SEPARATOR', Token.skip))
scanner.set_buffer('''
    if (a < 5) then
       a = a + 1;
    ''')
scanner.GREEDY = True
 
tok = Token('START') # Dummy token
while tok.id != YY_EOF:
    tok = scanner.lex()
    print tok.lineno, tok.id, "'%s'" % tok.text, tok.value

First Execution

If you execute the above script, you will get the following:

2 IF 'if' None
2 LP '(' None
2 ERROR 'a' None
2 LT '<' None
2 INTEGER '5' 5.0
2 RP ')' None
2 THEN 'then' None
3 ERROR 'a' None
3 EQ '=' None
3 ERROR 'a' None
3 PLUS '+' None
3 INTEGER '1' 1.0
3 SC ';' None
4 EOF  None

Each line contains the line number where the token appeared, the Token ID, the Token text and the numeric value if appliable (None otherwise). See Token class properties to see what info does Blex.lex() return.

Some things to notice:

  1. The identifier a is not being recognized. When a pattern is not recognized, the ERROR Token is returned. You should check this for errors.
  2. The token EOF (End Of File) is a special reserved token, returned at end of input. You should use to check end of input. Successive calls to scanner.lex() will continue returning EOF Token instances.
  3. The SEPARATOR (newlines and spaces) Token is not being printed because we told the scanner to skip them using the Token hook.

Recognizing identifiers

To recognize the a identifier, we will add this pattern after line #7:

scanner.add_token('[a-zA-Z]+', 'IDENTIFIER') # Identifiers containing only letters

This will recognize lower and uppercases identifiers, but not identifiers containing numbers, like alpha1 or counter32. The above is not a C identifier pattern definition. Use '[_A-Za-z][_A-Za-z0-9]*' for C identifiers.

Rerun your .py source. Now you will get:

2 IF 'if' None
2 LP '(' None
2 IDENTIFIER 'a' None
2 LT '<' None
2 INTEGER '5' 5.0
2 RP ')' None
2 THEN 'then' None
3 IDENTIFIER 'a' None
3 EQ '=' None
3 IDENTIFIER 'a' None
3 PLUS '+' None
3 INTEGER '1' 1.0
3 SC ';' None
4 EOF  None

The uppercase words are the Token id. This will be the token sequence that will be passed to the parser:

IF LP IDENTIFIER LT INTEGER RP THEN IDENTIFIER EQ IDENTIFIER PLUS INTEGER SC EOF