A lexical analyzer can be described as a program or subroutine that proceses the input and returns an ordered set of tokens. This process is called lexical analysis. Like (F)Lex and many other scanners, Blex uses regular expressions to match the input and return a Token.
This page will describe internal details about Blex. Since it's all written in python, it's implemented as a class.
The Blex Class
The Blex class defines the lexical scanner. It provides several methods and properties to manage the scanner. Read the Blex tutorial for the typical steps on creating and using the Blex scanner.
Blex Instance has the following properties:
- The current line number in the input. This is an integer. Since the input is supposed to be text, it increases whenever Blex.EOL string is found. By default, Blex.EOL is '\n', but you can change this so the line counter will increase under other circumstances. You can even modify this value yourself (use with care). Blex don't use this internally, but you might find it useful for reporting something like a syntax error and give the line number where it occurs.
- Is a dictionary (a hash table) that associates every status name (a string, like 'INITIAL') with its related status object.
- A string containing the current Blex status. The Blex looks in its status_table property to get the status instance information (for example, which patterns can be matched, etc.). Only patterns defined within the current status will be matched against the input. See Blex Algorithm for a general and brief description of how the input is processed.
- A (sometimes really long) string containing the input buffer. Since buffer can be any object, you can replace buffer with an stream object, provided it allows indexed access (e.g. overloading the operator '[...]' by defining the __getitem__ method).
- An integer counter which points to the current char being processed in the buffer. This value ranges from 0 to len(buffer)-1.
- A boolean (True | False) value, which sets the Blex operation mode.
Blex class provides some methods to interact with its instances and read/set values to some of the properties:
- Use this to change the input to a new string (e.g. the content of a source code file). When this function is invoked, the previos buffer is discarded, and lookahead property is set to 0 (buffer beginning).
- Creates a new status, and gives it a name. For example, you might want to create an status called 'COMMENT', then just write: scanner.add_status('COMMENT').
- add_token(Pattern, Token_ID [, status])
- Creates a pattern and associates it with the given Token_ID. Optionally you can pass a status string or a list of strings, so this pattern will only match under the given status(es). By default, if no status is specified, the pattern will match under any status. See the Adding Tokens section for more information.
- add_tokens(dict [,status])
- Like the above, but allows to specify several tokens at once. The dict parameter is a python dictionary object (hash), containing pattern:Token_ID pairs. See Adding Tokens for more information.
- Discards the input until end of line (end of line is defined in the global constant Blex.EOL).
- This is the scanner method. Every time this method is called, a new token instance is returned (and the lookahead property updated accordingly). Scanners using Blex should call Blex.lex() every time they need a new token to parse the input.
- Return the next token (calling Blex.lex()) without modifying the scanner status nor the lookahead input pointer. Keep in mind that pattern actions will be executed, so use with care.