Chris Lattner
Project Proposal
Due: 20 Mar 2000
CS492 - Senior Seminar

Lexers are powerful tools that abstract the process of tokenizing symbols to a high level. Lexers provide power to the programmer, reducing the likeliness of errors, corner cases, and generally improve performance over hand written scanners. The Hasklex lexer accepts a relatively standard variety of regular expressions, turning a list of regex’s and their token values into a scanner object. This object may then be used to process a string in a variety of ways. In addition, the regular expression library is exposed, allowing singular matches to occur.

The Hasklex library is built up of several useful layers, each of which is exposed to the user. The most basic layer allows string matches to be performed against a DFA. Built on that is a regular expression layer, and then the Hasklex lexer is built out of the aggregate whole. The user may chose to import only the parts that they will use.

The DFA Module

The DFA Module provides no functions to build a DFA, only functions to modify one, test against one, or return statistics about one. As such, the following functions are exposed:

matchDFA :: FA -> String -> Bool

matchDFA takes a preconstructed DFA, and returns a boolean value indicating whether or not the string matches it.

matchDFALength :: FA -> String -> Int

matchDFALength is the equivalent function of matchDFA, but matches the start of a string, returning the longest match possible. The length of the match is returned.

matchDFALengthState :: FA -> String -> (Int, Int)

matchDFALengthState is also a function that matches a DFA against the start of a string, returning information about the longest possible match. The first number of the returned pair is the length of the match, and the second is the ID number of the final state that was reached to permit this match. This unique number may be set by the setFA_ID function, described below.

setFA_ID :: Int -> FA -> FA

setFA_ID is used to set the ID number of all of the final states of the DFA to a specific number. This only modifies final states that have an ID number of zero, so once they are set to something different, they cannot be reset by this function.

measureFA :: FA -> IO ()

measureFA inspects a DFA and prints information including the number of states, transitions and final states in the finite automata.

The RegEx Module

The RegEx module encapsulates a powerful regular expression library into an easy to use package. It exports only 3 functions, 2 of which are trivially implemented. This regular expression library provides a full complement of regular expression features, allowing patterns to be expressed concisely.

The following regular expression forms are supported:

RegEx Form

Example

Meaning

Literal character

"x"

Match a literal ‘x’ character.

Any character

"."

Match any character.

Juxtaposition

"xy"

Match ‘x’ then ‘y’.

Klein Enclosure

"x*"

Match zero or more ‘x’ characters.

Repetition

"x+"

Match one or more ‘x’ characters.

Optionalization

"x?"

Match zero or one ‘x’ characters.

Inversion

"x!"

Match anything but a single ‘x’

Alternation

"a|b|c"

Match ‘a’ or ‘b’ or ‘c’.

Difference

".-x"

Match any character but ‘x’

Character Class

"[a0-9]"

Match a collection of characters.

Grouping

"(ab)*"

Group a subexpression for an operator.

Although most of these forms are postfix operators (*, +, ?, and !), the alternation and difference operators are infix operators. Infix operators may be used without parentheses if there are no other operators between the operators. For example: "abc|def|ghi" is tokenized as "((abc)|(def))|(ghi)", but "ab*c|def" is tokenized as "(a(b*))((c)|(def))". Postfix operators bind very tightly, and thus must be used with the grouping operator to bind to units larger than a character. The character class operators are neither infix nor postfix operators, instead, they behave like a singular character.

The character class regex form is a very powerful form capable of matching in powerful ways. The character class always matches a single character against the set of characters contained within, but the set may be manipulated in powerful ways. Ranges of characters may be specified with the ‘-’ operator (ex: "[a-zA-Z]" matches all letters), the character class may be inverted with the ‘^’ operator (ex: "[^0-9]" matches everything except for digits), and escape sequences may be used in the character class (ex: "[\n\t ]" matches common whitespace characters). Any special character may be escaped with the ‘\’ character (ex: "[\n\t\-\]\\]" matches newline, tab, ‘-’, ‘]’ or ‘\’).

Other non-intuitive cases follow from these simple rules. For example, "[^-|]" matches anything but ‘-’ or ‘|’. To match the range ‘^’ through ‘|’, one would use "[\^-|]". To match anything (the equilivant of ‘.’, you could use "[^]").

In order to match the special operators ("[^-\\]" does not match caret, dash, slash!), special cases have been put into place. First, the character class inverse operator ‘^’ is only a special operator when it is the first character of the class. The ‘-’ character is only a special operator when it is not the first character of a class, and the ‘\’ character may be produced by double escaping itself: "[\\]". To match caret, dash, slash, one would produce this character class: "[-^\\]".

compileRegEx :: String -> FA

compileRegEx takes a regex specification as the first parameter, and returns the compiled DFA as the result.

measureRegEx :: String -> IO ()

measureRegEx takes a regex specification and measures the resulting DFA.

matchRegEx :: String -> String -> Bool

matchRegEx takes a regex specification and a string, compiles it and then matches the string against the DFA.

The Lexer Module

A lexer is not quite the same thing as a series of regular expressions bound together into a package. To operate efficiently (and correctly), a lexer must execute all regex’s in parallel, returning the longest match possible. It must provide a way to prioritize regex’s to disambiguate them. For example, the two patterns "[a-zA-Z][_a-zA-Z0-9]+" and "while" both match the input "while", but precedence should be given to the literal match. In addition, functionality should be provided to allow descriptive tokens to be used with the regular expressions, avoiding having to define constants or other unreadable systems.

The Hasklex lexer allows the user to do all of these things. The lexer is compiled by specifying a list of regular expressions (along with their token IDs), and an error token to use. When tokenizing an input string, the lexer returns the token corresponding to the regex that matched, or the error token if none match. Additionally, a priority is implicitly assigned to each regex by the ordering of the regex-token list.

type Token a = (String, a)

The Token type corresponds to a single token specifier. When used to compile a lexer, the first value is the regex to use, and the second is the token to associate with it. When returned by the lexer, a token corresponds the string that was lexed along with the token associated with the matching regex.

compileLexer :: [Token] -> a -> Lexer a

compileLexer takes a list of tokens (specifying regex/token pairs) and an error token. The error token is returned when no match can be made, and the pairs of regex’s and tokens are used as described above.

lexFirstToken :: Lexer a -> String -> Token a

lexFirstToken takes a lexer and a string to lex. It returns a token containing the first token id recognized and the string it matched. If no regex matches, the error token is returned with an empty string.

lexIntoList :: Lexer a -> String -> [Token a]

lexIntoList is much like lexFirstToken, except that it lexes the entire input string into a list of tokens. If an error occurs, then the last element of the list is the error token.

measureLexer :: Lexer a -> IO ()

measureLexer takes a lexer specification and measures the internal DFA.

Examples of Use

Matching strings with regular expressions [and the DFA module]:

testRE = compileRegEx "[a-zA-Z]+"
testRE is now a DFA that is compiled from the regular expression.
Main> measureFA testRE
2 states, 104 transitions, and 1 final states.
Main> matchDFA testRE "abcDEF"
True
Main> matchDFA testRE "abc0DEF"
False
Main> matchDFAHead testRE "abc0DEF"
"abc"

Using the lexer module:
testLexer =
compileLexer [("[ \n\t]+", TokIgnore), -- Ignore whitespace
("if", TokIf),
("then", TokThen),
("else", TokElse),
("[+]", TokPlus),
("-", TokMinus),
("[*]", TokStar),
("==", TokEqEq),
("[0-9]+", TokDigit),
("[a-zA-Z_][a-zA-Z0-9_]*", TokVar)
] TokError
testLexer is the lexer used in the examples below.
Main> lexIntoList testLexer "if then \telse I\nam + A then iffy"

[("if",TokIf), (" ",TokIgnore), ("then",TokThen), (" \t",TokIgnore), ("else",TokElse), (" ",TokIgnore), , ("I",TokVar), ("\n",TokIgnore), ("am",TokVar), (" ",TokIgnore), ("+",TokPlus), (" ",TokIgnore), ("A",TokVar), (" ",TokIgnore), ("then",TokThen), (" ",TokIgnore), ("iffy",TokVar)]

Main> filter ((/=TokIgnore).snd) $$

[("if",TokIf), ("then",TokThen), ("else",TokElse), ("I",TokVar), ("am",TokVar), ("+",TokPlus), ("A",TokVar), ("then",TokThen), ("iffy",TokVar)]