For the last week, we have been studying regular expressions and finite state automata. Keep in mind why we did this... We can use regular expressions to specify the concrete syntax of tokens in our language. We can then convert these regular expressions into a deterministic finite automaton. Finally, we can implement a lexical analyzer as a state machine that recognizes tokens in a compiler.
Today, let's begin with an exercise implementing a full scanner, and then consider some of the pragmatic issues of writing programs that do lexical analysis.
Background: In Session 3, I showed a scanner for the language ac, a simple arithmetic calculator language. I wrote that scanner in an ad hoc fashion, attacking the ac language spec with only my skills as a Java programmer. The language is simple enough that I was able to write a correct and complete scanner in a short time.
But, after Session 5 and Session 6, we now know a technique for designing lexical analyzers using regular expressions and finite state machines. This technique makes it possible to more confidently write a correct and complete scanner, even for a more complex language, in a reasonable amount of time.
Let's use the techniques you've learned on a simpler language before you dive the rest of the way into a lexical analyzer for the language you design.
Task: Design a scanner for ac using regular expressions and finite state machines.
Here is the grammar for ac:
program -> declarations statements
declarations -> declaration declarations
| Ø
declaration -> floatdecl identifier
| intdecl identifier
statements -> statement statements
| Ø
statement -> identifier assign value expressionTail
| print identifier
expressionTail -> plus value expressionTail
| minus value expressionTail
| Ø
value -> identifier
| number
floatdecl -> f
intdecl -> i
assign -> =
print -> p
plus -> +
minus -> -
identifier -> all lowercase letters except {f,i,p}
number -> integers, or reals with both whole and fractional parts
Follow the steps we've learned:
At the end, you should have a single deterministic FSA.
Here are regular expressions for ac tokens:
floatdecl f
intdecl i
assign =
print p
plus +
minus -
identifier [a-eghj-oq-z]
number 0 | [1-9][0-9]*
| (0 | [1-9][0-9]*).[0-9][0-9]*
We use these regular expressions to generate a non-deterministic finite state machine for my analyzer. The result is given here in two figures, a machine for non-number tokens:
... and a machine for number tokens:
Using the techniques that we learned last week, or a tool such as JFLAP, we convert the non-deterministic finite machine into a deterministic machine:
Finally, using techniques that we mentioned but did not study last week, we can convert this machine into a minimal dfa -- one that collapses the one-character choices to a single state and eliminates the duplication of the '.' transition:
Look how well the techniques worked... We end up with a clean, straightforward machine of six states, four of which accept tokens.
For your convenience, the session07/ directory also contains JFLAP source files for my NFA, DFA, and minimal DFA.
Now, use your DFA (or mine) to implement a new scanner for ac. You may reuse our existing scanner by implementing a new getNextToken() method. Here is the Token class that we use in our first acdc compiler. You may use it in your solution, too.
To implement a new scanner, all we need is a new getNextToken() method. As we discussed last time, there are several ways to implement a DFA in code. In the procedural style of our acdc scanner, we will write a switch.
Because you only have to override the getNextToken() method to write a new scanner, you could subclass the existing class (after adding accessors that interact with the input stream...). The rest of the code will be inherited and provide interaction with the parser, including the maintenance of a lookahead cache, and provide utility methods for dealing with whitespace and numeric digits.
A better design is to create an abstract superclass Scanner that implements the public interface of scanners: peek() and nextToken(). The latter depends on the getNextToken() method to do its job, but Scanner leaves this method for subclasses to implement. Then create two subclasses that provide their own versions of the getNextToken() method, say, AdHocScanner (our original) and FsmScanner (our new version). This is an example of the template method pattern that you may have encountered in your object-oriented programming course.
Here is code for my new version of our scanner:
You will note that I did not follow our state machine slavishly. For example, I did not implement a separate function for state q2, which accepts all one-character keywords and identifiers. Instead, I handled those cases immediately in the switch statement that displatches on them. Likewise, I collapsed state q4 into the method that handles state q3, because I could recognize all digits after the first decimal digit with a simple for loop.
Writing a pragmatic method that answers directly, without writing separate methods for each state, obscures the connection between the DFA and the code. For larger DFAs, this obscurity is rarely worth the confusion it creates for the programmer and for readers of the code! I suggest that you follow the state machine you create for your language's concrete syntax relatively closely. As a language becomes more complex, you will find more and more value in following the state machine. This value comes largely in increased confidence in the completeness and correctness of the lexical analyzer code and in a corresponding decrease in debugging time. If you need to optimize this code later for performance reasons, you always can.
Lexical analyzer generators such as Lex, Flex, JLex, and JavaCC generate code that follows a state machine exactly, state-by-state.
You have implemented a new scanner that should work just like the old one, only in a principled way. Now, integrate your new scanner class into our acdc compiler. Again, reuse as much code as possible...
You only have to make the run() method configurable with an instance of a Scanner. We can do this with a simple modification to the compiler class's design. First, factor the invocation of new Scanner() into a factory method. Then, make a subclass of AcDcCompiler that overrides the factory method to create an instance of your new class. (Of course, we could also use the template method pattern here, as we did for the scanner...)
Finally, make a new AcDc driver.
Here is code for my new version of our compiler:
(And, yes, I could probably do better for the driver class in terms of saying things once and only once. This is what I mean when I often say that main methods should do as little as possible. As soon as they do very much, they need to become objects of their own!)
This is the sort of refactoring that happens on a real project. Why didn't we just design for pluggable scanners and compilers in the first place? "You aren't gonna need it." Why, indeed! The number of possible configurations that we could have accounted for is quite large. Would they all have been worth the effort for a small demo program? Would the extra complexity have obscured the code unnecessarily? At this point, I've now implemented two extra degrees of flexibility in response to a real need. If I ever need the others, I can implement them then -- at relatively little development cost.
Download the new compiler as a .zip bundle
Regular expressions and finite automata are great tools for helping us to design lexical analyzers, but when we turn to actually writing a program we still face a lot of choices. The issues that make up the concrete choices we make when writing code are sometimes called pragmatics. Last time we discussed one pragmatic issue, how to implement a DFA in code. Let's consider a few more pragmatic issues now.
Managing the Input Stream. A scanner often reads all the characters from its input stream (often a file) into an array or a vector of strings before processing. This makes it easy for the scanner to move about the input stream and to provide detailed feedback when the compiler encounters an error. This includes lexical errors in the scanner as well as errors downstream. In order to support error reporting downstream, the tokens should keep track of where there were found in the source file.
Working with Multiple DFAs. Sometimes it is easier to create one DFA for each token type than it is to create one large DFA. If we have multiple DFAs, we then have a choice:
Matching the Longest Token. Unless the language specification says otherwise, a scanner will usually match the longest sequence of characters possible before accepting a token. Some symbols, such as parentheses, are usually self-delimiting, that is, they do not need to be surrounded by whitespace in order to be recognized.
For tokens that are not self-delimiting, how can the scanner recognize the longest sequence possible?
Handling Keywords. Languages with even a few keywords usually require many, many states in a DFA in order to distinguish between keywords and identifiers, as there are many "special cases" in long character sequences.
A technique for managing this problem is to build a table of keywords and their tokens before scanning. Then you can do one of at least two different things:
It is difficult for a compiler to detect meaningful errors at only the lexical analysis level. Consider the case of a misspelled keyword:
wihle ( 1 ) pollForInput();
The programmer almost surely intended the keyword while, but the scanner can only recognize this as an identifier. However, the scanner can find itself in a situation where the characters remaining in the stream match a pattern in state machine. How can it respond? There are several possibilities.
Perhaps the simplest is called panic mode. The scanner deletes characters until it finds a match. This is surprisingly effective in an interactive mode.
Other options include:
Most compilers operate on a single-error assumption and look for the one simplest change that would fix the stream. This is a common strategy when diagnosing faults in most engineered systems, and even natural systems like the human body. It works well in a compiler, too, except perhaps when working exclusively with novice programmers.
(Google uses strategies like this in real-time when it provides spell checking on queries.)