We have begun to examine the phases of a compiler in depth, starting with the scanner. To build a scanner, we need a language for writing the rules that describe the structure of tokens and a technique for writing programs that recognize tokens using such rules. For these purposes, we will use regular expressions and deterministic finite state automata, respectively. Last time, we learned a bit about regular expressions and finite state automata, which are tools we can use to build recognizers of strings. Today, we will learn a technique for converting regular expressions to FSAs.
Last time, we wrote a regular expression for non-negative integers, 0 | [1-9][0-9]* . Create a deterministic FSA that recognizes non-negative integers.
Solution: "Walking down" this expression allows us to build an FSA without much trouble at all:
For a little fun at home, do the same for the regular expression for comments that I asked you to create last time.
We closed last time with the claim that it is good for us that any regular expression can be recognized by a deterministic FSA. Why is that good for us? Because then we have a clear path to implementing a scanner, a program that recognizes tokens in a language.
Usually, we will break the second step above -- regular expression to DFA -- into two steps: convert the regular expression to an NFA, then convert the NFA to a DFA. We do this because it is often easier to translate a regular expression into an NFA than into a DFA. This is true for complex tokens and especially true when we have a choice among many different kinds of tokens, which is true for most programming languages. This will make our lives easier, because choices within a token and choices among tokens can be represented as nondeterministic moves out of a state.
Fortunately, theory tells us that any languages which can be recognized by a nondeterministic FSA can also be recognized by a deterministic FSA. Today we will learn a techniques for converting NFAs into DFAs. As in most design and implementation choices we make, their is a trade-off between NFAs and DFAs, involving the speed and size of the resulting machines.
This three-step outline gives us a repeatable, if tedious, technique for implementing token recognizers. Fortunately, we will have some tool support for doing some of the tedious conversion. Of course, this tedium is a sign that we can write a program to do all of the work for us, a scanner generator.
Let's lay the foundation for the rest of our session with this quickie:
Write a regular expression for the language consisting of all strings of a's and b's that end in abb.
Solution: This is correct:
(a*b*)*abb
But we can do better by doing simpler:
(a | b)*abb
Let's use this regular expression as a running example for the rest of our session, in which we learn a reliable and robust technique for turning a regular expression into a program that recognizes the corresponding language.
Now, convert this regular expression into an FSA.
Solution: Given the nondeterministic choice between a and b, an NFA is the simplest choice. Here is one candidate:
Is the process you used to produce your answer precise enough that we could write a program to do it?
With some creativity and trial-and-error, we can create an NFA from any regular expression by doing the following:
Some techniques follow these suggestions with gusto, producing remarkably verbose automata. For example, a textbook we once used in this course describes an algorithm that produces this NFA for (a | b)*abb:
There should also be ε transitions from State 0 to State 7 and from State 6 to State 1. [States 0 and 7 handle the case of (a | b)0, and States 1 and 6 handle the case of (a | b)+.] Remarkable!
The disadvantage of such a technique seems obvious: it wastes nodes and creates many unnecessary ε transitions. But the advantage is that the process is mechanistic and straightforward to automate in code. This means that a human programmer need not manage all of the details it requires. A second advantage is that we do not have to worry about its verbosity, because we have techniques available to us that can optimize the automaton -- and they, too, are straightforward to automate in code.
Creating an initial NFA from a small regular expression is usually a straightforward process, even when done ad hoc -- as long as you don't worry about creating the smallest or best automaton possible. We have techniques for improving the automaton later.
From the standpoint of a programmer building a recognizer, an NFA is more useful than a regular expression because it is simulatable -- we can write code that acts just like the automaton. However, being nondeterministic, they require the ability to guess a path and backtrack when the guess is wrong. We would like for our recognizers to be deterministic.
Fortunately, as we noted at the end of last session, the set of languages recognized by NFAs is identical to the set of languages recognized by DFAs, and we can use the elements of a proof for this claim to create an algorithm for converting any NFA into an equivalent DFA.
The idea is to simulate the NFA on all possible inputs. Whenever a symbol takes you from state i to states j, k, ... nondeterministically, we will create a compound state [j, k, ...] in the DFA. Then, when we consider a state [j, k, ...], we will create a compound state [m, n, ...] that includes all states reachable form any state in [j, k, ...].
We will continue this process until we have covered all of the branches out of all the strates in the growing DFA. We need to be certain to handle all of the ε transitions, too. But they do not consume any symbols. In the end, any state that contains a final state from the NFA is a final state in the DFA.
This sounds dangerous, but the NFA is finite, with some fixed number of states n, and so the process will terminate. The resulting DFA will have at most 2n states, though in practice it usually has far fewer.
Let's try it out on our simpler NFA... and we do.
The result is this DFA:
Not bad at all. The 4-state NFA results in a 4-state DFA, without much more complexity. (The small alphabet helps.)
What about the mechanically-produced 11-state NFA with all the ε transitions? It starts as:
We arrive at a couple of transitions from State 0 on a and b inputs by following ε transitions to states that do have a and b transitions. If we continue, we arrive at:
Very nice! The 11-state NFA results in a 5-state DFA, only slightly more complex than our simpler version. Again, having a small alphabet helps keep the FSA small. And only those states in the NFA having incoming a's and b's end up in the "superstates" of the resulting DFA.
Theory can help us do better. The translation technique is verbose, too. We can take advantage of "important states" when we compute the "ε closure" of a state, that is, the set of states reachable from a given state on a given input. This approach creates states only for cases when two sets of states have the same important states. The result can be smaller DFAs. (See Pages 134-135 of the Dragon book if you would like to learn more.)
The verbosity of our techniques for converting regular expressions into NFAs and then into DFAs means that we have room to compress the resulting DFA into a more efficient machine. If we apply one of the optimization algorithms to our verbose DFA, it produces:
... which is identical to the DFA we produced from our own NFA. We cannot do better, because it is a minimal DFA for this language!
One of the important conclusions we can draw from this experience is that we can use straightforward, mechanical techniques at each step along the way and produce a minimal DFA that recognizes our regular language. Another is that CS theory can does have a use for us programmers!
We won't study minimization techniques in this course, in the interest of time and learning other topics. If you are interested, you can read about them in your textbook or in the Dragon book.
Fortunately for us, these optimizing techniques are mechanical and thus programmable in code. Many people have done so.
I used the free tool JFLAP to generate my FSA diagrams for the last two sessions. I also used JFLAP to do the convert NFAs to DFAs and to minimize the DFAs. You can download JFLAP from the project home web site or from my local copy of a recent version.
There are many other tools available out there. If you run across a tool that you find to be useful, let me know, and I'll add it to the course resources page.
Isn't writing code to implement a DFA just a simple matter of programming? While tedious, implementing a deterministic finite state automaton in code really is straightforward. For a large automaton, though, the code can be quite long.
Each state in a DFA becomes a state in the code. How can we implement states?
In a procedural style, we might implement them as ints to be switched on. For example,
int state = START_STATE;
switch (state)
{
case 0:
// switch on character from input to change state
break;
case 1:
// switch on characters from input to change state
break;
...
}
We can often make such code easier to read if we implement each state as a procedure that knows which states to call on each possible character. For example:
protected void/Token state0( input )
{
switch (next character)
{
case a:
state1( input )
break;
...
}
}
Finally, we can eliminate much of the nested code by implementing the DFA as a table to be processed. For example, we can implement a table as a two-dimensional array, with states labeling the rows and input characters labeling the columns:
a b
0 1 0
1 1 2
2 1 3
3 1 0
Now, we write code that starts in state 0 and uses the array to choose the next value for the state. In this table, state 3 is the only terminal state. If the automaton exhausts its input stream while in this state, it recognizies the string.
We can also use a dead state d as a destination if ever a character is not allowed out of a given state. If the automaton ever reaches this state, it fails to recognizie the string. Dead states enable us to construct a complete table to drive the process.
In an object-oriented style, we can make our states objects that know how to switch among themselves. When a state reads the next character, it can send the next state a message and tell it process the rest of the input. Some states will need to know that a token has just been recognized, create it, and pass it back to the requestor. For more on this technique, see the State design pattern.
... tools that write the code for us ...