Consider this context-free grammar:
S := ( L )
| a
L := L , S
| S
Eliminate the ambiguity that will complicate writing a parser for the language.
Solution: First, eliminate the left recursion that can cause a parser to recurse infinitely:
S := ( L )
| a
L := S , L
| S
That is a simple change, huh? It requires a little thinking, though, about what can be derived from L.
Then, left factor the resulting grammar that will cause a parser to choose non-deterministically:
S := ( L )
| a
L := S L'
L' := , L
| Ø
We can see that null substitution must give a ), but all that really matters is that it can't give a , -- that's what assures us that the parser can choose determinstically.
Now we have a grammar for the same language in a form that we can use to build a simpler parser for it.
Bonus Problem: What language does this grammar specify?
We are discussing the syntax analysis phase of a compiler. Our discussion last time led us to use the BNF description of a context-free grammar. The next question is, how can we use such a grammar and the idea of a pushdown automaton to write a parser for the language?
There are two basic strategies:
This is goal-oriented parsing. If we look at the hierarchy of goals that this process creates, we realize that non-terminals at the top of the tree are recognized first. Thus, this strategy is usually called top-down parsing.
This is opportunistic parsing. If we look at the hierarchy of symbols recognized in the process, we see that non-terminals at the bottom of the tree are matched first and only later the non-terminals that contain them. Thus, this strategy is usually called bottom-up parsing.
Actually, there is a third general class of parsing techniques called universal parsing, such as the Cocke-Younger-Kasami and Earley algorithms. (Both of these algorithms use a form of dynamic programming.) These universal techniques work for any context-free grammar, but they are too inefficient and too opaque to use in a production compiler. The more efficient top-down and bottom-up algorithms work only for a subset of grammars, but they are sufficient for the purpose of compiling programming languages.
Using both top-down and bottom-up algorithms, the parser will look ahead some number of tokens in order to choose its next step. In principle, a parser can look ahead an arbitrary number of tokens. However, looking ahead more than one token results in much more complex algorithms and, much like the universal parsing techniques, turns out not to be valuable in recognizing the sorts of programming languages that most people can understand and use. So, from now on, we will consider only algorithms that require a lookahead of one token.
Typically, if we write a parser by hand, the parser works top-down. Bottom-up algorithms are so complex that writing them by hand is rather difficult. Parsers generated by tools may parse either top-down or bottom-up.
For reasons of both time constraints and complexity, we will focus nearly all of our attention on top-down parsing in this course.
The simplest kind of top-down parser uses recursive descent. This is the sort of parser we saw in our acdc compiler. Recursive descent draws directly from two ideas we've just seen:
While simple, recursive descent parsing is powerful enough to be used to write parsers by hand for many interesting programming languages.
To build a recursive descent parser, we write one procedure for each non-terminal, whose task is to determine if the sequence of tokens being processed is an instance of the non-terminal. Matching a terminal in the sequence requires a direct match, often a test for equivalence. Matching a non-terminal requires a call to the procedure that recognizes that element. Thus, the matching process "descends" down the expression to terminals by making function calls.
Often the procedure that is called to match a non-terminal is recursive, or mutually recursive. For instance, an if statement might consist of an expression (the condition) and two statements (the then and else blocks, respectively). The expression may contain another expression, and either statement may be an if statement.
By making recursive calls within the implementing language, we do not have to implement and manage our our own stack, because the mechanics of the recursive procedure calls creates and manages the stack of calls, saving and restoring the state of each state machine when a called procedure returns. In other words, we use our program's run-time behavior to implement a pushdown automaton for us!
In general, a recursive descent parser may have to backtrack. This occurs in a situation in which the parser has a choice of two procedures to call -- and chooses the one that doesn't match.
With certain language grammars, though, we can write recursive descent parsers that never have to backtrack. This special case is called predictive parsing, and it is remarkably useful for many programming languages.
The acdc parser we considered in our second week is a predictive parser. It takes advantage of the fact that acdc's grammar defines a unique way of determining the next production to follow based on the next token in the stream.
As we've seen, even some deceptively simple grammars cannot be translated directly into a predictive parser. Consider this grammar:
S := c A d
A := ab
| a
This grammar requires backtracking just top recognize "cad"!
While the grammar of ac was simple, it turns out that with careful grammar design we can almost always create a grammar suitable for a predictive parser. Last session we saw two techniques that help us eliminate ambiguities: eliminating the left recursion and left factoring.
For example, we can use left-factoring to fix the rules for A in our ambiguous grammar above to give:
A := a A'
A' := b | Ø
(Now, construct a derivation of "cad" to convince yourself that the grammar is predictive.)
The reason that so many grammars can be written carefully enough to be translated directly into a predictive parser is the presence of keywords and other language cues. For example, control structures such as if, for, and while signal their existence with a marker token. So, only the more implicit parts of the language, such as lists of expressions and arguments, tend to cause difficulty, and using our "left" techniques eliminates most problems with these parts of the grammar.
Last time, we said that we could implement a parser using a collection of finite-state machines for recognizing non-terminals. To do this, we build a state machine for each non-terminal in this way:
The program that uses these transition diagrams to parse a stream of tokens begins in the start state of the start symbol of the grammar. It then changes from its current state s based on the next token:
This approach works only if the state machine is deterministic. For example, note in Step 2 that there can be only one transition out of state s labeled with a non-terminal. If there were more than one, how would the parser know how to proceed? Such a grammar would need to be disambiguated before being converted to state machines.
Quick Example: Let's see how this process works for the grammar you fixed up in our opening exercise, recognizing the string (a,(a)). ... done in class.
Ø-transitions can cause nondeterminism, too, and our left recursion and left factoring techniques often introduce them. Sometimes we can make an assumption that eliminates the non-determinism, if not the Ø-transition.
For example, consider:
If we decide to take the + transition whenever it applies and the Ø-transition only when the + transition does not, we have eliminated the nondeterminism.
We can also simplify state diagrams in ways that aren't as obvious when we are dealing with the grammar. If you'd like to read more about this topic, take a look at the Dragon book, Pages 184-185.
The set of recursive-descent parsers is a subset of the set of top-down parsers. In recursive descent, the parser calls a procedure for matching each non-terminal it seeks to recognize. These procedures will call other such procedures, descending through a tree of goals down to terminal matches. Of course, with a mutually inductive grammar, the parser's match procedures will be mutually recursive, too.
Recursive-descent parsers may have to backtrack when a grammar defines two ways to match a non-terminal that cannot be distinguished by their prefixes. Compilers should generally not backtrack, due to the combinatorial explosion of possible paths to recognize a program in a complex language.
A subset of the top-drawn parsers is predictive, some of which use recursive descent. Predictive parsers never backtrack. In practice, predictive parsing is sufficient, because programming language grammars can almost always be designed to avoid backtracking. Given that, language implementers are unwilling to pay the cost of making multiple scans over the token stream. If a particular grammar cannot redesigned in this way, then the language designers will usually redesign the language.
So, if the set of recursive-descent parsers is a subset of the set of top-down parsers, what other sort of top-down parser is there?
We can build non-recursive parsers by implementing and maintaining our own stack of states. As is often the case, careful construction of a pushdown automaton by hand can produce a more efficient parser than results from an automaton created implicitly by the execution of recursive procedures in the implementing language.
The main question we have to answer when implementing such a parser is, what production should the parser apply when faced with a transition labeled by a non-terminal? We will use a table to record how to make such decisions. The result will be a non-recursive parser, usually called a table-driven parser.
As a table-driven parser processes a stream of tokens, it uses two data structures:
The stack is dynamic, of course, growing and shrinking as nested non-terminals are recognized. The table is static; it is built prior to processing and stays the same throughout.
(Of course, such a parser also has an input buffer of tokens to process and an output stream on which to write its results.)
The parsing table is a function M that maps ordered pairs onto production rules. The ordered pairs are of the form (N, t), where N is a non-terminal and t is a token. When the parser is attempting to recognize an N and the next input token is a t, it uses M to find the correct production rule to use. If no production exists for the pair, then the table must indicate an "error" value for that pair, either explicitly or implicitly.
Here is a simple table-driven algorithm. A parser using this algorithm acts on the basis of the current token i in the input stream and the symbol A on top of the stack, until it reaches the end of the input stream, denoted by $.
until A == $.
This algorithm requires no recursion or even any procedure calls to recognize an input stream. It grows and shrinks its stack on its own and loops through entries in the table, which embodies the knowledge that exists in the form of separate procedures in a recursive-descent parser.
Our next task: Learn how to construct the table!