Session 10

Table-Driven Parsing


810:155
Translation of Programming Languages


Opening Exercise

Whip up a quick design for a recursive-descent parser for this context-free grammar:

    statement  := repeat statement until expression
                | print identifier
                | identifier  expression
    expression := identifier = expression
                | zero? expression
                | number

The idea of the exercise is to get the structure of the parser right, so you can make reasonable assumptions about tokens and the scanner that let you focus on recursive descent.

Solution: We can write a predictive parser for this grammar because its choices are deterministic and keyed by a token. Here is a quick Java draft. I assume that I have a procedure called match() that (1) consumes a token and returns true when the next token in the stream matches the argument and (2) returns false otherwise.

    public boolean statement()
    {
       switch ( scanner.peek() )
       {
          Token.repeat :
             return match(Token.repeat) && statement() &&
                    match(Token.until)  && expression();
          Token.print :
             return match(Token.print) && match(Token.identifier);
          Token.identifier :
             return match(Token.identifier) &&
                    match(Token.assign)     && expression();
          default:
             return false;
       }
    }

    public boolean expression()
    {
       switch ( scanner.peek() )
       {
          Token.identifier :
             return match(Token.identifier) &&
                    match(Token.equals)     && expression();
          Token.zeroPredicate :
             return match(Token.zeroPredicate) && expression();
          Token.number :
             return match(Token.number);
          default:
             return false;
       }
    }

What would I have to do if we added identifier as a type of expression? (Hint: Left factoring may be involved. :-) What effect would this have on our parser?



Recap: Strategies for Parsing

We are discussing the syntax analysis phase of a compiler. Last time we discussed how parsers can work either top-down from an initial goal of the start symbol of a grammar or bottom-up from sequences of terminal tokens. A common class of simple top-down parsers can be implemented using a technique called recursive descent, in which the parser calls a procedure for matching each non-terminal it seeks to recognize. These procedures will call other such procedures, descending through a tree of goals down to terminal matches.

A recursive-descent parser may have to backtrack if the grammar allows two ways to match a non-terminal that cannot be distinguished by their prefixes. But often we can build a predictive parser that never has to backtrack by massaging the grammar. We can implement predictive parsers in a straightforward way using a collection of finite-state machines for recognizing non-terminals.

We can also build a predictive parser that does not need to recurse by implementing and maintaining our own stack of states. As is often the case, careful construction of a pushdown automaton by hand can result in a more efficient parser than results from an implicit automaton generated by recursion. We ended last session by looking at the idea of a table-driven parser, which uses a table to record which production to apply in any given state, and a parsing algorithm. This algorithm requires no recursion or even any procedure calls to recognize an input stream. It grows and shrinks its stack on its own and loops through the table, which encodes the knowledge that is distributed across the separate procedures of a recursive-descent parser.



An Example of Table-Driven Parsing

Consider this simple grammar for arithmetic expressions:

    expression  := term + expression
                 | term
    term        := factor * term
                 | factor
    factor      := ( expression )
                 | identifier

Notice how this grammar "implements" the common order of precedence for the operators (), *, and +...   It "nests" a higher-precedence operation deeper in the grammar with a new non-terminal. A factor must be recognized before the term that contains it, which ensures that the parentheses bind before the multiplication!

This grammar matches how we like to write expressions, but it causes a parser problems with its lefthand ambiguities. So we use left factoring to create this grammar:

    expression  := term expression'
    expression' := + term expression'
                 | Ø
    term        := factor term'
    term'       := * factor term'
                 | Ø
    factor      := ( expression )
                 | identifier

We can construct a parsing table for this grammar that identifies which rule to apply for each (non-terminal, token) pair:



. identifier + * ( ) $
E E := TE'     E := TE'    
E'   E' := +TE'     E' := ø E' := ø
T T := FT'     T := FT'    
T'   T' := ø T' := *FT'   T' := ø T' := ø
F F := identifier     F := ( E )    

Notice that each cell has at most one entry, so the table is predictive. Any cell without an entry indicates a parsing error.

Now let's trace our table-driven algorithm as it recognizes this token stream:

identifier + identifier * identifier


Stack Input Output
$E identifier + identifier * identifier $  
$E'T identifier + identifier * identifier $ E := TE'
$E'T'F identifier + identifier * identifier $ T := FT'
$E'T'identifier identifier + identifier * identifier $ F := identifier
$E'T' + identifier * identifier $  
$E' + identifier * identifier $ T' := ø
$E'T+ + identifier * identifier $ E' := +TE'
$E'T identifier * identifier $  
$E'T'F identifier * identifier $ T := FT'
$E'T'identifier identifier * identifier $ F := identifier
$E'T' * identifier $  
$E'T'F* * identifier $ T' := *FT'
$E'T'F identifier $  
$E'T'identifier identifier $ F := identifier
$E'T' $  
$E' $ T' := ø
$ $ E' := ø

Notice that the algorithm follows a left derivation of the input; that is, the output productions generate a left derivation of the input stream. Top-down parsers generally follow left derivations, because they try to match non-terminals to the leftmost tokens in the stream.



Parsing with a Table

Table-driven parsing shifts knowledge for parsing a grammar out of the procedures that recognize each non-terminal into data entries in a table, which are then processed by a single algorithm. Not surprisingly, then, the key to building a table-driven parser lies in constructing a suitable parsing table.

This is a common theme that you have encountered throughout your undergraduate curriculum, from the trade-off between procedure-based and data-based implementations in Programming Languages to techniques such as the strategy pattern in Object-Oriented Programming. It also plays an important role in the knowledge representation techniques you study in Artificial Intelligence. This is one of the Big Ideas of computer science!

We can build the parsing table for simple grammars like the one above in an ad hoc way simply by working through the grammar. Building parsing tables for more complex grammars requires more work.

But even for a simple grammar an ad hoc process is error-prone. You may not see immediately why we have the table entries we do. For example, how do we know to have entries in row E for identifier and ( input tokens and none for the others? The answer is that...

Doing this sort of ad hoc trace for a language such as BigRig would be error-prone indeed!



FIRST and FOLLOW Functions

We formalize this process in terms of two functions, FIRST and FOLLOW, that we can compute for a grammar and then use to generate complete and correct parsing tables mechanically.

Let α be any string of grammar symbols. FIRST(α) is the set of terminals that begin strings derived from α. If we can derive ø from α, then ø is in FIRST(α), too.

For example, in our grammar above, FIRST(expression) = {identifier, (}. This set follows from the reasoning we did above.

Quick Exercise: What is FIRST(expression')?

Let A be any non-terminal. FOLLOW(A) is the set of terminals that can appear immediately to the right of A in a derivation. If we can derive ø from the symbols that separate A and some terminal a, then a is in FOLLOW(A), too.

For example, in our grammar above, FOLLOW(term) = {+, ), $}. That's because...

Quick Exercise: What is FOLLOW(factor)?

Hint: A factor can end a term, so its FOLLOW set must contain at least the symbols in FOLLOW(term).



Building the FIRST and FOLLOW Sets

These two sets can help us build a parsing table, but we don't want to have to build the sets in such an ad hoc fashion, either. We can go one more step and formalize the construction of these two sets.

To compute FIRST sets, use these rules:

  1. If X is a terminal, then FIRST(X) = {X}.

  2. If there is a rule X := ø, then add ø to FIRST(X).
  3. If there is a rule X := Y1, Y2, ... Yn, then add all the non-ø symbols from FIRST(Y1) to FIRST(X).
    • If there is a rule Y1 := ø, then add all the non-ø symbols from FIRST(Y2) to FIRST(X).
    • If that is true and there is a rule Y2 := ø, then add all the non-ø symbols from FIRST(Y3) to FIRST(X).
      Continue in this way until you find a Yi that does not generate an ø.

We can use these rules to generate a complete listing of FIRST sets by proceeding through all the production rules, recursing each time we process a symbol in a production. Rule 3 tells us that we should build the sets bottom up, starting with the lowest-level non-terminals in the grammar.

For our grammar above, we might build the FIRST sets in this order:

To compute FOLLOW sets, we can use the FIRST sets. Use these rules:

  1. Put $ in FOLLOW(S) for the start symbol S.
  2. If there is a rule A := αBβ, then put everything from FIRST(β) -- except for ø -- into FOLLOW(B).
  3. If there is a rule A := αB or A := αBβ where FIRST(β) contains ø, then put everything from FOLLOW(A) into FOLLOW(B).

We can use these rules to generate a complete listing of FOLLOW sets by proceeding through the production rules on at a time. This time, Rule 3 tells us that we should build the sets top-down, starting with highest-level non-terminals.

For our grammar above, we might build the FOLLOW sets in this order:

This process can be tedious, but it is repeatable and guaranteed to produce complete and accurate FIRST and FOLLOW sets. Because it is tedious but formal and repeatable, this process is a perfect candidate for automation in a computer program!

These sets are not our goal; they are only a tool for achieving our goal: construction of a parsing table for the grammar that is both complete and sound. By complete we mean:

If a sentence is in the language, the parser accepts it.

By sound we mean:

If the parser accepts a sentence, then it is in the language.

Our next step is to reap the fruits of our labor building FIRST and FOLLOW sets.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... February 17, 2009