Session 11

Building Tables for Table-Driven Parsing


810:155
Translation of Programming Languages


Opening Exercise

Consider this context-free grammar:

    statement   := declare identifier option-list
    option-list := option option-list
                 | ø
    option      := scale | precision
    scale       := fixed | float
    precision   := single | double

Build the FIRST sets for this grammar. Then build the FOLLOW sets.

Solution: We saw a set of rules for creating FIRST and FOLLOW sets last time...

    FIRST(declare)      FIRST(fixed)   FIRST(single)
    FIRST(identifier)   FIRST(float)   FIRST(double)

       ... are themselves. (R1)

    FIRST(scale)       = { fixed, float }     (R3, P5)
    FIRST(precision)   = { single, double }   (R3, P6)

    FIRST(option)      = FIRST(scale)         (R3, P4)
                         U FIRST(precision)   (R3, P5)
    FIRST(option-list) = FIRST(option)        (R3, P2)
                         U { ø }              (R2, P3)
    FIRST(statement)   = { declare }          (R3, P1)

Then we use the FIRST sets to help create the FOLLOW sets:

    FOLLOW(statement)   = { $ }                         (R1)
    FOLLOW(option-list) = FOLLOW(statement)             (R3, P1)
    FOLLOW(option)      = (FIRST(option-list) - { ø })  (R2, P2)
                        + FOLLOW(option-list)           (R3, P2, P3)
    FOLLOW(scale)       = FOLLOW(option)                (R3, P4)
    FOLLOW(precision)   = FOLLOW(option)                (R3, P4)



Recap: Strategies for Parsing

We are discussing the syntax analysis phase of a compiler. Session 9 taught us that parsers can work either top-down from an initial goal of recognizing the start symbol of the grammar or bottom-up by recognizing sequences of terminal tokens. Session 10 taught us that we can build a top-down parser by constructing a parsing table that tells a generic algorithm which production rule to apply at any point in the token stream.

Table-driven parsing shifts knowledge for parsing a grammar out of the procedures that recognize each non-terminal into data, entries in a table, which are then processed by a common algorithm. Not surprisingly, then, the key to building a table-driven parser lies in constructing a suitable parsing table.

Last time, we formalized this process in terms of two functions, FIRST and FOLLOW, that can be computed for a grammar.

We have been working with this simple grammar for arithmetic expressions as a running example:

    expression  := term expression'
    expression' := + term expression'
                 | Ø
    term        := factor term'
    term'       := * factor term'
                 | Ø
    factor      := ( expression )
                 | identifier

Last time, we built FIRST and FOLLOW set for the grammar:

Today, we will learn how to use these functions to generate a complete and correct parsing table mechanically.



Building Parsing Tables with FIRST and FOLLOW

Given these two functions, we can now construct the parsing table for a grammar in a mechanical way. The idea is really pretty simple.

Suppose that A := α is a production and that terminal a is in FIRST(α). Then the parser should expand A by α whenever the input symbol is a. The only case that can break this rule is when there is a production α = ø, or ø is otherwise derivable from α. In that case, the parser should still expand A by α if the input symbol is in FOLLOW(A). The last situation holds even if the 'input symbol' is $.

So, given a grammar, produce a parsing table M as follows:

  1. For each production A := α, do the following.
    • For each terminal a in FIRST(α), add A := α to M[A, a].
    • If ø is in FIRST(α), add A := α to M[A, b] for each terminal b in FOLLOW(A).
  2. Make each undefined entry in M be an error.

Let's use this algorithm to build a parsing table for our arithmetic grammar from the FIRST and FOLLOW sets defined above. We can process the production rules in order:

  1. (Rule 1) expression := term expression'

  2. (Rule 2) expression' := + term expression'

  3. (Rule 3) expression' := ø

  4. (Rule 4) term := factor term'

  5. (Rule 5) term' := * factor term'

  6. (Rule 6) term' := ø

  7. (Rule 7) factor := ( expression )

  8. (Rule 8) factor := identifier

  9. All other entries in M[A,a] are errors.

Compare our entries with the parsing table I gave last time. They are identical!

Again, the process is tedious, but with care this algorithm builds a correct parsing table for any suitable grammar.



Follow-Up Exercise

Now, build a parsing table for the grammar from our opening exercise...



Types of Top-Down Parsers

The parsing table for our little expression language has a particular feature that makes the table-driven parsing algorithm work: each cell in the table has exactly one entry. (Remember, the blanks are implicitly 'error' entries.) This means that the algorithm can make deterministic choices at each point in the algorithm by looking ahead just one token.

Knuth characterized grammars and parsing techniques according to three characteristics: the order in which the parser reads the input stream, the kind of derivation that the parser follows, and the number of tokens that the parser must look ahead to decide what to do next. By this standard, a table-driven parser for our little expression grammar is LL(1).

We call a grammar for which we can build an LL(1) parser an LL(1) grammar.

The algorithm for building parsing tables that we studied above guarantees a complete and sound table for any LL(1) grammar. "Complete" means that the algorithm generates every entry that belongs in the table. "Sound" means that every entry generated by the algorithm belongs in the table. We know this from one of the proofs that we can do about LL(1) grammars and the algorithm.

You can determine if a grammar is LL(1) by verifying that it has these traits:

If a grammar is not LL(1), then our table-building algorithm will create cells that contain multiple entries. This means that the grammar is LL(k) for some k>1. There are algorithms capable of parsing LL(k) grammars, but they are less efficient than LL(1) algorithms because they either must look ahead farther, or backtrack when they guess wrong.

For some non-LL(1) grammars, we can make arbitrary choices that result in deterministic parsers. Often, these choices match up with programmer expectations and are codified in the non-BNF part of a language specification. Consider the case of dangling elses in this grammar fragment:

    statement  := if expression then statement statement'
                | a
    statement' := else statement
                | Ø
    expression := b

This grammar results from left-factoring a grammar that lists both kinds of if statement. The parsing table for this grammar is:

. a b expression if then $
S S := a . . S := if E then S S' . .
S' . . S' := else S
S' := ø
. . S' := ø
E . E := b . . . .

The entry for M[S', expression] contains both "S' := else S" and "S' := ø", because FOLLOW(S') = { else, $ }. This shows that the grammar is ambiguous, because a parser has two choices when trying to match an S' and an else token is encountered.

There is no way to re-write this grammar so that it is LL(1). What can we do?

We can resolve this ambiguity by deciding always to choose "S' := else S" when the choice arises. This decision corresponds to associating an else clause with the closest preceding then clause. Most programming language definitions specify this choice, in part to make the programmer's life easier and in part to simplify parsing.

(Note: In this grammar, choosing S' := ø is almost certainly wrong, as it would prevent ever putting an else on the stack or consuming it from the input stream!)

So, some grammars cannot be written in LL(1) form. Sometimes, we can resolve an ambiguity by making an arbitrary choice at each point of ambiguity in the parsing table. For a small number of conflicts of a certain kind, this strategy may suffice. But it requires human intervention and thus isn't amenable to automation. Fortunately, though, some grammars with this sort of ambiguity can be written in a way that can be processed efficiently, only not in an LL(1) way.

Every practical parsing technique scans its input stream left-to-right. Every practical parsing technique requires a lookahead of just 1. But some practical techniques do create rightmost derivations. These are the techniques that parse the token stream bottom-up. We characterize these parsers as LR(1). It turns out that the set of of LL(1) grammars is a proper subset of the set of LR(1) grammars. What this means for us in a practical sense is that many grammars which cannot be written in LL(1) form can be written in LR(1) form.



Bottom-Up Parsing

The recursive descent and table-driven approaches we've studied thus far follow the principle of top-down-parsing: the primary goal of parsing is to show that an input sequence can be derived from the start symbol of the grammar. Any non-termninal encountered along the way is treated as a subgoal to be achieved.

An alternative approach is to follow the principle of bottom-up parsing: process the input sequence with no particular goal in mind, matching non-terminals opportunistically whenever possible. The matching of the start symbol indicates a successful parse of the input.

Recall that in table-driven top-down parsing, we try to recognize the symbol on top of the stack. If it is a terminal, then we match the token against the next symbol on the input stream. If it is a non-terminal, then we expand the symbol on the stack using a production rule. We might call this match-expand parsing.

In bottom-up parsing, we shift terminal tokens from the input stream onto the stack. Whenever a string on top of the stack matches the righthand side of some production, we reduce the stack by replacing the string with the non-terminal symbol on the lefthand side of the rule. As a result, we sometimes call this style of parsing shift-reduce parsing.

Bottom-up parsing is an important topic in its own right, but one for which we do not have enough time to do justice this semester. Suffice to say that bottom-up parsing is even more complex and tedious to implement than table-driven top-down parsing. The trade-off is that it can handle grammars that LL(1) techniques cannot and do so very efficiently. These benefits mean that we often prefer to use an LR(1) parser. The costs mean that we almost always use a parser generator to create an LR(1) parser from the language's grammar.

If you would like to learn more about bottom-up parsing on your own, feel free to study Chapter 5 in Louden's text.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... February 19, 2008