Session 13

Building Abstract Syntax in the Parser


810:155
Translation of Programming Languages


Recap: Abstract Syntax and Semantic Actions

We are discussing the syntax analysis phase of a compiler. In particular, we are learning how to implement our first intermediate representation of a source program, the abstract syntax tree (AST) that serves as a primary input to all later phases in the compiler. In Session 12 we began to learn how to augment our table-driven parsing technique to create ASTs as a side effect, using the idea of a semantic action. We add a semantic action to each grammar rule that should create a node in the tree. When the parser encounters a semantic action on its parsing stack, it executes the action to create a part of the AST. The new piece of abstract syntax is placed on a second stack, called the semantic stack. It holds pieces of the AST until such time as they are used to build a higher-level node in the tree.

Let's apply this technique to our arithmetic language so that we can see how it all works. Then we'll look at some issues in implementing ASTs and semantic actions.


Applying the Technique to a Simple Table-Driven Parser

Recall our familiar little arithmetic grammar:

    expression  := term expression'
    expression' := + term expression'
                 | Ø
    term        := factor term'
    term'       := * factor term'
                 | Ø
    factor      := ( expression )
                 | identifier

What is the abstract syntax of this language? It contains addition expressions, multiplication expression, and identifiers. We don't need to represent parenthesized expressions independently, because the only effect of parentheses is to ensure a particular order of evaluation. We do, however, want our semantic actions to generate AST nodes in such a way as to reflect the specified order!

How do we modify our parsing table to produce an AST?

First, we augment the grammar with semantic actions for creating AST nodes. Given three kinds of abstract syntax, we need to augment the three rules that recognize them with actions.

The new productions are:

    expression  := term expression'
    expression' := + term expression' make-plus-exp
                 | Ø
    term        := factor term'
    term'       := * factor term' make-times-exp
                 | Ø
    factor      := ( expression )
                 | identifier make-identifier

Notice that we place the semantic action at the end of the grammar rule, because we are not able to create an AST node for a compound until we have recognized both of its operands.

These semantic actions do not affect how we build the parsing table, so it remains the same, with the addition of the new symbols to some rules:

. identifier + * ( ) $
E E := TE' . . E := TE' . .
E' . E' := +TE' make-+ . . E' := ø E' := ø
T T := FT' . . T := FT' . .
T' . T' := ø T' := *FT' make-* . T' := ø T' := ø
F F := identifier make-id . . F := ( E ) . .

Next, we write factory methods for each kind of abstract syntax. This is straightforward enough; these will be constructors for the records or objects that make up our representation. We will talk more about implementing the AST in code soon.

Now we modify our table-driven algorithm. The first step is also straightforward: we create and initialize a semantic stack before entering the parsing loop.

The second is to add an arm to the algorithm to process semantic actions. The parser still proceeds based on the symbol A on top of the parse stack and the next token i in the input stream. But now the symbol on top of the parse stack may be a semantic action, which doesn't affect what the parser recognizes but instead creates a new node for the AST.

Here is the augmented algorithm. After creating a parse stack and a semantic stack...

  1. Push the end of stream symbol, $, onto the parse stack.

  2. Push the start symbol onto the parse stack.

  3. Repeat

    1. If A is a terminal, then
      1. If A == i, then pop A from the parse stack and consume i.
      2. Else we have a token mismatch, so fail.

    2. Else if A is a non-terminal, then
      1. If table entry [A, i] contains the entry A := Y1, Y2, ... Yn, then pop A from the parse stack. Push Yn, Yn-1, ... Y1 onto the parse stack, in that order. Take any other desired action, such as printing the production.
      2. Else there is no transition for this pair, so fail.

    3. Else if A is a semantic action, then pop A from the parse stack and apply it to the semantic stack.

    until A == $.

  4. Return the value on top of the semantic stack.

Notice that we now have to distinguish between the two stacks at each step. The only step that refers to the semantic stack is the new step, which applies the semantic action just encountered to the semantic stack to build a new AST node, perhaps consuming nodes already on the semantic stack.

In a practical sense, the algorithm is incomplete. It recognizes failures in the cases of a token mismatch and missing entry in the parse table, but what other kinds of errors can occur? When and how should the algorithm recognize and signal them?


Watching the New Table-Driven Parser at Work

Let's trace this new version of the algorithm as it recognizes the token stream

x + y * z

This is the same expression we recognized back in Session 10, with generic identifiers replaced with real ones. Let's re-label the Stack column "Parse Stack" to reflect the fact that we now have two stacks to consider, and add a new column "Semantic Stack" in place of the Output column.

Parse Stack Input Semantic Stack
E x + y * z $ .
E'T x + y * z $ .
E'T'F x + y * z $ .
E'T' m-id id x + y * z $ .
E'T' m-id + y * z $ .
E'T' + y * z $ [x]
E' + y * z $ [x]
m-+ E'T+ + y * z $ [x]
m-+ E'T y * z $ [x]
m-+ E'T'F y * z $ [x]
m-+ E'T' m-id id y * z $ [x]
m-+ E'T' m-id * z $ [x]
m-+ E'T' * z $ [x] [y]
m-+ E' m-* T'F* * z $ [x] [y]
m-+ E' m-* T'F z $ [x] [y]
m-+ E' m-* T' m-id id z $ [x] [y]
m-+ E' m-* T' m-id $ [x] [y]
m-+ E' m-* T' $ [x] [y] [z]
m-+ E' m-* $ [x] [y] [z]
m-+ E' $ [x] [ [y] * [z] ]
m-+ $ [x] [ [y] * [z] ]
. $ [ [x] + [ [y] * [z] ] ]
. $ .

The main loop of the algorithm terminates when the parse stack goes empty, and returns the AST on the semantic stack, which is the representation of the input sequence. The brackets in the semantic stack show the grouping of parts in the AST, and they match the precedence rules implicit in the grammar.

Notice that the new tree-generating portion of our algorithm needs a bit more memory than just the new stack. It has to maintain a temporary variable in order to remember the content of terminals that have semantic content. When a terminal matches against an input token, the algorithm must retain the value matched until invoking the corresponding make-identifier semantic action to create an identifier node for the AST. In our example, matching an identifier against an x, y, or z requires remembering the actual identifier that was matched. This will also be true for other terminals with content, such as numeric literals. The semantic action operates not on just the semantic stack but on (semantic stack, matched terminal).

Notice, too, that the make-plus-exp and make-times-exp semantic actions access only the semantic stack, from which the action pops its two required values.

These two facts tell us something about how to implement SemanticActions in our program.


An Exercise with a Lesson

So, our augmented table-driven algorithm does the job, for at least one program in the language. But I'm concerned that the example we just traced isn't representative -- in particular the order that the + and * operators appeared in the sentence. We want the * operator to bind more tightly than the +, but will the parser construct the correct AST for a sentence in which the operators appear in reverse order? The make-* action will be pushed onto the parse stack first; will it be buried beneath a make-+ action?

One way to find out: trace the algorithm! If it works properly, great; if not, then perhaps the error will help us fix the algorithm. So...

Trace the new algorithm as it recognizes the token stream:

x * y + z

You may use the trace above to guide you, if you wish.

Solution:

Parse Stack Input Semantic Stack
E x * y + z $ .
E'T x * y + z $ .
E'T'F x * y + z $ .
E'T' m-id id x * y + z $ .
E'T' m-id * y + z $ .
E'T' * y + z $ [x]
E' m-* T'F* * y + z $ [x]
E' m-* T'F y + z $ [x]
E' m-* T' m-id id y + z $ [x]
E' m-* T' m-id + z $ [x]
E' m-* T' + z $ [y][x]
*** *** ***
E' m-* + z $ [y][x]
E' + z $ [[x] * [y]]
m-+ E'T+ + z $ [[x] * [y]]
m-+ E'T z $ [[x] * [y]]
m-+ E'T'F z $ [[x] * [y]]
m-+ E'T' m-id id z $ [[x] * [y]]
m-+ E'T' m-id $ [[x] * [y]]
m-+ E'T' $ [z] [[x] * [y]]
m-+ E' $ [z] [[x] * [y]]
m-+ $ [z] [[x] * [y]]
. $ [[[x] * [y]] + [z]]

I've added a row of ***s at the critical point in the derivation. Because terms cannot begin with a + operation, the parser sees the [T',+] pair and chooses the T' := ø rule. This ensures that the make-* semantic action is executed before the + causes an expansion with the E' := +TE' rule, and so the multiplication node is constructed first. Thus, the algorithm achieves the desired precedence binding.

I feel a lot better now.

But still...

Exercise for Home:

What about the input sequence x * y * z ? Does it produce the expected abstract syntax tree? If not, how might we modify our addition of semantic actions to the grammar to fix the problem?


Representing the Abstract Syntax Tree in a Program

When it comes time to implement our compiler in code, we can represent an AST in at least two different ways:

The choice between data and objects reflects a larger-scale decision between programming styles, based on a set of trade-offs.

With dead data, it is relatively easy to add new behaviors to the program, but not new types of things. We can add a new behavior by writing a new procedure, which must switch on the type of argument it receives. To add a new type of thing, we must modify every procedure that processes things, adding an arm to each for the new data type.

With objects, just the opposite is true. We can add a new kind of thing simply by creating a new class, defining for it a method for each kind of behavior in our system. To add a new behavior, though, we must modify every class of things, adding a new method for the new behavior.

In many large systems, especially ones that model parts of the the external world, this trade-off favors an object-oriented approach. But when implementing language processors, many people think that the balance shifts toward a procedural approach. Why? Because we are far more likely to know up front all the types of things we need to represent in our AST than we are to know all the kinds of behaviors we'll need. After all, the language grammar specifies the syntax of our language from the start!

Some folks use code generation as an example. While you are not likely to add a new kind of expression to the abstract syntax of a language, you may well want to change the target language produced by your compiler's code generator. That's a modification of behavior, which will require a change to every class in the abstract syntax hierarchy.

Whatever representation we choose, it must support downstream processing. The later stages of a compiler must read and modify the abstract syntax of a program in several different ways, some perhaps not anticipated by the compiler writer.

A pure object-oriented approach to abstract syntax will create a common interface for nodes in the abstract syntax tree and different classes for each kind of node. The AST of a program will be a form of composite, with compound expressions and statements that group particular configurations of nodes and with simple expressions such as literals and identifiers as "base cases".

ASTs as objects

In a context like compiler construction, the negatives of this approach begin to offset its positives. Because we are much more likely to implement new behaviors (analysis tools) than types (kinds of AST nodes), implementing all behaviors as methods requires that we continually modify and recompile all the files that define abstract syntax nodes.

A pure procedural approach implements behaviors that manipulate an AST as switches on node type, typically with recursive calls to manipulate the components of node. The AST will be some sort of record or structure that groups data but embodies no behavioral knowledge.

ASTs as records

In a context like compiler construction, many programmers prefer this approach to the OO approach, but even so its negatives nearly offset its positives. In a typed language, this approach requires frequent up- and downcasting of records in order to operate on them. Type checking becomes a programmer concern, not a compile- or run-time concern.

Can we achieve the best of both worlds? To do so, we need a way to add behavior for manipulating ASTs without having to modify the AST classes themselves, but we also need a way to expose type information dynamically without casting. Let's consider how we can implement abstract syntax trees in an object-oriented language in a way that balances these competing forces, using the Visitor pattern.

Update: Actually, we don't want to have to modify the AST classes themselves when that requires us to re-compile the entire application. This is a major issue in statically-typed, compiled languages. But the resurgence of dynamically-typed, interpreted languages makes this problem less a required result than a side effect of language design. One language that changes the dynamic on this issue is Ruby, with its open classes. This feature allows programmers to add a new set of methods for, say, pretty printing, by including the pretty_print method for each kind of AST node in a single file, without opening and editing the text files for those classes. I have not tried this in a compiler yet but will try it now... This seems to be a common way to implement Vistor in Ruby. Patterns come in many forms.



Using Visitors to Implement Abstract Syntax in an OO Language

For today, I asked you to read a fable about the visitor pattern. "... take a trip to the local supermarket with me."

Visitor is a pattern for balancing the forces between working with objects and adding new behaviors.

The strengths of this approach: It is relatively easy to add behaviors to the system, and we never have to edit the tree node classes.

The weaknesses: We must write a lot of little methods in the visitors. If we add a new type of things to our system, then we must edit and re-compile all of... our visitor classes! This technique effectively trades OO benefits for procedural benefits.


Bonus Code: Implementing Abstract Syntax Using Visitors

What might it look like for our little expression grammar? Consider this implementation:

Notice the interplay between accept() messages that are sent to Expressions and visit() messages that are sent to AstVisitors. Perhaps now you can understand why visitor is an example of something we call double dispatch.

Notice, too, the composite AdditionExp and MultiplicationExp delegate to their operands, which are themselves Expressions. In a functional or procedural solution, the sub-expressions would be processed by a recursive call. What you are seeing is the object-oriented version of recursion -- a composite delegating to simpler components, which ultimately reach the "base case" in the form of concrete components. This has been documented as a design pattern called Object Recursion pattern. (Check out the acknowledgements. :-)

You can download all of the code as a zip file.


Representing Semantic Actions in a Program

What are the alternative ways one might implement semantic actions? I can think of a few...

Just another small matter of programming, I guess.


LL(1) Parsing and Semantic Actions

The relatively simple extension of an LL(1) parser to execute semantic actions follows from the predictive nature of top-down parsing. A top-down algorithm looks ahead one token without consuming it and thus expands a non-terminal before processing the tokens that match the production's righthand side.

The primary job of the compiler writer is to add semantic actions to the grammar. This task isn't too difficult, as the grammar tells us most of what we need to know. But you do have to watch for occasional subtleties in precedence. When writing your compiler, be sure to test the output of your parser on many different kinds of programs, and inspect the results carefully.

In our next session, we will consider the pragmatics of implementing these ideas in a program, taking them down to the level of code.


Eugene Wallingford ..... wallingf@cs.uni.edu ..... March 10, 2009