The first job of a parser is to ensure that a program is syntactically correct. This is a boolean decision: yes, the program is legal, or no, the program is not. The parsing techniques we've looked at so far do this.
But in the context of a compiler, the parser must also produce data for use by the rest of the program:
These phases of the compiler need to know the structure of the program as well as the semantic values of the identifiers and literals used.
The parser could produce a parse tree. A parse tree records the derivation done by the parser. Consider this expression from our simple grammar in Sessions 10 and 11:
x + y
The parse tree for this expression looks something like this:
Yikes! That is a lot of tree for not much expression. By showing the full derivation, a parse tree records unnecessary information, including separators and even the marker non-terminals that we added solely to make the grammar work in a top-down algorithm. It also exposes details of the concrete syntax of the language to compiler stages downstream. We would like to implement those stages independent of the language grammar.
Instead, we prefer to have the parser generate an abstract syntax tree that records only the essential information embodied in the program. Here is an abstract syntax tree for the expression above:
That's much better! This tree records the information needed by the semantic analyzer, optimizer, and code generator to do their jobs -- and no more.
There is one more thing we'd like our parser to be able to do. When it encounters an errors, we would like for it to provide messages that the programmer can use to find and fix the errors. If we want the parser to be able to handle errors in a graceful way and report them to the programmer, we need to add one more bit of extra information to our AST: the position of element in the original source file. This information can also be useful for certain kinds of downstream processing, such as refactoring and unparsing.
The abstract syntax tree (AST) serves as the primary input to all later phases in the compiler.
Back in Session 10, I gave you a small context-free grammar and asked you to write a recursive-descent parser for it. Assuming a procedure called match(Token) that either returns true (and consumes a token) or returns false, we wrote code something like this.
Modify this parser to return an abstract syntax of the recognized program.
You may assume six simple AST classes or structs corresponding to the six kinds of things in our grammar.
Solution: Instead of returning true or false, we need to return an AST -- or not. So we don't just recognize statements and expressions; we build trees. Plus, identifiers and numbers have semantic values, so we can't just match and consume those tokens; we also need to record their values.
We might do something like this:
public Statement statement()
{
switch (scanner.peek())
{
Token.repeat :
match(Token.repeat);
Statement s = statement();
match(Token.until);
Expression e = expression();
return new RepeatStatement( s, e );
Token.print :
match(Token.print);
Token t = match(Token.identifier);
return new PrintStatement( new Identifier(t) );
Token.identifier :
Token t = match(Token.identifier);
match(Token.assign);
Expression e = expression();
return new AssignmentStatement( new Identifier(t), e );
default:
return null;
}
}
public Expression expression()
{
switch (scanner.peek())
{
Token.identifier :
Token t = match(Token.identifier);
match(Token.equals);
Expression e = expression();
return new EqualsTest( new Identifier(t), e );
Token.zeroPredicate :
match(Token.zeroPredicate);
Expression e = expression();
return new ZeroTest( e );
Token.number :
Token t = match(Token.number);
return new Number( t );
default:
return null;
}
}
That was pretty easy, wasn't it? The recursive descent technique provides a lot of support implicitly, by being predictive and using the procedure call stack to manage data passing.
When it comes time to implement our compiler in code, we can represent an AST in at least two different ways:
The choice between data and objects reflects a larger-scale decision between programming styles, based on a set of trade-offs. We'll discuss implementation of the AST in more detail next time. For now, let's assume that the parser will produce some record or object for each piece of abstract syntax it recognizes.
Recall our table-driven algorithm for top-down parsing. The parser acts on the basis of the current token i in the input stream and the symbol A on top of the stack, until it reaches the end of the input stream, denoted by $. Here is the algorithm:
until A == $.
This algorithm allows a parser to recognize a legal program, or signal an error when it encounters an illegal construction. But it does not construct the abstract syntax tree that is the desired output of a parser. Step 3.2.1 does provide a hook for taking some action associated with a newly-expanded production, but... What kind of action should the parser take?
What we need is the ability to associate with certain productions a command to construct a node for the abstract syntax. For example, in our familiar little arithmetic grammar, expanding a factor to an identifier should create an identifier node for the program's AST. This is the idea of a semantic action.
We saw a simple form of semantic action in our original acdc compiler in Week 2. That program's ad hoc recursive descent parser replaced explicit matching of subexpressions with calls to factory methods that built the appropriate AST expression from the inputs. The semantic actions in this parser are the calls to the factory methods.
Using Java's run-time stack hid the state of a growing AST behind message sends. For example, in parsing the ac expression A = B + 1, the AST grew as follows:
Table-driven parsing introduces a new wrinkle. Productions expand non-terminals before their components are processed, so the algorithm doesn't know at expansion time what kind of AST node to create. When the parser pops the expanded non-terminal, it loses record of the non-terminal -- and the need to produce an AST node for it. In a recursive descent parser, this information is maintained implicitly on the system stack.
The parser will need a new workspace in which to do this construction. Remember that a top-down parser builds a leftmost derivation of its input stream. When the parser builds a node for the tree, it must remember the result because the node may be part of a higher-level expression that hasn't been recognized yet. For example, a parser that is recognizing the expression x + y will recognize the term x before it has encountered the + token, which will result in an AST node that contains the node for x.
A recursive descent parser maintains, this information implicitly on the system stack. Another explicit stack makes a suitable workspace for a table-driven parser. This new stack is called the semantic stack. It holds pieces of the AST until such time as they are used to build a higher-level node in the tree.
Building an abstract syntax tree using a semantic stack involves several small changes to our original approach:
Note that the semantic actions are pushed onto the parse stack, along with parse actions. A semantic action operates on the semantic stack, popping zero or more items and pushing a new piece of the AST. The semantic stack holds the products of the semantic actions
The semantic actions that we add to the grammar do not affect any other uses of the grammar, such as the creation of FIRST and FOLLOW sets and the building of the parse table. They are annotations for use only by the parsing algorithm.
Next time, we'll apply this technique to our arithmetic language so that we can see how it all works in a concrete. Then we will be able to look at how to implement these ideas in code.