We are now considering the synthesis phase of the compiler, which converts a semantically-valid abstract syntax tree into an equivalent program in the target language. Last session, we discussed the idea of intermediate representation, a couple of candidate intermediate representations, and a third -- three-address code -- that serves as a useful point on the way to machine-level code. Statements in three-address code are patterned on the general form
x := y op z
with at most three identifier addresses. Such programs require the code generator to introduce temporary identifiers in order to decompose more complex expressions, including control structures, into three-address code. The essence of this stage of processing is that it linearizes the AST in a way that prepares for generating target code.
Today, we learn how to generate three-address code from an abstract syntax tree and how to represent three-address code programs in memory.
Recall that a three-address code language will likely have binary assignments, unary assignments, copy assignments, unconditional jumps, conditional jumps, a set of statements for higher-level transfers of control such as procedure calls, indexed assignments, and address assignments.
To generate a three-address code representation of an AST, we will use the same technique we used to create and process the abstract syntax tree in earlier stages of the compiler: walking the tree using structural recursion. For each of type of expression in the AST, write the code that generates an equivalent three-address code. In the compilers world, this is often referred to as syntax-directed translation.
Each expression E on the lefthand side of a grammar rule will be computed into a new temporary variable ti. The representation generated for E will consist of two parts:
Generating three-address code requires using many temporary variables. The code generator will need a procedure such as makeNewTemp() to create a new, unique temporary variable name each time it is called. For simplicity, we'll assume that we have such a procedure and that it generates the sequence t1, t2, t3, .... Also for simplicity, we will use a unique identifier for each temporary. A more efficient compiler could use a small pool of unique identifiers, reusing the same name multiple times in different scopes.
The code generator will need to emit code in three-address form. For simplicity, let's assume that we have a procedure emitCode() that works like Java's OutputStream.println method: it takes a single string, perhaps concatenated from many parts, and writes it followed by a new line character. In our discussion, we will use emitCode() to generate a string that we can store in an expression's code field, though in practice this procedure would typically write on an output stream, which may be bound to a file.
When writing your compiler, you can implement makeNewTemp() and emitCode() to behave in just these ways, and then use them!
Suppose that we have the following simple grammar:
S → id := E
E → E1 + E2
E → E1 * E2
E → - E1
E → ( E1 )
E → id
Here is the semantic action for the first arm of the grammar:
S → id := E
------------
S.code := [ E.code ]
emitCode( id.place " := " E.place )
The [ E.code ] means to make a recursive call to compute the three-address code for E and place it immediately before the code generated by the call to emitCode(). Notice that this grammar rule is a special case, as its lefthand symbol does not have a temporary variable associated with its value.
Now it's your turn...
Use the procedures makeNewTemp() and emitCode() to write the semantic actions for a three-address code generator.
Structural recursion does most of the work. These semantic actions simply package the code built by the recursive calls with the newly-generated statement, if any, in the correct order.
Here is a possible solution.
E → E1 + E2
------------
E.place := makeNewTemp()
E.code := [ E1.code ]
[ E2.code ]
emitCode( E.place " := " E1.place " + " E2.place )
E → E1 * E2
------------
E.place := makeNewTemp()
E.code := [ E1.code ]
[ E2.code ]
emitCode( E.place " := " E1.place " * " E2.place )
E → - E1
---------
E.place := makeNewTemp()
E.code := [ E1.code ]
emitCode( E.place " := negate " E1.place )
E → ( E1 )
-----------
E.place := E1.place
E.code := E1.code
E → id
-------
E.place := id.place
E.code := ""
When we implement this code, a programmer-defined identifier is replaced by a pointer to a symbol table entry for the name.
Most source languages have high-level control structures such as if..then..else.. and for..to..do... Three-address code can represent such statements using code labels and jumps.
Consider the simple while statement:
S → while E do S1
First of all, we need to make a decision about the semantics of boolean values in the little language. What values of E count as true, and which count as false? For simplicity, let's define 0 as false and everything else as true.
Second, we need to create a new procedure in our code generator: a procedure that produces unique labels for use in the generated code. Unlike the case of temporary variables, we cannot reuse labels multiple times, as they all exist in the same namespace. So they really must be unique.
Let's assume the presence of a procedure named makeNewLabel() that generates the sequence of unique labels L1, L2, L3, ....
Now we are ready to generate three-address code for the while statement. This translation resembles the one that you make when implementing a loop in assembly language. We need to generate code bodies for the constituent expression and statement and then use jumps to ensure that the three-address code has the same semantics as the while statement. Here is a picture of what the generated code will look like:
So, we need to generate two labels and make recursive calls to generate code for E and S1. The semantic action for the while statement would look something like this:
S → while E do S1
------------------
L1 = makeNewLabel()
L2 = makeNewLabel()
S.code := emitCode( L1 ": " )
[ E.code ]
emitCode( "if " E.place " = 0 goto " L2 )
[ S1.code ]
emitCode( "goto " L1 )
emitCode( L2 ": " )
Why don't you try your hand at a simple control structure. Let's add to our little language an if statement:
S → if E1 = E2 then S1
Write the semantic action to generate three-address code for this statement.
This turns out to be quite similar to the while statement. Here is a picture of what the generated code will look like:
The semantic action itself will look something like this:
S → if E1 = E2 then S1
-----------------------
L3 = makeNewLabel()
L4 = makeNewLabel()
S.code := [ E1.code ]
[ E2.code ]
emitCode( "if " E1.place " = " E1.place " goto " L3 )
emitCode( "goto " L4 )
emitCode( L3 ": " )
[ S1.code ]
emitCode( L4 ": " )
How does adding an else clause complicate code generation?
A three-address statement is an abstraction that the compiler writer must implement in code. In practice, each statement will be a record with fields for its parts. What might each three-address code statement look like? There are at least two options.
We could represent the record as a quadruple, a record with four fields: the operator, the left operand, the right operand, and the result. Unary operators can leave the right operand slot empty. Some operators won't need to use the result slot, either; for example, param needs to identify only a single argument to a procedure. Finally, jump instructions don't have results, but they do have target labels; the label can be stored in the result slot. (If you implement these statements using objects or variable-length records, then these conventions become unnecessary.)
Consider the three-address code for our old friend, the expression a := b*-c + b*-c:
As mentioned earlier, the slots that refer to programmer-defined names can be replaced with pointers to the corresponding symbol table entries.
Notice that using quadruples creates a kind of duplication. Each statement has a result, which is stored in a temporary location. The sequencing of these temps matches exactly the sequence of the statements themselves. We can eliminate this representation of the temporaries by storing in the corresponding argument slot the number of the instruction that computes it. The result is an implementation of three-address code in a real triple.
Here is what a triple representation would look like for our example:
Note that the record number (0 through 4) now stands in place of the five temporaries (t1 through t5) and eliminates the need for a result field.
Using triples creates a new wrinkle for statements such as a[i] := x. Assigning a value to an array slot requires two separate operations: computing the target slot in the array, and then assigning to it. Such a statement requires two triples:
There are some interesting trade-offs between these two representations. Triples are efficient and compact. Quadruples can use a single instruction in some cases where triples require two. Triples are hard to reorder, because some entries refer to other entries by their positions in the list.
Being able to reorder statements is an important feature if we want the compiler to improve the efficiency of the code it generates. We could represent the triples using a linked list with pointers instead of an array with indices, but that makes the compiler itself less efficient. We can stick with an array representation and still find a nice balance between triples and quadruples using a technique known as indirect triples. Let's leave that as material beyond the scope of the course. Feel free to read up on the idea, or to ask.
There is one other issue that can arise. When the compiler generates code for a boolean expression or a conditional statement, it may not yet know where the target of a jump is. The result is a label with no corresponding code in the target program. This can also occur when the compiler generates code "out of order", say, by generating code for local chunks of the AST independently and then stitching the pieces together at the end.
One solution is to generate code in two passes, a first to generate the framework of the target program and a second to compute the jumps and labels. This simplifies the compiler's task but makes it less efficient.
Another approach makes only a single pass over the whole program. When the compiler generates a branch whose jump target is undefined, it adds the statement to a list of "jumps to be completed". As soon as the correct label is known the compiler fills in the slot and removes the statement from the to-do list. This technique is known as backpatching. Again, let's leave this idea as beyond the scope of the course. Feel free to read up on the idea, or to ask.
Note that for simple languages this problem may not arise...
... boolean expressions create interesting problems. How can a three-code generator produce code for boolean expressions that "short citrcuits" evaluation as soon as possible?
If we translate the AST of the source program into an intermediate representation, then the code generator must include a second stage that translates the intermediate program into target code.
Whether our compiler translates the AST into an intermediate program and then into target code, translates the AST directly into target code, or even generates target code immediately during the parsing phase, it will follow a common pattern:
generateCode( AST tree )
{
if ( tree != null )
{
generate code to prepare for code of left subtree
generateCode( tree.leftChild() );
generate code to prepare for code of right subtree
generateCode( tree.rightChild() );
generate code to implement tree's behavior
}
}
This generalizes an idea we saw several times in our look at writing three-address code for particular kinds of program expressions. It is a post-order traversal of the tree, or of each instruction in a sequence, extended to generate code that must be inserted for the sub-expressions. The preparation code generally depends on the kind of tree being processed.
The best sort of intermediate representation as input to a code generator has data objects that map directly onto the primitives of the target machine. This includes the insertion of type conversion operators.
The output of the generator can be absolute or relative. Absolute code has all of its addresses already computed and placed in the code. A program in this form can be loaded anywhere and executed immediately. The downside is that it requires compiling the whole program at the same time, which limits the flexibility available to programmers who use the compiler. Relative code -- with addresses computed only as offsets to be resolved at the time the program is loaded to execute -- offers the flexibility of separate compilation but requires the cost of linking modules and loading each time the program is run.
A code generator that produces assembly language as the target language allows the compiler to rely on the assembler facilities of the target machine, which are often considerable. In contrast, a code generator that produces machine language must do all of the work that an assembler might already do -- but perhaps can do it more efficiently.