Write a BigRig program to write to standard output a list of the prime numbers less than or equal to a number read from standard input. You could use a sieve of Eratosthenes, but I use brute force.
Here is my solution, // modified with your suggestions in class that clarified the grammar and semantics of BigRig. (Be sure that this information makes it into your grammar and explanatory notes on the language!!) //.
As in our last session, I learned a lot by doing this exercise. Do this among yourselves with other simple programs. You will learn a lot and will end up with test programs as a side effect.
We have been considering the synthesis phase of the compiler, which converts a semantically-valid abstract syntax tree into an equivalent program in the target language. For the last few sessions, we learned about the run-time system that is necessary to support a compiled program. This involves the allocation of memory to data objects and sequence of instructions needed to execute and return from a procedure.
Now -- finally! -- we consider how to generate target code itself.
I can't just hear you muttering under your breath...
Wait a second. You said we were going to generate code. What's this "intermediate representation" stuff? I'm starting to think that you don't want to teach us how to generate code. Maybe you don't even know how to generate code... I knew I should have taken Project Management.
There are very good reasons why we don't want to immediately generate code for the target machine. The main idea is this: A large part of generating code for a particular target machine is (surprise!) machine-dependent. Yet not all machine-dependent elements of code generation depend in the same way on the machine. If we write a monolithic code generator, then any small change in the target machine -- perhaps even a small upgrade in the processor -- can cause changes all through the program. If instead we write a modular code generator, with modules that reflect particular shear layers in the generation process, then we may be able to contain changes in target machine specification to an easily identified subset of modules.
So, let's think of code generation in two parts:
Simple compilers may not require any intermediate representation. Translating ac to dc (a language at the same level of abstraction) or even to Java-like Java or Scheme-like Scheme (languages at much higher levels of abstraction) are often best accomplished with a direct code generator. Whenever you find yourself implementing an off-the-cuff little language to support some task, you will probably want to do the same.
For more complex target machines -- say, the Java virtual machine or MIPS assembly language -- we can benefit by writing our code generator as a sequence of transformations.
To sum up, intermediate transformations enable us to:
Finally, an intermediate representation sometimes outlives the compiler for which it was created. Chris Clark described an example in Build a Tree--Save a Parse:
Sometimes the intermediate language takes on a life of its own. Several systems that have multiple parsers, sophisticated optimizers, and multiple code generators, have been developed and marketed commercially. Each of these systems has its own common virtual assembly language used by the various parsers and code generators. These intermediate languages all began connecting just one parser to one code generator.
P-code is an example IL that took on a life of its own. It was invented by Nicklaus Wirth as the IL for the ETH Pascal compiler. Many variants of that compiler arose [Ne179], including the USCD Pascal compiler that was used at Stanford to define an optimizer [Cho83]. Chow's compiler evolved into the MIPS compiler suite, which was the basis for one of the DEC C compilers -- acc. That compiler did not parse the same language nor use any code from the ETH compiler, but the IL survived.
A good language design usually pays off.
In the analysis of phase of a compiler, we convert concrete code to an abstract syntax tree which is, strictly speaking, an intermediate representation of the source program on the way to a target program. Along the way, we might create a parse tree that retains some of the concrete details of the source program. We can use this idea to motivate a variation of the abstract syntax tree as our next level of representation.
Consider this AST, for the expression a := b*(-c) + b*(-c):
We could transform the AST into a directed, acyclic graph by identifying common subexpressions. This would result in a more concise representation and perhaps eliminate redundant code generation later. The result would be:
Another common intermediate language is based on the idea of a "stack machine". A stack machine behaves like an interpreter of postfix expressions. Some hardware provides machine-level support for such operations, so generating a postfix representation of an AST can be an efficient way of creating the target program. In other cases, a virtual machine interprets a postfix-like program. That is how the Java's virtual machine works.
Some higher-level languages are based on a stack machine model. Examples include Forth, a language intended for embedded control programs, and Joy (be sure to scroll past some status data), a functional language based not on function application but on function composition. (A few years ago I spent some time playing with Joy, including writing a Scheme interpreter for a subset of the language, and it's fun!)
By doing a post-order traversal of the AST above, we can convert it into the following postfix expression:
a b c - * b c - * + :=
In Joy, I could use the dup primitive (for duplicate) to directly represent the equivalent DAG:
a b c - * dup + :=
A postfix representation converts a tree into a linear form. Note that the postfix expression does not represent the edges of the tree directly; they are implicit in the order of items on the stack and the number of arguments expected by the operators on the stack.
(Converting from a tree to a postfix expression and back reminds me of a wonderful technique that I used to teach in my Data Structures course, in which one can recover a tree and its edges even without any information about the items -- if one has just two of its linear representations -- say, in postfix and infix. Algorithms used to do intermediate transformations in compilers use the same idea!)
Syntax trees can be represented in a compiler in many ways. In both, a node is represented with a record.
(And this reminds me of techniques we used when I first learned to program in Fortran, which did not have dynamic memory allocation. These techniques remain in use because they result in efficient implementations.)
We can do better by choosing an intermediate representation that moves us closer to the level of the machine, without tying us to the details of any particular machine. So, after manipulating the abstract syntax tree of a program, and perhaps a postfix representation of it, we will transform the program into a three-address code representation. Statements in three-address code are patterned on the general form
x := y op z
where op is a primitive operation, y and z are its argument, and x is the location in which to store the result.
The name of the representation derives from the fact that each expression uses at most three objects or locations.
Because all expressions in three-address code contain exactly one operator, we must decompose compound expressions into sequences of simpler expressions, perhaps using temporary variables to hold the intermediate results. For example, we might represent the expression
x + y * z
as the following three-code program:
t1 := y * z
t2 := x + t1
We create the temporary names t1 and t2 for locations to hold intermediate results.
We can represent the more complicated AST above using the following three-address code program:
t1 := - c
t2 := b * t1
t3 := - c
t4 := b * t3
t5 := t2 + t4
a := t5
Whereas we might represent the optimized DAG above using this three-address code program:
t1 := - c
t2 := b * t1
t3 := t2 + t2
a := t3
Notice that we have already adapted the general form of three-address code in two ways. First, some operators take only one argument and so need only two addresses. Second, some assignments merely copy a known value to another location, and such expressions need neither a third address nor an operator, other than the assignment that is implicit in all three-address statements.
You may also notice that we have used a temporary name for the whole expression, even though we immediately copy the value into the user-defined variable a. The optimizer or code generator may well eliminate this extra step later, but the use of the extra temporary enables certain other algorithms to be applied to the representation.
At least two features of three-address code make it a good representation for optimization and code generation. First, it forces the compiler to unravel compound expressions and flow-of-control operations into sequences of simpler expressions. Second, its use of names for intermediate values means that expressions can more easily be rearranged.
A three-address code program represents the nodes and edges that appear in an AST, but adds the use of names for internal nodes. This linearizes an AST just as a postfix expression does, but the three-address code program represents the edges explicitly via its temporaries.
As you can see, three-address code resembles assembly language in its level of expression. The expressiveness of a three-address code language depends on the number and kind of statements allowed. In particular, any useful three-address code language will likely have these basic operations:
To support flow of control, it will likely have:
Finally, in order to represent higher-level data types such as arrays and pointers, it may have:
A three-address code language for representing Pascal programs would require all of these but the address assignments, whereas a language for representing C programs would require address assignments, too.
Quick Exercise: Which of these elements are required to represent a BigRig program?
The design of a three-address code -- and especially its set of unary and binary operators -- has a large effect on the resulting code generator. As mentioned above, the intermediate language must be rich enough to implement the semantics of the source language. Beyond that, we must strike a balance between a small-enough language, which is easy to implement and retarget, and a too-small language, which leads to longer three-address code programs. The longer the resulting representation, the harder the optimizer and code generator must work harder to generate an efficient target program.
Write a three-address program using the features described above for this BigRig function from the test program euclid that we wrote last session:
function remainder :=
function with number a, number b and returns number :
number result := 0;
if a < b then,
assign result := a.
else
assign result := call remainder(a-b, b).
!
returns result.
!.
With a little practice, you can write three-address programs of this sort by hand with little effort -- though perhaps much tedium. You'll then be prepared to write code that writes three-address programs!
So, how can a compiler generate a representation in three-address code? We will use the same technique we used to process the abstract syntax tree in earlier stages of the compiler: walk the tree using structural recursion. For each of type of expression in the AST, write the code that generates an equivalent three-address code.
Some of the issues we will have to consider include:
That's what we'll do next time!