Session 2

The Components of a Compiler


810:155
Translation of Programming Languages


Where Compilers Come From

Background: Last time we spoke of a compiler as a translator of a program written in a source language, say S, to a program in a target language, say T. There is, of course, a third language involved: the language in which the compiler is written. We call this the implementation language, I, of the compiler. This language characterizes the platform on which the compiler runs.

We can represent a compiler as a triple of source, target, and implementation languages in a number of ways. A common textual representation looks like this:

C(S-T)(I)

A common graphical representation is called a T-diagram:

T(S,T,I)

So, a native C compiler on a PowerPC chip would have this T-diagram:

T(C,PPC,PPC)

Suppose that you have an S→M compiler that runs on machine M, and you would like to have an L→M compiler for a new language L, also on M. What would you do?

You could write an L→M(S) compiler. Compile it using the existing S→M(M) compiler. Voilá! -- an L→M(M) compiler. Here is a compound T-diagram that shows the process:

L(S)M + S(M)M = L(M)M

This simple process is how many compilers are created. Consider the example of the many tools available for the earliest versions of Unix. TROFF was one of the first text formatting systems built for Unix. However, it didn't provide any facilities for processing mathematical equations. So, someone wrote the first version of EQN, an equation formatter for TROFF, in C. He compiled this program using the standard C compiler on the PDP-11, creating a native EQN→TROFF compiler for the PDP.

Now for a tougher one...

Exercise: We have a brand-new processor N, for which there are no compilers. However, we do have a compiler for language L on machine M:

L(M)M

We write an L→N compiler in L itself:

L(L)N

How can we use these two compilers to create a native compiler for L on machine N, that is, L→N(N)?

Solution: Try this:

  1. Compile L→N(L) using L→M(M), giving L→N(M).

    Building L(M)N

  2. Compile L→N(L) again, this time using L→N(M), giving L→N(N).

    Building L(N)N

Isn't that amazing? We have created a compiler to run on machine N without ever having worked on machine N!

Bootstrapping a Compiler

With some creativity, we can create compilers of many different forms using only a sparse number of tools. The technique used in the example above is the heart of the answer to a common student question: "Where did the first native C come from?" You might ask the same question of Ada, or Java, but C is the "base case" in our usual devolution toward the machine; most compilers are written in C, including the C compiler.

Here is how we bootstrap the first native C compiler. First, we write a compiler for a minimal subset of C (denoted C0) in assembly language:

C0(ASM)ASM

Then we write a compiler for C in C0:

C(C0)ASM

Finally, we compile the second compiler using the first:

creating the native C compiler

If we now want a native C compiler written in full C, we can write it and compile it using our native C compiler.

The technique used in the exercise above can be used to bootstrap an optimizing compiler that optimizes itself. Huh? First, of course, we write an optimizing compiler for S in S that compiles to machine language M. Then we write a quick-and-dirty compiler that translates S into M. For all we care, both the translation process used by the quick-and-dirty compiler and the code it generates can be bad. Now, we compile our S→M(S) compiler twice:

a self-optimizing optimizing compiler

Note that second step: the optimizing compiler optimizes itself! Beautiful.



The Components of a Compiler

[ I have based this lecture on some good material in Chapter 1 of Fundamentals of Compilers by Karen Lemone, CRC Press, 1992. ]

We can think of a compiler as a sequences of processes:

block diagram of a compiler

This is a simplification, much like the waterfall process in software engineering. The stages of compilation can interact with one another, so the process isn't always a sequence. The interaction between scanning and parsing is so strong that many compilers are parser-driven -- the parser is the initial stage, and it calls the scanner whenever it wants another token.

(For you object-oriented programmers, this approach fits nicely with the notion that the scanner is an input stream, a decorator of sorts that produces tokens out of an underlying character stream.)

The order of the stages differs from textbook to textbook and from approach to approach, but the most likely differences involve the existence and ordering of preparation for code generation, initial code generation, and optimization. The Louden textbook adopts a traditional ordering that considers preparing for code generation as part of code generation. You'll also notice that he specifies a stage for optimizing target code.

Our discussion today describes what a compiler does by examining the processing of two simple lines of code:

                       x1 = a + bb * 12;
                       x2 = a/2 + bb *12;

The rest of the semester covers in detail how a compiler does what it does.



Lexical Analysis

This stage of the compiler converts a stream of characters into a stream (or sequence) of tokens. A token is the smallest language unit that conveys meaning. In our example code, the 'x's do not mean anything at the level of the program, but "x1" indicates a variable -- or, more accurately, an identifier. The ';' means something, as it terminates a statement; the white space in the program does not, as it merely separates tokens. Choosing tokens and spacing rules is an important part of language design, and doing it well depends much more on human psychology than on technical details of processing a language grammar.

The scanner read the input stream until it identifies a token, and it then classifies the token by its type. The various token types available in common languages include:

So, a token is really an ordered pair consisting of its type and its value, which may be optional. So, it is our compiler's first data structure!

Consider our example code. A scanner might convert this stream of characters:

x1 = a + bb * 12;
x2 = a/2 + bb *12;

into a sequence of tokens:

    IDENTIFIER x1
    OPERATOR   =
    IDENTIFIER a
    OPERATOR   +
    IDENTIFIER bb
    OPERATOR   *
    LITERAL    12
    TERMINATOR
    IDENTIFIER x2
    ...
    OPERATOR   *                 ... notice, even though no space
    LITERAL    12
    TERMINATOR

Now consider this example:

    if newValue < 1 then ...  

The tokens are a keyword, an identifier, an operator, another identifier, another keyword, .... The scanner doesn't know that newValue < 1 makes up a boolean condition; it simply finds the tokens.

At this level, our compiler can find only the simplest lexical errors, such as misspellings. We could build, or at least begin to build, a table of the identifiers recognized (a symbol table), but this is usually done in a later stage, along with some related processing.



Syntax Analysis

The next stage of the compiler converts a stream (or sequence) of tokens into optionally a parse tree and ultimately into an abstract syntax tree, our second and third data structures.

Parsing is like diagramming sentences in a natural language. With natural language, we can recognize a legal sentence by its composition from a noun phrase followed by a verb phrase. At this level, the compiler recognizes expressions, statements, blocks, procedures, classes, modules, and other meaningful units at the level above tokens.

We can parse a token stream using a number of different approaches, usually grouped as "top-down" and "bottom-up". Learning these techniques -- and how to implement them -- is a big part of a course on compilers.

A parse tree groups tokens into meaningful units but typically retains a lot of syntactic details, such as parentheses and statement terminators. An abstract syntax tree also groups tokens but retains only essential information about the statement or expression. Some compilers create a parse tree on the way to creating an abstract syntax tree, but many produce an AST directly.

The token stream for our sample code might be parsed into a parse tree such as this: [to be created].

... and then into an abstract syntax tree such as this:

PROGRAM
   ASSIGNMENT STATEMENT
     LHS
        IDENTIFIER x1
     RHS
        ADDITION EXPRESSION
           IDENTIFIER a
           MULTIPLICATION EXPRESSION
              IDENTIFIER bb
              INTEGER    12
   ASSIGNMENT STATEMENT
     LHS
        IDENTIFIER x2
   ...

Notice that both of these trees reflect semantic information such as the order of precedence on operators. bb * 12 is to be computed first, and its value added to a. This shows how the linear token stream is converted into a tree structure that reflects the meaning of the statement.



Semantic Analysis

The semantic analyzer takes as input the AST and generates as output an AST annotated with new information. It usually also produces a symbol table that contains all of the user-defined identifiers used in the program. Both of these are data structures in their own right.

The purpose of this stage of the compiler is to determine whether the abstract syntax tree produced by the parse is meaningful and then to add information that makes the AST more useful when generating code.

Static checking is the process of determining whether the tree is meaningful. It usually involves some or all of:

In the static checking phase, the compiler can generate error messages for semantic problems in the program. These can be of great help to the programmer.

Consider our running example. The types associated with each intermediate expression depend on the types of a and bb. If either is, say, a boolean variable, then both of these statements are illegal! If a is a floating-point value, then statement 2 may be illegal -- if / is the operator for integer division.



Source Code Optimization

By the end of semantic analysis, the compiler has produced an annotated abstract syntax tree of the program, which as compactly as possible represents as much information about the content of the program as possible.

an AST

The source code optimizer looks for ways to make the program more efficient, without changing what it computes. We can think of at least three kinds of optimization:

As the goal of our current discussion is only to illustrate what happens in each stage, we'll only consider local optimizations here, and then only two common forms.

Sometimes, a program stores a constant in a variable for use later. If the compiler can show that the variable still holds this value at a particular point in the computation later, then it can replace the variable reference with the constant value. Using the constant directly is more efficient than doing the memory fetch at run time.

This technique is called constant propagation or constant folding. If your compiler propagates constants, then you need never worry about creating a temporary variable solely for its value as a name!

A second common local optimization is elimination of common subexpressions. When two pieces of the code compute the same result, the compiler can generate one copy and reuse it. Consider our AST above. Both statments compute bb * 12, so the compiler can reuse the former instead of generating the latter:

an optimized AST

Of course, the compiler has to be able to show that under no circumstances does the value of the former get changed to something different from that of the latter. Perhaps now you can see how functional programming, with no side effects, can empower the tools that process programs!



From here on the notes are still partially incomplete but, I hope, better for you than nothing. -- EW



Preparation for Code Generation

This phase of the compiler uses a (maybe optimized) abstract syntax tree to create a plan for allocating memory and registers at run-time. These locations will hold the code of the program and the values of identifiers at various times during execution.

One of the primary issues in memory allocation is choosing between static allocation (fixed at compile time) and dynamic allocation (determined at run time).

One of the primary issues in register allocation is make the generated executable run as fast as possible. First, operations on registers are generally faster than operations on memory locations. Second, if an identifier can remain in a register for several operations, the code will run faster than if it has to repeatedly load registers and store their values out to memory.

A common approach is to assign registers and memory locations to variables in a preparation phase and then to generate target code with this information available to the code generator.

Unfortunately, we have a chicken-and-egg problem. To generate the best target code, we need to know how our registers and memory locations to be used; but to select registers and memory locations in the best way, we need to know what the target code looks like!

Registers are a scarce resource. RISC machines have more registers than older architectures, but there usually still aren't enough of them for the many partial computations even a small piece of code does. But: good register allocation often contributes a better performance improvement than code optimization!

... another approach: assign temporary registers and memory locations, and then modify them as needed during code generation

[for our code example]


Code Generation

intermediate representation (optimized AST) and memory and register allocation → assembly code

You've all written some assembly language, so you know that each statement in our AST requires several lines of machine code. The key to generating good code is the selection of efficient, small sequences of assembly code to implement each operation. We usually do this by having "code templates" for each kind of operation, which we know to do a good job in the general case. The generator can specialize this code for any peculiarities of the source code in question. (A target code optimizer might do even more of this!)

Note that readability of the generated code is usually of no importance. An especially tricky piece of code is preferred if it is smaller and faster than the alternatives. We can comment this code if we really want to look at. (This is one of the good uses of comments: to explain a piece of code that needs to be written in a way that is not obvious or easy to understand!)

For our code example above, a code generator might produce assembly code of this sort (comments added for your sake):

    PUSHADDR   X2           ; put address of X2 on stack
    PUSHADDR   X1           ; put address of X1 on stack
    PUSH       bb           ; put b on stack
    PUSH       a            ; put a on stack
    LOAD       1(S),R1      ; bb → R1
    MULT       #12,R1       ; bb * 12 → R1
    LOAD       (S),R2       ; a → R2
    STORE      R2,R3        ; copy a → R3
    ADD        R1,R3        ; a + bb * 12 → R3
    STORE      R3,@2(S)     ; X1 ← a + bb * 12
    DIV        #2,R2        ; a/2 → R2
    ADD        R1,R2        ; a/2 + bb * 12 → R2
    STORE      R2,@3(S)     ; X2 ← a/2 + bb * 12

The particular details of this fictional assembly language are not important. But notice how the optimized code is computed once into R1 and used twice. Notice, too, how a is used from R2 in the first computation and then halved for use in the second. These result from the optimization phase and the register allocation phase that preced code generation.


Next Time: An Implemented Compiler

The next two sessions, we will take the step from what a compiler does to how a compiler works. We will do so by studying a simple compiler for a simple language, dubbed AC. Implemented in Java, the compiler compiles programs written in AC into DC, an assembly language-like stack calculator available on many Unix systems.

To prepare for next time, please download the simple compiler. Read its README file, look at the Java code, and the compile and run it. The zip file contains sample programs written in AC, their compiled equivalents in DC, and a file with some sample input and output for earlier phases of the compiler. Also take a look at the man page for DC. You certainly don't need to understand all of the details of DC, but you'll probably want some sense of how the target language looks and works.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... January 20, 2009