Session 1

An Introduction to Compilers


810:155
Translation of Programming Languages


Oral Pop Quiz

Which language do you think is the harder to compile: Ada or Java? What about C or Java? What about Scheme or Ada? Why?

Is there a relationship between the difficulty of compiling a language and the difficulty of using a language? Why or why not?



Welcome to the Course

Welcome to 810:155, formally titled "Translation of Programming Languages". You and I will usually just call it "compilers", for reasons that will be obvious by the time you leave today, if not already. This is one of the great courses in any computer science curriculum. After study programming languages more generally in 810:154 (a course I called "a pivotal point in your computer science education"), you are ready to more fully understand the relationship between mathematics, language, and machine. And to write a real program.

This an exciting course because it is where the "high priests of computer science" reveal their the rest of their secrets. The course in Programming Languages teaches us that the tools we use to process our programs are just programs, like any others. There, we studied some of the basic concepts of programming languages by working with a simple form of a language processing program, an interpreter.

(This course is a natural continuation of that course. In many ways they are very different from one another, but in a sense they are like one course.)

But interpreters often avoid important details of language by simplifying the problem they address. This semester, we attack the full complexity of processing language head-on by studying and implementing compilers. These programs must generally deal with the full range of issues involved in executing a program on a digital computer -- from processing complex syntax, through processing complex abstract syntax and semantics, though generating code for a target machine. Along the way, we will consider other issues such as handling syntactic abstractions, processing a broader class of languages, and optimizing the execution of generated programs. As we do, you will come to appreciate at a new level the role of language in computing, the mathematical foundations of computation, the interaction between program and machine, and the value of good programming practice.



Details of the Course

As most of you know, I am Eugene Wallingford, and I'll be your instructor for the Translation of Programming Languages. (Everyone say, "Eugene"!) You should find the course challenging -- but I also hope that you find it fascinating.

I have passed out a short "vital statistics" page (page 1 of the syllabus) containing basic contact information for me and the course. If you have had me for a course before, then this will look familiar. It lists the URL of the course web page, which includes pointers to a full syllabus, a page with links to all our activities (lecture notes, programming assignments, exams), and a page of resources that we'll use this semester. Use this sheet to set a bookmark to the course web page in your favorite web browser. You never know when a strong urge to do 155 will strike you!

Study the syllabus carefully. It lists the policies by which we will run this course. Again, if you have had me for a course before, these will look familiar. You will need to know these policies and when they apply.

Also study the course activities page, which gives a tentative schedule for the semester, including reading assignments and exam dates. Some points that you should pay special attention to include:

Now, on to the course...



What is a Compiler?

Programmers write programs using what we usually call a source language. The result is the source code of the program. In older days, programmers used to speak of "writing a new code" when they were writing a program. (You can still hear such language from non-programmers, such as on the CBS show NUMB3RS. My mathematician and physicist friends often speak that way. Brits seem more likely: "a source code", "codes".)

We ordinarily think of source code undergoing a single transformation on its way to executable code:

 'the compiler'

That's what a compiler does: translate the source code into machine language, which is run directly on a machine, on "bare metal".

But underneath the hood, a compiler really implements a sequence of transformations on the way to its result. The phases of a compiler can be described and implemented in many ways, but here is a representative picture of the process:

the phases of a compiler

These phases transform the source code from a sequence of characters created by the programmer into an abstract representation of the program, and then into the desired target code. Each stage adds some new information to the representation.

Notice, too, that we think of the output of the compiler generally as target code, rather than machine language. That's because a compiler may generate code for a virtual machine, such as for Java or Squeak. It may also generate source code in another human programming language, such as C. Compiling to C is a common strategy for the implementors of new languages, as C is relatively portable and low level; the result is that the new language can be made quickly available in efficient form on a wide variety of machines.

The history of compilers parallels the history of programming. "In the beginning", when people were working with the first stored-program computers, programs really were codes. These codes were actual machine instructions that caused the computer to perform specific operations. Humans are language machines, and these first computer scientists soon created symbolic shorthand for machine instructions that made codes less tedious to write and easier to read. They wrote a program called an assembler to assemble the corresponding machine instructions from their symbolic codes.

But, humans being language machines, they weren't satisfied. Assembly code was less tedious to write and easier to read, but it was still tedious to write and difficult to read. Furthermore, there was a growing realization that programs were more general than any particular machine... If I write a cool program in assembly code and want to share it with you, then I have to describe it at a higher level and have you re-write it in the assembly language for your computer. This led to the creation of the first programming languages.

A program written in a high-level programming language such as Fortran described a set of computations more generally, in a language that was much closer to the way in which problem solvers thought about their problems. It was also much less dependent on the vagaries of any particular computer. But, to execute the program on a specific machine, they needed a new kind of translator -- which we now call a compiler.

The history of programming languages is a bit murky, as there were many different groups of folks working different subsets of the problem at the same time. (If you would like to learn more about the history of specific languages, I encourage to take a look at the History Of Programming Languages collections available in the library.) One of the best-documented of the early compiler efforts was the creation of the first Fortran compiler at IBM, led by Turing Award winner John Backus. That compiler took 18 person-years to build and taught us a lot about how to implement a translator for a high-level language. But if compilers had continued to require such massive expenditures of effort, computer science would have had a hard time taking off as a discipline.

The lessons from the Fortran project helped make compiler writing, and thus programming in high-level languages, more feasible. Two other more theoretical developments did, too. First, John McCarthy showed that an interpreter for a high-level language could be written in the same language. This meant that, once we had even a small, inefficient compiler for a language, we could create more complete and more efficient compilers much more easily. Second, in the world of linguistics, Noam Chomsky had begun to classify human languages according to the complexity of their grammars. This classification helped mathematicians and computer scientists focus their efforts on particular classes of languages, ultimately leading to a solid theoretical foundation for programs that process langauges of a particular class. This foundation, in turn, made it possible for us to more systematically write the phases of a compiler -- and even to write programs that write these programs for us!

The state of the art in compiler writing continues to advance as we learn more about languages and try to optimize programs under new constraints, such as just-in-time compilation for dynamic languages. But the fundamentals are well understood, which allows any computer science student to learn all she needs to know in order to write compilers and other language-processing programs.

That is why you are here.



The Project

Just some placeholders for our discussion...

What source language? A "real" one, or one created for a course? One the prof assigns, or one the students designe?

What target language? A "real" one, or one created for a course? A real machine, or a virtual machine?

The process: as project course, software engineering, ...

Which teams? 1x6 vs 2x3 vs 3x2 (vs 6x1)? There are trade-offs, both on the size of the language, the complexity of the target language, the type of ancillary products...



Program Analysis and Synthesis

A compiler performs two basic tasks:

(We will see more detail on what next session, and on how next week and throughout semester.)

Early compilers tightly coupled these two stages. One result of this coupling is that writing compilers for m languages on n platforms requires m x n different compilers. But -- at least in principle -- analysis requires only knowledge of the source language, not the target language, and synthesis requires only knowledge of the target language, not the source language. If we decouple the two stages by introducing an intermediate representation, we can perhaps simplify both. And then writing compilers for m languages on n platforms might require only m "front ends" + n "back ends".

Figure 2 above shows the tasks of a compiler broken down, with scanning and parsing doing the bulk of the program analysis and the code generator and target code optimizer doing the bulk of the program synthesis.

Analysis uses knowledge of the source language to decompose the source program into its meaningful parts. An abstract syntax tree can serve as a first intermediate representation.

Synthesis uses knowledge of the target language to translate structures in the intermediate representation into meaningful expressions in the target. The target can be:

Many of the most successful programs are not compilers in the "translate to executable machine code" sense, but they rely on many of the basic techniques of a compiler. This is one reason our course is titled "Translation of Programming Languages" and not "Compilers". The rise to prominence of refactoring browsers and modern IDEs demonstrate the value of this more general name, even if it is a bit confusing at times.



Note: These notes become more of an outline at this point...


The Family Tree of the Compiler

Compilers come in many different forms:

But there are many types of programs that work much like compilers, either with different sorts of output or with different intents. These relatives of the compiler include:

The techniques we study cover in this course make up the bulk of all these language processors.



The Effect of Compiler Technology on its Context

The need and desire to build compilers exerts some pressure on the design and implementation of other parts of our computer systems. For example, consider some of the effects on language design:

Or on computer design:



For Graduate Students

If you are a graduate student enrolled in this course, what will you do to earn graduate credit? Some options include:

See me to make arrangements.



Why 155 is For Every CS Major

...



Eugene Wallingford ..... wallingf@cs.uni.edu ..... January 15, 2008