Which language do you think is the harder to compile: Ada or Java? What about C or Java? What about Scheme or Ada? Why?
Is there a relationship between the difficulty of compiling a language and the difficulty of using a language? Why or why not?
Welcome to 810:155, formally titled "Translation of Programming Languages". You and I will usually just call it "compilers", for reasons that will be obvious by the time you leave today, if not already. This is one of the great courses in any computer science curriculum. After study programming languages more generally in 810:154 (a course I called "a pivotal point in your computer science education"), you are ready to more fully understand the relationship between mathematics, language, and machine. And to write a real program.
This an exciting course because it is where the "high priests of computer science" reveal their the rest of their secrets. The course in Programming Languages teaches us that the tools we use to process our programs are just programs, like any others. There, we studied some of the basic concepts of programming languages by working with a simple form of a language processing program, an interpreter.
(This course is a natural continuation of that course. In many ways they are very different from one another, but in a sense they are like one course.)
But interpreters often avoid important details of language by simplifying the problem they address. This semester, we attack the full complexity of processing language head-on by studying and implementing compilers. These programs must generally deal with the full range of issues involved in executing a program on a digital computer -- from processing complex syntax, through processing complex abstract syntax and semantics, though generating code for a target machine. Along the way, we will consider other issues such as handling syntactic abstractions, processing a broader class of languages, and optimizing the execution of generated programs. As we do, you will come to appreciate at a new level the role of language in computing, the mathematical foundations of computation, the interaction between program and machine, and the value of good programming practice.
As most of you know, I am Eugene Wallingford, and I'll be your instructor for the Translation of Programming Languages. (Everyone say, "Eugene"!) You should find the course challenging -- but I also hope that you find it fascinating.
I have passed out a short "vital statistics" page (page 1 of the syllabus) containing basic contact information for me and the course. If you have had me for a course before, then this will look familiar. It lists the URL of the course web page, which includes pointers to a full syllabus, a page with links to all our activities (lecture notes, programming assignments, exams), and a page of resources that we'll use this semester. Use this sheet to set a bookmark to the course web page in your favorite web browser. You never know when a strong urge to do 155 will strike you!
Study the syllabus carefully. It lists the policies by which we will run this course. Again, if you have had me for a course before, these will look familiar. You will need to know these policies and when they apply.
Also study the course activities page, which gives a tentative schedule for the semester, including reading assignments and exam dates. Some points that you should pay special attention to include:
Now, on to the course...
Programmers write programs using what we usually call a source language. The result is the source code of the program. In older days, programmers used to speak of "writing a new code" when they were writing a program. (You can still hear such language from non-programmers, such as on the CBS show NUMB3RS. My mathematician and physicist friends often speak that way. Brits seem more likely: "a source code", "codes".)
We ordinarily think of source code undergoing a single transformation on its way to executable code:
That's what a compiler does: translate the source code into machine language, which is run directly on a machine, on "bare metal".
But underneath the hood, a compiler really implements a sequence of transformations on the way to its result. The phases of a compiler can be described and implemented in many ways, but here is a representative picture of the process:
These phases transform the source code from a sequence of characters created by the programmer into an abstract representation of the program, and then into the desired target code. Each stage adds some new information to the representation.
Notice, too, that we think of the output of the compiler generally as target code, rather than machine language. That's because a compiler may generate code for a virtual machine, such as for Java or Squeak. It may also generate source code in another human programming language, such as C. Compiling to C is a common strategy for the implementors of new languages, as C is relatively portable and low level; the result is that the new language can be made quickly available in efficient form on a wide variety of machines.
The history of compilers parallels the history of programming. "In the beginning", when people were working with the first stored-program computers, programs really were codes. These codes were actual machine instructions that caused the computer to perform specific operations. Humans are language machines, and these first computer scientists soon created symbolic shorthand for machine instructions that made codes less tedious to write and easier to read. They wrote a program called an assembler to assemble the corresponding machine instructions from their symbolic codes.
But, humans being language machines, they weren't satisfied. Assembly code was less tedious to write and easier to read, but it was still tedious to write and difficult to read. Furthermore, there was a growing realization that programs were more general than any particular machine... If I write a cool program in assembly code and want to share it with you, then I have to describe it at a higher level and have you re-write it in the assembly language for your computer. This led to the creation of the first programming languages.
A program written in a high-level programming language such as Fortran described a set of computations more generally, in a language that was much closer to the way in which problem solvers thought about their problems. It was also much less dependent on the vagaries of any particular computer. But, to execute the program on a specific machine, they needed a new kind of translator -- which we now call a compiler.
The history of programming languages is a bit murky, as there were many different groups of folks working different subsets of the problem at the same time. (If you would like to learn more about the history of specific languages, I encourage to take a look at the History Of Programming Languages collections available in the library.) One of the best-documented of the early compiler efforts was the creation of the first Fortran compiler at IBM, led by Turing Award winner John Backus. That compiler took 18 person-years to build and taught us a lot about how to implement a translator for a high-level language. But if compilers had continued to require such massive expenditures of effort, computer science would have had a hard time taking off as a discipline.
The lessons from the Fortran project helped make compiler writing, and thus programming in high-level languages, more feasible. Two other more theoretical developments did, too. First, John McCarthy showed that an interpreter for a high-level language could be written in the same language. This meant that, once we had even a small, inefficient compiler for a language, we could create more complete and more efficient compilers much more easily. Second, in the world of linguistics, Noam Chomsky had begun to classify human languages according to the complexity of their grammars. This classification helped mathematicians and computer scientists focus their efforts on particular classes of languages, ultimately leading to a solid theoretical foundation for programs that process langauges of a particular class. This foundation, in turn, made it possible for us to more systematically write the phases of a compiler -- and even to write programs that write these programs for us!
The state of the art in compiler writing continues to advance as we learn more about languages and try to optimize programs under new constraints, such as just-in-time compilation for dynamic languages. But the fundamentals are well understood, which allows any computer science student to learn all she needs to know in order to write compilers and other language-processing programs.
That is why you are here.
Just some placeholders for our discussion...
What source language? A "real" one, or one created for a course? One the prof assigns, or one the students designe?
What target language? A "real" one, or one created for a course? A real machine, or a virtual machine?
The process: as project course, software engineering, ...
Which teams? 1x6 vs 2x3 vs 3x2 (vs 6x1)? There are trade-offs, both on the size of the language, the complexity of the target language, the type of ancillary products...
A compiler performs two basic tasks:
(We will see more detail on what next session, and on how next week and throughout semester.)
Early compilers tightly coupled these two stages. One result of this coupling is that writing compilers for m languages on n platforms requires m x n different compilers. But -- at least in principle -- analysis requires only knowledge of the source language, not the target language, and synthesis requires only knowledge of the target language, not the source language. If we decouple the two stages by introducing an intermediate representation, we can perhaps simplify both. And then writing compilers for m languages on n platforms might require only m "front ends" + n "back ends".
Figure 2 above shows the tasks of a compiler broken down, with scanning and parsing doing the bulk of the program analysis and the code generator and target code optimizer doing the bulk of the program synthesis.
Analysis uses knowledge of the source language to decompose the source program into its meaningful parts. An abstract syntax tree can serve as a first intermediate representation.
Synthesis uses knowledge of the target language to translate structures in the intermediate representation into meaningful expressions in the target. The target can be:
Many of the most successful programs are not compilers in the "translate to executable machine code" sense, but they rely on many of the basic techniques of a compiler. This is one reason our course is titled "Translation of Programming Languages" and not "Compilers". The rise to prominence of refactoring browsers and modern IDEs demonstrate the value of this more general name, even if it is a bit confusing at times.
Note: These notes become more of an outline at this point...
Compilers come in many different forms:
But there are many types of programs that work much like compilers, either with different sorts of output or with different intents. These relatives of the compiler include:
Common programs of this form check verify the types of identifiers and point out common problems. Perhaps the best known program of this sort is Lint, which detects potential portability problems in C programs. It can also check programs for syntax and data type errors, giving more (and more helpful) messages to the programmer.
A refactoring browser is much like Lint, but it tends to provide support at a higher level, helping the programmer move from a syntactically valid, working program to a structurally different but semantically equivalent program. Programmers use refactoring browsers to detect potential design flaws ("code smells") in their programs. See this list of refactoring tools for an idea of what is available. A refactoring browser manipulates the intermediate representation of the program, much like an optimizer, but with different goals, and then regenerates code in the source language!
An interpreter typically bypasses a deep representation of the syntax and structure of a program by evaluating expressions when their values are needed. It may do no synthesis at all. Interpreters are often used for command and scripting languages because most operations invoke complex functions that are already implemented. Interpreted code is good for 'gluing' together other pieces that may have been processed more deeply when they were created. Interpretation is also commonly used for languages that make static analysis insufficient for fully understanding the dynamic behavior of the program (APL, Smalltalk, and Java with reflection are examples).
A preprocessor is a program that takes input in a source language augmented with some special-purpose command and produces output in the same source language, in which the special-purpose commands have been replaced with valid source code. C is perhaps the best-known language with a significant preprocessor facility (for file inclusion and macro definition), but the history of programming includes many others, such as Knuth's 'weave' tool for literate programming and the weavers for aspect-oriented programming. Indeed, C++ was originally implemented using a C preprocessor!
The techniques we study cover in this course make up the bulk of all these language processors.
The need and desire to build compilers exerts some pressure on the design and implementation of other parts of our computer systems. For example, consider some of the effects on language design:
Or on computer design:
If you are a graduate student enrolled in this course, what will you do to earn graduate credit? Some options include:
See me to make arrangements.
...