Session 16

On to Semantic Analysis


810:155
Translation of Programming Languages


Opening Exercise

Write down a list of three or more features of a program that are essential in order for the program to be correct that the parser cannot check.

Are there program features that, while not necessary, are desired but which the parser cannot verify?



State of the Project

Do you have a parser that accepts BigRig programs (and only BigRig programs :-)?

Do you have an implementation of the abstract syntax of BigRig?

Do you have a parser that produces the abstract syntax tree of a BigRig program?



At Midterm

Your midterm exam is available as a take-home exam and due before our next session begins. Please follow the rules. I am relying on your honor.

We are now in the middle of the course, and in the middle of the compiler. This week we will discuss briefly the semantic analysis stage of a compiler and then move onto the synthesis of target code.

I hope you remain enthused with how important this topic is, even in 2009. Nearly every issue of Dr. Dobb's Journal contains an article about compiler technology, the most recent as of this writing being an Optimzing Compiler for Parallelization. Once regarded almost as "solved technology", compilers remain a focal point of computer science research, driven by several forces. One is the need for parallelism in multicore systems. Another is the need for security when programs run acros a network or in a distributed system. Yet another is the need for increased reliability in networked environments.

This awakening of compiler research echoes the beginning of compilers, when John Backus and his team at IBM created the first Fortran compiler. Back then, many people said that a compiler could never create an efficient, reliable program -- until the Fortran compiler proved them wrong. Many folks today say the same thing about parallelism. I expect that someone will prove them wrong soon, too.

One way to make it easier for a compiler to generate parallel code is to import more ideas from functional programming. Without side effects, a program's expressions often decompose naturally into parallel operations. But can we compile a functional program for handling massive datasets into an efficient executable? I am reminded of the language Sisal, which aimed to do just that -- only 15 years ahead of the symmetric multiprocessor (SMP) architectures needed to make it viable. But the emergence of multicore processors give new hope to it. And what about Single Assignment C (SAC) or what many regard as the most promising project in this area, Fortress? So much excitement in the world of compilers...

Sometimes, great inspiration comes from the past. A while back, I ran across a link to Forth -- The Early Years, and it reacquainted me with this unusual language. It is easy to forget how diverse the space of programming languages is, but reading about Forth will remind you.

Now let's get back to the business of learning how to write a compiler ourselves!



Semantic Analysis

We have just finished studying syntactic analysis, which showed us how to build a parser that can produce an abstract syntax tree of the program given to a compiler. This tree serves as input to the next phase of the compiler, semantic analysis:

semantic analysis in context

A parser verifies that the program is syntactically correct, that it follows the grammatical rules of the language. Semantic analysis is where the compiler does its best to ensure that the program is semantically correct, that it is meaningful according to the definition of the language's constructs.

In this stage, the compiler does static analysis. This means that it tries to verify only those features which can be observed by looking at the program content. A compiler can verify dynamic features only by executing or simulating the program, the costs of which are too high to implement in the typical compiler.

Static analysis of semantic conventions can report errors that are undetectable in earlier stages of the compiler and that are fatal to program execution. As we discussed in Session 8, some features of a programming language are context-sensitive, not context-free. A canonical example is matching the number of arguments in a function call to the number of formal parameters in a function definition. Rather than ask the parser to process a context-sensitive grammar, we opt for a simpler grammar and leave context-sensitive features as annotations in the language specification, to be verified by a follow-up step -- semantic analysis.

What sorts of static features can a compiler reliably check? Here are a few:

Just as with syntactic analysis, semantic analysis has two goals. The first is to ensure semantic correctness. The second goal is to prepare the abstract syntax tree for code generation. To satisfy this second goal, the semantic analysis phase usually produces two kinds of output:

We can also think about semantic analysis more broadly than in the context of a compiler. One of the common uses of semantic analyzers in contemporary programming environments is in tools that support automated refactoring. Even the simplest refactoring -- renaming a variable or a method -- requires semantic analysis to ensure that the new name does not create a conflict with an existing name in the same scope.

For this course, we will focus our semantic efforts on verifying type information and building the symbol table. Most other static checks performed by a compiler are routine by comparison. Most can be done using straightforward structural recursion over the AST. Some can be performed while building the symbol table. For example, the compiler can verify uniqueness of names at the time each entry is made in the symbol table.

First, let's consider type checking.


A Brief Introduction to Type Checking

A type checker verifies that the type of some program component matches what the program expects where the component occurs. Here are some examples of expectations that must be verified:

How can type checking help with code generation? Many target languages, including assembly languages, often support different operations for similar but different types, such as integers and real numbers. Knowing that a particular expression is an integer means that the compiler can generate code using the more efficient integer operation.

When an operator such as + can be used with arguments of different types, we say that the operator is overloaded. You may be familiar with overloading from languages such as Java and C++. Not only do these languages include overloaded built-in operators (in Java, + works on strings as well as numbers), but we can also write methods that take different types and numbers of arguments. In C++, the programmer can overload operators... Semantic analysis can identify the context in which an operator or function operates and record that information for later use.

Even a compiler as simple as acdc benefits from a type checker that recognizes integer and floating-point values and, when suitable, coerces an expression to be of the type expected by its context. We will look at the acdc type checker again in our next session.

To build a type checker, we use information from the language's grammar, which specifies the syntactic constructs that can appear in a program, and rules for assigning types to each construct. For example, the Java language specification says that, when both operands to a binary arithmetic operator are integers, the type of the result is also an integer. This kind of rule implies that every expression has a type associated with it. The Java spec also tells us that we can create an array of values by following a type name T with []; the result is a new type, array of T. This kind of rule implies that types have structure, because they can be constructed.

We use these conclusions to guide how we check types in an AST.



Type Expressions

In many languages, a type can be basic or constructed. Basic types are those provided as primitives in the language, such as int, double, char, and boolean. Constructed types are created by the programmer out of other types. A Java class and a C struct are record types that aggregate one or more values of other types. Arrays are homogeneous aggregates. In languages that support explicit pointers, such as C and Ada, a pointer is a type constructed from another type, too. Just as we can create an array of T for some T, we can also create a pointer to T.

We can even think of user-defined functions specifying a type. Consider the Java function

int frobnicate( String s, int start )

A call to frobnicate produces integer values for use in containing expressions, but the function header also creates an expectation for a call, namely, that it passes a String and an int as its only two arguments, in that order. We can think of this function as having the type

(String, int) → int

Compilers that want to reason about higher-order functions, and even do type checking on them, use such function types as an important tool. See Haskell for a language that does wonderful things with function types, including automated inference of types from untyped expressions. But even in more conventional languages such as Java -- or BigRig -- we can use function types effectively in our compilers.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... March 24, 2009