Session 18

Type Checking a Small Language


810:155
Translation of Programming Languages


Recap

We are considering the semantic analysis phase of a compiler, which does the last checking of the source program before the compiler to generates code in the target language. This phase can report errors that were undetectable in earlier stages, both those that indicate non-conformance to the language definition and those fatal to a meaningful program execution. We are focusing on type checking, one of several kinds of semantic analysis that can be done statically. Other kinds can be implemented in a straightforward fashion by doing structural recursion over the source program's abstract syntax tree.

A type checker assigns a type expression to the elements of the AST. Type expressions can be:

Constructed types -- especially in conjunction with named types -- are what complicate type checking.

Last time, we re-considered acdc's type checker, which implements a simple type system involving integers, floating-point numbers, and integer-to-float conversions. Today, let's apply our ideas about type systems to a slightly larger language, which will give us a chance to explore more issues in type checking.



An Example Language

Consider this simple language, which generates programs that consist of any number of declarations followed by a single expression:

    P → D ; E
    D → D { ; D }
      | id : T
    T → char
      | integer
      | array [num] of T
      | T
    E → literal
      | num
      | id
      | E mod E
      | E [ E ]
      | E

This language has two basic types, char and integer, and two constructed types, arrays and pointers. Each array has the index set [0..num-1], where num is the declared size of the array. Expressions include the integer operation mod, array dereferencing, and pointer dereferencing.

In this language, all identifiers are declared prior to being used. Here are two programs generated by the grammar:

    year: integer;          a: integer;
    year mod 1970           b: char;
                            c: array [10] of integer;
                            d: ↑integer;
                            c[d↑] mod c[a]

A type checker for this language can first build type expressions for each declared identifier and then compute the type of the program's expression. We can implement the semantic actions needed to type-check a program in separate arms of the type checker code, whether recursive imperative code or some OO implementation (say, as a visitor).

Let's consider declarations first. These actions require the program to record basic types and assemble constructed types from their components:

    P → D ; E
    D → D ; D
      | id : T               addType(id.value, T.type)
    T → char                T.type ← char
      | integer              T.type ← integer
      | array [num] of T1    T.type ← array(0..num.value-1), T1.type)
      | T1                  T.type ← pointer(T1.type)

Now let's consider the types of expressions. For literal values, the types are basic types:

    E → literal              E.type ← char
    E → num                  E.type ← integer

Identifiers have values associated with them in the symbol table:

    E → id                   lookupType(id.value)

Types for the three remaining kinds of expressions must be computed. Because they must be computed from parts that have specific types, there is the possibility of a type error. For example, the integer operation mod requires integer arguments.

    E → E1 mod E2            E.type ←
                               if E1.type = integer and
                                  E2.type = integer
                               then
                                  integer
                               else
                                  type error

Similarly, array indices much be integers. At this point, we don't care about the value of the index (is it in the index set?), only that the types match up.

    E → E1 [ E2 ]            E.type ←
                               if E1.type = array(s, t) and
                                  E2.type = integer
                               then
                                  t
                               else
                                  type error

Finally, pointer dereferencing works only for pointer types, and returns the pointed-to type:

    E → E1                 E.type ←
                               if E1.type = pointer(t)
                               then
                                  t
                               else
                                  type error

That's pretty much how it works. Not too bad!



A Quick Exercise

Add to our simple language:

Solution: Here are the changes we should make, along with the corresponding type-checking actions:

    T → ...
      | boolean            T.type ← boolean
    E → ...
      | E1 < E2            E.type ←
                             if E1.type = integer and
                                E2.type = integer
                             then
                                boolean
                             else
                                type error
      | ! E1               E.type ←
                             if E1.type = boolean
                             then
                                boolean
                             else
                                type error
      | E1 && E2           E.type ←
                             if E1.type = boolean and
                                E2.type = boolean
                             then
                                boolean
                             else
                                type error

Type checking expressions is, at its simplest, a matter of verifying argument types and setting result types. At the top level, it is straightforward structural recursion. At the bottom level, it is straightforward selection.



Type Checking Statements

Consider this change and addition to our simple language, which introduces statements:

    P → D ; S

    Sid := E
      |  if E then S
      |  while E do S
      |  S { ; S }

There is not much new required; the same techniques for type-checking expressions work for statements. One new wrinkle arises: in most languages, statements do not really have types, or need them. We can assign types to statements anyway, but more common in procedural languages is not to do so, through a custom type surrogate: void.



Type Checking Functions

Suppose that we wanted to add functions and function application to our little language. This would mean adding at least two new rules to the language's grammar, one for the new type and one for new kind of expression:

    T → T "" T
    E → E ( E )

In the first rule, the second → is quoted because it is part of the language, not part of our BNF description!

The semantic rule we need to type-check the new type expression is straightforward; what about the application expression?

    T → T1 "" T2    T.type ← function(T1.type, T2.type)
    E → E1 ( E2 )     E.type ←
                          if E1.type = function(s, t) and
                             E2.type = s
                          then
                             t
                          else
                             type error

Even with this simple and incomplete idea of functions, we can explore many of the interesting issues involved in type-checking functions.

Type expressions such as the one above bring back to our attention an issue we still need to address: how can we tell if a function expression consists of the appropriate types for it parts? Or for any structured type?



The Equality of Type Expressions

When type-checking a constructed type, the type checker's behavior is usually of the form "if this type expression is equal to that type expression, then return some type, else return a type error". For this reason, a type checker must be able to determine when two types are equal. This raises questions of both semantics and efficiency.

The semantic issue arises most obviously in the case of named types. Last time, we considered the example of a type named employee versus a record with identical fields and field types. Consider the relationship of this issue to that of == versus equals() in Java, identity versus equality. In types, we say that the issue is one of name equality versus structural equality.

This problem can also arise without named types, even with simple types. For example, Pascal allows implicit types:

         TYPE  link = ↑cell;
               ptr  = ↑cell;
         VAR   next : link;
               p    : ptr;
               q    : ↑cell;
               r    : ↑cell;

Are the types of any of next, p, q, and r the same? The Pascal language definition did not specify the meaning of identical types, so that the answer is "It depends" -- on the implementation! This is one of the problems solved in the Pascal descendant, Ada.

The efficiency issue arises with constructed types. To implement structural equality, a type checker must consider the equality of all components of the structures. A constructed type is essentially a tree in which the interior nodes are type constructors and the leaf are basic types, so implementing structural equality is effectively a comparison of trees of arbitrary size. As constructed types become more complex, this sort of tree comparison can be expensive.

Then there is the potential for a recursive tree structure. Consider:

         TYPE  data = ...;
               link = ↑cell;
               cell = record
                         value : data;
                         next  : link
                      end;

We can represent this type as a tree without a cycle, by using the type name link. But if we substitute a "pointer" for link to eliminate the extra name, we have a tree with a cycle. In Pascal, this issue is implementation-dependent, but the language's emphasis on type names led most implementors to use the cycle-free implementation. In C, the solution was clearer: C uses structural equivalence for all types except structs. When the type checker encounters a struct constructor, it uses name equivalence to compare.

Can we implement a more efficient type checker than one that must walk trees? Yes, if we are willing to make a few simplifying assumptions. These assumptions allow short-cuts in the checking for type equivalence, at the cost of weakening the type system. Consider a type encoding scheme similar to one used by some early C compilers:

    pointer(t)
    array(t)
    freturns(t)

This simplifies the array type (by not considering the index set) and the function type (by not considering the argument type). That information will be recorded and used elsewhere.

These type constructed are all unary operators, which allows for a simple, uniform representation. Consider:

    char
    freturns(char)
    pointer(freturns(char))
    array(pointer(freturns(char)))

We can represent each of these type expressions in a linear fashion, using a bit string. With three type constructors, we will need two-bit strings, perhaps:

    type         string

    freturns     01
    pointer      10
    array        11

We save 00 as the prefix for the primitive types:

    type         string

    integer      0000
    char         0001
    real         0010
    boolean      0011

Now we can represent the results of type constructors using a bit string comprising the strings of its primitive type and the constructors. For example:

    type                                        string

    char                                        0001 00000000
    freturns(char)                              0001 01000000
    pointer(freturns(char))                     0001 01100000
    array(pointer(freturns(char)))              0001 01101100
    pointer(freturns(freturns(array(real))))    0010 11010110

This saves space and makes for lightning-fast comparisons, as comparing bit strings can be done using bit operations at the machine level. Two different bit strings indicate two different types, which enables the machine to recognize certain type mismatches. However...

In