We are considering the semantic analysis phase of a compiler, which does the last checking of the source program before the compiler to generates code in the target language. This phase can report errors that were undetectable in earlier stages, both those that indicate non-conformance to the language definition and those fatal to a meaningful program execution. We are focusing on type checking, one of several kinds of semantic analysis that can be done statically. Other kinds can be implemented in a straightforward fashion by doing structural recursion over the source program's abstract syntax tree.
A type checker assigns a type expression to the elements of the AST. Type expressions can be:
Constructed types -- especially in conjunction with named types -- are what complicate type checking.
Last time, we re-considered acdc's type checker, which implements a simple type system involving integers, floating-point numbers, and integer-to-float conversions. Today, let's apply our ideas about type systems to a slightly larger language, which will give us a chance to explore more issues in type checking.
Consider this simple language, which generates programs that consist of any number of declarations followed by a single expression:
P → D ; E
D → D { ; D }
| id : T
T → char
| integer
| array [num] of T
| ↑T
E → literal
| num
| id
| E mod E
| E [ E ]
| E↑
This language has two basic types, char and integer, and two constructed types, arrays and pointers. Each array has the index set [0..num-1], where num is the declared size of the array. Expressions include the integer operation mod, array dereferencing, and pointer dereferencing.
In this language, all identifiers are declared prior to being used. Here are two programs generated by the grammar:
year: integer; a: integer;
year mod 1970 b: char;
c: array [10] of integer;
d: ↑integer;
c[d↑] mod c[a]
A type checker for this language can first build type expressions for each declared identifier and then compute the type of the program's expression. We can implement the semantic actions needed to type-check a program in separate arms of the type checker code, whether recursive imperative code or some OO implementation (say, as a visitor).
Let's consider declarations first. These actions require the program to record basic types and assemble constructed types from their components:
P → D ; E
D → D ; D
| id : T addType(id.value, T.type)
T → char T.type ← char
| integer T.type ← integer
| array [num] of T1 T.type ← array(0..num.value-1), T1.type)
| ↑T1 T.type ← pointer(T1.type)
Now let's consider the types of expressions. For literal values, the types are basic types:
E → literal E.type ← char
E → num E.type ← integer
Identifiers have values associated with them in the symbol table:
E → id lookupType(id.value)
Types for the three remaining kinds of expressions must be computed. Because they must be computed from parts that have specific types, there is the possibility of a type error. For example, the integer operation mod requires integer arguments.
E → E1 mod E2 E.type ←
if E1.type = integer and
E2.type = integer
then
integer
else
type error
Similarly, array indices much be integers. At this point, we don't care about the value of the index (is it in the index set?), only that the types match up.
E → E1 [ E2 ] E.type ←
if E1.type = array(s, t) and
E2.type = integer
then
t
else
type error
Finally, pointer dereferencing works only for pointer types, and returns the pointed-to type:
E → E1↑ E.type ←
if E1.type = pointer(t)
then
t
else
type error
That's pretty much how it works. Not too bad!
Add to our simple language:
Solution: Here are the changes we should make, along with the corresponding type-checking actions:
T → ...
| boolean T.type ← boolean
E → ...
| E1 < E2 E.type ←
if E1.type = integer and
E2.type = integer
then
boolean
else
type error
| ! E1 E.type ←
if E1.type = boolean
then
boolean
else
type error
| E1 && E2 E.type ←
if E1.type = boolean and
E2.type = boolean
then
boolean
else
type error
Type checking expressions is, at its simplest, a matter of verifying argument types and setting result types. At the top level, it is straightforward structural recursion. At the bottom level, it is straightforward selection.
Consider this change and addition to our simple language, which introduces statements:
P → D ; S
S → id := E
| if E then S
| while E do S
| S { ; S }
There is not much new required; the same techniques for type-checking expressions work for statements. One new wrinkle arises: in most languages, statements do not really have types, or need them. We can assign types to statements anyway, but more common in procedural languages is not to do so, through a custom type surrogate: void.
Suppose that we wanted to add functions and function application to our little language. This would mean adding at least two new rules to the language's grammar, one for the new type and one for new kind of expression:
T → T "→" T
E → E ( E )
In the first rule, the second → is quoted because it is part of the language, not part of our BNF description!
The semantic rule we need to type-check the new type expression is straightforward; what about the application expression?
T → T1 "→" T2 T.type ← function(T1.type, T2.type)
E → E1 ( E2 ) E.type ←
if E1.type = function(s, t) and
E2.type = s
then
t
else
type error
Even with this simple and incomplete idea of functions, we can explore many of the interesting issues involved in type-checking functions.
For example, we could specify the type of function like Scheme's map procedure as
((s→t) X array(s)) → array(t)
Algol had this feature, even as a statically-typed language -- way back in 1960! C has this feature in a more roundabout fashion, through its use of function pointers.
Type expressions such as the one above bring back to our attention an issue we still need to address: how can we tell if a function expression consists of the appropriate types for it parts? Or for any structured type?
When type-checking a constructed type, the type checker's behavior is usually of the form "if this type expression is equal to that type expression, then return some type, else return a type error". For this reason, a type checker must be able to determine when two types are equal. This raises questions of both semantics and efficiency.
The semantic issue arises most obviously in the case of named types. Last time, we considered the example of a type named employee versus a record with identical fields and field types. Consider the relationship of this issue to that of == versus equals() in Java, identity versus equality. In types, we say that the issue is one of name equality versus structural equality.
This problem can also arise without named types, even with simple types. For example, Pascal allows implicit types:
TYPE link = ↑cell;
ptr = ↑cell;
VAR next : link;
p : ptr;
q : ↑cell;
r : ↑cell;
Are the types of any of next, p, q, and r the same? The Pascal language definition did not specify the meaning of identical types, so that the answer is "It depends" -- on the implementation! This is one of the problems solved in the Pascal descendant, Ada.
The efficiency issue arises with constructed types. To implement structural equality, a type checker must consider the equality of all components of the structures. A constructed type is essentially a tree in which the interior nodes are type constructors and the leaf are basic types, so implementing structural equality is effectively a comparison of trees of arbitrary size. As constructed types become more complex, this sort of tree comparison can be expensive.
Then there is the potential for a recursive tree structure. Consider:
TYPE data = ...;
link = ↑cell;
cell = record
value : data;
next : link
end;
We can represent this type as a tree without a cycle, by using the type name link. But if we substitute a "pointer" for link to eliminate the extra name, we have a tree with a cycle. In Pascal, this issue is implementation-dependent, but the language's emphasis on type names led most implementors to use the cycle-free implementation. In C, the solution was clearer: C uses structural equivalence for all types except structs. When the type checker encounters a struct constructor, it uses name equivalence to compare.
Can we implement a more efficient type checker than one that must walk trees? Yes, if we are willing to make a few simplifying assumptions. These assumptions allow short-cuts in the checking for type equivalence, at the cost of weakening the type system. Consider a type encoding scheme similar to one used by some early C compilers:
pointer(t)
array(t)
freturns(t)
This simplifies the array type (by not considering the index set) and the function type (by not considering the argument type). That information will be recorded and used elsewhere.
These type constructed are all unary operators, which allows for a simple, uniform representation. Consider:
char
freturns(char)
pointer(freturns(char))
array(pointer(freturns(char)))
We can represent each of these type expressions in a linear fashion, using a bit string. With three type constructors, we will need two-bit strings, perhaps:
type string
freturns 01
pointer 10
array 11
We save 00 as the prefix for the primitive types:
type string
integer 0000
char 0001
real 0010
boolean 0011
Now we can represent the results of type constructors using a bit string comprising the strings of its primitive type and the constructors. For example:
type string
char 0001 00000000
freturns(char) 0001 01000000
pointer(freturns(char)) 0001 01100000
array(pointer(freturns(char))) 0001 01101100
pointer(freturns(freturns(array(real)))) 0010 11010110
This saves space and makes for lightning-fast comparisons, as comparing bit strings can be done using bit operations at the machine level. Two different bit strings indicate two different types, which enables the machine to recognize certain type mismatches. However...
In