Session 5

Lexical Analysis and Regular Languages


810:155

Translation of Programming Languages


Opening Exercise

Show the dc code generated by our acdc compiler for this ac program:

    i a
    a = 1922
    p a

Use the CodeGenerator class to walk the AST. (Question: What does the AST for the program look like?)

Then take a look at what the compiler produces. The templates for dc are simple, because the language is simple. Even without being able to write our own dc programs, we can follow the code generator's work and understand it.



Transition, From Breadth to Depth

Last week, we studied a complete compiler, from end to end. We were able to see all the phases of a compiler implemented, but only for a simple language that constrained the decisions that we had to make. This study gave us breadth but only a little depth.

For the rest of the semester, we will examine each of the phases of the compiler in much more depth. We begin at the beginning, with the scanner, the module of the compiler that performs lexical analysis.



Lexical Analysis

The job of the lexical analyzer, or scanner, is to transform a stream of characters into a stream of tokens. The token is the smallest meaningful unit of a language. A scanner needs to know how to recognize the terminals in our language grammar: the keywords, the punctuation marks, legal identifiers, and legal values. We know the exact form of keywords and punctuation, so our scanner can recognize them by perfect match. Identifiers and values, though, are defined by rules that tell us how to match an input string with one of these terminal types.

The scanner also needs to how how to recognize the characters that don't matter, the whitespace that programmers use to make a program easier to read but which do not contribute to the meaning of the program.

There are some potential conflicts. What if a program contains the string "if8"? Is it an identifier, or the keyword if followed by a number? The language definition should cover such cases, and the scanner for the language must implement this specification.

To build a scanner, we need:

For these purposes, we will use regular expressions and deterministic finite state automata, respectively. These may sound scarier than they are. CS theory will help us to develop these techniques and show us that we can directly convert our description of rules into recognition code.



Languages and Notation

An alphabet is a finite set of symbols. For example, {a, b, c, ..., z} ∪ {A, B, C, ..., Z} is the Roman alphabet that we use for words in English, and {0, 1} is the usual binary alphabet.

A string (or word or sentence) is a finite sequence of characters drawn from from an alphabet. "Vonnegut" is a favorite string of mine from the Roman alphabet. "01111111" is a word in binary. Both of these strings have a length of 8. Not all strings have length 8, of course. "Kurt" and "00" are other strings drawn from these alphabets; "chronosynclastic" is another favorite of mine. There is even a string of size 0, "", that can be drawn from any alphabet. We will refer to it with the symbol ε, the lowercase Greek letter epsilon.

A language is set of strings drawn from a particular alphabet. {"Kurt", "Vonnegut", "chronosynclastic", "infundibulum"} and {"01111111", "00"} are languages over the Roman and binary alphabets, respectively. ε can be a member of a language, too, so {"Kurt", "Vonnegut", ε} is a language. Finally, the empty set {} is a language, too. We will often denote it with the symbol ∅.

Unlike a string, a language can be infinite. We can't write down all of the members of an infinite language, so if we want to specify it we will need a notation that describes the members of the set.

In your programming languages course, you likely used BNF notation to describe the strings that made up a programming language. Here is a BNF description of a language drawn from the alphabet {a}:

    s → A
    A → aA
      | ε

This describes the language {"", "a", "aa"", "aaa", ...}, the set of strings consisting of 0 or more a's.

In a BNF description:

A choice is often called "alternation". Concatenation is often indicated by juxtaposition, simply placing two symbols side by side.

Finally, when we use a non-terminal on the righthand side of a production, we can replace it with any string that matches the non-terminal. This allows inductive definitions.

BNF notation is useful for defining programming languages because, as described here, it is capable of defining context free languages, a set of languages to which most programming languages belong. We'll use BNF for this purpose, to describe the syntax of our programming languages, which is useful when we move on to the topic of parsing, or syntactic analysis, in a few sessions.

But the tokens that our scanner must recognize form a simpler kind of language called a regular language. Regular languages are a subset of the context free languages whose syntax contain no recursion, only repetition. Let's turn our attention to regular languages and how to recognize them.



Regular Expressions

A regular expression is a description written in the notation described above, but with no recursion. For example, consider identifiers in the language Pascal, which consist of any number of letters and digits up to 1024 in size, but which must begin with a letter. A regular expression for these identifiers is:

    identifier → letter (letter | digit)*

Note that our regular expression does not incorporate the 1024-character maximum length. Length restrictions are specified outside the BNF description, which allows us to write simpler BNF descriptions that assume strings of unbounded length.

Of course, letter and digit are non-terminals, so technically we should define them, too:

    letter → a | b | c | ... | z | A | B | C | ... | Z
    digit  → 0 | 1 | 2 | ... | 9

You will commonly see BNF descriptions that leave some non-terminals undefined, relying on a common understanding of their meaning. But a formal language specification must include them.

Quick Exercise: Now try your hand at it. Write a regular expression to describe the set of non-negative integers.

Solution: Here is a solution that fails. Why?

    integer → digit*

Well, for one, it does not include 0, which is a non-negative integer. So perhaps:

    integer → 0 | digit*

Why does it fail? Because digit* permits zero repetitions, and the empty string is not a legal member of the set! So how about:

    integer → 0 | digit digit*

This still fails, because an integer of more than one digit cannot start with a 0. So finally let's try:

    integer      → 0 | nonzerodigit digit*
    nonzerodigit → 1 | 2 | ... | 9

Language design requires atention to detail, even the smallest details.

Writing out a full set of choices among alphabetic or numeric characters is tedious and not even helpful to readers, who know these sequences well. So we often use square brackets, [], to denote a character class. [] give us a shorthand for indicating a choice from the class. With this shorthand, we can describe non-negative integers more concisely as:

    integer → 0 | [1-9][0-9]*

Quick Exercise: Try one more. Write a regular expression to describe the set of Java (or C, Ada, or Scheme) identifiers.

Solution: For Java, we might write...

    identifier → javaLetter (javaLetter | digit)*
    javaLetter → letter | _ | $

Note a few things about Java identifiers:

Note, too, the use of parentheses, (), for grouping in the rule that defines identifier. They are not part of the identifier.

Quick Exercise: Here is one you can try at home... Write a regular expression to describe the set of Java (or C, Ada, or Scheme) comments.

Regular expressions are useful to us in this course for a couple of reasons. First, they match the notational needs of lexical analysis quite well, empowering us write concise and clear rules to describe the identifiers and values in a language. Second, it is relatively easy to write simple, efficient programs that recognize regular expressions.

The regular expressions we write should be:

We alluded to the ambiguity problem above. Consider this grammar:

    token      → keyword | identifier | integer
    keyword    → if | ...
    identifier → (letter | _) (letter | digit | _)*

Is the string "if8" an identifier of length 3, or the keyword if followed by (the beginning of) an integer?

Some languages are defined such that the presence or absence of whitespace matters, which helps us to make this choice. Others resolve such a conflict by specifying a rule for how to match rules. One such rule is "first rule to match", which makes the order of the rules in the BNF description significant. A common solution is the "longest match" rule, which returns the token type that matches the longest sequence of characters. Using "first rule to match", if8 is the reserved word if followed by (the beginning of) an integer; using a "longest match" rule, it is an identifier.

Programmers don't generally have the language grammar with them all the time, so using the "first rule to match" approach can create problems. Many common languages use the "longest match".



Finite State Automata

Once we have a description of a language, we need a way to determine whether a given string is in the language or not. A program that does this task is called a recognizer. Recognizers for regular languages can be built in a straightforward way based on their simple structure. The technique we use is to convert a regular expression into a finite state automaton, or FSA. A finite state automaton is the description of a machine, which we will be able to implement directly in a program.

An FSA is a mathematical "machine" consisting of states and transitions among them. The machine begins execution in its start state. At each step, it reads a character from the string and follows a transition to a new state. If it reaches the end of its input string and is in one of the designated terminal states, then it has recognized a member of the language it models. Otherwise, if the machine comes to rest in any state other than a terminal state after consuming every token in a string, then the automaton fails, which indicates that the string is not a member of the language it recognizes.

A complete machine will have a transition for each letter in the language's alphabet coming out of every state:

complete FSA #0, with all branches

But for simplicity's sake, we do not usually create write complete transition diagrams. In that case, if the FSA reaches a state that has no transition out for the character just read, then it fails.

Here is an FSA that recognizes the language { "if" }:

simple FSA #1, for the keyword <TT>if</TT>

q0 is the start state, and q2 is the terminal state. Note that the state state is marked with a triangle, and the lone terminal state is marked with a double circle.

This is not a complete FSA, as it does not have transitions for (q0, f) or (q1, i). If the alphabet of this languages contains more characters than 'i' and 'f', then it is missing many transitions. If this FSA ever encounters a character in a state without a transition for it, the machine fails.

If every state in an FSA has exactly one transition out for any member of the alphabet, we say that it is a deterministic finite automaton (DFA). Otherwise, it is nondeterministic (NFA). Our FSA for recognizing the language { "if" } is deterministic; each state X character pair takes the machine into a unique state.

Here is an FSA that recognizes the regular language if[a-z]:

simple FSA #2, for strings that start with 'if'

This machine has two terminal states, q2 and q4. It is nondeterminsitic because it has two transitions out of state q1 on the symbol f: one to q2 and another to q3.



FSAs and Regular Languages

The language of an FSA is the set of strings that leave it in a terminal state. It turns out that the language accepted by any FSA can be written as a regular expression. We can create this regex by simulating the machine to produce a sequence of sub-expressions. Your theory course teaches a nice proof of this!

Consider this DFA:

a simple DFA to convert into a regex

We can walk down its states to see that the language recognized is described by the regular expression a(x{,x}).

Note that the parentheses in this expression are part of this language! So we can't use them for grouping purposes in the syntax of our regular expression. Instead, we have used the grouping shorthand {} to indicate 0 or more repetitions.

Quick Exercise: Now try it for yourself. Convert this DFA to a regular expression:

another DFA to convert into a regex

This one is tougher, because it has two loops -- and so two repetitions -- one nested within the other. But we ought to be able to derive the regex a{+{c*}b}. .

As you can see from just a little practice, the process of converting a deterministic FSA into a regular expression is straightforward. It should not surprise you that we can write simple programs to automate the process!

So, FSA ⊂ RL. More important for writing compilers, though, the relationship holds the other way, too: RL ⊂ FSA. That is, any regular expression can be recognized by an FSA. (Again, your theory course teaches a nice proof of this!) Together, these two relationships tell us that the set of regular languages is the same as the set of languages recognized by FSAs.

Why is it good for us that any regular expression can be recognized by a deterministic FSA? Because then we have a clear path to writing a scanner! We will explore this path next session.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... January 29. 2009