We have been talking about scanning, and today we talk about parsing. So you won't mind scanning and parsing a little Java for me.
int price = 75;
int discount = -25;
System.out.println( "Price is " + price + discount + " dollars" );
System.out.println( "Price is " + price - discount + " dollars" );
System.out.println( "Price is " + price + - + discount + " dollars" );
System.out.println( "Price is " + price + - - discount + " dollars" );
System.out.println( "Price is " + price+-+discount + " dollars" );
System.out.println( "Price is " + price+--discount + " dollars" );
If anything causes an exception, just write EXCEPTION and assume that the rest of the code executes.
Here is the output:
Price is 75-25 dollars
EXCEPTION
Price is 7525 dollars
Price is 75-25 dollars
Price is 7525 dollars
Price is 75-26 dollars
The first thing to recall is that Java's binary + is overloaded; it performs both numeric addition and string concatenation. When the left operand to + is a string, Java's parser chooses the string concatenation operator.
The unary operators + and - apply to the value of discount, which will be printed. From our discussion of scanning, you should understand why the + that follows price is treated as a binary operator and not unary. These operators are self-delimiting, so the scanner does not require that they be surrounded by white space. With a little thought, you should be able to figure out why the last line generates a different result than the fourth line... The scanner matches the longest token possible, and -- is a unary prefix Java operator!
What happens if we parenthesize the price op(s) discount expressions? Now some arithmetic is performed!
Feel free to play around a bit with the original program and the parenthesized program. Ironically, they grew out of a discussion started by a question in CS I the day before this session in a previous offering of this course. The timing was perfect then and perfect now, given that you are designing a programming language and will probably be thinking about leading minus signs and tokens in the near future. Of course, CS1 students probably don't appreciate the subtlety of the lexical issues as well you all do!
We turn now to syntactic analysis. Recall the context:
The scanner provides a public interface that allows clients to consume the next token in recognizing an abstract syntax expression and to peek at the next token without consuming it.
In our previous usint, we used regular languages as our theoretical model for the concrete syntax of a programming language. Regular languages are the subset of context-free languages that do not allow recursion; we can use only repetition when we need to describe repeated structures. That works great for the rules of concrete syntax, which has a regular and relatively simple structure.
But the abstract syntax of programming languages requires the full power of a context-free language. Recursion is essential. Consider even this incomplete grammar for arithmetic expressions:
expression := identifier
| expression + expression
We cannot use substitution to eliminate the recursion, because each substitution of expression in the 'if' arm creates a longer sentence -- with more occurrences of the symbol that is being replaced!
We also know that the set of regular languages is equivalent to the set of languages recognized by a finite state machine. But we can't create a finite state machine to recognize the expressions in our small arithmetic language, because a finite state machine cannot keep track of how many levels the recursive calls nest. We can create an FSM to recognize expressions nested to n levels for any particular n, but that machine would not be able to recognize a legal expression nested n+1 levels deep.
This is precisely the strength of the context-free languages, which are a superset of the regular languages. Thus, context-free grammars will serve as our declarative model for abstract syntax.
BNF notation is the standard notation for describing languages, and the set of languages describable using BNF is equivalent to the set of context-free languages.
Keep in mind what context-free means here. Each occurrence of expression in the '+' arm above can be replaced by the righthand side of any sentence for expression, regardless of what symbols come before or after expression in the containing expression.
Context-free grammars are capable of describing nearly every element of programming language syntax. There are certain kinds of constraints that are not expressible in a context-free way:
These constraints are sensitive to the context in which the constrained token appears.
In a context-sensitive language, the lefthand side of a rule can specify context information that restricts a substitution. For example:
S := sAt
| xAy
sA := Sa
| b
The LHS of the second rule requires that it be used to replace As only in conjunction with preceding s's.
Unfortunately, processing context-sensitive grammars is considerably more complex and less efficient than processing context-free grammars. These costs hardly seems worth the benefit, as they apply only to a few isolated elements of common programming language syntax.
So, we will use context-free grammars to model abstract syntax. One effect of this choice is that the sequence of tokens accepted by a parser will be a superset of the set of valid programs. That's okay -- we will use later phases of the compiler to enforce context-sensitive rules.
Just as we generalize from regular languages to context-free languages when moving from describing lexical information to describing syntactic information, we can generalize recognizing sentences in our languages from being finite-state machines to a set of state machines that can call one another -- and themselves. (That's recursion!) In order for this process to work, the state of the calling machine must be resumed when the called machine terminates. Thus, the state must be saved. Because the machine calls are nested, a stack does the trick.
So, just as regular expressions find their complement in finite state machines, context-free languages find their complement in pushdown automata. Because a pushdown automaton provides an arbitrarily deep stack, it is able to handle nested expressions of arbitrary depth.
Like FSMs, the topic of pushdown automata is understood so well that theory supports us in recognizing context-free languages, and thus in writing parsers. The same theory also supports the automatic generation of parsers.
We use BNF to write a context-free grammar for a language. For syntax analysis, the terminals of the grammar are tokens, and the non-terminals are the abstract components of programs.
We can view how a grammar defines the sentences in its language in several ways. A particularly useful one is as a means for constructing a sequence of BNF rule applications that leads to a valid sentence. This process is called derivation. Derivation is closely related to the substitution model used to evaluate expressions in functional languages such as Scheme.
Derivation treats the grammar's production rules as rules for rewriting equivalent expressions. Consider this more complete arithmetic expression grammar:
expression := identifier
| expression operator expression
| ( expression )
| - expression
operator := + | - | * | / | ^
We can use this grammar to derive the sentence -(identifier + identifier) as follows:
expression => -expression substitute using rule 4
=> -(expression) substitute using rule 3
=> -(expression operator expression) substitute using rule 2
=> -(identifier operator expression) substitute using rule 1
=> -(identifier + expression) substitute using rule 5a
=> -(identifier + identifier) substitute using rule 1
The theory underlying context-free grammars and derivation is understood well enough that we can do induction proofs to show that:
We will use this fact to drive our construction of parsers using derivation as a model.
At each point in a derivation, we face two choices:
In my derivation of -(identifier + identifier), I always chose the leftmost non-terminal as the symbol to replace. The result is called a left derivation. Similarly, I could have generated a right derivation by choosing the rightmost non-terminal to replace:
expression => -expression substitute using rule 4
=> -(expression) substitute using rule 3
=> -(expression operator expression) substitute using rule 2
=> -(expression operator identifier) substitute using rule 1
=> -(expression + identifier) substitute using rule 5a
=> -(identifier + identifier) substitute using rule 1
Does it matter which of these we use, or if we use some other?
We can use derivation to generate a tree that gives the structure of the derived sentence. This is called a parse tree. The parse tree for -(identifier + identifier) looks like this:
expression
/ \
- expression
/ | \
( expression )
/ | \
expression operator expression
| | |
identifier + identifier
Parse trees are useful as intermediate data structures in syntax analysis and also as the motivation for parsing algorithms.
Quick Exercise: Give a left derivation of identifier + identifier * identifier and show the corresponding parse tree.
Here is a solution:
expression
/ | \
expression operator expression
| | / | \
identifier + / | \
/ | \
expression operator expression
| | |
identifier * identifier
But wait... Here is another:
expression
/ | \
expression operator expression
/ | \ | |
/ | \ * identifier
/ | \
expression operator expression
| | |
identifier + identifier
Is this a problem?
Some grammars are ambiguous in the sense that the same sentence has two different but valid parse trees. Committing to a leftmost or rightmost derivation does not avoid such ambiguities, because our second choice -- of a substitution rule -- admits non-determinism.
Theoretically, this sort of ambiguity may not matter, but practically it matters a lot in the construction of parsers. These two parse trees indicate different meanings for the underlying program. The first tree gives the result we would expect from the common precedence rule that * binds more tightly than +. But the precedence rule isn't the reason that it gives the expected result; that was just an accident. Our parser must commit to an approach that generates the correct abstract syntax for all programs.
Often, we can eliminate ambiguity from a grammar by restructuring it a bit. Three common techniques are:
Eliminate left recursion
When we chose context-free grammars so that we could create recursive structures, we created a potential problem. Consider this grammar:
A := Aa | b
This grammar is left-recursive because A occurs as the first symbol in the RHS of a rule that defines A. If you think about these two rules for a minute, you'll realize that b is the "lead" symbol for recognizing the application of both arms of this production. That is, all possible instances of A must begin with b; otherwise the recursion will never terminate. This creates a problem for a parser: when it tries to decide which rule to apply when it sees a b, it faces the fact that both rules apply. Thus, any algorithm that uses this grammar to recognize this language must be nondeterministic, which is not what we want in any realistic compiler.
So we rule out left recursion. This restriction turns out not to be much of a problem in practice, because we can always rewrite a rule that uses left recursion in a form that doesn't use it. For example, in this case, we could simply use repetition:
A := ba*
In other cases, we can factor the commonality into a separate rule and rewrite the original rules. Recall our grammar for arithmetic expressions:
expression := identifier
| expression operator expression
| ( expression )
| - expression
operator := + | - | * | / | ^
Consider just the first two rules. We can convert those rules into:
expression := identifier expression-tail
expression-tail := operator expression
| Ø
The new grammar generates the same set of sentences as the original two rules, without left recursion. You may remember that the BNF for ac featured a similar construction.
Left factoring
A similar technique removes ambiguities by isolating them in a single rule. Consider this common programming language situation:
statement := if expression then statement else statement
| if expression then statement
Upon seeing the token if, a parser would not be able to tell immediately which rule applied. We can defer the decision by expanding statement to common form:
statement := if expression then statement statement-rest
statement-rest := else statement
| Ø
The common parts of the two rules become part of a single rule for statement, and the choice becomes part of a second rule.
But doesn't the Ø rule create a new form of nondeterminism? Yes, but we'll be able to eliminate that choice by peeking ahead to what can follow a statement. More on that soon!
Note that this technique is much like factoring duplication out of code and isolating it in a new function. It's just that here, we create a new production!
Create 'marker' non-terminals
Our two 'left' techniques create new non-terminals in order to eliminate an ambiguity. This approach can be adapted to create non-terminals that help a grammar to remember higher-level structures. We'll not look at this approach in detail this semester, but you can read about it in other texts, such as the Dragon book, pp. 174-175.