Syntactic Analysis - cs.southern.edu · Advantages of CFGs Precise, easy-to-understand syntactic...
Transcript of Syntactic Analysis - cs.southern.edu · Advantages of CFGs Precise, easy-to-understand syntactic...
Context-free Grammars
• The syntax of programming language constructs can be described bycontext-free grammars (CFGs)
• Relatively simple and widely used
• More powerful grammars exist
– Context-sensitive grammars (CSG)– Type-0 grammars
Both are too complex and inefficient for general use
• Backus-Naur Form (BNF) and extended BNF (EBNF) are a convenientway to represent CFGs
Compiler Construction Syntactic Analysis 2
Advantages of CFGs
• Precise, easy-to-understand syntactic specification of a programminglanguage
• Efficient parsers can be automatically generated for some classes ofCFGs
• This automatic generation process can reveal ambiguities that mightotherwise go undetected during the language design
• A well-designed grammar makes translation to object code easier
• Language evolution is expedited by an existing grammatical languagedescription
Compiler Construction Syntactic Analysis 3
Role of the Syntactic Analyzer
• Second phase of compilation
• Input to parser is the output of the lexer
• Output of parser is (usually) a parse tree
parserlexer
symboltable
sourcecode
token
get next token
Compiler Construction Syntactic Analysis 4
Parsers
• Universal parsers
– Cocke-Younger-Kasami algorithm– Earley’s algorithm– Both too inefficient for production compilers
• “Normal” parsers
– Work only on subclasses of CFGs– Examples: LL, LR, LALR(1)– Automated tools available for the popular subclasses
Compiler Construction Syntactic Analysis 5
Context-free Grammar
Context-free Grammar (CFG) is a 4-tuple
〈VN,VT ,s,P〉
• VN is a set of non-terminal symbols
• VT is a set of terminal symbols
• s is a distinguished element of VN called the start symbol
• P is a set of productions or rules that specify how legal strings are built
P⊆VN× (VN∪VT)∗
Compiler Construction Syntactic Analysis 6
CFG Elements
• Terminals: basic symbols from which strings are formed (typicallycorresponds to tokens from lexer)
• Non-terminals: syntactic variables that denote sets of strings and, inparticular, denoting language constructs
• Start symbol: a non-terminal; the set of strings denoted by the startsymbol is the language defined by the grammar
• Productions: set of rules that define how terminals and non-terminalscan be combined to form strings in the language
A→ bXY z
Compiler Construction Syntactic Analysis 7
Example
Symbol table interpreter
G = 〈VN,VT ,s,P〉
VN = {S}VT = {new, id,num, insert, lookup,quit}s = SP : S → new id num
| insert id id num| lookup id id| quit
Compiler Construction Syntactic Analysis 8
Example
An arithmetic expression language
G = 〈VN,VT ,s,P〉
VN = {E}VT = {id,+,∗,(,),−}
s = EP : E → E +E
| E ∗E| (E)
| −E| id
Compiler Construction Syntactic Analysis 9
Notational Conventions (1)
Dragon book, pages 166, 167
Terminals
• Lower-case letters early in the alphabet (a, b, etc.)
• Operator symbols (+, ∗, etc.)
• Punctuation symbols (parentheses, commas, etc.)
• Digits
• Boldface strings (id, if, etc.)
Compiler Construction Syntactic Analysis 10
Notational Conventions (2)
Non-terminals
• Upper-case letters early in the alphabet (A, B, etc.)
• The letter S, if used, is usually the start symbol
• Lower-case italics names (expr, stmt, etc.)
Compiler Construction Syntactic Analysis 11
Notational Conventions (3)
• Grammar symbols (either terminals or non-terminals)
– Upper-case letters late in the alphabet (X , Y , etc.)
• Strings of terminals
– Lower-case letters late in the alphabet (u, v, etc.)
• Strings of grammar symbols
– Lower-case Greek letters (α, β, etc.)– Useful for representing generic productions
Compiler Construction Syntactic Analysis 12
Notational Conventions (4)
• Productions with the same left side can be “merged” into one productionusing the | symbol
A→ α1, A→ α2, . . . , A→ αk
becomes
A→ α1 | α2 | . . . | αk
• Unless otherwise indicated, the left side of the first listed production isthe start symbol
Compiler Construction Syntactic Analysis 13
Example
A programming language construct
stmt → ;| if ( expr ) stmt else stmt| while ( expr ) stmt| blk| id = expr ;
blk → { stmt∗ }
Compiler Construction Syntactic Analysis 14
Derivations
• Rewrite rule approach
• A production is treated as a rewriting rule in which a non-terminal onthe left side of the production is replaced by the grammar symbols onthe right side of the production
• Begin with the start symbol and through a sequence of derivationsproduce any string in L(G)
Compiler Construction Syntactic Analysis 15
Derivation
Given the productions
A→ αBβ
B→ λ1λ2 . . .λn
we can derive
A ⇒ αBβ ⇒ αλ1λ2 . . .λnβ
Compiler Construction Syntactic Analysis 16
A Derivation
Given the productions
E→ E +E | E ∗E | (E) | −E | id
we can derive −(id+ id):
E ⇒ −E ⇒ −(E) ⇒ −(E +E) ⇒ −(id+E) ⇒ −(id+ id)
Compiler Construction Syntactic Analysis 17
Derivations
Let α be a set of grammar symbols (terminals and non-terminals)
α ∗⇒ β means zero or more derivations
1. α ∗⇒ α (Base case)
2. If α ∗⇒ γ and γ ∗
⇒ β, then α ∗⇒ β (Inductive case)
Compiler Construction Syntactic Analysis 18
The Language of a Grammar
Given a grammar G, the language of G is L(G)
L(G)⊆VT∗
L(G) = {w ∈VT∗ | S
∗⇒ w}
Compiler Construction Syntactic Analysis 19
Sentential Forms
• Leftmost derivation
– Leftmost non-terminal is replaced at each step– Rightmost derivation replaces the rightmost non-terminal at each
step
• Sentential form
A set of grammar symbols that may obtained from a set of validderivations
• Leftmost sentential form
A set of grammar symbols that may obtained from a set of validleftmost derivations
Compiler Construction Syntactic Analysis 20
Regular Languages and CFLs
• All regular languages are context-free
• Consider the regular expression
a∗b∗
Let G = 〈{A,B},{a,b},A,{A→ aA | B,B→ bB |ε}〉
Compiler Construction Syntactic Analysis 21
Producing a Grammar from a RegularLanguage
1. Construct an NFA from the regular expression
2. Each state in the NFA corresponds to a non-terminal symbol
3. For a transition from state A to state B given input symbol x, add aproduction of the form
A→ xB
4. If A is a final state, add the production
A→ ε
Compiler Construction Syntactic Analysis 22
Parse Trees
• A graphical representationof a sequence ofderivations
• Each interior node is anon-terminal and itschildren are the right sideof one of thenon-terminal’s productions
E
E
E
+ E
* E
id
id
id
Compiler Construction Syntactic Analysis 23
Parse Trees
• If you read the leaves ofthe tree from left to rightthey form a sentential form
– Also called the “yield” or“frontier” of the parse tree
• All the leaves need not beterminals; the parse treemay be incomplete
• Valid sentential forms cancontain non-terminals
E
E
E
+ E
* E
id
id
id
Compiler Construction Syntactic Analysis 24
Ambiguity
Given the productions
E → E +E | E ∗E | (E) | id
Derive id+ id∗ id:
E ⇒ E +E⇒ id+E ⇒ id+E ∗E⇒ id+ id∗E⇒ id+ id∗ id
orE ⇒ E ∗E⇒ E +E ∗E ⇒ id+E ∗E⇒ id+ id∗E⇒ id+ id∗ id
Compiler Construction Syntactic Analysis 25
Ambiguity and Parse Trees
A grammar G is ambiguous if a string in L(G) can have more than oneparse tree
E
E
E
+ E
* E
id
id
id
E
id
E
*
E
E
E
id id
+
Compiler Construction Syntactic Analysis 26
Consequences of Ambiguity
• Ambiguity is generally bad
• Often means there is more than one way to interpret a string
Add before multiply or multiply before add?
• An ambiguous grammar should be rewritten to remove the ambiguity
Compiler Construction Syntactic Analysis 27
Removing the Ambiguity
Consider the rewritten productions
E → T | E +TT → F | T ∗FF → (E) | id
E
E
T
+ T
* F
id
id
F
id
F
T
Here only one parse tree is possible
Compiler Construction Syntactic Analysis 28
Disambiguating Rules
Can we provide rules for disambiguating
id+(id∗ id)
from
(id+ id)∗ id
Compiler Construction Syntactic Analysis 29
Top-down Parsing
• Recursive descent is an example
• Grows the parse tree from the root down to the leaves
• Useful for recognizing flow-of-control constructs since they are alwayslabeled with a keyword (e.g., if,while ,do, for)
• Requires each production for the same non-terminal to begin with aunique token
Compiler Construction Syntactic Analysis 30
Left factoring
Can be used to factor out a common prefix in two of more productions
For example, to parse if...then vs. if...then...else
C → if E then S else S| if E then S
Left factor the grammar (factor out common left expression):
C → if E then SXX → else S | ε
Compiler Construction Syntactic Analysis 31
Top-down Parsing
Two requirements
• Left-factor the grammar
Produce grammar in which no productions for the same non-terminal have a common prefix
• No left recursion
A+⇒ Aα
Parser could get into an infinite loop
Compiler Construction Syntactic Analysis 32
Top-down Parsing
Top-down parsing produces a sequence of left-most derivations
A → Bx | CyB → zC → w
Produces two strings: zx and wy
Compiler Construction Syntactic Analysis 33
Top-down Parsers
Two common approaches are used in top-down parsing
• Recursive descent parser
– Recursive– The structure of the grammar is hard-coded into the parsing program
• Table-driven parser
– Non-recursive– The structure of the language is encoded in a parse table
Compiler Construction Syntactic Analysis 34
Recursive Descent
• Relatively easy to implement
• Reads the input stream (from the scanner) left to right and verifies itscorrectness
• Perl has a recursive descent parser (Parse::RecDescent)
• “Recursive,” since parsing is accomplished via recursive procedures
• “Descent,” since parsing is top-down (descends from the root down thebranches to the leaves)
Compiler Construction Syntactic Analysis 35
Recursive DescentEach non-terminal is a subroutine call
A → Bx | CyB → zC → w
AB
B
C
x
y
z
wC3 4
10 2
5 8
6 7
9
Compiler Construction Syntactic Analysis 36
Recursive Descent
• A candidate grammar:
E → T | E +TT → F | T ∗FF → (E) | d
Bad because of left recursion
• The grammar can be modified to support a recursive descent parser:
E → T E ′
E ′ → +T E ′ | εT → FT ′
T ′ → ∗FT ′ | εF → (E) | d
Compiler Construction Syntactic Analysis 37
Generalized Parser
public abstract class RecursiveDescent {private String input;protected int cursor = 0;public RecursiveDescent() {
getInputString();if ( parse() && cursor == input.length() ) {
System.out.println("Accept");} else {
error();}
}protected final boolean checkNextToken(char ch) {
// Ignore whitespacewhile ( cursor < input.length() &&
(input.charAt(cursor) == ’ ’ || input.charAt(cursor) == ’\t’) ) {cursor++;
}return (cursor < input.length())? input.charAt(cursor++) == ch
: false;}protected static void error() {
System.out.println("Invalid string");System.exit(1);
}protected final void getInputString() {
input = Console.In.getString();}public abstract boolean parse();
}
Compiler Construction Syntactic Analysis 38
Subclass for Given Grammar (1)
public class Expression extends RecursiveDescent {/** Original Grammar:* E -> T | E + T* T -> F | T * F* F -> ( E ) | d** Adapted Grammar:* E -> T E’* E’ -> + T E’ | e* T -> F T’* T’ -> * F T’ | e* F -> ( E ) | d** Note method names: E1() => E’ and T1() => T’*/
public boolean parse() {return E();
}public static void main(String[] args) {
new Expression();}// Continued . . .
Compiler Construction Syntactic Analysis 39
Subclass for Given Grammar (2)
private boolean E() {int pos = cursor;// E -> T E’if ( T() && E1() ) {
return true;}cursor = pos; // Backtrackreturn false;
}
E→ T E ′
Compiler Construction Syntactic Analysis 40
Subclass for Given Grammar (3)
private boolean E1() {int pos = cursor;// E’ -> + T E’if ( checkNextToken(’+’) && T() && E1() ) {
return true;}cursor = pos; // Backtrack// E’ -> ereturn true;
}
E ′→+T E ′ | ε
Compiler Construction Syntactic Analysis 41
Subclass for Given Grammar (4)
private boolean T() {int pos = cursor;// T -> F T’if ( F() && T1() ) {
return true;}cursor = pos; // Backtrackreturn false;
}}
T → FT ′
Compiler Construction Syntactic Analysis 42
Subclass for Given Grammar (5)
private boolean T1() {int pos = cursor;// T’ -> * F T’if ( checkNextToken(’*’) && F() && T1() ) {
return true;}cursor = pos; // Backtrack// T’ -> ereturn true;
}
T ′→∗FT ′ | ε
Compiler Construction Syntactic Analysis 43
Subclass for Given Grammar (6)
private boolean F() {int pos = cursor;// F -> ( E )if ( checkNextToken(’(’) && E() && checkNextToken(’)’) ) {
return true;}cursor = pos; // Backtrack// F -> dif ( checkNextToken(’d’) ) {
return true;}cursor = pos; // Backtrackreturn false;
}}
F → (E) | d
Compiler Construction Syntactic Analysis 44
Backtracking
• The example recursive descent parser used backtracking
• Recursive descent parsing is criticized as being inefficient due tobacktracking
• Some grammars can be written so that no backtracking is required
– The right side of the production starts with a terminal, so you knowimmediately which production to apply
– A top-down parser that requires no backtracking is called a predictiveparser
Compiler Construction Syntactic Analysis 45
The Bad News
Some grammars cannot be processed with a top-down parser
We need to determine the characteristics required to make a top-downparser feasible
Compiler Construction Syntactic Analysis 46
Preprocessing Needed
FIRST(α) is the set of terminals that begin strings derived from α
A → Bx | CyB → zC → w
FIRST(B) = {z}FIRST(C) = {w}FIRST(A) = {z,w}
Compiler Construction Syntactic Analysis 47
One Criteria
Given a production of the form
A → α | β
if FIRST(α)∩FIRST(β) 6= ∅, then a top-down parser cannot be used
Compiler Construction Syntactic Analysis 48
ε Productions
• ε productions complicate the situation
• FOLLOW(A) is the set of terminals that can appear immediately to theright of A in some sentential form
A → Bx | CyB → z | εC → w
FIRST(B) = {z}FIRST(C) = {w}FIRST(A) = {z,w}
FOLLOW(B) = {x}FOLLOW(C) = {y}FOLLOW(A) = {$}(end of input)
Compiler Construction Syntactic Analysis 49
FOLLOW
• Without any ε productions, FIRST would be sufficient
• Formally: If X ∈VN∪VT , then
FIRST(X) =
{
{X}, if X ∈VT
{a | a ∈VT and X∗⇒ aβ}, otherwise
If A ∈VN, then
FOLLOW(A) = {a | a ∈VT and A∗⇒ αAaβ}
• How do we compute FIRST and FOLLOW?
Compiler Construction Syntactic Analysis 50
FIRST ComputationSetOfTerminalSymbols FIRST(GrammarSymbol X ) {
if ( X is a terminal ) F ← {X}; FIRST(X) is just X
else {
F ← ∅ ;
if ( X → ε is a production ) F ← F ∪ ε; Add ε to FIRST(X)
if ( X → y1y2 . . .yn is a production ) {
if ( ∃ i such that ε ∈ FIRST(y1), ε ∈ FIRST(y2), . . . , ε ∈ FIRST(yi−1),
and a ∈ FIRST(yi) )
F ← F ∪ a;
if ( ε ∈ FIRST(y1), ε ∈ FIRST(y2), . . . , ε ∈ FIRST(yn) )
F ← F ∪ ε; Add ε to FIRST(X)
}
}
return F ;
}
Compiler Construction Syntactic Analysis 51
FIRST
In a nutshell:
• If A 6∗⇒ ε, then
FIRST(A) = {a ∈VT | A∗⇒ aβ}
• Else, if A∗⇒ ε, then
FIRST(A) = {a ∈VT | A∗⇒ aβ} ∪ {ε} (if A
∗⇒ ε)
Compiler Construction Syntactic Analysis 52
FOLLOW Computation
SetOfTerminalSymbols FOLLOW(NonTerminalSymbol A) {
F ← ∅ ;
if ( A is the start symbol )
F ← F ∪ $ ;
if ( B → αAβ is a production ) α can be ε
F ← F ∪ (FIRST(β) - ε);
if ( C → αA or (C→ αAγ and ε ∈ FIRST(γ)) )
F ← F ∪ FOLLOW(C);
return F ;
}
Compiler Construction Syntactic Analysis 53
FOLLOW
In a nutshell:
• If S 6+⇒ αA, then
FOLLOW(A) = {a ∈VT | S+⇒ αAaβ}
• Else, if S+⇒ αA, then
FOLLOW(A) = {a ∈VT | S+⇒ αAaβ}∪{$}
Compiler Construction Syntactic Analysis 54
FIRST and FOLLOW Example
Compute the FIRST and FOLLOW sets for the grammar from ourrecursive descent parser was built:
E → T E ′
E ′ → +T E ′ | εT → FT ′
T ′ → ∗FT ′ | εF → (E) | d
Compiler Construction Syntactic Analysis 55
FIRST and FOLLOW Example
E → T E ′
E ′ → +T E ′ | εT → FT ′
T ′ → ∗FT ′ | εF → (E) | d
The solution:
FIRST(+) = {+}FIRST(∗) = {∗}FIRST(d) = {d}FIRST(() = {(}FIRST()) = {)}
FIRST(E) = {(,d}FIRST(E ′) = {ε,+}FIRST(T ) = {(,d}FIRST(T ′) = {ε,∗}FIRST(F) = {(,d}
FOLLOW(E) = {$,)}FOLLOW(E ′) = {$,)}FOLLOW(T ) = {+,),$}FOLLOW(T ′) = {+,),$}FOLLOW(F) = {∗,+,),$}
Compiler Construction Syntactic Analysis 56
LL(1) Grammar
• Scanning Left-to-right
• Leftmost derivation
• 1 symbol lookahead
LL(2), . . . , LL(k) means 2, . . . , k lookahead symbols
Most parsers have just one symbol of lookahead
Compiler Construction Syntactic Analysis 57
LL(1) Grammar
Formally, a grammar is LL(1) if and only if whenever A → α | β
1. FIRST(α)∩FIRST(β) = ∅
2. At most one of α or β can derive ε
3. If β ∗⇒ ε, then α does not derive any string that starts with a terminal in
FOLLOW(A)
All LL(1) grammars can be parsed by a recursive descent parser, andrecursive descent parsers can parse only LL(1) grammars
Compiler Construction Syntactic Analysis 58
Common Prefixes
Recall the common prefix example:
C → if E then S else S| if E then S
FIRST(if E then S else S) = {if}FIRST(if E then S) = {if}
Thus the grammar is not LL(1), but the factored grammar is LL(1) (butambiguous):
C → if E then SXX → else S | ε
Compiler Construction Syntactic Analysis 59
Left Recursion
Consider the grammar:
E→ E +d | d
FIRST(E +d) = {d}FIRST(d) = {d}
Thus the grammar is not LL(1)
A recursive descent parser would succumb to infinite recursion
Compiler Construction Syntactic Analysis 60
Parse Table from FIRST, FOLLOW
• If more than one production matches, then the grammar is not LL(1)
• For any two productions Pi, Pj, FIRST(Pi)∩FIRST(Pj) = ∅
• If A→ α and b ∈ FIRST(α), then parsetable[A][b] = A→ α
• If X → α and ε ∈ FIRST(α), then for each b ∈ FOLLOW(X)
parsetable[X ][b] = X → α
Compiler Construction Syntactic Analysis 61
Parse Table for Example Grammar
Build an LL(1) parse table for our sample grammar:
E → T E ′
E ′ → +T E ′ | εT → FT ′
T ′ → ∗FT ′ | εF → (E) | d
FIRST and FOLLOW sets:
FIRST(+) = {+}FIRST(∗) = {∗}FIRST(d) = {d}FIRST(() = {(}FIRST()) = {)}
FIRST(E) = {(,d}FIRST(E ′) = {ε,+}FIRST(T ) = {(,d}FIRST(T ′) = {ε,∗}FIRST(F) = {(,d}
FOLLOW(E) = {$,)}FOLLOW(E ′) = {$,)}FOLLOW(T ) = {+,),$}FOLLOW(T ′) = {+,),$}FOLLOW(F) = {∗,+,),$}
Compiler Construction Syntactic Analysis 62
Parse Table for Example Grammar
The solution:
Top of Input SymbolStack d + ∗ ( ) $
E E→ TE ′ E→ T E ′
E ′ E ′→+TE ′ E ′→ ε E ′→ ε
T T → FT ′ T → FT ′
T ′ T ′→ ε T ′→∗FT ′ T ′→ ε T ′→ ε
F F → d F → (E)
Compiler Construction Syntactic Analysis 63
LL(1) Table-driven Parser
a 21 ana3a
LL Parser
Parse Table
Stack
Input$
Output
Compiler Construction Syntactic Analysis 64
LL(1) Parsing AlgorithmLL Parser() {
stack.push(S); Push start symbol onto empty stack
a← scanner.getNextToken(); Get next token
while ( not stack.empty() ) {
X ← stack.top(); Look at top of stack
if ( X is a non-terminal and parsetable[X ][a] = X → y1...yk ) {
stack.pop(); Pop off top item
stack.push(yk . . .y1); Push left side symbols on in reverse order
} else if ( X = a ) {
stack.pop(); Pop off top item
a← scanner.getNextToken(); Get next token
} else
Error(); Illegal string
}
}
Compiler Construction Syntactic Analysis 65
Parsing Example
Stack Input Rule
$ E d + d * d $ E→ T E ′
$ E ′T d + d * d $ T → FT ′
$ E ′T ′F d + d * d $ F → d$ E ′T ′d d + d * d $$ E ′T ′ + d * d $ T ′→ ε
$ E ′ + d * d $ E ′→+TE ′
$ E ′T+ + d * d $$ E ′T d * d $ T → FT ′
$ E ′T ′F d * d $ F → d$ E ′T ′d d * d $$ E ′T ′ * d $ T ′→ ∗FT ′
$ E ′T ′F* * d $$ E ′T ′F d $ F → d$ E ′T ′d d $$ E ′T ′ $ T ′→ ε
$ E ′ $ E ′→ ε
$ $ Accept
Compiler Construction Syntactic Analysis 66
Another Parsing Example
Stack Input Rule
$ E (d + d) * d$ E→ T E ′
$ E ′T (d + d) * d$ T → FT ′
$ E ′T ′F (d + d) * d$ F → (E)
$ E ′T ′)E( (d + d) * d$$ E ′T ′)E d + d) * d$ E→ T E ′
$ E ′T ′)E ′T d + d) * d$ T → FT ′
$ E ′T ′)E ′T ′F d + d) * d$ F → d
$ E ′T ′)E ′T ′d d + d) * d$$ E ′T ′)E ′T ′ + d) * d$ T ′→ ε
$ E ′T ′)E ′ + d) * d$ E ′→+T E ′
$ E ′T ′)E ′T+ + d) * d$$ E ′T ′)E ′T d) * d$ T → FT ′
$ E ′T ′)E ′T ′F d) * d$ F → d
$ E ′T ′)E ′T ′d d) * d$$ E ′T ′)E ′T ′ ) * d$ T ′→ ε
$ E ′T ′)E ′ ) * d$ E ′→ ε
$ E ′T ′) ) * d$$ E ′T ′ * d$ T ′→∗FT ′
$ E ′T ′F∗ * d$$ E ′T ′F d$ F → d
$ E ′T ′d d$$ E ′T ′ $ T ′→ ε
$ E ′ $ E ′→ ε
$ $ Accept
Compiler Construction Syntactic Analysis 67
Try a Non-LL(1) Grammar
E→ E + id | id
Observe FIRST(E + id) = FIRST(id) = {id}
Recursive descent parser: infinite recursion
Parse table:
Top of Input SymbolStack d $
E E→ idE→ E + id
Compiler Construction Syntactic Analysis 68
Top-down Parsing Summary
To produce a top-down parser:
1. Eliminate left recursion and common prefixs; this yields an LL(1)grammar
2. Find the FIRST and FOLLOW sets
3. Build either the recursive descent parser methods or the parsing table
Compiler Construction Syntactic Analysis 69
Limitations of LL(1) Grammars
• In many cases a grammar G1 can be easily devised to represent stringsin a language L(G1), but G1 is not LL(1)
• Sometimes G1 can be rewritten to form G2, where L(G1) = L(G2) andG2 is LL(1)
• Some context-free languages have no LL(1) grammars
Compiler Construction Syntactic Analysis 70
Bottom-up Parsing
• Grows parse tree from the leaves up
• Only two choices when scanning input
– shift symbol onto stack– reduce
• Parser reduces in the reverse order of a rightmost derivation
• Bottom-up parsers are more powerful than top-down parsers
They can be used to parse a larger variety of grammars
Compiler Construction Syntactic Analysis 71
Reduction
E→ E +E | E ∗E | (E) | −E | id
E⇒ E +E⇒ E +E ∗E⇒ E +E ∗ id⇒ E + id∗ id⇒ id+ id∗ id
Parser gives a rightmost reverse derivation
Compiler Construction Syntactic Analysis 72
Handles
• A handle of a string
– is a substring– that matches the right side of a production– whose reduction to the non-terminal on the left side represents one
step along the reverse of a rightmost derivation
• For unambiguous grammars, every right-sentential form has a uniquehandle
Compiler Construction Syntactic Analysis 73
Handle—More Formally
• A handle of a right-sentential form γ is a production A → β and aposition in γ where β can be found
• If (A → β,k) is a handle, then replacing β in γ at position k with Aproduces the previous right-sentential form in a rightmost derivation ofγ
The substring to the right of a handle contains only terminal symbols
Compiler Construction Syntactic Analysis 74
Handle Pruning
• Begin with string to parse
• Find handle and replace with the left side of a production that producesthat handle
• Repeat until only the start symbol remains
Compiler Construction Syntactic Analysis 75
Handle Pruning Example
E → E +T | TT → T ∗F | FF → d
Sentential Form Handle
d+d∗d (F → d,1)F +d∗d (T → F,1)T +d∗d (E→ T,1)E +d∗d (F → d,3)E +F ∗d (T → F,3)E +T ∗d (F → d,5)E +T ∗F (T → T ∗F,3)E +T (E→ E +T,1)E –
Observe that this a rightmost derivation in reverse
Compiler Construction Syntactic Analysis 76
Shift-Reduce Parsing
Two problems to solve
• Find substring to be reduced in a right-sentential form
• Determine what production to choose in case more than one productionhas that substring on its right side
Compiler Construction Syntactic Analysis 77
Overview of Process
• Stack containsstates andgrammar symbols
• Grammar symbolson stack representa viable prefix
a1 ana3a2 $
Stack
Input
LR Parser
Action Goto
Parse Table
Compiler Construction Syntactic Analysis 78
Parse Table
• Action
– shift– reduce
• Goto
– Next state
a1 ana3a2 $
Stack
Input
LR Parser
Action Goto
Parse Table
Compiler Construction Syntactic Analysis 79
Parse Table Actions• Shift
– Pushes inputsymbol and stateon to the stack
• Reduce
– Replaces astring of symbolson the stack witha non-terminal
– Symbols on thestack can beeither terminalsor non-terminals
a1 ana3a2 $
Stack
Input
LR Parser
Action Goto
Parse Table
Compiler Construction Syntactic Analysis 80
Shift-Reduce Parsing
• Stack holds grammar symbols
– $ indicates bottom of stack
• Input buffer for string to be parsed
– $ indicates end of string
• Parser activity
– shifts zero or more input symbols onto the stack until a handle β ison the top of the stack
– β is then reduced to the left side of a production
Compiler Construction Syntactic Analysis 81
Shift-Reduce Parsing
• Initial parser state
– Stack: $ Input: w$
(Stack grows to the right; string is consumed from left to right)
• Final parser state (if no errors)
– Stack: $S Input: $
• Parser actions
– Shift next input symbol to top of stack– Reduce handle on top of stack to non-terminal– Accept when string consumed and S on stack– Error when string cannot be parsed
Compiler Construction Syntactic Analysis 82
Viable Prefix
Prefix of a right sentential form that can appear on the stack of a shift-reduce parser
Compiler Construction Syntactic Analysis 83
Types of Bottom-up Parsers
• SLR
– “Simple LR”– LR(0), no lookahead
• LR
– LR(1), more powerful, but requires a lot of memory
• LALR
– Look ahead LR– Yacc is LALR(1)
Compiler Construction Syntactic Analysis 84
SLR
• We’ll concentrate on SLR since it is the simplest form
• To construct an SLR parse table we need items
• An item consists of a production and a numeric position within thatproduction
– An item encodes where you are in a production
Compiler Construction Syntactic Analysis 85
Expression Grammar
E→ E +E | E ∗E | (E) | id
compare to
E → E +T | TT → T ∗F | FF → (E) | id
Compiler Construction Syntactic Analysis 86
Canonical LR(0) States
1. Augment the grammar by adding a new production
S ′→ S
2. closure operation sets up states
3. goto operation computes transitions between states
Compiler Construction Syntactic Analysis 87
LR(0) Items
An LR(0) item of a grammar G is a production of G with a dot (·) at someposition of the right side.
Example: Four items can be derived from production A→ XYZ
A → ·XY ZA → X ·YZA → XY ·ZA → XY Z·
Compiler Construction Syntactic Analysis 88
Interpreting LR(0) Items
• An item indicates how much of a production we have seen at a givenpoint in the parsing process
• The item[A→ X ·Y Z]
means we have seen a string derivable from X and hope to see a stringderivable from Y Z
Compiler Construction Syntactic Analysis 89
Closure Algorithm
ItemSet closure(ItemSet I) {J← I;do {
Jold← J;for each item [A→ α ·Bβ] ∈ J and each production B→ γ ∈ G do {
J← J∪{B→ ·γ};}
} while ( J 6= Jold );return J;
}
• B is a non-terminal
• If one B-production is added to the closure with a dot on the left end,then all B-productions will be added to the closure
Compiler Construction Syntactic Analysis 90
Closure
closure([E→ E + ·T ]) =
E → E + ·TT → ·T ∗FT → ·FF → ·(E)
F → ·id
Compiler Construction Syntactic Analysis 91
goto Function
goto(I,X)
• I is a set of items (really just a state)
• X is a grammar symbol
• goto(I,X) is defined as the closure of the set of all items [A→ αX ·β]
such that [A→ α ·Xβ] is in I
• Intuitively, if I is the set of items valid for a viable prefix γ, then goto(I,X)
is the set of items valid for the viable prefix γX
Compiler Construction Syntactic Analysis 92
LR(0) Item Sets
EE E + TE TT FT *T FFF d
E( )
E1IE
E E + T
8I
T FT *T F
F dF E( )
E TE +
2IET F*
T
3IT F
4I
F E( )
E E + TE TT FT *T F
F d
F E )(
5I
F d
7I
0
F ( E )
6I
I
E + TF )( E
E
10I
E
T T * F
9IT
T F
F dF E( )
T *
11IET F*
ET
T+E
T
F
E
T(
(
+
*
d
d
d
F
F
)
(
(
F
d
*
+
T
Compiler Construction Syntactic Analysis 93
Set-of-Items Construction
SetOfItems items(Grammar G′) {C← { closure ([S′→ ·S])});do {
Cold←C;for each set of items I ∈C and each grammar symbol X such that
goto(I,X ) is not empty do {C←C∪{ goto(I,X ) };
}} while ( C 6= Cold );return C;
}
• G′ is the augmented grammar
Compiler Construction Syntactic Analysis 94
SLR Parse Table Construction
BuildSLRParser(Grammar G′) {Initialize all the entries in the goto and action tables to “error”;C← items(G′); C = {I0, I1, . . . , In}for each item set Ii ∈C do {
if [A→ α ·aβ] ∈ Ii and goto(Ii,a) = I jaction([i][a])← “shift j”; a is a terminal
if [A→ α·] ∈ Ii and A 6= S′
for all a ∈ FOLLOW(A) doaction([i][a])← “reduce A→ α”;
if [S′→ S·] ∈ Iiaction([i][$])← “accept”;
}for each non-terminal A ∈ G′ do
if goto(Ii,A) = I jgoto[i][A]← j;
The initial state of the parser is i where [S′→ ·S] ∈ Ii;}
• G′ is the augmented grammar
Compiler Construction Syntactic Analysis 95
SLR Parsing Example
FOLLOW(E) = {$,+,)}
FOLLOW(T ) = {$,+,∗,)}
FOLLOW(F) = {$,+,∗,)}
Compiler Construction Syntactic Analysis 96
SLR Parse Table
Action Goto
State d + ∗ ( ) $ E T F
0 shift 5 shift 4 1 2 3
1 shift 8 Accept
2 reduce shift 9 reduce reduceE→ T E→ T E→ T
3 reduce reduce reduce reduceT → F T → F T → F T → F
4 shift 5 shift 4 6 2 3
5 reduce reduce reduce reduceF → d F → d F → d F → d
6 shift 8 shift 7
7 reduce reduce reduce reduceF → (E) F → (E) F → (E) F → (E)
8 shift 5 shift 4 11 3
9 shift 5 shift 4 10
10 reduce reduce reduce reduceT → T ∗F T → T ∗F T → T ∗F T → T ∗F
11 reduce shift 9 reduce reduceE→ E +T E→ E +T E→ E +T
Compiler Construction Syntactic Analysis 97
LR Parsing Algorithm
LR Parser() {stack.push(S); Push initial state onto empty stackdone← false;a← scanner.getNextToken(); Get next tokenwhile ( not done ) {
s← stack.top(); Look at state on top of stackif ( action[s][a] = shift s′) {
stack.push(a);stack.push(s′);a = scanner.getNextToken();
} else if ( action[s][a] = reduce A→ B ) {stack.pop 2×|B| symbols; Pop off some symbolss′← stack.top();stack.push(A);stack.push(goto[s′][A]);
} else if ( action[s][a] = accept ) {done← true;
} else {Error(); Illegal string
}}
}
Compiler Construction Syntactic Analysis 98
Parsing Example
Stack Input Rule
$ S0 (d + d) * d $ Shift 4
$ S0(4 d + d) * d $ Shift 5
$ S0(4d5 + d) * d $ Reduce F → d
$ S0(4F3 + d) * d $ Reduce T → F$ S0(4T2 + d) * d $ Reduce E→ T$ S0(4E6 + d) * d $ Shift 8
$ S0(4E6+8 d) * d $ Shift 5
$ S0(4E6+8d5 ) * d $ Reduce F → d
$ S0(4E6+8F3 ) * d $ Reduce T → F$ S0(4E6+8T 11 ) * d $ Reduce T → E +T$ S0(4E6 ) * d $ Shift 7
$ S0(4E6)7 * d $ Reduce F → (E)
$ S0F3 * d $ Reduce T → F$ S0T2 * d $ Shift 9
$ S0T2*9 d $ Shift 5
$ S0T2*9d5 $ Reduce F → d
$ S0T2*9F10 $ Reduce T → T ∗F$ S0T2 $ Reduce E→ T$ S0E1 $ Accept
Compiler Construction Syntactic Analysis 99
Comparing Grammars
• LR(1) grammars describe languages that are a proper superset oflanguages represented by LL(1) grammars
• LR(1) is more powerful than LALR(1)
• LALR(1) is more efficient than LR(1)
• For a language like C:
– LR(1) parser has thousands of states– LALR(1) parser has hundreds of states
Compiler Construction Syntactic Analysis 100
Comparing Context-free Grammars
kLR( )
CFGs
LR(1)
LALR(1)
SLR(1)
LL(1)
Compiler Construction Syntactic Analysis 101
Chomsky’s Grammar Hierarchy
Consider productions of the form α→ β
Type Name Criteria Recognizer
Type 3 Regular A→ a | aB Finite automaton
Type 2 Context-free A→ α Push-down automaton
Type 1 Context-sensitive |α| ≤ |β| Linear bounded automaton
Type 0 Unrestricted α 6= ε Turing machine
Compiler Construction Syntactic Analysis 102
Grammar Hierarchy
Type 0
Type 1
Type 2
Type 3
Unrestricted
Context−sensitive
Context−free
Regular
Compiler Construction Syntactic Analysis 103
Error Handling
• Compilers cannot only process syntactically correct programs
• Language specifications do not usually describe how the compilershould respond to syntactical errors
• Review of types of errors
– Lexical– Syntactic– Semantic– Logical
Compiler Construction Syntactic Analysis 104
Syntactic Errors
What should be done when the stream of tokens coming from the lexerdisobeys the grammatical rules of the language?
Compiler Construction Syntactic Analysis 105
Goals
• Errors should be reported clearly and accurately
• Some error recovery should be performed so subsequent errors can bedetected
• The error detection and reporting mechanism should not significantlyslow down the processing of correct programs
Compiler Construction Syntactic Analysis 106
Issues
• Sometimes an error exist many lines before it is detected
• Types of errors are dependent on the programming language used
• See Example 4.1 in the dragon book
Compiler Construction Syntactic Analysis 107
Error Handling
• Report the location of the detected error
– at least line number– possibly the position within that line– report problem
• Recovery
– A poor job may produce many “spurious” errors– One strategy: skip ”bad” tokens and continue with a number of “good” tokens until
any subsequent errors are reported
Compiler Construction Syntactic Analysis 108
Error Recovery Strategies (1)
Panic-mode
• Discard tokens until some synchronizing token is detected
• Advantage
– simple to implement– won’t enter an infinite loop
Compiler Construction Syntactic Analysis 109
Error Recovery Strategies (2)
Phrase-level
• Perform local correction on remaining input (e.g., replace comma bysemicolon) to allow parser to continue
• Used first with top-down parsers
• Has difficulty coping with errors that occur before the point of detection
Compiler Construction Syntactic Analysis 110
Error Recovery Strategies (3)
Error productions
• Augment grammar with special “error rules”
• Very useful if certain erroneous constructs are anticipated
• Yacc supports error productions
Compiler Construction Syntactic Analysis 111
Error Recovery Strategies (4)
Global correction
• Finds the minimal number of corrections required to produce a goodparse tree from a bad one
• Interesting from a theoretical point of view, but not too practical
• Corrected parse tree obviously may not be what the programmerintended!
Compiler Construction Syntactic Analysis 112
Yacc/Bison Program
• Used to generate LALR(1) parsers
• Developed by S.C. Johnson
• YACC stands for “Yet another compiler compiler”
• As with Lex, originally for C under Unix, but other platforms aresupported
• Yacc generated C code can be linked with Lex generated C code for aready-made lexer/parser combination
• GNU Bison is the modern version that we will use
We’ll just call it Yacc, though
Compiler Construction Syntactic Analysis 113
Yacc Specification
%{
C/C++ Declarations%}
Yacc Declarations%%
Rules%%
Programmer functions
Compiler Construction Syntactic Analysis 114
Yacc Specification (2)
%{
C/C++ Declarations%}
Yacc Declarations%%
Rules%%
Programmer functions
1. C/C++ macros and declarations are placed in the C/C++ declarations section
2. Yacc token declarations and precedence assignments are placed in the
Yacc declarations section
3. Code to execute when productions are matched is placed placed in the
rules section
4. Arbitrary C/C++ code is placed in the programmer functions section; functions
named yylex() and yyerror() (normally produced by Lex) must be available
Compiler Construction Syntactic Analysis 115
Yacc Rules
• Consist of a grammar production and an associated action
• The Yacc syntax for the rule
A→ Bx | C
is
A : B x{ $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; }
| C{ $$ = new ANode($1); cout << "Matched A -> C" << endl; }
;
Compiler Construction Syntactic Analysis 116
Yacc Rules
A→ Bx | C
A : B x{ $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; }
| C{ $$ = new ANode($1); cout << "Matched A -> C" << endl; }
;
• The $$ metasymbol represents the value to be returned by the parserwhen the production is matched; it represents the left side non-terminal(A is this case)
• The $1, $2, etc. metasymbols represent the values of the grammarsymbols matched on the right side of the production
• Since the parser works from the bottom up, the left side non-terminalswill have already been matched and their values will be available
Compiler Construction Syntactic Analysis 117
Example Yacc Specification%{ /* -------------------------- C/C++ declarations */
#include <ctype.h>int yylex();void yyerror(char *);
%} /* -------------------------- Yacc declarations */%union {
int value;int symbol;
}%type <value> S E I%token <symbol> digit%left ’+’%left ’*’%% /* -------------------------- Rules */S : E { printf("%d\n", $1); }
| /* epsilon */ {};
E : E ’+’ E { $$ = $1 + $3; }| E ’*’ E { $$ = $1 * $3; }| ’(’ E ’)’ { $$ = $2; }| I { $$ = $1; };
I : I digit { $$ = 10 * $1 + ($2 - ’0’); }| digit { $$ = $1 - ’0’; };
%% /* -------------------------- C/C++ code */int main() {
while ( !feof(stdin) ) {yyparse();
}return 0;
}
Compiler Construction Syntactic Analysis 118
Yacc Specification to Parser
}
%%
yyparse();
prog.y
Declarations%%Production rules
main() {C procedures
y.tab.c
Parse Table
DFA
yyparse()
Compiler Construction Syntactic Analysis 119