Syntactic Analysis - cs.southern.edu · Advantages of CFGs Precise, easy-to-understand syntactic...

120
Syntactic Analysis Chapter 4 Compiler Construction Syntactic Analysis 1

Transcript of Syntactic Analysis - cs.southern.edu · Advantages of CFGs Precise, easy-to-understand syntactic...

Syntactic Analysis

Chapter 4

Compiler Construction Syntactic Analysis 1

Context-free Grammars

• The syntax of programming language constructs can be described bycontext-free grammars (CFGs)

• Relatively simple and widely used

• More powerful grammars exist

– Context-sensitive grammars (CSG)– Type-0 grammars

Both are too complex and inefficient for general use

• Backus-Naur Form (BNF) and extended BNF (EBNF) are a convenientway to represent CFGs

Compiler Construction Syntactic Analysis 2

Advantages of CFGs

• Precise, easy-to-understand syntactic specification of a programminglanguage

• Efficient parsers can be automatically generated for some classes ofCFGs

• This automatic generation process can reveal ambiguities that mightotherwise go undetected during the language design

• A well-designed grammar makes translation to object code easier

• Language evolution is expedited by an existing grammatical languagedescription

Compiler Construction Syntactic Analysis 3

Role of the Syntactic Analyzer

• Second phase of compilation

• Input to parser is the output of the lexer

• Output of parser is (usually) a parse tree

parserlexer

symboltable

sourcecode

token

get next token

Compiler Construction Syntactic Analysis 4

Parsers

• Universal parsers

– Cocke-Younger-Kasami algorithm– Earley’s algorithm– Both too inefficient for production compilers

• “Normal” parsers

– Work only on subclasses of CFGs– Examples: LL, LR, LALR(1)– Automated tools available for the popular subclasses

Compiler Construction Syntactic Analysis 5

Context-free Grammar

Context-free Grammar (CFG) is a 4-tuple

〈VN,VT ,s,P〉

• VN is a set of non-terminal symbols

• VT is a set of terminal symbols

• s is a distinguished element of VN called the start symbol

• P is a set of productions or rules that specify how legal strings are built

P⊆VN× (VN∪VT)∗

Compiler Construction Syntactic Analysis 6

CFG Elements

• Terminals: basic symbols from which strings are formed (typicallycorresponds to tokens from lexer)

• Non-terminals: syntactic variables that denote sets of strings and, inparticular, denoting language constructs

• Start symbol: a non-terminal; the set of strings denoted by the startsymbol is the language defined by the grammar

• Productions: set of rules that define how terminals and non-terminalscan be combined to form strings in the language

A→ bXY z

Compiler Construction Syntactic Analysis 7

Example

Symbol table interpreter

G = 〈VN,VT ,s,P〉

VN = {S}VT = {new, id,num, insert, lookup,quit}s = SP : S → new id num

| insert id id num| lookup id id| quit

Compiler Construction Syntactic Analysis 8

Example

An arithmetic expression language

G = 〈VN,VT ,s,P〉

VN = {E}VT = {id,+,∗,(,),−}

s = EP : E → E +E

| E ∗E| (E)

| −E| id

Compiler Construction Syntactic Analysis 9

Notational Conventions (1)

Dragon book, pages 166, 167

Terminals

• Lower-case letters early in the alphabet (a, b, etc.)

• Operator symbols (+, ∗, etc.)

• Punctuation symbols (parentheses, commas, etc.)

• Digits

• Boldface strings (id, if, etc.)

Compiler Construction Syntactic Analysis 10

Notational Conventions (2)

Non-terminals

• Upper-case letters early in the alphabet (A, B, etc.)

• The letter S, if used, is usually the start symbol

• Lower-case italics names (expr, stmt, etc.)

Compiler Construction Syntactic Analysis 11

Notational Conventions (3)

• Grammar symbols (either terminals or non-terminals)

– Upper-case letters late in the alphabet (X , Y , etc.)

• Strings of terminals

– Lower-case letters late in the alphabet (u, v, etc.)

• Strings of grammar symbols

– Lower-case Greek letters (α, β, etc.)– Useful for representing generic productions

Compiler Construction Syntactic Analysis 12

Notational Conventions (4)

• Productions with the same left side can be “merged” into one productionusing the | symbol

A→ α1, A→ α2, . . . , A→ αk

becomes

A→ α1 | α2 | . . . | αk

• Unless otherwise indicated, the left side of the first listed production isthe start symbol

Compiler Construction Syntactic Analysis 13

Example

A programming language construct

stmt → ;| if ( expr ) stmt else stmt| while ( expr ) stmt| blk| id = expr ;

blk → { stmt∗ }

Compiler Construction Syntactic Analysis 14

Derivations

• Rewrite rule approach

• A production is treated as a rewriting rule in which a non-terminal onthe left side of the production is replaced by the grammar symbols onthe right side of the production

• Begin with the start symbol and through a sequence of derivationsproduce any string in L(G)

Compiler Construction Syntactic Analysis 15

Derivation

Given the productions

A→ αBβ

B→ λ1λ2 . . .λn

we can derive

A ⇒ αBβ ⇒ αλ1λ2 . . .λnβ

Compiler Construction Syntactic Analysis 16

A Derivation

Given the productions

E→ E +E | E ∗E | (E) | −E | id

we can derive −(id+ id):

E ⇒ −E ⇒ −(E) ⇒ −(E +E) ⇒ −(id+E) ⇒ −(id+ id)

Compiler Construction Syntactic Analysis 17

Derivations

Let α be a set of grammar symbols (terminals and non-terminals)

α ∗⇒ β means zero or more derivations

1. α ∗⇒ α (Base case)

2. If α ∗⇒ γ and γ ∗

⇒ β, then α ∗⇒ β (Inductive case)

Compiler Construction Syntactic Analysis 18

The Language of a Grammar

Given a grammar G, the language of G is L(G)

L(G)⊆VT∗

L(G) = {w ∈VT∗ | S

∗⇒ w}

Compiler Construction Syntactic Analysis 19

Sentential Forms

• Leftmost derivation

– Leftmost non-terminal is replaced at each step– Rightmost derivation replaces the rightmost non-terminal at each

step

• Sentential form

A set of grammar symbols that may obtained from a set of validderivations

• Leftmost sentential form

A set of grammar symbols that may obtained from a set of validleftmost derivations

Compiler Construction Syntactic Analysis 20

Regular Languages and CFLs

• All regular languages are context-free

• Consider the regular expression

a∗b∗

Let G = 〈{A,B},{a,b},A,{A→ aA | B,B→ bB |ε}〉

Compiler Construction Syntactic Analysis 21

Producing a Grammar from a RegularLanguage

1. Construct an NFA from the regular expression

2. Each state in the NFA corresponds to a non-terminal symbol

3. For a transition from state A to state B given input symbol x, add aproduction of the form

A→ xB

4. If A is a final state, add the production

A→ ε

Compiler Construction Syntactic Analysis 22

Parse Trees

• A graphical representationof a sequence ofderivations

• Each interior node is anon-terminal and itschildren are the right sideof one of thenon-terminal’s productions

E

E

E

+ E

* E

id

id

id

Compiler Construction Syntactic Analysis 23

Parse Trees

• If you read the leaves ofthe tree from left to rightthey form a sentential form

– Also called the “yield” or“frontier” of the parse tree

• All the leaves need not beterminals; the parse treemay be incomplete

• Valid sentential forms cancontain non-terminals

E

E

E

+ E

* E

id

id

id

Compiler Construction Syntactic Analysis 24

Ambiguity

Given the productions

E → E +E | E ∗E | (E) | id

Derive id+ id∗ id:

E ⇒ E +E⇒ id+E ⇒ id+E ∗E⇒ id+ id∗E⇒ id+ id∗ id

orE ⇒ E ∗E⇒ E +E ∗E ⇒ id+E ∗E⇒ id+ id∗E⇒ id+ id∗ id

Compiler Construction Syntactic Analysis 25

Ambiguity and Parse Trees

A grammar G is ambiguous if a string in L(G) can have more than oneparse tree

E

E

E

+ E

* E

id

id

id

E

id

E

*

E

E

E

id id

+

Compiler Construction Syntactic Analysis 26

Consequences of Ambiguity

• Ambiguity is generally bad

• Often means there is more than one way to interpret a string

Add before multiply or multiply before add?

• An ambiguous grammar should be rewritten to remove the ambiguity

Compiler Construction Syntactic Analysis 27

Removing the Ambiguity

Consider the rewritten productions

E → T | E +TT → F | T ∗FF → (E) | id

E

E

T

+ T

* F

id

id

F

id

F

T

Here only one parse tree is possible

Compiler Construction Syntactic Analysis 28

Disambiguating Rules

Can we provide rules for disambiguating

id+(id∗ id)

from

(id+ id)∗ id

Compiler Construction Syntactic Analysis 29

Top-down Parsing

• Recursive descent is an example

• Grows the parse tree from the root down to the leaves

• Useful for recognizing flow-of-control constructs since they are alwayslabeled with a keyword (e.g., if,while ,do, for)

• Requires each production for the same non-terminal to begin with aunique token

Compiler Construction Syntactic Analysis 30

Left factoring

Can be used to factor out a common prefix in two of more productions

For example, to parse if...then vs. if...then...else

C → if E then S else S| if E then S

Left factor the grammar (factor out common left expression):

C → if E then SXX → else S | ε

Compiler Construction Syntactic Analysis 31

Top-down Parsing

Two requirements

• Left-factor the grammar

Produce grammar in which no productions for the same non-terminal have a common prefix

• No left recursion

A+⇒ Aα

Parser could get into an infinite loop

Compiler Construction Syntactic Analysis 32

Top-down Parsing

Top-down parsing produces a sequence of left-most derivations

A → Bx | CyB → zC → w

Produces two strings: zx and wy

Compiler Construction Syntactic Analysis 33

Top-down Parsers

Two common approaches are used in top-down parsing

• Recursive descent parser

– Recursive– The structure of the grammar is hard-coded into the parsing program

• Table-driven parser

– Non-recursive– The structure of the language is encoded in a parse table

Compiler Construction Syntactic Analysis 34

Recursive Descent

• Relatively easy to implement

• Reads the input stream (from the scanner) left to right and verifies itscorrectness

• Perl has a recursive descent parser (Parse::RecDescent)

• “Recursive,” since parsing is accomplished via recursive procedures

• “Descent,” since parsing is top-down (descends from the root down thebranches to the leaves)

Compiler Construction Syntactic Analysis 35

Recursive DescentEach non-terminal is a subroutine call

A → Bx | CyB → zC → w

AB

B

C

x

y

z

wC3 4

10 2

5 8

6 7

9

Compiler Construction Syntactic Analysis 36

Recursive Descent

• A candidate grammar:

E → T | E +TT → F | T ∗FF → (E) | d

Bad because of left recursion

• The grammar can be modified to support a recursive descent parser:

E → T E ′

E ′ → +T E ′ | εT → FT ′

T ′ → ∗FT ′ | εF → (E) | d

Compiler Construction Syntactic Analysis 37

Generalized Parser

public abstract class RecursiveDescent {private String input;protected int cursor = 0;public RecursiveDescent() {

getInputString();if ( parse() && cursor == input.length() ) {

System.out.println("Accept");} else {

error();}

}protected final boolean checkNextToken(char ch) {

// Ignore whitespacewhile ( cursor < input.length() &&

(input.charAt(cursor) == ’ ’ || input.charAt(cursor) == ’\t’) ) {cursor++;

}return (cursor < input.length())? input.charAt(cursor++) == ch

: false;}protected static void error() {

System.out.println("Invalid string");System.exit(1);

}protected final void getInputString() {

input = Console.In.getString();}public abstract boolean parse();

}

Compiler Construction Syntactic Analysis 38

Subclass for Given Grammar (1)

public class Expression extends RecursiveDescent {/** Original Grammar:* E -> T | E + T* T -> F | T * F* F -> ( E ) | d** Adapted Grammar:* E -> T E’* E’ -> + T E’ | e* T -> F T’* T’ -> * F T’ | e* F -> ( E ) | d** Note method names: E1() => E’ and T1() => T’*/

public boolean parse() {return E();

}public static void main(String[] args) {

new Expression();}// Continued . . .

Compiler Construction Syntactic Analysis 39

Subclass for Given Grammar (2)

private boolean E() {int pos = cursor;// E -> T E’if ( T() && E1() ) {

return true;}cursor = pos; // Backtrackreturn false;

}

E→ T E ′

Compiler Construction Syntactic Analysis 40

Subclass for Given Grammar (3)

private boolean E1() {int pos = cursor;// E’ -> + T E’if ( checkNextToken(’+’) && T() && E1() ) {

return true;}cursor = pos; // Backtrack// E’ -> ereturn true;

}

E ′→+T E ′ | ε

Compiler Construction Syntactic Analysis 41

Subclass for Given Grammar (4)

private boolean T() {int pos = cursor;// T -> F T’if ( F() && T1() ) {

return true;}cursor = pos; // Backtrackreturn false;

}}

T → FT ′

Compiler Construction Syntactic Analysis 42

Subclass for Given Grammar (5)

private boolean T1() {int pos = cursor;// T’ -> * F T’if ( checkNextToken(’*’) && F() && T1() ) {

return true;}cursor = pos; // Backtrack// T’ -> ereturn true;

}

T ′→∗FT ′ | ε

Compiler Construction Syntactic Analysis 43

Subclass for Given Grammar (6)

private boolean F() {int pos = cursor;// F -> ( E )if ( checkNextToken(’(’) && E() && checkNextToken(’)’) ) {

return true;}cursor = pos; // Backtrack// F -> dif ( checkNextToken(’d’) ) {

return true;}cursor = pos; // Backtrackreturn false;

}}

F → (E) | d

Compiler Construction Syntactic Analysis 44

Backtracking

• The example recursive descent parser used backtracking

• Recursive descent parsing is criticized as being inefficient due tobacktracking

• Some grammars can be written so that no backtracking is required

– The right side of the production starts with a terminal, so you knowimmediately which production to apply

– A top-down parser that requires no backtracking is called a predictiveparser

Compiler Construction Syntactic Analysis 45

The Bad News

Some grammars cannot be processed with a top-down parser

We need to determine the characteristics required to make a top-downparser feasible

Compiler Construction Syntactic Analysis 46

Preprocessing Needed

FIRST(α) is the set of terminals that begin strings derived from α

A → Bx | CyB → zC → w

FIRST(B) = {z}FIRST(C) = {w}FIRST(A) = {z,w}

Compiler Construction Syntactic Analysis 47

One Criteria

Given a production of the form

A → α | β

if FIRST(α)∩FIRST(β) 6= ∅, then a top-down parser cannot be used

Compiler Construction Syntactic Analysis 48

ε Productions

• ε productions complicate the situation

• FOLLOW(A) is the set of terminals that can appear immediately to theright of A in some sentential form

A → Bx | CyB → z | εC → w

FIRST(B) = {z}FIRST(C) = {w}FIRST(A) = {z,w}

FOLLOW(B) = {x}FOLLOW(C) = {y}FOLLOW(A) = {$}(end of input)

Compiler Construction Syntactic Analysis 49

FOLLOW

• Without any ε productions, FIRST would be sufficient

• Formally: If X ∈VN∪VT , then

FIRST(X) =

{

{X}, if X ∈VT

{a | a ∈VT and X∗⇒ aβ}, otherwise

If A ∈VN, then

FOLLOW(A) = {a | a ∈VT and A∗⇒ αAaβ}

• How do we compute FIRST and FOLLOW?

Compiler Construction Syntactic Analysis 50

FIRST ComputationSetOfTerminalSymbols FIRST(GrammarSymbol X ) {

if ( X is a terminal ) F ← {X}; FIRST(X) is just X

else {

F ← ∅ ;

if ( X → ε is a production ) F ← F ∪ ε; Add ε to FIRST(X)

if ( X → y1y2 . . .yn is a production ) {

if ( ∃ i such that ε ∈ FIRST(y1), ε ∈ FIRST(y2), . . . , ε ∈ FIRST(yi−1),

and a ∈ FIRST(yi) )

F ← F ∪ a;

if ( ε ∈ FIRST(y1), ε ∈ FIRST(y2), . . . , ε ∈ FIRST(yn) )

F ← F ∪ ε; Add ε to FIRST(X)

}

}

return F ;

}

Compiler Construction Syntactic Analysis 51

FIRST

In a nutshell:

• If A 6∗⇒ ε, then

FIRST(A) = {a ∈VT | A∗⇒ aβ}

• Else, if A∗⇒ ε, then

FIRST(A) = {a ∈VT | A∗⇒ aβ} ∪ {ε} (if A

∗⇒ ε)

Compiler Construction Syntactic Analysis 52

FOLLOW Computation

SetOfTerminalSymbols FOLLOW(NonTerminalSymbol A) {

F ← ∅ ;

if ( A is the start symbol )

F ← F ∪ $ ;

if ( B → αAβ is a production ) α can be ε

F ← F ∪ (FIRST(β) - ε);

if ( C → αA or (C→ αAγ and ε ∈ FIRST(γ)) )

F ← F ∪ FOLLOW(C);

return F ;

}

Compiler Construction Syntactic Analysis 53

FOLLOW

In a nutshell:

• If S 6+⇒ αA, then

FOLLOW(A) = {a ∈VT | S+⇒ αAaβ}

• Else, if S+⇒ αA, then

FOLLOW(A) = {a ∈VT | S+⇒ αAaβ}∪{$}

Compiler Construction Syntactic Analysis 54

FIRST and FOLLOW Example

Compute the FIRST and FOLLOW sets for the grammar from ourrecursive descent parser was built:

E → T E ′

E ′ → +T E ′ | εT → FT ′

T ′ → ∗FT ′ | εF → (E) | d

Compiler Construction Syntactic Analysis 55

FIRST and FOLLOW Example

E → T E ′

E ′ → +T E ′ | εT → FT ′

T ′ → ∗FT ′ | εF → (E) | d

The solution:

FIRST(+) = {+}FIRST(∗) = {∗}FIRST(d) = {d}FIRST(() = {(}FIRST()) = {)}

FIRST(E) = {(,d}FIRST(E ′) = {ε,+}FIRST(T ) = {(,d}FIRST(T ′) = {ε,∗}FIRST(F) = {(,d}

FOLLOW(E) = {$,)}FOLLOW(E ′) = {$,)}FOLLOW(T ) = {+,),$}FOLLOW(T ′) = {+,),$}FOLLOW(F) = {∗,+,),$}

Compiler Construction Syntactic Analysis 56

LL(1) Grammar

• Scanning Left-to-right

• Leftmost derivation

• 1 symbol lookahead

LL(2), . . . , LL(k) means 2, . . . , k lookahead symbols

Most parsers have just one symbol of lookahead

Compiler Construction Syntactic Analysis 57

LL(1) Grammar

Formally, a grammar is LL(1) if and only if whenever A → α | β

1. FIRST(α)∩FIRST(β) = ∅

2. At most one of α or β can derive ε

3. If β ∗⇒ ε, then α does not derive any string that starts with a terminal in

FOLLOW(A)

All LL(1) grammars can be parsed by a recursive descent parser, andrecursive descent parsers can parse only LL(1) grammars

Compiler Construction Syntactic Analysis 58

Common Prefixes

Recall the common prefix example:

C → if E then S else S| if E then S

FIRST(if E then S else S) = {if}FIRST(if E then S) = {if}

Thus the grammar is not LL(1), but the factored grammar is LL(1) (butambiguous):

C → if E then SXX → else S | ε

Compiler Construction Syntactic Analysis 59

Left Recursion

Consider the grammar:

E→ E +d | d

FIRST(E +d) = {d}FIRST(d) = {d}

Thus the grammar is not LL(1)

A recursive descent parser would succumb to infinite recursion

Compiler Construction Syntactic Analysis 60

Parse Table from FIRST, FOLLOW

• If more than one production matches, then the grammar is not LL(1)

• For any two productions Pi, Pj, FIRST(Pi)∩FIRST(Pj) = ∅

• If A→ α and b ∈ FIRST(α), then parsetable[A][b] = A→ α

• If X → α and ε ∈ FIRST(α), then for each b ∈ FOLLOW(X)

parsetable[X ][b] = X → α

Compiler Construction Syntactic Analysis 61

Parse Table for Example Grammar

Build an LL(1) parse table for our sample grammar:

E → T E ′

E ′ → +T E ′ | εT → FT ′

T ′ → ∗FT ′ | εF → (E) | d

FIRST and FOLLOW sets:

FIRST(+) = {+}FIRST(∗) = {∗}FIRST(d) = {d}FIRST(() = {(}FIRST()) = {)}

FIRST(E) = {(,d}FIRST(E ′) = {ε,+}FIRST(T ) = {(,d}FIRST(T ′) = {ε,∗}FIRST(F) = {(,d}

FOLLOW(E) = {$,)}FOLLOW(E ′) = {$,)}FOLLOW(T ) = {+,),$}FOLLOW(T ′) = {+,),$}FOLLOW(F) = {∗,+,),$}

Compiler Construction Syntactic Analysis 62

Parse Table for Example Grammar

The solution:

Top of Input SymbolStack d + ∗ ( ) $

E E→ TE ′ E→ T E ′

E ′ E ′→+TE ′ E ′→ ε E ′→ ε

T T → FT ′ T → FT ′

T ′ T ′→ ε T ′→∗FT ′ T ′→ ε T ′→ ε

F F → d F → (E)

Compiler Construction Syntactic Analysis 63

LL(1) Table-driven Parser

a 21 ana3a

LL Parser

Parse Table

Stack

Input$

Output

Compiler Construction Syntactic Analysis 64

LL(1) Parsing AlgorithmLL Parser() {

stack.push(S); Push start symbol onto empty stack

a← scanner.getNextToken(); Get next token

while ( not stack.empty() ) {

X ← stack.top(); Look at top of stack

if ( X is a non-terminal and parsetable[X ][a] = X → y1...yk ) {

stack.pop(); Pop off top item

stack.push(yk . . .y1); Push left side symbols on in reverse order

} else if ( X = a ) {

stack.pop(); Pop off top item

a← scanner.getNextToken(); Get next token

} else

Error(); Illegal string

}

}

Compiler Construction Syntactic Analysis 65

Parsing Example

Stack Input Rule

$ E d + d * d $ E→ T E ′

$ E ′T d + d * d $ T → FT ′

$ E ′T ′F d + d * d $ F → d$ E ′T ′d d + d * d $$ E ′T ′ + d * d $ T ′→ ε

$ E ′ + d * d $ E ′→+TE ′

$ E ′T+ + d * d $$ E ′T d * d $ T → FT ′

$ E ′T ′F d * d $ F → d$ E ′T ′d d * d $$ E ′T ′ * d $ T ′→ ∗FT ′

$ E ′T ′F* * d $$ E ′T ′F d $ F → d$ E ′T ′d d $$ E ′T ′ $ T ′→ ε

$ E ′ $ E ′→ ε

$ $ Accept

Compiler Construction Syntactic Analysis 66

Another Parsing Example

Stack Input Rule

$ E (d + d) * d$ E→ T E ′

$ E ′T (d + d) * d$ T → FT ′

$ E ′T ′F (d + d) * d$ F → (E)

$ E ′T ′)E( (d + d) * d$$ E ′T ′)E d + d) * d$ E→ T E ′

$ E ′T ′)E ′T d + d) * d$ T → FT ′

$ E ′T ′)E ′T ′F d + d) * d$ F → d

$ E ′T ′)E ′T ′d d + d) * d$$ E ′T ′)E ′T ′ + d) * d$ T ′→ ε

$ E ′T ′)E ′ + d) * d$ E ′→+T E ′

$ E ′T ′)E ′T+ + d) * d$$ E ′T ′)E ′T d) * d$ T → FT ′

$ E ′T ′)E ′T ′F d) * d$ F → d

$ E ′T ′)E ′T ′d d) * d$$ E ′T ′)E ′T ′ ) * d$ T ′→ ε

$ E ′T ′)E ′ ) * d$ E ′→ ε

$ E ′T ′) ) * d$$ E ′T ′ * d$ T ′→∗FT ′

$ E ′T ′F∗ * d$$ E ′T ′F d$ F → d

$ E ′T ′d d$$ E ′T ′ $ T ′→ ε

$ E ′ $ E ′→ ε

$ $ Accept

Compiler Construction Syntactic Analysis 67

Try a Non-LL(1) Grammar

E→ E + id | id

Observe FIRST(E + id) = FIRST(id) = {id}

Recursive descent parser: infinite recursion

Parse table:

Top of Input SymbolStack d $

E E→ idE→ E + id

Compiler Construction Syntactic Analysis 68

Top-down Parsing Summary

To produce a top-down parser:

1. Eliminate left recursion and common prefixs; this yields an LL(1)grammar

2. Find the FIRST and FOLLOW sets

3. Build either the recursive descent parser methods or the parsing table

Compiler Construction Syntactic Analysis 69

Limitations of LL(1) Grammars

• In many cases a grammar G1 can be easily devised to represent stringsin a language L(G1), but G1 is not LL(1)

• Sometimes G1 can be rewritten to form G2, where L(G1) = L(G2) andG2 is LL(1)

• Some context-free languages have no LL(1) grammars

Compiler Construction Syntactic Analysis 70

Bottom-up Parsing

• Grows parse tree from the leaves up

• Only two choices when scanning input

– shift symbol onto stack– reduce

• Parser reduces in the reverse order of a rightmost derivation

• Bottom-up parsers are more powerful than top-down parsers

They can be used to parse a larger variety of grammars

Compiler Construction Syntactic Analysis 71

Reduction

E→ E +E | E ∗E | (E) | −E | id

E⇒ E +E⇒ E +E ∗E⇒ E +E ∗ id⇒ E + id∗ id⇒ id+ id∗ id

Parser gives a rightmost reverse derivation

Compiler Construction Syntactic Analysis 72

Handles

• A handle of a string

– is a substring– that matches the right side of a production– whose reduction to the non-terminal on the left side represents one

step along the reverse of a rightmost derivation

• For unambiguous grammars, every right-sentential form has a uniquehandle

Compiler Construction Syntactic Analysis 73

Handle—More Formally

• A handle of a right-sentential form γ is a production A → β and aposition in γ where β can be found

• If (A → β,k) is a handle, then replacing β in γ at position k with Aproduces the previous right-sentential form in a rightmost derivation ofγ

The substring to the right of a handle contains only terminal symbols

Compiler Construction Syntactic Analysis 74

Handle Pruning

• Begin with string to parse

• Find handle and replace with the left side of a production that producesthat handle

• Repeat until only the start symbol remains

Compiler Construction Syntactic Analysis 75

Handle Pruning Example

E → E +T | TT → T ∗F | FF → d

Sentential Form Handle

d+d∗d (F → d,1)F +d∗d (T → F,1)T +d∗d (E→ T,1)E +d∗d (F → d,3)E +F ∗d (T → F,3)E +T ∗d (F → d,5)E +T ∗F (T → T ∗F,3)E +T (E→ E +T,1)E –

Observe that this a rightmost derivation in reverse

Compiler Construction Syntactic Analysis 76

Shift-Reduce Parsing

Two problems to solve

• Find substring to be reduced in a right-sentential form

• Determine what production to choose in case more than one productionhas that substring on its right side

Compiler Construction Syntactic Analysis 77

Overview of Process

• Stack containsstates andgrammar symbols

• Grammar symbolson stack representa viable prefix

a1 ana3a2 $

Stack

Input

LR Parser

Action Goto

Parse Table

Compiler Construction Syntactic Analysis 78

Parse Table

• Action

– shift– reduce

• Goto

– Next state

a1 ana3a2 $

Stack

Input

LR Parser

Action Goto

Parse Table

Compiler Construction Syntactic Analysis 79

Parse Table Actions• Shift

– Pushes inputsymbol and stateon to the stack

• Reduce

– Replaces astring of symbolson the stack witha non-terminal

– Symbols on thestack can beeither terminalsor non-terminals

a1 ana3a2 $

Stack

Input

LR Parser

Action Goto

Parse Table

Compiler Construction Syntactic Analysis 80

Shift-Reduce Parsing

• Stack holds grammar symbols

– $ indicates bottom of stack

• Input buffer for string to be parsed

– $ indicates end of string

• Parser activity

– shifts zero or more input symbols onto the stack until a handle β ison the top of the stack

– β is then reduced to the left side of a production

Compiler Construction Syntactic Analysis 81

Shift-Reduce Parsing

• Initial parser state

– Stack: $ Input: w$

(Stack grows to the right; string is consumed from left to right)

• Final parser state (if no errors)

– Stack: $S Input: $

• Parser actions

– Shift next input symbol to top of stack– Reduce handle on top of stack to non-terminal– Accept when string consumed and S on stack– Error when string cannot be parsed

Compiler Construction Syntactic Analysis 82

Viable Prefix

Prefix of a right sentential form that can appear on the stack of a shift-reduce parser

Compiler Construction Syntactic Analysis 83

Types of Bottom-up Parsers

• SLR

– “Simple LR”– LR(0), no lookahead

• LR

– LR(1), more powerful, but requires a lot of memory

• LALR

– Look ahead LR– Yacc is LALR(1)

Compiler Construction Syntactic Analysis 84

SLR

• We’ll concentrate on SLR since it is the simplest form

• To construct an SLR parse table we need items

• An item consists of a production and a numeric position within thatproduction

– An item encodes where you are in a production

Compiler Construction Syntactic Analysis 85

Expression Grammar

E→ E +E | E ∗E | (E) | id

compare to

E → E +T | TT → T ∗F | FF → (E) | id

Compiler Construction Syntactic Analysis 86

Canonical LR(0) States

1. Augment the grammar by adding a new production

S ′→ S

2. closure operation sets up states

3. goto operation computes transitions between states

Compiler Construction Syntactic Analysis 87

LR(0) Items

An LR(0) item of a grammar G is a production of G with a dot (·) at someposition of the right side.

Example: Four items can be derived from production A→ XYZ

A → ·XY ZA → X ·YZA → XY ·ZA → XY Z·

Compiler Construction Syntactic Analysis 88

Interpreting LR(0) Items

• An item indicates how much of a production we have seen at a givenpoint in the parsing process

• The item[A→ X ·Y Z]

means we have seen a string derivable from X and hope to see a stringderivable from Y Z

Compiler Construction Syntactic Analysis 89

Closure Algorithm

ItemSet closure(ItemSet I) {J← I;do {

Jold← J;for each item [A→ α ·Bβ] ∈ J and each production B→ γ ∈ G do {

J← J∪{B→ ·γ};}

} while ( J 6= Jold );return J;

}

• B is a non-terminal

• If one B-production is added to the closure with a dot on the left end,then all B-productions will be added to the closure

Compiler Construction Syntactic Analysis 90

Closure

closure([E→ E + ·T ]) =

E → E + ·TT → ·T ∗FT → ·FF → ·(E)

F → ·id

Compiler Construction Syntactic Analysis 91

goto Function

goto(I,X)

• I is a set of items (really just a state)

• X is a grammar symbol

• goto(I,X) is defined as the closure of the set of all items [A→ αX ·β]

such that [A→ α ·Xβ] is in I

• Intuitively, if I is the set of items valid for a viable prefix γ, then goto(I,X)

is the set of items valid for the viable prefix γX

Compiler Construction Syntactic Analysis 92

LR(0) Item Sets

EE E + TE TT FT *T FFF d

E( )

E1IE

E E + T

8I

T FT *T F

F dF E( )

E TE +

2IET F*

T

3IT F

4I

F E( )

E E + TE TT FT *T F

F d

F E )(

5I

F d

7I

0

F ( E )

6I

I

E + TF )( E

E

10I

E

T T * F

9IT

T F

F dF E( )

T *

11IET F*

ET

T+E

T

F

E

T(

(

+

*

d

d

d

F

F

)

(

(

F

d

*

+

T

Compiler Construction Syntactic Analysis 93

Set-of-Items Construction

SetOfItems items(Grammar G′) {C← { closure ([S′→ ·S])});do {

Cold←C;for each set of items I ∈C and each grammar symbol X such that

goto(I,X ) is not empty do {C←C∪{ goto(I,X ) };

}} while ( C 6= Cold );return C;

}

• G′ is the augmented grammar

Compiler Construction Syntactic Analysis 94

SLR Parse Table Construction

BuildSLRParser(Grammar G′) {Initialize all the entries in the goto and action tables to “error”;C← items(G′); C = {I0, I1, . . . , In}for each item set Ii ∈C do {

if [A→ α ·aβ] ∈ Ii and goto(Ii,a) = I jaction([i][a])← “shift j”; a is a terminal

if [A→ α·] ∈ Ii and A 6= S′

for all a ∈ FOLLOW(A) doaction([i][a])← “reduce A→ α”;

if [S′→ S·] ∈ Iiaction([i][$])← “accept”;

}for each non-terminal A ∈ G′ do

if goto(Ii,A) = I jgoto[i][A]← j;

The initial state of the parser is i where [S′→ ·S] ∈ Ii;}

• G′ is the augmented grammar

Compiler Construction Syntactic Analysis 95

SLR Parsing Example

FOLLOW(E) = {$,+,)}

FOLLOW(T ) = {$,+,∗,)}

FOLLOW(F) = {$,+,∗,)}

Compiler Construction Syntactic Analysis 96

SLR Parse Table

Action Goto

State d + ∗ ( ) $ E T F

0 shift 5 shift 4 1 2 3

1 shift 8 Accept

2 reduce shift 9 reduce reduceE→ T E→ T E→ T

3 reduce reduce reduce reduceT → F T → F T → F T → F

4 shift 5 shift 4 6 2 3

5 reduce reduce reduce reduceF → d F → d F → d F → d

6 shift 8 shift 7

7 reduce reduce reduce reduceF → (E) F → (E) F → (E) F → (E)

8 shift 5 shift 4 11 3

9 shift 5 shift 4 10

10 reduce reduce reduce reduceT → T ∗F T → T ∗F T → T ∗F T → T ∗F

11 reduce shift 9 reduce reduceE→ E +T E→ E +T E→ E +T

Compiler Construction Syntactic Analysis 97

LR Parsing Algorithm

LR Parser() {stack.push(S); Push initial state onto empty stackdone← false;a← scanner.getNextToken(); Get next tokenwhile ( not done ) {

s← stack.top(); Look at state on top of stackif ( action[s][a] = shift s′) {

stack.push(a);stack.push(s′);a = scanner.getNextToken();

} else if ( action[s][a] = reduce A→ B ) {stack.pop 2×|B| symbols; Pop off some symbolss′← stack.top();stack.push(A);stack.push(goto[s′][A]);

} else if ( action[s][a] = accept ) {done← true;

} else {Error(); Illegal string

}}

}

Compiler Construction Syntactic Analysis 98

Parsing Example

Stack Input Rule

$ S0 (d + d) * d $ Shift 4

$ S0(4 d + d) * d $ Shift 5

$ S0(4d5 + d) * d $ Reduce F → d

$ S0(4F3 + d) * d $ Reduce T → F$ S0(4T2 + d) * d $ Reduce E→ T$ S0(4E6 + d) * d $ Shift 8

$ S0(4E6+8 d) * d $ Shift 5

$ S0(4E6+8d5 ) * d $ Reduce F → d

$ S0(4E6+8F3 ) * d $ Reduce T → F$ S0(4E6+8T 11 ) * d $ Reduce T → E +T$ S0(4E6 ) * d $ Shift 7

$ S0(4E6)7 * d $ Reduce F → (E)

$ S0F3 * d $ Reduce T → F$ S0T2 * d $ Shift 9

$ S0T2*9 d $ Shift 5

$ S0T2*9d5 $ Reduce F → d

$ S0T2*9F10 $ Reduce T → T ∗F$ S0T2 $ Reduce E→ T$ S0E1 $ Accept

Compiler Construction Syntactic Analysis 99

Comparing Grammars

• LR(1) grammars describe languages that are a proper superset oflanguages represented by LL(1) grammars

• LR(1) is more powerful than LALR(1)

• LALR(1) is more efficient than LR(1)

• For a language like C:

– LR(1) parser has thousands of states– LALR(1) parser has hundreds of states

Compiler Construction Syntactic Analysis 100

Comparing Context-free Grammars

kLR( )

CFGs

LR(1)

LALR(1)

SLR(1)

LL(1)

Compiler Construction Syntactic Analysis 101

Chomsky’s Grammar Hierarchy

Consider productions of the form α→ β

Type Name Criteria Recognizer

Type 3 Regular A→ a | aB Finite automaton

Type 2 Context-free A→ α Push-down automaton

Type 1 Context-sensitive |α| ≤ |β| Linear bounded automaton

Type 0 Unrestricted α 6= ε Turing machine

Compiler Construction Syntactic Analysis 102

Grammar Hierarchy

Type 0

Type 1

Type 2

Type 3

Unrestricted

Context−sensitive

Context−free

Regular

Compiler Construction Syntactic Analysis 103

Error Handling

• Compilers cannot only process syntactically correct programs

• Language specifications do not usually describe how the compilershould respond to syntactical errors

• Review of types of errors

– Lexical– Syntactic– Semantic– Logical

Compiler Construction Syntactic Analysis 104

Syntactic Errors

What should be done when the stream of tokens coming from the lexerdisobeys the grammatical rules of the language?

Compiler Construction Syntactic Analysis 105

Goals

• Errors should be reported clearly and accurately

• Some error recovery should be performed so subsequent errors can bedetected

• The error detection and reporting mechanism should not significantlyslow down the processing of correct programs

Compiler Construction Syntactic Analysis 106

Issues

• Sometimes an error exist many lines before it is detected

• Types of errors are dependent on the programming language used

• See Example 4.1 in the dragon book

Compiler Construction Syntactic Analysis 107

Error Handling

• Report the location of the detected error

– at least line number– possibly the position within that line– report problem

• Recovery

– A poor job may produce many “spurious” errors– One strategy: skip ”bad” tokens and continue with a number of “good” tokens until

any subsequent errors are reported

Compiler Construction Syntactic Analysis 108

Error Recovery Strategies (1)

Panic-mode

• Discard tokens until some synchronizing token is detected

• Advantage

– simple to implement– won’t enter an infinite loop

Compiler Construction Syntactic Analysis 109

Error Recovery Strategies (2)

Phrase-level

• Perform local correction on remaining input (e.g., replace comma bysemicolon) to allow parser to continue

• Used first with top-down parsers

• Has difficulty coping with errors that occur before the point of detection

Compiler Construction Syntactic Analysis 110

Error Recovery Strategies (3)

Error productions

• Augment grammar with special “error rules”

• Very useful if certain erroneous constructs are anticipated

• Yacc supports error productions

Compiler Construction Syntactic Analysis 111

Error Recovery Strategies (4)

Global correction

• Finds the minimal number of corrections required to produce a goodparse tree from a bad one

• Interesting from a theoretical point of view, but not too practical

• Corrected parse tree obviously may not be what the programmerintended!

Compiler Construction Syntactic Analysis 112

Yacc/Bison Program

• Used to generate LALR(1) parsers

• Developed by S.C. Johnson

• YACC stands for “Yet another compiler compiler”

• As with Lex, originally for C under Unix, but other platforms aresupported

• Yacc generated C code can be linked with Lex generated C code for aready-made lexer/parser combination

• GNU Bison is the modern version that we will use

We’ll just call it Yacc, though

Compiler Construction Syntactic Analysis 113

Yacc Specification

%{

C/C++ Declarations%}

Yacc Declarations%%

Rules%%

Programmer functions

Compiler Construction Syntactic Analysis 114

Yacc Specification (2)

%{

C/C++ Declarations%}

Yacc Declarations%%

Rules%%

Programmer functions

1. C/C++ macros and declarations are placed in the C/C++ declarations section

2. Yacc token declarations and precedence assignments are placed in the

Yacc declarations section

3. Code to execute when productions are matched is placed placed in the

rules section

4. Arbitrary C/C++ code is placed in the programmer functions section; functions

named yylex() and yyerror() (normally produced by Lex) must be available

Compiler Construction Syntactic Analysis 115

Yacc Rules

• Consist of a grammar production and an associated action

• The Yacc syntax for the rule

A→ Bx | C

is

A : B x{ $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; }

| C{ $$ = new ANode($1); cout << "Matched A -> C" << endl; }

;

Compiler Construction Syntactic Analysis 116

Yacc Rules

A→ Bx | C

A : B x{ $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; }

| C{ $$ = new ANode($1); cout << "Matched A -> C" << endl; }

;

• The $$ metasymbol represents the value to be returned by the parserwhen the production is matched; it represents the left side non-terminal(A is this case)

• The $1, $2, etc. metasymbols represent the values of the grammarsymbols matched on the right side of the production

• Since the parser works from the bottom up, the left side non-terminalswill have already been matched and their values will be available

Compiler Construction Syntactic Analysis 117

Example Yacc Specification%{ /* -------------------------- C/C++ declarations */

#include <ctype.h>int yylex();void yyerror(char *);

%} /* -------------------------- Yacc declarations */%union {

int value;int symbol;

}%type <value> S E I%token <symbol> digit%left ’+’%left ’*’%% /* -------------------------- Rules */S : E { printf("%d\n", $1); }

| /* epsilon */ {};

E : E ’+’ E { $$ = $1 + $3; }| E ’*’ E { $$ = $1 * $3; }| ’(’ E ’)’ { $$ = $2; }| I { $$ = $1; };

I : I digit { $$ = 10 * $1 + ($2 - ’0’); }| digit { $$ = $1 - ’0’; };

%% /* -------------------------- C/C++ code */int main() {

while ( !feof(stdin) ) {yyparse();

}return 0;

}

Compiler Construction Syntactic Analysis 118

Yacc Specification to Parser

}

%%

yyparse();

prog.y

Declarations%%Production rules

main() {C procedures

y.tab.c

Parse Table

DFA

yyparse()

Compiler Construction Syntactic Analysis 119

Build Process

%%C proceduresmain() {

Production rules

}

Declarations%%

yacc yyparse();

gccprog.y y.tab.c prog

gcc −o prog y.tab.cyacc prog.y

Compiler Construction Syntactic Analysis 120