Syntactic Analysis - cs.southern.edu · Advantages of CFGs Precise, easy-to-understand syntactic...

Syntactic Analysis

Chapter 4

Compiler Construction Syntactic Analysis 1

Context-free Grammars

• The syntax of programming language constructs can be described bycontext-free grammars (CFGs)

• Relatively simple and widely used

• More powerful grammars exist

– Context-sensitive grammars (CSG)– Type-0 grammars

Both are too complex and inefficient for general use

• Backus-Naur Form (BNF) and extended BNF (EBNF) are a convenientway to represent CFGs


Advantages of CFGs

• Precise, easy-to-understand syntactic specification of a programminglanguage

• Efficient parsers can be automatically generated for some classes ofCFGs

• This automatic generation process can reveal ambiguities that mightotherwise go undetected during the language design

• A well-designed grammar makes translation to object code easier

• Language evolution is expedited by an existing grammatical languagedescription


Role of the Syntactic Analyzer

• Second phase of compilation

• Input to parser is the output of the lexer

• Output of parser is (usually) a parse tree

parserlexer

symboltable

sourcecode

token

get next token


Parsers

• Universal parsers

– Cocke-Younger-Kasami algorithm– Earley’s algorithm– Both too inefficient for production compilers

• “Normal” parsers

– Work only on subclasses of CFGs– Examples: LL, LR, LALR(1)– Automated tools available for the popular subclasses


Context-free Grammar

Context-free Grammar (CFG) is a 4-tuple

〈VN,VT ,s,P〉

• VN is a set of non-terminal symbols

• VT is a set of terminal symbols

• s is a distinguished element of VN called the start symbol

• P is a set of productions or rules that specify how legal strings are built

P⊆VN× (VN∪VT)∗


CFG Elements

• Terminals: basic symbols from which strings are formed (typicallycorresponds to tokens from lexer)

• Non-terminals: syntactic variables that denote sets of strings and, inparticular, denoting language constructs

• Start symbol: a non-terminal; the set of strings denoted by the startsymbol is the language defined by the grammar

• Productions: set of rules that define how terminals and non-terminalscan be combined to form strings in the language

A→ bXY z


Example

Symbol table interpreter

G = 〈VN,VT ,s,P〉

VN = {S}VT = {new, id,num, insert, lookup,quit}s = SP : S → new id num

| insert id id num| lookup id id| quit


Example

An arithmetic expression language

G = 〈VN,VT ,s,P〉

VN = {E}VT = {id,+,∗,(,),−}

s = EP : E → E +E

| E ∗E| (E)

| −E| id


Notational Conventions (1)

Dragon book, pages 166, 167

Terminals

• Lower-case letters early in the alphabet (a, b, etc.)

• Operator symbols (+, ∗, etc.)

• Punctuation symbols (parentheses, commas, etc.)

• Digits

• Boldface strings (id, if, etc.)



Non-terminals

• Upper-case letters early in the alphabet (A, B, etc.)

• The letter S, if used, is usually the start symbol

• Lower-case italics names (expr, stmt, etc.)



• Grammar symbols (either terminals or non-terminals)

– Upper-case letters late in the alphabet (X , Y , etc.)

• Strings of terminals

– Lower-case letters late in the alphabet (u, v, etc.)

• Strings of grammar symbols

– Lower-case Greek letters (α, β, etc.)– Useful for representing generic productions



• Productions with the same left side can be “merged” into one productionusing the | symbol

A→ α1, A→ α2, . . . , A→ αk

becomes

A→ α1 | α2 | . . . | αk

• Unless otherwise indicated, the left side of the first listed production isthe start symbol


Example

A programming language construct

stmt → ;| if ( expr ) stmt else stmt| while ( expr ) stmt| blk| id = expr ;

blk → { stmt∗ }


Derivations

• Rewrite rule approach

• A production is treated as a rewriting rule in which a non-terminal onthe left side of the production is replaced by the grammar symbols onthe right side of the production

• Begin with the start symbol and through a sequence of derivationsproduce any string in L(G)


Derivation

Given the productions

A→ αBβ

B→ λ1λ2 . . .λn

we can derive

A ⇒ αBβ ⇒ αλ1λ2 . . .λnβ


A Derivation


E→ E +E | E ∗E | (E) | −E | id

we can derive −(id+ id):

E ⇒ −E ⇒ −(E) ⇒ −(E +E) ⇒ −(id+E) ⇒ −(id+ id)


Derivations

Let α be a set of grammar symbols (terminals and non-terminals)

α ∗⇒ β means zero or more derivations

1. α ∗⇒ α (Base case)

2. If α ∗⇒ γ and γ ∗

⇒ β, then α ∗⇒ β (Inductive case)


The Language of a Grammar

Given a grammar G, the language of G is L(G)

L(G)⊆VT∗

L(G) = {w ∈VT∗ | S

∗⇒ w}


Sentential Forms

• Leftmost derivation

– Leftmost non-terminal is replaced at each step– Rightmost derivation replaces the rightmost non-terminal at each

step

• Sentential form

A set of grammar symbols that may obtained from a set of validderivations

• Leftmost sentential form

A set of grammar symbols that may obtained from a set of validleftmost derivations


Regular Languages and CFLs

• All regular languages are context-free

• Consider the regular expression

a∗b∗

Let G = 〈{A,B},{a,b},A,{A→ aA | B,B→ bB |ε}〉


Producing a Grammar from a RegularLanguage

1. Construct an NFA from the regular expression

2. Each state in the NFA corresponds to a non-terminal symbol

3. For a transition from state A to state B given input symbol x, add aproduction of the form

A→ xB

4. If A is a final state, add the production

A→ ε


Parse Trees

• A graphical representationof a sequence ofderivations

• Each interior node is anon-terminal and itschildren are the right sideof one of thenon-terminal’s productions

E

E

E

+ E

* E

id

id

id


Parse Trees

• If you read the leaves ofthe tree from left to rightthey form a sentential form

– Also called the “yield” or“frontier” of the parse tree

• All the leaves need not beterminals; the parse treemay be incomplete

• Valid sentential forms cancontain non-terminals

E

E

E

+ E

* E

id

id

id


Ambiguity


E → E +E | E ∗E | (E) | id

Derive id+ id∗ id:

E ⇒ E +E⇒ id+E ⇒ id+E ∗E⇒ id+ id∗E⇒ id+ id∗ id

orE ⇒ E ∗E⇒ E +E ∗E ⇒ id+E ∗E⇒ id+ id∗E⇒ id+ id∗ id


Ambiguity and Parse Trees

A grammar G is ambiguous if a string in L(G) can have more than oneparse tree

E

E

E

+ E

* E

id

id

id

E

id

E

*

E

E

E

id id

+


Consequences of Ambiguity

• Ambiguity is generally bad

• Often means there is more than one way to interpret a string

Add before multiply or multiply before add?

• An ambiguous grammar should be rewritten to remove the ambiguity


Removing the Ambiguity

Consider the rewritten productions

E → T | E +TT → F | T ∗FF → (E) | id

E

E

T

+ T

* F

id

id

F

id

F

T

Here only one parse tree is possible


Disambiguating Rules

Can we provide rules for disambiguating

id+(id∗ id)

from

(id+ id)∗ id


Top-down Parsing

• Recursive descent is an example

• Grows the parse tree from the root down to the leaves

• Useful for recognizing flow-of-control constructs since they are alwayslabeled with a keyword (e.g., if,while ,do, for)

• Requires each production for the same non-terminal to begin with aunique token


Left factoring

Can be used to factor out a common prefix in two of more productions

For example, to parse if...then vs. if...then...else

C → if E then S else S| if E then S

Left factor the grammar (factor out common left expression):

C → if E then SXX → else S | ε


Top-down Parsing

Two requirements

• Left-factor the grammar

Produce grammar in which no productions for the same non-terminal have a common prefix

• No left recursion

A+⇒ Aα

Parser could get into an infinite loop


Top-down Parsing

Top-down parsing produces a sequence of left-most derivations

A → Bx | CyB → zC → w

Produces two strings: zx and wy


Top-down Parsers

Two common approaches are used in top-down parsing

• Recursive descent parser

– Recursive– The structure of the grammar is hard-coded into the parsing program

• Table-driven parser

– Non-recursive– The structure of the language is encoded in a parse table


Recursive Descent

• Relatively easy to implement

• Reads the input stream (from the scanner) left to right and verifies itscorrectness

• Perl has a recursive descent parser (Parse::RecDescent)

• “Recursive,” since parsing is accomplished via recursive procedures

• “Descent,” since parsing is top-down (descends from the root down thebranches to the leaves)


Recursive DescentEach non-terminal is a subroutine call


AB

B

C

x

y

z

wC3 4

10 2

5 8

6 7

9


Generalized Parser

public abstract class RecursiveDescent {private String input;protected int cursor = 0;public RecursiveDescent() {

getInputString();if ( parse() && cursor == input.length() ) {

System.out.println("Accept");} else {

error();}

}protected final boolean checkNextToken(char ch) {

// Ignore whitespacewhile ( cursor < input.length() &&

(input.charAt(cursor) == ’ ’ || input.charAt(cursor) == ’\t’) ) {cursor++;

}return (cursor < input.length())? input.charAt(cursor++) == ch

: false;}protected static void error() {

System.out.println("Invalid string");System.exit(1);

}protected final void getInputString() {

input = Console.In.getString();}public abstract boolean parse();

}


Subclass for Given Grammar (1)

public class Expression extends RecursiveDescent {/** Original Grammar:* E -> T | E + T* T -> F | T * F* F -> ( E ) | d** Adapted Grammar:* E -> T E’* E’ -> + T E’ | e* T -> F T’* T’ -> * F T’ | e* F -> ( E ) | d** Note method names: E1() => E’ and T1() => T’*/

public boolean parse() {return E();

}public static void main(String[] args) {

new Expression();}// Continued . . .



private boolean E() {int pos = cursor;// E -> T E’if ( T() && E1() ) {

return true;}cursor = pos; // Backtrackreturn false;

}

E→ T E ′



private boolean E1() {int pos = cursor;// E’ -> + T E’if ( checkNextToken(’+’) && T() && E1() ) {

return true;}cursor = pos; // Backtrack// E’ -> ereturn true;

}

E ′→+T E ′ | ε



private boolean T() {int pos = cursor;// T -> F T’if ( F() && T1() ) {


}}

T → FT ′



private boolean T1() {int pos = cursor;// T’ -> * F T’if ( checkNextToken(’*’) && F() && T1() ) {

return true;}cursor = pos; // Backtrack// T’ -> ereturn true;

}

T ′→∗FT ′ | ε



private boolean F() {int pos = cursor;// F -> ( E )if ( checkNextToken(’(’) && E() && checkNextToken(’)’) ) {

return true;}cursor = pos; // Backtrack// F -> dif ( checkNextToken(’d’) ) {


}}

F → (E) | d


Backtracking

• The example recursive descent parser used backtracking

• Recursive descent parsing is criticized as being inefficient due tobacktracking

• Some grammars can be written so that no backtracking is required

– The right side of the production starts with a terminal, so you knowimmediately which production to apply

– A top-down parser that requires no backtracking is called a predictiveparser


The Bad News

Some grammars cannot be processed with a top-down parser

We need to determine the characteristics required to make a top-downparser feasible


Preprocessing Needed

FIRST(α) is the set of terminals that begin strings derived from α


FIRST(B) = {z}FIRST(C) = {w}FIRST(A) = {z,w}


One Criteria

Given a production of the form

A → α | β

if FIRST(α)∩FIRST(β) 6= ∅, then a top-down parser cannot be used


ε Productions

• ε productions complicate the situation

• FOLLOW(A) is the set of terminals that can appear immediately to theright of A in some sentential form

A → Bx | CyB → z | εC → w

FIRST(B) = {z}FIRST(C) = {w}FIRST(A) = {z,w}

FOLLOW(B) = {x}FOLLOW(C) = {y}FOLLOW(A) = {$}(end of input)


FOLLOW

• Without any ε productions, FIRST would be sufficient

• Formally: If X ∈VN∪VT , then

FIRST(X) =

{

{X}, if X ∈VT

{a | a ∈VT and X∗⇒ aβ}, otherwise

If A ∈VN, then

FOLLOW(A) = {a | a ∈VT and A∗⇒ αAaβ}

• How do we compute FIRST and FOLLOW?


FIRST ComputationSetOfTerminalSymbols FIRST(GrammarSymbol X ) {

if ( X is a terminal ) F ← {X}; FIRST(X) is just X

else {

F ← ∅ ;

if ( X → ε is a production ) F ← F ∪ ε; Add ε to FIRST(X)

if ( X → y1y2 . . .yn is a production ) {

if ( ∃ i such that ε ∈ FIRST(y1), ε ∈ FIRST(y2), . . . , ε ∈ FIRST(yi−1),

and a ∈ FIRST(yi) )

F ← F ∪ a;

if ( ε ∈ FIRST(y1), ε ∈ FIRST(y2), . . . , ε ∈ FIRST(yn) )

F ← F ∪ ε; Add ε to FIRST(X)

}

}

return F ;

}


FIRST

In a nutshell:

• If A 6∗⇒ ε, then

FIRST(A) = {a ∈VT | A∗⇒ aβ}

• Else, if A∗⇒ ε, then

FIRST(A) = {a ∈VT | A∗⇒ aβ} ∪ {ε} (if A

∗⇒ ε)


FOLLOW Computation

SetOfTerminalSymbols FOLLOW(NonTerminalSymbol A) {

F ← ∅ ;

if ( A is the start symbol )

F ← F ∪ $ ;

if ( B → αAβ is a production ) α can be ε

F ← F ∪ (FIRST(β) - ε);

if ( C → αA or (C→ αAγ and ε ∈ FIRST(γ)) )

F ← F ∪ FOLLOW(C);

return F ;

}


FOLLOW

In a nutshell:

• If S 6+⇒ αA, then

FOLLOW(A) = {a ∈VT | S+⇒ αAaβ}

• Else, if S+⇒ αA, then

FOLLOW(A) = {a ∈VT | S+⇒ αAaβ}∪{$}


FIRST and FOLLOW Example

Compute the FIRST and FOLLOW sets for the grammar from ourrecursive descent parser was built:

E → T E ′

E ′ → +T E ′ | εT → FT ′

T ′ → ∗FT ′ | εF → (E) | d


FIRST and FOLLOW Example

E → T E ′

E ′ → +T E ′ | εT → FT ′

T ′ → ∗FT ′ | εF → (E) | d

The solution:

FIRST(+) = {+}FIRST(∗) = {∗}FIRST(d) = {d}FIRST(() = {(}FIRST()) = {)}

FIRST(E) = {(,d}FIRST(E ′) = {ε,+}FIRST(T ) = {(,d}FIRST(T ′) = {ε,∗}FIRST(F) = {(,d}

FOLLOW(E) = {$,)}FOLLOW(E ′) = {$,)}FOLLOW(T ) = {+,),$}FOLLOW(T ′) = {+,),$}FOLLOW(F) = {∗,+,),$}


LL(1) Grammar

• Scanning Left-to-right

• Leftmost derivation

• 1 symbol lookahead

LL(2), . . . , LL(k) means 2, . . . , k lookahead symbols

Most parsers have just one symbol of lookahead


LL(1) Grammar

Formally, a grammar is LL(1) if and only if whenever A → α | β

1. FIRST(α)∩FIRST(β) = ∅

2. At most one of α or β can derive ε

3. If β ∗⇒ ε, then α does not derive any string that starts with a terminal in

FOLLOW(A)

All LL(1) grammars can be parsed by a recursive descent parser, andrecursive descent parsers can parse only LL(1) grammars


Common Prefixes

Recall the common prefix example:

C → if E then S else S| if E then S

FIRST(if E then S else S) = {if}FIRST(if E then S) = {if}

Thus the grammar is not LL(1), but the factored grammar is LL(1) (butambiguous):

C → if E then SXX → else S | ε


Left Recursion

Consider the grammar:

E→ E +d | d

FIRST(E +d) = {d}FIRST(d) = {d}

Thus the grammar is not LL(1)

A recursive descent parser would succumb to infinite recursion


Parse Table from FIRST, FOLLOW

• If more than one production matches, then the grammar is not LL(1)

• For any two productions Pi, Pj, FIRST(Pi)∩FIRST(Pj) = ∅

• If A→ α and b ∈ FIRST(α), then parsetable[A][b] = A→ α

• If X → α and ε ∈ FIRST(α), then for each b ∈ FOLLOW(X)

parsetable[X ][b] = X → α


Parse Table for Example Grammar

Build an LL(1) parse table for our sample grammar:

E → T E ′

E ′ → +T E ′ | εT → FT ′

T ′ → ∗FT ′ | εF → (E) | d

FIRST and FOLLOW sets:

FIRST(+) = {+}FIRST(∗) = {∗}FIRST(d) = {d}FIRST(() = {(}FIRST()) = {)}

FIRST(E) = {(,d}FIRST(E ′) = {ε,+}FIRST(T ) = {(,d}FIRST(T ′) = {ε,∗}FIRST(F) = {(,d}

FOLLOW(E) = {$,)}FOLLOW(E ′) = {$,)}FOLLOW(T ) = {+,),$}FOLLOW(T ′) = {+,),$}FOLLOW(F) = {∗,+,),$}


Parse Table for Example Grammar

The solution:

Top of Input SymbolStack d + ∗ ( ) $

E E→ TE ′ E→ T E ′

E ′ E ′→+TE ′ E ′→ ε E ′→ ε

T T → FT ′ T → FT ′

T ′ T ′→ ε T ′→∗FT ′ T ′→ ε T ′→ ε

F F → d F → (E)


LL(1) Table-driven Parser

a 21 ana3a

LL Parser

Parse Table

Stack

Input$

Output


LL(1) Parsing AlgorithmLL Parser() {

stack.push(S); Push start symbol onto empty stack

a← scanner.getNextToken(); Get next token

while ( not stack.empty() ) {

X ← stack.top(); Look at top of stack

if ( X is a non-terminal and parsetable[X ][a] = X → y1...yk ) {

stack.pop(); Pop off top item

stack.push(yk . . .y1); Push left side symbols on in reverse order

} else if ( X = a ) {

stack.pop(); Pop off top item

a← scanner.getNextToken(); Get next token

} else

Error(); Illegal string

}

}


Parsing Example

Stack Input Rule

$ E d + d * d $ E→ T E ′

$ E ′T d + d * d $ T → FT ′

$ E ′T ′F d + d * d $ F → d$ E ′T ′d d + d * d $$ E ′T ′ + d * d $ T ′→ ε

$ E ′ + d * d $ E ′→+TE ′

$ E ′T+ + d * d $$ E ′T d * d $ T → FT ′

$ E ′T ′F d * d $ F → d$ E ′T ′d d * d $$ E ′T ′ * d $ T ′→ ∗FT ′

$ E ′T ′F* * d $$ E ′T ′F d $ F → d$ E ′T ′d d $$ E ′T ′ $ T ′→ ε

$ E ′ $ E ′→ ε

$ $ Accept


Another Parsing Example

Stack Input Rule

$ E (d + d) * d$ E→ T E ′

$ E ′T (d + d) * d$ T → FT ′

$ E ′T ′F (d + d) * d$ F → (E)

$ E ′T ′)E( (d + d) * d$$ E ′T ′)E d + d) * d$ E→ T E ′

$ E ′T ′)E ′T d + d) * d$ T → FT ′

$ E ′T ′)E ′T ′F d + d) * d$ F → d

$ E ′T ′)E ′T ′d d + d) * d$$ E ′T ′)E ′T ′ + d) * d$ T ′→ ε

$ E ′T ′)E ′ + d) * d$ E ′→+T E ′

$ E ′T ′)E ′T+ + d) * d$$ E ′T ′)E ′T d) * d$ T → FT ′

$ E ′T ′)E ′T ′F d) * d$ F → d

$ E ′T ′)E ′T ′d d) * d$$ E ′T ′)E ′T ′ ) * d$ T ′→ ε

$ E ′T ′)E ′ ) * d$ E ′→ ε

$ E ′T ′) ) * d$$ E ′T ′ * d$ T ′→∗FT ′

$ E ′T ′F∗ * d$$ E ′T ′F d$ F → d

$ E ′T ′d d$$ E ′T ′ $ T ′→ ε

$ E ′ $ E ′→ ε

$ $ Accept


Try a Non-LL(1) Grammar

E→ E + id | id

Observe FIRST(E + id) = FIRST(id) = {id}

Recursive descent parser: infinite recursion

Parse table:

Top of Input SymbolStack d $

E E→ idE→ E + id


Top-down Parsing Summary

To produce a top-down parser:

1. Eliminate left recursion and common prefixs; this yields an LL(1)grammar

2. Find the FIRST and FOLLOW sets

3. Build either the recursive descent parser methods or the parsing table


Limitations of LL(1) Grammars

• In many cases a grammar G1 can be easily devised to represent stringsin a language L(G1), but G1 is not LL(1)

• Sometimes G1 can be rewritten to form G2, where L(G1) = L(G2) andG2 is LL(1)

• Some context-free languages have no LL(1) grammars


Bottom-up Parsing

• Grows parse tree from the leaves up

• Only two choices when scanning input

– shift symbol onto stack– reduce

• Parser reduces in the reverse order of a rightmost derivation

• Bottom-up parsers are more powerful than top-down parsers

They can be used to parse a larger variety of grammars


Reduction

E→ E +E | E ∗E | (E) | −E | id

E⇒ E +E⇒ E +E ∗E⇒ E +E ∗ id⇒ E + id∗ id⇒ id+ id∗ id

Parser gives a rightmost reverse derivation


Handles

• A handle of a string

– is a substring– that matches the right side of a production– whose reduction to the non-terminal on the left side represents one

step along the reverse of a rightmost derivation

• For unambiguous grammars, every right-sentential form has a uniquehandle


Handle—More Formally

• A handle of a right-sentential form γ is a production A → β and aposition in γ where β can be found

• If (A → β,k) is a handle, then replacing β in γ at position k with Aproduces the previous right-sentential form in a rightmost derivation ofγ

The substring to the right of a handle contains only terminal symbols


Handle Pruning

• Begin with string to parse

• Find handle and replace with the left side of a production that producesthat handle

• Repeat until only the start symbol remains


Handle Pruning Example

E → E +T | TT → T ∗F | FF → d

Sentential Form Handle

d+d∗d (F → d,1)F +d∗d (T → F,1)T +d∗d (E→ T,1)E +d∗d (F → d,3)E +F ∗d (T → F,3)E +T ∗d (F → d,5)E +T ∗F (T → T ∗F,3)E +T (E→ E +T,1)E –

Observe that this a rightmost derivation in reverse


Shift-Reduce Parsing

Two problems to solve

• Find substring to be reduced in a right-sentential form

• Determine what production to choose in case more than one productionhas that substring on its right side


Overview of Process

• Stack containsstates andgrammar symbols

• Grammar symbolson stack representa viable prefix

a1 ana3a2 $

Stack

Input

LR Parser

Action Goto

Parse Table


Parse Table

• Action

– shift– reduce

• Goto

– Next state

a1 ana3a2 $

Stack

Input

LR Parser

Action Goto

Parse Table


Parse Table Actions• Shift

– Pushes inputsymbol and stateon to the stack

• Reduce

– Replaces astring of symbolson the stack witha non-terminal

– Symbols on thestack can beeither terminalsor non-terminals

a1 ana3a2 $

Stack

Input

LR Parser

Action Goto

Parse Table



• Stack holds grammar symbols

– $ indicates bottom of stack

• Input buffer for string to be parsed

– $ indicates end of string

• Parser activity

– shifts zero or more input symbols onto the stack until a handle β ison the top of the stack

– β is then reduced to the left side of a production



• Initial parser state

– Stack: $ Input: w$

(Stack grows to the right; string is consumed from left to right)

• Final parser state (if no errors)

– Stack: $S Input: $

• Parser actions

– Shift next input symbol to top of stack– Reduce handle on top of stack to non-terminal– Accept when string consumed and S on stack– Error when string cannot be parsed


Viable Prefix

Prefix of a right sentential form that can appear on the stack of a shift-reduce parser


Types of Bottom-up Parsers

• SLR

– “Simple LR”– LR(0), no lookahead

• LR

– LR(1), more powerful, but requires a lot of memory

• LALR

– Look ahead LR– Yacc is LALR(1)


SLR

• We’ll concentrate on SLR since it is the simplest form

• To construct an SLR parse table we need items

• An item consists of a production and a numeric position within thatproduction

– An item encodes where you are in a production


Expression Grammar

E→ E +E | E ∗E | (E) | id

compare to

E → E +T | TT → T ∗F | FF → (E) | id


Canonical LR(0) States

1. Augment the grammar by adding a new production

S ′→ S

2. closure operation sets up states

3. goto operation computes transitions between states


LR(0) Items

An LR(0) item of a grammar G is a production of G with a dot (·) at someposition of the right side.

Example: Four items can be derived from production A→ XYZ

A → ·XY ZA → X ·YZA → XY ·ZA → XY Z·


Interpreting LR(0) Items

• An item indicates how much of a production we have seen at a givenpoint in the parsing process

• The item[A→ X ·Y Z]

means we have seen a string derivable from X and hope to see a stringderivable from Y Z


Closure Algorithm

ItemSet closure(ItemSet I) {J← I;do {

Jold← J;for each item [A→ α ·Bβ] ∈ J and each production B→ γ ∈ G do {

J← J∪{B→ ·γ};}

} while ( J 6= Jold );return J;

}

• B is a non-terminal

• If one B-production is added to the closure with a dot on the left end,then all B-productions will be added to the closure


Closure

closure([E→ E + ·T ]) =

E → E + ·TT → ·T ∗FT → ·FF → ·(E)

F → ·id


goto Function

goto(I,X)

• I is a set of items (really just a state)

• X is a grammar symbol

• goto(I,X) is defined as the closure of the set of all items [A→ αX ·β]

such that [A→ α ·Xβ] is in I

• Intuitively, if I is the set of items valid for a viable prefix γ, then goto(I,X)

is the set of items valid for the viable prefix γX


LR(0) Item Sets

EE E + TE TT FT *T FFF d

E( )

E1IE

E E + T

8I

T FT *T F

F dF E( )

E TE +

2IET F*

T

3IT F

4I

F E( )

E E + TE TT FT *T F

F d

F E )(

5I

F d

7I

0

F ( E )

6I

I

E + TF )( E

E

10I

E

T T * F

9IT

T F

F dF E( )

T *

11IET F*

ET

T+E

T

F

E

T(

(

+

*

d

d

d

F

F

)

(

(

F

d

*

+

T


Set-of-Items Construction

SetOfItems items(Grammar G′) {C← { closure ([S′→ ·S])});do {

Cold←C;for each set of items I ∈C and each grammar symbol X such that

goto(I,X ) is not empty do {C←C∪{ goto(I,X ) };

}} while ( C 6= Cold );return C;

}

• G′ is the augmented grammar


SLR Parse Table Construction

BuildSLRParser(Grammar G′) {Initialize all the entries in the goto and action tables to “error”;C← items(G′); C = {I0, I1, . . . , In}for each item set Ii ∈C do {

if [A→ α ·aβ] ∈ Ii and goto(Ii,a) = I jaction([i][a])← “shift j”; a is a terminal

if [A→ α·] ∈ Ii and A 6= S′

for all a ∈ FOLLOW(A) doaction([i][a])← “reduce A→ α”;

if [S′→ S·] ∈ Iiaction([i][$])← “accept”;

}for each non-terminal A ∈ G′ do

if goto(Ii,A) = I jgoto[i][A]← j;

The initial state of the parser is i where [S′→ ·S] ∈ Ii;}

• G′ is the augmented grammar


SLR Parsing Example

FOLLOW(E) = {$,+,)}

FOLLOW(T ) = {$,+,∗,)}

FOLLOW(F) = {$,+,∗,)}


SLR Parse Table

Action Goto

State d + ∗ ( ) $ E T F

0 shift 5 shift 4 1 2 3

1 shift 8 Accept

2 reduce shift 9 reduce reduceE→ T E→ T E→ T

3 reduce reduce reduce reduceT → F T → F T → F T → F

4 shift 5 shift 4 6 2 3

5 reduce reduce reduce reduceF → d F → d F → d F → d

6 shift 8 shift 7

7 reduce reduce reduce reduceF → (E) F → (E) F → (E) F → (E)

8 shift 5 shift 4 11 3

9 shift 5 shift 4 10

10 reduce reduce reduce reduceT → T ∗F T → T ∗F T → T ∗F T → T ∗F

11 reduce shift 9 reduce reduceE→ E +T E→ E +T E→ E +T


LR Parsing Algorithm

LR Parser() {stack.push(S); Push initial state onto empty stackdone← false;a← scanner.getNextToken(); Get next tokenwhile ( not done ) {

s← stack.top(); Look at state on top of stackif ( action[s][a] = shift s′) {

stack.push(a);stack.push(s′);a = scanner.getNextToken();

} else if ( action[s][a] = reduce A→ B ) {stack.pop 2×|B| symbols; Pop off some symbolss′← stack.top();stack.push(A);stack.push(goto[s′][A]);

} else if ( action[s][a] = accept ) {done← true;

} else {Error(); Illegal string

}}

}


Parsing Example

Stack Input Rule

$ S0 (d + d) * d $ Shift 4

$ S0(4 d + d) * d $ Shift 5

$ S0(4d5 + d) * d $ Reduce F → d

$ S0(4F3 + d) * d $ Reduce T → F$ S0(4T2 + d) * d $ Reduce E→ T$ S0(4E6 + d) * d $ Shift 8

$ S0(4E6+8 d) * d $ Shift 5

$ S0(4E6+8d5 ) * d $ Reduce F → d

$ S0(4E6+8F3 ) * d $ Reduce T → F$ S0(4E6+8T 11 ) * d $ Reduce T → E +T$ S0(4E6 ) * d $ Shift 7

$ S0(4E6)7 * d $ Reduce F → (E)

$ S0F3 * d $ Reduce T → F$ S0T2 * d $ Shift 9

$ S0T2*9 d $ Shift 5

$ S0T2*9d5 $ Reduce F → d

$ S0T2*9F10 $ Reduce T → T ∗F$ S0T2 $ Reduce E→ T$ S0E1 $ Accept


Comparing Grammars

• LR(1) grammars describe languages that are a proper superset oflanguages represented by LL(1) grammars

• LR(1) is more powerful than LALR(1)

• LALR(1) is more efficient than LR(1)

• For a language like C:

– LR(1) parser has thousands of states– LALR(1) parser has hundreds of states


Comparing Context-free Grammars

kLR( )

CFGs

LR(1)

LALR(1)

SLR(1)

LL(1)


Chomsky’s Grammar Hierarchy

Consider productions of the form α→ β

Type Name Criteria Recognizer

Type 3 Regular A→ a | aB Finite automaton

Type 2 Context-free A→ α Push-down automaton

Type 1 Context-sensitive |α| ≤ |β| Linear bounded automaton

Type 0 Unrestricted α 6= ε Turing machine


Grammar Hierarchy

Type 0

Type 1

Type 2

Type 3

Unrestricted

Context−sensitive

Context−free

Regular


Error Handling

• Compilers cannot only process syntactically correct programs

• Language specifications do not usually describe how the compilershould respond to syntactical errors

• Review of types of errors

– Lexical– Syntactic– Semantic– Logical


Syntactic Errors

What should be done when the stream of tokens coming from the lexerdisobeys the grammatical rules of the language?


Goals

• Errors should be reported clearly and accurately

• Some error recovery should be performed so subsequent errors can bedetected

• The error detection and reporting mechanism should not significantlyslow down the processing of correct programs


Issues

• Sometimes an error exist many lines before it is detected

• Types of errors are dependent on the programming language used

• See Example 4.1 in the dragon book


Error Handling

• Report the location of the detected error

– at least line number– possibly the position within that line– report problem

• Recovery

– A poor job may produce many “spurious” errors– One strategy: skip ”bad” tokens and continue with a number of “good” tokens until

any subsequent errors are reported


Error Recovery Strategies (1)

Panic-mode

• Discard tokens until some synchronizing token is detected

• Advantage

– simple to implement– won’t enter an infinite loop



Phrase-level

• Perform local correction on remaining input (e.g., replace comma bysemicolon) to allow parser to continue

• Used first with top-down parsers

• Has difficulty coping with errors that occur before the point of detection



Error productions

• Augment grammar with special “error rules”

• Very useful if certain erroneous constructs are anticipated

• Yacc supports error productions



Global correction

• Finds the minimal number of corrections required to produce a goodparse tree from a bad one

• Interesting from a theoretical point of view, but not too practical

• Corrected parse tree obviously may not be what the programmerintended!


Yacc/Bison Program

• Used to generate LALR(1) parsers

• Developed by S.C. Johnson

• YACC stands for “Yet another compiler compiler”

• As with Lex, originally for C under Unix, but other platforms aresupported

• Yacc generated C code can be linked with Lex generated C code for aready-made lexer/parser combination

• GNU Bison is the modern version that we will use

We’ll just call it Yacc, though


Yacc Specification

%{

C/C++ Declarations%}

Yacc Declarations%%

Rules%%

Programmer functions


Yacc Specification (2)

%{

C/C++ Declarations%}

Yacc Declarations%%

Rules%%

Programmer functions

1. C/C++ macros and declarations are placed in the C/C++ declarations section

2. Yacc token declarations and precedence assignments are placed in the

Yacc declarations section

3. Code to execute when productions are matched is placed placed in the

rules section

4. Arbitrary C/C++ code is placed in the programmer functions section; functions

named yylex() and yyerror() (normally produced by Lex) must be available


Yacc Rules

• Consist of a grammar production and an associated action

• The Yacc syntax for the rule

A→ Bx | C

is

A : B x{ $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; }

| C{ $$ = new ANode($1); cout << "Matched A -> C" << endl; }

;


Yacc Rules

A→ Bx | C

A : B x{ $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; }

| C{ $$ = new ANode($1); cout << "Matched A -> C" << endl; }

;

• The $$ metasymbol represents the value to be returned by the parserwhen the production is matched; it represents the left side non-terminal(A is this case)

• The $1, $2, etc. metasymbols represent the values of the grammarsymbols matched on the right side of the production

• Since the parser works from the bottom up, the left side non-terminalswill have already been matched and their values will be available


Example Yacc Specification%{ /* -------------------------- C/C++ declarations */

#include <ctype.h>int yylex();void yyerror(char *);

%} /* -------------------------- Yacc declarations */%union {

int value;int symbol;

}%type <value> S E I%token <symbol> digit%left ’+’%left ’*’%% /* -------------------------- Rules */S : E { printf("%d\n", $1); }

| /* epsilon */ {};

E : E ’+’ E { $$ = $1 + $3; }| E ’*’ E { $$ = $1 * $3; }| ’(’ E ’)’ { $$ = $2; }| I { $$ = $1; };

I : I digit { $$ = 10 * $1 + ($2 - ’0’); }| digit { $$ = $1 - ’0’; };

%% /* -------------------------- C/C++ code */int main() {

while ( !feof(stdin) ) {yyparse();

}return 0;

}


Yacc Specification to Parser

}

%%

yyparse();

prog.y

Declarations%%Production rules

main() {C procedures

y.tab.c

Parse Table

DFA

yyparse()


Build Process

%%C proceduresmain() {

Production rules

}

Declarations%%

yacc yyparse();

gccprog.y y.tab.c prog

gcc −o prog y.tab.cyacc prog.y


Syntactic Analysis - cs.southern.edu · Advantages of CFGs Precise, easy-to-understand syntactic...

Documents

Transcript of Syntactic Analysis - cs.southern.edu · Advantages of CFGs Precise, easy-to-understand syntactic...