Post on 04-Jan-2016
241-437 Compilers: syntax/4 1
Compiler Structures
• Objective– describe general syntax analysis, grammars,
parse trees, FIRST and FOLLOW sets
241-437, Semester 1, 2011-2012
4. Syntax Analysis
241-437 Compilers: syntax/4 2
Overview
1. What is a Syntax Analyzer?
2. What is a Grammar?
3. Parse Trees
4. Types of CFG Parsing
5. Syntax Analysis Sets
241-437 Compilers: syntax/4 3
In this lecture
Source Program
Target Lang. Prog.
Semantic Analyzer
Syntax Analyzer
Lexical Analyzer
FrontEnd
Code Optimizer
Target Code Generator
BackEnd
Int. Code Generator
Intermediate Code
241-437 Compilers: syntax/4 4
1. What is a Syntax Analyzer?
Lexical Analyzer
if (a == 0) a = b;
if ( a == 0 ) a = b ;
Syntax Analyzer
builds a parse tree
IF
EQ ASSIGN
a 0 a b
241-437 Compilers: syntax/4 5
Syntax Analyses that we do
I gave Jim cardthe
pronoun verb proper noun
noun phrase
article noun
- Identify the function of each word- Recognize if a sentence is grammatically correct
sentence
(subject) (action) (object)
verb phrase(indirect object)
grammartypes /categories
241-437 Compilers: syntax/4 6
Languages
• We use a natural language to communicate– its grammar rules are very complex– the rules don’t cover important things
• We use a formal language to define a programming language– its grammar rules are fairly simple– the rules cover almost everything
241-437 Compilers: syntax/4 7
2. What is a Grammar?
• A grammar is a notation for defining a language, and is made from 4 parts:– the terminal symbols– the syntactic categories (nonterminal symbols)
• e.g. statement, expression, noun, verb
– the grammar rules (productions)• e,g, A => B1 B2 ... Bn
– the starting nonterminal• the top-most syntactic category for this grammar
continued
241-437 Compilers: syntax/4 8
• We define a grammar G as a 4-tuple:G = (T, N, P, S)
– T = terminal symbols– N = nonterminal symbols– P = productions/rules– S = starting nonterminal
241-437 Compilers: syntax/4 9
2.1. Example 1
• Consider the grammar:T = {0, 1}
N = {S, R}
P = { S => 0S => 0 RR => 1 S }
S is the starting nonterminal
the right hand sidesof productions usuallyuse a mix of terminalsand nonterminals
241-437 Compilers: syntax/4 10
Is “01010” in the language?• Start with a S rule:
– Rule String Generated-- SS => 0 R 0 RR => 1 S 0 1 SS => 0 R 0 1 0 RR => 1 S 0 1 0 1 SS => 0 0 1 0 1 0
• No more rules can be applied since there are no more nonterminals left in the string.
Yes, itis in thelanguage.
241-437 Compilers: syntax/4 11
Example 2
• Consider the grammar:T = {a, b, c, d, z}
N = {S, R, U, V}
P = { S => R U z | zR => a | b RU => d V U | cV => b | c }
S is the starting nonterminal
241-437 Compilers: syntax/4 12
• The notation:X => Y | Z
is shorthand for the two rules:X => YX => Z
• Read ‘|’ as ‘or’.
241-437 Compilers: syntax/4 13
Is “adbdbcz” in the language?
• Rule String Generated-- SS => R U z R U zR => a a U zU => d V U a d V U zV => b a d b U zU => d V U a d b d V U zV => b a d b d b U zU => c a d b d b c z Yes!
This grammar has choices about how to rewrite the string.
241-437 Compilers: syntax/4 14
Example 3: Sums
• The grammar:T = {+, -, 0, 1, 2, 3, ..., 9}
N = {L, D}
P = { L => L + D | L – D | DD => 0 | 1 | 2 | ... | 9
}
L is the starting nonterminal
e.g. 5 + 6 - 2
241-437 Compilers: syntax/4 15
Example 4: Brackets
• The grammar:T = { '(', ')' }
N = {L}
P = { L => '(' L ')' LL => ε
}
L is the starting nonterminal
ε means 'nothing'
241-437 Compilers: syntax/4 16
2.2. Derivations
A sequence of the form: w0 w1 … wn
is a derivation of wn from w0 (or w0 * wn)Example:
L rule L => ( L ) L
( L ) L rule L =>
( ) L rule L =>
( )
L * ( )This means that the sentence ( ) is a derivation of L
241-437 Compilers: syntax/4 17
LL rule L => ( L ) Lrule L => ( L ) L
( L ) ( L ) LL rule L => ( L ) Lrule L => ( L ) L
( L ) ( L ) ( L ) ( L ) LL rule L => rule L =>
( ( LL ) ( L ) ) ( L ) rule L => ( L ) Lrule L => ( L ) L
(( (( LL ) L ) ( L ) ) L ) ( L ) rule L => rule L =>
(( ) (( ) LL ) ( ) ( LL ) ) rule L => rule L =>
( ( ) ( ( ) LL ) ( ) ) ( ) rule L => rule L =>
( ( ) ) ( )( ( ) ) ( )
so L * (( )) ( )
241-437 Compilers: syntax/4 18
2.3. Kinds of Grammars
• There are 4 main kinds of grammar, of increasing expressive power:– regular (type 3) grammars– context-free (type 2) grammars– context-sensitive (type 1) grammars– unrestricted (type 0) grammars
• They vary in the kinds of productions they allow.
241-437 Compilers: syntax/4 19
Regular Grammars• Every production is of the form:
A => a | a B | – A, B are nonterminals, a is a terminal
• These are sometimes called right linear rules because if a nonterminal appears in the rule body, then it must appear last.
• Regular grammars are equivalent to REs.
S => wTT => xTT => a
241-437 Compilers: syntax/4 20
Example
• Integer => + UInt | - UInt | 0 Digits | 1 Digits | ... | 9 Digits
UInt => 0 Digits | 1 Digits | ... | 9 Digits
Digits => 0 Digits | 1 Digits | ... | 9 Digits |
241-437 Compilers: syntax/4 21
Context-Free Grammars (CFGs)
• Every production is of the form:A =>
– A is a nonterminal, can be any number of nonterminals or terminals
• The Syntax Analyzer uses CFGs.
A => aA => aBcdB => ae
241-437 Compilers: syntax/4 22
2.4. REs for Syntax Analysis?• Why not use REs to describe the syntax of a
programming language?– they don’t have enough power
• Examples:– nested blocks, if statements, balanced braces
• We need the ability to 'count', which can be implemented with CFGs but not REs.
241-437 Compilers: syntax/4 23
3. Parse Trees
• A parse tree is a graphical way of showing how productions are used to generate a string.
• The syntax analyzer creates a parse tree to store information about the program being compiled.
241-437 Compilers: syntax/4 24
Example
• The grammar:T = { a, b }
N = { S }
P = { S => S S | a S b | a b | b a }
S is the starting nonterminal
241-437 Compilers: syntax/4 25
Parse Tree for “aabbba”
The root of the tree is the start symbol S: S
Expand using S => S SS
SS
Expand using S => a S b
continued
expand thesymbol inthe circle
241-437 Compilers: syntax/4 26
S
S
S
S
a b
Expand using S => a bS
S
SS
a b
a bExpand using S => b a
continued
241-437 Compilers: syntax/4 27
S
S
S
a b
a b
S
b a
• Stop when there are no more nonterminals in leaf positions.
• Read off the string by reading the leaves left to right.
241-437 Compilers: syntax/4 28
3.1. Ambiguity
Two (or more) parse trees for the same string
E => E + EE => E – EE => 0 | … | 9 E
E + E
E - E E + E
E - E
E
2 3
4 2
3 42 – 3 + 4
or
241-437 Compilers: syntax/4 29
• The two derivations: E E + E E E – E
E – E + E 2 – E
2 – E + E 2 – E + E
2 – 3 + E 2 – 3 + E
2 – 3 + 4 2 – 3 + 4
241-437 Compilers: syntax/4 30
Fixing Ambiguity
• An ambiguous grammar can sometimes be made unambiguous:
E => E + T | E – T | T
T => 0 | … | 9
• We'll look at some techniques in chapter 5.
241-437 Compilers: syntax/4 31
4. Types of CFG Parsing
• Top-down (chapter 5)– recursive descent (predictive) parsing– LL methods
• Bottom-up (chapter 6)– operator precedence parsing– LR methods– SLR, canonical LR, LALR
241-437 Compilers: syntax/4 32
4.1. A Statement Block Grammar
• The grammar:T = {begin, end, simplestmt, ;}
N = {B, SS, S}
P = { B => begin SS endSS => S ; SS | εS => simplestmt | begin SS end
}
B is the starting nonterminal
241-437 Compilers: syntax/4 33
Parse Tree
begin simplestmt ; simplestmt ; end
S S SS
SS
SS
BB => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
begin simplestmt ; simplestmt ; end
241-437 Compilers: syntax/4 34
4.2. Top Down (LL) Parsing
begin simplestmt ; simplestmt ; end
SS
BB => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 35
begin simplestmt ; simplestmt ; end
S
SS
SS
BB => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 36
begin simplestmt ; simplestmt ; end
S
SS
SS
BB => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 37
begin simplestmt ; simplestmt ; end
S S SS
SS
SS
BB => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 38
begin simplestmt ; simplestmt ; end
S S SS
SS
SS
BB => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 39
begin simplestmt ; simplestmt ; end
S S SS
SS
SS
B 1
2
3
4
5
6
B => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
241-437 Compilers: syntax/4 40
4.3. Bottomup (LR) Parsing
begin simplestmt ; simplestmt ; end
S
B => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 41
begin simplestmt ; simplestmt ; end
S S
B => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 42
begin simplestmt ; simplestmt ; end
S S SS
B => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 43
begin simplestmt ; simplestmt ; end
S S SS
SS
B => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 44
begin simplestmt ; simplestmt ; end
S S SS
SS
SS
B => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
continued
241-437 Compilers: syntax/4 45
begin simplestmt ; simplestmt ; end
S S SS
SS
SS
B 6
5
1
4
2
3
B => B => beginbegin SS SS endend
SS => S SS => S ;; SS SS
SS => SS => S => S => simplestmtsimplestmt
S => S => beginbegin SS SS endend
241-437 Compilers: syntax/4 46
5. Syntax Analysis Sets
• Syntax analyzers for top-down (LL) and bottom-up (LR) parsing utilize two types of sets:– FIRST sets– FOLLOW sets
• These sets are generated from the programming language CFG.
241-437 Compilers: syntax/4 47
5.1. The FIRST Sets
• FIRST( <non-terminal> ) =set of all terminals that start productions for that non-terminal
• Example:S => pingS => begin S end
FIRST(S) = { ping, begin }
241-437 Compilers: syntax/4 48
More Mathematically
• A is a non-terminal.• FIRST(A) =
– { c | A =>* c , c is a terminal } { } if A =>*
• is the rest of the terminals and nonterminals after 'c'
241-437 Compilers: syntax/4 49
Building FIRST Sets
• For each non-terminal A,FIRST(A) =
FIRST_SEQ() FIRST_SEQ() ...
for all productions A => , A => , ...
– , are the bodies of the productions
241-437 Compilers: syntax/4 50
FIRST_SEQ()
• FIRST_SEQ() = { }• FIRST_SEQ(c ) = { c }, if c is a terminal• FIRST_SEQ(A )
= FIRST(A), if FIRST(A)
= (FIRST(A) – {}) FIRST_SEQ(), if FIRST(A)
– is a sequence of terminals and non-terminals, and possibly empty
241-437 Compilers: syntax/4 51
FIRST() Example 1
• S => a S e• S => B• B => b B e• B => C• C => c C e• C => d
• FIRST(C) = {c,d}FIRST(C) = {c,d}• FIRST(B) =FIRST(B) =• FIRST(S) =FIRST(S) =
Start with FIRST(C) since itsrules only start with terminals
continued
241-437 Compilers: syntax/4 52
• FIRST(C) = {c,d}• FIRST(B) = {b,c,d}• FIRST(S) =
do FIRST(B) now that we know FIRST(C)
• S => a S eS => a S e• S => BS => B• B => B => bb B e B e• B => B => CC• C => c C eC => c C e• C => dC => d
continued
241-437 Compilers: syntax/4 53
• FIRST(C) = {c,d}• FIRST(B) = {b,c,d}• FIRST(S) = {a,b,c,d}
• S => S => aa S e S e• S => S => BB• B => b B eB => b B e• B => CB => C• C => c C eC => c C e• C => dC => d
do FIRST(S) now that we know FIRST(B)
241-437 Compilers: syntax/4 54
FIRST() Example 2
• P => i | c | n T S• Q => P | a S | b S c S T• R => b |• S => c | R n | • T => R S q
• FIRST(P) = {i,c,n}FIRST(P) = {i,c,n}• FIRST(Q) =FIRST(Q) =• FIRST(R) = {b,FIRST(R) = {b,}}• FIRST(S) =FIRST(S) =• FIRST(T) =FIRST(T) =
continued
Start with P and R since theirrules only start with terminals or
241-437 Compilers: syntax/4 55
• FIRST(P) = {i,c,n}• FIRST(Q) = {i,c,n,a,b}• FIRST(R) = {b,}• FIRST(S) =• FIRST(T) =
• P => i | c | n T SP => i | c | n T S• Q => Q => PP | | aa S | S | bb S c S T S c S T• R => b | R => b | • S => c | R n | S => c | R n | • T => R S qT => R S q
continued
do FIRST(Q) now that we know FIRST(P)
241-437 Compilers: syntax/4 56
• FIRST(P) = {i,c,n}• FIRST(Q) = {i,c,n,a,b}• FIRST(R) = {b,}• FIRST(S) = {c,b,n,}• FIRST(T) =
do FIRST(S) now that we know FIRST(R)Note: S R n n because R *
• P => i | c | n T SP => i | c | n T S• Q => P | a S | b S c S TQ => P | a S | b S c S T• R => b | R => b | • S => S => cc | | R nR n | | • T => R S qT => R S q
continued
241-437 Compilers: syntax/4 57
• FIRST(P) = {i,c,n}• FIRST(Q) = {i,c,n,a,b}• FIRST(R) = {b,}• FIRST(S) = {c,b,n,}• FIRST(T) = {b,c,n,q}
do FIRST(T) now that we know FIRST(R) and FIRST(S)Note: T R S q S q q because both R and S *
• P => i | c | n T SP => i | c | n T S• Q => P | a S | b S c S TQ => P | a S | b S c S T• R => b | R => b | • S => c | R n | S => c | R n | • T => T => R S qR S q
241-437 Compilers: syntax/4 58
FIRST() Example 3
• S => a S e | S T S• T => R S e | Q• R => r S r | • Q => S T |
• FIRST(S) = {a}FIRST(S) = {a}• FIRST(T) = {r, a, FIRST(T) = {r, a, }}• FIRST(R) = {r, FIRST(R) = {r, }}• FIRST(Q) = {a, FIRST(Q) = {a, }}
Order 1) R, S 2) Q 3) T
241-437 Compilers: syntax/4 59
5.2. The FOLLOW Sets
• FOLLOW( <non-terminal> ) =– set of all the terminals that follow
<non-terminal> in productions
– the set includes $ if nothing follows <non-terminal>
241-437 Compilers: syntax/4 60
• Example:S => bing A bong | ping A pong | zing A
A => ha
• FOLLOW(A) = { bong, pong, $ }
241-437 Compilers: syntax/4 61
More Mathematically
• A is a non-terminal.• FOLLOW(A) =
{ c in terminals | S =>+ . . . A c . . . } { $ } if S =>+ . . .
is a sequence of terminals and non-terminals
=>+ is any number of => expansions
241-437 Compilers: syntax/4 62
Building FOLLOW() Sets• To make the FOLLOW(A) set, apply rules 1-4:
1. for all productions (B => . . . A ) add FIRST_SEQ()-{}
2. for all (B => . . . A ) and FIRST_SEQ()add FOLLOW(B)
3. for all (B => . . . A) add FOLLOW(B)
4. if A is the start symbol then add { $ }
• is a sequence of termminals and non-terminals
241-437 Compilers: syntax/4 63
• What is in FOLLOW(A) for the productions:
B => A C
C => s
• FOLLOW(A) gets FIRST_SEQ(C) == FIRST(C) == { s }– uses rule 1
continued
Small Examples
241-437 Compilers: syntax/4 64
• What is in FOLLOW(A) for the productions:
C => B r
B => t A
• FOLLOW(A) gets FOLLOW(B) == { r }– uses rule 3
241-437 Compilers: syntax/4 65
FOLLOW() Example 1
• S => a S e | B• B => b B C f | C• C => c C g | d |
• FIRST(C) = {c,d,}• FIRST(B) = {b,c,d,}• FIRST(S) = {a,b,c,d,}
• FOLLOW(C) = FOLLOW(C) =
• FOLLOW(B) = FOLLOW(B) =
• FOLLOW(S) = FOLLOW(S) = {$, e}{$, e}
S is the start symbol
continued
241-437 Compilers: syntax/4 66
• S => a S e | B• B => b B C f | C• C => c C g | d |
• FIRST(C) = {c,d,}• FIRST(B) = {b,c,d,}• FIRST(S) = {a,b,c,d,}
• FOLLOW(C) = {f,g} FOLLOW(C) = {f,g} follow(B)follow(B)
• FOLLOW(B)FOLLOW(B)
= FIRST_SEQ(C f) -{= FIRST_SEQ(C f) -{} } FOLLOW(S) FOLLOW(S) = = {c, d, f, $, e}{c, d, f, $, e}
• FOLLOW(S) = FOLLOW(S) = {$,e}{$,e}
continued
241-437 Compilers: syntax/4 67
• S => a S e | B• B => b B C f | C• C => c C g | d |
• FIRST(C) = {c,d,}• FIRST(B) = {b,c,d,}• FIRST(S) = {a,b,c,d,}
• FOLLOW(C) FOLLOW(C) = = {f,g,c,d,$,e}{f,g,c,d,$,e}
• FOLLOW(B)FOLLOW(B)
= = {c, d, f, $, e}{c, d, f, $, e}
• FOLLOW(S) = FOLLOW(S) = {$,e}{$,e}
241-437 Compilers: syntax/4 68
FOLLOW() Example 2
• S => ( A ) | • A => T E• E => & T E | • T => ( A ) | a | b | c
• FIRST(T) = {( ,a,b,c}• FIRST(E) = {& , }• FIRST(A) = {( ,a,b,c}• FIRST(S) = {( , }
• FOLLOW(S) = FOLLOW(S) = {$}{$} • FOLLOW(A) = FOLLOW(A) = {)}{)} • FOLLOW(E) =FOLLOW(E) =• FOLLOW(T) =FOLLOW(T) =
continued
241-437 Compilers: syntax/4 69
• S => ( A ) | • A => T E• E => & T E | • T => ( A ) | a | b | c
• FIRST(T) = {(,a,b,c}• FIRST(E) = {&, }• FIRST(A) = {(,a,b,c}• FIRST(S) = {(, }
• FOLLOW(S) = FOLLOW(S) = { $ }{ $ }• FOLLOW(A) = FOLLOW(A) = { ) }{ ) }• FOLLOW(E) = FOLLOW(E) =
FOLLOW(A) FOLLOW(A) FOLLOW(E) FOLLOW(E)= = { ) }{ ) }
• FOLLOW(T) = FOLLOW(T) =
(FIRST_SEQ(E) – {(FIRST_SEQ(E) – {}) }) FOLLOW(A) FOLLOW(A) FOLLOW(E) FOLLOW(E) = = {&, )}{&, )}
241-437 Compilers: syntax/4 70
FOLLOW() Example 3
• S => T E1• E1 => + T E1 | • T => F T1• T1 => * F T1 | • F => ( S ) | id
• FIRST(F) = FIRST(T) = FIRST(S) = {(,id}
• FIRST(T1) = {*,}• FIRST(E1) = {+,}
• FOLLOW(S) = FOLLOW(S) = {$,)}{$,)} • FOLLOW(E1) =FOLLOW(E1) =• FOLLOW(T) =FOLLOW(T) =• FOLLOW(T1) =FOLLOW(T1) =• FOLLOW(F) =FOLLOW(F) =
continued
241-437 Compilers: syntax/4 71
• S => T E1• E1 => + T E1 | • T => F T1• T1 => * F T1 | • F => ( S ) | id
• FIRST(F) = FIRST(T) = FIRST(S) = {(,id}
• FIRST(T1) = {*,}• FIRST(E1) = {+,}
• FOLLOW(S) = FOLLOW(S) = {$,)}{$,)}• FOLLOW(E1) = FOLLOW(E1) =
FOLLOW(S) FOLLOW(S) Follow(E1) Follow(E1) = = {$,)}{$,)}
• FOLLOW(T) = FIRST(E1) FOLLOW(T) = FIRST(E1) FOLLOW(S) FOLLOW(S) FOLLOW(E1) FOLLOW(E1) = = {+,$,)}{+,$,)}
• FOLLOW(T1) = FOLLOW(T1) = FOLLOW(T) = FOLLOW(T) = {+,$,)}{+,$,)}
• FOLLOW(F) = FIRST(T1) FOLLOW(F) = FIRST(T1) FOLLOW(T) FOLLOW(T) FOLLOW(T1) FOLLOW(T1) = = {*,+,$,)}{*,+,$,)}
241-437 Compilers: syntax/4 72
FOLLOW() Example 4
• S => A B C | A D• A => a | a A• B => b | c | • C => D a C• D => b b | c c
• FIRST(D) = FIRST(C) = {b,c}• FIRST(B) = {b,c• FIRST(A) = FIRST(S) = {a}
• FOLLOW(S) = FOLLOW(S) = {$}{$} • FOLLOW(D) = FOLLOW(D) = {a,$}{a,$} • FOLLOW(A) =FOLLOW(A) =• FOLLOW(B) =FOLLOW(B) =• FOLLOW(C) =FOLLOW(C) =
continued
241-437 Compilers: syntax/4 73
• S => A B C | A D• A => a | a A• B => b | c | • C => D a C• D => b b | c c
• FIRST(D) = FIRST(C) = {b,c}• FIRST(B) = {b,c• FIRST(A) = FIRST(S) = {a}
• FOLLOW(S) = {$}FOLLOW(S) = {$}• FOLLOW(D) = {a,$}FOLLOW(D) = {a,$}• FOLLOW(A) = {b,c}FOLLOW(A) = {b,c}• FOLLOW(B) = {b,c}FOLLOW(B) = {b,c}• FOLLOW(C) = {$}FOLLOW(C) = {$}