Compiler Components and their Generators - Traditional Parsing Algorithms
Fall 2014-2015 Compiler Principles Lecture 2: Parsing part 1comp151/wiki.files/02-parsing-1.pdf ·...
Transcript of Fall 2014-2015 Compiler Principles Lecture 2: Parsing part 1comp151/wiki.files/02-parsing-1.pdf ·...
Fall 2014-2015 Compiler PrinciplesLecture 2: Parsing part 1
Roman ManevichBen-Gurion University
1
Previously: lexical analysis
• High-level process
• Scanner generator (e.g., JFlex) automatically generates scanner code
2
List ofregular
expressions(one per lexeme)
NFA+Є DFAToken nextToken() {…}
Codeimplementingmaximal munchwith tie breaking policy
minimization
Books
3
CompilersPrinciples, Techniques, and ToolsAlfred V. Aho, Ravi Sethi, Jeffrey D. Ullman
Advanced Compiler Design and ImplementationSteven Muchnik
Modern Compiler DesignD. Grune, H. Bal, C. Jacobs, K. Langendoen
Modern Compiler Implementation in JavaAndrew W. Appel
Tentative syllabus
FrontEnd
Scanning
Top-downParsing (LL)
Bottom-upParsing (LR)
AttributeGrammars
IntermediateRepresentation
Lowering
Optimizations
Local Optimizations
DataflowAnalysis
LoopOptimizations
Code Generation
RegisterAllocation
InstructionSelection
4
mid-term exam
Agenda
5
• Understand role of syntax analysis
• Context-free grammars refresher
• Top-down parsing
The bigger picture
• Compilers include different kinds of program analyses each further constrains the set of legal programs
– Lexical constraints
– Syntax constraints
– Semantic constraints
– “Logical” constraints(Verifying Compiler grand challenge)
6
Program consists of legal tokens
Program included in a given context-free language
Program included in a given attribute grammar (type checking, legal inheritance graph, variables initialized before used)
Memory safety: null dereference, array-out-of-bounds access,data races, functional correctness (program meets specification)
Syntax analysis overview
7
Role of syntax analysis
• Recover structure from stream of tokens– Parse tree / abstract syntax tree
• Error reporting (recovery)• Other possible tasks
– Syntax directed translation (one pass compilers)– Create symbol table– Create pretty-printed version of the program, e.g., Auto
Formatting function in Eclipse
8
High-levelLanguage
(scheme)
Executable
Code
LexicalAnalysis
Syntax Analysis
Parsing
AST SymbolTableetc.
Inter.Rep.(IR)
CodeGeneration
From tokens to abstract syntax trees
59 + (1257 * xPosition)
)id*num(+num
Lexical Analyzer
program text
token stream
Parser
Grammar:
E id
E num
E E + E
E E * E
E ( E ) +
num
num x
*
Abstract Syntax Tree
validsyntaxerror
9
Lexicalerror valid
Regular expressionsFinite automata
Context-free grammarsPush-down automata
Context-free grammarsrefresher
10
Example grammar
11
shorthand for Statement
shorthand for Expression
shorthand for List(of expressions)
S S ; SS id := E S print (L)E idE numE E + EL EL L, E
CFG terminology
12
Symbols:Terminals (tokens): ; := ( ) id num print
Non-terminals: S E L
Start non-terminal: SConvention: the non-terminal appearingin the first derivation rule
Grammar productions (rules)
N α
S S ; SS id := E S print (L)E idE numE E + EL EL L, E
More definitions
• Sentential form: a sequence of symbols, terminals (tokens) and non-terminals
• Sentence: a sequence of terminals (tokens)
• Derivation step: given a sentential form αNβand rule N µ a step is the transitionαNβ αµβ
• Derivation sequence: a sequence of derivation steps 1 … k such that i i+1 is the result of applying one production and k is a sentence
13
Language of a CFG
• A word ω is in L(G) (valid program) if there exists a corresponding derivation sequence– Start the start symbol
– Repeatedly replace one of the non-terminals by a right-hand side of a production
– Stop when the sentence contains only terminals
• ω is in L(G) if S * ω– Rightmost derivation
– Leftmost derivation
14
Leftmost derivation
15
S
=> S ; S
=> id := E ; S
=> id := num ; S
=> id := num ; id := E
=> id := num ; id := E + E
=> id := num ; id := num + E
=> id := num ; id := num + num
a := 56 ; b := 7 + 3
id := num ; id := num + num
S S ; SS id := E S print (L)E idE numE E + EL EL L, E
Rightmost derivation
16
S
=> S ; S
=> S ; id := E
=> S ; id := E + E
=> S ; id := E + num
=> S ; id := num + num
=> id := E ; id := num + num
=> id := num ; id := num + num
a := 56 ; b := 7 + 3
id := num ; id := num + num
S S ; SS id := E S print (L)E idE numE E + EL EL L, E
Canonical derivations
• Leftmost/rightmost derivations may not be unique but they allow describing a derivation by the sequence of production rules taken (since non-terminal is already known)
17
Parse trees
• Tree nodes are symbols, children ordered left-to-right
• Each internal node is non-terminal and its children correspond to one of its productions
N µ1 … µk
• Root is start non-terminal
• Leaves are tokens
• Yield of parse tree: left-to-right walk over leaves
18
µ1 µk
N
…
Parse tree exercise
19
S S ; SS id := E S print (L)E idE numE E + EL EL L, E id := num ; id := num num+
Draw parse tree for expression
Parse tree exercise
20
id := num ; id := num num+
E E E
S E
S
S
Order-independent representation
S S ; SS id := E S print (L)E idE numE E + EL EL L, E
(S(Sa := (E56)E)S ; (Sb := (E(E7)E + (E3)E)E)S)SEquivalently add parentheses labeled by non-terminal names
Capabilities and limitations of CFGs
• CFGs naturally express– Hierarchical structure
• A program is a list of classes,A Class is a list of definition…
– Alternatives• A definition is either a field definition or a method definition
– Beginning-end type of constraints• Balanced parentheses S (S)S | ε
• Cannot express– Correlations between unbounded strings (identifiers)– For example: variables are declared before use: ω S ω
• Handled by semantic analysis (attribute grammars)
21
p. 173
Bad grammars
22
By Oren neu dag (Own work) [CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Badly-formed grammars
• A non-terminal N is reachable if S * αNβ• A non-terminal N is generating if N * ω• A grammar G is badly-formed if it either contains unreachable non-
terminals or non-generating non-terminals– G1 = {
S xN y
}– G2 = {
S x | NN a N b N
}
• Theorem: for every grammar G there exists an equivalent well-formed grammar G’ ( that is, L(G)=L(G’) )Proof: exercise
• From now on, we will only handle well-formed grammars
23
Ambiguity in Context-free grammars
24
Sometimes there are two parse trees
25
Leftmost derivation
E
E + E
num + E
num + E + E
num + num + E
num + num + num
num(1)
E
E E
+
E E
+num(2) num(3)
Rightmost derivation
E
E + E
E + num
E + E + num
E + num + num
num + num + num
+ num(3)+num(1) num(2)
Arithmetic expressions:
E id
E num
E E + E
E E * E
E ( E )
1 + 2 + 3
E
E E
E
E
1 + (2 + 3) (1 + 2) + 3
Is ambiguity a problem for compilers?
Leftmost derivation
E
E + E
num + E
num + E + E
num + num + E
num + num + num
num(1)
E
E E
+
E E
+num(2) num(3)
Rightmost derivation
E
E + E
E + num
E + E + num
E + num + num
num + num + num
+ num(3)+num(1) num(2)
Arithmetic expressions:
E id
E num
E E + E
E E * E
E ( E )
1 + 2 + 3
E
E E
E
E
= 6 = 6
1 + (2 + 3) (1 + 2) + 3Depends on semantics
26
Problematic ambiguity example
Leftmost derivation
E
E + E
num + E
num + E * E
num + num * E
num + num * num
num(1)
E
E E
+
E E
*num(2) num(3)
Rightmost derivation
E
E * E
E * num
E + E * num
E + num * num
num + num * num
* num(3)+num(1) num(2)
Arithmetic expressions:
E id
E num
E E + E
E E * E
E ( E )
1 + 2 * 3
This is what we usually want: * has precedence over +
E
E E
E
E
= 7 = 9
1 + (2 * 3) (1 + 2) * 3
27
Ambiguous grammars
• A grammar is ambiguous if there exists a word for which there are– Two different leftmost derivations
– Two different rightmost derivations
– Two different parse trees
• Property of grammars, not languages
• Some languages are inherently ambiguous –no unambiguous grammars exist
• No algorithm to detect whether arbitrary grammar is ambiguous
28
Drawbacks of ambiguous grammars
• Ambiguous semantics
• Parsing complexity
• May affect other phases
• Solutions?
29
Drawbacks of ambiguous grammars
• Ambiguous semantics
• Parsing complexity
• May affect other phases
• Solutions
– Allow only non-ambiguous grammars
– Transform grammar into non-ambiguous
– Handle as part of parsing method
• Using special form of “precedence”
• Wait for bottom-up parsing lecture
30
Transforming ambiguous grammars to non-ambiguous by layering
Ambiguous grammar
E E + E
E E * E
E id
E num
E ( E )
Unambiguous grammar
E E + T
E T
T T * F
T F
F id
F num
F ( E )
Layer 1
Layer 2
Layer 3
Let’s derive 1 + 2 * 3
Each layer takes care of one way of composing sub-strings to form a string:1: by +2: by *3: atoms
31
Transformed grammar: * precedes +
Ambiguous grammar
E E + E
E E * E
E id
E num
E ( E )
Unambiguous grammar
E E + T
E T
T T * F
T F
F id
F num
F ( E )
Derivation
E
=> E + T
=> T + T
=> F + T
=> 1 + T
=> 1 + T * F
=> 1 + F * F
=> 1 + 2 * F
=> 1 + 2 * 3+ * 321
F F F
T
TE
T
E
Parse tree
32
Transformed grammar: + precedes *
Ambiguous grammar
E E + E
E E * E
E id
E num
E ( E )
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
Derivation
E
=> E * T
=> T * T
=> T + F * T
=> F + F * T
=> 1 + F * T
=> 1 + 2 * T
=> 1 + 2 * F
=> 1 + 2 * 3
F F F
T
T
E
T
E
Parse tree
33
Another example for layering
34
Ambiguous grammar
P ε
| P P
| ( P )
ε )( ε )(( )
P P
P P
P
ε )( ε )(( )
P P
P P
P
P
ε
P
Another example for layering
35
Ambiguous grammar
P ε
| P P
| ( P )
Unambiguous grammar
S P S
| ε
P ( S )
Takes care of “concatenation”
Takes care of nesting
ε )( ε )(( )
S S
P P
s
ε
s
P
s
s
s
ε
“dangling-else” example
36
Ambiguous grammar
S if E then S
| if E then S else S
| other
if
S
Sthen
thenif elseE S S
E
E1
E2 S1 S2
if
S
Sthen
thenif
else
E S
SE
E1
E2 S1
S2
if E1 then (if E2 then S1 else S2) if E1 then (if E2 then S1) else S2
This is what we usually want: match else to closest unmatched then
if E1 then if E2 then S1 else S2
p. 174
“dangling-else” example
37
if
S
Sthen
thenif else
Ambiguous grammar
S if E then S
| if E then S else S
| other
E S S
E
E1
E2 S1 S2
if
S
Sthen
thenif
else
E S
SE
E1
E2 S1
S2
if E1 then (if E2 then S1 else S2) if E1 then (if E2 then S1) else S2
Unambiguous grammar
S M | U
M if E then M else M
| other
U if E then S
| if E then M else U
if E1 then if E2 then S1 else S2
Matched statements
Unmatched statements
p. 174
Parsing strategies
38
Broad kinds of parsers
• Parsers for arbitrary grammars–Cocke-Younger-Kasami [‘65] method O(n3)
– Earley’s method (implemented by NLTK)
–Not commonly used by compilers
• Parsers for restricted classes of grammars– Top-Down
• With/without backtracking
–Bottom-Up
39
Top-down parsing
• Constructs parse tree in a top-down matter
• Preorder tree traversal
• Find the leftmost derivation
• Predictive: for every non-terminal and k-tokens predictthe next production LL(k)
• Challenge: beginning with the start symbol, try to guess the productions to apply to end up at the user's program
40
By Fidelio (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Top-down parsing example
41
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F F
T
T
E
T
E
Top-down parsing example
42
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
We need this rule to match the * in the input
+ * 321
E
Top-down parsing example
43
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
E
T
E
Top-down parsing example
44
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
T
E
T
E
Top-down parsing example
45
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F
T
T
E
T
E
Top-down parsing example
46
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
T
E
T
E
Top-down parsing example
47
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
T
E
T
E
Top-down parsing example
48
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
T
E
T
E
Top-down parsing example
49
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
T
E
T
E
F
Top-down parsing example
50
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F F
T
T
E
T
E
Bottom-up parsing
• Construct parse tree in a bottom-up manner
• Find the rightmost derivation in a reverse order
• For every potential right hand side and k-tokens decide when a production is found LR(k)
• Postorder tree traversal
• Challenge: beginning with the user's program, try to apply productions in reverse to convert the program back into the start symbol
51
Bottom-up parsing example
52
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
Bottom-up parsing example
53
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F
Bottom-up parsing example
54
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F
T
Bottom-up parsing example
55
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
Bottom-up parsing example
56
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
F
Bottom-up parsing example
57
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
F
T
Bottom-up parsing example
58
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
F
T
T
Bottom-up parsing example
59
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
F
T
T
E
Bottom-up parsing example
60
Unambiguous grammar
E E * T
E T
T T + F
T F
F id
F num
F ( E )
+ * 321
F F
T
F
T
T
E
E
Top-down parsingvia
recursive descent
61
By Vahram Mekhitarian (Own work) [CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Challenges in top-down parsing
• Top-down parsing begins with virtually no information– Begins with just the start symbol, which matches
every program
• How can we know which productions to apply?• In general, we can‘t
– There are some grammars for which the best we can do is guess and backtrack if we're wrong
• If we have to guess, how do we do it?– Parsing as a search algorithm– Too expensive in theory (exponential worst-case time)
and practice
62
Predictive parsing
• Given a grammar G and a word ω attempt to derive ω using G
• Idea– Apply production to leftmost nonterminal– Pick production rule based on next input token
• General grammar– More than one option for choosing the next
production based on a token
• Restricted grammars (LL)– Know exactly which single rule to apply– May require some lookahead to decide
63
Boolean expressions example
64
not ( not true or false )
E => not E => not ( E OP E ) =>not ( not E OP E ) =>not ( not LIT OP E ) =>not ( not true OP E ) =>not ( not true or E ) =>not ( not true or LIT ) =>not ( not true or false )
not E
E
( E OP E )
not LIT or LIT
true false
production to apply known from next token
E LIT | (E OP E) | not ELIT true | falseOP and | or | xor
Recursive descent parsing
• Define a function for every nonterminal
• Every function works as follows
– Find applicable production rule
– Terminal function checks match with next input token (if no match reports error)
– Nonterminal function calls (recursively) other functions
• If there are several applicable productions for a nonterminal, use lookahead
65
Matching tokens
• Variable current holds the current input token
66
match(token t) {
if (current == t)
current = next_token()
else
error
}
E LIT | (E OP E) | not ELIT true | falseOP and | or | xor
Functions for nonterminals
67
E() {
if (current {TRUE, FALSE}) // E LIT
LIT();
else if (current == LPAREN) // E ( E OP E )
match(LPAREN); E(); OP(); E(); match(RPAREN);
else if (current == NOT) // E not E
match(NOT); E();
else
error;
}
LIT() {
if (current == TRUE) match(TRUE);
else if (current == FALSE) match(FALSE);
else error;
}
E LIT | (E OP E) | not ELIT true | falseOP and | or | xor
Implementation via recursion
E → LIT
| ( E OP E )
| not E
LIT → true
| false
OP → and
| or
| xor
E() {
if (current {TRUE, FALSE}) LIT();
else if (current == LPAREN) match(LPARENT); E(); OP(); E(); match(RPAREN);
else if (current == NOT) match(NOT); E();
else error;
}
LIT() {
if (current == TRUE) match(TRUE);
else if (current == FALSE) match(FALSE);
else error;
}
OP() {
if (current == AND) match(AND);
else if (current == OR) match(OR);
else if (current == XOR) match(XOR);
else error;
}
68
Adding semantic actions
• Can add an action to perform on each production rule
• Can build the parse tree
– Every function returns an object of type Node
– Every Node maintains a list of children
– Function calls can add new children
69
Building the parse tree
Node E() {
result = new Node();
result.name = “E”;
if (current {TRUE, FALSE}) // E LIT
result.addChild(LIT());
else if (current == LPAREN) // E ( E OP E )
result.addChild(match(LPAREN));
result.addChild(E());
result.addChild(OP());
result.addChild(E());
result.addChild(match(RPAREN));
else if (current == NOT) // E not E
result.addChild(match(NOT));
result.addChild(E());
else error;
return result;
}
70
Recursive descent
• How do you pick the right A-production?
• Generally – try them all and use backtracking
• In our case – use lookahead
void A() {choose an A-production, A X1X2…Xk; for (i=1; i≤ k; i++) {if (Xi is a nonterminal) call procedure Xi();
elseif (Xi == current terminal)advance input;
elsereport error;
}}
71
Technical challengeswith recursive descent
72
• With lookahead 1, the function for indexed_elem will never be tried… – What happens for input of the form ID[expr]
term ID | indexed_elemindexed_elem ID [ expr ]
Recursive descent: problem 1
73
Recursive descent: problem 2
int S() {
return A() && match(token(‘a’)) && match(token(‘b’));
}
int A() {
return match(token(‘a’)) || 1;
}
S A a bA a |
What happens for input “ab”?
What happens if you flip order of alternatives and try “aab”?
74
Recursive descent: problem 3
int E() {
return E() && match(token(‘-’)) && term();
}
E E - term | term
What happens with this procedure?
Recursive descent parsers cannot handle left-recursive grammars
p. 127
75
Indirect left recursion
76
E F - term | termF E
int E() {
return F() && match(token(‘-’)) && term();
}
int F() {
return E();
}
A grammar is left-recursive if it allows a derivation sequence of the form S * N* N
Example: E F - term E - term
Next lecture:more on top-down parsing
77