Chapter 3 Context-Free Grammars, Context-Free Languages, Parse ...
Context-Free Grammarsweb.cse.ohio-state.edu/.../extras/slides/25.Context-Free-Grammars.pdfA grammar...
Transcript of Context-Free Grammarsweb.cse.ohio-state.edu/.../extras/slides/25.Context-Free-Grammars.pdfA grammar...
BL Compiler Structure
24 October 2013 OSU CSE 2
CodeGeneratorParserTokenizer
string ofcharacters
(source code)
string oftokens
(“words”)
abstractprogram
string ofintegers
(object code)
The parser is arguably the most interesting, and most difficult,
piece of the BL compiler.
Plan for the BL Parser
• Design a context-free grammar (CFG) to specify syntactically valid BL programs
• Use the grammar to implement a recursive-descent parser (i.e., an algorithm to parse a BL program and construct the corresponding Programobject)
24 October 2013 OSU CSE 3
Plan for the BL Parser
• Design a context-free grammar (CFG) to specify syntactically valid BL programs
• Use the grammar to implement a recursive-descent parser (i.e., an algorithm to parse a BL program and construct the corresponding Programobject)
24 October 2013 OSU CSE 4
A grammar is a set of formation rules for strings in
a language.
Plan for the BL Parser
• Design a context-free grammar (CFG) to specify syntactically valid BL programs
• Use the grammar to implement a recursive-descent parser (i.e., an algorithm to parse a BL program and construct the corresponding Programobject)
24 October 2013 OSU CSE 5
A grammar is context-free if it satisfies certain technical conditions described herein.
Languages
• A language is a set of strings over some alphabet Σ
• If L is a language, then mathematically it is a set of string of Σ
24 October 2013 OSU CSE 6
Aside: Characters vs. Tokens
• In the following examples of CFGs, we deal with languages over the alphabet of individual characters (e.g., Java’s charvalues)Σ = character
• In the BL project, we deal with languages over an alphabet of tokens (to be explained later)
24 October 2013 OSU CSE 7
Example: Real-Number Constants
• Some syntactically valid real-number constants (i.e., some strings in the “language of valid real-number constants”):37.044615.22E1699241.18.E-93
24 October 2013 OSU CSE 8
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 9
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 10
This is a rewrite rule (a replacement rule), which
describes how strings in the language may be formed.
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 11
A name on the left of a rewrite rule is called anon-terminal symbol.
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 12
The special CFG symbol means “can be rewritten as”
or “can be replaced by”.
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 13
The special CFG symbol |means “or”, i.e., there are
multiple possible “rewrites” for the same non-terminal.
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 14
So this ...
CFG Rewrite Rulesreal-const digit-seq . digit-seqreal-const digit-seq . digit-seq exponentreal-const digit-seq .real-const digit-seq . exponent exponent E digit-seq |
E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 15
... means exactly the same thing as these four separate
rewrite rules.
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 16
One non-terminal symbol (normally in the first rewrite
rule) is called thestart symbol.
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 17
A symbol from the alphabet on the right-hand side of a
rewrite rule is called aterminal symbol.
CFG Rewrite Rulesreal-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
exponent E digit-seq |E + digit-seq |E – digit-seq
digit-seq digit digit-seq |digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 18
To remember the name: terminalsymbols are what you end up with
when generating strings in the language (see below).
Four Components of a CFG
• Non-terminal symbols for this CFG:– real-const, exponent, digit-seq, digit
• Terminal symbols for this CFG:– ., E, +, -, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
• Start symbol for this CFG:– real-const
• Rewrite rules for this CFG:– (see previous slides)
24 October 2013 OSU CSE 19
Derivations
• A derivation of a string of terminal symbols consists of a sequence of specific rewrite-rule applications that begin with the start symbol and continue until only terminal symbols remain– A string is in the language of the CFG iff
there is a derivation that leads to it• The symbol indicates a derivation step,
i.e., a specific rewrite-rule application24 October 2013 OSU CSE 20
Example: Derivation of 5.6E10
• Begin with the start symbol:real-const
• ... and pick one possible rewrite:real-const digit-seq . digit-seq |
digit-seq . digit-seq exponent |digit-seq . |digit-seq . exponent
24 October 2013 OSU CSE 22
Which rewrite is appropriate
to derive 5.6E10?
Example: Derivation of 5.6E10
• This is the first step of the derivation:real-const digit-seq . digit-seq exponent
24 October 2013 OSU CSE 23
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:real-const digit-seq . digit-seq exponent
24 October 2013 OSU CSE 24
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:real-const digit-seq . digit-seq exponent
• ... and pick one possible rewrite:digit-seq digit digit-seq |
digit
24 October 2013 OSU CSE 25
Which rewrite is appropriate
to derive 5.6E10?
Example: Derivation of 5.6E10
• This is the second step of the derivation:real-const digit-seq . digit-seq exponent
digit . digit-seq exponent
24 October 2013 OSU CSE 26
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:real-const digit-seq . digit-seq exponent
digit . digit-seq exponent
24 October 2013 OSU CSE 27
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:real-const digit-seq . digit-seq exponent
digit . digit-seq exponent• ... and pick one possible rewrite:
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 28
Example: Derivation of 5.6E10
• This is the third step of the derivation:real-const digit-seq . digit-seq exponent
digit . digit-seq exponent5 . digit-seq exponent
24 October 2013 OSU CSE 29
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:real-const digit-seq . digit-seq exponent
digit . digit-seq exponent5 . digit-seq exponent
24 October 2013 OSU CSE 30
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:real-const digit-seq . digit-seq exponent
digit . digit-seq exponent5 . digit-seq exponent
• ... and pick one possible rewrite:digit-seq digit digit-seq |
digit
24 October 2013 OSU CSE 31
One Derivation of 5.6E10real-const digit-seq . digit-seq exponent
digit . digit-seq exponent5 . digit-seq exponent5 . digit exponent5 . 6 exponent5 . 6 E digit-seq5 . 6 E digit digit-seq5 . 6 E 1 digit-seq5 . 6 E 1 digit5 . 6 E 1 0
24 October 2013 OSU CSE 32
One Derivation of 5.6E10real-const digit-seq . digit-seq exponent
digit . digit-seq exponent5 . digit-seq exponent5 . digit exponent5 . 6 exponent5 . 6 E digit-seq5 . 6 E digit digit-seq5 . 6 E 1 digit-seq5 . 6 E 1 digit5 . 6 E 1 0
24 October 2013 OSU CSE 33
Note that a derivation is used in this way to
generate a string in the language of the CFG.
Another Derivation of 5.6E10real-const digit-seq . digit-seq exponent
digit-seq . digit-seq E digit-seqdigit-seq . digit-seq E digit digit-seqdigit-seq . digit-seq E digit digitdigit-seq . digit-seq E digit 0digit-seq . digit-seq E 1 0digit-seq . digit E 1 0digit-seq . 6 E 1 0digit . 6 E 1 05 . 6 E 1 0
24 October 2013 OSU CSE 34
Derivation Trees
• A derivation tree depicts a derivation (such as those above) in a tree
• Note that the order in which rewrites are done is sometimes arbitrary– A tree captures the required temporal order of
rewrites from top-to-bottom– A tree captures the required spatial order
among terminal symbols from left-to-right
24 October 2013 OSU CSE 35
A Derivation Tree for 5.6E10
24 October 2013 OSU CSE 36
real-const
digit
exponentdigit-seqdigit-seq .
digit
5 6
digit-seqE
digit digit-seq
digit1
0
A Derivation Tree for 5.6E10
24 October 2013 OSU CSE 37
real-const
digit
exponentdigit-seqdigit-seq .
digit
5 6
digit-seqE
digit digit-seq
digit1
0
This tree captures bothderivations previously illustrated
(and all others) for 5.6E10.
Other Examples
• Can you find a derivation tree for 5.E3?– If so, it’s in the language of the CFG;
otherwise it’s not in that language• Can you find a derivation tree for .6E10?
– If so, it’s in the language of the CFG; otherwise it’s not in that language
24 October 2013 OSU CSE 38
A Famous CFG
expr expr add-op term | termterm term mult-op factor | factorfactor ( expr ) | digit-seqadd-op + | -mult-op * | DIV | REMdigit-seq digit digit-seq | digitdigit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 39
A Derivation Tree for 4+6*2
24 October 2013 OSU CSE 41
expr
digit
digit-seq
+
4
expr add-op term
term
factor
term mult-op factor
digit
digit-seq
6
factor
digit
digit-seq
2
*
Example: (4+6)*2
• Find a derivation tree for (4+6)*2• How is it different from the previous one?
24 October 2013 OSU CSE 42
A Simpler CFG for Expressionsexpr expr op expr | ( expr ) | digit-seqop + | - | * | DIV | REMdigit-seq digit digit-seq | digitdigit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
24 October 2013 OSU CSE 43
One Derivation Tree for 4+6*2
24 October 2013 OSU CSE 44
expr
digit
digit-seq +
4
expr op expr
expr op expr
digit
digit-seq
6
digit
digit-seq
2
*
Another Derivation Tree for 4+6*2
24 October 2013 OSU CSE 45
expr
*
opexpr
expr op expr
digit
digit-seq
4
digit
digit-seq
6
+ digit
digit-seq
2
expr
Ambiguity
• The second (simpler) CFG for arithmetic expressions is ambiguous because some strings in the language of the CFG have more than one derivation tree
• As is often the case, ambiguity is bad– If you want to use the derivation tree as the
basis for evaluating the expression, only one of the derivation trees shown above results in the right answer (which one?)
24 October 2013 OSU CSE 46