Chap. 4, Formal Grammars and Parsing J. H. Wang Oct. 19, 2015.
-
Upload
baldwin-anderson -
Category
Documents
-
view
219 -
download
0
Transcript of Chap. 4, Formal Grammars and Parsing J. H. Wang Oct. 19, 2015.
Outline
• Introduction• Context-Free Grammars• Properties of CFGs• Transforming Extended Grammars• Parsers and Recognizers• Grammar Analysis Algorithms
Introduction
• A natural language’s grammar: to capture a small but important aspect of a sentence’s validity with respect to a natural language
• Regular sets: guiding the actions of automatically constructed scanner– Chap. 3
• Grammar: guiding the actions of the parsers– Chap. 5, 6
• Semantic analysis: enforcing programming language rules that are not easily expressed by grammars– Chap. 7, 8, 9
The Role of the Parser
LexicalAnalyzer
LexicalAnalyzer ParserParser
Symbol Table
source program
Parse treetoken
Get next token
Rest of Front End
Rest of Front End
Intermediate
representation
Context-Free Grammars
• Components: G=(N,,P,S)– A finite terminal alphabet : the set of
tokens produced by the scanner– A finite nonterminal alphabet N: variables of
the grammar– A start symbol S: SN that initiates all
derivations• Goal symbol
– A finite set of productions P: AX1…Xm, where AN, XiN, 1≤i≤m and m≥0.
• Rewriting rules
• Vocabulary V=N– N=
• CFG: recipe for creating strings• Derivation: a rewriting step using the
production A replaces the nonterminal A with the vocabulary symbols in – Left-hand side (LHS): A– Right-hand side (RHS):
• Context-free language of grammar G L(G): the set of terminal strings derivable from S
Notations
Names Beginning with
Represent Symbols In
Examples
Uppercase N A, B, C, Prefix
Lowercase and punctuation
a, b, c, if, then, (, ;
X, Y N Xi, Y3
Other Greek letters
(N)* , ,
• Or notation:– A
| … |
– AA …A
A=>: one step of derivation using the production A– =>+: derives in one or more steps– =>*: derives in zero or more
steps
• S=>*: is a sentential form of the CFG
• SF(G): the set of sentential forms of G
• L(G)={w*|S=>+w}– L(G)=SF(G)*
• Two conventions that nonterminals are rewritten in some systematic order– Leftmost derivation: from left to right– Rightmost derivation: from right to left
Leftmost Derivation
• A derivation that always chooses the leftmost possible nonterminal at each step– =>lm, =>+
lm, =>*lm
– A left sentential form• A sentential form produced via a leftmost
derivation• E.g. production sequence in top-down
parsers• (Fig. 4.1)
• E.g: a leftmost derivation of f ( v + v )– E =>lm Prefix ( E )
=>lm f ( E ) =>lm f ( v Tail ) =>lm f ( v + E ) =>lm f ( v + v Tail ) =>lm f ( v + v )
Rightmost Derivations
• The rightmost possible nonterminal is always expanded– Canonical derivation
– =>rm, =>+rm, =>*rm
– A right sentential form• A sentential form produced via a rightmost
derivation• E.g. produced by bottom-up parsers (Ch. 6)• (Fig. 4.1)
• E.g: a rightmost derivation of f ( v + v )– E =>rm Prefix ( E )
=>rm Prefix ( v Tail ) =>rm Prefix ( v + E ) =>rm Prefix ( v + v Tail ) =>rm Prefix ( v + v ) =>rm f ( v + v )
Parse Trees
• Parse tree: graphical representation of a derivation– Root: start symbol S– Each node: either grammar symbol or λ– Interior nodes: nonterminals
• An interior node and its children: production
– E.g. Fig. 4.2
• Phrase of the sentential form: a sequence of symbols descended from a single nonterminal in the parse tree
• A simple or prime phrase: a phrase that contains no smaller phrase
• Handle of a sentential form: the leftmost simple phrase
• E.g. f ( v Tail ) in Fig. 4.2
Other Types of Grammars
• Regular grammars: less powerful• Context-sensitive and unrestricted
grammars: more powerful
Regular Grammars
• A CFG that is limited to productions of the form AaB or Cd– RHS: either a symbol from {λ}
followed by a nonterminal symbol, or a symbol from {λ}
– Regular set• E.g. {[i]i|i>=1} not regular
– S TT [ T ] | λ
• Regular sets are a proper subset of the context-free languages
Beyond Context-Free Grammars
• Context-sensitive grammar: nonterminals are rewritten only when they appear in a particular context (A), provided the rule never causes the sentential form to contract in length
• Unrestricted grammar (type-0 grammar): the most general
• More powerful, but less useful– Efficient parsers for such grammars do
not exist– It’s difficult to prove properties about
such grammars
• CFGs: a nice balance between generality and practicability
Properties of CFGs
• Some grammars might have problems:– Include useless symbols– Allow multiple, distinct derivations for
some input string– Include strings not in the language, or
exclude strings in the language
Reduced Grammars
• Each of its nonterminals and productions participates in the derivation of some string– Useless nonterminals: can be safely removed– E.g.
• SA | BAaBB bCc
– Algorithms to detect useless nonterminals• Ex.16 and Ex.19
Ambiguity
• Allow a derived string to have two or more different parse trees– E.g.
• Expr Expr – Expr | id
• Two different parse trees for id – id – id– Fig. 4.3
– No algorithm for checking an arbitrary CFG for ambiguity• Undecidable
Faulty Language Definition
• Terminal strings derivable by the grammar do not correspond exactly to the strings in the language
• Determining in general whether two CFGs generate the same language is an undecidable problem
Transforming Extended Grammars
• BNF (Backus-Naur form)– Optional symbols: enclosed in square brackets
• A [X1…Xn]
– Repeated symbols: enclosed in braces• B {X1…Xm}
– E.g. Java-like declaration• Declaration [final][static][const] Type identifier {,
identifier }
– Transforming extended BNF grammars into standard form
• Fig. 4.4
Parsers and Recognizers
• Recognizer: to determine if input string x L(G)
• Parser: to determine the string’s validity and structure (parse tree)– Top-down: starting at the root, expanding
the tree in a depth-first manner• Preorder traversal, predictive
– Bottom-up: starting at the leaves• Postorder traversal
• E.g. grammar– Program begin Stmts end $
Stmts Stmt; Stmts | λStmt simplestmt | begin Stmts end
– String: begin simplestmt; simplestmt; end $• Top-down parse: Fig. 4.5• Bottom-up parse: Fig. 4.6
• Parsing techniques– E.g. LL(1), LR(1) are the best-known top-
down and bottom-up parsing strategies
• L: token sequence is processed from left to right
• L,R: Leftmost or Rightmost parse• 1: the number of lookahead symbols
Grammar Analysis Algorithms
• Grammar representation– Programming language constructs assumed:
• A set: an unordered collection of distinct entities• A list: an ordered collection of entities• An iterator: a construct that enumerates the contents
of a set or list– Observations
• Symbols are rarely deleted from a grammar• Transformations can add symbols and productions to a
grammar• Typically visit all rules for a nonterminal, or visit all
occurrences of a symbol in productions• A production’s RHS processed one symbol at a time
– -> a production is represented by LHS and RHS symbols
Grammar Utilities• Creating or adding:
– Grammar(S)– Production(A, rhs)– Nonterminal(A)– Terminal(x)
• Iterators:– Productions()– Noterminals()– Terminals()– RHS(p)– LHS(p)– ProductionsFor(A)– Occurrences(X)– Tail(y)
• Others– IsTerminal(X)– Production(y)
Deriving the Empty String
• It’s common to determine which nonterminals can derive λ– Not trivial because the derivation can
take more than one step• A=>BCD=>BC=>B=> λ
– Fig. 4.7
ERIVES MPTY TRING
ON ERMINALS
RODUCTIONS
RODUCTION
HECK OR MPTY
HECK OR MPTY
HECK OR MPTY
CCURRENCES
• The algorithm establishes two structures– RuleDerivesEmpty(p)– SymbolDerivesEmpty(A)– Useful in grammar analysis and parsing
algorithms in Chap.4, 5, & 6
First Sets
• The set of all terminal symbols that can begin a sentential form derivable from the string – First()={ a| =>*a }– We never include λ in First() even if
=>*λ– E.g. (in Fig.4.1)
• First(Tail) = {+}• First(Prefix) = {f}• First(E) = {v, f, (}
– Fig.4.8, Fig. 4.9, Fig. 4.10
Follow Sets
• The set of terminals that can follow a nonterminal A in some sentential form– For AN,
• Follow(A) = {b|S=>+Ab}
– The right context associated with A– Fig. 4.11
NTERNAL OLLOW
NTERNAL OLLOW
NTERNAL OLLOW
OLLOW
CCURRENCES
IRST AIL
LL ERIVE MPTY
LL ERIVE MPTY
RODUCTION
ON ERMINALS
AIL
• First and Follow sets can be generalized to include strings of length k– Firstk(), Followk(A)
– Useful in parsing techniques that use k-symbol lookaheads (e.g. LL(k), LR(k))
More on FIRST and FOLLOW[Aho, Lam, Sethi, Ullman]
• Two functions FIRST and FOLLOW allow us to choose which production to apply, based on the next input symbol
• FIRST(): the set of terminals that begin strings derived from – Ex: A=>* c, c is in FIRST(A)
• FOLLOW(A): the set of terminals a that can appear immediately to the right of A in some sentential form– Ex: S =>* Aa
• To compute FIRST(X) for all grammar symbols X– If X is a terminal, FIRST(X)={X}
– If X is a nonterminal and XY1Y2…Yk, then place a in FIRST(X) if for some i, a is in FIRST(Yi) and Y1…Yi-1=>* λ
– (If Xλ is a production, add λ to FIRST(X))– (NOTE: In [Fischer 2009], never including
λ in First(X) even if X =>*λ)
• To compute FOLLOW(A) for all nonterminals A– (Place $ in FOLLOW(S))
• (Note: $ not needed in [Fischer 2009])
– If there’s a production AB, then everything in FIRST() except λ is in FOLLOW(B)
– If there’s a production AB, or AB, where FIRST() contains λ, then everything in FOLLOW(A) is in FOLLOW(B)
Example
• Ex: (4.28)– E T E’
E’ + T E’ | λT F T’T’ * F T’ | λF (E) | id
– FIRST(F)=FIRST(T)=FISRT(E)={(,id}– FIRST(E’)={+}– FIRST(T’)={*}– FOLLOW(E)=FOLLOW(E’)={)}– FOLLOW(T)=FOLLOW(T’)={+,)}– FOLLOW(F)={+,*,)}