COP4020 Programming Languages Syntax analysis Prof. Xin Yuan.
-
Upload
estella-burke -
Category
Documents
-
view
223 -
download
1
Transcript of COP4020 Programming Languages Syntax analysis Prof. Xin Yuan.
COP4020 Spring 2014 204/19/23
Overview
Syntax analysis overview Grammar and context-free grammar Grammar derivations Parse trees
Syntax analysis
Syntax analysis is done by the parser. Detects whether the program is written following the grammar
rules and reports syntax errors. Produces a parse tree from which intermediate code can be
generated.
Sourceprogram
Lexicalanalyzer
token
Request for token
parserRest of front end
Symboltable
Int.code
Parsetree
The syntax of a programming language is described by a context-free grammar (Backus-Naur Form (BNF)). Similar to the languages specified by regular
expressions, but more general. A grammar gives a precise syntactic specification
of a language. From some classes of grammars, tools exist that
can automatically construct an efficient parser. These tools can also detect syntactic ambiguities and other problems automatically.
A compiler based on a grammatical description of a language is more easily maintained and updated.
Grammars
A grammar has four components G=(N, T, P, S):
T is a finite set of tokens (terminal symbols)
N is a finite set of nonterminals
P is a finite set of productions of the form .Where and
S is a special nonterminal that is a designated start symbol
COP4020 Spring 2014 504/19/23
*)(*)( TNNTN *)( TN
Example Grammar for expression (T=?, N=?, P=?, S=?)
Production: E ->E+E
E-> E-E
E-> (E)
E-> -E
E->num
E->id
How does this correspond to a language?Informally, you can expand the non-terminals using the productions
until all are expanded: the ending sentence (a sequence of tokens) is recognized by the grammar.
COP4020 Spring 2014 604/19/23
Language recognized by a grammar We say “aAb derives awb in one step”, denoted as
“aAb=>awb”, if A->w is a production and a and b are arbitrary strings of terminal or nonterminal symbols.
We say a1 derives am if a1=>a2=>…=>am, written as a1=>am
The languages L(G) defined by G are the set of strings of the terminals w such that S=>w.
COP4020 Spring 2014 704/19/23
*
*
Example
A->aA
A->bA
A->a
A->b
G=(N, T, P, S) N=? T=? P=? S=?
What is the language recognized by this grammer?
COP4020 Spring 2014 804/19/23
Chomsky Hierarchy (classification of grammars)
A grammar is said to be regular if it is
right-linear, where each production in P has the form, or . Here, A and B are non-terminals and w is a terminal
or left-linear context-free if each production in P is of the form
, where and context sensitive if each production in P is of the form
where unrestricted if each production in P is of the form
where
All languages recognized by regular expression can be represented by a regular grammar.
wBA wA
ANA *)( TN
||||
A context free grammar has four components G=(N, T, P, S):
T is a finite set of tokens (terminal symbols)
N is a finite set of nonterminals
P is a finite set of productions of the form Where and .
S is a special nonterminal that is a designated start symbol.
Context free grammar is more expressive than regular expression. Consider language
{ab, aabb, aaabbb, …}
ANA *)( TN
COP4020 Spring 2014 1104/19/23
BNF Notation (another form of context free grammar) Backus-Naur Form (BNF) notation for productions:
<nonterminal> ::= sequence of (non)terminals
where Each terminal in the grammar is a token A <nonterminal> defines a syntactic category The symbol | denotes alternative forms in a production The special symbol denotes empty
COP4020 Spring 2014 1204/19/23
Example<Program> ::= program <id> ( <id> <More_ids> ) ; <Block> .<Block> ::= <Variables> begin <Stmt> <More_Stmts> end<More_ids> ::= , <id> <More_ids>
| <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables>
| <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables>
| <Stmt> ::= <id> := <Exp>
| if <Exp> then <Stmt> else <Stmt>| while <Exp> do <Stmt>| begin <Stmt> <More_Stmts> end
<More_Stmts> ::= ; <Stmt> <More_Stmts>|
<Exp> ::= <num>| <id>| <Exp> + <Exp>| <Exp> - <Exp>
COP4020 Spring 2014 1304/19/23
Derivations
From a grammar we can derive strings (= sequences of tokens) The opposite process of parsing
Starting with the grammar’s designated start symbol, in each derivation step a nonterminal is replaced by a right-hand side of a production for that nonterminal A sentence (in the language) is a sequence of terminals that can
be derived from the start symbol. A sentential form is a sequence of terminals and nonterminals
that can be derived from the start symbol.
COP4020 Spring 2014 1404/19/23
Example Derivation
<expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier
<expression> ::= identifier | unsigned_integer | - <expression> | ( <expression> ) | <expression> <operator> <expression><operator> ::= + | - | * | /
Start symbol
Replacement of nonterminal with one of its productions
The final string is the yield
Sen
tent
ial f
orm
s
COP4020 Spring 2014 1504/19/23
Rightmost versus Leftmost Derivations
When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most)
<expression> <expression> <operator> <expression> <expression> <operator> identifier
Replace in leftmost derivation
Replace in rightmost derivation
<expression> <expression> <operator> <expression> identifier <operator> <expression> Replace in leftmost derivation
Replace in rightmost derivation
COP4020 Spring 2014 1604/19/23
A Language Generated by a Grammar A context-free grammar is a generator of a context-free language The language defined by a grammar G is the set of all strings w that
can be derived from the start symbol S
L(G) = { w | S * w }
<S> ::= a | ‘(’ <S> ‘)’ L(G) = { set of all strings a (a) ((a)) (((a))) … }
<S> ::= <B> | <C><B> ::= <C> + <C><C> ::= 0 | 1
L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 }
COP4020 Spring 2014 1704/19/23
Parse Trees
A parse tree depicts the end result of a derivation The internal nodes are the nonterminals The children of a node are the symbols (terminals and
nonterminals) on a right-hand side of a production The leaves are the terminals
<expression>
<expression> <operator>
identifier
<operator> <expression><expression>
<expression>
identifieridentifier * +
Parse Trees
COP4020 Spring 2014 1804/19/23
<expression>
<expression> <operator>
identifier
<operator> <expression><expression>
<expression>
identifieridentifier * +
<expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier
COP4020 Spring 2014 1904/19/23
Ambiguity
There is another parse tree for the same grammar and input: the grammar is ambiguous
This parse tree is not desired, since it appears that + has precedence over *
<expression>
<expression> <operator>
identifier
<operator> <expression><expression>
<expression>
identifieridentifier +*
COP4020 Spring 2014 2004/19/23
Ambiguous Grammars
Ambiguous grammar: more than one distinct derivation of a string results in different parse trees
A programming language construct should have only one parse tree to avoid misinterpretation by a compiler
For expression grammars, associativity and precedence of operators is used to disambiguate
<expression> ::= <term> | <expression> <add_op> <term><term> ::= <factor> | <term> <mult_op> <factor><factor> ::= identifier | unsigned_integer | - <factor> | ( <expression> )<add_op> ::= + | -<mult_op> ::= * | /
COP4020 Spring 2014 2104/19/23
Ambiguous if-then-else:the “Dangling Else” A classical example of an ambiguous grammar are the
grammar productions for if-then-else:
<stmt> ::= if <expr> then <stmt> | if <expr> then <stmt> else <stmt>
It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design
Ada uses different syntax to avoid ambiguity:
<stmt> ::= if <expr> then <stmt> end if | if <expr> then <stmt> else <stmt> end if