COP4020 Programming Languages Syntax analysis Prof. Xin Yuan.

21
COP4020 Programming Languages Syntax analysis Prof. Xin Yuan

Transcript of COP4020 Programming Languages Syntax analysis Prof. Xin Yuan.

COP4020Programming Languages

Syntax analysis

Prof. Xin Yuan

COP4020 Spring 2014 204/19/23

Overview

Syntax analysis overview Grammar and context-free grammar Grammar derivations Parse trees

Syntax analysis

Syntax analysis is done by the parser. Detects whether the program is written following the grammar

rules and reports syntax errors. Produces a parse tree from which intermediate code can be

generated.

Sourceprogram

Lexicalanalyzer

token

Request for token

parserRest of front end

Symboltable

Int.code

Parsetree

The syntax of a programming language is described by a context-free grammar (Backus-Naur Form (BNF)). Similar to the languages specified by regular

expressions, but more general. A grammar gives a precise syntactic specification

of a language. From some classes of grammars, tools exist that

can automatically construct an efficient parser. These tools can also detect syntactic ambiguities and other problems automatically.

A compiler based on a grammatical description of a language is more easily maintained and updated.

Grammars

A grammar has four components G=(N, T, P, S):

T is a finite set of tokens (terminal symbols)

N is a finite set of nonterminals

P is a finite set of productions of the form .Where and

S is a special nonterminal that is a designated start symbol

COP4020 Spring 2014 504/19/23

*)(*)( TNNTN *)( TN

Example Grammar for expression (T=?, N=?, P=?, S=?)

Production: E ->E+E

E-> E-E

E-> (E)

E-> -E

E->num

E->id

How does this correspond to a language?Informally, you can expand the non-terminals using the productions

until all are expanded: the ending sentence (a sequence of tokens) is recognized by the grammar.

COP4020 Spring 2014 604/19/23

Language recognized by a grammar We say “aAb derives awb in one step”, denoted as

“aAb=>awb”, if A->w is a production and a and b are arbitrary strings of terminal or nonterminal symbols.

We say a1 derives am if a1=>a2=>…=>am, written as a1=>am

The languages L(G) defined by G are the set of strings of the terminals w such that S=>w.

COP4020 Spring 2014 704/19/23

*

*

Example

A->aA

A->bA

A->a

A->b

G=(N, T, P, S) N=? T=? P=? S=?

What is the language recognized by this grammer?

COP4020 Spring 2014 804/19/23

Chomsky Hierarchy (classification of grammars)

A grammar is said to be regular if it is

right-linear, where each production in P has the form, or . Here, A and B are non-terminals and w is a terminal

or left-linear context-free if each production in P is of the form

, where and context sensitive if each production in P is of the form

where unrestricted if each production in P is of the form

where

All languages recognized by regular expression can be represented by a regular grammar.

wBA wA

ANA *)( TN

||||

A context free grammar has four components G=(N, T, P, S):

T is a finite set of tokens (terminal symbols)

N is a finite set of nonterminals

P is a finite set of productions of the form Where and .

S is a special nonterminal that is a designated start symbol.

Context free grammar is more expressive than regular expression. Consider language

{ab, aabb, aaabbb, …}

ANA *)( TN

COP4020 Spring 2014 1104/19/23

BNF Notation (another form of context free grammar) Backus-Naur Form (BNF) notation for productions:

<nonterminal> ::= sequence of (non)terminals

where Each terminal in the grammar is a token A <nonterminal> defines a syntactic category The symbol | denotes alternative forms in a production The special symbol denotes empty

COP4020 Spring 2014 1204/19/23

Example<Program> ::= program <id> ( <id> <More_ids> ) ; <Block> .<Block> ::= <Variables> begin <Stmt> <More_Stmts> end<More_ids> ::= , <id> <More_ids>

| <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables>

| <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables>

| <Stmt> ::= <id> := <Exp>

| if <Exp> then <Stmt> else <Stmt>| while <Exp> do <Stmt>| begin <Stmt> <More_Stmts> end

<More_Stmts> ::= ; <Stmt> <More_Stmts>|

<Exp> ::= <num>| <id>| <Exp> + <Exp>| <Exp> - <Exp>

COP4020 Spring 2014 1304/19/23

Derivations

From a grammar we can derive strings (= sequences of tokens) The opposite process of parsing

Starting with the grammar’s designated start symbol, in each derivation step a nonterminal is replaced by a right-hand side of a production for that nonterminal A sentence (in the language) is a sequence of terminals that can

be derived from the start symbol. A sentential form is a sequence of terminals and nonterminals

that can be derived from the start symbol.

COP4020 Spring 2014 1404/19/23

Example Derivation

<expression>  <expression> <operator> <expression>  <expression> <operator> identifier  <expression> + identifier  <expression> <operator> <expression> + identifier  <expression> <operator> identifier + identifier  <expression> * identifier + identifier  identifier * identifier + identifier

<expression> ::= identifier              | unsigned_integer              | - <expression>              | ( <expression> )              | <expression> <operator> <expression><operator> ::= + | - | * | /

Start symbol

Replacement of nonterminal with one of its productions

The final string is the yield

Sen

tent

ial f

orm

s

COP4020 Spring 2014 1504/19/23

Rightmost versus Leftmost Derivations

When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most)

<expression>  <expression> <operator> <expression>  <expression> <operator> identifier 

Replace in leftmost derivation

Replace in rightmost derivation

<expression>  <expression> <operator> <expression>  identifier <operator> <expression> Replace in leftmost derivation

Replace in rightmost derivation

COP4020 Spring 2014 1604/19/23

A Language Generated by a Grammar A context-free grammar is a generator of a context-free language The language defined by a grammar G is the set of all strings w that

can be derived from the start symbol S

L(G) = { w | S * w }

<S> ::= a | ‘(’ <S> ‘)’ L(G) = { set of all strings a (a) ((a)) (((a))) … }

<S> ::= <B> | <C><B> ::= <C> + <C><C> ::= 0 | 1

L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 }

COP4020 Spring 2014 1704/19/23

Parse Trees

A parse tree depicts the end result of a derivation The internal nodes are the nonterminals The children of a node are the symbols (terminals and

nonterminals) on a right-hand side of a production The leaves are the terminals

<expression>

<expression> <operator>

identifier

<operator> <expression><expression>

<expression>

identifieridentifier * +

Parse Trees

COP4020 Spring 2014 1804/19/23

<expression>

<expression> <operator>

identifier

<operator> <expression><expression>

<expression>

identifieridentifier * +

<expression>  <expression> <operator> <expression>  <expression> <operator> identifier  <expression> + identifier  <expression> <operator> <expression> + identifier  <expression> <operator> identifier + identifier  <expression> * identifier + identifier  identifier * identifier + identifier

COP4020 Spring 2014 1904/19/23

Ambiguity

There is another parse tree for the same grammar and input: the grammar is ambiguous

This parse tree is not desired, since it appears that + has precedence over *

<expression>

<expression> <operator>

identifier

<operator> <expression><expression>

<expression>

identifieridentifier +*

COP4020 Spring 2014 2004/19/23

Ambiguous Grammars

Ambiguous grammar: more than one distinct derivation of a string results in different parse trees

A programming language construct should have only one parse tree to avoid misinterpretation by a compiler

For expression grammars, associativity and precedence of operators is used to disambiguate

<expression> ::= <term> | <expression> <add_op> <term><term> ::= <factor> | <term> <mult_op> <factor><factor> ::= identifier | unsigned_integer | - <factor> | ( <expression> )<add_op> ::= + | -<mult_op> ::= * | /

COP4020 Spring 2014 2104/19/23

Ambiguous if-then-else:the “Dangling Else” A classical example of an ambiguous grammar are the

grammar productions for if-then-else:

<stmt> ::= if <expr> then <stmt> | if <expr> then <stmt> else <stmt>

It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design

Ada uses different syntax to avoid ambiguity:

<stmt> ::= if <expr> then <stmt> end if | if <expr> then <stmt> else <stmt> end if