COP4020 Programming Languages Syntax analysis Prof. Xin Yuan.

COP4020Programming Languages

Syntax analysis

Prof. Xin Yuan

COP4020 Spring 2014 204/19/23

Overview

Syntax analysis overview Grammar and context-free grammar Grammar derivations Parse trees

Syntax analysis

Syntax analysis is done by the parser. Detects whether the program is written following the grammar

rules and reports syntax errors. Produces a parse tree from which intermediate code can be

generated.

Sourceprogram

Lexicalanalyzer

token

Request for token

parserRest of front end

Symboltable

Int.code

Parsetree

The syntax of a programming language is described by a context-free grammar (Backus-Naur Form (BNF)). Similar to the languages specified by regular

expressions, but more general. A grammar gives a precise syntactic specification

of a language. From some classes of grammars, tools exist that

can automatically construct an efficient parser. These tools can also detect syntactic ambiguities and other problems automatically.

A compiler based on a grammatical description of a language is more easily maintained and updated.

Grammars

A grammar has four components G=(N, T, P, S):

T is a finite set of tokens (terminal symbols)

N is a finite set of nonterminals

P is a finite set of productions of the form .Where and

S is a special nonterminal that is a designated start symbol

COP4020 Spring 2014 504/19/23

*)(*)( TNNTN *)( TN

Example Grammar for expression (T=?, N=?, P=?, S=?)

Production: E ->E+E

E-> E-E

E-> (E)

E-> -E

E->num

E->id

How does this correspond to a language?Informally, you can expand the non-terminals using the productions

until all are expanded: the ending sentence (a sequence of tokens) is recognized by the grammar.

COP4020 Spring 2014 604/19/23

Language recognized by a grammar We say “aAb derives awb in one step”, denoted as

“aAb=>awb”, if A->w is a production and a and b are arbitrary strings of terminal or nonterminal symbols.

We say a1 derives am if a1=>a2=>…=>am, written as a1=>am

The languages L(G) defined by G are the set of strings of the terminals w such that S=>w.

COP4020 Spring 2014 704/19/23

*

*

Example

A->aA

A->bA

A->a

A->b

G=(N, T, P, S) N=? T=? P=? S=?

What is the language recognized by this grammer?

COP4020 Spring 2014 804/19/23

Chomsky Hierarchy (classification of grammars)

A grammar is said to be regular if it is

right-linear, where each production in P has the form, or . Here, A and B are non-terminals and w is a terminal

or left-linear context-free if each production in P is of the form

, where and context sensitive if each production in P is of the form

where unrestricted if each production in P is of the form

where

All languages recognized by regular expression can be represented by a regular grammar.

wBA wA

ANA *)( TN

||||

A context free grammar has four components G=(N, T, P, S):

T is a finite set of tokens (terminal symbols)

N is a finite set of nonterminals

P is a finite set of productions of the form Where and .

S is a special nonterminal that is a designated start symbol.

Context free grammar is more expressive than regular expression. Consider language

{ab, aabb, aaabbb, …}

ANA *)( TN

COP4020 Spring 2014 1104/19/23

BNF Notation (another form of context free grammar) Backus-Naur Form (BNF) notation for productions:

<nonterminal> ::= sequence of (non)terminals

where Each terminal in the grammar is a token A <nonterminal> defines a syntactic category The symbol | denotes alternative forms in a production The special symbol denotes empty

COP4020 Spring 2014 1204/19/23

Example<Program> ::= program <id> ( <id> <More_ids> ) ; <Block> .<Block> ::= <Variables> begin <Stmt> <More_Stmts> end<More_ids> ::= , <id> <More_ids>

| <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables>

| <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables>

| <Stmt> ::= <id> := <Exp>

| if <Exp> then <Stmt> else <Stmt>| while <Exp> do <Stmt>| begin <Stmt> <More_Stmts> end

<More_Stmts> ::= ; <Stmt> <More_Stmts>|

<Exp> ::= <num>| <id>| <Exp> + <Exp>| <Exp> - <Exp>

COP4020 Spring 2014 1304/19/23

Derivations

From a grammar we can derive strings (= sequences of tokens) The opposite process of parsing

Starting with the grammar’s designated start symbol, in each derivation step a nonterminal is replaced by a right-hand side of a production for that nonterminal A sentence (in the language) is a sequence of terminals that can

be derived from the start symbol. A sentential form is a sequence of terminals and nonterminals

that can be derived from the start symbol.

COP4020 Spring 2014 1404/19/23

Example Derivation

<expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier

<expression> ::= identifier | unsigned_integer | - <expression> | ( <expression> ) | <expression> <operator> <expression><operator> ::= + | - | * | /

Start symbol

Replacement of nonterminal with one of its productions

The final string is the yield

Sen

tent

ial f

orm

s

COP4020 Spring 2014 1504/19/23

Rightmost versus Leftmost Derivations

When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most)

<expression> <expression> <operator> <expression> <expression> <operator> identifier

Replace in leftmost derivation

Replace in rightmost derivation

<expression> <expression> <operator> <expression> identifier <operator> <expression> Replace in leftmost derivation

Replace in rightmost derivation

COP4020 Spring 2014 1604/19/23

A Language Generated by a Grammar A context-free grammar is a generator of a context-free language The language defined by a grammar G is the set of all strings w that

can be derived from the start symbol S

L(G) = { w | S * w }

<S> ::= a | ‘(’ <S> ‘)’ L(G) = { set of all strings a (a) ((a)) (((a))) … }

<S> ::= <B> | <C><B> ::= <C> + <C><C> ::= 0 | 1

L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 }

COP4020 Spring 2014 1704/19/23

Parse Trees

A parse tree depicts the end result of a derivation The internal nodes are the nonterminals The children of a node are the symbols (terminals and

nonterminals) on a right-hand side of a production The leaves are the terminals

<expression>

<expression> <operator>

identifier

<operator> <expression><expression>

<expression>

identifieridentifier * +

Parse Trees

COP4020 Spring 2014 1804/19/23

<expression>


identifier


<expression>

identifieridentifier * +

<expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier

COP4020 Spring 2014 1904/19/23

Ambiguity

There is another parse tree for the same grammar and input: the grammar is ambiguous

This parse tree is not desired, since it appears that + has precedence over *

<expression>


identifier


<expression>

identifieridentifier +*

COP4020 Spring 2014 2004/19/23

Ambiguous Grammars

Ambiguous grammar: more than one distinct derivation of a string results in different parse trees

A programming language construct should have only one parse tree to avoid misinterpretation by a compiler

For expression grammars, associativity and precedence of operators is used to disambiguate

<expression> ::= <term> | <expression> <add_op> <term><term> ::= <factor> | <term> <mult_op> <factor><factor> ::= identifier | unsigned_integer | - <factor> | ( <expression> )<add_op> ::= + | -<mult_op> ::= * | /

COP4020 Spring 2014 2104/19/23

Ambiguous if-then-else:the “Dangling Else” A classical example of an ambiguous grammar are the

grammar productions for if-then-else:

<stmt> ::= if <expr> then <stmt> | if <expr> then <stmt> else <stmt>

It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design

Ada uses different syntax to avoid ambiguity:

<stmt> ::= if <expr> then <stmt> end if | if <expr> then <stmt> else <stmt> end if

COP4020 Programming Languages Syntax analysis Prof. Xin Yuan.

Documents

Transcript of COP4020 Programming Languages Syntax analysis Prof. Xin Yuan.