Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12.

Module 2 Module 2 Compiler and their Compiler and their

WorkingWorking

Software ConstructionLecture 10 ,11 and 12

2

What are CompilersWhat are Compilers Translate information from one

representation to another Usually information = program Typical Compilers:

• VC, VC++, GCC, JavaC• FORTRAN, Pascal, VB

Translators• Word to PDF• PDF to Postscript

3

Source CodeSource Code Optimized for human

readability Matches human notions of

grammar Uses named constructs such

as variables and procedures

4

How to TranslateHow to Translate Translation is a complex

process source language and

generated code are very different

Need to structure the translation

5

Two-pass CompilerTwo-pass Compiler

FrontEnd

BackEnd

sourcecode

IR machinecode

errorsUse an intermediate representation (IR)Front end maps legal source code into IRBack end maps IR into target machine code

6

The Front-EndThe Front-End

Modules Scanner (also called Lexical analyzer) Parser

scanner parsersourcecode

tokens IR

errors

7

ScannerScanner

Maps character stream into words – basic unit of syntax

Produces pairs – • a word and• its part of speech


tokens IR

errors

8

ScannerScanner Example

x = x + y becomes

<id,x> <assign,=><id,x><op,+><id,y>

token typeword

<id,x>

we call the pair “<token type, word>” a “token”typical tokens: number, identifier, +, -, new, while, if

9

ParserParser


tokens IR

errors

•Recognizes context-free syntax and reports errors•Guides context-sensitive (“semantic”) analysis•Builds IR for source program

What is Context Free SyntaxWhat is Context Free Syntax To understand this we should have base of

context free grammar It is a set of write and rules such as

10

11

Context-Free GrammarsContext-Free Grammars Context-free syntax is specified

with a grammar G=(S,N,T,P) S is the start symbol N is a set of non-terminal symbols T is set of terminal symbols or words P is a set of productions or rewrite

rules

12

Context-Free GrammarsContext-Free GrammarsGrammar for expressions 1. goal → expr2. expr → expr op term3. | term4. term → number 5. | id6. op → + 7. | -

13

The Front EndThe Front End For this CFG

S = goalT = { number, id, +, -}N = { goal, expr, term, op}P = { 1, 2, 3, 4, 5, 6, 7}

14

Context-Free GrammarsContext-Free Grammars Given a CFG, we can derive

sentences by repeated substitution

Consider the sentence (expression)

x + 2 – y

15

DerivationDerivationProduction Result

goal1 expr2 expr op term5 expr op y7 expr – y2 expr op term – y4 expr op 2 – y6 expr + 2 – y3 term + 2 – y5 x + 2 – y

16

The Front EndThe Front End To recognize a valid

sentence in some CFG, we reverse this process and build up a parse

A parse can be represented by a tree: parse tree or syntax tree

17

ParseParseProduction Result

goal1 expr2 expr op term5 expr op y7 expr – y2 expr op term – y4 expr op 2 – y6 expr + 2 – y3 term + 2 – y5 x + 2 – y

18

Syntax TreeSyntax Tree x+2-y

goal

expr

termopexpr

termopexpr

term

– <id,y>

<id,x>

+ <number, 2>

19

Abstract Syntax TreesAbstract Syntax Trees The parse tree contains a lot

of unneeded information. Compilers often use an

abstract syntax tree (AST).

20

Abstract Syntax TreesAbstract Syntax Trees

This is much more concise AST summarizes grammatical structure without the

details of derivation ASTs are one kind of intermediate representation

(IR)

–<id,y>

<id,x> <number,2>

+

21

Three-pass CompilerThree-pass Compiler

Intermediate stage for code improvement or optimization Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code May also improve space usage, power consumption, ... Must preserve “meaning” of the code.

FrontEnd

Sourcecode

machine code

errors

MiddleEnd

BackEnd

IR IR

Lexical AnalysisLexical AnalysisScanner


tokens IR

errors

23

Lexical AnalysisLexical Analysis The task of the scanner is to take a program written

in some programming language as a stream of characters and break it into a stream of tokens.

This activity is called lexical analysis. The lexical analyzer partition input string into

substrings, called words, and classifies them according to their role

Output of lexical analysis is a stream of tokens

24

TokensTokensExample:

if( i == j ) z = 0;else z = 1;

Input is just a sequence of characters :

if ( \b i \b = = \b j \n \t ....

25

TokensTokensGoal: partition input string into substrings classify them according to their role A token is a syntactic category Natural language:

“He wrote the program” Words: “He”, “wrote”, “the”, “program” Programming

language: “if(b == 0) a = b”

Words: “if”, “(”, “b”, “==”, “0”, “)”, “a”, “=”, “b”

26

TokensTokens Identifiers: x y11 maxsize Keywords: if else while for Integers: 2 1000 -44 5L Floats: 2.0 0.0034 1e5 Symbols: ( ) + * / { } < > == Strings: “enter x” “error”

27

How to Describe Tokens?How to Describe Tokens? Regular Languages are the

most popular for specifying tokens

• Simple and useful theory• Easy to understand• Efficient implementations

28

Example of LanguagesExample of Languages

Alphabet = English charactersLanguage = English sentences

Alphabet = ASCIILanguage = C++ programs,

Java, C#

29

RecapRecapTokens:

strings of characters representing lexical units of programs such as identifiers, numbers, operators.

Regular Expressions:concise description of tokens. A regular expression describes a set of strings.

Language L(R):set of strings represented by a regular expression R. L(R) is the language denoted by regular expression R.

30

Regular ExpressionRegular ExpressionR|S = either R or SRS = R followed by S (concatenation)R* = concatenation of R zero or more times

(R*= |R|RR|RRR...)R? = | R (zero or one R)R+ = RR* (one or more R)[abc] = a|b|c (any of listed)[a-z] = a|b|....|z (range)[^ab] = c|d|... (anything but ‘a’‘b’)

31

How to Use REsHow to Use REs We need mechanism to determine if

an input string w belongs to L(R), the language denoted by regular expression R.

32

AcceptorAcceptor Such a mechanism is called

an acceptor.

input string

language

w

L

acceptor yes, if w Lno, if w L

33

Finite Automata (FA)Finite Automata (FA) Specification:

Regular Expressions

Implementation: Finite Automata A finite automaton accepts a string if we can follow transitions labelled with characters in the string from start state to some accepting state

SYNTACTIC VS SEMANTIC

ANALYSIS

Syntactic Analysis Natural language analogy: consider the sentence

He wrote the programHe wrote the program

noun verb article noun

subject predicate object

sentence

Syntactic Analysis Programming language

if ( b <= 0 ) a = bbool expr assignment

if-statement

Syntactic Analysisint* foo(int i, int j)){ for(k=0; i j; ) fi( i > j ) return j;}

extra parenthesis

Missing expression

not a keyword

Semantic Analysis Grammatically correct

He wrote the computer

noun verb article noun

subject predicate object

sentence

Semantic Analysisint* foo(int i, int j){ for(k=0; i < j; j++ ) if( i < j-2 ) sum = sum+i return sum;}

undeclared var

return type

mismatch

Role of the Parser Not all sequences of tokens are program. Parser must distinguish between valid and invalid sequences of tokens.

What we needAn expressive way to describe the syntax An acceptor mechanism that determines if input token stream satisfies the syntaxParsing is the process of discovering a derivation for some sentenceMathematical model of syntax – a grammar G.Algortihm for testing membership in L(G).

Backus-Naur Form (BNF) Context-free grammars are (often) given by BNF

expressions (Backus-Naur Form) Grammar rules in a similar form were first used in the description of the Algol60 Language. The notation was developed by John Backus and adapted by Peter Naur for the Algol60 report. Thus the term Backus-Naur Form (BNF) .

The meta-symbols of BNF are: definition or description

::=• meaning "is defined as"

|• meaning "or"

< >• angle brackets used to surround category names.

• optional items are enclosed in meta symbols [ and ]

Meta-symbols of BNF optional items are enclosed in meta symbols [ and ] example: <if_statement> ::= if <boolean_expression> then <statement_sequence>

[ else <statement_sequence> ] end if ;

repetitive items (zero or more times) are enclosed in meta symbols { and }, example: <identifier> ::= <letter> { <letter> | <digit> }

terminals of only one character are surrounded by quotes (") to distinguish them from meta-symbols, example: <statement_sequence> ::= <statement> { ";" <statement> }

In recent text books, terminal and non-terminal symbols are distingue by using bold faces for terminals and suppressing < and > around non-terminals. This improves greatly the readability.

The example then becomes: if_statement ::= if boolean_expression then statement_sequence [ else statement_sequence ] end if ";"

More Useful Grammar1 expr → expr op expr2 | num3 | id4 op → +5 | –6 | *7 | /

Derivation: x – 2 * yRule Sentential Form

- expr1 expr op expr2 <id,x> op expr5 <id,x> – expr1 <id,x> – expr op expr2 <id,x> – <num,2> op expr6 <id,x> – <num,2> expr

3 <id,x> – <num,2> <id,y>

Derivation Such a process of rewrites is called a derivation. Process or discovering a derivations is called parsing At each step, we choose a non-terminal to replace Different choices can lead to different derivations.

Two derivations are of interest

1. Leftmost derivation

2. Rightmost derivation

Derivations Leftmost derivation: replace leftmost non-

terminal (NT) at each step Rightmost derivation: replace rightmost NT at

each step The example on the preceding slides

was leftmost derivation There is also a rightmost derivation

Rightmost DerivationRule Sentential Form

- expr1 expr op expr3 expr op <id,y>6 expr <id,y>1 expr op expr <id,y>2 expr op <num,2> <id,y>

5 expr – <num,2> <id,y>3 <id,x> – <num,2> <id,y>

Derivations The two derivations produce different parse

trees.

The parse trees imply different evaluation orders!

Parse Trees

G

E

E op E

E op Ex –

2 * y

Leftmost derivation

evaluation orderx – ( 2 * y )

Parse Trees

G

E

op

evaluation order(x – 2 ) * y

E

x –

E

E op E

2

* y

Rightmostderivation

Precedence These two derivations point out a problem with the

grammar It has no notion of precedence, or implied order of

evaluation

To add precedence

Create a non-terminal for each level of precedence

Isolate corresponding part of grammar

Force parser to recognize high precedence subexpressions first.

PrecedenceFor algebraic expressions Multiplication and division,

first. (level one) Subtraction and addition,

next (level two)

PrecedenceThis grammar is larger Takes more rewriting to reach some of the terminal

symbols But it encodes expected precedence

Produces same parse tree under leftmost and rightmost derivations Let’s see how it parses

x – 2 * y

Precedence x – 2 * y1 Goal → expr

2 expr → expr + term

3 | expr – term

4 | term

5 term → term factor

6 | term / factor

7 | factor

8 factor → number

9 | Id

Rule Sentential Form- Goal1 expr3 expr – term 5 expr – term factor9 expr – term <id,y>7 expr – factor <id,y>8 expr – <num,2>

<id,y>4 term – <num,2>

<id,y>7 factor – <num,2>

<id,y>9 <id,x> – <num,2>

<id,y> The rightmost derivation

Parse Trees

G

E

F

T

T F

<id,x>

–

*<id,y

>

T

E

T

<num,2>

evaluation orderx – ( 2 * y )

Precedence Both leftmost and rightmost derivations give the

same expression

Because the grammar directly encodes the desired precedence.

Parsing Techniques

Parsing TechniquesTop-down parsers Start at the root of the parse tree

and grow towards leaves. Pick a production and try to match

the input Bad “pick” may need to backtrack Some grammars are backtrack-free.

Top-down parsersAlso called LL parsingL means that tokens are read left to rightL means that the parser constructs a leftmost derivation.

Parsing TechniquesBottom-up parsers Start at the leaves and grow toward root As input is consumed, encode

possibilities in an internal state. Start in a state valid for legal first tokens Bottom-up parsers handle a large class

of grammars Preferred method in practice

Bottom-up ParsingAlso called LR parsing L means that tokens are read left

to right R means that the parser

constructs a rightmost derivation.

Top-Down Parser A top-down parser starts with the root of the

parse tree. The root node is labeled with the goal symbol of

the grammar

Top-Down Parsing Algorithm Construct the root node of the parse tree Repeat until the fringe [ leaves] of the parse tree

matches input string

At a node labeled A, select a production with A on its lhs

for each symbol on its rhs, construct the appropriate child

When a terminal symbol is added to the fringe and it does not match the fringe, backtrack

Find the next node to be expanded

Top-Down Parsing The key is picking right production in step

1.

That choice should be guided by the input string

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr + term x – 2 * y4 term + term x – 2 * y7 factor + term x – 2 * y9 <id,x> + term x – 2 * y9 <id,x> + term x – 2 * y

This worked well except that “–” does not match “+”


The parser must backtrack to here


This time the “–” and “–” matched

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr – term x – 2 * y4 term – term x – 2 * y7 factor – term x – 2 * y9 <id,x> – term x – 2 * y9 <id,x> – term x – 2 * y

We can advance past “–” to look at “2”

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr – term x – 2 * y4 term – term x – 2 * y7 factor – term x – 2 * y9 <id,x> – term x – 2 * y9 <id,x> – term x – 2 * y- <id,x> – term x – 2 * y

Now, we need to expand “term”

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr – term x – 2 * y4 term – term x – 2 * y7 factor – term x – 2 * y9 <id,x> – term x – 2 * y9 <id,x> – term x – 2 * y- <id,x> – term x – 2 * y

P Sentential Form input- <id,x> – term x – 2 * y7 <id,x> – factor x – 2 * y9 <id,x> –

<num,2>x – 2 * y

- <id,x> – <num,2>

x – 2 * y“2” matches “2”

We have more input but no non-terminals left to expand

The expansion terminated too soon

Need to backtrack

P Sentential Form input- <id,x> – term x – 2 * y7 <id,x> – factor x – 2 * y9 <id,x> –

<num,2>x – 2 * y

- <id,x> – <num,2>

x – 2 * y

P Sentential Form input- <id,x> – term x – 2 * y5 <id,x> – term * factor x – 2 * y7 <id,x> – factor * factor x – 2 * y8 <id,x> – <num,2> *

factorx – 2 * y

- <id,x> – <num,2> * factor

x – 2 * y

- <id,x> – <num,2> * factor

x – 2 * y

9 <id,x> – <num,2> * <id,y>

x – 2 * y

- <id,x> – <num,2> * <id,y>

x – 2 * y

Success! We matched and consumed all the input

Another Possible ParseP Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr +term x – 2 * y2 expr +term +term x – 2 * y2 expr +term +term +term x – 2 * y2 expr +term +term +term

+....x – 2 * y

consuming no input!!Wrong choice of expansion leads to non-terminationParser must make the right choice

Left Recursion

Top-down parsers cannot handle left-recursive

grammars

Left Recursion Our expression grammar is left recursive.

This can lead to non-termination in a top-down parser

Non-termination is bad in any part of a compiler

For a top-down parser, any recursion must be a right recursion

We would like to convert left recursion to right recursion

To remove left recursion, we transform the grammar

Eliminating Left RecursionConsider a grammar fragment:

A → A | where neither nor starts with A.

Eliminating Left RecursionWe can rewrite this as:

A → A'

A' → A' |

where A' is a new non-terminal

Eliminating Left RecursionA → A ' A' → A'

|

This accepts the same language but uses only right recursion

Eliminating Left Recursion

The expression grammar we have been using contains two cases of left- recursion

Eliminating Left Recursion

expr → expr + term | expr – term | term

term → term * factor | term ∕ factor | factor

Eliminating Left RecursionApplying the transformation yields

expr → term expr' expr' → + term expr'

| – term expr' |

Eliminating Left RecursionApplying the transformation yields

term → factor term' term' → * factor term'

| ∕ factor term' |

Eliminating Left Recursion These fragments use only

right recursion A top-down parser will

terminate using them.

Predictive Parsing If a top down parser picks the wrong

production, it may need to backtrack Alternative is to look ahead in input and use

context to pick correctly How much lookahead is needed? In general, an arbitrarily large amount Fortunately, large classes of CFGs can be

parsed with limited lookahead Most programming languages constructs fall in

those subclasses

LL[1]....LL[K] PARSING scan input from Left to right do a Leftmost derivation use 1.. k symbols of lookahead is a top-down parsing technique

FURTHER IN ADVANCE COURSE …….

COMPILER CONSTRUCTION 7TH SEMESTER

Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12.

Documents

Transcript of Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12.