Source Code
 Optimized for human readability
 Matches human notions of grammar
 Uses named constructs such as variables and procedures

Software Construction
Lecture 10, 11 and 12

What are Compilers

Translate information from one representation to another
Usually information = program

Typical Compilers:
• VC, VC++, GCC, JavaC
• FORTRAN, Pascal, VB

Translators
• Word to PDF
• PDF to Postscript

Source Code

Optimized for human readability

Matches human notions of grammar

as variables and procedures

How to Translate

Translation is a complex process

generated code are very different

Need to structure the translation

Two-pass CompilerTwo-pass Compiler




IR machinecode

errorsUse an intermediate representation (IR)Front end maps legal source code into IRBack end maps IR into target machine code

The Front-EndThe Front-End

Modules Scanner (also called Lexical analyzer) Parser

scanner parsersourcecode

tokens IR


Maps character stream into words – basic unit of syntax

Produces pairs – • a word and• its part of speech

scanner parsersourcecode

tokens IR


ScannerScanner Example

x = x + y becomes

<id,x> <assign,=><id,x><op,+><id,y>

token typeword


we call the pair “<token type, word>” a “token”typical tokens: number, identifier, +, -, new, while, if

scanner parsersourcecode

tokens IR


•Recognizes context-free syntax and reports errors•Guides context-sensitive (“semantic”) analysis•Builds IR for source program

What is Context Free Syntax

context free grammar It is a set of write and rules such as


Context-Free Grammars

Context-free syntax is specified with a grammar G=(S,N,T,P)
S is the start symbol
N is a set of non-terminal symbols
T is set of terminal symbols or words
P is a set of productions or rewrite rules


Context-Free Grammars
Grammar for expressions
1. goal → expr
2. expr → expr op term
3. | term
4. term → number
5. | id
6. op → +
7. | -

The Front End
For this CFG

S = goal
T = { number, id, +, -}
N = { goal, expr, term, op}
P = { 1, 2, 3, 4, 5, 6, 7}

Context-Free Grammars

sentences by repeated substitution

Consider the sentence (expression)

x + 2 – y

Derivation
Production Result

goal1 expr2 expr op term5 expr op y7 expr – y2 expr op term – y4 expr op 2 – y6 expr + 2 – y3 term + 2 – y5 x + 2 – y

The Front End

To recognize a valid sentence in some CFG, we reverse this process and build up a parse

A parse can be represented by a tree: parse tree or syntax tree

Parse
Production Result

goal1 expr2 expr op term5 expr op y7 expr – y2 expr op term – y4 expr op 2 – y6 expr + 2 – y3 term + 2 – y5 x + 2 – y

Syntax TreeSyntax Tree x+2-y






– <id,y>


+ <number, 2>

Abstract Syntax Trees

The parse tree contains a lot of unneeded information.

abstract syntax tree (AST).

Abstract Syntax Trees

This is much more concise

details of derivation ASTs are one kind of intermediate representation



<id,x> <number,2>


Three-pass Compiler

Intermediate stage for code improvement or optimization
Analyzes IR and rewrites (or transforms) IR
Primary goal is to reduce running time of the compiled code
May also improve space usage, power consumption, ...
Must preserve "meaning" of the code.

front end optimizer back end
source code IR IR machine code
errors errors



machine code





Lexical AnalysisLexical AnalysisScanner

scanner parsersourcecode

tokens IR


Lexical Analysis

in some programming language as a stream of characters and break it into a stream of tokens.

This activity is called lexical analysis.

substrings, called words, and classifies them according to their role

Output of lexical analysis is a stream of tokens

Example

if( i == j )
  z = 0;
else
  z = 1;

Input is just a sequence of characters:

if ( \b i \b = = \b j \n \t ....

Tokens
Goal: partition input string into substrings and classify them according to their role
A token is a syntactic category

Natural language:
"He wrote the program"
Words: "He", "wrote", "the", "program"

Programming language:
"if(b == 0) a = b"

Words: "if", "(", "b", "==", "0", ")", "a", "=", "b"

Tokens
Identifiers: x y11 maxsize
Keywords: if else while for
Integers: 2 1000 -44 5L
Floats: 2.0 0.0034 1e5
Symbols: ( ) + * / { } < > ==
Strings: "enter x" "error"

How to Describe Tokens?

most popular for specifying tokens

• Simple and useful theory
• Easy to understand
• Efficient implementations

Example of Languages

Alphabet = English characters
Language = English sentences

Alphabet = ASCII
Language = C++ programs, Java, C#

Java, C#

Tokens:
strings of characters representing lexical units of programs such as identifiers, numbers, operators.

Regular Expressions:
concise description of tokens. A regular expression describes a set of strings.

Language L(R):
set of strings represented by a regular expression R. L(R) is the language denoted by regular expression R.

Regular Expression
R|S = either R or S
RS = R followed by S (concatenation)
R* = concatenation of R zero or more times

(R*= ε|R|RR|RRR...)
R? = ε| R (zero or one R)
R+ = RR* (one or more R)
[abc] = a|b|c (any of listed)
[a-z] = a|b|....|z (range)
[^ab] = c|d|... (anything but 'a''b')

How to Use REs

We need mechanism to determine if an input string w belongs to L(R), the language denoted by regular expression R.

Acceptor

an acceptor.

input string




acceptor yes, if w Lno, if w L

Finite Automata (FA)

Regular Expressions

Implementation: Finite Automata
A finite automaton accepts a string if we can follow transitions labelled with characters in the string from start state to some accepting state

Syntactic Analysis
Natural language analogy: consider the sentence

He wrote the program

noun verb article noun

subject predicate object


Syntactic Analysis
Programming language

if ( b <= 0 ) a = b
    bool expr assignment


Syntactic Analysis
int* foo(int i, int j)
{
  for(k=0; i j; )
    fi( i > j )
      return j;
}

extra parenthesis

Missing expression

not a keyword

Semantic Analysis
Grammatically correct

He wrote the computer

noun verb article noun

subject predicate object


Semantic Analysis
int* foo(int i, int j)
{
  for(k=0; i < j; j++ )
    if( i < j-2 )
      sum = sum+i
  return sum;
}

undeclared var

return type


Role of the Parser
Not all sequences of tokens are program.
Parser must distinguish between valid and invalid sequences of tokens.

What we need
An expressive way to describe the syntax
An acceptor mechanism that determines if input token stream satisfies the syntax
Parsing is the process of discovering a derivation for some sentence
Mathematical model of syntax – a grammar G.
Algorithm for testing membership in L(G).

Backus-Naur Form (BNF)

Context-free grammars are (often) given by BNF expressions (Backus-Naur Form)
Grammar rules in a similar form were first used in the description of the Algol60 Language.
The notation was developed by John Backus and adapted by Peter Naur for the Algol60 report. Thus the term Backus-Naur Form (BNF).

The meta-symbols of BNF are:
definition or description

::= meaning "is defined as

|• meaning "or"

< >• angle brackets used to surround category names.

• optional items are enclosed in meta symbols [ and ]

Meta-symbols of BNF optional items are enclosed in meta symbols [ and ] example: <if_statement> ::= if <boolean_expression> then <statement_sequence>

[ else <statement_sequence> ] end if ;

repetitive items (zero or more times) are enclosed in meta symbols { and }, example: <identifier> ::= <letter> { <letter> | <digit> }

terminals of only one character are surrounded by quotes (") to distinguish them from meta-symbols, example: <statement_sequence> ::= <statement> { ";" <statement> }

In recent text books, terminal and non-terminal symbols are distingue by using bold faces for terminals and suppressing < and > around non-terminals. This improves greatly the readability.

The example then becomes: if_statement ::= if boolean_expression then statement_sequence [ else statement_sequence ] end if ";"

More Useful Grammar1 expr → expr op expr2 | num3 | id4 op → +5 | –6 | *7 | /

Derivation: x – 2 * yRule Sentential Form

- expr1 expr op expr2 <id,x> op expr5 <id,x> – expr1 <id,x> – expr op expr2 <id,x> – <num,2> op expr6 <id,x> – <num,2> expr

3 <id,x> – <num,2> <id,y>

Derivation Such a process of rewrites is called a derivation. Process or discovering a derivations is called parsing At each step, we choose a non-terminal to replace Different choices can lead to different derivations.

Two derivations are of interest

1. Leftmost derivation

2. Rightmost derivation

Derivations Leftmost derivation: replace leftmost non-

terminal (NT) at each step Rightmost derivation: replace rightmost NT at

each step The example on the preceding slides

was leftmost derivation There is also a rightmost derivation

Rightmost DerivationRule Sentential Form

- expr1 expr op expr3 expr op <id,y>6 expr <id,y>1 expr op expr <id,y>2 expr op <num,2> <id,y>

5 expr – <num,2> <id,y>3 <id,x> – <num,2> <id,y>

Derivations The two derivations produce different parse


The parse trees imply different evaluation orders!

Parse Trees



E op E

E op Ex –

2 * y

Leftmost derivation

evaluation orderx – ( 2 * y )

Parse Trees




evaluation order(x – 2 ) * y


x –


E op E


* y


Precedence These two derivations point out a problem with the

grammar It has no notion of precedence, or implied order of


To add precedence

Create a non-terminal for each level of precedence

Isolate corresponding part of grammar

Force parser to recognize high precedence subexpressions first.

PrecedenceFor algebraic expressions Multiplication and division,

first. (level one) Subtraction and addition,

next (level two)

1 Goal → expr2 expr → expr + term3 | expr – term4 | term5 term → term factor6 | term / factor7 | factor8 factor → number9 | Id



PrecedenceThis grammar is larger Takes more rewriting to reach some of the terminal

symbols But it encodes expected precedence

Produces same parse tree under leftmost and rightmost derivations Let’s see how it parses

x – 2 * y

Precedence x – 2 * y1 Goal → expr

2 expr → expr + term

3 | expr – term

4 | term

5 term → term factor

6 | term / factor

7 | factor

8 factor → number

9 | Id

Rule Sentential Form- Goal1 expr3 expr – term 5 expr – term factor9 expr – term <id,y>7 expr – factor <id,y>8 expr – <num,2>

<id,y>4 term – <num,2>

<id,y>7 factor – <num,2>

<id,y>9 <id,x> – <num,2>

<id,y> The rightmost derivation

Parse Trees













evaluation orderx – ( 2 * y )

Parse Trees













evaluation orderx – ( 2 * y )

Precedence Both leftmost and rightmost derivations give the

same expression

Because the grammar directly encodes the desired precedence.

Parsing Techniques

Parsing TechniquesTop-down parsers Start at the root of the parse tree

and grow towards leaves. Pick a production and try to match

the input Bad “pick” may need to backtrack Some grammars are backtrack-free.

Top-down parsersAlso called LL parsingL means that tokens are read left to rightL means that the parser constructs a leftmost derivation.

Parsing TechniquesBottom-up parsers Start at the leaves and grow toward root As input is consumed, encode

possibilities in an internal state. Start in a state valid for legal first tokens Bottom-up parsers handle a large class

of grammars Preferred method in practice

Bottom-up ParsingAlso called LR parsing L means that tokens are read left

to right R means that the parser

constructs a rightmost derivation.

Top-Down Parser A top-down parser starts with the root of the

parse tree. The root node is labeled with the goal symbol of

the grammar

Top-Down Parsing Algorithm Construct the root node of the parse tree Repeat until the fringe [ leaves] of the parse tree

matches input string

At a node labeled A, select a production with A on its lhs

for each symbol on its rhs, construct the appropriate child

When a terminal symbol is added to the fringe and it does not match the fringe, backtrack

Find the next node to be expanded

Top-Down Parsing The key is picking right production in step


That choice should be guided by the input string

Expression Grammar1 Goal → expr2 expr → expr + term3 | expr - term4 | term5 term → term * factor6 | term ∕ factor7 | factor8 factor → number9 | id10 | ( expr )Let’s try parsing

x – 2 * y

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr + term x – 2 * y4 term + term x – 2 * y7 factor + term x – 2 * y9 <id,x> + term x – 2 * y9 <id,x> + term x – 2 * y

This worked well except that “–” does not match “+”

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr + term x – 2 * y4 term + term x – 2 * y7 factor + term x – 2 * y9 <id,x> + term x – 2 * y9 <id,x> + term x – 2 * y

The parser must backtrack to here

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr + term x – 2 * y4 term + term x – 2 * y7 factor + term x – 2 * y9 <id,x> + term x – 2 * y9 <id,x> + term x – 2 * y

This time the “–” and “–” matched

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr – term x – 2 * y4 term – term x – 2 * y7 factor – term x – 2 * y9 <id,x> – term x – 2 * y9 <id,x> – term x – 2 * y

We can advance past “–” to look at “2”

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr – term x – 2 * y4 term – term x – 2 * y7 factor – term x – 2 * y9 <id,x> – term x – 2 * y9 <id,x> – term x – 2 * y- <id,x> – term x – 2 * y

Now, we need to expand “term”

P Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr – term x – 2 * y4 term – term x – 2 * y7 factor – term x – 2 * y9 <id,x> – term x – 2 * y9 <id,x> – term x – 2 * y- <id,x> – term x – 2 * y

P Sentential Form input- <id,x> – term x – 2 * y7 <id,x> – factor x – 2 * y9 <id,x> –

<num,2>x – 2 * y

- <id,x> – <num,2>

x – 2 * y“2” matches “2”

We have more input but no non-terminals left to expand

The expansion terminated too soon

Need to backtrack

P Sentential Form input- <id,x> – term x – 2 * y7 <id,x> – factor x – 2 * y9 <id,x> –

<num,2>x – 2 * y

- <id,x> – <num,2>

x – 2 * y

P Sentential Form input- <id,x> – term x – 2 * y5 <id,x> – term * factor x – 2 * y7 <id,x> – factor * factor x – 2 * y8 <id,x> – <num,2> *

factorx – 2 * y

- <id,x> – <num,2> * factor

x – 2 * y

- <id,x> – <num,2> * factor

x – 2 * y

9 <id,x> – <num,2> * <id,y>

x – 2 * y

- <id,x> – <num,2> * <id,y>

x – 2 * y

Success! We matched and consumed all the input

Another Possible ParseP Sentential Form input- Goal x – 2 * y1 expr x – 2 * y2 expr +term x – 2 * y2 expr +term +term x – 2 * y2 expr +term +term +term x – 2 * y2 expr +term +term +term

+....x – 2 * y

consuming no input!!Wrong choice of expansion leads to non-terminationParser must make the right choice

Left Recursion

Top-down parsers cannot handle left-recursive


Left Recursion Our expression grammar is left recursive.

This can lead to non-termination in a top-down parser

Non-termination is bad in any part of a compiler

For a top-down parser, any recursion must be a right recursion

We would like to convert left recursion to right recursion

To remove left recursion, we transform the grammar

Eliminating Left RecursionConsider a grammar fragment:

A → A | where neither nor starts with A.

Eliminating Left RecursionWe can rewrite this as:

A → A'

A' → A' |

where A' is a new non-terminal

Eliminating Left RecursionA → A ' A' → A'


This accepts the same language but uses only right recursion

Eliminating Left Recursion

The expression grammar we have been using contains two cases of left- recursion

Eliminating Left Recursion

expr → expr + term | expr – term | term

term → term * factor | term ∕ factor | factor

Eliminating Left RecursionApplying the transformation yields

expr → term expr' expr' → + term expr'

| – term expr' |

Eliminating Left RecursionApplying the transformation yields

term → factor term' term' → * factor term'

| ∕ factor term' |

Eliminating Left Recursion These fragments use only

right recursion A top-down parser will

terminate using them.

1 Goal → expr2 expr → term expr' 3 expr' → + term expr' 4 | – term expr'5 | 6 term → factor term' 7 term' → * factor term' 8 | ∕ factor term'9 | 10 factor → number11 | id12 | ( expr )

Predictive Parsing If a top down parser picks the wrong

production, it may need to backtrack Alternative is to look ahead in input and use

context to pick correctly How much lookahead is needed? In general, an arbitrarily large amount Fortunately, large classes of CFGs can be

parsed with limited lookahead Most programming languages constructs fall in

those subclasses

LL[1]....LL[K] PARSING scan input from Left to right do a Leftmost derivation use 1.. k symbols of lookahead is a top-down parsing technique

