Compiler Construction Class Notesregdodds/COS729/compilerConstruction-slides.pdf · Compiler...

292
Compiler Construction Class Notes Reg Dodds Department of Computer Science University of the Western Cape [email protected] c 2006,2017 Reg Dodds March 22, 2017

Transcript of Compiler Construction Class Notesregdodds/COS729/compilerConstruction-slides.pdf · Compiler...

Compiler Construction

Class Notes

Reg Dodds

Department of Computer Science

University of the Western Cape

[email protected]

c©2006,2017 Reg Dodds

March 22, 2017

Introduction

• What is a Compiler?

• What is an Interpreter?

• Why Compiler Construction?

• What languages?

• An example of a very simple compilation.

• Why write a compiler?

• Layout of a compiler.

1

What is interpretation?

• Let L ∈ L be a programming language, with

L = {Fortran, Lisp, Algol, COBOL, PL/1, BA-

SIC, APL, SNOBOL, Pascal, C, C++, Ada, SQL,

Java, ML, Haskell, · · ·}.

• IL is an interpreter for a program pL ∈ L, and

input ∈ A∗ is data, where A is usually called

a character set and A∗ is its Kleene-closure from

which IL computes output data output ∈ A∗.

• The execution of the interpreter may abort and

lead to an error condition:

IL : L×A∗ → A∗ ∪ {error}

pLinput

}

interpret−→by IL

output ∪ {error}, which we

may also write as:

IL(pL, input) = output ∪ {error}

A single process takes place: the source program

is directly interpreted.

2

Making interpreters efficient

• In a production quality interpreter it is advan-

tageous to produce some sort of compact inter-

pretable code by a process that is similar to com-

pilation, once, and then subsequently reinterpret

this compact code repeatedly.

This process is used by Java and many interpreters

for BASIC such as GWBasic. Typically a command-

line interface interprets the command directly.

An even better idea is to compile blocks of code in-

crementally, directly to executable machine code.

When a block is altered its corresponding code is

replaced with new code.

Interpreters often have direct access to the original

source code—this is very useful for finding errors

in the source program. Stepping mechanisms that

move line-by-line through the source are easily im-

plemented with interpreters.

3

One view of a compiler

• When compiling is involved, two processes are ap-

plied to execute a source program.

• A compiler CL for a language L translates a syn-

tactically correct source program pL ∈ L into

equivalent machine code.

✲ ✲Compiler

program

machine

language

Source

• Examples:

– A source program in C++ is translated into

Mips machine code.

– Visual Basic source code is compiled into Intel-

x86 machine code.

– A Java source program is translated into JVM

byte code.

4

Execution of machine language

• The machine code produced by the compiler is

somehow executed by hardware.

• Hardware may be emulated by microcode, or it

may be hardwired.

• Some instructions may be entirely executable by

hardware.

• Certain instructions may be emulated by microcode.

• The user is usually not aware that some of the

machine code instructions, or even all of them, are

being emulated.

• On some machines the machine instruction set may

change dynamically, depending on the application.

• It is likely that compiled machine code, on any

particular machine, runs faster than code running

on an interpreter on the same machine.

5

What is compilation?

• The source program pL ∈ L is first translated by a

compilerCL into an equivalent machine executable

program pM .

• Next pM is interpreted, or executed, by a machine

plus its input to create output and/or an error.

• To run a program: (1) it is compiled and (2) then

it is executed. CL(pL) = pM ∪ {error}, if there

are no compilation errors then the second step may

be invoked:IM (pM , input)=output ∪ {error}.

• Notice that the interpreter IL has now become IMwhich is perhaps hardware.

• Interpreters and computers are different realiza-

tions of computing machines.

• Sun’s picoJava chip or the Java Virtual Machine

on your computer can be used interchangeably to

run the same byte code program pM .

6

Java source program

public class simple {

public static void main (Strings argsv[ ]){

int a;

a = 41;

a = a + 19;

}

}

7

Java byte code

Compiled from public class simple.java

public class simple extends java.lang.Object{

public static void main (java.lang.String[ ]);

public simple();

}

Method void main(java.lang.String[ ]);

0 bipush 41

2 istore 1

3 iload 1

4 bipush 19

6 iadd

7 istore 1

8 return

Method simple()

0 aload 0

1 invokespecial #12 <Method java.lang.Object.()>

4 return

Note there is a main method and a constructor method.

8

Overview of course

• Programs related to compilers.

• The compilation process: phases, intermediate code,

structures.

• Bootstrapping and transfer, T-diagrams, Louden’s

TINY and TM.

• SEPL: interpreter, emulator, compiler.

9

Programs related to compilers

(Louden p 4-6)

• interpreters

• assemblers

• linkers

• loaders

• preprocessors

• editors

• debuggers

• profilers

• project managers—SCCS and RCS

10

The compilation process

(Louden p 7, 8-14)

• Phases Intermediate code

– source code

– scanner—lexical analyser tokens

– syntax analyser abstract syntax tree

– semantic analyser annotated syntax tree

– intermediate code optimizer intermediate code

– code generator target code

– target code optimizer optimized code

– linker-loader executable code

• Structures

– literals

– symboltable

– error handler

– temporary files

11

Bootstrapping and transfer of programminglanguages

(Louden ¶1.6, p 18-21)

• T-diagrams—next slide.

• Pascal in 1970 on CDC 6600.

• P-code compiler for Pascal in 1973.

• P-code emulator written in Algol 60, and in For-

tran lead to widespread usage of Pascal. (Why?)

12

T-diagrams

• A T-diagram represents Source language being

run in Host code to produce Target language.

Source

Host

Target

• Let two compilers run on the same host machine.

One compiler translates from language Start into

an intermediate language IL and the other com-

piler translates from IL into language Final.

Host

Start IL

Host

IL Final

Host

Start Final

We have produced a system that can compile from

Start into Final.

13

T-diagrams

• One compiler for Pascal creates P-code, but runs

on machine M.

• Another processor running on M can generate code

for machine N.

Pascal P−code P−codePascal

M M N N

M

• We have produced a system that can compile from

Pascal into P-code on a new machine.

14

T-diagrams: compiler-OLD⇒ compiler-NEW

Pascal P−code P−codePascal

M M N N

M

15

T-diagrams

• define the SEPL language.

• write an interpreter for it.

• develop a machine emulator—or use an available

one.

• develop a compiler that compiles to our machine

machine code.

• add an optimizing phase to the compiler.

• alter the compiler to produce code for another ma-

chine.

16

Students Educational Programming Language(SEPL)

Various projects lie ahead.

• Define the SEPL language—Louden calls his ‘TINY’

• Develop its syntax and informal semantics.

• Write an interpreter for it using flex/lex and

bison/yacc.

• Decide on target machine.

• Develop a machine emulator for the target or use

a real machine.

• Develop a compiler that produces executable code.

• Introduce optimization phase—not really enough

time.

• How much time required to produce the compiler?

17

Scanning—Lexical analysis

(Louden Chapter 2)

• tokens from lexemesis done quite well by flex.

• regular expressions (Louden p 38).

• extension of notation for regular expressions—doesnot give the notation any more power, but simpli-fies its practical use.

• regular expressions are widely used: flex, vim, sed,emacs, python, bash, tcl/Tk, grep, awk, perl, etc.

• regular expressions and FSAs (Louden p 47–).

• DFSA-FSA relationship (Louden p46–72).

• minimization of number of states.

• Louden’s TINY-scanner:Gives insight into direct connection between FSAand scanner. (Louden ¶2.5)

• application of flex for scanning—lexical analysis.

18

Context-free languages (CFLs)and syntax analysis

(Louden Chapter 3)

• Syntax analysers are based on CFLs.

✲ ✲tokens

analysersyntax tree

abstractlist of

• syntaxtree = analyse();

19

Parse trees

• have dynamic structure.

• recursive structure.

• tree keeps track of attributes such as:

types,

scope,

liveliness,

nesting and

values.

✏✏✏✏✏✏✏✏✏✏

PPPPPPPPP

❝❝

❝❝❝

✑✑

✑✑

✑✑idi

integersubscript expression

assignment

ida

integer[]

number6

integer

integer

e.g. a[i] = 6;

20

Context-free grammars (CFGs)

(Louden ¶3.2)

• Formally a CFG is a fourtuple G = (N,T, P, S)where N and T are alphabets, N is the set ofnon-terminals—or variables—and T is the setof terminals, P ⊆ N × (N ∪ T )∗ is the setproductionrules and S ∈ N is the startsymbol.

• Example:

N = {exp, op},T = {number,+,−, ∗},P = {exp → exp op exp|(exp)|number,

op → +| − |∗} andS = exp

• Note that number is treated as a token.• The source string (117 − 17) ∗ 5 is first tokenizedto (number − number) ∗ number before it isanalysed.

• P1 = {E → E O E|(E)|n,O → +| − |∗} is aset of productions not different from P .

21

Derivations

• sententialform: any string ∈ (N∪T )∗ derived

from S, the start symbol.

• direct derivations: if one production is applied

to a part of a sentential form and transforms it by

matching the right hand side of a production with

this part and then replaces it with a non-terminal.

• Example:

The production exp → (exp). can be applied to

bring about the direct derivation

exp ∗ number ⇒ (exp) ∗ number.

• derivation when a chain of direct derivations are

applied one after the other to transform the sen-

tential form s0 to another sentential form sn. It is

written as s0∗⇒ sn.

• language: all strings s ∈ T ∗ that can be derived

from the start symbol S, symbolically:

L(G) = {s ∈ T ∗|S∗⇒ s}.

22

Derivation:exp

∗⇒ (number − number) ∗ number

[exp → exp op exp],exp ⇒ exp op exp,

[exp → number],⇒ exp op number,

[op → ∗],⇒ exp ∗ number,

[exp → (exp)],⇒ (exp) ∗ number,

[exp → exp op exp],⇒ (exp op exp) ∗ number,

[exp → number],⇒ (exp op number) ∗ number,

[op → −],⇒ (exp− number) ∗ number,

[exp → number],⇒ (number − number) ∗ number,

23

language, sentence, examples

• language: all strings s ∈ T ∗ that can be derivedfrom the start symbol S, symbolically:L(G) = {s ∈ T ∗|S

∗⇒ s}

• sentence: the elements of the language L(G),s ∈ L(G), are known as sentences.

Example:G = ({E}, {a, (, )}, {E → (E)|a}, E)E → a, i.e. E ⇒ a, i.e. E

∗⇒ a,a ∈ L(G). similarly

E ⇒ (E) ⇒ (a), i.e. E∗⇒ (a) and E ⇒ (E) ⇒

((E)) ⇒ ((a)), i.e. E∗⇒ ((a)).

Theorem: E∗⇒ (na)n, ∀n ∈ N0

Proof: Using induction.P0: E

∗⇒ (0a)0 = a, since E ⇒ E → a.

P1: E∗⇒ (a), because E ⇒ (E) ⇒ (a).

PkPk+1

: Assume that Pk holds, i.e. E∗⇒ (ka)k. Now

E → (E), in other words E ⇒ (E)∗⇒ ((ka)k) ≡

(k+1a)k+1, and E∗⇒ (na)n,∀n ∈ N,

i.e. L(G) = {(na)n|n ∈ N0}.

24

Examples

Problem with empty base

If P = {E → (E)} is L(G) = { } = ∅. This is empty

because it is impossible to form bases P0, or P1. Since

the base does not exist an infinite regress ensues.

However, we can prove that E∗⇒ (nE)n, but this is of

little value, since E can not be reduced to a terminal.

CFL using regular expressions

If P = {E → E + a|a}, is L(G) = a(+a)∗,

where a(+a)∗ ≡ {a, a + a, a + a + a, . . .}.

25

An if -statement

G = ({statement, if -statement, expression},{0, 1, if, else, other},{statement → if -statement|otherif -statement → if (expression) statement

|if (expression) statementelse statement

expression → 0|1},statement)

and L(G) =

{ other, if (0) other, if (1) other,if (0) other else other, if (1) other else other,if (0) if (0) other, if (1) if (0) other,if (0) if (1) other, if (1) if (1) other,if (0) if (0) other else other,if (1) if (0) other else other,if (0) if (1) other else other,if (1) if (1) other else other,. . .}

26

The use of ε

Consider the grammar—we only show the productionsP :

{statement → if -statement|other,if -statement → if (expression) statement

|if (expression) statementelse statement,

expression → 0 | 1}

It may be written if using an ε-grammar if follows:

{statement → if -statement|other,if -statement → if (expression) statement else-part,else-part → else statement | ε,expression → 0 | 1}

ε is also useful for lists:

list → statement; list | statementstatement → s

This generates the language

L(G) = {s, s; s, s; s; s, . . .} ≡ s+ It is rewritten

using ε if follows:

list → non-ε-list | εnon-ε-list → statement;non-ε-list |statementstatement → s

27

Left- and right recursion

The regular grammar a+ is represented as follows with

left recursive productions: A → Aa | a. a ∈ L(G)

since A → a, thus A∗⇒ a, but A → Aa, and A

∗⇒

aa, and we again expect that it may be replaced in

A → Aa and it follows that A∗⇒ aaa. It is simple to

prove with mathematical induction that L(G) = a+.

Our notation is rather informal: the set represented by

a+, was formerly represented more exactly by L(a+),

which represents the set {a, aa, aaa, . . .}.

Similarly we can prove that a grammar using the right

recursive productions A → aA | a generates the same

language.

How is a∗ represented?

A → Aa | ε or using A → aA | ε

What is L(G) for the grammar with the productions

A → (A)A | ε?

28

Parse trees and abstract syntax trees (ASTs)

• It is convenient to distinguish between a parse tree

and an abstract syntax tree.

• An abstract syntax tree is often called a syntax

tree.

• A parse tree contains all the information concern-

ing the syntactical

• Consider the parse tree and its corresponding stripped

down (abstract) syntax tree generated by the

derivation on the next slide.

• Syntax trees usually show the actual values at the

terminals and not merely the tokens.

29

Right derivation forexp

∗⇒ (number − number) ∗ number

The derivation below is executed in a determinate or-

der. The rightmost non-terminal is replaced in each

step until no more non-terminals remain.

(1) [exp → exp op exp],exp ⇒ exp op exp,(2) [exp → number],

⇒ exp op number,(3) [op → ∗],

⇒ exp ∗ number,(4) [exp → (exp)],

⇒ (exp) ∗ number,(5) [exp → exp op exp],

⇒ (exp op exp) ∗ number,(6) [exp → number],

⇒ (exp op number) ∗ number,(7) [op → −],

⇒ (exp− number) ∗ number,(8) [exp → number],

⇒ (number − number) ∗ number,

64

Parse tree and syntax tree for the derivationexp

∗⇒ (29 - 11) * 47

• Parse tree for (29 - 11) * 47

1

34 2

5

67

29 11

47

exp exp

number

exp

exp op exp

number

exp

number

*

)

op

(

• Syntax tree for (29 - 11) * 47

*

29 11

47

65

Right derivation forexp

∗⇒ (number − number) ∗ number

The derivation below is executed in a determinate or-

der. The rightmost non-terminal is replaced in each

step until no more non-terminals remain.

(1) [exp → exp op exp],exp ⇒ exp op exp,(2) [exp → number],

⇒ exp op number,(3) [op → ∗],

⇒ exp ∗ number,(4) [exp → (exp)],

⇒ (exp) ∗ number,(5) [exp → exp op exp],

⇒ (exp op exp) ∗ number,(6) [exp → number],

⇒ (exp op number) ∗ number,(7) [op → −],

⇒ (exp− number) ∗ number,(8) [exp → number],

⇒ (number − number) ∗ number,

64

Parse tree for right derivation ofexp

∗⇒ (number − number) ∗ number

1

34 2

5

678

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✱✱

✱✱

✱✱

❧❧❧❧

❧❧

exp exp

number

exp

exp op exp

number

exp

number

*

-

)

op

(

65

Leftmost derivation forexp

∗⇒ (number − number) ∗ number

The derivation below is executed in a determinate or-

der. The leftmost non-terminal of the sentential form is

replaced each time—reduced—until there are no more

non-terminals.

(1) [exp → exp op exp],exp ⇒ exp op exp,(2) [exp → (exp)],

⇒ (exp) op exp,(3) [exp → exp op exp],

⇒ (exp op exp) op exp,(4) [exp → number],

⇒ (number op exp) op exp,(5) [op → −],

⇒ (number − exp) op exp,(6) [exp → number],

⇒ (number − number) op exp,(7) [op → ∗],

⇒ (number − number) ∗ exp,(8) [exp → number],

⇒ (number − number) ∗ number,

66

A Parse tree for the derivation ofexp

∗⇒ (number − number) ∗ number

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✱✱

✱✱

✱✱

❧❧❧❧

❧❧

exp exp

number

exp

exp op exp

number

exp

number

*

-

( )

op

67

Right and left derivations fornumber + number

A left derivation

(1) exp ⇒ exp op exp,⇒ number op exp,⇒ number + exp,⇒ number + number,

1

2 3 4

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

exp

exp op exp

number + number//

68

Rightmost derivation

A rightmost derivation for number + number

(1) exp ⇒ exp op exp,⇒ exp op number,⇒ exp + number,⇒ number + number,

1

3 24

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

exp

exp op exp

number + number

69

Ambiguous grammars

The grammar with

P = { exp → exp op exp|(exp)|numberop → +| − ∗}

is ambiguous because it has two different parse trees. It

will also therefore have two different left—and rightmost—

derivations, because each parse tree has a unique left-

most derivation.

✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

❍❍❍❍❍❍❍❍❍❍❍

exp op exp

number

exp

number-

number

exp

exp

*

op

and now the other tree.

70

Ambiguous grammars

A different parse tree for number + number

✟✟✟✟✟✟✟✟✟✟✟PPPPPPPPPPPPPPP

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

exp

number -

exp op exp

exp op exp

number number*

Ambiguous: If two different parse trees can be derivedfrom a given grammar then it is ambiguous.

It is preferable to use an unambiguous grammar fordefining a computing language.

Ambiguity can be eliminated in two ways: the grammarcan be altered so that it becomes unambiguous, or—theway bison/yacc does it—precedence rules or associationrules can be applied when there ambiguities.

71

The dangling else problem(Louden p.120–123)

The string if (0) if (1) other else other has

two parse trees. This is the dangling else problem.statement

if−statement

exp

if−statement

statementexp

statement

1

( )

if ( )

other

else

other0

if statement

statement

if−statement

exp statement( )if

if−statement

1

if ( )

0

other

exp statement

other

statementelse

72

The dangling else problem

The C code

if (x != 0)

if (y == 1/x) OK = TRUE;

else z = 1/x;

could have had two interpretations:

if (x != 0) { if (x != 0) {

if (y == 1/x) OK = TRUE; if (y == 1/x) OK = TRUE;

} else z = 1/x;

else z = 1/x; }

C disambiguates if with the most closely nested rule

which resolves the ambiguity—right-hand side.

The grammar rules may be adapted as follows:

if -statement → matched | unmatchedmatched → if (exp) matched else matched

| otherunmatched → if (expression) if -statement

| if (exp) matched else unmatchedexpression → 0|1

The next slide shows the unambiguous parse tree.

73

An unambiguous grammar for C’s if-statement

if -statement → matched | unmatchedmatched → if (exp) matched else matched

| otherunmatched → if (expression) if -statement

| if (exp) matched else unmatchedexpression → 0|1

exp( )if

1

if ( )

0

other

exp

other

else

if−statement

unmatched

if−statement

matched

matched matched

74

Representations of syntax: BNF

• BNF—Bacchus-Naur form.

The metasymbol ::= is used like → in production

rules, | is separates alternatives. Angle brackets, <

and > delimit non-terminals. Terminals are written

in plain text, or in bold face.

The code below defines a <program>:

<program> ::= program

<declaration-list>

begin

<statement-list>

end.

A program, starts with program, and is followed

by a list of declarations, then a begin, and a list

of statements terminated with end and a fullstop.

• EBNF—Extended BNF. BNF was made more con-

venient to use by extending it slightly.

75

Representations of syntax: EBNF

• EBNF—Extended BNF.

Put optional items inside brackets [ and ],

<if-statement> ::=

if <boolean> then <statement-list>

[else <statement-list>]

end if ;

Repetition is done using braces, { and }.

<identifier> ::=

<letter> { <letter> | <digit> }

An <identifier> is a word that starts with a

letter and has any number of letters of digits.

<statement-list> ::=

<statement> { ; <statement-list> }

An <statement-list> is a <statement> or a

list of <statement>s separated by semicolons.

76

Representations of syntax: EBNF

• tramline diagrams—used by Wirth for Pascal, and

for ANS Fortran.

• two-level grammar—Algol 68.

• etc.

77

Formal properties of CFLs

(Louden p.128–133)

• Vide Louden.

78

The Chomsky hierarchy (Louden p. 131)

Chomsky-type: Description

3: Regular languagesLet A ∈ N and α ∈ T then productions in thegrammar have the form A → α or A → Aαor alternatively: the recursion may be right about.Only one kind of recursion may be present, i.e. leftor right—otherwise G is a CFL.

2: Let A ∈ N and γ ∈ (N ∪ T )∗ and A → γ. In acontext-free language A can always be replacedin any context by γ.

1: If the productionA → γ is in a context sensitivelanguage, then it may be applied only in a pre-determined context, i.e., A may produce γ onlyif A is in a given context le, e.g. αAβ → αγβ,where α 6= ε. Such a rule is context sensitive.An example of context sensitivity is the restrictionthat variables must be declared before they maybe used.

0: Phrase structure grammars are the most powerful.

79

Top-down parsing(Louden Chapter 4)

• Recursive-descent

• LL(1) parsing

• first and follow sets

• Error recovery in

top-down-parsers

80

Top-down parsing

• A top-down parser executes a leftmost deriva-tion. It starts from the start symbol and worksits way down to the terminals in the form of tokens.

• Predictive parser: attempts to forecast the nextconstruction by using lookahead tokens.

• Backtracking parser: attempts different possi-bilities for parsing the known input, and backs upwhen it hits dead ends.

– Slower than predictive parsers.– Use exponential time.– More powerful.

• Recursive-descent parsing is usually applied tohand-written compilers—Wirth’s compilers oftenuse RD parsers. Your 1st-year compiler was RD.

• LL(1) parsing L—on left—input is followed fromleft to right. L—on right—derivation is leftmost.The 1 means that only one token is used to predictthe progress of the parser.

81

LL(1) parsing

• LL(1) parsers work from left to right through the

input and follow a leftmost derivation that uses

one lookahead token.

• Viable-prefix property—easy to see very quickly

in such languages that there is an error when the

lookahead token does not correspond with what we

expect. The viable prefix corresponds to first.

• LL(k) parsers are also possible where k > 1.

More difficult to see errors.

• first and follow sets derived from the grammar

are used to construct the tables that will be used

for LL(1) parsing.

82

first and follow sets

• The set first(X), where X is a terminal or ε, is

simply {X}.

• SupposeX is a nonterminal then first(X) is the

set of all xs such that {X∗⇒ xβ}, where β may

be ε.

• The definition may be altered to accommodate

LL(k) parsers by replacing x with strings of k ter-

minals, or if β is ε |x| < k.

• In other words first is the set of leading terminals

of the sentential forms derivable from X .

• The definition may be altered to accommodate

LL(k) parsers by replacing x with strings of k ter-

minals, or if β is ε |x| < k. (See also Louden p.

168)

83

first sets

• In the grammar for arithmetic expressions:

exp → exp addop term | termaddop → + | −term → term mulop factor | factormulop → ∗factor → ( exp ) | number

• first(addop) = { +, −}

• first(mulop) = { ∗ }

• first(exp) = { (, number}

• first(term) = { (, number}

• first(factor) = { (, number}

84

first in the grammar for an if -statement

G = ({statement, if -statement, expression},{0, 1, if, else, rest},{statement → if -statement | restif -statement → if (expression)

statement else-partelse-part → else statement | εexpression → 0 | 1},statement})

first(statement) = {if, rest}

first(expression) = {0, 1}

first(if -statement) = {if}

first(else-part) = {else, ε}

85

Basic LL(1) parsing(Louden p. 152)

• LL(1) parsers use a push-down-stack rather thanbacktracking from recursive procedure calls.

• Consider S → ( S ) S | ε

• Initialize stack to $S

• Parse action

Parsing stack Input Action1 $ S ()$ S → (S)S2 $ S)S( ()$ match3 $ S)S )$ S → ε4 $ S) )$ match5 $ S $ S → ε6 $ $ accept

• Two actions:

1. Replace A ∈ N at the top of the stack by α,where A → α, where α ∈ (N ∪ T )∗ and

2. Match the token on top of the stack with thenext input token.

86

LL(1) parsing

• Parse action

Parsing stack Input Action1 $ S ()$ S → (S)S2 $ S)S( ()$ match3 $ S)S )$ S → ε4 $ S) )$ match5 $ S $ S → ε6 $ $ accept

• At step 1 the stack contains §S and the input is

()$.

• Apply rule S → (S)S.

• The RHS is place stacked item-by-item onto the

stack so that it appears reversed.

• Remove the matched on top of the stack ( in step

2 because it matches the token at the start of the

input.

87

LL(1)-recursion-free productions for arithmetic(Louden p.160)

exp → term exp′

exp′ → addop term exp′ | εaddop → + | −term → factor term′

term′ → mulop factor term′ | εmulop → ∗factor → ( exp ) | number

88

89

Parse tree and syntax tree for 3-4-5(Louden p. 161)

• The parse tree for the expression 3 − 4 − 5 does

not represent the left associativity of subtraction.

• The parser should still construct the left associa-

tive syntax tree.

1. The value 3 must be passed up to the root exp

2. The root exp hands 3 down to exp‘ which sub-

tracts 4 from it.

3. The resulting −1 is passed down to the next

exp′,

4. which subtracts 5 yielding −6,

5. which is passed to the next exp′.

6. The rightmost exp′ has an ε child and finally

passes the −6 back to the root exp.

87

Building the syntax tree with anLL(1)-grammar

Implement exp → term exp′ as follows

exp(){ term; exp’; }

To compute the expression it is rewritten as:

int exp(){int temp;

temp = term;

return exp’(temp);

}

88

Code for arithmetic

The code for exp′ → addop term exp′ | ε is

exp’() {

switch(token) {

’+’: match(’+’); term; exp’; break;

’-’: match(’-’); term; exp’; break; }

}

To compute the expression it could be rewritten as:

int exp’(int val) {

switch(token) {

’+’: match(’+’); val += term; return exp’(val);

’-’: match(’-’); val -= term; return exp’(val);

default: return val;

}

Note that exp′ requires a parameter passed from exp.

89

Left factoring

Left factoring is needed when right-hand sides of

productions share a common prefix, e.g.

A → αβ | αγ

Typical practical examples are:

stmt-sequence → stmt;stmt-sequence|stmt

stmt → s

and

if -stmnt → if ( exp ) statement| if ( exp ) statement else statement

An LL(1) parser cannot distinguish between such pro-

ductions. The solution is to factor out the common

prefix as follows:

A → αA′, A′ → β | γ

For factoring to work properly α should be the longest

left prefix.

Louden gives a left-factoring algorithm and many ex-

amples on pp.164–166.

90

follow sets

• In this discussion we regard $ as a terminal.

• Recall that first(A) is the set of leading terminals

of the sentential forms derivable from A.

• Informally, follow(A) is the set of terminals that

may be derived from nonterminals appearing after

A on the right-hand side of productions, or it is

the set of those terminals that follow A in such

productions.

• Since $ is regarded as a terminal, if A is the start

symbol then $ is in follow(A).

• Formally: follow(A) is the set of terminals such

that if there is a production B → αAγ,

1. then first(γ) \ {ε} is in follow(A), and

2. and if ε is in first(γ), then follow(A) con-

tains follow(B).

• follow sets are only defined for nonterminals.

91

An algorithm for follow(A)—Algol style

for all nonterminals A dofollow(A) = { };

follow(start-symbol) = {$};while there are changes to any follow sets dofor each production A → X1X2 . . . Xn dofor each Xi that is a nonterminal doadd first(Xi+1Xi+2 . . . Xn) \ {ε}to follow(Xi)/* Note : if i = n then Xi+1Xi+2 . . . Xn = ε*/if ε ∈ first(Xi+1Xi+2 . . . Xn) thenadd follow(A) to follow(Xi)

92

An algorithm for follow(A)—C-style

for (all nonterminals A)follow(A) = { };

follow(start-symbol) = {$};while (there are changes to any follow sets)for (each production A → X1X2 . . . Xn)for (each Xi that is a nonterminal){add first(Xi+1Xi+2 . . . Xn) \ {ε}to follow(Xi)/* Note : if i = n then Xi+1Xi+2 . . . Xn = ε*/if ε ∈ first(Xi+1Xi+2 . . . Xn) thenadd follow(A) to follow(Xi)

}

93

Construct follow from the first set

• In the grammar for arithmetic expressions:

(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number

• first(addop) = { +, - }

• first(mulop) = { * }

• first(factor) = { (, number }

• first(term) = { (, number }

• first(exp) = { (, number }

94

Constructing follow from first

• In the grammar for arithmetic expressions:

(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number

• Ignore (3), (4), (7) and (9)—no RH nonterminals

• Set all follow(A) = { }; follow(exp) = {$}

• (1) affects follow of exp, addop and term

first(addop) is added to follow(exp), so

follow(exp) = { $, -,+} and

first(term) is added to follow(addop), so

follow(addop) = { (,number} and

follow(exp) is added to follow(term), so

follow(term) = { $,+, -}

95

Constructing follow from first

• In the grammar for arithmetic expressions:

(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number

• (2) causes follow(exp) to be added to follow(term),

which does not add anything new.

• (5) is similar to (1). first(mulop) is added to

follow(term), so

follow(term) = { $,+, -, *} and

first(factor) is added to follow(mulop), so

follow(mulop) = { (,number} and

follow(term) is added to follow(factor), so

follow(factor) = { $,+, -, *}

96

Constructing follow from first

• In the grammar for arithmetic expressions:

(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number

• (6) adds follow(term) to follow(factor)—no

effect.

• (8) adds first()) to follow(exp), such that

follow(exp) = { $,+, -, )}

• During the second pass (1) adds ) to follow(factor),

so that follow(factor) = { $,+, -, *, )}

97

Constructing LL(1) parse tables

The parse table M [A, a] contains productions added

according to the rules

1. If A → α is a production rule such that there is a

derivation α∗⇒ aβ, where a is a token, then the

rule A → α is added to M [A, a].

2. If A → α∗⇒ ε is an ε-production and there is

a derivation S$∗⇒ αAaβ, where S is the start

symbol and a is a token, or $, then the production

A → ε is added to M [A, a].

The token a in Rule 1 is in first(α) and the token in

Rule 2 is in follow(A). This is repeatedly applied for

each nonterminal A and each production A → α.

1. For each token a in first(α), add A → α to the

entry M [A, a].

2. If ε ∈ first(α), for each element a ∈ follow(A),

add A → α to M [A, a].

98

Characterizing an LL(1) grammar

A grammar in BNF is LL(1) if the following conditions

are satisfied:

1. For every productionA → α1|α2| . . . |αn, first(αi)∩

first(αj) is empty for all i and j and i, j ∈

[1..n], i 6= j.

2. For every nonterminal A such that first(A) ⊃ ε,

first(A) ∩ follow(A) is empty.

99

Examples

• See Louden’s examples on p. 178–180.

100

Bottom-up parsing

• Overview.

• Finite automata of LR(0) items

and LR(0) parsing.

• SLR(1) parsing.

• General LR(1) and LALR(1) parsing.

• bison an LALR(1) parser generator.

• Generation of a parser using bison.

• Error recovery in bottom-up parsers.

101

Bottom-up parsing—an overview

• The most general bottom-up parser is the LR(1)

parser—the L indicates that the input is processed

from the left to the right, and theR indicates that

a rightmost derivation is applied, and the one

indicates that a single token is used for lookahead.

• LR(0) parsers are also possible where there is no

lookahead, i.e. the “lookahead” token can be ex-

amined after it appears on the parse stack.

• SLR(1) parsers improve on LR(0) parsing.

• An even more powerful method, but still not as

general as LR(1) parsers is the LALR(1) parser.

• Bottom-up parsers are generally more powerful than

their top-down counterparts—for example left re-

cursion can be handled.

• Bottom-up parsers are unsuitable for hand coding,

so parser generators like bison are used.

102

Bottom-up parsing—overview

• Parse stack contains tokens and nonterminals PLUS

state information.

• Parse stack starts empty and ends with start symbol

alone on the stack and an empty input string.

• Actions: shift, reduce and accept.

• A shift merely moves a token from the input to

the top of the stack.

• A reduce replaces the string α on top of the stack

with a nonterminal A, given A → α.

• Top-down parsers are generate-match parsers and

bottom-up parsers are shift-reduce parsers.

• If the grammar does not possess a unique start

symbol that only appears once in the grammar,

then bottom-up parsers are always augmented by

such a start symbol.

103

Bottom-up parse of ()

• Consider the grammar with P = {S → ( S ) S | ε}.

• Augment it by adding: S′ → S.

• A bottom-up parse for the parenthesis grammar of

() follows:

Parsing stack Input Action1 $ ()$ shift2 $ ( )$ reduce S → ε3 $ ( S )$ shift4 $ ( S ) $ reduce S → ε5 $ ( S ) S $ reduce S → ( S ) S6 $ S $ reduce S′ → S7 $ S′ $ accept

• The bottom-up parser looks deeper into its parse

stack and thus requires arbitrary stack lookahead.

• The derivation is: S′ ⇒ S ⇒ (S)S ⇒ (S) ⇒ ()

Clearly the rightmost nonterminal is reduced at

each derivation step.

104

A bottom-up parse of + grammar

• Consider the grammar with P ={E → E+n|n}.

• Augment it by adding: E′ → E.

• A bottom-up parse for the + grammar of n + n:

Parsing stack Input Action1 $ n + n$ shift2 $ n + n$ reduce E → n3 $ E + n$ shift4 $ E + n$ shift5 $ E + n $ reduce E → E + n6 $ E $ reduce E′ → E7 $ E′ $ accept

• The derivation is: E′ ⇒ E ⇒ E + n ⇒ n + n

We see that the rightmost nonterminal is reduced

at each derivation step.

105

Bottom-up parse—overview

Parsing stack Input Action1 $ n + n$ shift2 $ n + n$ reduce E → n3 $ E + n$ shift4 $ E + n$ shift5 $ E + n $ reduce E → E + n6 $ E $ reduce E′ → E7 $ E′ $ accept

• In derivation: E′ ⇒ E ⇒ E+n ⇒ n+n, each of

the intermediate strings is called a right sentential

form, and it is split between the parse stack and

the input.

• E+n occurs in step 3 of the parse as E‖+n, and

as E + ‖n in step 4, and finally as E + n‖.

• The string of symbols on top of the stack is called

a viable prefix of the right sentential form. E,

E+ and E + n are all viable prefixes of E + n.

• The viable prefixes of n + n are ε and n, but n+

and n + n are not.

106

Bottom-up parse—overview

• A shift-reduce parser will shift terminals to the

stack until it can perform a reduction to obtain

the next right sentential form.

• This occurs when the stack top matches the right-

hand side of a production.

• This string together with the position in the right

sentential form where it occurs and the production

used to reduce it, is known as the handle.

• Handles are unique in unambiguous grammars.

• The handle of n + n is thus E → n and the

handle of E + n, to which the previous form is

reduced is E → E + n .

• The main task of a shift-reduce parser is finding

the next handle.

107

Bottom-up parse—overview

Parsing stack Input Action1 $ ()$ shift2 $ ( )$ reduce S → ε3 $ ( S )$ shift4 $ ( S ) $ reduce S → ε5 $ ( S ) S $ reduce S → ( S ) S6 $ S $ reduce S′ → S7 $ S′ $ accept

• The main task of a shift-reduce parser is findingthe next handle.

• Reductions only occur when the reduced string isa right sentential form.

• In step 3 above the reduction S → ε cannot beperformed because the resulting string after theshift of ) onto the stack would be (S S) which isnot a right sentential form. Thus S → ε is not ahandle at this position of the sentential form (S.

• To reduce with S → (S)S the parser knows that(S)S appears on the right of a production andthat it is already on the stack by using a DFA of“items”.

108

LR(0) items

• The grammar with P = {S′ → S, S → (S)S | ε}

has three productions and eight LR(0) items:

S′ → .SS′ → S.S → .(S)SS → (.S)SS → (S.)SS → (S).SS → (S)S.S → .

• When P = {E′ → E, E → E + n| n} there are

three productions and eight LR(0) items:

E′ → .EE′ → E.E → .E + nE → E. + nE → E + .nE → E + n.E → .nE → n.

109

LR(0) parsing—LR(0) items

• An LR(0) item of a CFG is a production with a

distinguished position in its right-hand side.

• The distinguished position is usually denoted with

the meta symbol: . i.e. period.

• e.g. if A → α and β and γ are any two strings

of symbols including ε such that α = βγ then

A → .βγ, A → β.γ and A → βγ. are all LR(0)

items.

• They are called LR(0) items because they contain

no explicit reference to lookahead.

• The item “records” the recognition of the right-

hand side of a particular production.

• Specifically A → β.γ constructed from A → βγ

denotes that the β part has already been seen and

it may be possible to derive the next input tokens

from γ.

110

LR(0) parsing—LR(0) items

• The item A → .α indicates that A could be re-

duced from α—it is called an initial item.

• The item A → α. indicates that α is on the top of

the stack and may be the handle if A → α is used

to reduce α to A—it is called a complete item.

• The LR(0) items are used as states of a finite au-

tomaton that maintains information about the parse

stack and the progress of a shift-reduce parse.

111

LR(0) parsing—finite automata of items

• LR(0) items denote the states of a FSA that main-

tains the progress of a shift-reduce parse.

• One approach is to first construct a nondetermin-

istic FSA of LR(0) items and then derive a DFA

from it. Another approach is to construct theDFA

of sets of LR(0) items directly.

• What transitions are represented in the NFA of

LR(0) items?

• Suppose that the symbol X ∈ (N ∪T ). Let A →

α.Xη be an LR(0) item which represents a state

reached where α has been recognized and where

the focal point, is directly before X .

• If X is a token, then there is a transition on the

token X to next LR(0) state: A → αX.η.

A → α.Xη A → αX.ηX

112

LR(0) parsing—finite automata of items

• We are considering A → α.Xη where the focalpoint, is directly before X .

• Suppose that X is a nonterminal, then it can-not be directly matched with a token on the inputstream. The transition:

A → α.Xη A → αX.ηX

corresponds to pushing X onto the stack as a re-sult of a reduction of some β to X as a result ofapplying the rule X → β

• Such a reduction must be preceded by the recog-nition of β. The state denoted by X → .β repre-sents the start of the process of recognizing β.

• So when X is a nonterminal ε-transitions mustalso be provided: leaving from A → α.Xη forevery production X → β with X on the left andgoing to the LR(0) state X → .β.

A → α.Xη ε X → .β

113

LR(0) parsing—finite automata of items

• The two transitions:

A → α.Xη A → αX.ηX

and

A → α.Xη ε X → .β

are the only ones in the NFA of LR(0) items.

• The start state of the NFA must correspond to

the initial conditions of the parser: the parse stack

is empty and the S the start symbol is about to be

parsed, i.e. any initial item S → .α can be used.

• Since we want the start state to be unique, the

simple device of augmenting the grammar with a

new, unique start symbol S′ for which S′ → S

suffices.

• The start state then is S′ → .S.

114

LR(0) parsing—finite automata of items

• What are the accepting states of the NFA?

• The NFA does not need accepting states.

• The NFA is not being used to do the recognition

of the language.

• The NFA is merely being applied to keep track of

the state of the parse.

• The parser itself determines when it accepts an

input stream by determining that the input stream

is empty and the start symbol is on the top of the

parse stack.

115

LR(0) parsing—finite automata of items

• The grammar with P = {S′ → S, S → (S)S | ε}

has three productions and eight LR(0) items:

S′ → .SS′ → S.S → .(S)SS → (.S)SS → (S.)SS → (S).SS → (S)S.S → .

• The NFA of LR(0) items for the S grammar:

S′ → .S S′ → S.

S → .S → .(S)S

S → (.S)S S → (S.)S S → (S).S

S → (S)S.

ε

εεε

εε

()

SS

S

• The next step is to produce the DFA that corre-

sponds to the NFA.

116

LR(0) parsingConverting the NFA into a DFA

S′ → .S S′ → S.

S → .S → .(S)S

S → (.S)S S → (S.)S S → (S).S

S → (S)S.

ε

εεε

εε

()

SS

S

• Form the ε-closure of each LR(0) item.

• The closure always contains the set itself

• Add each item for which there are ε-transitions

from the the original set.

• Then recursively add all sets which are ε-reachable

from the sets already aggregated.

• Do this for every LR(0) item in the NFA.

• Add the terminal transitions that leave each ag-

gregate.

117

LR(0) parsingan NFA and its corresponding DFA

• The NFA for the S grammar:

S′ → .S S′ → S.

S → .S → .(S)S

S → (.S)S S → (S.)S S → (S).S

S → (S)S.

ε

εεε

εε

()

SS

S

• The DFA derived from the NFA:

0.

1.

2.

3.

4.

5.

S′ → .SS′ → S.

S → .

S → .

S → .

S → .(S)S

S → .(S)S

S → .(S)S

S → (.S)S

S → (S.)S

S → (S).SS → (S)S.

(

(

(

)

S

S

S

118

LR(0) parsing—finite automata of items

• When P = {E′ → E, E → E + n| n} there are

three productions and eight LR(0) items:

E′ → .EE′ → E.E → .E + nE → E. + nE → E + .nE → E + n.E → .nE → n.

• The NFA of LR(0) items for the E grammar:

E′ → .E E′ → E.

E → .E + n E → .n E → n.

E → E. + n E → E + .n E → E + n.

ε

εε

ε

E

E

+ n

n

• The next step is to produce the DFA that corre-

sponds to the NFA.

119

LR(0) parsing: NFA and equivalent DFA

• The NFA for the E grammar:

E′ → .E E′ → E.

E → .E + n E → .n E → n.

E → E. + n E → E + .n E → E + n.

ε

ε

ε

ε

E

E

+ n

n

• The DFA derived from the above NFA:

0.

1.

3.2. 4.

E′ → .E E′ → E.E → .E + nE → .n

E → n.

E → E. + n

E → E + .n E → E + n.

E

+n

n

• The items that are added by the ε-closure areknown as closure items and those items thatoriginate the state are called kernel items.

120

LR(0) parsing

• The LR(0) algorithm keeps track of the currentstate in the DFA of LR(0) items.

• The parse stack need hold only state numbers

since they represent all the necessary information.

• For the sake of simplifying the description of thealgorithm the symbol will also be pushed onto theparse stack before the state number .

• The parse starts with:

Parsing stack Input1 $ 0 input string$

• Suppose the token n is shifted onto the stack andthe next state is 2:

Parsing stack Input2 $ 0 n 2 rest of input string$

• The LR(0) parsing algorithm chooses its next ac-tion depending on the state on the top of the stackand the current input token.

121

The LR(0) parsing algorithm

Let s be the current state.

1. If state s contains the item A → α.Xβ where X

is a terminal, then the action is a shift.

• If the token is X then the next state is A →

αX.β.

• If the token is not X then there is an error.

2. If s contains a complete action such as A → γ.

then the action is to reduce γ by the rule A → γ.

• When the start symbol S is reduced by the rule

S′ → S and the input is empty, then accept;

if it is not empty then announce an error.

• In every other case the next state is computed

as follows:

(a) pop γ off the stack.

(b) Set s = top, which contains B → αA.β.

(c) push(A) and push(B → αA.β).

122

LR(0) parsingshift-reduce and reduce-reduce conflicts

• A grammar is said to be an LR(0) grammar if the

parser rules are unambiguous.

• If a state contains the complete item A → α. then

it can contain no other items.

• If such a state were also to contain the shift item

A → α.Xβ, where X is a terminal, then an ambi-

guity arises as to whether action (1) or (2) must be

executed. This is called a shift-reduce conflict.

• If such a state were also to contain another com-

plete item B → β., then an ambiguity arises as to

which production to apply—A → α. orB → β.—

this is known as a reduce-reduce conflict.

• A grammar is therefore LR(0) if and only if each

state is either a shift state or a reduce state con-

taining a single complete item.

123

SLR(1) parsing

• The SLR(1) parsing algorithm.

• Disambiguating rules for parsing conflicts.

• Limits of SLR(1) parsing power.

• SLR(k) grammars.

124

The SLR(1) parsing algorithm

• Simple LR(1), i.e. SLR(1) parsing, uses a DFA of

sets of LR(0) items.

• The power of LR(0) is significantly increased by

using the next token in the input stream to direct

its actions in two ways:

1. The input token is consulted before a shift is

made, to ensure that an appropriateDFA tran-

sition exists, and

2. It uses the follow set of a nonterminal to

decide if a reduction should be performed.

• This is powerful enough to parse almost all com-

mon language constructs.

125

The SLR(1) parsing algorithm

Let s be the current state, i.e. the state on top of the

stack.

1. If s contains any item of the form A → α.Xβ,

where X is the next token in the input stream,

then shift X onto the stack and push the state

containing the item A → αX.β.

2. If s contains the complete item A → γ. and the

next token in the input stream is in follow(A),

then reduce by the rule A → γ.—more details

follow on next slide.

3. If the next input token is not accommodated by

the (1) or (2), then an error is declared.

126

The SLR(1) parsing algorithm—2.

2. If s contains the complete item A → γ. and the

next token in the input stream is in follow(A),

then reduce by the rule A → γ.

The reduction by S′ → S, where S is the start

state, and the next token is $, implies acceptance,

otherwise the new state is computed as follows:

(a) Remove the string γ and all its corresponding

states from the parse stack.

(b) Back up the DFA to the state where the con-

struction of γ started.

(c) By construction, this state contains an item of

the form B → γ.Aβ. Push A onto the stack

and push the item containing B → γA.β.

127

SLR(1) grammar

A grammar is an SLR(1) grammar if the application of

the SLR(1) parsing rules do not result in an ambiguity.

A grammar is an SLR(1) grammar ⇐⇒.

1. For any item A → α.Xβ, where X is a token

there is no complete item B → γ. in s with X ∈

follow(B).

A violation of this condition is a shift-reduce

conflict.

2. For any two complete items A → α. ∈ s and

A → β. ∈ s, follow(A) ∩ follow(B) = ∅.

A violation of this condition is a reduce-reduce

conflict.

128

Table-driven SLR(1) grammar

• The grammar with P = {E′ → E,E → E +

n|n} is not LR(0) but is SLR(1) and its DFA of

sets of items is:

0.

1.

3.2. 4.

E′ → .E E′ → E.E → .E + nE → .n

E → n.

E → E. + n

E → E + .n E → E + n.

E

+n

n

• follow(E′) = {$}, and follow(E) = {$,+}

• q(1, $) = accept instead of r(E′ → E)

State Input Go ton + $ E

0 s2 11 s3 accept2 r(E → n) r(E → n)3 s44 r(E → E + n) r(E → E + n)

129

SLR(1) parse of n + n + n

State Input Go ton + $ E

0 s2 11 s3 accept2 r(E → n) r(E → n)3 s44 r(E → E + n) r(E → E + n)

Parsing stack Input Action1 $ 0 n + n + n$ shift 22 $ 0 n 2 + n + n$ reduce E → n3 $ 0 E 1 + n + n$ shift 34 $ 0 E 1 + 3 n + n$ shift 45 $ 0 E 1 + 3 n 4 + n$ reduce E → E + n6 $ 0 E 1 + n$ shift 37 $ 0 E 1 + 3 n$ shift 48 $ 0 E 1 + 3 n 4 $ reduce E → E + n9 $ 0 E 1 $ accept

130

SLR(1) parse of ()()

State Input Go to( ) $ S

0 s2 r(S → ε) r(S → ε) 11 accept2 s2 r(S → ε) r(S → ε) 33 s44 s2 r(S → ε) r(S → ε) 55 r(S → (S)S ) r(S → (S)S )

Parsing stack Input Action1 $ 0 ()()$ shift 22 $ 0 ( 2 )()$ reduce S → ε3 $ 0 ( 2 S 3 ()$ shift 44 $ 0 ( 2 S 3 ) 4 ()$ shift 25 $ 0 ( 2 S 3 ) 4 ( 2 )$ reduce S → ε6 $ 0 ( 2 S 3 ) 4 ( 2 S 3 $ shift 47 $ 0 ( 2 S 3 ) 4 ( 2 S 3 ) 4 $ reduce S → ε8 $ 0 ( 2 S 3 ) 4 ( 2 S 3 ) 4 S 5 $ reduce S → (S)S9 $ 0 ( 2 S 3 ) 4 S 5 $ reduce S → (S)S10 $ 0 S 1 $ accept

131

Disambiguating rules for parsing conflicts

• shift-reduce have a natural disambiguating rule:

prefer the shift over the reduce.

• reduce-reduce conflicts are more complex to resolve—

they usually require the grammar to be altered.

• Preferring the shift over the reduce in the dangling-

else ambiguity, leads to incorporating the most-

closely-nested-if rule.

• The grammar with the following productions is

ambiguous:

statement → if -statement|otherif -statement → if(exp)statement

if(exp)statement|else statementexp → 0|1

• We will consider the even simpler grammar:

S → I | otherI → ifS | ifSelseS

132

Disambiguating a shift-reduce conflict

Consider the grammar:

S → I | otherI → if S | if S else S

Since follow(I) = follow(S) = {$, else}, there is

a parsing conflict—in state 5 the complete item I →

if S. indicates a reduction on inputting else or $, but

the item I → if S.else S indicates a shift when else

is read.

0.

1.

2.

3.

5.7.

4.

6.

S′ → .S S → S.

I → .if S

I → .if S

I → .if S

I → .if S else S

I → .if S else S

I → .if S else S

other

other

other

if

if

else

S → other.

II

IS → I.

SS

S

I → if.S

I → if.S else S

I → if S.

I → if.S else S

I → ifS else.S

I → ifS else S.

S → .I

S → .I

S → .I

S → .other

S → .other

S → .other

133

SLR(1) table without conflicts

• The rules are numbered:

(1) S → I(2) S → other(3) I → if S(4) I → if S else S

• The SLR(1) parse table:

State Input Go toif else other $ S I

0 s4 s3 1 21 accept2 r1 r13 r2 r24 s4 s3 5 25 s6 r36 s4 s3 6 27 r4 r4

134

Limits of SLR(1) parsing power

• Consider the grammar which describes parameter-less procedures and assignment statements:

stmt → call-stmt | assign-stmtcall-stmt → identifierassign-stmt → var := expvar → var [ exp] | identifierexp → var | number

• Assignments and procedure calls both start with

an identifier.

• The parser can only decide at the end of the state-

ment or when the token ‘:=’ appears if a call or an

assignment is being processed.

135

Limits of SLR(1) parsing power

• Consider the simplified grammar:

S → id | V := EV → idE → V | n

• The start state of the DFA of sets items contains:

S′ → .SS → .idS → .V := EV → .id

• The state has a shift transition on id to the state:

S → id.V → id.

• follow(S) = {$} and follow(V ) = {:=, $}. On

getting the input token $ the SLR(1) parser will

try to reduce by both the rules S → id and V →

id—this is a reduce-reduce conflict.

• This simple problem can be solved by using an

SLR(k) grammar.

136

SLR(k) grammars

• The SLR(1) algorithm can be extended to SLR(k)

parsing, with k ≥ 1 lookahead symbols.

• Use firstk and followk sets and the two rules:

1. If s ⊃ A → α.Xβ where X is a token and

Xw ∈ firstk(Xβ) are the next k tokens in

the input stream, then the action is to shift

the current input token onto the stack, and to

push the state containing the itemA → αX.β

2. If s ⊃ A → α. and w ∈ followk(A) are the

next tokens in the input string, then the action

is to reduce by the rule A → α

• SLR(k) parsing is more powerful than SLR(1) pars-

ing when k > 1, but it is substantially slower, since

the cost of parsing grows exponentially in k.

• Typical non-SLR(1) constructs are handled using

an LALR(1) parser, by using standard disambiguat-

ing rules, or by rewriting the grammar.

137

General LR(1) and LALR(1) parsing

• LR(1), also called canonical LR(1), parsing over-

comes the problem with SLR(1) parsing but also

at the cost of increased time complexity.

• LookaheadLR(1) orLALR(1) preserves the efficiency

of SLR(1) parsing and retains the benefits of gen-

eral LR(1) parsing.

• We will discuss:

– Finite automata of LR(1) items.

– The LR(1) parsing algorithm.

– LALR(1) parsing.

138

Finite automata of LR(1) items(Louden p. 217–220)

• SLR(1) applies lookahead after constructing theDFA of LR(0) items—the construction ignores theadvantages that may ensue from considering looka-heads.

• General LR(1) uses the new DFA that has looka-heads built in from the start.

• ThisDFA uses items that are an extension ofLR(0)items.

• They are called LR(1) items because they includea single lookahead token in each item.

• LR(1) items are written:

[A → α.β, a]

where A → α.β is an LR(0) item, and a is thelookahead token.

• Next the transitions between LR(1) items will bedefined.

139

Transitions between LR(1) items

• There are several similarities withDFAs of LR(0)

items.

• They include ε-transitions.

• The DFA states are also built from ε-closures.

• However, transitions betweenLR(1) items must keep

track of the lookahead token.

• Normal, i.e. non-ε-transitions, are quite similar to

those in DFAs of LR(0) items.

• The major difference lies in the definition of ε-

transitions.

140

Definition of LR(1)-transitions

• Given an LR(1) item, [A → α.Xγ, a], where X ∈

N∪T , there is a transition on X to the item [A →

αX.γ, a].

• Given an LR(1) item, [A → α.Bγ, a], where

B∈N , there are ε-transitions to items

[B → .β, b] for every production B → β and for

every token b ∈ first(γa).

• Only ε-transitions create new lookaheads.

141

DFA of sets of LR(0) items for A → (A)|a(Louden p. 208)

• The augmented grammar with P = {A′ → A,A →(A)|a} has the DFA of sets of LR(0) items:

0.

1.

5.

3.

2.

4.

A′ → .AA′ → A.

A → .a

A → .a

A → a.

S → (.A)S → .(A)

S → .(A)

S → (A.) S → (A).(

(

)A

A

a

a

• The parsing actions for the input ((a)) follow:

Parsing stack Input Action1 $ 0 ((a))$ shift2 $ 0 ( 3 (a))$ shift3 $ 0 ( 3 ( 3 a))$ shift4 $ 0 ( 3 ( 3 a 2 ))$ reduceA → a5 $ 0 ( 3 ( 3 A 4 ))$ shift6 $ 0 ( 3 ( 3 A 4 ) 5 )$ reduceA → (A)7 $ 0 ( 3 A 4 )$ shift8 $ 0 ( 3 A 4 ) 5 $ reduceA → (A)9 $ 0 A 1 $ accept

142

DFA of sets of LR(1) items for A → (A)|a(Louden p. 218)

• Augment the grammar by adding A′ → A.

• State 0: first put [A′ → .A, $]into State 0.

To complete the closure, add ε-transitions to items

with an A on the left of productions with $ as the

lookahead:

[A → .(A), $], and [A → .a, $].

0.

[A′ → .A, $]

[A → .(A), $][A → .a, $]

• State 1: There is a transition from State 0 on A

to the closure of the set that includes [A′ → A., $].

The action for this state will be to accept.

1.[A′ → A., $]

143

DFA of sets of LR(1) items for A → (A)|a

0.

[A′ → .A, $]

[A → .(A), $][A → .a, $]

• State 2: There is a transition on ‘(’ leaving State 0

to the closure of the LR(1) item [A → (.A), $]

which forms the basis of State 2. Since there are

ε-transitions from this item to [A → .(A), )] and

to [A → .a, )] because the follow of theA in paren-

theses is first()$) = {)}.

• Note that there is a new lookahead item.

• The complete State 2 is:

2.

[A → (.A), $]

[A → .(A), )][A → .a, )]

144

DFA of sets of LR(1) items for A → (A)|a

• State 3: This state emanates from State 0 with

a transition on ‘a’ from [A → .a, $] to [A → a., $]

3.[A → a., $]

• Note that the lookahead does not change.

• This completes the states that emanate from State 0.

2.

[A → (.A), $]

[A → .(A), )][A → .a, )]

• State 4: An ε-transition on A leaves State 2 to

the state containing [A → (A.), $].

4.[A → (A.), $]

145

DFA of sets of LR(1) items for A → (A)|a

• The next state emanates from State 2:

2.

[A → (.A), $]

[A → .(A), )][A → .a, )]

• State 5: The transition on ‘(’ to the ε-closure of

[A → (.A), )], which are once again is the set of

all the items with A on the left of a production,

namely [A → .(A), )], and [A → .a, )].

5.

[A → (.A), )]

[A → .(A), )][A → .a, )]

• States 2 and 5 differ only in the lookaheads of their

first item.

146

DFA of sets of LR(1) items for A → (A)|a

• State 6: The last state emanating from State 2

is the transition on ‘a’ to the item [A → a., )].

• It differs from State 3 in the lookahead.

• State 7: There is a transition on ‘)’ from State 4

to the item [A → (A)., $].

• State 8: State 5 has a transition on ‘(’ to itself

and a transition on ‘A’ to the item [A → (A.), )].

• State 9: There is a transition on ‘)’ from State 8

to the item [A → (A)., )].

147

DFA of sets of LR(1) items for A → (A)|a

6.

8. 9.

2.

5.

4.

3.

0.

1.

5.

[A′ → .A, $] [A′ → A., $]

[A → .(A), $]

[A → (.A), $]

[A → (.A), )]

[A → (A.), $] [A → (A)., $]

[A → (A)., )][A → (A.), )]

[A → .(A), )]

[A → .(A), )]

[A → a., $]

[A → a., )]

[A → .a, )]

[A → .a, )]

[A → .a, $]

A

A

A

aaaaa

aaaa

a

)

)

(

(

(

148

The general LR(1) parsing algorithm(Louden p. 220–223)

Let s be the current state, i.e. the state on top of the

stack. The actions are defined as follows:

1. If s contains any LR(1) item of the form [A →

α.Xβ, a], where X is the next token in the input

stream, then shift X onto the stack and push the

state containing the LR(1) item [A → αX.β, a].

2. If s contains the complete LR(1) item [A → γ., a]

and the next token in the input stream is a, then

reduce by the rule A → γ.—more details follow

on next slide.

3. If the next input token is not accommodated by

the (1) or (2), then an error is declared.

149

The general LR(1) parsing algorithm—2.

2. If s contains the complete item A → γ. and the

next token in the input stream is a, then reduce

by the rule A → γ.

The reduction by S′ → S, where S is the start

state, and the next token is $, implies acceptance,

otherwise the new state is computed as follows:

(a) Remove the string γ and all its corresponding

states from the parse stack.

(b) Back up the DFA to the state where the con-

struction of γ started.

(c) By construction, this state contains an LR(1)

item of the form [B → γ.Aβ, b]. Push A

onto the stack and push the item containing

[B → γA.β, b].

150

LR(1) grammar

A grammar is an LR(1) grammar if the application of

the LR(1) parsing rules do not result in an ambiguity.

A grammar is an LR(1) grammar ⇐⇒.

1. For any item [A → α.Xβ, a] ∈ s, where X is a

token, there is no complete item in s of the form

[B → γ.,X ]

A violation of this condition is a shift-reduce

conflict.

2. There are no two complete LR(1) items of the form

[A → α., a] ∈ s and [A → β., a] ∈ s, otherwise

it would lead to a reduce-reduce conflict.

151

LR(1) parse table for A → (A)|a

Number the two productions as follows:

(1) A → (A) and(2) A → a

TheLR(1) parse table:

State Input Go to( a ) $ A

0 s2 s3 11 accept2 s5 s6 43 r24 s75 s5 s6 86 r27 r18 s99 r1

• That the parse table is extracted from the DFA of

sets of LR(1) items.

• This grammar is LR(0) and thus also SLR(1) .

152

General LR(1) parsing

• The grammar with the rules

S → id|V :=EV → id

E → V |n

proves not to be SLR(1)

• We construct its DFA of sets of LR(1) items.

• The start state is the ε-closure of the LR(1) item

[S′ → .S, $]. So it also contains the LR(1) items

[S → .id, $] and [S → .V :=E, $].

• The last item, in turn, gives rise to the LR(1) item

[V → .id, :=].

• The lookahead is ‘:=’ because a ‘V ’ must only be

recognized if it is actually followed by ‘:=’

0.[V → .id, :=]

[S → .S, $]

[S → .id, $]

[S → .V :=E, $]

153

General LR(1) parsing

• Consider state 0:

0.[V → .id, :=]

[S → .S, $]

[S → .id, $]

[S → .V :=E, $]

A transition from state 0 on ‘S’ goes to state 1:

1.[S → S., $]

• State 0 has a transition on ‘id’ to state 2:

2.

[S → id., $][V → id., :=]

• State 0 has a transition on ‘V ’ to state 3:

3.[S → V.:=E, $]

• No transitions leave states 1 and 2.

154

General LR(1) parsing

• The third state has a transition on ‘:=’ to the clo-

sure of the item [S → V :=.E, $]. Since E has

no symbols following it, the lookaheads will be ‘$’.

The two items [E → .V, $] and [E → .n, $] must

be added. The first of these leads to the item

[V → .id, $]. 3.[S → V.:=E, $]

• Each of these items in state 4 has the general

form [A → α.Xβ] and in turn leads to a transition

on X ∈ {E, V, n, id}, to a state with the single

item [A → αX.β] in it.

• State 2 gave rise to a parsing conflict in the SLR(1)

parser. The LR(1) items now clearly distinguish

between the two reductions by their lookaheads:

Select S → id on ‘$’ and V → id on ‘:=’.

155

General LR(1) parsing(Louden p. 223)

0.

1.

2.

4.

3.

5.

7.6. 8.

S

V

V

E

:=

id

idn

[V → .id, :=]

[V → id., :=]

[E → .V, $][E → .n, $]

[V → id, $]

[E → n., $] [V → id., $]

[S → S., $][S → .S, $]

[S → .id, $]

[S → .V :=E, $]

[S → id., $]

[S → V.:=E, $]

[S → V :=.E, $]

[S → V :=E., $]

[S → V :=E., $]

156

LALR(1) parsing(Louden p. 224–226)

• In theDFA of sets of LR(1) items many states differonly in some of the lookaheads of their items.

• The DFA of sets of LR(0) items of the grammarwith P = {A → (A)|a} has only 6 states whileits DFA of sets of LR(1) items has 10 items.

• In the DFA of sets of LR(1) items states 2–5, 4–8,7–9, and 3–6 differ only in some item lookaheads.

• e.g. the item [A → (.A), $] from state 2 differsfrom the item [A → (.A), )] from state 5 only inits lookahead.

• The LALR(1) algorithm caters for these almost-duplicate items by coalescing such pairs into itemswith sets of lookahead items, e.g. [A → (.A), $/)].

• The DFA of sets of LALR(1) items is identical tothe corresponding DFA of sets of LR(0) items, ex-cepting that the former includes sets of lookaheaditems.

157

LALR(1) parsing

• The LALR(1) parsing algorithm preserves the ben-

efit of the smaller DFA of sets of LR(0) items with

the advantage of some of the benefit of LR(1) pars-

ing over SLR(1) parsing.

• Definition: the core of a state DFA of sets of

LR(1) items is the set of LR(0) items consisting of

the first components of all LR(1) items of the state.

• First principle of LALR(1) parsing

The core of a state of the DFA of sets of LR(1)

items is a state of the DFA of sets of LR(0) items

• Second principle of LALR(1) parsing

Given two states s1 and s2 of the DFA of sets

of LR(1) items that have the same core, suppose

there is a transition on the symbol X from state

s1 to state t1, then there is also a transition on

the symbol X from state s2 to state t2 and the

states t1 and t2 have the same core.

158

LALR(1) parsing

• The two principles of LALR(1) parsing allow us to

construct the DFA of sets of LALR(1) items which

is built up from the DFA of sets of LR(1) items

by identifying all states that have the same core

and forming the union of the lookahead symbols

for each LR(0) item.

• Thus each LALR(1) item in this DFA will have

an LR(0) item as its first component and a set of

lookahead tokens as its second component.

• Multiple lookaheads are separated by ‘/’.

159

LALR(1) parsing

• The DFA of sets of LALR(1) items.

2. 4.

3.

0.

1.

7.(

[A′ → .A, $] [A′ → A., $]

[A → .(A), $]

[A → (.A), $/)]

[A → (A.), $/)] [A → (A)., $/)][A → .(A), )]

[A → a., $/)]

[A → .a, )]

[A → .a, $]

A

A

a

aaaaaa

)

(

• As would be expected this DFA is identical to the

DFA of sets of LR(0) items for this grammar, ex-

cepting for lookaheads.

160

LALR(1) parsing algorithm

• The LALR(1) parsing algorithm is identical to the

general LR(1) parsing algorithm.

• Definition: if no parsing conflicts arise when

parsing a grammar with the LALR(1) parsing al-

gorithm it is known as an LALR(1) grammar.

• It is possible for the LALR(1) construction to create

parsing conflicts that do not exist in general LR(1)

parsing.

• There cannot be any shift-reduce conflicts but

reduce-reduce conflicts are possible.

• Every SLR(1) grammar is certainly LALR(1) and

LALR(1) parsers often do as well as general LR(1)

parsers in removing typical conflicts that occur in

SLR(1) parsing.

• The id grammar is not SLR(1) but is LALR(1).

161

LALR(1) parsing

• Combining LR(1) states to form the DFA of sets of

LALR(1) items solves the problem of large parsing

tables, but it still requires the entire DFA of sets

of LR(1) items to be computed.

• It is possible to compute theDFA of sets ofLALR(1)

items directly from the DFA of sets of LR(0) items

by propagating lookaheads which is a relatively

simple process.

• Consider the LALR(1) of the A-grammar.

• Begin constructing lookaheads by adding ‘$’ to the

lookahead of the item A′ → A in state 0. The ‘$’

is said to be spontaneously generated.

• Then by the rules of ε-closure the ‘$’ propagates

to the two closure items of ‘.A’. By following the

three transitions leaving state 0, the ‘$’ propagates

to states 1, 2, and 3.

162

LALR(1) parsing

• Continuing with state 2 the closure items get the

lookahead ‘)’ by spontaneous generation—because

in A → (.A), the core item of the state, ‘.A’ is

followed by ‘)’.

• The transition on a to state 3 causes the ‘)’ to be

propagated to the lookahead of the item in that

state.

• The transition of ‘(’ from state 2 to itself causes

the ‘)’ to propagate to the lookahead of the kernel

item—which now has ‘)’ and ‘$’ in its lookahead

set.

• Now this lookahead set ‘)/$’ propagates to states

4 and 7.

• We have now demonstrated how to build the DFA

of sets of LALR(1) directly from the DFA of sets of

LR(0) items.

163

The hierarchy of parsers

LL(1)LALR(1)

LR(1)

LL(0)

SLR(1)

LR(0)

LR(k)

SLR(k)

LL(k)

• All LL(0) grammars are LR(0) but there exist LR(0)

grammars that are not LL(0).

• LR(0) grammars are SLR(1) and there are SLR(1)

grammars that are not LR(0) grammars.

• SLR(1) grammars areLALR(1) and there areLALR(1)

grammars that are not SLR(1) grammars.

• LALR(1) grammars are LR(1) and there are LR(1)

grammars that are not LALR(1).

164

The hierarchy of parsers—continued

LL(1)LALR(1)

LR(1)

LL(0)

SLR(1)

LR(0)

LR(k)

SLR(k)

LL(k)

• LL(k) grammars cut across these grammars but are

a subset of LR(k) and obviously include all LL(1)

and LL(0) grammars.

• Similarly LR(k) are obviously a superset of LR(1)

and LR(0) grammars.

165

bison an LALR(1) parser generator(Louden p. 226–250)

• bison basics.

• bison options.

• Parsing conflicts and disambiguating rules.

• Tracing the execution of a bison parser.

• Arbitrary value types in bison.

• Embedded actions in bison

166

Generation of a parser using bison

The TINY parser using bison.

• See the code for tiny.y

• Use of YYPARSER

• YYSTYPE is used to define the values returned bythe bison procedures as follows:#define YYSTYPE TreenNode *where TreeNodeis defined as:

typedef struct treeNode {struct treeNode * child[MAXCHILDREN];struct treeNode * sibling;int lineno;NodeKind nodekind;union {StmtKind stmnt; ExpKind exp;} kind;union {TokenType op;

int val;char * name;} attr;

ExpType type; /* for type checking of exps */} TreeNode;

167

Error recovery in bottom-up parsers

• The normal state of a compiler is dealing with

errors—most source files presented to a compiler

are erroneous.

• It is not acceptable for a compiler to give up on

the first parse error it finds.

• Compilers should be designed to cope with as many

errors as possible.

• Detecting errors in bottom-up parsing.

• Panic mode error recovery.

• Error recovery in bison.

• Error recovery in your compiler.

168

Detecting errors in bottom-up parsing

• An empty entry in parse table indicates an error.

• Contrary to intuition such entries are very useful—

a negative effect is that they increase the size op

the table.

• Because of the their construction an LR(1) parser

detects errors better than an LALR(1) which can

detect errors earlier that an LR(0) parser.

• Using the a-grammar with the incorrect input ‘(a$’

an ‘LR(1)’ parser will shift ‘(’ and ‘a’ onto the stack

and move to state 6, where it immediately reports

that there is no entry under ‘$.’

• An LR(0) parser will first reduce by A → a before

discovering the missing ‘)’

• Note that none of these parsers can ever shift a

terminal token in error.

169

Panic mode error recovery

It is possible to achieve good error recovery by removing

symbols from either the parse stack or the input stream

or both.

There are three possible actions:

1. Pop a state from the stack.

2. Successively pop tokens from the input until a to-

ken is seen for which we can restart the parse.

3. Push a new state onto the stack.

170

Panic mode error recovery

An effective way to choose an action when an error

occurs is to:

1. Pop states from the parse stack until a state with

nonempty ‘goto’ entries is found.

2. If there is a legal action on the current input token

from one of the ‘goto’ states, push that state onto

the stack and restart the parse. If there are several

such states, prefer a shift to a reduce. Among the

reduce actions, prefer one whose associated non-

terminal is least general.

3. If there is no legal action on the current input token

from one of the ‘goto’ states advance the input

until there is a legal action or the end is reached.

These rules have the effect of forcing the recognition

of the item being recognized—it is known as panic

mode error recovery.

171

Panic mode error recovery—example

Consider the parse below of (2+*)—it proceeds nor-mally until the * is seen. At that point panic modewould cause the following actions to take place on theparsing stack.

Parsing stack Input Action. . . . . . . . .$ 0(6E10 + 7 ∗)$ error: push T, goto 11$ 0(6E10 + 7T 11 ∗)$ shift 9$ 0(6E10 + 7T 11 ∗ 9 )$ error: push F, goto 13$ 0(6E10 + 7T 11 ∗ 9F 13 )$ reduce T → T ∗ F. . . . . . . . .

• At the first error the parser is in state 7, whichhas legal goto states 11 and 4. Since state 11 hasa shift on the next input token ‘*,’ that goto ispreferred, and the token is shifted.

• The parser goes into state 9, with ‘)’ as input—another error.

• In state 9 there is a single goto entry to state 11and state 11 does not have a legal action for ‘)’, sothe parse can now proceed normally.

172

Error recovery in bison.

• bison uses error productions.

• An error production contains the pseudo token

error.

• error marks the context in which erroneous to-

kens can be removed until a suitable synchronizing

token is seen.

• error productions allow the programmer to man-

ually mark those nonterminals whose goto entries

can be used in error recovery.

• bison also uses the function ‘yyerrok’ which pre-

vents particular tokens from being discarded when

it is doing error recovery.

173

How bison uses error.

• When an error is encountered, states are popped

from the parse stack until it reaches a state in

which ‘error’ is a legal lookahead.

• If ‘error’ is never a legal lookahead for a shift then

the parse stack will be emptied, aborting the parse.

• When a state with ‘error’ as legal lookahead is

found, the parser carries on as if it has seen ‘error’

followed by the lookahead that caused the error.

• The previous lookahead token can be discarded

with ‘yyclearin’.

• If the parser is in the Error State and then dis-

covers further errors, the input tokens causing the

errors will be discarded without any messages un-

til three tokens have been shifted legally onto the

parsing stack.

174

bison and the error state.

• While the parser is in the error recovery mode the

value of ‘YYRECOVERING’ is 1—normally it is 0.

• The parser can be removed from from the error

state by using ‘yyerrok’.

175

Error recovery in bison—examples

Consider the example

%token NUMBER%%command

: exp { printf("%d\n", $1);}; // prints the result.

exp: exp ’+’ term { $$ = $1 + $3;}| exp ’-’ term { $$ = $1 - $3;}| term { $$ = $1;};

term: term ’*’ factor { $$ = $1 * $3;}| factor { $$ = $1;};

factor: NUMBER { $$ = S1;}| ’(’ exp ’)’ { $$ = S2;};

%%

176

Error recovery in bison—examples

Consider this replacement for command

command: exp { printf("%d\n", $1);}| error { yyerror("incorrect expression");};

Now suppose the input ‘2++3’ is given to the parser.This will lead to the configuration:

Parsing stack Input$ 0 exp 2 + 7 +3$

The parser enters the error state and begins poppingstates from the stack until the state 0 is uncovered.Then the production for command provides that erroris a legal lookahead, and it is shifted onto the stack andimmediately reduced to command causing the errormessage “incorrect expression” to be printed. The stacknow becomes:

Parsing stack Input$ 0 command 1 +3$

At this stage the only legal lookahead is ‘$’ correspond-

ing to the return of ‘EOF ’ by ‘yylex’ and the parser

will delete the remaining input tokens ‘+3’ before exit-

ing the error state.

177

Error recovery in bison—examples

• A better idea is to reenter the line after the error.This is done with a synchronizing token, such as‘\n’.

command: exp ’\n’ { printf("%d\n", $1);}| error ’\n’

{ yyerrok;printf("reenter expression: ");}

command;

• When the error occurs the parser will skip all the

tokens up to the end-of-line symbol, when it will

execute ‘yyerrok’ and the printf statement and

will then try to get another command.

• The call to yyerrok is needed to cancel the error

state, otherwise bison will eat up input until it

finds three legal tokens.

178

Where to put error

Follow these goals when placing error• as close as possible to the start symbol—ensuresthat there is always a point to recover from

• as close as possible to each terminal—improve re-covery further by inserting action yyerrok;

• without introducing new conflicts—could be diffi-cult. Allow parsing to continue beyond expressionbut trash the rest of the statement.

Place error symbols:• into each recursive construct or repetition.

• don’t add yyerrok; in productions with error—it may lead to cascading error messages and evenloops if the parser cannot discard input.

• non-empty lists require two error variants, one atthe start of a list and another for the end.

• possibly empty lists require an error symbol in-side the empty branch—otherwise add the symbolwhere the empty list is being used.

179

Where to put error

The table below is a good guideline:Construct EBNF bison inputoptional sequence x : {y} x

: /* empty */| x y {yyerrok;}| x error;

sequence x : y{y} x: y| x y {yyerrok;}| error| x error;

list x : y{Ty} x: y| x T y {yyerrok;}| error| x error| x error y {yyerrok;}| x T error;

Note that we used a yyerrok; action after a produc-

tion with an error symbol.

180

Where to put error

181

Semantic analysis

Introduction

• Semantic analysis is sometimes referred to as context-

sensitive analysis because coping with some of the

simplest semantics—such as using a variable if and

only if they have already been declared—is beyond

the capabilities of a CFG.

• Generally, use symbol table and bison actions to

perform or compute semantics.

• More formally, syntax-directed translation with at-

tribute grammars may be used.

• Use type-checking algorithms based on attribute

dependency and propagation.

181

Semantic analysis

• Semantic analysis involves computing beyond the

reach of CFGs and parsing algorithms and staged

in the realm of context-sensitive grammars.

• The semantic information is closely related to the

eventual meaning or semantics of the program be-

ing translated.

• Since it takes place prior to execution it may be

regarded as static.

• Semantic analysis in a statically typed language

such as C involves building a symbol table to keep

track of the meanings of identifiers and performing

type inference to propagate these meanings and

type checking on expressions and statements.

182

Semantic analysis

What sort of meaning is involved that extends beyondthe capabilities of a CFG?

1. Has x been declared only once?

2. Has x already been declared before its first use?

3. Has x been defined before its first use?

4. Is x a scalar, an array, a function, or a class?

5. Is x declared but never used?

6. To which declaration does x refer?

7. Are the types in an expression compatible?

8. Does the dimension match the declaration?

9. Is an array reference within its declared bounds?

10. Where is x stored? When is it allocated or created?

11. Does *p refer to the result of a new or of a malloc()?

12. Does the expression produce a constant value?

183

Semantic analysis

Semantic analysis can be divided into two categories:

the analysis of a program

1. to establish its correctness in order to guarantee

proper execution—varies according to their typing

strength of the language in question. Languages

can be ordered more or less in terms of their typing

strength:

LISP ≺ Smalltalk ≺ Fortran ≺ Basic ≺

C ≺ Pascal ≺ Oberon ≺ Ada andJava

2. to enhance the efficiency of its execution—this

is usually relegated to “optimization.”

184

Static semantic analysis

• Involves both the description of the analyses toperform, as well as the implementation of the anal-yses using appropriate algorithms.

• Denotational semantics (Strachey) may be used.

• Attributes and attribute grammars (Donald Knuth1965) may be used to write semantic rules.

• bison’s actions often boil down to semantic rules.

• Attribute grammars may be useful for languageswhich obey the principle of syntax-directed se-mantics:– the semantic content of a program isclosely related to its syntax.

• Modern programming languages tend to follow thisprinciple.

• Despite this, semantics are often not formally spec-ified by the language designers and the task offiguring out the attribute grammar is left to thecompiler writer.

185

Attributes and attribute grammars

• An attribute is a property of a programming lan-

guage construct.

• The attributes of an object

scope

size

extent

lexical/dynamic

register/memory

transient/persistent

static/automatic

locationvalue

type

name:

• data type of an object.

• value of an expression, or object code.

• location of a variable in memory or on disk.

• number of significant digits in a variable.

• more attributes are given in the figure.

186

Attributes and attribute grammars

• Abstract syntax represented by an abstract syntax

tree is a better basis for semantics—but this too is

usually left to the whims of the compiler writer.

• Attributes may be fixed prior to the compilation

process.

• Binding is the process of computing an attribute

and associating its computed value with the lan-

guage construct.

• The time that it occurs is called binding time.

• Attributes that can be prebound by the compiler

are static.

• Those attributes that are bound at run time are

dynamic.

187

Binding time

• In C or Pascal the data type of a variable can bedetermined at compile time by the type checker.In LISP this is usually done at run time.

• The values in expressions are usually dynamic. Someconstant expressions such as (1+2)*3 can be cal-culated during compilation—constant folding.

• Variables are usually allocated during compilation.Storage for variables can also be created at runtime, but typically these are used via pointers re-siding in statically allocated relative addresses.

• The object code is static, because all of it is createdat run time. Some languages cater for doing thisdynamically.

• In a language likeBASIC the size and type of num-bers is determined at run time. But usually thescanner needs to know the number of digits it mustaccumulate—otherwise numbers can be stored asstrings and the accuracy sorted out by the run timeroutines.

188

Attribute grammars

• If X ∈ N ∪ T and a is an attribute of X , writeX.a for the value of a associated with X .

• Given a collection of attributes a1, a2, . . . , ak theprinciple of syntax-directed semantics implies thatfor each grammar ruleX0 → X1X2 . . . Xn, whereX0 ∈ N and Xi ∈ X ∪ T for i ∈ [1..n], thevalues of the attributes Xi.aj of each symbol Xi

are related to the values of the other symbols inthe rule.

• Each relationship is specified by an attribute equa-tion.

Xi.aj = fij(X0.a1, . . . , X0.ak,X1.a1, . . . , X1.ak,...,Xn.a1, . . . , Xn.ak)

• An attribute grammar for the attributes a1, a2,

. . . , ak is the collection of all such equations forall the grammar rules of the language.

189

Attribute grammar for number grammar

Typically attribute equations are written with with each

grammar rule.

The number grammar has the 12 productions:

number → number digit | digitdigit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Its attribute grammar follows

Grammar rules Semantic rulesnumber1 → number1.val =

number1 digit number2.val ∗ 10 + digit.valnumber → digit number.val = digit.valdigit → 0 digit.val = 0digit → 1 digit.val = 1digit → 2 digit.val = 2digit → 3 digit.val = 3digit → 4 digit.val = 4digit → 5 digit.val = 5digit → 6 digit.val = 6digit → 7 digit.val = 7digit → 8 digit.val = 8digit → 9 digit.val = 9

190

The parse tree for number grammar

The parse tree for the number grammar of the integer

345 follows

number

digit

number(val=34*10+5=345)

number digit(val=5)(val=3*10+4=34)

digit

3

4

5

(val=4)

(val=3)

(val=3)

191

Attribute grammar for exp grammar

The exp grammar has the 7 productions:

exp → exp + term | exp− term | termterm → term ∗ factor | factorfactor → (exp) | number

Its attribute grammar follows

Grammar rules Semantic rulesexp1 → exp2+term exp1.val = exp2.val + term.valexp1 → exp2-term exp1.val = exp2.val − term.valexp → term exp.val = term.valterm1 → term1.val =

term2*factor term2.val ∗ factor.valterm → factor term.val = factor.valfactor → (exp) factor.val = exp.valfactor → number factor.val = number.val

• Note that the ‘+’ in the grammar rule ‘exp1 →

exp2+term’ represents the token in the source pro-

gram, and the ‘+’ in the semantic rule represents

the arithmetic operation to be performed at exe-

cution time.

• There is no equation with ‘number.val’ on the LHS,

since this value is calculated prior to the semantic

phase, e.g. by the scanner.

192

Parse tree for exp grammar

The parse tree of (34-3)*42 for the exp grammar

exp(val=1302)

term(val=31*42=1302)

term

factor(val=31) (val=42)

exp(val=34−3=31)

exp term

term

factor

factor(val=3)

factor(val=34)

(val=34)

(val=34)

(val=3)

(val=3)

(val=42)(val=31) *

( )

number

number

number

(val=34)

193

Attribute grammar for decl grammar

The decl grammar has the 5 productions:

decl → type | var-listtype → int | floatvar-list → id, var-list | id

Its attribute grammar follows

Grammar rules Semantic rulesdecl → type | var-list var-list.dtype = type.dtypetype → int type.dtype = integertype → float type.dtype = realvar-list1 → id, var-list2 id.dtype = var-list1.dtype

var-list2.dtype = var-list1.dtypevar-list → id id.dtype = var-list.dtype

The parse tree for the string float x,y

type(decl=real)

var-list(decl=real)

var-list

(dtype=real)

float id(x)

,

id(y)

(dtype=real)

(dtype=real)

decl

194

Attribute grammar for based-num grammar

The based-num grammar has the 15 productions:based-num → num basechar,basechar → o | d,num → num | digit,digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Its attribute grammar followsGrammar rules Semantic rulesbased-num → based-num.val = num.val

num basechar num.base = basechar.basebasechar → o basechar = 8basechar → d basechar = 10num1 → num2 digit num1.val =

if digit.val = erroror num2.val = error

then errorelse num2.val ∗ num1.base

+digit.valnum2.base = num1.basedigit.base = num1.base

num → digit num.val = digit.valdigit.base = num.base

digit → 0 digit.val = 0digit → 1 digit.val = 1· · ·digit → 7 digit.val = 7digit → 8 digit.val = if digit.base = 8

then error else 8digit → 9 digit.val = if digit.base = 8

then error else 9

195

Parse tree for based-num grammar

The parse tree for the based-num grammar follows

based-num

num

(base=8)

num digit

(base=8) (base=8)

(base=8)(base=8)

digitnum

digit

(base=8)

basechar(base=8)

(val=3)

(val=3)

(val=3*8+4=28)

(val=28*8+5=229)

(val=229)

(val=5)

(val=4)

o

5

4

3

196

Simplifications and extensions to AGs

Some useful but obvious extensions to the AG meta-

language are

• The use of if ... then ... else statements,

and a case statement.

• Certain functions can also enhance the function-

ality, e.g. the function ‘numval’ in the attribute

equation digit.val = numval(D) which converts

‘D’—the token for a digit—into its numerical value.

The C function below does the trick

int numval(char D) {

return (int)D - (int)’0’;

}

197

AG for creating abstract syntax tree

An abstract syntax tree is created by the semantic rules.Grammar rules Semantic rulesexp1 → exp2+term exp1.tree =

mkOpNode(+, exp2.tree, term.tree)exp1 → exp2-term exp1.tree =

mkOpNode(-, exp2.tree, term.tree)exp → term exp.tree = term.treeterm1 → term1.tree =

term2*factor mkOpNode(*, term2.tree, factor.tree)term → factor term.tree → factor.treefactor → (exp) factor.tree → exp.treefactor → number factor.tree =

mkNumNode(number.lexval)

198

Algorithms for attribute computation

• Consider the attribute equation

Xi.aj = fij(X0.a1, . . . , X0.ak,X1.a1, . . . , X1.ak,...,Xn.a1, . . . , Xn.ak).

• It may be viewed as an assignment of the value

of the RHS to the attribute Xi.aj where all the

attributes used on the RHS must be known.

• Some of the RHS attributes depend on having the

value of others available before they can be com-

puted.

• These dependencies are subject to some inherent

order preordained by the code being translated.

The dependencies are quite easy to determine by

building a dependency graph.

199

Dependency graphs and evaluation order

• Given an attribute grammar, each production rule

has an associated dependency graph.

• This graph has a node labelled by each attribute

Xi.aj of each symbol in the grammar rule, and for

each attribute equation

Xi.aj = fij(. . . , Xm.ak, . . .)

associated with the grammar rule there is an edge

from each node Xm.ak in the RHS to the node

Xi.aj. Xi.aj

Xm.ak

• e.g. for the production rule and its attribute equa-

tionnumber1 → number1.val =

number1 digit number2.val ∗ 10 + digit.val

the dependency graph is

number1.val

number2.val digit.val

200

Dependency graphs

• The dependency graphs for grammar rules of the

form digit → D are simple. They consist of pure

nodes without any edges digit.val

where digit.val = numval(D)

• The grammar rule and its attribute equation

number → digit number.val = digit.val

has the dependency graph

number.val

digit.val

• The dependency graph for the string 345.

number.val

number.val

number.val

digit.val

digit.val

digit.val

201

Dependency graphs

• In the grammar for declarations, the rule

var-list1 → id, var-list2has two associated attribute equations:id.dtype = var-list1.dtypevar-list2 = var-list1and the dependency graph

var-list.dtype

var-list.dtype

id.dtype

• Similarly the grammar rule var-list → id has the

dependency graph

var-list.dtype

id.dtype

• The two rules type → int and type → float

have trivial dependency graphs.

202

Dependency graph of decl grammar

• The rule

decl → type var-list var-list.dtype = type.dtype

has the dependency graph (DG)

var-list.dtypetype.dtype

• Since decl is not involved in the DG, it is not clear

which grammar rule has been associated with it.

So the DG is drawn over its corresponding parse

tree segment as follows

var-list.dtypetype.dtype

decl

• Now it is clear to which rule the dependency is

associated.

203

Dependency graph of decl grammar

• The dependency graph superimposed over the parse

tree for var-list1 → id, var-list2now becomes

,id var-list

var-list

dtype dtype

dtype

• The dependency graph for float x,y

idfloat ,(x)

id(y)

var-list

var-list

decl

dtype

dtype

dtypedtype

type dtype

204

Dependency graph of based-num grammar

• The dependency graph for the grammar rule

based-num → num basechar

follows below

based-num

num val

val

basebase basechar

The graph shows the dependencies based-num =

num.val and num.base = basechar.base.

• The dependency for digit → 9 follows.

9

valbase digit

The dependency is created by

digit.valif digit.base = 8 then error else 9

digit.val depends upon digit.base.

205

Dependency graph of based-num grammar

• The dependency graph for num → num digit

num

num valval

valbase

basebase digit

The graph shows the dependencies of the three at-

tribute equationsnum1.val =

if digit.val = error or num2.val = errorthen errorelse num2.val ∗ num1.base + digit.val

num2.base = num1.basedigit.base = num1.base

• The dependency for num → digit is similar.

num

val

val

base

base

digit

206

Dependency graph for 345o

• The graph shows the dependencies for 345o

5

3

4

o

num

num

num val

val

val

val

val

val

val

basebase

base

basebase

basebase

digit

digit

digit

based-num

basechar

207

Dependency graph for 345o

• The graph shows the dependencies for 345o in or-der of computation.

5

3

4

o

5 6

7 8 94

3 10 11 12

14

1132num

num

num val

val

val

val

val

val

val

basebase

base

basebase

basebase

digit

digit

digit

based-num

basechar

208

Synthesized and inherited attributes

• Rule-based attribute evaluation is based on traver-

sal of the parse or syntax tree.

• There are various approaches.

• The simplest to handle are synthesized attributes.

• Definition: An attribute is synthesized if all its

dependencies point upwards—from child to parent—

in the parse tree.

• An attribute a is synthesized if, given the rule

A → X1X2 . . . Xn, the only associated attribute

equation with an a on the LHS is of the formA.a = f(X1.a1, . . . , X1.ak, . . . , Xn.a1, . . . , Xn.ak)

• An attribute grammar in which all the attributes

are synthesized is an S-attributed grammar.

• From the examples on Slides 200, 201 and 202 it

follows that the number grammar is S-attributed.

209

S-attributed grammar—example

• The decl grammar is not S-attributed as can isobvious from dependency graph for float x, y

given below.

• The dependency graph for float x,y

idfloat ,(x)

id(y)

var-list

var-list

decl

dtype

dtype

dtypedtype

type dtype

• The S-synthesized attribute grammar can be calcu-lated by a single bottom-up, LRN—or postorder—traversal of the parse or syntax tree.

• The following pseudo code may be used:

void posteval(treenode T){

for(each child C of T)

posteval(C);

compute all synthesized attributes of T;

}

210

C-code for posteval

enum {Plus, Minus, Times} OpKind;enum {Opkind, Constkind} ExpKind;struct streenode {

ExpKind kind;OpKind op;struct streenode *lchild, *rchild;int val;};

void posteval(treenode *t){int temp;if (t->kind == Opkind){

posteval(t->lchild); // traverse Left childposteval(t->rchild); // traverse Right childswitch (t->op) {

case Plus:t->val = t->lchild->val + t->lchild->val;break;

case Minus:t->val = t->lchild->val - t->lchild->val;break;

case Times:t->val = t->lchild->val * t->lchild->val;break;} // end case

} // end if} // end posteval

211

Inherited attributes

• Attributes may be synthesized.

• Definition: An attribute that is not synthesized is

inherited.

• There are three kinds of attribute inheritance, viz.

(a) Inheritance from parent to siblings,

(b) Inheritance from sibling to sibling,

(c) Sibling inheritance via sibling pointers.

a B

B

aC

A

AAa

a B a C a a C

(b)(a)

(c)

• Inherited attributes are evaluated by a preorder—

or NLR—traversal, of the parse or syntax tree:

212

Inherited attributes

• Inherited attributes are calculated by NLR or pre-order traversal of the parse or syntax tree.

void preeval(treenode T) {for(each child C of T) {

compute all inherited attributes of T;preeval(C);}

}

213

Evaluating inherited attributes

• The decl grammar with semantic rules is as fol-

lows.

Grammar rules Semantic rulesdecl → type | var-list var-list.dtype = type.dtypetype → int type.dtype = integertype → float type.dtype = realvar-list1 → id, var-list2 id.dtype = var-list1.dtype

var-list2.dtype =var-list1.dtype

var-list → id id.dtype = var-list.dtype• The pseudo code to evaluate the dtype attributes:

void evaltype(treenode T) {switch (T->nodekind) {case decl:

evaltype(T->typechild);(T->var-listchild) = (T->typechild)->dtype;evaltype(T->var-listchild);

break;case type:

if ((T->child)==int) T.dtype = integer;else T.dtype = real;

break;case var-list:

T->firstchild in list = T.dtype;if (T->thirdchild in list !=NIL) {

T->thirdchild in list = T.dtype;evaltype(T->typechild);}

} // end case} // end evaltype

214

NLR traversal for float x,y;

2

5

4idfloat ,(x)

id(y)

1

3 var-list

var-list

decl

dtype

dtype

dtypedtype

type dtype

C code for NLR:

enum {decl, type, id} nodekind;enum {integer, real} typekind;struct treeNode {

nodekind kind;struct treeNode *lchild, *rchild, *sibling;typekind dtype; // for type and id nodeschar *name; // for id nodes};

void evaltype(treeNode *t) {switch (t->kind) {case decl:

t->rchild->dtype = t->lchild->dtype;evaltype(t->rchild);break;

case id:if (t->sibling != NULL) {

t->sibling->dtype = t->dtype;evaltype(t->sibling);}

break;} // end case

} // end evaltype

215

Attribute grammars—more examples

See examples of calculating attributes for

• ‘based− num’ grammar

• ‘exp → exp / exp|num|num.num’ grammar

in Louden pp. 282–284.

216

Other means of handling attributes

• Attribute as parameters/returned values

It sometimes saves storage space to use parameters

and returned values to transfer attribute values—

rather than storing them in the nodes of the syntax

tree record structure. Louden gives a worked ex-

ample of this on pp. 285–287.

• Attribute as external data structures

It may be convenient to use data structures extra

to the symbol table such as another lookup table,

graphs, stacks, etc. Louden has examples on pp.

287–289.

• Computing attributes during parsing

This question is more interesting because depend-

ing on the grammar the total effort during compila-

tion and the effort of constructing the compiler can

be reduced be having fewer passes through the syn-

tax tree and a shorter compiler. This is discussed

by Louden on pp. 288–295.

217

The symbol table

• The structure of the symbol table

• Declarations

• Scope rules and block structure

• Interaction of same-level declarations

• Attribute grammar using a symbol table

218

The symbol table

• After the syntax tree, the symbol table is the major

inherited attribute in a compiler.

• It is possible, but unnecessary, to delay building

the symbol table until after the parsing has com-

pleted.

• It is easier to build the symbol table as information

becomes available from the scanner and parser.

• The principal operations are:

– insert—for putting properties such as type and

scope into the symbol table,

– lookup—for retrieving attributes of a name in

the table, and

– delete—for removing items from the table.

• Using a hash table for the symbol table is prefer-

able because of its speed for these three operations.

219

The structure of the symbol table

• The hash table implements insertion, lookup and

deletion in O(1) time.

• The greatest disadvantage of using a hash table is

that it cannot produce a lexicographical listing of

its entries.

• Using slowerO(log(n)) methods such as a binary-

sequence search tree (BSST), AVL trees or even

B trees which can easily display the symbols al-

phabetically in O(n) time is not warranted.

• Closed hash tables where the entries are placed

directly into the table tend to become slower as

they become over 85% full. The tables must then

be resized and rehashed.

• Open hash tables are easier to tune and behave

more consistently as more entries are added.

220

Open hash table

• The performance of an open hash table with m

buckets is easy to tune by increasing the size of m.

• A uniform random mapping of keys to indices i ∈

[0..m] is usually good enough.

• A simple hash code is

h0 = 0,hi+1 = (αhi + ci) mod m

where α is a suitable number and ci is some integer

representing a character from the key.

• A variation is to apply the mod once only after

the final iteration.

221

Open hash table

• It is not unusual to mix predetermined keywords

with transient variable names in one and the same

table.

• Separating keywords and programmer identifiers

unnecessarily complicates the code.

• In the diagram ‘Hsize = m.’

H: 0

1

2

4

5

6

Hsize

apricots

carrots

marrows

apples

pears

beans

stNode*

stNode

stNode

stNode stNode

stNode

stNode

Λ

Λ

Λ

Λ

Λ

ΛΛ

Λ

Λ

222

-

// symboltable.h

#ifndef SYMBOLTABLE_H#define SYMBOLTABLE_Hstruct stNode;typedef stNode* stNodeP;class st {public:

bool isEmpty() const;// post: return value == nil or true;

st();// Constructor// post: st created && st.isEmpty() == true;// hashtable H is set up with Hsize elements// each element H[i] == NULL

~st();int lookupId(char* id, int idClass, int idState);int printSymbolTable();int insertId(char* id, int& idClass, int& idState);int deleteId(char* id, int idClass, int idState);int hash(char * id);

private:enum {Hsize=127};st* root;stNodeP H[Hsize];int noNodes;};

#endif

223

Declarations

Four kinds of declarations occur frequently:

1. constants, such as

const int SIZE = 199

2. types, such as

type Table = array [1..SIZE] of Entry;

and struct and union declarations such as

struct Entry {

char *name;

int count;

struct Entry *next;

}

and a typedef mechanism such as

typedef struct Entry *EntryPtr;

3. variables and

4. procedures/functions.

224

Declarations

Four kinds of declarations occur frequently:

1. constants,

2. types,

3. variables such asint a, b[100]; which defines a and b and allo-cates memory to them, andstatic int a, b[100]; declares a variable lo-cal to the procedure but that is not placed on theprocedure’s stack,extern int a, b[100]; tells the compiler thatthe linker will find this variable allocated and ini-tialized in another module andregister int x; which allocates the variable toa register instead of memory.

4. and procedures/functions are defined by giving abody of statements to execute. Prototypes mayalso be declared.

225

Scope rules and block structure

• Explicit declaration prior to use helps the pro-grammer to reduce errors of type references

• This simplifies the symbol table operations—it iseasy to detect variables that have not been de-clared and tends to make programming more idiotproof.

• It also enables single-pass compilation.

• Languages where explicit declaration or declara-tion before use is not required cannot easily becompiled in a single pass.

• Block structure leads to ‘older’ variables beingshadowed by ‘more recently’ declared variables ofthe same name.

• Although block structure is not a particularly use-ful programming feature, it makes it possible tosave some run-time memory at the cost of moreelaborate execution procedures.

226

Scope rules and block structure

int i,j;

int f (int size) {

char i, temp;

... // body of f

{ double j; // block A

...

}

... // body of f

{ char *j; // block B

...

}

} // end of f

The nonlocal int i cannot be reached from the com-pound statement body of the function f and is thussaid to have a scope hole there.

The nonlocal int j can be reached from the compoundstatement body of the function f but not from withineither of the two blocks.

In Pascal and Ada functions can be nested, complicat-ing the run time environment.

227

Scope rules and block structure

The Pascal code below reflects a similar symbol tablestructure as the C example. But that is where thesimilarity ends. Very interesting scoping and accessproblems arise during run time, since f, g and h can becalled by one another in many different ways.

program Ex;var i,j: integer;

function f(size: integer): integer;var i, temp: char;

procedure g;var j: real;begin

...end; { g }

procedure h;var j: ^char;begin

...end; { h }

begin { body f }...end; { f }

begin { main program }...end.

228

Scope rules and block structure

• Implementing nested scopes and shadowing, the

stInsert operation must not overwrite the previous

declarations, but must temporarily hide them from

view such that stLookup will find only shadowing

variables.

• Similarly delete must only remove the shadowing

variable and leave the previously hidden variables.

• The shadowing is thus easily implemented.

• See also the procedures.ps Slides 4–16 of Mooly

Sagiv and Reinhard Wilhelm made available on

the class web page.

• Some pictorial examples from Louden pp. 304–305

follow.

229

Symbol table contents

• After the declarations of the body of f:

3

4

char int

intint

char

function

stNode

stNode

stNode

H:

Λ

stNode*

Λ

Λ

Λ

Λ0

1

2

=Hsize

i i

jsize

temp

f

• After the declaration of Block B in the body of f:

3

4

char

int

char

function

char *

int

int

stNode

stNode

stNode stNode

H:

Λ

stNode*

Λ

Λ

0

1

2

=Hsize

i

temp

f

j

i Λ

Λjsize

• After leaving f and deleting its declarations:

3

4

int

function

int

stNode

stNode

H:

Λ

stNode*

Λ

0

1

2

=Hsize

i

f

Λ

Λ

Λj

230

Using separate tables for each scope

• Using separate tables for each scope

3

4

3

4

char

int

function

char *

3

4

char

int

int

stNodeΛ

0

1

2

Λ

0

1

2

i

stNode

i Λ

Λf

stNode

stNode

stNode

H:j

Λ

0

1

2

Λ

Λ

temp

size

Λ

Λj

=Hsize

=Hsize

stNode*

=HsizeΛ

Λ

Λ

Λ

Λ

Λ

Λ

stNode*

stNode*

231

Variables in scope holes

• The global integer variable i in the Ex program

on Slide 228 may be accessible by using notation

that defines the scope, i.e. Ex.i

• In this manner the references to the various vari-

ables called j, could be referred to as Ex.j, Ex.f.g.j

or Ex.f.h.j.

• The nesting depth may also be used. Ex has a

nesting depth 0, and f has nesting depth 1, while

g and h both have a nesting depth of 2.

• The scope resolution operator of C++ may also be

used to define the scope of a class declaration:

Class A {...int f(); // f is a member function...};

A::f() { // this is the definition of f in A...}

232

Dynamic scope

• Common Lisp normally uses static—also called lexical—scope, but older Lisp implementations often useddynamic scope.

• Variables with dynamic variables are possible inCommon Lisp.

• The following C++ example illustrates dynamicscoping:

#include <iostream.h>int i = 1;void f(void) {

cout << "i =" << i << endl;}

void main (void) {int i = 2;f();return 0;}

• In a normal C++ program the value of ‘i’ will be‘1’ sinc C++ uses static scoping.

• If C++ used dynamic scoping the value printed is‘2’.

233

Interaction of same-level declarations

• In a correct C++ compiler the following example

typedef int i;

int i

should produce a compilation error.

• This kind of error is detected by using the symbol

table—the symbol table should not permit a spe-

cific variable to be inserted once it is already there

at a given level.

234

Interaction of same-level declarations

• In this example

#include <iostream.h>

int i = 1;

void f(void) {

int i = 2, j = i + 1;

cout << "i = " << j << endl;

}

void main (void) {

return 0;

}

The value printed for ‘j’ is a ‘3’ because the dec-

larations are evaluated sequentially.

• Some languages permit collateral declaration in

which case the value of ‘j’ is derived from the‘i’ in

the outer block, because the inner ‘i’ cannot yet

be regarded to exist.

• this kind of declaration is possible in Common

Lisp, ML and Scheme.

235

Interaction of same-level declarations

• Yet another possibility is recursive declaration in

which declarations refer to one another.

int gcd(int m, int n) {

if (m==0) return n;

else return gcd(m,n % m);

}

• In this case the name ‘gcd must be added to the

symbol table before the body of the function is

processed otherwise the function will not be known

when the compiter encounters it at the recursive

call.

236

Interaction of same-level declarations

• In the case of mutual recursive functions eventhat is not enough.

void f(void) {... g(); ...

}void g(void) {

... f(); ...}

Some sort of a forward declaration is needed, suchas a prototype in C++.

void g(void); // prototype for g().void f(void) {

... g(); ...}void g(void) {

... f(); ...}

In Pascal the keyword ‘forward’ is used to prede-

clare the procedure heading.

237

Attribute grammar using a symbol table

238

Data types and type checking

• Type expressions and type constructors.

• Type names, type declarations and recursive types.

• Type equivalence.

• Type inference and type checking.

• Additional topics in type checking.

260

A practical semantic analyzer

• A symbol table.

• A semantic analyzer.

280

Run-time environments

• Memory organization during program execution.

• Fully static run-time environments.

• Stack-based run-time environments.

• Dynamic memory.

• Parameter passing mechanisms.

300

Code generation

• Intermediate code and data structures for code

generation.

• Basic code generation techniques.

• Code generation of data structure references.

• Code generation of control statements.

• Code generation of procedure and function calls.

• Code generation in commercial compilers.

• TM: a simple target machine.

• A survey of code optimization techniques.

320

Code generation

• Generate executable code for a target machine

• Executable code depends on:

1. source language,

2. runtime environment,

3. target machine and the

4. operating system.

• May produce assembler output or relocatable bi-

nary, necessitating an assembler and a linker.

• Code should be optimized.

• Use an intermediate representation (IR) such as a

parse tree, or produce code directly.

• Our compiler produces directly executable P-code.

321

Intermediate representation (IR)

We will discuss two popular forms of intermediate code:

• Principal IR is an abstract syntax tree (AST).

• AST relies heavily upon the symbol table

• AST does not resemble target code well enough.

• Forms of intermediate code that are closer to target

machines are:

– 3-Address code

– P-code

• Data structures for implementing 3-address code

– Usually represented as quads.

– Often avoided because an extra compilation

pass is needed to produce final target code.

– Discussion is merited because it assists in un-

derstanding code generation.

322

3-Address code

• The general form of an arithmetic operation is

x = y op z

The obvious semantics apply, namely, x must be

an L-value and y and z may be either L-values or

R-values without run-time addresses.

• The expression 2*a+(b-3) has the syntax tree+

*

a b 3

2

• The expression 2*a+(b-3) may be translated into

t1 = 2 * a

t2 = b - 3

t3 = t1 * t3

323

A simple program with its 3-address code

read x; { input an integer )if 0 < x then { don’t compute if x <= 0 }

fact := 1;repeat

fact := fact * x;x := x - 1until x = 0;

write fact { output factorial of x }end

read x

t1 = x > 0

if_false t1 goto Ll

fact = 1

label L2

t2 = fact * x

fact = t2

t3 = x - 1

x = t3

t4 = x == 0

if_false t4 goto L2

write fact

label Ll

halt

324

Represented as triples

read x; { input an integer )if 0 < x then { don’t compute if x <= 0 }

fact := 1;repeat

fact := fact * x;x := x - 1until x = 0;

write fact { output factorial of x }end

(0) (rd,x,_)

(1) (gt,x,o)

(2) (if_f,(1),(11))

(3) (asn,l,fact)

(4) (mul,fact,x)

(5) (asn,(4),fact)

(6) (sub,x,l)

(7) (asn,(6),x)

(8) (eq,x,o)

(9) (if_f,(8),(4))

(10) (wri,fact,_)

(11) (halt,_,_)

325

Represented as quads

read x; { input an integer )if 0 < x then { don’t compute if x <= 0 }

fact := 1;repeat

fact := fact * x;x := x - 1until x = 0;

write fact { output factorial of x }end

(rd,x,_,_)

(gt,x,o,tl)

(if_f,t1,L1,_)

(asn,l,fact,_)

(lab,L2,_,_)

(mul,fact,x,t2)

(asn,t2,fact,_)

(sub,x,l,t3)

(asn,t3,x,_)

(eq,x,o,t4)

(if_f,t4,L2,_)

(lab,L1,_,_)

(halt,_,_,_)

326

3-address code with P-code equivalent

• The expression 2*a+(b-3) may be translated into

t1 = 2 * a

t2 = b - 3

t3 = t1 + t2

• This code may be translated into:

ldc a A(t1)

ldc a A(a) ; t1 = 2 * a

ldi i

ldc i 2

mul ; leave value of t1 on stack

ldc a A(b) ; t2 = b - 3

ldi i

ldc i 3

sub ; leave value of t2 on stack

add ; t3 = t1 + t2, t3 on stack

327

Intermediate code as a synthesized attribute

(x=x+3)+4

lda X

lod X

ldc 3

adi

stn

ldc 4

adi

Grammar Rule Semantic Rulesexp1 → id = exp2 exp1.pcode = "lda" || id.strval

++ exp2.pcode ++"stn"exp → aexp exp.pcode = aexp.pcodeaexp1 → aexp2 + factor aexp1.pcode = aexp2.pcode

++factor.pcode++"adi"aexp → factor aexp.pcode = factor.pcodefactor → (exp) factor.pcode = exp.pcodefactor → num factor.pcode = "ldc"||num.strvalfactor → id factor.pcode = "lod"||ld.strval

328

Practical code generation: genCode

procedure genCode (T: treenode);beginif T is not nil thengenerate code to prepare for code of left child of T;gencode(left child of T);generate code to prepare for code of right child of T;gencode(right child of T);generate code to implement the action of T;end;

enum Optype {Plus,Assign) ;enum NodeKind {OPKind,ConstKind,IdKind};typedef struct streenode {

NodeKind kind;Optype op; // used with opkindstruct streenode *lchild,*rchild;int val; // used with ConstKindchar * strval; // used for identifiers-and numbers} STreeNode;

typedef STreeHode *SyntaxTree;

329

Practical code generation: genCode

void genCode( SyntaxTree t) {char codestr[CODESIZE]; // CODESIZE = max length of a P-code line

if (t != NULL) {switch (t->kind) {case OPKind:

switch (t->op) {case Plus:

genCode(t->lchild);genCode(t->rchild);emitCode("adi"),break;

case Assign:sprintf(codestr,"%s %s", "lda", t->strval);emitCode(codestr);genCode(t->lchild);emitCode("stn"),break;

default:emitCode("Error"),break;

}break;

case ConstKind:sprintf (codestr,"%s %s", "ldc", t->strval);emitCode(codestr);break;

case IdKind:sprintf(codestr,"%s %s", "lod", t->strval);emitCode(codestr);break;

default:emitCode("Error");

break;}

}}

330

Practical code generation: Bison

%{#define YYSTYPE char *// make Bison/Yacc use strings as values// Other inclusion code ...%}%token NUM ID

exp: ID { sprintf(codestr,"%s %s, "lda", $1);

emitcode(codestr);’=’ exp { emitcode("stn"); ]}

| aexp;

aexp: aexp ’+’ factor { emitCode("adi"); }| factor;

factor : ’(’ exp ’)’| NUM { sprintf(codestr, "%s %s", "lda",$1);

emitcode(codestr);}

| ID { sprintf(codestr, "%s %s","lod",$1);emitcode (codestr);}

;%%// utility functions ...

331

-

332

-

333

Introduction

334

-

335

-

336

-

337

-

338

-

339

-

340

-

341

-

342

-

343

-

344

-

345

-

346

-

347

-

348

-

349

-

350

-

351

-

352

Introduction

353

-

354

-

355

-

356

-

357

-

358

-

359

-

360

-

361

-

362

-

363

-

364

-

365

-

366

-

367

-

368

-

369

-

370

-

371

Introduction

372

-

373

-

374

-

375

-

376

-

377

-

378

-

379

-

380

-

381

-

382

-

383

-

384

-

385

-

386

-

387

-

388

-

389

-

390

exp1.latex

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

exp

exp op exp

number + number

391

expbig1.latex

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✱✱

✱✱

✱✱

❧❧❧❧

❧❧

exp exp

number

exp

exp op exp

number

exp

number

*

-

( )

op

392

expbig2.latex

1

34 2

5

678

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

✱✱

✱✱

✱✱

❧❧❧❧

❧❧

exp exp

number

exp

exp op exp

number

exp

number

*

-

)

op

(

393

exp1lefttree.latex

1

2 3 4

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

exp

exp op exp

number + number

394

exp-rightparse.latex

1

3 24

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

exp

exp op exp

number + number

395

exp-ambigous1.latex

✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

❍❍❍❍❍❍❍❍❍❍❍

exp op exp

number

exp

number-

number

exp

exp

*

op

396

exp-ambigous2.latex

✟✟✟✟✟✟✟✟✟✟✟PPPPPPPPPPPPPPP

✟✟✟✟✟✟✟✟✟✟✟

❍❍❍❍❍❍❍❍❍❍❍

exp

number *

exp op exp

exp op exp

number number-

397