A COURSE MATERIAL ON COMPILER DESIGN ECGS22 · a course material on compiler design – ecgs22...

A COURSE MATERIAL ON

COMPILER DESIGN – ECGS22

2017-2018

BY

Dr.M.Deepamalar., MCA.,M.Phil.,Ph.D

ASSISTANT PROFESSOR

DEPARTMENT OF COMPUTER SCIENCE

PARVATHYS ARTS AND SCIENCE COLLEGE

DINDIGUL

COMPILER DESIGN

Dr.M.Deepamalar, Assistant Professor 2

PARVATHY’S ARTS AND SCIENCE COLLEGE, DINDIGUL

DEPARTMENT OF COMPUTER SCIENCE

COMPILER DESIGN – ECS22

I M.SC (CS) – (2017 -2018)

SYLLABUS

UNIT – I

Compilers and Translators – Why do We Need Translators? – The Structure of a Compiler –

Lexical Analysis – Syntax Analysis – Intermediate Code Generation – Optimization – Code

Generation – Book Keeping – Error Handling – Compiler – Writing Tools – Getting Started .

The role of lexical analyzer – Simple approach to design of a lexical analyzer – Regular

Expressions – Finite Automata – From regular expression to finite automata – Minimizing the

number of states of a DFA – A language for specifying lexical analyzer – Implementing a lexical

analyzer – The scanner generator as Swiss army Knife .

UNIT – II

The Syntactic Specification of Programming Languages – Derivation and Parse Trees –

Capability of context free Grammars. Parsers : Shift –reduce Parsing – Operator precedence

parsing – Top down parsing – Predictive Parsers.

UNIT – III

LR parsers – The canonical collection of LR(0) items – Constructing SLR parsing tables –

Constructing canonical LR parsing tables – constructing SLR parsing tables – Constructing

LALR parsing tables – Using ambiguous grammars – An automatic parse generator

implementation of LR parsing tables – constructing LALR set of items. Syntax directed

translation schemes – Implementation of syntax directed schemes – Intermediate code – parse

trees and syntax trees – Three Address code, quadruples, and triples – Translation of assignment

statements – Boolean expression – statements that alter the flow of control – postfix translations

– Translations with a top down parser.

UNIT – IV

The contents of a symbol tables – Data Structures for a symbol table – Representing scope

information . Errors : Lexical phase errors – syntactic phase errors – semantic errors. The

principal sources of optimization – Loop optimization – The DAG representation of basic blocks

– Value numbers and algebraic laws – Global data flow analysis.

UNIT – V

Dominators – Reducible Flow graphs – Depth first search – Loop invariant computations –

Induction variable elimination – Some other loop optimization. Code Generation – Object

programs – A machine model – A simple code generator – Register allocation and assignment –

code generation from DAG’s – Peephole optimization.

Text Book: Principles of Compiler Design, Alfred V.Aho & Jeffrey D.Ullman. 25th Reprint,

2002.

COMPILER DESIGN


UNIT – I : COMPILERS AND TRANSLATORS

Translator

• A translator is a program that takes as input a program written in one language and produces as output a program in another language. Beside program translation, the

translator performs another very important role, the error-detection. Any violation of d

HLL specification would be detected and reported to the programmers.

• Important role of translator are: 1. Translating the HLL program input into an equivalent ML program. 2. Providing diagnostic messages wherever the programmer violates specification of

the HLL.

Type of Translators

INTERPRETOR

Converts the source code into machine code 1 line at a time.

Program therefore runs very slowly.

Main reason why an interpreter is used is at the testing / development stage.

Programmers can quickly identify errors and fix them.

The translator must be present on the computer for the program is to be run

COMPILER

Converts the whole code into one file (often a .exe file).

The file can then be run on any computer without the translator needing to be present.

Can take a long time to compile source code as the translator will often have to convert the instructions into various sets of machine code as different

CPUs will understand instructions with different machine code from one

another.

ASSEMBLER This type of translator is used for Assembly Language (not High Level

Languages).

It converts mnemonic assembly language instructions into machine code.

Why do We Need Translators?

Translators are programs that convert high level language commands: print, IF, For etc.

…into a set of machine code commands:

1011, 11001, 11000011110 etc

…so that the CPU can process the data!

There are 2 ways in which translators work:

COMPILER DESIGN


1. Take the whole code and convert it into machine code before running it (known as compiling).

2. Take the code one instruction at a time, translate and run the instruction, before translating the next instruction (known as interpreting).

Compiler

• A compiler is a program that reads a program written in one language called the source language into an equivalent program in another language called the target language. The

target program is then provided the input to produce output. C, Java, Pascal all are

compiled

• Compiler is a translator program that translates a program written in (HLL) the source program and translate it into an equivalent program in (MLL) the target program. As an

important part of a compiler is error showing to the programmer.

• Executing a program written n HLL programming language is basically of two parts. the

source program must first be compiled translated into a object program. Then the results

object program is loaded into a memory executed.

List of Compilers

1. Ada compilers

2 .ALGOL compilers

3 .BASIC compilers

4 .C# compilers

5 .C compilers

6 .C++ compilers

7 .COBOL compilers

8 .D compilers

9 .Common Lisp compilers

10. ECMAScript interpreters

COMPILER DESIGN


11. Eiffel compilers

12. Felix compilers

13. Fortran compilers

14. Haskell compilers

15 .Java compilers

16. Pascal compilers

17. PL/I compilers

18. Python compilers

19. Scheme compilers

20. Smalltalk compilers

21. CIL compilers

Why do we need Compilers?

Compilers are important – Responsible for many aspects of system performance – Attaining performance has become more difficult over time

Compilers are interesting – Compilers include many applications of theory to practice – Writing a compiler exposes algorithmic & engineering issues

Compilers are everywhere – Many practical applications have embedded languages Commands, macros,

formatting tags

Challenges of Compiler Construction

Compiler construction poses challenging and interesting problems: o Compilers must process large inputs, perform complex algorithms, but also run

quickly

o Compilers have primary responsibility for run-time performance o Compilers are responsible for making it acceptable to use the full power of the

programming language

o Computer architects perpetually create new challenges for the compiler by building more complex machine

o Compilers must hide that complexity from the programmer

A successful compiler requires mastery of the many complex interactions between its constituent parts

The Structure of a Compiler

Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated

operation that takes source program in one representation and produces output in another

representation. The phases of a compiler are shown in below There are two phases of

compilation.

a. Analysis (Machine Independent/Language Dependent)

b. Synthesis(Machine Dependent/Language independent)

COMPILER DESIGN


Compilation process is partitioned into no-of-sub processes called ‘phases’.

Lexical Analysis:-

LA or Scanners reads the source program one character at a time, carving the source program

into a sequence of automic units called tokens.

Token LA reads the source program one character at a time, carving the source program into a sequence

of automatic units called ‘Tokens’.

1, Type of the token.

2, Value of the token.

Type : variable, operator, keyword, constant

Value : N1ame of variable, current variable (or) pointer to symbol table.

If the symbols given in the standard format the LA accepts and produces token as output.

Each token is a sub-string of the program that is to be treated as a single unit. Token are two

types.

1, Specific strings such as IF (or) semicolon.

2, Classes of string such as identifiers, label, constants.

COMPILER DESIGN


Syntax Analysis:-

The second stage of translation is called Syntax analysis or parsing. In this phase expressions,

statements, declarations etc… are identified by using the results of lexical analysis. Syntax

analysis is aided by using techniques based on formal grammar of the programming language.

Intermediate Code Generations:-

An intermediate representation of the final machine language code is produced. This phase

bridges the analysis and synthesis phases of translation.

Code Optimization :-

This is optional phase described to improve the intermediate code so that the output runs faster

and takes less space.

Code Generation:-

The last phase of translation is code generation. A number of optimizations to reduce the length

of machine language program are carried out during this phase. The output of the code

generator is the machine language program of the specified computer.

Table Management (or) Book-keeping:- This is the portion to keep the names used by the

program and records essential information about each. The data structure used to record this

information called a ‘Symbol Table’.

Error Handlers:-

It is invoked when a flaw error in the source program is detected. The output of LA is a stream of

tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the

tokens together into syntactic structure called as expression. Expression may further be

combined to form statements. The syntactic structure can be regarded as a tree whose leaves are

the token called as parse trees.

The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that

are permitted by the specification for the source language. It also imposes on tokens a tree-like

structure that is used by the sub-sequent phases of the compiler.

Example, if a program contains the expression A+/B after lexical analysis this expression might

appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer

should detect an error situation, because the presence of these two adjacent binary operators

violates the formulations rule of an expression. Syntax analysis is to make explicit the

hierarchical structure of the incoming token stream by identifying which parts of the token

stream should be grouped.

Example, (A/B*C has two possible interpretations.)

1, divide A by B and then multiply by C or

2, multiply B by C and then use the result to divide A.

each of these two interpretations can be represented in terms of a parse tree.

COMPILER DESIGN


Intermediate Code Generation:-

The intermediate code generation uses the structure produced by the syntax analyzer to create a

stream of simple instructions. Many styles of intermediate code are possible. One common style

uses instruction with one operator and a small number of operands. The output of the syntax

analyzer is some representation of a parse tree. the intermediate code generation phase

transforms this parse tree into an intermediate language representation of the source program.

Code Optimization

This is optional phase described to improve the intermediate code so that the output runs faster

and takes less space. Its output is another intermediate code program that does the some job as

the original, but in a way that saves time and / or spaces.

a. Local Optimization:-

There are local transformations that can be applied to a program to make an improvement. For

example,

If A > B goto L2

Goto L3

L2 :

This can be replaced by a single statement

If A < B goto L3

Another important local optimization is the elimination of common sub-expressions

A := B + C + D

E := B + C + F

Might be evaluated as

T1 := B + C

A := T1 + D

E := T1 + F

Take this advantage of the common sub-expressions B + C.

b. Loop Optimization:-

Another important source of optimization concerns about increasing the speed of loops. A

typical loop improvement is to move a computation that produces the same result each time

around the loop to a point, in the program just before the loop is entered.

Code generator :-

Code Generator produces the object code by deciding on the memory locations for data,

selecting code to access each datum and selecting the registers in which each computation is to

be done. Many computers have only a few high speed registers in which computations can be

performed quickly. A good code generator would attempt to utilize registers as efficiently as

possible.

Table Management OR Book-keeping :-

A compiler needs to collect information about all the data objects that appear in the source

program. The information about data objects is collected by the early phases of the compiler-

lexical and syntactic analyzers. The data structure used to record this information is called as

Symbol Table.

COMPILER DESIGN


Error Handing :-

One of the most important functions of a compiler is the detection and reporting of errors in the

source program. The error message should allow the programmer to determine exactly where the

errors have occurred. Errors may occur in all or the phases of a compiler. Whenever a phase of

the compiler discovers an error, it must report the error to the error handler, which issues an

appropriate diagnostic msg. Both of the table-management and error-Handling routines interact

with all phases of the compiler.

Example: Compilation Process of a source code through phases

COMPILER DESIGN


Compiler-Construction Tools or Compiler Writing Tools

Software development tools are available to implement one or more compiler phases – Scanner generators (Lex and Flex) – Parser generators (Yacc and Bison) – Syntax-directed translation engines – Automatic code generators – Data-flow engines

The role of lexical analyzer

It is the first phase of compiler

Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis

Reasons to make it a separate phase are: – Simplifies the design of the compiler – Provides efficient implementation – Improves portability

Interaction of the Lexical Analyzer with the Parser

Lexical Analysis Vs Parsing

Lexical analysis Parsing

A Scanner simply turns an input String (say a file) into a list of tokens. These tokens

represent things like identifiers, parentheses,

operators etc.

A parser converts this list of tokens into a Tree-like object to represent how the tokens fit

together to form a cohesive whole (sometimes

referred to as a sentence).

The lexical analyzer (the "lexer") parses individual symbols from the source code file

into tokens. From there, the "parser" proper

turns those whole tokens into sentences of

your grammar

A parser does not give the nodes any meaning beyond structural cohesion. The next thing to

do is extract meaning from this structure

(sometimes called contextual analysis).

COMPILER DESIGN


Tokens, Patterns, and Lexemes

• A token is a classification of lexical units

– For example: id and num • Lexemes are the specific character strings that make up a token

– For example: abc and 123 • Patterns are rules describing the set of lexemes belonging to a token

– For example: “letter followed by letters and digits” and “non-empty sequence of digits”

Difference between Token, Lexeme and Pattern

Token Lexeme Pattern

if if if

relation = < or or >=

id y, x Letter followed by letters and digits

num 31 , 28 Any numeric constant

operator + , *, - ,/ Any arithmetic operator

+ or * or – or /

Attributes of Tokens

COMPILER DESIGN


Specification of Tokens

• Alphabet: Finite, nonempty set of symbols

Example

• Strings: Finite sequence of symbols from an alphabet e.g. 0011001 • Empty String: The string with zero occurrences of symbols from alphabet. The empty

string is denoted by ε

• Length of String: Number of positions for symbols in the string. |w| denotes the length of string w

Example |0110| = 4; |ε| = 0

• Powers of an Alphabet: |∑k| = the set of strings of length k with symbols from ∑ • Example:

• The set of all strings over ∑ is denoted ∑*

• Language: is a specific set of strings over some fixed alphabet • Example: • The set of legal English words • The set of strings consisting of n 0's followed by n 1’s {ε ,01,0011,000111,….} • LP = the set of binary numbers whose value is prime {10,11,101,111,1011,….}

Concatenation and Exponentiation

• The concatenation of two strings x and y is denoted by xy • The exponentiation of a string s is defined by

s0 =

si = si-1s for i > 0

note that s = s = s

COMPILER DESIGN


Language Operations

• Union

L M = {s s L or s M}

• Concatenation

LM = {xy x L and y M}

• Exponentiation

L0 = {}; Li = Li-1L

• Kleene closure

L* = i=0,…, Li

• Positive closure

L+ = i=1,…, Li

Regular Expressions

• Basis symbols:

– is a regular expression denoting language {}

– a is a regular expression denoting {a}

• If r and s are regular expressions denoting languages L(r) and M(s) respectively,

then

– rs is a regular expression denoting L(r) M(s)

– rs is a regular expression denoting L(r)M(s)

– r* is a regular expression denoting L(r)*

– (r) is a regular expression denoting L(r)

• A language defined by a regular expression is called a Regular set or a Regular

Language

Regular Definitions

• Regular definitions introduce a naming convention:

d1 r1

d2 r2

…

dn rn

where each ri is a regular expression over

{d1, d2, …, di-1 }

COMPILER DESIGN


• Example:

letter AB…Zab…z

digit 01…9

id letter ( letterdigit )*

• The following shorthands are often used:

r+ = rr*

r? = r

[a-z] = abc…z

• Examples:

digit [0-9]

num digit+ (. digit+)? ( E (+-)? digit+ )?

Regular Definitions and Grammars

Grammar

Regular definitions

COMPILER DESIGN


Coding Regular Definitions in Transition Diagrams

relop ==

id letter ( letterdigit )*

Finite Automata

• Finite Automata are used as a model for:

– Software for designing digital circuits – Lexical analyzer of a compiler – Searching for keywords in a file or on the web. – Software for verifying finite state systems, such as communication

protocols.

Design of a Lexical Analyzer Generator

• Translate regular expressions to NFA • Translate NFA to an efficient DFA

COMPILER DESIGN


Nondeterministic Finite Automata

• An NFA is a 5-tuple (S, , , s0, F) where

S is a finite set of states

is a finite set of symbols, the alphabet

is a mapping from S to a set of states

s0 S is the start state

F S is the set of accepting (or final) states

Transition Graph

• An NFA can be diagrammatically represented by a labeled directed graph called a

transition graph

Transition Table

• The mapping of an NFA can be represented in a transition table

COMPILER DESIGN


The Language Defined by an NFA

• An NFA accepts an input string x if and only if there is some path with edges

labeled with symbols from x in sequence from the start state to some accepting

state in the transition graph

• A state transition from one state to another on the path is called a move

• The language defined by an NFA is the set of input strings it accepts, such as

(ab)*abb for the example NFA

Converting RE to NFA

• This is one way to convert a regular expression into a NFA. • There can be other ways (much efficient) for the conversion. • Thomson’s Construction is simple and systematic method. • It guarantees that the resulting NFA will have exactly one final state, and one start

state.

• Construction starts from simplest parts (alphabet symbols). • To create a NFA for a complex regular expression, NFAs of its sub-expressions

are combined to create its NFA.

From Regular Expression to -NFA

COMPILER DESIGN


Example:

For a RE (a|b) * a, the NFA construction is shown below.

Combining the NFAs of a Set of Regular Expressions

Deterministic Finite Automata

• A deterministic finite automaton is a special case of an NFA

– No state has an -transition

– For each state s and input symbol a there is at most one edge labeled a

leaving s

COMPILER DESIGN


• Each entry in the transition table is a single state

– At most one path exists to accept a string

– Simulation algorithm is simple

Example DFA

A DFA that accepts (ab)*abb

Conversion of an NFA into a DFA

• The subset construction algorithm converts an NFA into a DFA using:

-closure(s) = {s} {t s … t}

-closure(T) = sT -closure(s)

move(T,a) = {t s a t and s T}

• The algorithm produces:

Dstates is the set of states of the new DFA consisting of sets of states of the NFA

Dtran is the transition table of the new DFA

-closure and move Examples

-closure({0}) = {0,1,3,7}

move({0,1,3,7},a) = {2,4,7}

-closure({2,4,7}) = {2,4,7}

move({2,4,7},a) = {7}

-closure({7}) = {7}

move({7},b) = {8}

-closure({8}) = {8}

move({8},a) =

COMPILER DESIGN


Subset Construction Example 1

Subset Construction Example 2

COMPILER DESIGN


Minimizing the number of states of a DFA

Hopcroft’s Algorithm

• Input: A DFA M with set of states S, set of inputs , transition function defined,

start state So and set of accepting states F

• Output: A DFA M’ accepting the same language as M and having fewer states as

possible

• Method:

• Step1:Construct an initial partition P of the states with two groups : the accepting

states (F) and the non accepting states (S-F)

• Step2:Apply the following procedure (Construction of Pnew) to construct a new

partition (Pnew)

Procedure for Pnew construction

• For each group G of P do partition G into subgroups such that two

states and t are in the same subgroup if and only if for all input symbols

a, states s and t have transitions on a to states in the same group of P

• Replace G in Pnew by the set of all subgroups formed

• Step3: If Pnew = P and proceed to step 4 . Otherwise repeat step 2 with P=Pnew

• Step4:Choose one state as the state representative and add these states in M’

• Step5: If M’ has a dead state and unreachable state then remove those states (A

dead state is a non accepting state that has transitions to itself on all inputs. An

unreachable state is any state not reachable from the start state )

• Step6: Complete

COMPILER DESIGN


Example:

• The DFA for (a|b) *abb

Applying Minimization

Lexical Errors

Lexical errors are the errors thrown by your lexer when unable to continue. Which means

that there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on

the other side, will be thrown by your scanner when a given set of already recognised

valid tokens don't match any of the right sides of your grammar rules. simple panic-mode

COMPILER DESIGN


error handling system requires that we return to a high-level parsing function when a

parsing or lexical error is detected.

Error-recovery actions are:

i. Delete one character from the remaining input.

ii. Insert a missing character in to the remaining input.

iii. Replace a character by another character.

iv. Transpose two adjacent characters.

Definition Of Context Free Grammar (CFG)

CFG contain terminals, N-T, start symbol and production.

Terminal are basic symbols form which string are formed.

N-terminals are synthetic variables that denote sets of strings

In a Grammar, one N-T are distinguished as the start symbol, and the set of string it denotes is the language defined by the grammar.

The production of the grammar specify the manor in which the terminal and N-T can be combined to form strings.

Each production consists of a N-T, followed by an arrow, followed by a string of one terminal and terminals.

Definition of Symbol Table

An extensible array of records.

The identifier and the associated records contains collected information about the identifier.

FUNCTION identify (Identifier name)

RETURNING a pointer to identifier information contains

The actual string

A macro definition

A keyword definition

A list of type, variable & function definition

A list of structure and union name definition

A list of structure and union field selected definitions.

COMPILER DESIGN


A language for specifying lexical analyzer

Lex specifications

A Lex program (the .l file ) consists of three parts:

declarations

%%

translation rules

%%

auxiliary procedures

1. The declarations section includes declarations of variables,manifest constants(A

manifest constant is an identifier that is declared to represent a constant e.g. # define

PIE 3.14), and regular definitions.

2. The translation rules of a Lex program are statements of the form :

p1 {action 1}

p2 {action 2}

p3 {action 3}

… …

… …

where each p is a regular expression and each action is a program fragment describing

what action the lexical analyzer should take when a pattern p matches a lexeme. In

Lex the actions are written in C.

3. The third section holds whatever auxiliary procedures are needed by the actions.

Alternatively these procedures can be compiled separately and loaded with the lexical

analyzer.

Input Buffering

The LA scans the characters of the source pgm one at a time to discover tokens. Because

of large amount of time can be consumed scanning characters, specialized buffering

techniques have been developed to reduce the amount of overhead required to process an

input character.

Buffering techniques:

1. Buffer pairs

2. Sentinels

COMPILER DESIGN


The lexical analyzer scans the characters of the source program one a t a time to

discover tokens. Often, however, many characters beyond the next token many have to be

examined before the next token itself can be determined. For this and other reasons, it is

desirable for the lexical analyzer to read its input from an input buffer. Figure shows a

buffer divided into two haves of, say 100 characters each. One pointer marks the

beginning of the token being discovered. A look ahead pointer scans ahead of the

beginning point, until the token is discovered .we view the position of each pointer as

being between the character last read and the character next to be read. In practice each

buffering scheme adopts one convention either a pointer is at the symbol last read or the

symbol it is ready to read.

Token beginnings look ahead pointer. The distance which the look ahead pointer may

have to travel past the actual token may be large. For example, in a PL/I program may

see:

DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a

keyword or an array name until we see the character that follows the right parenthesis.

In either case, the token itself ends at the second E. If the look ahead pointer travels

beyond the buffer half in which it began, the other half must be loaded with the next

characters from the source file. Since the buffer shown in above figure is of limited

size there is an implied constraint on how much look ahead can be used before the

next token is discovered. In the above example, if the look ahead traveled to the left

half and all the way through the left half to the middle, we could not reload the right

half, because we would lose characters that had not yet been grouped into tokens.

While we can make the buffer larger if we chose or use another buffering scheme,

cannot ignore the fact that overhead is limited.

COMPILER DESIGN


Implementing a lexical analyzer with Lex

Lex is a popular scanner (lexical analyzer) generator o Developed by M.E. Lesk and E. Schmidt of AT&T Bell Labs o Other versions of Lex exist, most notably flex (for Fast Lex)

Input to Lex is called Lex specification or Lex program o Lex generates a scanner module in C from a Lex specification file o Scanner module can be compiled and linked with other C/C++ modules

Commands: o lex filename.l o cc –c lex.yy.c o cc lex.yy.o other.o –o scan o scan infile outfile

Lex Specification

A Lex specification file consists of three sections: definition section

%%

rules section

%%

auxiliary functions

_ The definition section contains a literal block and regular definitions

_ The literal block is C code delimited by %{ and %}

_ Contains variable declarations and function prototypes

_ A regular definition gives a name to a regular expression

COMPILER DESIGN


_ A regular definition has the form: name expression

_ A regular definition can be used by writing its name in braces: {name}

_ The rules section contains regular expressions and C code; it has the form:

r1 action1 r1 is a regular expression and action1 is C code fragment

r2 action2 When r1 matches an input string, action1 is executed

. . .

rn actionn action1 should be in {} if more than one statement exists

Lex Operators

\ C escape sequence

\n is newline, \t is tab, \\ is backslash, \" is double quote, etc.

* Matches zero or more of the preceding expression: x* matches _, x, xx, ...

+ Matches one or more of the preceding expression:

(ab)+ matches ab, abab, ababab, ...

? Matches zero or one occurrence of the preceding expression:

(ab)? matches _ or ab

| Matches the preceding or the subsequent expression: a|b matches a or b

( ) Used for grouping sub-expressions in a regular expression

[ ] Matches any one of the characters within brackets

[xyz] means (x|y|z)

A range of characters is indicated with the dash operator (–)

[0-9] matches any decimal digit, [A-Za-z] matches any letter

If first character after [ is ^, it complements the character class

[^A-Za-z] matches all characters which are NOT letters

Meta-characters other than \ loose their meaning inside [ ]

. Matches any single character except the newline character

" " Matches everything within the quotation marks literally

"x*" matches exactly x*

Meta-characters, other than \ , loose their meaning inside " "

C escape sequences retain their meaning inside " "

COMPILER DESIGN


{ } {name} refers to a regular definition from the first section

[A-Z]{3} matches strings of exactly 3 capital letters

[A-Z]{1,3} matches strings of 1, 2, or 3 capital letters

/ The lookahead operator

matches the left expression but only if followed by the right expression

0/1 matches 0 in 01, but not in 02

Only one slash is permitted per regular expression

^ As first character of a regular expression, ^ matches beginning of a line

$ As last character of a regular expression, $ matches end of a line

Same as /\n

The scanner generator as Swiss Army Knife

Swiss Army Knife is the scanner generator

Features of Swiss Army Knife

Subject it to serious scrutiny

Strive for simplicity

Reusable components should be a design goal

Avoid futurities

Avoid digressions

Avoid quantum leaps

COMPILER DESIGN


UNIT II- THE SYNTACTIC SPECIFICATION OF THE PROGRAMMING

LANGUAGES

Programming Language Definition

Appearance of programming language : Vocabulary : Regular expression : Backus-Naur Form(BNF) or Context Free Form(CFG)

Semantics : Informal language or some examples

The Syntax and Semantics of Programming Language

A programming language must include the specification of syntax (structure) and semantics (meaning).

Syntax typically means the context-free syntax because of the almost universal use of context-free-grammar (CFGs)

Ex. a = b + c is syntactically legal b + c = a is illegal

The semantics of a programming language are commonly divided into two classes:

Static semantics

Semantics rules that can be checked at compiled time.

Ex. The type and number of a function’s arguments

Runtime semantics

Semantics rules that can be checked only at run time

The Difference Between Syntax And Semantic

• Syntax is the way in which we construct sentences by following principles and rules.

• Semantics is the interpretations of and meanings derived from the sentence transmission and understanding of the message or in other words are the logical

sentences making sense or not

COMPILER DESIGN


Syntax Definition

To specify the syntax of a language : CFG and BNF o Example : if-else statement in C has the form of statement → if (

expression ) statement else statement

An alphabet of a language is a set of symbols. o Examples : {0,1} for a binary number system(language)={0,1,100,101,...} o {a,b,c} for language={a,b,c, ac,abcc..} o {if,(,),else ...} for a if statements={if(a==1)goto10, if--}

A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples : 0,1,10,00,11,111,0202 ... strings for a alphabet {0,1} o Null string is a string which does not have any symbol of alphabet.

Language o Is a subset of all the strings over a given alphabet. o Alphabets Ai Languages Li for Ai o A0={0,1} L0={0,1,100,101,...} o A1={a,b,c} L1={a,b,c, ac, abcc..} o A2={all of C tokens} L2= {all sentences of C program }

Example 2.1. Grammar for expressions consisting of digits and plus and minus signs.

o Language of expressions L={9-5+2, 3-1, ...} o The productions of grammar for this language L are: o list → list + digit o list → list - digit o list → digit o digit → 0|1|2|3|4|5|6|7|8|9 o list, digit : Grammar variables, Grammar symbols o 0,1,2,3,4,5,6,7,8,9,-,+ : Tokens, Terminal symbols

Convention specifying grammar o Terminal symbols : bold face string if, num, id o Nonterminal symbol, grammar symbol : italicized names, list, digit ,A,B

Grammar G=(N,T,P,S) o N : a set of nonterminal symbols o T : a set of terminal symbols, tokens o P : a set of production rules

o S : a start symbol, S∈N Grammar G for a language L={9-5+2, 3-1, ...}

o G=(N,T,P,S) o N={list,digit} o T={0,1,2,3,4,5,6,7,8,9,-,+} o P : list -> list + digit

COMPILER DESIGN


o list -> list - digit o list -> digit o digit -> 0|1|2|3|4|5|6|7|8|9 o S=list

Some definitions for a language L and its grammar G

o Derivation : A sequence of replacements S⇒α1⇒α2⇒…⇒αn is a derivation of αn.

o Example, A derivation 1+9 from the grammar G o left most derivation

o list ⇒ list + digit ⇒ digit + digit ⇒ 1 + digit ⇒ 1 + 9 o right most derivation

o list ⇒ list + digit ⇒ list + 9 ⇒ digit + 9 ⇒ 1 + 9 Language of grammar L(G)

o L(G) is a set of sentences that can be generated from the grammar G.

o L(G)={x| S ⇒* x} where x ∈ a sequence of terminal symbols Example: Consider a grammar G=(N,T,P,S):

o N={S} T={a,b} o S=S P ={S → aSb | ε } is aabb a sentecne of L(g)? (derivation of string

aabb)

o S⇒aSb⇒aaSbb⇒aaεbb⇒aabb(or S⇒* aabb) so, aabbεL(G)

o there is no derivation for aa, so aa∉L(G) note L(G)={anbn| n≧0} where

anbn meas n a's followed by n b's.

Syntax Analysis

• Syntax Analysis is also called Parsing or Hierarchical Analysis. • A Parser implements grammar of the language may it be C, C++ etc • The parser obtains a string of tokens from the lexical analyzer and verifies that the

string can be generated by the grammar for the source language

• The grammar that a parser implements is called a Context Free Grammar or CFG

The Syntactic Specification of Programming Language

Program Aspects

Syntax: what valid programs look like Semantics: what valid programs mean; what they should compute Compiler must contain both information A programming language must include the specification of syntax (structure) and

semantics (meaning).

Syntax typically means the context-free syntax because of the almost universal use of context-free-grammar (CFGs)

Ex.

COMPILER DESIGN


a = b + c is syntactically legal b + c = a is illegal

The semantics of a programming language are commonly divided into two classes: Static semantics

Semantics rules that can be checked at compiled time. Ex. The type and number of a function’s arguments

Runtime semantics Semantics rules that can be checked only at run time

Basics of Syntax Analysis

Syntax analysis or parsing is the second phase of a compiler.

Syntax analyzer creates the syntactic structure of the given source program.

This syntactic structure is mostly a parse tree.

The syntax analyzer or parser checks whether a given source program satisfies the rules

implied by a context – free grammar or not.

o If it satisfies, the parser creates the parse tree of that program.

o Otherwise the parser gives the error message.

A context free grammar

o Gives a precise syntactic specification of a programming language.

o The design of the grammar is an initial phase of the design of a compiler

o A grammar can be directly converted into a parser by some tools.

Syntax analysis is done by the parser. o Detects whether the program is written following the grammar rules and reports

syntax errors.

o Produces a parse tree from which intermediate code can be generated.

Limitations of Syntax Analyzers

Syntax analyzers or parsers receive their inputs, in the form of tokens, from lexical

analyzers. Lexical analyzers are responsible for the validity of a token supplied by the syntax

analyzer. Syntax analyzers have the following drawbacks :

it cannot determine if a token is valid,

it cannot determine if a token is declared before it is being used,

it cannot determine if a token is initialized before it is being used,

it cannot determine if an operation performed on a token type is valid or not.

These tasks are accomplished by the semantic analyzer, which are defined in Semantic Analysis.

COMPILER DESIGN


Capability of Context Free Grammar

Context Free Grammar (CFG)

A lexical analyzer can identify tokens with the help of regular expressions and pattern

rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations of

the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis.

Therefore, this phase uses context-free grammar CFG, which is recognized by pushdown

automata. CFG, on the other hand, is a superset of Regular Grammar. It implies that every

Regular Grammar is also context-free, but there exists some problems, which are beyond the

scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming

languages.

• The syntax of a programming language is described by a context-free grammar (Backus-Naur Form (BNF)).

– Similar to the languages specified by regular expressions, but more general. – A grammar gives a precise syntactic specification of a language. – From some classes of grammars, tools exist that can automatically construct an

efficient parser. These tools can also detect syntactic ambiguities and other

problems automatically.

– A compiler based on a grammatical description of a language is more easily maintained and updated.

A context-free grammar has four components:

1. A set of non-terminals V. Non-terminals are syntactic variables that denote sets of strings. The non-terminals define sets of strings that help define the language generated

by the grammar.

2. A set of tokens, known as terminal symbols Σ. Terminals are the basic symbols from which strings are formed.

3. A set of productions P. The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings. Each production consists of

a non-terminal called the left side of the production, an arrow, and a sequence of tokens

and/or on- terminals, called the right side of the production.

4. One of the non-terminals is designated as the start symbol S; from where the production begins.

COMPILER DESIGN


The strings are derived from the start symbol by repeatedly replacing a non-terminal initially the

start symbol by the right side of a production, for that non-terminal.

• A grammar G = (N, T, P, S) o N is a finite set of non-terminal symbols o T is a finite set of terminal symbols

o P is a finite subset of (𝑁 ∪ 𝑇)∗𝑁(𝑁 ∪ 𝑇)∗ × (𝑁 ∪ 𝑇)∗ • An element (𝛼, 𝛽)𝜖𝑃 is written as 𝛼 → 𝛽

o S is a distinguished symbol in N and is called the start symbol. o Inherently recursive structures of a programming language are defined by a context-

free grammar.

o In a context-free grammar, we have: • A finite set of terminals (in our case, this will be the set of tokens) • A finite set of non-terminals (syntactic-variables) • A finite set of productions rules in the following form

• A → α where A is a non-terminal and

• α is a string of terminals and non-terminals (including the empty string)

• A start symbol (one of the non-terminal symbol) o L(G) is the language of G (the language generated by G) which is a set of sentences. o A sentence of L(G) is a string of terminal symbols of G. o If S is the start symbol of G then

• ω is a sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G.

• If G is a context-free grammar, L(G) is a context-free language.

• Language defined by a grammar o “aAb derives awb in one step”, denoted as “aAb=>awb”, if A->w is a production and

a and b are arbitrary strings of terminal or nonterminal symbols.

o a1 derives am if a1=>a2=>…=>am, written as a1=>am

The languages L(G) defined by G are the set of strings of the terminals w such that S=>w.

Example

COMPILER DESIGN


The problem of palindrome language, which cannot be described by means of Regular

Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by

means of CFG, as illustrated below:

G = ( V, Σ, P, S )

Where:

V = { Q, Z, N }

Σ = { 0, 1 }

P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 } S = { Q }

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,

Chomsky Hierarchy (classification of grammars)

i) A grammar is said to be regular if it is

• right-linear, where each production in P has the form, A → wB or A → w. Here, A and B are non-terminals and w is a terminal.

• or left-linear

ii) Context-free if each production in P is of the form A → α,, where 𝐴𝜖𝑁 and 𝛼 ∈ (𝑁 ∪ 𝑇)∗ iii) Context sensitive if each production in P is of the form α → β where |α| ≤ |β|

iv) Unrestricted if each production in P is of the form α → β where α≠ɛ

• Context-free grammar is sufficient to describe most programming languages. • Example: a grammar for arithmetic expressions.

->

-> ( )

-> -

-> id

-> + | - | * | /

derive -(id) from the grammar:

=> - => - () =>-(id)

sentence: a strings of terminals that can be derived from S

sentential form: a strings of terminals or none terminals that can be derived from S.

derive id + id * id from the grammar: E=>E+E=>E+E*E=>E+E*id=>E+id*id=>id+id*id

leftmost/rightmost derivation -- each step replaces leftmost/rightmost non-terminal. E=>E+E=>id+E=>id+E*E=>id+id*E=>id+id*id

COMPILER DESIGN


Derivation and Parse Trees

Derivations

Starting with start symbol

At each step: a nonterminal replaced with the body of a production

A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we take two decisions for some sentential form of input:

o Deciding the non-terminal which is to be replaced. o Deciding the production rule, by which, the non-terminal will be replaced.

To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation If the sentential form of an input is scanned and replaced from left to right, it is called

left-most derivation. The sentential form derived by the left-most derivation is called the

left-sentential form.

Right-most Derivation If we scan and replace the input with production rules, from right to left, it is known as

right-most derivation. The sentential form derived from the right-most derivation is called

the right-sentential form.

Example Production rules:

E → E + E

E → E * E

E → id

Input string: id + id * id

The left-most derivation is:

E → E * E

E → E + E * E

E → id + E * E

E → id + id * E

E → id + id * id

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

E → E + E

E → E + E * E

E → E + E * id

E → E + id * id

E → id + id * id

Definition : Derivation

o In general a derivation step is αAβ ⇒ αγβ if there is a production rule A→γ in our grammar where α and β are arbitrary strings of terminal and non-terminal

symbols.

COMPILER DESIGN


o α1 ⇒ α2 ⇒ ... ⇒ αn (αn derives from α1 or α1 derives αn ) o At each derivation step, we can choose any of the non-terminal in the sentential

form of G for the replacement.

o If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-most derivation.

Example:

E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id)

o If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-most derivation.

Example:

E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id)

o We will see that the top-down parsers try to find the left-most derivation of the given source program.

o We will see that the bottom-up parsers try to find the right-most derivation of the given source program in the reverse order.

More on Derivations

Example

COMPILER DESIGN


Parse Tree

A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the start symbol.

A parse tree pictorially shows how the start symbol of a grammar derives a specific string in the language.

Filters out the order of nonterminals replacement.

Many to one relationship between derivations and parse tree

Given a context-free grammar, a parse tree has the following properties: o The root is labeled by the start symbol

o Each leaf is labeled by a token or the empty string

o Each interior node is labeled by a nonterminal

o If A is a non-terminal labeling some interior node and abcdefg..z are the labels of

the children of that node from left to right, then A->abcdefg..z is a production of

the grammar.

Example 1: Construct the Parse Tree for –(id+id)

The left most derivation for –(id+id)

𝐸 → −(𝐸)

𝐸 → −(𝐸 + 𝐸)

𝐸 → −(𝑖𝑑 + 𝐸)

𝐸 → −(𝑖𝑑 + 𝑖𝑑)

COMPILER DESIGN


Parse Tree for –(id+id)

Example 2: Construct a Parse Tree for id+id*id

The left most derivation for id+id*id

E → E * E

E → E + E * E

E → id + E * E

E → id + id * E

E → id + id * id

Parse Tree for id+id*id

COMPILER DESIGN


Ambiguity

A grammar produces more than one parse tree for a sentence is called as an ambiguous grammar.

For the most parsers, the grammar must be unambiguous.

Unambiguous grammar Unique selection of the parse tree for a sentence

We should eliminate the ambiguity in the grammar during the design phase of the compiler.

An unambiguous grammar should be written to eliminate the ambiguity.

COMPILER DESIGN


We have to prefer one of the parse trees of a sentence (generated by an ambiguous

grammar) to disambiguate that grammar to restrict to this choice.

Ambiguous grammars (because of ambiguous operators) can be disambiguated according to the precedence and associativity rules.

Example Production Rules :

E E+E

EE-E

E id

For the string id+id*id the above grammar produces two parse trees.

Two parse tree for id+id*id for the above grammar

The language generated by an ambiguous grammar is said to be inherently ambiguous.

Ambiguity in grammar is not good for a compiler construction.

No method can detect and remove ambiguity automatically, but it can be removed by either

i) re-writing the whole grammar without ambiguity, or ii) by setting and following associativity and precedence constraints.

Associativity If an operand has operators on both sides, the side on which the operator takes this

operand is decided by the associativity of those operators. If the operation is left-

associative, then the operand will be taken by the left operator or if the operation is right-

associative, the right operator will take the operand.

i) Left Associative Operations such as Addition, Multiplication, Subtraction, and Division are

left associative.

If the expression contains: id op id op id it will be evaluated as: (id op id) op id

For example, id + id + id can be evaluated as (id+id)+id

ii) Right Associative Operations like Exponentiation are right associative, i.e., the order of

evaluation in the same expression will be: id op (id op id)

For example, id ^ id ^ id can be evaluated as id ^ (id ^ id)

COMPILER DESIGN


Precedence o If two different operators share a common operand, the precedence of operators

decides which will take the operand.

o Use precedence of operators as follows: ^ (right to left)

* (left to right)

+ (left to right)

Both the Associativity and Precedence decrease the chances of ambiguity in a language or its grammar.

Example To disambiguate the grammar E → E+E | E*E | E^E | id | (E), use precedence of

operators as follows:

^ (right to left)

* (left to right)

+ (left to right)

We get the following unambiguous grammar:

E → E+T | T

T → T*F | F

F → G^F | G

G → id | (E)

Left Recursion

o A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as the left-most symbol.

o Left-recursive grammar is considered to be a problematic situation for top-down parsers.

o Top-down parsers start parsing from the Start symbol, which in itself is nonterminal.

o So, when the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it goes into an

infinite loop.

o A grammar is left recursive if it has a non-terminal A such that there is a

derivation: A ⇒ Aα for some string α. o Top-down parsing techniques cannot handle left-recursive grammars. o So,to convert our left-recursive grammar into an equivalent grammar which is not

left-recursive.

o The left-recursion may appear in a single step of the derivation (immediate left-recursion), or may appear in more than one step of the derivation.

Example:

(1) A => Aα | β

(2) S => Aα | β

A => Sd

COMPILER DESIGN


(1) is an example of immediate left recursion, where A is any non-terminal symbol and α

represents a string of non-terminals.

(2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the parser may go into a loop forever.

Immediate Left Recursion and Its Elimination

A → A α | β where β does not start with A

⇓ Eliminate immediate left recursion A → β A’

A’ → α A’ | ε an equivalent grammar

In general,

A → A α1 | ... | A αm | β1 | ... | βn where β1 ... βn do not start with A

⇓ Eliminate immediate left recursion A → β1 A’ | ... | βn A’

A’ → α1 A’ | ... | αm A’ | ε an equivalent grammar

COMPILER DESIGN


Example:

E → E+T | T

T → T*F | F

F → id | (E)

⇓ Eliminate immediate left recursion

E → T E’

E’ → +T E’ | ε

T → F T’

T’ → *F T’ | ε

F → id | (E)

A grammar cannot be immediately left-recursive, but it still can be left-recursive.

By just eliminating the immediate left-recursion, we may not get a grammar which is not left-recursive.

Example:

S → Aa | b

A → Sc | d

This grammar is not immediately left-recursive, but it is still left-recursive.

S⇒ Aa ⇒ Sca Or

A⇒ Sc ⇒ Aac causes to a left-recursion. So, we have to eliminate all left-recursions from the grammar.

COMPILER DESIGN


Elimination of all left recursion

Arrange non-terminals in some order: A1 ... An

for i from 1 to n do

{

for j from 1 to i-1 do

{

replace each production

Ai → Aj γ

by

Ai → α1 γ | ... | αk γ

where Aj → α1 | ... | αk

}

Eliminate immediate left-recursions among Ai productions

}

Example:

S → Aa | b

A → Ac | Sd | f

Case 1: Order of non-terminals: S, A

for S:

we do not enter the inner loop. there is no immediate left recursion in S.

for A:

Replace A → Sd with A → Aad | bd So, we will have A → Ac | Aad | bd | f Eliminate the immediate left-recursion in A

A → bdA’ | fA’

A’ → cA’ | adA’ | ε

So, the resulting equivalent grammar which is not left-recursive is:

S → Aa | b

A → bdA’ | fA’

A’ → cA’ | adA’ | ε

Case 2: Order of non-terminals: A, S

for A:

we do not enter the inner loop. Eliminate the immediate left-recursion in A

A → SdA’ | fA’

A’ → cA’ | ε

for S:

Replace S → Aa with S → SdA’a | fA’a

COMPILER DESIGN


So, we will have S → SdA’a | fA’a | b Eliminate the immediate left-recursion in S

S → fA’aS’ | bS’

S’ → dA’aS’ | ε

So, the resulting equivalent grammar which is not left-recursive is:

S → fA’aS’ | bS’

S’ → dA’aS’ | ε

A → SdA’ | fA’

A’ → cA’ | ε

Left Factoring

If more than one grammar production rules has a common prefix string, then the top-down parser cannot make a choice as to which of the production it should take to

parse the string in hand.

Then it cannot determine which production to follow to parse the string as both productions are starting from the same terminal ornon − terminal. To remove this

confusion, use a technique called left factoring.

Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, make one production for each common prefixes and the rest of the

derivation is added by new productions.

A predictive parser (a top-down parser without backtracking) insists that the grammar must be left-factored.

grammar a new equivalent grammar suitable for predictive parsing stmt → if expr then stmt else stmt | if expr then stmt

when we see if, we cannot now which production rule to choose to re-write stmt in the derivation

In general, A → βα1 | βα2 where α is non-empty and the first symbols of β1 and β2 (if they have one) are different.

when processing α we cannot know whether expand A to βα1 or

A to βα2

But, if we re-write the grammar as follows A → αA’

A’ → β1 | β2 so, we can immediately expand A to αA’

Algorithm For each non-terminal A with two or more alternatives (production rules) with a

common non-empty prefix, let say

A → βα1 | ... | βαn | γ1 | ... | γm

convert it into

A → αA’ | γ1 | ... | γm

A’ → β1 | ... | βn

Example 1:

A → abB | aB | cdg | cdeB | cdfB

COMPILER DESIGN


⇓ A → aA’ | cdg | cdeB | cdfB

A’ → bB | B

⇓ A → aA’ | cdA’’

A’ → bB | B

A’’ → g | eB | fB

Example 2:

A → ad | a | ab | abc | b

⇓ A → aA’ | b

A’ → d | ε | b | bc

⇓ A → aA’ | b

A’ → d | ε | bA’’

A’’ → ε | c

Example left factoring

COMPILER DESIGN


Unit III

Parsers

Parser

A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams.

The parser analyzes the source code token stream against the production rules to detect any errors in the code.

The output of the phase is a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a parse tree as the output of the phase.

Parsers are expected to parse the whole code even if some errors exist in the program.

Parsers use error recovering strategies.

Three general types of parsers

Universal parsing methods:

o can parse any grammars

o too inefficient to use in production compilers

Top-down methods:

o Parse-trees built from root to leaves.

o Input to parser scanned from left to right one symbol at a time

Bottom-up methods:

o Start from leaves and work their way up to the root.

o Input to parser scanned from left to right one symbol at a time

COMPILER DESIGN


Given a formal syntax specification (typically as a context-free grammar [CFG] ), the

parse reads tokens and groups them into units as specified by the productions of the CFG

being used.

As syntactic structure is recognized, the parser either calls corresponding semantic

routines directly or builds a syntax tree.

CFG ( Context-Free Grammar )

BNF ( Backus-Naur Form )

GAA ( Grammar Analysis Algorithms )

LL, LR, SLR, LALR Parsers

YACC

TOP-DOWN PARSING

• Constructing a parse tree for an input string starting from root • Parse tree built in preorder (depth-first) • Finding left-most derivation • At each step of a top-down parse:

o determine the production to be applied o matching terminal symbols in production body with input string

The parse tree is created top to bottom.

Top-down parser

Recursive-Descent Parsing o Backtracking is needed (If a choice of a production rule does not work, we

backtrack to try other alternatives.)

o It is a general parsing technique, but not widely used. o Not efficient

Predictive Parsing o No backtracking o Efficient o Needs a special form of grammars i.e. LL (1) grammars.

COMPILER DESIGN


o Recursive Predictive Parsing is a special form of Recursive Descent parsing without backtracking.

o Non-Recursive (Table Driven) Predictive Parser is also known as LL (1) parser.

Top – Down Parsing

1. Recursive –Descent Parsing (Uses Backtracking)

• Backtracking is needed.

• It tries to find the left-most derivation.

i. Predictive Parser

ii. Non-recursive Predictive Parser

Algorithm of Recursive Descent Parsing

Example 1:

If the grammar is S → aBc; B → bc | b and the input is abc:

b

S S

a B c

b c

a B c

COMPILER DESIGN


Example 2:

2. Predictive Parser

When re-writing a non-terminal in a derivation step, a predictive parser can uniquely choose a production rule by just looking the current symbol in the input string.

Example:

stmt → if ...... |

while ...... |

begin ...... |

for .....

COMPILER DESIGN


When we are trying to write the non-terminal stmt, we have to choose first production rule.

When we are trying to write the non-terminal stmt, we can uniquely choose the production rule by just looking the current token.

We eliminate the left recursion in the grammar, and left factor it. But it may not be suitable for predictive parsing (not LL (1) grammar).

3. Recursive Predictive Parsing

Each non-terminal corresponds to a procedure.

Example:

A → aBb | bAB

proc A

{

case of the current token

{

‘a’: - match the current token with a, and move to the next token;

- call ‘B’;

- match the current token with b, and move to the next token;

‘b’: - match the current token with b, and move to the next token;

- call ‘A’;

- call ‘B’;

}

}

Applying ε-productions

A → aA | bB | ε

If all other productions fail, we should apply an ε-production. For example, if the current token is not a or b, we may apply the ε-production.

Most correct choice: We should apply an ε-production for a non-terminal A when the current token is in the follow set of A (which terminals can follow A in the sentential

forms).

Example:

A → aBe | cBd | C

B → bB | ε

C → f

proc A

{


{

a: - match the current token with a and move to the next token;

COMPILER DESIGN


- call B;

- match the current token with e and move to the next token;

c: - match the current token with c and move to the next token;

- call B;

- match the current token with d and move to the next token;

f: - call C //First Set of C

}

}

proc C

{

match the current token with f and move to the next token;

}

proc B

{


{

b: - match the current token with b and move to the next token;

- call B

e,d: - do nothing //Follow Set of B

}

}

4. Non-Recursive Predictive Parsing - LL(1) Parser

Non-Recursive predictive parsing is a table-driven parser.

• It is a top-down parser. • It is also known as LL(1) Parser.

• Input buffer o The string to be parsed. We will assume that its end is marked with a special

symbol $.

COMPILER DESIGN


• Output o A production rule representing a step of the derivation sequence (left-most

derivation) of the string in the input buffer.

• Stack o contains the grammar symbols o at the bottom of the stack, there is a special end marker symbol $. o initially the stack contains only the symbol $ and the starting symbol S. ($S

initial stack)

o when the stack is emptied (i.e. only $ left in the stack), the parsing is completed.

• Parsing table o a two-dimensional array M[A,a] o each row is a non-terminal symbol o each column is a terminal symbol or the special symbol $ o each entry holds a production rule.

Parser Actions

The symbol at the top of the stack (say X) and the current symbol in the input string (say a)

determine the parser action. There are four possible parser actions.

Example:

For the Grammar is S → aBa; B → bB | ε and the following LL(1) parsing table:

COMPILER DESIGN


Construction of Predictive Parsing Tables – LL(1) Parsing Table :

Definition FIRST

COMPILER DESIGN


Definition FOLLOW

Definition LL(1) Grammar

COMPILER DESIGN


LL(1) Grammars

COMPILER DESIGN


Definition Parsing Table

COMPILER DESIGN


Derivation of id+id*id Using Predictive Parsing Table

Non – LL (1) Grammars

COMPILER DESIGN


Bottom Up Parsing

Given a string of terminals

Build parse tree starting from leaves and working up toward the root

Reverse of right-most derivation

Used for type of grammars called LR

LR parsers are difficult to build by hand

We use automatic parser generators for LR grammars

A bottom-up parser creates the parse tree of the given input starting from leaves towards the root.

A bottom-up parser tries to find the right-most derivation of the given input in the reverse order.

(a) S ⇒ ... ⇒ ω (the right-most derivation of ω)

(b) ← (the bottom-up parser finds the right-most derivation in the reverse order)

Bottom-up parsing is also known as shift-reduce parsing because its two main actions are shift and reduce.

o At each shift action, the current symbol in the input string is pushed to a stack. o At each reduction step, the symbols at the top of the stack (this symbol sequence

is the right side of a production) will replaced by the non-terminal at the left side

of that production.

o There are also two more actions: accept and error.

COMPILER DESIGN


Example

1. Shift – Reduce Parsing

• A shift-reduce parser tries to reduce the given input string into the starting symbol. • At each reduction step, a substring of the input matching to the right side of a production

rule is replaced by the non-terminal at the left side of that production rule.

• If the substring is chosen correctly, the right most derivation of that string is created in the reverse order.

• Form of bottom-up parsing • Consists of:

– Stack: holds grammar symbols – input buffer: holds the rest of the string to be parsed

• Handle always appears on the top of the stack

Syntax

COMPILER DESIGN


Example

Handle

COMPILER DESIGN


Stack Implementation

COMPILER DESIGN


Conflicts during Shift Reduce Parsing

• There are context-free grammars for which shift-reduce parsers cannot be used. • Stack contents and the next input symbol may not decide action:

– shift/reduce conflict: Whether make a shift operation or a reduction. – reduce/reduce conflict: The parser cannot decide which of several reductions to

make.

If a shift-reduce parser cannot be used for a grammar, that grammar is called as non-LR(k) grammar.

Types of Shift – Reduce Parsing

COMPILER DESIGN


2. Operator Precedence Parsing

Precedence Relation

• In operator-precedence parsing, we define three disjoint precedence relations between certain pairs of terminals.

o a b b has lower precedence than a

• The determination of correct precedence relations between terminals are based on the traditional notions of associativity and precedence of operators. (Unary minus causes a

problem).

• The intention of the precedence relations is to find the handle of a right-sentential form, o marking the right hand.

• In our input string $a1a2...an$, we insert the precedence relation between the pairs of terminals (the precedence relation holds between the terminals in that pair).

• Example

Using Precedence Relation to Find Handles

• Scan the string from left end until the first .> is encountered. • Then scan backwards (to the left) over any =· until a

COMPILER DESIGN


• The handle contains everything to left of the first .> and to the right of the

COMPILER DESIGN


Creating Operator Precedence Relation From Associativity and Precedence

Example

COMPILER DESIGN


Operator Precedence Grammar

There is another more general way to compute precedence relations among terminals:

1. a = b if there is a right side of a production of the form αaβbγ, where β is either a single

non-terminal or ε.

2. a < b if for some non-terminal A there is a right side of the form αaAβ and A derives to

γbδ where γ is a single non-terminal or ε.

3. a > b if for some non-terminal A there is a right side of the form αAbβ and A derives to

γaδ where δ is a single non-terminal or ε.

Note that the grammar must be unambiguous for this method. Unlike the previous method, it

does not take into account any other property and is based purely on grammar productions. An

ambiguous grammar will result in multiple entries in the table and thus cannot be used.

Handling Unary Minus

Operator-Precedence parsing cannot handle the unary minus when also use the binary minus in our grammar.

• The best approach to solve this problem is to let the lexical analyzer handle this problem. The lexical analyzer will return two different operators for the unary minus and the

binary minus.

The lexical analyzer will need a look ahead to distinguish the binary minus from the unary minus.

• Then, make O if unary-minus has higher precedence than O unary-minus b

Advantages and Disadvantages

Advantages: o simple o powerful enough for expressions in programming languages

COMPILER DESIGN


Disadvantages: o It cannot handle the unary minus (the lexical analyzer should handle the unary

minus).

o Small class of grammars. o Difficult to decide which language is recognized by the grammar.

3. LR Parser or LR Parsing

LR parsing is attractive because:

LR parsing is most general non-backtracking shift-reduce parsing, yet it is still efficient.

The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive parsers. LL(1)-Grammars ⊂ LR(1)-Grammars

An LR-parser can detect a syntactic error as soon as it is possible to do so a left-to- right scan of the input.

Parser Configuration

COMPILER DESIGN


Parser Actions

Construction of Parsing Tables

An LR parser using SLR parsing tables for a grammar G is called as the SLR parser for G.

If a grammar G has an SLR parsing table, it is called SLR grammar.

Every SLR grammar is unambiguous, but every unambiguous grammar is not a SLR grammar.

Augmented Grammar: G’ is G with a new production rule S’→S where S’ is the new starting symbol.

Closure Operation

If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items constructed from I by the two rules:

COMPILER DESIGN


1. Initially, every LR(0) item in I is added to closure(I).

2. If A → α.Bβ is in closure(I) and Bγ→ is a production rule of G; then B→.γ will be in the

closure(I). We will apply this rule until no more new LR(0) items can be added to closure(I).

GOTO Operation

If I is a set of LR(0) items and X is a grammar symbol (terminal or non-terminal), then

goto(I,X) is defined as follows:

If A → α.Xβ in I then every item in closure({A → αX.β}) will be in goto(I,X).

Example:

I = { E’ → .E, E → .E+T, E → .T, T → .T*F, T → .F, F → .(E), F → .id }

goto(I,E) = { E’ → E., E → E.+T }

goto(I,T) = { E → T., T → T.*F }

goto(I,F) = {T → F. }

goto(I,() = {F→ (.E), E→ .E+T, E→ .T, T→ .T*F, T→ .F, F→ .(E), F→ .id }

goto(I,id) = { F → id. }

Construction of the Canonical LR(0) items

To create the SLR parsing tables for a grammar G, we will create the canonical LR(0)

collection of the grammar G’.

Algorithm:

C is { closure({S’→.S}) }

repeat the followings until no more set of LR(0) items can be added to C.

for each I in C and each grammar symbol X

if goto(I,X) is not empty and not in C

add goto(I,X) to C

GOTO function is a DFA on the sets in C.

Example

Let following be the grammar and its LR Parsing Table

COMPILER DESIGN


COMPILER DESIGN


Construction of Parsing Table

1. Construct the canonical collection of sets of LR(0) items for G’.

C←{I0,...,In}

2. Create the parsing action table as follows

a. If a is a terminal, Aα→.aβ in Ii and goto(Ii,a)=Ij then action[i,a] is shift j.

b. If Aα→. is in Ii , then action[i,a] is reduce Aα→ for all a in FOLLOW(A) where

A≠S’.

c. If S’→S. is in Ii , then action[i,$] is accept.

d. If any conflicting actions generated by these rules, the grammar is not SLR(1).

3. Create the parsing goto table

a. for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j

4. All entries not defined by (2) and (3) are errors.

5. Initial state of the parser contains S’→.S

COMPILER DESIGN


LALR Parsing tables

LALR stands for LookAhead LR.

LALR parsers are often used in practice because LALR parsing tables are smaller than Canonical LR parsing tables.

The number of states in SLR and LALR parsing tables for a grammar G are equal.

But LALR parsers recognize more grammars than SLR parsers.

yacc creates a LALR parser for the given grammar.

A state of LALR parser will be again a set of LR(1) items.

This shrink process may introduce a reduce/reduce conflict in the resulting LALR

parser.

In that case the grammar is NOT LALR.

This shrink process cannot produce a shift/reduce conflict.

Constructing LALR set of items

The core of a set of LR(1) items is the set of its first component.

COMPILER DESIGN


Find the states (sets of LR(1) items) in a canonical LR(1) parser with same cores. Then we will merge them as a single state.

Do this for all states of a canonical LR(1) parser to get the states of the LALR parser.

In fact, the number of the states of the LALR parser for a grammar will be equal to the number of states of the SLR parser for that grammar.

Parsing Tables Construction

Shift / Reduce Conflict

COMPILER DESIGN


Reduce / Reduce Conflict

LALR(1) Items

COMPILER DESIGN


Using Ambiguous Grammar

All grammars used in the construction of LR-parsing tables must be un-ambiguous.

Can we create LR-parsing tables for ambiguous grammars? o Yes, but they will have conflicts. o We can resolve these conflicts in favor of one of them to disambiguate the

grammar.

o At the end, we will have again an unambiguous grammar.

Why we want to use an ambiguous grammar? o Some of the ambiguous grammars are much natural, and a corresponding

unambiguous grammar can be very complex.

o Usage of an ambiguous grammar may eliminate unnecessary reductions.

Example

Sets of LR(0) Items for Ambiguous Grammar

COMPILER DESIGN


SLR-Parsing Tables for Ambiguous Grammar

COMPILER DESIGN


Syntax Directed Translation Schemes

Syntax Directed Translation

Translation guided by syntax

Generate Intermediate code, evaluate expression, type checking.

Attach rules to productions in the grammar

Rules (program fragments) executed when the production is used during syntax analysis

Grammar symbols are associated with attributes to associate information with the

programming language constructs that they represent.

Values of these attributes are evaluated by the semantic rules associated with the production rules.

Evaluation of these semantic rules: o may generate intermediate codes o may put information into the symbol table o may perform type checking o may issue error messages o may perform some other activities o In fact, they may perform almost any activities.

An attribute may hold almost any thing. o A string, a number, a memory location, a complex record.

Evaluation of a semantic rule defines the value of an attribute. But a semantic rule may also have some side effects such as printing a value.

COMPILER DESIGN


Syntax Directed Definition

Example : Consider the grammar

COMPILER DESIGN


Implementation of Syntax Directed Translation

The Syntax Directed Translation can be implemented by

Dependency Graph

S-Attributed Definitions

L-Attributed Definitions

Synthesized Attributes

SDD to Construct Syntax Tree

Inherited Attributes

COMPILER DESIGN


S-Attributed Definitions

An SDD is S-attributed if every attribute is synthesized We can have a post-order traversal of parse-tree to evaluate attributes in S-attributed

definitions

postorder(N)

{

for (each child C of N, from the left) postorder(C);

evaluate the attributes associated with node N;

}

S-Attributed definitions can be implemented during bottom-up parsing without the need to explicitly create parse trees

L-Attributed Definitions

COMPILER DESIGN


Syntax Directed Translation Schemes (SDT)

Translation Schemes are more implementation oriented than syntax directed Definitions since they indicate the order in which semantic rules and attributes are to be

evaluated.

Definition. A Translation Scheme is a context-free grammar in which 1. Attributes are associated with grammar symbols;

2. Semantic Actions are enclosed between braces {} and are inserted within the

right-hand side of productions.

Yacc uses Translation Schemes. Translation Schemes deal with both synthesized and inherited attributes. Semantic Actions are treated as terminal symbols: Annotated parse-trees contain

semantic actions as children of the node standing for the corresponding production.

Translation Schemes are useful to evaluate L-Attributed definitions at parsing time (even if they are a general mechanism).

An L-Attributed Syntax-Directed Definition can be turned into a Translation Scheme.

An SDT is a Context Free grammar with program fragments embedded within production bodies

Those program fragments are called semantic actions They can appear at any position within production body Any SDT can be implemented by first building a parse tree and then performing the

actions in a left-to-right depth first order

Typically SDT’s are implemented during parsing without building a parse tree

Consider the Translation Scheme for the L-Attributed Definition for “type declarations”:

COMPILER DESIGN


Semantic Analysis

Check for semantic consistency with the language definition

Type checking - checks if each operator is applied to the right type of operands

Checks that cannot be done in syntax analysis, like variables declared before use

Type conversions - coercions

Uses intermediate representation (syntax tree for example) and symbol table

Semantic Analysis computes additional information relat

A COURSE MATERIAL ON COMPILER DESIGN ECGS22 · a course material on compiler design – ecgs22...

Documents

Transcript of A COURSE MATERIAL ON COMPILER DESIGN ECGS22 · a course material on compiler design – ecgs22...