Post on 23-Oct-2014
Unit 6Compilers
Introduction A compiler is a program that reads a program written in one
language called the source language and translates it into an equivalent program in another language called the target language.
There are two parts of compilation: analysis and synthesisAnalysis: creates intermediate representation of SPSynthesis: constructs the desired target program
Compiler Target program
Source program
Error messages
Phases of a CompilerCompiler operates
in phasesLexical analyzerSyntax
analyzer
Semantic analyzer
Intermediate code generator
Code optimizer
Code generator
Error handlerSymbol table
manager
Source program
target program
Phases of a Compiler Lexical analyzer:
Performs lexical analysis also known as linear analysis or scanning.The stream of characters are read from left-to-right and grouped into
tokens.The white spaces are eliminated during lexical analysisFor eg.,
position = initial + rate * 60The foll. Tokens are formed:
The identifier position The assignment symbol The identifier initial Plus sign The identifier rate Multiplication sign Number 60
Phases of a CompilerSyntax analyzer:
Performs syntax analysis also known as hierarchical analysis or parsing.
It involves grouping of tokens of SP into grammatical phrases.
Then they are represented by a parse tree.
Phases of a Compiler
Assignment statement
expressionidentifier
expression
=
expression+
expressionexpression
identifier number
rate 60
identifier
initial
position
*
Parse tree for position = initial + rate * 60
Phases of a Compiler
Syntax tree is a compressed representation of a parse tree.The operators appear in the interior nodes and the operands of an
operator are the children of the node for that operator.
=
+
rate 60
initial
position
*
Syntax tree for position = initial + rate * 60
Phases of a CompilerSemantic analyzer:
Performs semantic analysisIt involves checking the SP for semantic errors and gathers
type information.Important component is type checking
=
+
rate
60
initial
position
*
inttoreal
Phases of a Compiler Intermediate code generation: intermediate code must have 2 properties: easy to produce and easy
to translate. It can be in different forms:One such form is “three-address code” It is a like assembly languageThree address code consists of a sequence of instructions each of
which has at most 3 operandsid1 = id2 +id3*60
temp1 = inttoreal(60)temp2 = id3*temp1temp3 = id2+temp2id1 = temp3
Phases of a CompilerCode optimizer:Attempts to improve the intermediate code
temp1 = inttoreal(60)temp2 = id3*temp1temp3 = id2+temp2id1 = temp3
temp1 = id3 * 60id1 = id2 + temp1
Optimized to
Phases of a CompilerCode generator:Deals with generation of target code, consisting of relocatable
machine code or assembly code.
MOVF id3,R2MULF #60.0,R2MOVF id2,R1ADDF R2,R1MOVF R1,id1
‘#’ treated as constant
Floating point Source Destination
Phases of a CompilerSymbol table management:
A symbol table is a data structure containing a record for each identifier with fields for the attributes of the identifier.
It allows us to find the record for each identifier and to store r retrieve data from that record.
Error detection and Reporting:Each phase can encounter errorsEach phase must deal with that errors
Lexical AnalyzerThe lexical analyzer is the first phase of compiler Its main task is to read the input characters and produces output a
sequence of tokens that the parser uses for syntax analysis
tokenget next token
Lexical analyser
Parser
Symbol table
Source program
Interaction of lexical analyzer with parser
Lexical AnalyzerTokens, Patterns and Lexemes.A lexeme is a sequence of characters in the source program
that is matched by the pattern for the token.A token is an abstract symbol representing a kind of lexical
unit.eg., a keyword, identifierA pattern is a description of the form that the lexemes of a
token may take.
For example in the statement,
const pi = 3.1416;
the substring pi is a lexeme for the token identifier.
Lexical AnalyzerTokens Patterns and Lexemes.
Another example, a C statement,
printf(“Total=%d\n”, score);
both printf and score are lexemes matching the pattern for token id, and “Total=%d\n” is a lexeme matching literal.
In most programming languages, the following constructs are treated as tokens: keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as parentheses, commas, and semicolons.
Lexical AnalyzerTokens Patterns and Lexemes.
TOKEN SAMPLE LEXEMES INFORMAL DESCRIPTION OF PATTERN
constifrelationidnumliteral
Constif<,<=,=,<>,>,>=pi,count,D23.1416,0,6.02E23“core dumped”
constif< or <= or = or <> or >= or >letter followed by letters and digitsany numeric constantany characters between “ and “ except”
Lexical Analyzer: Specification of TokensRegular expressions are an important notation for specifying tokens.
Strings and Languages:The term alphabet or character class denotes any finite set of
symbols eg., The set {0, 1} is the binary alphabet.
A String over some alphabet is a finite sequence of symbols drawn from that alphabet
The term language denotes any set of strings over some fixed alphabet
Lexical Analyzer: Specification of TokensOperations on languages:
There are several important operations like union, concatenation and closure that can be applied to languages.
For example:Let L be the set {A,B,….,Z,a,b,…z} and D be the set {0,1,…9}1. LUD is the set of letters and digits2. LD is the set of strings consisting of a letter followed by a digit3. L4 is the set of all four-letter strings4. L* is the set of all strings, including ε , the empty string5. L(LUD)* is the set of all strings of letters and digits beginning with a
letter.6. D+ is the set of all strings of one or more digits
Lexical Analyzer: Specification of TokensRegular Expressions:
An identifier is a letter followed by zero or more letters or digitsThe expression: letter (letter | digit)*The | here means “or” , the parentheses are used to group sub
expressions, the star means “ zero or more instances of” the parenthesized expression, and the juxtaposition of letter with remainder of the expression means concatenation.
A regular expression is built up out of simpler regular expressions using set of defining rules.
Each regular expression r denotes a language L(r).
Lexical Analyzer: Specification of TokensRegular Expressions RE:
The rules that define the regular expression over alphabet ∑.ε is a RE that denotes {ε}, the set containing the empty stringIf a is a symbol in ∑, then a is a RE that denotes {a}, the set
containing the string a.Suppose r and s are regular expressions denoting the languages
L(r) and L(s). Then,a) (r)|(s) is a RE denoting L(r) U L(s)b) (r)(s) is a RE denoting L(r)L(s)c) (r)* is a RE denoting (L(r))*d) (r) is a RE denoting (L(r))2
A language denoted by a RE is said to be a regular set.
Lexical Analyzer: Specification of TokensRegular Expressions:Example: Let ∑ = {a,b}
The RE a|b denotes the set {a,b}The RE (a|b) (a|b) denotes {aa,ab,ba,bb}, the set of all strings of a’s
and b’s of length two.The RE a* denotes the set of all strings of zero or more instances of
an a {ε,a,aa,…..}.The RE (a|b)* denotes the denotes the set of all strings of zero or
more instances of an a or b.The RE a|a*b denotes the set containing the string a and all strings
consisting of zero or more a’s followed be b.
Lexical Analyzer: Specification of TokensRegular definitions: If ∑ is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form d r , d’ r’
where d, d’ is a distinct name and r, r’ is a regular expression over the symbols in ∑ U {d, d’,…}
Example: Consider the set of strings of letters and digits beginning with a letter. The regular definition for the set is
letter A|B|…|Z|a|b|…zdigit 0|1|2|…|9
id letter ( letter | digit ) *
Lexical Analyzer: Recognition of Tokens Consider the following grammar fragment:
stmt if expr then stmt|if expr then stmt else stmt|e
exprterm relop term|term
termid|num
where the terminals if , then, else, relop, id and num generate sets of strings given by the following regular definitions:
if ifthen tenelse elserelop <|<=|=|<>|>|>=idletter(letter|digit)*numdigit+ (.digit+)?(E(+|-)?digit+)?
Lexical Analyzer: Recognition of Tokens
REGULAR EXPRESSION
TOKEN ATTIBUTE VALUE
wsif
thenelseid
num<
<==
<>>
>=
-if
thenelseid
numreloprelopreloprelopreloprelop
----
Pointer to table entryPointer to table entry
LTLEEQNEGTGE
Lexical Analyzer: Finite Automata
A recognizer for a language is a program that takes as input a string x and answers ‘yes’ if x is a sentence of the language and ‘no’ otherwise.
A finite automaton can be deterministic or non deterministic.
They are represented by transition graphs.
In these labeled directed graphs, the nodes are the states and the labeled edges represent the transition function.
Lexical Analyzer: Finite Automata
Nondeterministic Finite Automata (NFA):A mathematical model consisting of:
1) a set of states S
2) a set of input alphabet ∑
3) a transition function move that maps state-symbol pairs to set of
states
4) a state s0 as start or initial state
5) a set of states F as final or accepting states
Lexical Analyzer: Finite AutomataNondeterministic Finite Automata :Example: the transition graph for an NFA that recognizes the
language (a|b)*abb
Set of states S: {0,1,2,3} Input symbol alphabet ∑ = {a, b}Initial state is 0Accepting state is 3 indicated by double circle.
0 1 2 3start a b b
a
b
Lexical Analyzer: Finite AutomataNondeterministic Finite Automata :
Transition table: Row represents each state
Column for each input symbol
The entry for row i and symbol a
in the table is the set of states that
can be reached by a transition
from state i on input a.
State Input symbols
a b
0 {0,1} {0}
1 - {2}
2 - {3}
Lexical Analyzer: Finite Automata
Deterministic Finite Automata (DFA) :A mathematical model in which
1) no state has an ε-transition, a transition on input ε, and
2) for each state s and input symbol a, there is at most one edge labeled a leaving s.
Since there is at most one transition from each state on any input, it becomes very easy to determine whether a DFA accepts an input string .
Lexical Analyzer: Finite AutomataConversion of NFA to DFA:Subset construction algorithm
Input: NFA N
Output: equivalent DFA D
Method:Operations used:
operation description
Epsilon-closure(S)
Set of nfa states reachable from nfa state s on e-transitions alone
Epsilon-closure(T)
Set of nfa states reachable from nfa state s in T on e-transitions alone
Move(T, a) Set of nfa states to which there is a transition on input symbol a from nfa state s in T
Lexical Analyzer: Finite AutomataConversion of NFA to DFA:
Subset construction algorithm:
Initially,ε-closure(So) is the only state in D-states and it is unmarked
While there is an unmarked state in T in D-states do begin
Mark T;
For each input symbol a do begin
U:=e-closure(move(T, a));
If U is not in D-states then
Add U as an unmarked state to D-states;
Dtrans[T, a]:=U
EndEnd
Lexical Analyzer: Finite AutomataFrom a regular expression to an NFA:Thomson’s ConstructionTo convert regular expression r over an alphabet into an NFA N
accepting L(r)
Parse r into its constituent sub-expressions. Construct NFAs for each of the basic symbols in r.
Lexical Analyzer GeneratorLex is used to specify lexical analyzers for a variety of languages.This tool is referred as lex compiler.
Lex compiler
C compiler
a.out
Lex.yy.c
a.out
Sequence of tokens
Lex source program lex.l
Lex.yy.c
Input stream
Creating a lexical analyzer with Lex
Lexical Analyzer GeneratorLex specifications:A lex program consists of three parts:
Declaration section includes declarations of variables, manifest constants and regular definitions.
Translation rules are of the form:P1 {action1}P2 {action2}… ….
Declarations%%Translation rules%%Auxiliary procedures
Here, each pi is a RE and
each action i is a program fragment describing what action is to be taken when pattern pi matches a lexeme.
Lexical Analyzer GeneratorDesign:Given a set of specifications, the lexical analyzer should look for
lexemes.This is usually implemented using a finite automatonThe lexical analyzer generator constructs a transition table for a finite
automaton from the regular expression patterns in the lexical analyzer generator specification.
The lexical analyzer itself consists of a finite automaton simulator that uses this transition table to look for the regular expression patterns in the input buffer.
This can be implemented using an NFA or a DFA. The transition table for an NFA is considerably smaller than that for a DFA, but DFA recognizes patterns faster than the NFA.
Lexical Analyzer GeneratorDesign:
Lex Specification
Lex CompilerTransition table
Lexeme
FA Simulator
Transition table
Input buffer
b) Schematic lexical analyzer
a) Lex Compiler
Model of Lex Compiler
Syntax AnalysisEvery programming language has rules that prescribe the syntactic
structure of well-formed programs.The syntax of programming language constructs can be described by
context-free grammars or BNF(Backus-Naur Form) notation.Grammars offer significant advantages:
Gives a precise, yet easy-to-understand, syntactic specification of a programming language.
From certain classes of grammars, we can automatically construct an efficient parser that determines if a source program is syntactically well-formed.
A properly designed grammar imparts a structure to a programming language that is useful for the translation of source programs.
New constructs can be added to a language.
Syntax AnalysisThe parser obtains a string of tokens from the lexical analyzer. It then verifies that the string can be generated by the grammar for
the source language.The parser should report syntax errors, if any.
tokenget next token
Lexical analyser
Parser
Symbol table
rest of front end
Source program
intermediate representation
Parse tree
Syntax AnalysisThree general types of parsers for grammars:
Universal parsing methods:Can parse any grammarToo inefficient to use in production compilers
Top-down methods:Build parse trees from the top(root) to the bottom(leaves)
Bottom-up methods:Start from the leaves and work up to the root.
Context-free grammarsConsider a conditional statement defined by a rule such as:
If S1 and S2 are statements and E is an expression, then
“if E then S1 else S2” is a statement
stmt → if expr then stmt else stmt
A context-free grammar consists of terminals, non-terminals, a start symbol and productions
Context-free grammars1. Terminals are the basic symbols from which strings are formed. The
word "token" is a synonym for "terminal" when we are talking about grammars for programming languages.
2. Non-terminals are syntactic variables that denote sets of strings that help define language generated by the grammar. They impose a hierarchical structure on the language.
3. In a grammar one non-terminal is distinguished as the start symbol, and the sets of strings it denotes is the language denoted by the grammar.
4. The productions of a grammar specify the manner in which terminals and non-terminals can be combined to form strings . Each production consists of a non-terminal followed by an arrow(==>) followed by a string of non-terminals and terminals.
Context-free grammars Example:
The grammar with the foll. Productions
expr → expr op expr
expr → (expr)
expr → -expr
expr → id
op → +
op → -
op → *
op → /
In this grammar, the terminal symbols are id,+,-,*,()
The non terminal symbols are expr and op.
expr is the start symbol
Example:E ==>EAE | (E) | -E | id A==> + | - | * | / |
Where E,A are the non-terminals
while id, +, *, -, /,(, ) are the terminals.
Derivation and Parse trees• Consider the foll. Grammar:
E ==>E+E | E*E | (E) | -E | id• E ==> -E is read as “E derives -E”• We can take a single E and repeatedly apply productions in any
order to obtain a sequence of replacements• For eg., E ==> -E ==> -(E) ==> - (id)• We call such a sequence of replacements a derivation of -(id)
from E.
Derivation and Parse trees• Given a grammar G with start symbol S, we can use ==> + relation
to define L(G), the language generated by G.• A string of terminals w is in L(G) if and only if S ==> + w, the
string w is called a sentence of G.
• If S ==> * α , where α may contain non terminals, then α is a sentential form of G.
Derivation and Parse trees• Parse trees:• It may be viewed as a graphical representation for a derivation.• Each interior node of a parse tree is labeled by a nonterminal.• The leaves are labeled by nonterminals or terminals, read from left to
right.• For eg., the parse tree for -(id+id)• E -> -E
-> - (E)
-> - (E + E)
-> - (id + E)
-> - (id + id)
E
- E
( E )
E + E
id id
Derivation and Parse trees• Example: id+id*id
E -> E+E -> id+E -> id+ E*E -> id + id*E -> id + id*id
E -> E*E -> E+E*E -> id+ E*E -> id + id*E -> id + id*id
E
E E+
E E*
id
id
id
E
E E*
E E+
id
id
id
(a) (b)
Ambiguity• A grammar that produces more than one parse tree for some
sentence is said to be ambiguous
• An ambiguous grammar is one that produces more than one leftmost or more than one rightmost derivation for the same sentence.
• Carefully writing the grammar can eliminate ambiguity.
Elimination of Left Recursion• Definition: A grammar is left recursive if it has a non terminal A
such that there is a derivation A→ Aα for some string α.
• Top-Down parsing methods cannot handle left-recursive grammars, so a transformation that eliminates left recursion is needed.
• A left-recursive pair of productions A → Aα| β could be replaced by the non-left-recursive productions.
A → βA'A' → α A'| ε
Elimination of Left Recursion Algorithm:
Input: Grammar with no cycles or e-productions.Output: An equivalent grammar with no left recursion.Method: Apply the algorithm to G. Note that the resulting non-left-recursive
grammar may have e -productions. 1. Arrange the nonterminals in some order A1, A2, ,...........,An. 2. for i := 1 to n do begin for j: = 1 to i - 1 do begin replace each production of the form Ai ==> Aj γ by the productions Ai ==> δ1γ | δ2 γ...............| δk γ
where Aj ==> δ1 | δ2 |.........| δk are all current Aj productions. end eliminate the immediate left recursion among Ai productionsend
Elimination of Left RecursionNo matter how many A-productions there are, we can eliminate
immediate left recursion from them.
First, we group the A-productions as
A → Aα1| Aα2| …| Aαm|β1| β2…| βm
• Then, we replace A-productions by
• A → β1 A’ | β2 A’ …| βm A’
• A’ → α1 A’ | α2 A’ | …| αmA’ | ε
Elimination of Left RecursionExample:Consider the foll. Grammar:
• Eliminating the immediate left recursion to the productions for E and then for T, we obtain
E → E+T| TT → T*F| FF → (E) | id
E → TE’E’ → +TE’ | ε T → FT’T’ → *FT’ | ε F → (E) | id
Left Factoring Left factoring is a grammar transformation that is useful producing a
grammar suitable for predictive parsing.The basic idea is that when it is not clear which of the two alternative
productions to use to expand a nonterminal A.For example:If A==>αβ1|αβ2 are two A-productions and the input begins
with a non empty string derived from α , we do not know whether to expand A to αβ1 or to αβ2.
We may defer decision by expanding A to αA' . Then after seeing the input derived from α we expand A' to β1or to β2
A==>αA'A'==>β1|β2
Left Factoring Algorithm:
Input. Grammar G
Output. An equivalent left factored grammar.
Method. For each non terminal A find the longest prefix α common to two or more of its alternatives. If α!= ε, i.e., there is a non trivial common prefix, replace all the A productions
A==>αb1|αb2|..............|αbn|g , where g represents all alternatives that do not begin with α by
A==>αA'|g
A'==>b1b|2|.............|bn
Here A' is new nonterminal. Repeatedly apply this transformation until no two alternatives for a non-terminal have a common prefix.
Left Factoring Example: consider the foll. Grammar
S → iEtS | iEtSeS | aE → b
S → iEtSS’ | aS’ → eS | εE → b
Left-factored
Parsing methodsThe syntax analysis phase of a compiler verifies that the sequence of
tokens extracted by the scanner represents a valid sentence in the grammar of the programming language.
There are two major parsing approaches: top-down and bottom-up.
In top-down parsing, you start with the start symbol and apply the productions until you arrive at the desired string.
In bottom-up parsing, you start with the string and reduce it to the start symbol by applying the productions backwards.
Parsing methodsConsider the foll grammar:
S –> AB
A –> aA | ε
B –> b | bBHere is a top-down parse of aaab.
S
AB S –> AB
aAB A –> aA
aaAB A –> aA
aaaAB A –> aA
aaaεB A –> ε
aaab B –> bThe top-down parse produces a leftmost derivation of the sentence.
Parsing methods Consider the foll grammar:
S –> AB
A –> aA | ε
B –> b | bB A bottom-up parse works in reverse The bottom-up parse prints out a rightmost derivation of the sentence.
aaab
aaaεb (insert ε)
aaaAb A –> ε
aaAb A –> aA
aAb A –> aA
Ab A –> aA
AB B –> b
S S –> AB
Top-Down Parsing It is an attempt to find the leftmost derivation for an input string
Recursive-descent parsing: In this we execute a set of recursive procedures to process the input.A procedure is associated with each nonterminal of a grammar.As we parse the input string, we call the procedures that correspond
to the left-side nonterminal of the productions. Consider the foll. Grammar:
S → cAd
A → ab|a Input string w = cad
Top-Down ParsingConstruct parse tree:
Initially, the i/p pointer points to c. then uses the first production for S to expand the tree as shown in Fig.a
The leftmost leaf matches with the first symbol, so advance the i/p pointer to a. then use the second leaf and expand A using the first alternative as shown in Fig. b
Since there is a match advance the pointer to third symbol d. since there is no match, report failure and go back to A and reset the pointer to position 2.
Expand A with other alternative and check. Since we have produced a parse tree for w, we halt and announce successful completion of parsing
S
A dc
S
A dc
S
A dc
a b aFig. a
Fig. b
Fig. c
Top-Down ParsingPredictive parser: It is recursive descent parser that needs no backtracking.Transition diagrams for predictive parsers:
To construct the transition diagram, first eliminate left recursion and then left factor the grammar. Then for each nonterminal A do the foll:
1. create an initial and final state
2. for each production A -> X1X2…Xn, create a path from the initial to the final state, with edges labeled X1X2…Xn .
A predictive parser based on transition diagrams attempt to match terminal symbols against the input, and makes a potentially recursive procedure call whenever it has to follow an edge labeled by a nonterminal.
Top-Down ParsingNonrecursive Predictive Parsing: It is possible to build a nonrecursive predictive parser by maintaining
a stack explicitly , rather than implicitly via recursive calls.
Predictive Parsing Program
Parsing Table M
a + b $
X
Y
Z
$
OutputStack
Input
Model of a nonrecursive predictive parser
Top-Down Parsing: Nonrecursive Predictive Parsing Input buffer:
it contains the string to be parsed, followed by $.$, symbol used to indicate the end of the input string
Stack:It contains a sequence of grammar symbols with $ on the bottom.Initially, it contains the start symbol of the grammar on top of $.
Parsing table:It is a 2-D array M[A,a] where A is a nonterminal, a is a terminal
or symbol $.
The parser is controlled by a program as follows
Top-Down Parsing: Nonrecursive Predictive ParsingThe program considers X, the symbol on top of the stack and a , the
current i/p symbol.These 2 symbols determine the actions. There are 3 possibilities:
If X= a=$, the parser halts and announces successful completionIf X=a!=$, the parser pops X off the stack and advances the input
pointer to the next input symbol.If X is a nonterminal, the program consults entry M[X, a] of the
parsing table M. This entry will be either an X-production of the grammar or an error entry. If, for example, M[X, a]={X->UVW}, the parser replaces X on top of the stack by WVU( with U on top).
If M[X, a]=error, the parser calls an error recovery routine.
Top-Down Parsing: Nonrecursive Predictive Parsing Input. A string w and a parsing table M for grammar G. Output. If w is in L(G), a leftmost derivation of w; otherwise, an error indication. Method. Initially, $S on the stack with S on top, and w$ in the input buffer.
set ip to point to the first symbol of w$.
repeat
let X be the top stack symbol and a the symbol pointed to by ip.
if X is a terminal or $ then
if X=a then
pop X from the stack and advance ip
else error()
else
if M[X,a]=X->Y1Y2...Yk then begin
pop X from the stack;
push Yk,Yk-1...Y1 onto the stack, with Y1 on top;
output the production X-> Y1Y2...Yk
end
else error()
until X=$
Top-Down Parsing: Nonrecursive Predictive Parsing
Nonter-minal
Input symbol
id + * ( ) $
E E →TE’ E →TE’
E’ E’ →+TE’ E’ → ε E’ → ε
T T → FT’ T → FT’
T’ T’ → ε T’ → *FT’ T’ → ε T’ → ε
F F →id F →(E)
Parsing table M
Top-Down Parsing: Nonrecursive Predictive Parsing
The construction of a predictive parser is aided by two functions associated with a grammar G: FIRST and FOLLOW.
If α is any string of grammar symbols, let FIRST(α) be the set of terminals that begin the strings derived from α. If α =*>e, then e is also in FIRST(α).
FOLLOW(A), for nonterminals A, to be the set of terminals a that can appear immediately to the right of A in some sentential form, that is, the set of terminals a such that there exists a derivation of the form S=*> αAaβ for some α and β. If A can be the rightmost symbol in some sentential form, then $ is in FOLLOW(A).
Top-Down Parsing: Nonrecursive Predictive Parsing
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more terminals or e can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is {X}.
2. If X->e is a production, then add e to FIRST(X).
3. If X is nonterminal and X->Y1Y2...Yk is a production, then place a in FIRST(X) if for some i, a is in FIRST(Yi) and e is in all of FIRST(Y1),...,FIRST(Yi-1); that is, Y1...Yi-1=*>e. If e is in FIRST(Yj) for all j=1,2,...,k, then add e to FIRST(X). For example, everything in FIRST(Yj) is surely in FIRST(X). If y1 does not derive e, then we add nothing more to FIRST(X), but if Y1=*>e, then we add FIRST(Y2) and so on.
Top-Down Parsing: Nonrecursive Predictive Parsing
To compute the FOLLOW(A) for all nonterminals A, apply the following rules until nothing can be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol and $ in the input right endmarker.
2. If there is a production A=>aBß where FIRST(ß) except e is placed in FOLLOW(B).
3. If there is a production A->aB or a production A->aBß where FIRST(ß) contains e, then everything in FOLLOW(A) is in FOLLOW(B).
Top-Down Parsing: Nonrecursive Predictive Parsing
Algorithm: Construction of predictive parsing table
Input: Grammar G
Output: Parsing table M
Method:
1. For each production A → α of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A, a].
3. If ε is in FIRST(α), add A → α to M[A, b] for each nonterminal b in FOLLOW(A). If ε is in FIRST(α) and $ is in FOLLOW(A), add A → α to M[A,$].
4. Make each undefined entry of M be error.
Top-Down Parsing: LL(1) grammars
A grammar whose parsing table has no multiple-defined entries is said to be LL(1).
The first “L” stands for scanning the input from left to right.The second “L” for producing a leftmost derivation1 for using one input symbol of lookahead at each step to make
parsing action decisions.Properties:
No ambiguous or left-recursive grammar can be LL(1)A grammar is LL(1) iff whenever A → α| β are two distinct productions
of G and the foll. conditions hold:For no terminal a do both α and β derive strings beginning with aAt most one of α and β can derive the empty stringIf β =*>e, then α does not derive any string beginning with a terminal
in FOLLOW(A).
Top-Down Parsing: LL(1) grammars
Disadvantages:
The main difficulty in using predictive parsing is in writing a grammar for the source language
Although left recursion elimination and left factoring are easy to do, they make the resulting grammar hard to read and difficult to use for translation purposes.
to alleviate some of this difficulty, a common organization for a parser in a compiler is to use a predictive parser for control constructs and to use operator precedence for expressions.
Bottom-Up Parsing:
It attempts to construct a parse tree for an i/p string beginning at the leaves and working towards the root.
This process is of reducing a string to the start symbol of the grammar. At each reduction step a particular substring matching the right side of a
production is replaced by the symbol of the left of that production.
Consider the grammar:S → aABeA → Abc | bB → dString w= abbcde
These reductions trace out the rightmost derivation in reverse.
abbcdeaAbcdeaAdeaABeS
Bottom-Up Parsing: Shift-Reduce parser
Consider the foll. Grammar: E ==>E+E | E*E | (E) | id Input string: id1+id2*id3
Right-sentential form
Handle Reducing production
id1+id2*id3 id1 E →id
E+id2*id3 id2 E →id
E+E*id3 id3 E →id
E+E*E E*E E →E*E
E+E E+E E →E+E
E
Reductions made by shift-reduce parser
Bottom-Up Parsing:
Stack implementation of shift-reduce parsing:A stack is used to hold grammar symbols Input buffer to hold the string w to be parsed. Initially, the stack is empty and the string w is the input as follows:
Stack Input$ w$
The parser operates by shifting zero or more input symbols onto the stack until a handle β is on top of the stack.
The parser then reduces β to the left side of the production.The parser repeats until the stack contains the start symbol and the
input is empty.Stack Input$S $
A handle of a string is a substring that matches the right side of a production, and whose reduction to the nonterminal on the left side of the production represents one step of reduction process.
Bottom-Up Parsing:
Stack implementation of shift-reduce parsing:There are four possible actions a shift-reduce parser can make:
Shift: the next symbol is shifted onto the top of the stack.
Reduce: the parser knows the right end of the handle is at the top of the stack. It must then locate the left end of the handle within the stack and decide with what nonterminal to replace the handle.
Accept: the parser announces successful completion of parsing
Error: the parser discovers that a syntax error has occurred and calls an error recovery routine.
Bottom-Up Parsing:
Stack Input Action
$ id1+id2*id3$ Shift
$id1 +id2*id3$ Reduce by E →id
$E +id2*id3$ Shift
$E+ id2*id3$ Shift
$E+id2 *id3$ Reduce by E →id
$E+E *id3$ shift
$E+E* id3$ Shift
$E+E*id3 $ Reduce by E →id
$E+E*E $ Reduce by E →E*E
$E+E $ Reduce by E →E+E
$E $ Accept
Bottom-Up Parsing: Operator-Precedence parsing
Operator grammar: These grammars have the property that no production right side is ε
or has two adjacent nonterminals.Consider the foll. grammar for expressions:
Is not an operator grammar. If we substitute for A each of its alternatives, we obtain an operator
grammar.
E → EAE | (E) | -E |idA → + | - | * | / | ↑
E → E+E | E-E | E*E | E/E | E ↑ E | (E) | -E |id
Bottom-Up Parsing: Operator-Precedence parsing
We define three disjoint precedence relations,<., =. and .>, between certain pairs of terminals.
Relation Meaning
a<.b a “yields precedence to” b
a = .b a “has the same precedence as” b
a.>b a “takes precedence over” b
id + * $
id .> .> .>
+ <. .> <. .>
* <. .> .> .>
$ <. <. <.
Bottom-Up Parsing: Operator-Precedence parsing
Consider the string : id+id*id
The handle can then be found by the foll. process:Scan the string from left until first .> is encountered.Then scan backwards until a <. is encountered.The handle contains everything to the left of the first .> and to the
right of <.
$<. id .> + <. id .> * <. id .> $
$<. id .> + <. id .> * <. id .> $$E+E*E$
$+*$
$<.+<.*.>$
$E+E$$+$$<.+.>$E
Bottom-Up Parsing: Operator-Precedence parsing
Algorithm: Set ip to point to the first symbol of w$ Repeat forever
If $ is on top of the stack and ip points to $ thenreturn
Else beginLet a be the topmost terminal symbol on the stackand let b be the symbol pointed to by ip
If a<.b or a=b then beginPush b onto the stackAdvance ip to the next input symbol
end else if a.>b then repeat
Pop the stack Until the top stack terminal is related by <. to the terminal most recently popped Else error()
end
Bottom-Up Parsing: Operator-Precedence parsing
If operator θ1 has higher precedence than operator θ2, make θ1 .> θ2 and θ2 <. θ1 eg., if * has higher precedence than +, make *.>+ and +<. *
If θ1 and θ2 are operators of equal precedence, then make θ1 .> θ2 and θ2 .> θ1 if the operators are left associative or make θ1 <. θ2 and θ2 <. θ1 if the operators are right associative. eg., if + and – are left associative, then make + .> +, +.> -, -.>- and -.> +
Make θ<.id, id.>θ, θ<.(, (<. θ, ).> θ, θ.>), θ.> $ and $<. θ for all operators θ. Also let(=) $<.( $<.id(<.( id.>$ ).>$(<.id id.> ) )>.)
Bottom-Up Parsing: Operator-Precedence parsing
Consider the foll grammar:
Assume:↑ is of highest precedence and right associative* and / are of next highest precedence and left-associative and+ and – are of lowest precedence and left-associative
E → E+E | E-E | E*E | E/E | E ↑ E | (E) | -E |id
+ - * / ↑ id ( ) $+ .> .> <. <. <. <. <. .> .>- .> .> <. <. <. <. <. .> .>* .> .> .> .> <. <. <. .> .>/ .> .> .> .> <. <. <. .> .>↑ .> .> .> .> <. <. <. .> .>id .> .> .> .> .> .> .>( <. <. <. <. <. <. <. =) .> .> .> .> .> .> .>$ <. <. <. <. <. <. <.
Bottom-Up Parsing: LR Parsers
It is used to parse a large class of context-free grammars.LR(k) parsing: L is for left-to-right scanning of the input, R for
constructing a rightmost derivation in reverse and k for the number of lookahead input symbols.
Characteristics:LR parsers can be constructed to recognize all programming language
constructsIt is general nonbacktracking shift-reduce parsing methodIt can detect a syntactic error as soon as it is possible
Drawback:Lot of work to construct LR parser , hence requires a specialized tool-
LR parser generator.
Bottom-Up Parsing: LR Parsers
There are 3 techniques for constructing an LR parsing table for a grammar.
Simple LR(SLR): is the easiest to implement but the least powerful.Canonical LR: most powerful and the most expensive.LookAhead LR (LALR): is intermediate in power , will work on
most programming language grammars.
Bottom-Up Parsing: LR Parsers
LR parsing algorithm: It consists of an input, an output, a stack, a driver program and a
parsing table that has two parts (action and goto). The parsing program reads characters from an input buffer one at a
time.The program uses a stack to store a string of the form s0X1s1x2….,
where sm is on top,
Each Xi is a grammar symbol and si is a symbol called a state.
Each state symbol summarizes the information contained in the stack below it and the combination of the state symbol on top of the stack and the current input symbol are used to index the parsing table and determine the shift-reduce parsing decision.
Bottom-Up Parsing: LR Parsers
The parsing table consists of 2 parts: a parsing function action and a goto function goto.
The program behaves as follows: It determines sm, the state currently on top of the stack and ai, the
current input symbol . It consults action[sm,ai], the parsing action table entry for state sm
and input ai, which can have one of four values:Shift s, where s is a stateReduce by a grammar production A → βAccept, andErrorThe function goto takes a state and grammar symbol and produces a
state.
Bottom-Up Parsing: LR Parsers
The configurations resulting after each of the four types of move are as follows:If action[sm,ai] =shift s, the parser executes a shift move.
Here the parser has shifted both the current input symbol ai and the next state s, which is given in action[sm,ai] onto the stack; ai+1 becomes the current input symbol.
If action[sm,ai]=reduce A → β, then the parser executes a reduce moveHere the parser pops 2r symbols off the stack (r is the length of β). The parser then pushes both A and s, the entry for goto [sm-r,A], onto the stack. The current i/p symbol is not changed in a reduce move.
If action[sm,ai] = accept, parsing is completed.If action[sm,ai]=error, the parser has discovered an error and calls
an error recovery routine.
Bottom-Up Parsing: LR Parsing algorithm
Input: input string w and an LR parsing table with functions action and goto for G
Output: if w is in L(G), a bottom-up parse for w, otherwise an error indication
Method: initially, the parser has s0 (initial state) on its stack and w$ in the input buffer.
Example: (1) E → E+T (2) E →T (3) T → T*F (4) T →F (5) F → (E) (6) F →id
Bottom-Up Parsing: LR Parsers
State action gotoid + * ( ) $ E T F
0 S5 S4 1 2 31 S6 Acc2 R2 S7 R2 R23 r4 r4 r4 r44 S5 S4 8 2 35 r6 r6 r6 r66 S5 S4 9 37 S5 S4 108 S6 S119 R1 S7 R1 R110 R3 R3 R3 R311 R5 R5 R5 R5
Parsing table
1. sj means shift and stack state i2. rj means reduce by production numbered j3. acc means accept4. blank means error
Bottom-Up Parsing: LR Parsers
STACK INPUT ACTION(1) 0 Id * Id + id $ Shift(2) 0 id 5 * Id + id $ Reduce by F →id
(3) 0 F 3 * Id + id $ Reduce by T →F(4) 0 T 2 * Id + id $ Shift(5) 0 T 2 *7 Id + id $ Shift
(6) 0 T 2 *7 id 5 + id $ Reduce by F →id(7) 0 T 2 *7 F 10 + id $ Reduce by T →T*F
(8) 0 T 2 + id $ Reduce by E →T(9) 0 E 1 + id $ Shift(10) 0 E 1 + 6 id $ Shift
(11) 0 E 1 + 6 id 5 $ Reduce by F →id
(12) 0 E 1 + 6 F 3 $ Reduce by T →F
(13) 0 E 1 + 6 T 9 $ E →E+T
(14) 0 E 1 $ accept
Moves of LR parser on id+id*id
Code Optimization
It aims at improving the execution efficiency of a program. This is achieved in two ways:
Redundancies in a program are eliminatedComputations in a program are rearranged or rewritten to make it
execute efficiently.
Front End
Optimization Phase
Back endSource Program
Target Program
Intermediate representation (IR)
Code Optimization techniques
Compile time evaluation:Performing certain actions specified during compilation itself ,
thereby reducing the execution time of the program.When all operands in an operation are constants, the operation can
be performed at compilation time.Known as constant folding
eg., an assignment a=3.14/2
can be replaced by a=1.57 thereby eliminating division operation.
Code Optimization techniques
Elimination of common subexpressions:Common subexpressions are occurrences of expressions yielding the
same value.We can avoid recomputing the expression if we can use the
previously computed value.Example:
a =b+c…x= b*c+5.2
t =b+ca=t…x= t+5.2
Code Optimization techniques
Dead Code Elimination: Code which can be omitted from a program without affecting its results is
called dead code. Dead code is detected by checking whether the value assigned in an
assignment statement is used anywhere in the program.Frequency Reduction: Execution time of a program can be reduced by moving code from a part
of a program which is executed very frequently to another part of the program which is executed fewer times.
for i=1 to 100 dobegin z=i; x= 25*a; y = x+z;end
x= 25*a;for i=1 to 100 dobegin z=i; y = x+z;end
Code Optimization techniques
Strength Reduction:Replaces the occurrence of a time consuming operation by an
occurrence of a faster operation eg., replacement of a multiplication by an addition.
for i=1 to 10 dobegin --- k= i*5; ---;end
itemp=5;for i=1 to 10 dobegin --- k = itemp; --- itemp = itemp+5;end
Code Optimization techniques
Local and global optimization:Local optimization : optimizing transformations are applied over
small segments of a program consisting of a few statements.
Global optimization : optimizing transformations are applied over a program unit i.e., over a function or a procedure.
YACC
It is a parser generator .It stands for “yet another compiler-compiler”.
Used to generate LALR parsers using the YACC parser generator provided on Unix.
Yacccompiler
C compiler
a.out
y.tab.c
a.out
output
Yacc specification translate.y
y.tab.c
Input
YACC
A YACC program has 3 parts:declarations%%translation rules%%supporting C functions
Declarations part: there are two optional sections In the first section, we put ordinary C declarations delimited by %
{ and %}. It also contain the declarations of grammar tokensEg. The statement %token DIGIT
YACC
The translation rules part:Enclosed between %% and %%Each rule consists of a grammar production and the associated
semantic action. For eg., <left side> : <alt 1> {semantic action 1}
| <alt 2> {semantic action 2}
…
| <alt n> {semantic action n}
; In a YACC production, a quoted single character is taken to be the
terminal symbol and unquoted strings of letters and digits not declared to be tokens are taken to be nonterminals.
YACC
A YACC semantic action is a sequence of C statements.The semantic action is performed whenever we reduce by the
associated production.eg., For 2 productions: E → E+T| T
expr : expr ‘+’ term {$$ = $1 + $3;} | term ;
$$ refers to the attribute value associated with the nonterminal on the left.$i refers to the value associated with the ith grammar symbol on the right.
Supporting C-routines part: a lexical analyzer by the name yylex() must be provided.Error recovery routines may be added
Syntax Directed Translation
There are 2 notations for associating semantic rules with productions: Syntax-directed definitions and translation schemes
Conceptually, with both syntax directed definitions and translation schemes, we pass the input token stream, build the parse tree, and then traverse the tree as needed to evaluate the semantic rules at the parse tree nodes
input parse dependency evaluation order
string tree graph for semantic rules
Syntax Directed Definition
A syntax-directed definition is a generalization of a CFG in which each grammar symbol has an associated set of attributes, partitioned into 2 subsets called the synthesized and inherited attributes of that grammar symbol.
The value of an attribute at a parse tree node is defined by a semantic rule associated with a production used at that node.The value of a synthesized attribute at a node is computed from the
values of attributes at the children of that node.The value of a inherited attribute at a node is computed from the values
of attributes at the siblings and parent of that node
Syntax Directed Definition
Semantic rules set up dependencies between attributes that will be represented by a graph.
From the dependency graph, we derive an evaluation order for the semantic rules.
Evaluation of the semantic rules defines the values of the attributes at the nodes in parse tree for the input string.
A parse tree showing the values of attributes at each node is called an annotated parse tree.
The process of computing the attribute values at the nodes is called annotating or decorating the parse tree.
Syntax Directed Definition
Form of a Syntax Directed Definition
In a syntax directed definition, each grammar production A has associated with it a set of semantic rules of the form
b:= f(c1, c2,……..,ck), where f is a function, and either b is a synthesized attribute of A and c1, c2,……..,ck are attributes
belonging to the grammar symbols of the production, or, b is an inherited attribute of one of the grammar symbols on the
right side of the production, and c1, c2,……..,ck are attributes belonging to the grammar symbols of the production.
In either case, we say that the attribute b depends on attributes c1, c2,……..,ck. An attribute grammar is a syntax directed definition in which the functions in semantic rules cannot have side effects.
Syntax Directed Definition
A syntax directed definition that uses synthesized attributes exclusively is said to be an S-attributed definition.
Example:
production Semantic rules
L → E n Print(E.val)
E → E1 +T E.val = E1.val + T.val
E → T E.val = T.val
T → T1 * F T.val = T1.val + F.val
T → F T.val = F.val
F → (E) F.val = E.val
F → digit F.val = digit. lexval
Syntax Directed Definition
Synthesized attributes: A syntax directed definition that uses synthesized attributes exclusively is said to be an S-attributed definition.
ANOTATED PARSE TREE FOR 3*5+4n L
E.val=19 n
E.val=15 + T.val=4
T.val=15 F.val =4
T.val=3 * F.val=5 digit.lexval=4
F.val=3 digit.lexval=5 digit.lexval=3
A parse tree is annotated by evaluating the semantic rules for the attributes at each node bottom up, from the leaves to the root.
Syntax Directed Definition
Inherited Attributes:They are useful for expressing the dependence of a programming
language construct on the context in which it appears.For eg., to keep track of whether an identifier appears on the left or
right side of an assignment in order to decide whether the address or value of the identifier is needed.
production Semantic rules
D → T L L.in = T.type
T → int T.type = integer
T → real T.Type = real
L → L1 , id Ll.in = L.inaddtype(id.entry,L.in)
L → id addtype(id.entry,L.in)
Syntax Directed Definition
Inherited Attributes:Parse tree for the sentence: real id1,id2,id3
D
T.type = real
L.in =real
real L.in =real id3,
,L.in =real id2
id1
Syntax trees
An (abstract) syntax tree is a condensed form of a parse tree useful for representing language constructs.
In the syntax tree, operands and keywords do not appear as leaves, but rather are associated with the interior node that would be the parent of those leaves in the parse tree.
Example: 3*5+4
+
4*
53