Chapter 2 A Simple One-Pass Compilerorca.st.usm.edu/~seyfarth/csc415/chapter02.pdf · A Simple...
Transcript of Chapter 2 A Simple One-Pass Compilerorca.st.usm.edu/~seyfarth/csc415/chapter02.pdf · A Simple...
A Simple One-Pass Compiler
● Language syntax: context-free grammarBackus-Naur Form – BNF
● Language semantics: informal descriptions● Grammar used in syntax-directed translation● Infix expressions will be converted to postfix
– 8+4*3-2 becomes 843*+2-– postfix is easy to evaluate
● Later, programming constructs will be added
Initial Lexical Analysis
● We start with expressions formed from single digit numbers and arithmetic operators
● So, the lexical analyzer can simply read a character and return it. (ignoring white space)
● Later we will extend the lexical analysis to deal with multi-digit numbers, identifiers and keywords.
Context Free Grammar
● Defines the hierarchical structure of a programming language
● stmt > while ( expr ) stmt● Means a statement can be the keyword while,
followed by an expression in parentheses and a statement after the parentheses
● In this case we assume that expr is also defined by the grammer.
Grammar
● A context-free grammar consists of 4 parts:● A set of tokens (aka terminals)● A set of nonterminals (mol variables)● A set of productions where each production is
– a nonterminal on the left of the arrow for the production
– a sequence of terminals and nonterminals to the right of the arrow
● A designated start nonterminal
Example Grammar 1
● list -> list + digit● list -> list – digit● list -> digit● digit -> 0|1|2|3|4|5|6|7|8|9● The last line is shorthand for 10 productions:
– digit -> 0 digit -> 1 ...● Terminals are 0-9 and + and -● Nonterminals are list and digit
Example Grammar 1 (2)
● You use a grammar to derive a string by starting with the start symbol and repeatedly replacing a nonterminal with a RHS for it.list => list + digit => list + 8 => list – digit + 8 => digit – digit + 8 => 4 – digit + 8 => 4 – 3 + 8
● This is referred to as a derivation
Parse Trees
● The root is labeled with the start nonterminal● Leaves are labeled by terminals or ε● Interior nodes are labeled by nonterminals● The children of a node are labeled by the RHS
of a production for the nonterminal
Parse Trees (3)
● The leaves of a parse tree (left to right) form the “ yield” o f the tree.
● This is a string or sentence generated by the grammar.
● This string is derived from the start symbol.
Ambiguity
● A grammar is ambiguous if some string can be generated by 2 or more parse trees.
● The grammar below is ambiguous● list > list + list
● list > list – list
● list > 0|1|2|...|9
Associativity
● We prefer left associativity for + and -611 == (61)1
● In C, assignment is right associative– a=b=c
– list > var = list | var
– var > a | b | c
● Parse trees for left associative grammars tend to expand on the left.
Left vs Right Associativity
list
a
6 c
b
1
1
=
=
var
list
digit
digitlist
list
var
varlist
digit
list
Operator Precedence
● If we use all 4 basic math operations, we need operator precedence84*2 == 8(4*2)
● We need more nonterminals and productions● An expression is a sum/difference of terms● A term is a product/quotient of factors● A factor is a digit or an expression in
parentheses
Grammar with Precedence
● expr > expr + term | expr term | term
● term > term * factor | term / factor | factor
● factor > digit | ( expr )
● digit > 0|1|2|...|9
Why is this grammar ambiguous?
● stmt > id = expr | if expr then stmt | if expr then stmt else stmt | while expr do stmt | { stmt_list }
● stmt_list > stmt_list ; stmt | ε
Why is the grammar ambiguous?
● if 1 then if 2 then a = 2 else a = 1
● if 1 then if 2 then a = 2else a = 1
● Which if “owns” the else?
Syntax-Directed Translation
● A compiler must keep track of a variety of values for program entities– The starting address for an else clause– The type of an expression– The size of an array
● We refer to these as attributes and associate them with terminals and nonterminals.
● A syntax-directed definition adds attribute rules (semantic rules) to productions.
Postfix Notation
● If E is a variable or constant, PF(E) = E● If op is a binary operator, PF(E1 op E2) =
PF(E1) PF(E2) op● If E is of the form ( E1 ), then PF(E) = E1● Postfix uses no parentheses● PF(8-1-1) = 81-1-
PF(2+3*4-5*2) = 234*+52*-PF((2+3)*(4-5)*2) = 23+45-*2*
Synthesized Attributes
● Synthesized attributes at an internal node of a parse tree are determined from attributes of its children. (bottom up)
● The alternative is “i nherited” a ttributes.● Attributes are specified using a “d ot” n otation
like members of a struct or class.expr.t
Syntax-Directed Definition● expr > expr1 + term expr.t = expr1.t term.t +
● expr > expr1 term expr.t = expr1.t term.t
● expr > term expr.t = term.t
● term > 0 term.t = '0'
● term > 1 term.t = '1'
● ...
Depth-First Parse Tree Traversal
void visit ( node *n ){for ( m = first_child(n); m; m++ ) {visit ( m );
}determine attributes of n;
}
Translation Scheme
● A context-free grammar with programming language statements embedded in RHSs.
● The programming statements are called semantic actions.
● Similar to a syntax-directed definition, but the order of execution/evaluation is explicit
● This format is used by yacc and bison. (also more or less the same in lex and flex)
Example Translation Scheme
● expr > expr + term { print('+');}● expr > expr term { print('');}
● expr > term
● term > 0 { print('0');}
● term > 1 { print('1');}
● ...
● term > 9 { print('9');}
Augmented Parse Tree
expr
expr
term
term
term
8
4
+
3
expr
{print('8');}
{print('+');}
{print('');}
{print('4');}
{print('3');}
Translation Scheme
● The execution of the print statements could be done via a depth-first traversal.
● Alternatively, if parsing occurs in the same pattern (and order), the tree could be skipped.
● Note that the semantic actions can be more general purpose actions: symbol table actions, error messages, line counting, ...
Parsing
● A parser converts a string of tokens into a parse tree. (perhaps the tree is not explicit)
● Only certain grammars yield efficient parsers.– Arbitrary grammars might take O(n3) time– Programming language grammars take O(n) time
● Top-down parser: parse tree constructed starting with the root– Can be easily hand-generated
● Bottom-up: construction starts at the leaves– Handles a larger class of grammars
Top-Down Parsing● Start with start symbol as the root of the tree● Repeat the steps below
– Find a node, n, labeled with a nonterminal, A– Select a production for A and construct children of
n for the RHS symbols of the production● For nice grammars the parsing will proceed
from left to right through the input string.● The challenge is selecting a proper production.● We will consider the current token from the
input to help select a production. (lookahead token)
Example Grammar
● type > simple
● | id
● | array [ simple ] of type
● simple > integer
● | char
● | num .. num
Picking a Production
● The 3 productions for type start with id, array or whatever starts simple.
● We can examine the first symbol and determine which of the 3 productions to use:– id: type > id
– array: type > array [ simple ] of type
– others: type > simple
● Likewise the 3 productions for simple can be selected by inspecting the lookahead symbol.
Parsing Using Lookahead
● Start with parse tree with start symbol at root● Lookahead symbol == array● Expand the tree by applying third production
type
array [ 1 .. 10 ] of integer
Parsing Using Lookahead
● Match and consume the token array● Advance to left bracket and match it
● Advance to 1 and expand simple
type
array [ 1 .. 10 ] of integer
array [ simple ] of type
Parsing Using Lookahead
● With lookahead 1 (a num), the correct production is selected and added to the tree
● We can finish the production for simple
type
array [ 1 .. 10 ] of integer
array [ simple ] of type
num .. num
Parsing Using Lookahead
● We advance past ] and of to reach integer● Now we can select the proper production to
apply based on the lookahead
type
array [ 1 .. 10 ] of integer
array [ simple ] of type
num .. num
Predictive Parsing
● Recursive descent parsing is done using a function for each nonterminal.
● Predictive parsing is a type of recursive descent where the lookahead symbol is used to “ predict” the correct production to apply.
● Parse tree is implicitly defined by the pattern of recursive function calls
void match ( int token ){
if ( look == token ) look = next_token();else error();
}
Functions for Predictive Parser
● The match function verifies that the current token is what we expect.
● It advances to the next token if it correct.
void type(){
switch ( look ) {case INTEGER: // These 3 start thecase CHAR: // 3 productions forcase NUM: // simple
simple(); break;
case ID: // type > idmatch(ID);break;
case ARRAY: // type > array ...match(ARRAY); match('['); // match 2 tokenssimple(); // expand simplematch(']'); match(OF); // match 2 tokenstype(); // expand typebreak;
default:error();
}}
void simple(){
switch ( look ) {case INTEGER:
match(INTEGER);break;
case CHAR:match(CHAR);break;
case NUM:match(NUM); // Not quite as simplematch(DOTDOT); // as the first 2 simplematch(NUM); // productionsbreak;
default:error();
}}
First Sets
● Prediction of the proper production requires knowing which tokens can be first in strings generated from a particular production.
● We define First sets for RHS of productions.● Let A > α be a production● If α = or ε α can generate , then is in ε ε
First(α).● First(α) also includes all terminals which can be
the first terminal is a string derived from .α
First Sets (2)
● First(simple) = { integer, char, num }● First(id) = { id }● First(array [ simple ] of type) = { array }
● We can choose one production over another if their First sets are disjoint.
Using -Productionsε
● opt_stmts -> stmt_list | ε● In the code for opt_stmts, if the lookahead
symbol is not in First(opt_stmts) we can use opt_stmts -> ε
● It may be than the lookahead symbol is legal after opt_stmts or not.
● If the lookahead symbol is illegal, it will result in an error elsewhere.
Predictive Parser
● Write a function for each nonterminal● Select which production to use for a
nonterminal by inspecting the lookahead symbol to determine which First set it is in.
● If First sets for competing productions are not disjoint, this plan won't work.
● Implement code for a production by calling functions for nonterminals of the RHS and matching terminals of the RHS.
Predictive Syntax-Directed Translator
● Extend the code for the predictive parser.● Copy the actions from the translation scheme
into the parser in the same position as in the translation scheme.
● The action will happen at the intended time.● The code for the parser/translator could be
automated using a tool which reads the translation scheme and writes C++ code.
Left Recursion
● Left recursion could cause infinite looping in a recursive-descent parser.
● expr -> expr + term● The problem is that expr is the first on the RHS.● Applying that production would not change the
lookahead symbol and would allow it to be selected again.
● Of course the alternative expr -> term would have a conflicting First set...
Eliminating Left Recursion
● expr -> expr + term | term● Compare this to● expr -> term rest● rest -> + term rest | ε● Now we have right recursion and recursive-
descent works.● But we generate parse trees which are better
for right associative operators.
A Translator for Simple Expressions
● We are extending the translator to include the 4 basic math operations, multi-digit numbers and identifiers.
● There will be a symbol table which will hold minimal information.
● The translator will accept a list of expressions with each expression terminated with a semicolon.
● We start with a left-recursive grammar which we convert to right-recursive.
Abstract and Concrete Syntax
● A parse tree can be called a concrete syntax tree.
● By contrast an abstract syntax tree leaves out grammar symbols, showing only operators and operands.
+
-
4
2
8
Simple Infix-to-postfix Specification● expr -> expr + term { print('+'); }
● expr -> expr – term { print('-'); }
● expr -> term● term -> 0 { print('0'); }
● term -> 1 { print('1'); }
● ...● term -> 9 { print('9'); }
Problem with the Specification
● The specification is left-recursive● We need to convert to right-recursion● expr -> term rest● rest -> + expr | - expr | ε● term -> 0 | 1 | ... | 9● If we use this grammar, we get the same
language, but we must be careful about the actions.
Problem with the Specification (2)
● Consider 2 choices for actions for -● rest -> - expr { rest.t = '-' expr.t; }
● rest -> - expr { rest.t = expr.t '-'; }
● The first pattern translates 8-4 into 8-4.● The second translates 8-4 into 84-, but it also
translates 8-4+2 into 842-+● We need help.
Eliminating Left-Recursion in a Translation Scheme
● The solution is to “ drag” the actions around during the conversion, treating each as 1 grammar symbol.
● In general we convert A -> A | A | α β γinto
● A -> Rγ● R -> R | R | α β ε● Actions can be part of the , and α β γ
Repaired Grammar
● expr -> term rest● rest -> + term { print('+'); } rest● rest -> - term { print('-'); } rest● rest -> ε● term -> 0 { print('0'); }● term -> 1 { print('1'); }● ...● term -> 9 { print('9'); }
Translation of 8-4+2
expr
rest
print('8') print('-')
print('+')print('4')
print('2') ε
rest
restterm
term
term
8
4
2
-
+
void expr(){ term(); rest();}void rest(){ switch ( lookahead ) { case '+': match('+'); term(); print('+'); rest(); break; case '-': match('-'); term(); print('+'); rest(); break; }}void term(){ if ( isdigit(lookahead) ) { print(lookahead); match(lookahead); } else error();}
Eliminating Tail Recursion● tail recursion: recursive call just be returning
from a recursive funtion – might as well use a loopvoid rest()
{ while ( 1 ) { switch ( lookahead ) { case '+': match('+'); term(); print('+'); break; case '-': match('-'); term(); print('+'); break; default: return; } }}
Merging Code
● expr is called once and transfers control to restmight as well merge
● The + and – cases are almost identicalcan merge using lookahead variable
Streamlined expr code
void expr(){ term(); while ( 1 ) { switch ( lookahead ) { case '+': case '-': t = lookahead; match(t); term(); print(t); break; default: return; } }}
Lexical Analysis
● White space and comment removal– Easy in the scanner– Difficult in the parser
● Constant = sequence of digits– scanner passes num to the parser– the value of the num is an attribute
● 25 + 15 – 12● <num,25> <+,> <num,15> <-,> <num,12>
Identifiers
● Identifier: letters and digits starting with a letter● Might also be a keyword, but not now.● Easy to code by starting a while loop when the
next character is a letter and continuing until the next character is not a letter nor a digit.
● Then we need to return the character to the input stream to be read by another section of code.
Interfacing to the Lexical Analyzer
● Lexical analyzer reads characters from stdin● Parser gets tokens/attributes from the lexical
analyzer● The simplest arrangement is for the lexical
analyzer to have a function to call to get the next token.
Symbol Table
● Generally a symbol table supports insertion of identifiers with attributes.
● A symbol table also allows searching for an identifier by name.
● Efficiency usually dictates using some form of hash table or tree. (STL map: red-black tree)
● For Chapter 2 the symbol table is an array of tuples of char pointers and ints (struct entry).
Symbol Table (2)
● A symbol table is a good place to handle keywords like “ div” and “ mod” .
● The translator inserts “d iv” with the #defined constant DIV (and “ mod” with MOD) in the table.
● DIV and MOD are ints greater than 255 to avoid confusion with single char tokens.
● The lexical analyzer uses lookup to search for a string. If in the table, it uses the token type from the table. Otherwise it inserts it as an ID.
Abstract Stack Machine
● An abstract stack machine is a possible form of intermediate code for a compiler.
● An ASM has data memory, instruction memory, a data stack and a CPU.
● The CPU has instructions to move data from data memory to the stack and vice versa.
● It also has instructions to perform operations on the top items of the stack.
● Lastly the CPU has flow-control instructions.
ASM Arithmetic Instructions
● Using an ASM is like interpreting postfix● PF(2+3*4) = 234*+● ASM instructions would bepush 2push 3push 4multiplyadd
● There would be a full collection of operators for ints and doubles.
L-values and R-values
● An identifier is used in 2 common ways in a programming language– On the left side of an assignment (l-value)– As part of an expression (r-value)
● When used for the target of an assignment the computer needs the address of the variable.
● When used in an expression the computer needs to value of the variable.
● An ASM needs rvalue and lvalue instructions.
lvalue and rvalue● To push a variable's value onto the stack
– rvalue a // a's address is used to get a
● To push a variable's address onto the stack– lvalue a // a's address is pushed
● To compute c=a+b– lvalue c– rvalue a– rvalue b– add– store // := in the book
ASM Control Flow
● label xSet label named x
● goto xBranch to the label named x
● gofalse xGoto x if the top of the stack is 0 (also pops it)
● gotrue x● Goto x is the top of the stack is not 0 (also
pops)● halt
ASM Code for if Statement
Source:if expr then stmt
Target:code for exprgofalse outcode for stmtlabel out
Translation scheme:stmt -> if expr { out = newlabel(); emit('gofalse',out); } then stmt { emit('label',out); }
void stmt (){
if ( lookahead == ID ) {emit('lvalue',tokenval);match(ID);match('=');expr();
} else if ( lookahead == IF ) {match(IF);expr();out = newlabel();emit('gofalse',out);match(THEN);stmt();emit('label',out);
} else error();}
Infix to Postfix Translator Specification
start -> list EOFlist -> expr ; list | εexpr -> expr + term { print('+') } | expr – term { print('-') } | termterm -> term * factor { print('*') } | term / factor { print('/') } | term DIV factor { print('DIV') } | term MOD factor { print('MOD') } | factorfactor -> ( expr ) | id { print(id.lexeme) } | num { print(num.value) }
Translation Scheme with no Left Recursion
start -> list EOFlist -> expr ; list | εexpr -> term moretermsmoreterms -> + term { print('+') } moreterms | – term { print('-') } moreterms | εterm -> factor morefactorsmorefactors -> * factor { print('*') } morefactors | / factor { print('/') } morefactors | DIV factor { print('DIV') } morefactors | MOD factor { print('MOD') } morefactors | εfactor -> ( expr ) | id { print(id.lexeme) } | num { print(num.value) }
Tokens
● Tokens are identified by an integer and some of them have an integer attribute value.
● Many tokens like '+' are simply themselves● NUM, DIV, MOD, ID, and DONE are #defined
as numbers starting with 256 to be distinct.● The integer attribute for NUM is the sequence
of digits converted to an integer.● The integer attribute for ID is the index into the
symbol table for that ID.