Chapter 2 A Simple One-Pass Compilerorca.st.usm.edu/~seyfarth/csc415/chapter02.pdf · A Simple...

74
Chapter 2 A Simple One-Pass Compiler

Transcript of Chapter 2 A Simple One-Pass Compilerorca.st.usm.edu/~seyfarth/csc415/chapter02.pdf · A Simple...

Chapter 2A Simple One-Pass Compiler

A Simple One-Pass Compiler

● Language syntax: context-free grammarBackus-Naur Form – BNF

● Language semantics: informal descriptions● Grammar used in syntax-directed translation● Infix expressions will be converted to postfix

– 8+4*3-2 becomes 843*+2-– postfix is easy to evaluate

● Later, programming constructs will be added

Initial Lexical Analysis

● We start with expressions formed from single digit numbers and arithmetic operators

● So, the lexical analyzer can simply read a character and return it. (ignoring white space)

● Later we will extend the lexical analysis to deal with multi-digit numbers, identifiers and keywords.

Context Free Grammar

● Defines the hierarchical structure of a programming language

● stmt ­> while ( expr ) stmt● Means a statement can be the keyword while,

followed by an expression in parentheses and a statement after the parentheses

● In this case we assume that expr is also defined by the grammer.

Grammar

● A context-free grammar consists of 4 parts:● A set of tokens (aka terminals)● A set of nonterminals (mol variables)● A set of productions where each production is

– a nonterminal on the left of the arrow for the production

– a sequence of terminals and nonterminals to the right of the arrow

● A designated start nonterminal

Example Grammar 1

● list -> list + digit● list -> list – digit● list -> digit● digit -> 0|1|2|3|4|5|6|7|8|9● The last line is shorthand for 10 productions:

– digit -> 0 digit -> 1 ...● Terminals are 0-9 and + and -● Nonterminals are list and digit

Example Grammar 1 (2)

● You use a grammar to derive a string by starting with the start symbol and repeatedly replacing a nonterminal with a RHS for it.list => list + digit     => list + 8     => list – digit + 8     => digit – digit + 8     => 4 – digit + 8     => 4 – 3 + 8

● This is referred to as a derivation

Parse Trees

● The root is labeled with the start nonterminal● Leaves are labeled by terminals or ε● Interior nodes are labeled by nonterminals● The children of a node are labeled by the RHS

of a production for the nonterminal

Parse Tree Example

list

+

– 8

4

3

list

list

digit

digit

digit

Parse Trees (3)

● The leaves of a parse tree (left to right) form the “ yield” o f the tree.

● This is a string or sentence generated by the grammar.

● This string is derived from the start symbol.

Ambiguity

● A grammar is ambiguous if some string can be generated by 2 or more parse trees.

● The grammar below is ambiguous● list ­> list + list

● list ­> list – list

● list ­> 0|1|2|...|9

Two Parse Trees for 6-1-1

list

6

6 11

1

1

­

­

­

­ list

list

listlist

listlist

list

listlist

Associativity

● We prefer left associativity for + and -6­1­1 == (6­1)­1

● In C, assignment is right associative– a=b=c

– list ­> var = list | var

– var ­> a | b | c

● Parse trees for left associative grammars tend to expand on the left.

Left vs Right Associativity

list

a

6 c

b

1

1

=

=

­

­ var

list

digit

digitlist

list

var

varlist

digit

list

Operator Precedence

● If we use all 4 basic math operations, we need operator precedence8­4*2 == 8­(4*2)

● We need more nonterminals and productions● An expression is a sum/difference of terms● A term is a product/quotient of factors● A factor is a digit or an expression in

parentheses

Grammar with Precedence

● expr ­> expr + term     |  expr ­ term     |  term

● term ­> term * factor     |  term / factor     |  factor

● factor ­> digit       |  ( expr )

● digit ­> 0|1|2|...|9

Why is this grammar ambiguous?

● stmt ­> id = expr      | if expr then stmt      | if expr then stmt else stmt      | while expr do stmt      | { stmt_list }

● stmt_list ­> stmt_list ; stmt           | ε

Why is the grammar ambiguous?

● if 1 then   if 2 then a = 2 else a = 1

● if 1 then   if 2 then a = 2else a = 1

● Which if “owns” the else?

Syntax-Directed Translation

● A compiler must keep track of a variety of values for program entities– The starting address for an else clause– The type of an expression– The size of an array

● We refer to these as attributes and associate them with terminals and nonterminals.

● A syntax-directed definition adds attribute rules (semantic rules) to productions.

Postfix Notation

● If E is a variable or constant, PF(E) = E● If op is a binary operator, PF(E1 op E2) =

PF(E1) PF(E2) op● If E is of the form ( E1 ), then PF(E) = E1● Postfix uses no parentheses● PF(8-1-1) = 81-1-

PF(2+3*4-5*2) = 234*+52*-PF((2+3)*(4-5)*2) = 23+45-*2*

Synthesized Attributes

● Synthesized attributes at an internal node of a parse tree are determined from attributes of its children. (bottom up)

● The alternative is “i nherited” a ttributes.● Attributes are specified using a “d ot” n otation

like members of a struct or class.expr.t

Syntax-Directed Definition● expr ­> expr1 + term  expr.t = expr1.t term.t +

● expr ­> expr1 ­ term  expr.t = expr1.t term.t ­

● expr ­> term          expr.t = term.t

● term ­> 0             term.t = '0'

● term ­> 1             term.t = '1'

● ...

Attribute Synthesis

expr.t

expr.t

expr.t

term.t

term.t

term.t

8

=3 4

­

+

=4

=8

= 83+

=83+4­

3

=8

Depth-First Parse Tree Traversal

void visit ( node *n ){for ( m = first_child(n); m; m++ ) {visit ( m );

}determine attributes of n;

}

Translation Scheme

● A context-free grammar with programming language statements embedded in RHSs.

● The programming statements are called semantic actions.

● Similar to a syntax-directed definition, but the order of execution/evaluation is explicit

● This format is used by yacc and bison. (also more or less the same in lex and flex)

Example Translation Scheme

● expr ­> expr + term  { print('+');}● expr ­> expr ­ term  { print('­');}

● expr ­> term

● term ­> 0            { print('0');}

● term ­> 1            { print('1');}

● ...

● term ­> 9            { print('9');}

Augmented Parse Tree

expr

expr

term

term

term

8

4

­

+

3

expr

{print('8');}

{print('+');}

{print('­');}

{print('4');}

{print('3');}

Translation Scheme

● The execution of the print statements could be done via a depth-first traversal.

● Alternatively, if parsing occurs in the same pattern (and order), the tree could be skipped.

● Note that the semantic actions can be more general purpose actions: symbol table actions, error messages, line counting, ...

Parsing

● A parser converts a string of tokens into a parse tree. (perhaps the tree is not explicit)

● Only certain grammars yield efficient parsers.– Arbitrary grammars might take O(n3) time– Programming language grammars take O(n) time

● Top-down parser: parse tree constructed starting with the root– Can be easily hand-generated

● Bottom-up: construction starts at the leaves– Handles a larger class of grammars

Top-Down Parsing● Start with start symbol as the root of the tree● Repeat the steps below

– Find a node, n, labeled with a nonterminal, A– Select a production for A and construct children of

n for the RHS symbols of the production● For nice grammars the parsing will proceed

from left to right through the input string.● The challenge is selecting a proper production.● We will consider the current token from the

input to help select a production. (lookahead token)

Example Grammar

● type ­> simple

●       | id

●       | array [ simple ] of type

● simple ­> integer

●         | char

●         | num .. num

Picking a Production

● The 3 productions for type start with id, array or whatever starts simple.

● We can examine the first symbol and determine which of the 3 productions to use:– id:  type ­> id

– array: type ­> array [ simple ] of type

– others: type ­> simple

● Likewise the 3 productions for simple can be selected by inspecting the lookahead symbol.

Parsing Using Lookahead

● Start with parse tree with start symbol at root● Lookahead symbol == array● Expand the tree by applying third production

type

array [ 1 .. 10 ] of integer

Parsing Using Lookahead

● Match and consume the token array● Advance to left bracket and match it

● Advance to 1 and expand simple

type

array [ 1 .. 10 ] of integer

array [ simple ] of type

Parsing Using Lookahead

● With lookahead 1 (a num), the correct production is selected and added to the tree

● We can finish the production for simple

type

array [ 1 .. 10 ] of integer

array [ simple ] of type

num .. num

Parsing Using Lookahead

● We advance past ] and of to reach integer● Now we can select the proper production to

apply based on the lookahead

type

array [ 1 .. 10 ] of integer

array [ simple ] of type

num .. num

Predictive Parsing

● Recursive descent parsing is done using a function for each nonterminal.

● Predictive parsing is a type of recursive descent where the lookahead symbol is used to “ predict” the correct production to apply.

● Parse tree is implicitly defined by the pattern of recursive function calls

void match ( int token ){

if ( look == token ) look = next_token();else error();

Functions for Predictive Parser

● The match function verifies that the current token is what we expect.

● It advances to the next token if it correct.

void type(){

switch ( look ) {case INTEGER:                // These 3 start thecase CHAR:                   // 3 productions forcase NUM:                    // simple

simple();                 break;

case ID:                     // type ­> idmatch(ID);break;

case ARRAY:                  // type ­> array ...match(ARRAY); match('['); // match 2 tokenssimple();                 // expand simplematch(']'); match(OF);    // match 2 tokenstype();                   // expand typebreak;

default:error();

}}

void simple(){

switch ( look ) {case INTEGER:

match(INTEGER);break;

case CHAR:match(CHAR);break;

case NUM:match(NUM);     // Not quite as simplematch(DOTDOT);  // as the first 2 simplematch(NUM);     // productionsbreak;

default:error();

}}

First Sets

● Prediction of the proper production requires knowing which tokens can be first in strings generated from a particular production.

● We define First sets for RHS of productions.● Let A ­> α be a production● If α = or ε α can generate , then is in ε ε

First(α).● First(α) also includes all terminals which can be

the first terminal is a string derived from .α

First Sets (2)

● First(simple) = { integer, char, num }● First(id) = { id }● First(array [ simple ] of type) = { array }

● We can choose one production over another if their First sets are disjoint.

Using -Productionsε

● opt_stmts -> stmt_list | ε● In the code for opt_stmts, if the lookahead

symbol is not in First(opt_stmts) we can use opt_stmts -> ε

● It may be than the lookahead symbol is legal after opt_stmts or not.

● If the lookahead symbol is illegal, it will result in an error elsewhere.

Predictive Parser

● Write a function for each nonterminal● Select which production to use for a

nonterminal by inspecting the lookahead symbol to determine which First set it is in.

● If First sets for competing productions are not disjoint, this plan won't work.

● Implement code for a production by calling functions for nonterminals of the RHS and matching terminals of the RHS.

Predictive Syntax-Directed Translator

● Extend the code for the predictive parser.● Copy the actions from the translation scheme

into the parser in the same position as in the translation scheme.

● The action will happen at the intended time.● The code for the parser/translator could be

automated using a tool which reads the translation scheme and writes C++ code.

Left Recursion

● Left recursion could cause infinite looping in a recursive-descent parser.

● expr -> expr + term● The problem is that expr is the first on the RHS.● Applying that production would not change the

lookahead symbol and would allow it to be selected again.

● Of course the alternative expr -> term would have a conflicting First set...

Eliminating Left Recursion

● expr -> expr + term | term● Compare this to● expr -> term rest● rest -> + term rest | ε● Now we have right recursion and recursive-

descent works.● But we generate parse trees which are better

for right associative operators.

A Translator for Simple Expressions

● We are extending the translator to include the 4 basic math operations, multi-digit numbers and identifiers.

● There will be a symbol table which will hold minimal information.

● The translator will accept a list of expressions with each expression terminated with a semicolon.

● We start with a left-recursive grammar which we convert to right-recursive.

Abstract and Concrete Syntax

● A parse tree can be called a concrete syntax tree.

● By contrast an abstract syntax tree leaves out grammar symbols, showing only operators and operands.

+

-

4

2

8

Simple Infix-to-postfix Specification● expr -> expr + term { print('+'); }

● expr -> expr – term { print('-'); }

● expr -> term● term -> 0 { print('0'); }

● term -> 1 { print('1'); }

● ...● term -> 9 { print('9'); }

Problem with the Specification

● The specification is left-recursive● We need to convert to right-recursion● expr -> term rest● rest -> + expr | - expr | ε● term -> 0 | 1 | ... | 9● If we use this grammar, we get the same

language, but we must be careful about the actions.

Problem with the Specification (2)

● Consider 2 choices for actions for -● rest -> - expr { rest.t = '-' expr.t; }

● rest -> - expr { rest.t = expr.t '-'; }

● The first pattern translates 8-4 into 8-4.● The second translates 8-4 into 84-, but it also

translates 8-4+2 into 842-+● We need help.

Eliminating Left-Recursion in a Translation Scheme

● The solution is to “ drag” the actions around during the conversion, treating each as 1 grammar symbol.

● In general we convert A -> A | A | α β γinto

● A -> Rγ● R -> R | R | α β ε● Actions can be part of the , and α β γ

Repaired Grammar

● expr -> term rest● rest -> + term { print('+'); } rest● rest -> - term { print('-'); } rest● rest -> ε● term -> 0 { print('0'); }● term -> 1 { print('1'); }● ...● term -> 9 { print('9'); }

Translation of 8-4+2

expr

rest

print('8') print('-')

print('+')print('4')

print('2') ε

rest

restterm

term

term

8

4

2

-

+

void expr(){ term(); rest();}void rest(){ switch ( lookahead ) { case '+': match('+'); term(); print('+'); rest(); break; case '-': match('-'); term(); print('+'); rest(); break; }}void term(){ if ( isdigit(lookahead) ) { print(lookahead); match(lookahead); } else error();}

Eliminating Tail Recursion● tail recursion: recursive call just be returning

from a recursive funtion – might as well use a loopvoid rest()

{ while ( 1 ) { switch ( lookahead ) { case '+': match('+'); term(); print('+'); break; case '-': match('-'); term(); print('+'); break; default: return; } }}

Merging Code

● expr is called once and transfers control to restmight as well merge

● The + and – cases are almost identicalcan merge using lookahead variable

Streamlined expr code

void expr(){ term(); while ( 1 ) { switch ( lookahead ) { case '+': case '-': t = lookahead; match(t); term(); print(t); break; default: return; } }}

Lexical Analysis

● White space and comment removal– Easy in the scanner– Difficult in the parser

● Constant = sequence of digits– scanner passes num to the parser– the value of the num is an attribute

● 25 + 15 – 12● <num,25> <+,> <num,15> <-,> <num,12>

Identifiers

● Identifier: letters and digits starting with a letter● Might also be a keyword, but not now.● Easy to code by starting a while loop when the

next character is a letter and continuing until the next character is not a letter nor a digit.

● Then we need to return the character to the input stream to be read by another section of code.

Interfacing to the Lexical Analyzer

● Lexical analyzer reads characters from stdin● Parser gets tokens/attributes from the lexical

analyzer● The simplest arrangement is for the lexical

analyzer to have a function to call to get the next token.

Symbol Table

● Generally a symbol table supports insertion of identifiers with attributes.

● A symbol table also allows searching for an identifier by name.

● Efficiency usually dictates using some form of hash table or tree. (STL map: red-black tree)

● For Chapter 2 the symbol table is an array of tuples of char pointers and ints (struct entry).

Symbol Table (2)

● A symbol table is a good place to handle keywords like “ div” and “ mod” .

● The translator inserts “d iv” with the #defined constant DIV (and “ mod” with MOD) in the table.

● DIV and MOD are ints greater than 255 to avoid confusion with single char tokens.

● The lexical analyzer uses lookup to search for a string. If in the table, it uses the token type from the table. Otherwise it inserts it as an ID.

Abstract Stack Machine

● An abstract stack machine is a possible form of intermediate code for a compiler.

● An ASM has data memory, instruction memory, a data stack and a CPU.

● The CPU has instructions to move data from data memory to the stack and vice versa.

● It also has instructions to perform operations on the top items of the stack.

● Lastly the CPU has flow-control instructions.

ASM Arithmetic Instructions

● Using an ASM is like interpreting postfix● PF(2+3*4) = 234*+● ASM instructions would bepush 2push 3push 4multiplyadd

● There would be a full collection of operators for ints and doubles.

L-values and R-values

● An identifier is used in 2 common ways in a programming language– On the left side of an assignment (l-value)– As part of an expression (r-value)

● When used for the target of an assignment the computer needs the address of the variable.

● When used in an expression the computer needs to value of the variable.

● An ASM needs rvalue and lvalue instructions.

lvalue and rvalue● To push a variable's value onto the stack

– rvalue a // a's address is used to get a

● To push a variable's address onto the stack– lvalue a // a's address is pushed

● To compute c=a+b– lvalue c– rvalue a– rvalue b– add– store // := in the book

ASM Control Flow

● label xSet label named x

● goto xBranch to the label named x

● gofalse xGoto x if the top of the stack is 0 (also pops it)

● gotrue x● Goto x is the top of the stack is not 0 (also

pops)● halt

ASM Code for if Statement

Source:if expr then stmt

Target:code for exprgofalse outcode for stmtlabel out

Translation scheme:stmt -> if expr { out = newlabel(); emit('gofalse',out); } then stmt { emit('label',out); }

void stmt (){

if ( lookahead == ID ) {emit('lvalue',tokenval);match(ID);match('=');expr();

} else if ( lookahead == IF ) {match(IF);expr();out = newlabel();emit('gofalse',out);match(THEN);stmt();emit('label',out);

} else error();}

Infix to Postfix Translator Specification

start -> list EOFlist -> expr ; list | εexpr -> expr + term { print('+') } | expr – term { print('-') } | termterm -> term * factor { print('*') } | term / factor { print('/') } | term DIV factor { print('DIV') } | term MOD factor { print('MOD') } | factorfactor -> ( expr ) | id { print(id.lexeme) } | num { print(num.value) }

Translation Scheme with no Left Recursion

start -> list EOFlist -> expr ; list | εexpr -> term moretermsmoreterms -> + term { print('+') } moreterms | – term { print('-') } moreterms | εterm -> factor morefactorsmorefactors -> * factor { print('*') } morefactors | / factor { print('/') } morefactors | DIV factor { print('DIV') } morefactors | MOD factor { print('MOD') } morefactors | εfactor -> ( expr ) | id { print(id.lexeme) } | num { print(num.value) }

Tokens

● Tokens are identified by an integer and some of them have an integer attribute value.

● Many tokens like '+' are simply themselves● NUM, DIV, MOD, ID, and DONE are #defined

as numbers starting with 256 to be distinct.● The integer attribute for NUM is the sequence

of digits converted to an integer.● The integer attribute for ID is the index into the

symbol table for that ID.