CSE P501 – Compiler Construction

58
CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014 Jim Hogg - UW - CSE P501 B-1

description

CSE P501 – Compiler Construction. Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next. Scanner. ‘Middle End’. Back End. Target. Source. Front End. chars. IR. IR. Scan. Select Instructions. Optimize. tokens. IR. Allocate Registers. Parse. IR. AST. Emit. - PowerPoint PPT Presentation

Transcript of CSE P501 – Compiler Construction

Page 1: CSE P501 –  Compiler Construction

CSE P501 – Compiler Construction

Scanner

Regex

Automata

Hand-Written Scanner

Grammars & BNF

Next

Spring 2014 Jim Hogg - UW - CSE P501 B-1

Page 2: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 A-2

Source TargetFront End Back End

Scanchars

tokens

AST

IR

AST = Abstract Syntax TreeIR = Intermediate Representation

‘Middle End’

Optimize

Select Instructions

Parse

Semantics

Allocate Registers

Emit

Machine Code

IR

IR

IR

IR

IR

Scanner

Page 3: CSE P501 –  Compiler Construction

Automatic or Hand-Written?

Use a scanner-generator - JFlex

Spring 2014 Jim Hogg - UW - CSE P501 B-3

regex define tokens JFlex Scanner

.jflex .java

Write a scanner, in Java, by hand Easy and enlightening Will see an outline of how, later

OR

Page 4: CSE P501 –  Compiler Construction

Reminder: a token is . . .

Spring 2014 Jim Hogg - UW - CSE P501 A-4

class C { public int fac(int n) { // factorial int nn; if (n < 1) nn = 1; else nn = n * this.fac(n-1); return nn; }}

class∙C∙{◊∙∙public∙int∙fac(int∙n)∙{∙∙//∙factorial◊∙∙∙∙int∙nn;◊∙∙∙∙if(n∙<∙1)◊∙∙∙∙∙∙nn∙=∙1;◊∙∙∙∙else◊∙∙∙∙nn∙=∙n∙*∙(this.fac(n-1));◊∙∙∙∙return∙nn;◊∙∙}◊}

Key for Char Stream:

◊ newline \n∙ space

CLASS ID:C LBRACE PUBLIC INT ID:fac LPAREN INT ID:n RPAREN LBRACE INT ID:nn SEMI IF LPAREN ID:n LT ILIT:1 RPAREN ID:nn EQ ILIT:1 ELSE ID:nn EQ ID:n TIMES LPAREN ID:this DOT ID:fac LPAREN ID:n MINUS ILIT:1 RPAREN RPAREN SEMI RETURN ID:nn SEMI RBRACE RBRACE

Page 5: CSE P501 –  Compiler Construction

A Token in your Java scanner

class Token { public int kind; // eg: LPAREN, ID, ILIT public int line; // for debugging/diagnostics public int column; // for debugging/diagnostics public String lexeme; // eg: “x”, “Total”, “(“, “42” public int value; // attribute of ILIT}

Spring 2014 Jim Hogg - UW - CSE P501 B-5

Obviously this Token is wasteful of memory: • lexeme is not required for primitive tokens, such as LPAREN, RBRACE, et• value is only required for ILIT

But, there's only 1 token alive at any instant during parsing, so no point refining into 3 leaner variants!

Page 6: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-6

Typical Tokens Operators & Punctuation

Single chars: + - * = / ( ] ; : Double chars: :: <= == !=

Keywords if while for goto return switch void …

Identifiers A single ID token kind, parameterized by lexeme

Integer constants A single ILIT token kind, parameterized by int valueSee jflex-1.5.0\examples\java\java.flex for real example

Page 7: CSE P501 –  Compiler Construction

Token Spotting

Spring 2014 Jim Hogg - UW - CSE P501 B-7

if(a<=3)++grades[1]; // what are the tokens? (no spaces)

public int fac(int n) { // what are the tokens? (need spaces?)

Counter-example: fixed-format FORTRAN:

DO 50 I = 1,99 // DO loopDO 50 I = 1.2 // assignment: DO50I = 1.2

Page 8: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-8

Principle of Longest Match

Scanner should pick the longest possible string to make up the next token (“greedy” algorithm)

Examplereturn idx <= iffy;

should be scanned into 5 tokens:

<= is one token, not two iffy is an ID, not IF followed by ID:fy

RETURN ID:idx LEQ ID:iffy SEMI

Page 9: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-9

The syntax, of most programming languages can be specified using Regular Expressions “REs” in Cooper&Torczon “regex” is more common

Tokens can be recognized by a deterministic finite automaton (DFA) DFA (a Java class) is almost always

generated from regex using a software tool, such as JFlex

Regex

Page 10: CSE P501 –  Compiler Construction

Regex Cheat Sheet

Spring 2014 Jim Hogg - UW - CSE P501 B-10

Pattern Matches?a aa* zero or more

a’sa+ one or more

a’sa? zero or one aa|b a or bab a followed by

bPrecedence: * (highest), concatenation, | (lowest)

Parentheses can be used to group regexs as needed

Notice meta-characters, in red

Escaped characters: \* \+ \? \| \. \t \n

Pattern Matches?[c-f] one of c or d or e or f[^0-3] any one character except 0-

3. any character, except

newline

Page 11: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-11

Regex Examplesregex Meaning?[abc]+[abc]* (Kleene closure)[0-9]+[1-9][0-9]*[a-zA-Z_][a-zA-Z0-9_]*(0|1)* 0(a|b)*aa(a|b)*

Check free online Regex tutorials if you are rusty. Eg: http://regexone.com/ Experiment with a regex-capable editor. Eg: http://www.editpadpro.com/

Page 12: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-12

regex

Defined over some alphabet Σ For programming languages, alphabet is ASCII or

Unicode

If re is a regular expression, L(re ) is the language (set of strings) generated by re

Page 13: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-13

regex macros

Possible syntax for numeric constantsDigit = [0-9]Digits = Digit+Number = Digits ( . Digits )? ( [eE] (+ | -)? Digits ) ?

How would you describe this set in English?

What are some examples of legal constants (strings) generated by Number?

Tools like JFlex accept these convenient macros

Page 14: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-14

Finite automata (state machines) can be used to recognize strings generated by regular expressions

Can build automaton by-hand or automagically Will not build by-hand in this course Will use the JFlex tool: given a set of regex, it

generates an automaton recognizer (a Java class)

Automata

Page 15: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-15

Finite Automata Terminology

Phrase AbbreviationFinite Automaton FADeterministic Finite Automaton DFANon-deterministic Finite Automaton NFAFinite-State Automaton FSA = {DFA, NFA}

Page 16: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-16

DFA for “cat”

a tc

Accepting State(double circles)

Start State

regex = cat

Page 17: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-17

DFA for ILIT

0-91

0-92

We have labelled the states

regex = [0-9][0-9]* = [0-9]+

Page 18: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-18

DFA for ID

a-z

0 0-9

1

a-z

regex = [a-zA-Z_][a-zA-Z0-9_]*

A-Z_

A-Z_

Page 19: CSE P501 –  Compiler Construction

DFAs work like this . . .

Spring 2014 Jim Hogg - UW - CSE P501 B-19

1. scan the input text string, character-by-character

2. following the arc/edge corresponding to the character just read

3. if there is no arc for the character just read, then, either:

a. if you are in an accepting state: you're done. Success!

b. if you are not in an accepting state: you're done. Failure!

Page 20: CSE P501 –  Compiler Construction

DFAs work like this - examples

Spring 2014 Jim Hogg - UW - CSE P501 B-20

1. Scan "fac(int n);" for the regex, alphaid = [a-z]+ (lower-case alphas)We hit "(" and are already in state 1. Success

2. Scan "23;" for regex alphaidThere is no arc for "2". We are still in state 0. Failure

3. Scan "today" for regex alphaidWe hit end-of-string and are already in state 1. Success

0 1a-z

a-z

Note: no need to add arcs to the DFA for all error cases - they are implicit

Page 21: CSE P501 –  Compiler Construction

Thompson’s Construction: Combining DFAs

Spring 2014 Jim Hogg - UW - CSE P501 B-21

ε

a b

DFA for: a DFA for: b

a b NFA for: ab

εa

b

NFA for a|b

ε

ε

ε

Page 22: CSE P501 –  Compiler Construction

Combining DFAs, cont’d

Spring 2014 Jim Hogg - UW - CSE P501 B-22

ε

a b

DFA for: a DFA for: b

a NFA for: a*

ε

ε

ε

Page 23: CSE P501 –  Compiler Construction

Exercise Draw the NFA for: b(at|ag) | bug

Spring 2014 Jim Hogg - UW - CSE P501 B-23

b

a t

ub g

a g

Page 24: CSE P501 –  Compiler Construction

Exercise

Draw the NFA for: b(at|ag) | bug

Spring 2014 Jim Hogg - UW - CSE P501 B-24

b

a t

ub g

a g

Page 25: CSE P501 –  Compiler Construction

NFA for a(b|c)*

Spring 2014 Jim Hogg - UW - CSE P501 B-25

b

c

a

a

b

c

To recognize "acb" successfully, we need to:

• guess the future correctly• backtrack and retry if we fail to

recognize• somehow execute all possible paths

None of these is attractive! Can we construct an equivalent DFA?

Page 26: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-26

Finite State Automaton (FSA) A finite set of states

One marked as initial state One or more marked as final states States sometimes labeled or numbered

A set of transitions from state to state Each labeled with symbol from Σ, or ε

Operate by reading input symbols (usually characters) Transition can be taken if labeled with current symbol ε-transition can be taken at any time (free bus ride)

Accept when final state reached & no more input Scanner uses an FSA as a subroutine – accept longest

match from current location each time called, even if more input

Reject if no transition possible, or no more input and not in final state (DFA)

Page 27: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-27

DFA vs NFA

Deterministic Finite Automata (DFA) No choice of which transition to take In particular, no ε transitions No guessing

Non-deterministic Finite Automata (NFA) Choice of transition in at least one case Accepts if some way to reach final state on given

input Reject if no possible way to final state How to implement in software?

Page 28: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-28

DFAs in Scanners

We really want DFA for speed: no backtracking, no guessing, no foretelling the future

Conversion from regex to NFA is easy, right?

But how to turn an NFA into an equivalent DFA?

Turns out to be obvious (once seen) and easy

Page 29: CSE P501 –  Compiler Construction

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501 B-29

Starting with the above NFA, we want to 'collapse' epsilon edges, ending up with a DFA that recognizes, and rejects, the same char strings. Ideally, we will end up with:

0 ac

b

4 b

6c

3

5

7

2 8

NFA for a(b|c)*

0 a 1 9

1

Page 30: CSE P501 –  Compiler Construction

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501 B-30

4 b

6c

3

5

7

2 8

NFA for a(b|c)*

0 a 1 9

• Begin in the Start state• Foreach labelled arc leaving that state, what set of states can I

reach, along labelled arc, or along transitions?

Page 31: CSE P501 –  Compiler Construction

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501 B-31

n4 b

n6c

n3

n5

n7

n2 n8

NFA for a(b|c)*

n0 a n1 n9

NFA State a b cd0 = n0 d1 =

{1,2,3,4,6,9}none none

d1 = {1,2,3,4,6,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

d2 = {3,4,5,6,8,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

d3 = {3,4,6,7,8,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

Page 32: CSE P501 –  Compiler Construction

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501 B-32

b

c

DFA for a(b|c)*

d0 a bc

c

b

NFA State a b cd0 d1 - -d1 - d2 d3d2 - d2 d3d3 - d2 d3

d2

d1

d3

Page 33: CSE P501 –  Compiler Construction

NFA to DFA - Even Better

Spring 2014 Jim Hogg - UW - CSE P501 B-33

DFA for a(b|c)*

d0 ac

b

• Can reduce number of states further, to yield above result

• If interested, see books for details• States minimization is not examined in P501

d1

Page 34: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-34

From NFA to DFA

Subset construction (equivalence class) Construct DFA from NFA, where each DFA state

represents a set of NFA states

Key idea State of DFA after reading some input is the set of all

states the NFA could have reached after reading the same input

Algorithm: example of a fixed-point computation If NFA has n states, DFA has at most 2n states

=> DFA is finite, can construct in finite # steps

Page 35: CSE P501 –  Compiler Construction

Build DFA for: b(at|ag) | bug from its NFA

Spring 2014 Jim Hogg - UW - CSE P501 B-35

b

a

1

3 t

u

0

b 98 10 g

42

a 6 g 75

11

12

NFA State a b g t ud0 = 0 - {1,2,5,9} - - -

d1 = {1,2,5,9} ? ? ? ? ?? ? ? ? ? ?

Page 36: CSE P501 –  Compiler Construction

Build DFA for: b(at|ag) | bug from its NFA

Spring 2014 Jim Hogg - UW - CSE P501 B-36

b

a

1

3 t

u

0

b 98 10 g

42

a 6 g 75

11

12

NFA State a b g t u

d0={0} - d1={1,2,5,9} - - -d1 = {1,2,5,9} d2={3,6} - - - d3={10}

d2 = {3,6} - - d4={7} d5={4,12}

-

d3 = {10} - - d6={11,12}

- -

TBD ? ? ? ? ?

Page 37: CSE P501 –  Compiler Construction

Idea: show a hand-written DFA for some typical tokens Then use to construct hand-written scanner

Setting: Parser calls scanner whenever it wants next token JFlex provides next_token Scanner stores current position in input

For illustration only. Course project will use JFlex scanner-generator

Note - most commercial compilers use hand-written scanners - generally faster

Spring 2014 Jim Hogg - UW - CSE P501 B-37

Hand-Written Scanner

Page 38: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-38

Scanner DFA Example – Part 1

0

Accept LPAREN(2

Accept RPAREN)3

whitespaceor comments

Accept SEMI;4

Accept EOFend of input1

Page 39: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-39

Scanner DFA Example – Part 2

Accept NEQ! 6

Accept NOT7

5 =

[other ]

Accept LEQ< 9

Accept LESS10

8 =

[other ]

Page 40: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-40

Scanner DFA Example – Part 3

[0-9]

Accept ILIT12

11

[other ]

[0-9]

Page 41: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-41

Strategies for handling identifiers vs keywords Hand-written scanner: look up identifier-like things in table of

keywords Machine-generated scanner: generate DFA with appropriate

transitions to recognize keywords

Scanner DFA Example – Part 4

[a-zA-Z]

Accept ID or keyword14

13

[other ]

[a-zA-Z0-9_]

Page 42: CSE P501 –  Compiler Construction

Scanner – class, ctor, skipWhite

public class Scanner { private String prog; // the MiniJava program to be scanned private int p; // index in 'prog' of current char

public Scanner(String prog) { this.prog = prog; p = 0; }

private void skipWhite() { char c = prog.charAt(p); while ( Character.isWhitespace(c) ) c = prog.charAt(++p); }

Spring 2014 Jim Hogg - UW - CSE P501 B-42

Page 43: CSE P501 –  Compiler Construction

Scanner- id

private Token id() { int pBegin = p; // remember begin index of id char c = prog.charAt(p); // current char - alphabetic

while ( Character.isAlphabetic(c) || Character.isDigit(c) || c == '_') { c = prog.charAt(++p); } return new Token(ID, prog.substring(pBegin, p));}

Spring 2014 Jim Hogg - UW - CSE P501 B-43

Page 44: CSE P501 –  Compiler Construction

Scanner - iLitprivate Token iLit() { int pBegin = p; // remember begin index of lexeme char c = prog.charAt(p); // current char int val = Character.getNumericValue(c); // convert to int

while ( Character.isDigit(c) ) { // step thru chars of number c = prog.charAt(++p); val = 10 * val + Character.getNumericValue(c); } String lex = prog.substring(pBegin, p); return new Token(ID, lex, val);}

Spring 2014 Jim Hogg - UW - CSE P501 B-44

Page 45: CSE P501 –  Compiler Construction

Scanner - nextTokenpublic Token nextToken() { skipWhitespace(); // returns at prog[p] char c = prog.charAt(p); // current char in 'prog' char n = prog.charAt(p + 1); // next char in 'prog'

switch (c) { case ‘>': if (n == '=') { p++; p++; return new Token(GEQ, “>="); } else { p++; return new Token(GT, “>"); } // . . . case '+': p++; return new Token(PLUS, "+"); // . . . } // end of switch

Spring 2014 Jim Hogg - UW - CSE P501 B-45

Page 46: CSE P501 –  Compiler Construction

Scanner – nextToken, cont’d

if (Character.isDigit(c)) { return this.iLit(); } else if (Character.isAlphabetic(c)) { return this.id(); } else { return new Token(BAD, ""); } } // end of nextToken

} // end of class Scanner

Spring 2014 Jim Hogg - UW - CSE P501 B-46

An entire hand-written scanner for MiniJava takes ~100 lines of Java

Page 47: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-47

Since the 60s, the syntax of every significant programming language has been specified by a formal grammar

First done in 1959 with BNF (Backus-Naur Form); used to specify ALGOL 60 syntax

Borrowed from the linguistics community (Noam Chomsky)

Grammars & BNF

Page 48: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-48

Grammar for a Tiny Language

program statement | program statement statement assignStmt | ifStmt assignStmt id = expr ; ifStmt if ( expr ) statement expr id | ilit | expr + expr id a | b | c | i | j | k | n | x | y | z ilit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Note: often see ::= used instead of

Page 49: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-49

Example Derivation

a = 1 ; if ( a + 1 ) b = 2 ;

program ::= statement | program statementstatement ::= assignStmt | ifStmtassignStmt ::= id = expr ;ifStmt ::= if ( expr ) statementexpr ::= id | ilit | expr + exprid ::= a | b | c | i | j | k | n | x | y | zilit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]

Page 50: CSE P501 –  Compiler Construction

B-50

Parse Tree - First Few Steps

a = 1 ; if ( a + 1 ) b = 2 ;

P

P S

S

A

= Eid

ilit

;

P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]

Page 51: CSE P501 –  Compiler Construction

B-51

Parse Tree - Complete

a = 1 ; if ( a + 1 ) b = 2 ;

P

P S

S

A

= Eid

ilit

I

SE(if )

EE +

id ilit

A

= Eid

ilit

;

;

P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]

Page 52: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-52

Alternative Notations There are several syntax notations

for productions in common use; all mean the same thing

ifStmt ::= if ( expr ) statement

ifStmt if ( expr ) statement

<ifStmt> ::= if ( <expr> ) <statement>

Page 53: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-53

Formal Languages & Automata Theory

Alphabet: a finite set of symbols ( eg: [a-zA-Z0-9_] ) String: a finite, possibly empty sequence of symbols from an

alphabet Language: a set, often infinite, of strings Finite specifications of (possibly infinite) languages

Grammar – a generator; a system for producing all strings in the language (and no other strings)

A particular language may be specified by many different grammars

A grammar specifies only one language

Page 54: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-54

Productions The rules of a grammar are called productions

Rules contain Nonterminal symbols: grammar variables (program,

statement, id, etc) Terminal symbols: concrete syntax that appears in

programs (a, b, c, 0, 1, if, (, ), … )

Meaning of nonterminal <sequence of terminals and non-terminals>

In a derivation, an instance of non-terminal can be replaced by the sequence of terminals and non-terminals on its RHS

Often, there are two or more productions for one nonterminal – use any in different parts of derivation

Page 55: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-55

Two ways to Parse

Parse: re-construct the derivation (syntactic structure) of a program

More prosaically: fill the gap between top and bottom of page with a parse tree:

Start at top; build tree downwards, sweeping left-to-right. This is called a "top-down" parse. What we just did for the "Tiny Language" example

Start at bottom; build little trees that join upwards. Called a "bottom-up" parse. What CUP does for us.

Page 56: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-56

Why Separate Scanner and Parser?

In principle, a single recognizer could work directly from a concrete, character-by-character grammar

In practice this is never done: always scan chars to tokens, because:

Simplicity & Separation of Concerns Scanner hides details from parser (comments, whitespace, input files,

etc) Parser becomes easier to build; has simpler input - stream-of-tokens

Efficiency Scanner can use simpler, fast design But still often consumes a surprising amount of the compiler’s total

execution time - it touches every char in source program

Page 57: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-57

Project Notes

For MiniJava project Use JFlex scanner-generator tool Use CUP parser-generator tool The two work together

CUP generates a file of token kinds into sym.java (SEMI = 28, LT = 18, etc)

JFlex needs these definitions. To bootstrap this process, inspect the MiniJava grammar and devise your own set of token kinds

See MiniJava page at: http://www.cambridge.org/resources/052182060X/

Page 58: CSE P501 –  Compiler Construction

Spring 2014 Jim Hogg - UW - CSE P501 B-58

Homework: paper exercises on regex and FAs

Next week: first part of the compiler assignment – the scanner

Send partner info to Nat if you want project space

Next topic: parsing Will do LR parsing first, for the project (CUP) Cooper&Torczon chapter 3

Next