010 Lexical Analysis

216
Lexical Analysis

Transcript of 010 Lexical Analysis

Page 1: 010 Lexical Analysis

Lexical Analysis

Page 2: 010 Lexical Analysis

Announcements

● Programming Assignment 1 Out● Due Friday, July 1 at 11:59 PM.

● Check the website for readings:● Decaf Specification● Lexical Analysis● Intro to flex● Programming Assignment 1

Page 3: 010 Lexical Analysis

Where We Are

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

SourceCode

MachineCode

Page 4: 010 Lexical Analysis

while (ip < z) ++ip;

Page 5: 010 Lexical Analysis

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

Page 6: 010 Lexical Analysis

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

Page 7: 010 Lexical Analysis

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

While

++

Ident

<

Ident Ident

ip z ip

Page 8: 010 Lexical Analysis

do[for] = new 0;

Page 9: 010 Lexical Analysis

do[for] = new 0;

d o [ f ] n e w 0 ;=o r

Page 10: 010 Lexical Analysis

do[for] = new 0;

T_Do [ T_For T_New T_IntConst

0

d o [ f ] n e w 0 ;=o r

] =

Page 11: 010 Lexical Analysis

do[for] = new 0;

T_Do [ T_For T_New T_IntConst

0

d o [ f ] n e w 0 ;=o r

] =

Page 12: 010 Lexical Analysis

Why do Lexical Analysis?

● Dramatically simplify parsing.● Eliminate whitespace.● Eliminate comments.● Convert data early.

● Separate out logic to read source files.● Potentially an issue on multiple platforms.● Can optimize reading code independently of parser.● An extreme example...

Page 13: 010 Lexical Analysis

I am having some difficulty compiling a C++ program that I've written.

This program is very simple and, to the best of my knowledge, conforms to all the rules set forth in the C++ Standard. [...]

The program is as follows:

Source: http://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers

Page 14: 010 Lexical Analysis

I am having some difficulty compiling a C++ program that I've written.

This program is very simple and, to the best of my knowledge, conforms to all the rules set forth in the C++ Standard. [...]

The program is as follows:

Source: http://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers

Page 15: 010 Lexical Analysis

I am having some difficulty compiling a C++ program that I've written.

This program is very simple and, to the best of my knowledge, conforms to all the rules set forth in the C++ Standard. [...]

The program is as follows:

Source: http://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers

c:\dev>g++ helloworld.pnghelloworld.png: file not recognized: File format not recognizedcollect2: ld returned 1 exit status

Page 16: 010 Lexical Analysis

Goals of Lexical Analysis

● Convert from physical description of a program into sequence of of tokens.

● Each token is associated with a lexeme.● Each token may have optional attributes.● The token stream will be used in the parser to

recover the program structure.

Page 17: 010 Lexical Analysis

Challenges in Lexical Analysis

● How to partition the program into lexemes?● How to label each lexeme correctly?

Page 18: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25

DO 5 I = 1.25

Thanks to Prof. Alex Aiken

Page 19: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25

DO5I = 1.25

Thanks to Prof. Alex Aiken

Page 20: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25

DO5I = 1.25

● Can be difficult to tell when to partition input.

Thanks to Prof. Alex Aiken

Page 21: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Nested template declarations

vector<vector<int>> myVector

Thanks to Prof. Alex Aiken

Page 22: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Nested template declarations

vector < vector < int >> myVector

Thanks to Prof. Alex Aiken

Page 23: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Nested template declarations

(vector < (vector < (int >> myVector)))

Thanks to Prof. Alex Aiken

Page 24: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Nested template declarations

(vector < (vector < (int >> myVector)))

● Again, can be difficult to determine where to split.

Thanks to Prof. Alex Aiken

Page 25: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Types nested in templates

Page 26: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Types nested in templates

template <typename T>

void MyFunction(T val) {

T::iterator itr;

}

Page 27: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Types nested in templates

template <typename T>

void MyFunction(T val) {

typename T::iterator itr;

}

Page 28: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● C++: Types nested in templates

template <typename T>

void MyFunction(T val) {

typename T::iterator itr;

}

● Can be hard to label tokens correctly.

Page 29: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● PL/1: Keywords can be used as identifiers.

Thanks to Prof. Alex Aiken

Page 30: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

Thanks to Prof. Alex Aiken

Page 31: 010 Lexical Analysis

Not-so-Great Moments in Scanning

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

● Can be difficult to determine how to label lexemes.

Thanks to Prof. Alex Aiken

Page 32: 010 Lexical Analysis

Defining a Lexical Analysis

● Define a set of tokens.● Define the set of lexemes associated with each

token.● Define an algorithm for resolving conflicts that

arise in the lexemes.

Page 33: 010 Lexical Analysis

Choosing Tokens

Page 34: 010 Lexical Analysis

What Tokens are Useful Here?

for (int k = 0; k < myArray[5]; ++k) { cout << k << endl;}

Page 35: 010 Lexical Analysis

What Tokens are Useful Here?

for (int k = 0; k < myArray[5]; ++k) { cout << k << endl;}

T_For {T_IntConstant }T_Int <<T_Identifier ;= <( [) ]++

Page 36: 010 Lexical Analysis

Choosing Good Tokens

● Very much dependent on the language.● Typically:

● Give keywords their own tokens.● Give different punctuation symbols their own

tokens.● Discard irrelevant information (whitespace,

comments)

Page 37: 010 Lexical Analysis

Defining Sets of Strings

Page 38: 010 Lexical Analysis

String Terminology

● An alphabet Σ is a set of characters.● A string over Σ is a finite sequence of elements

from Σ.● Example:

● Σ = {☺, ☼ }● Valid strings include ☺, ☺☺, ☺☼, ☺☺☼, etc.

● The empty string of no characters is denoted ε.

Page 39: 010 Lexical Analysis

Languages

● A language is a set of strings.● Example: Even numbers

● Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}● L = {0, 2, 4, 6, 8, 10, 12, 14, … }

● Example: C variable names● Σ = ASCII characters● L = {a, b, c, …, A, B, C, …, _, aa, ab, … }

● Example: L = {static}

Page 40: 010 Lexical Analysis

Regular Languages

● A subset of all languages that can be defined by regular expressions.

● Small subset of all languages, but surprisingly powerful in practice:● Capture a useful class of languages.● Can be implemented efficiently (more on that later).

Page 41: 010 Lexical Analysis

Regular Expressions

● Any character is a regular expression matching itself.● ε is a regular expression matching the empty string.

● If R1 and R

2 are regular expressions, then

● R1R

2 is a regular expression matching the concatenation of

the languages.

● R1 | R

2 is a regular expression matching the disjunction of the

languages.

● R1* is a regular expression matching the Kleene closure of the

language.● (R) is a regular expression matching R.

Page 42: 010 Lexical Analysis

Regular Expressions In Action

● Example: Even Numbers

(+|-|ε) (0|1|2|3|4|5|6|7|8|9)* (0|2|4|6|8)

Page 43: 010 Lexical Analysis

Regular Expressions In Action

● Example: Even Numbers

(+|-|ε) (0|1|2|3|4|5|6|7|8|9)* (0|2|4|6|8)

● Notational simplification: Definitions

Sign = + | -

OptSign = Sign | ε

Digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

EvenDigit = 0 | 2 | 4 | 6 | 8

EvenNumber = OptSign Digit* EvenDigit

Page 44: 010 Lexical Analysis

Regular Expressions In Action

● Example: Even Numbers

(+|-|ε) (0|1|2|3|4|5|6|7|8|9)* (0|2|4|6|8)

● Notational simplification: Multi-Way Disjunction

Sign = + | -

OptSign = Sign | ε

Digit = [0123456789]

EvenDigit = [02468]

EvenNumber = OptSign Digit* EvenDigit

Page 45: 010 Lexical Analysis

Regular Expressions In Action

● Example: Even Numbers

(+|-|ε) (0|1|2|3|4|5|6|7|8|9)* (0|2|4|6|8)

● Notational simplification: Ranges

Sign = + | -

OptSign = Sign | ε

Digit = [0-9]

EvenDigit = [02468]

EvenNumber = OptSign Digit* EvenDigit

Page 46: 010 Lexical Analysis

Regular Expressions In Action

● Example: Even Numbers

(+|-|ε) (0|1|2|3|4|5|6|7|8|9)* (0|2|4|6|8)

● Notational simplification: Zero-or-One

Sign = + | -

OptSign = Sign?

Digit = [0-9]

EvenDigit = [02468]

EvenNumber = OptSign Digit* EvenDigit

Page 47: 010 Lexical Analysis

Implementing Regular Expressions

● Regular expressions can be implemented using finite automata.

● There are two kinds of finite automata:● NFAs (nondeterministic finite automata), which

we'll see in a second, and● DFAs (deterministic finite automata), which we'll

see later.

● Automata are best explained by example...

Page 48: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

Page 49: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 50: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 51: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 52: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 53: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 54: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 55: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 56: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 57: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o l

Page 58: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

Page 59: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 60: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 61: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 62: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 63: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 64: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 65: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 66: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 67: 010 Lexical Analysis

A Simple NFA

r o f l

l

o l

start

l o g

Page 68: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

Σ

Page 69: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 70: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 71: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 72: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 73: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 74: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 75: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 76: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 77: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 78: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 79: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 80: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 81: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 82: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 83: 010 Lexical Analysis

A More Complex NFA

start

ε

ε

ε

a

s

s s

t e

s k a t e s

Σ

Page 84: 010 Lexical Analysis

Simulating an NFA

● Keep track of a set of states, initially the start state and everything reachable by ε-moves.

● For each character in the input:● Maintain a set of next states, initially empty.● For each current state:

– Follow all transitions labeled with the current letter.

– Add these states to the set of new states.

● Add every state reachable by an ε-move to the set of next states.

● Accept if at least one state in the set of states is accepting.

● Complexity: O(|S||Q|2) for strings of length |S| and |Q| states.

Page 85: 010 Lexical Analysis

From Regular Expressions to NFAs

● There is a (beautiful!) procedure from converting a regular expression to an NFA.

● Associate each regular expression with an NFA with the following properties:● There is exactly one accepting state.● There are no transitions out of the accepting state.● There are no transitions into the starting state.

start

Page 86: 010 Lexical Analysis

Base Cases

εstart

Automaton for ε

astart

Automaton for single character a

Page 87: 010 Lexical Analysis

Construction for R1R

2

Page 88: 010 Lexical Analysis

R1

Construction for R1R

2

R2

start start

Page 89: 010 Lexical Analysis

R1

Construction for R1R

2

R2

start

Page 90: 010 Lexical Analysis

R1

Construction for R1R

2

R2

start ε

Page 91: 010 Lexical Analysis

R1

Construction for R1R

2

R2

start ε

Page 92: 010 Lexical Analysis

Construction for R1 | R

2

Page 93: 010 Lexical Analysis

Construction for R1 | R

2

R2

start

R1

start

Page 94: 010 Lexical Analysis

Construction for R1 | R

2

R2

start

R1

start

start

Page 95: 010 Lexical Analysis

Construction for R1 | R

2

R2

R1start

ε

ε

Page 96: 010 Lexical Analysis

Construction for R1 | R

2

R2

R1start

ε

ε

Page 97: 010 Lexical Analysis

Construction for R1 | R

2

R2

R1start

ε

ε

ε

ε

Page 98: 010 Lexical Analysis

Construction for R1 | R

2

R2

R1start

ε

ε

ε

ε

Page 99: 010 Lexical Analysis

Construction for R*

Page 100: 010 Lexical Analysis

Construction for R*

R

start

Page 101: 010 Lexical Analysis

Construction for R*

R

start start

Page 102: 010 Lexical Analysis

Construction for R*

R

start start

ε

Page 103: 010 Lexical Analysis

Construction for R*

R

start

ε

ε ε

Page 104: 010 Lexical Analysis

Construction for R*

R

start

ε

ε ε

ε

Page 105: 010 Lexical Analysis

Construction for R*

R

start

ε

ε ε

ε

Page 106: 010 Lexical Analysis

Overall Result

● Any regular expression of length n can be converted into an NFA with O(n) states.

● Can determine whether a string matches a regular expression in O(|S|n2), where |S| is the length of the string.

● We'll see how to make this O(|S|) later.

Page 107: 010 Lexical Analysis

The Story so Far

● We have a way of describing sets of strings for each token using regular expressions.

● We have an implementation of regular expressions using NFAs.

● How do we cut apart the input text?

Page 108: 010 Lexical Analysis

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

Page 109: 010 Lexical Analysis

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

Page 110: 010 Lexical Analysis

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o trf o trf o trf o trf o tr

f o trf o trf o trf o tr

Page 111: 010 Lexical Analysis

Conflict Resolution

● Assume all tokens are specified as regular expressions.

● Algorithm: Left-to-right scan.● Tiebreaking rule one: Maximal munch.

● Always match the longest possible prefix of the remaining text.

Page 112: 010 Lexical Analysis

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o trf o trf o trf o trf o tr

f o trf o trf o trf o tr

Page 113: 010 Lexical Analysis

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o tr

Page 114: 010 Lexical Analysis

Implementing Maximal Munch

● Given a set of regular expressions, how can we use them to implement maximum munch?

● Idea:● Convert expressions to NFAs.● Run all NFAs in parallel, keeping track of the last

match.● When all automata get stuck, report the last match

and restart the search at that point.

Page 115: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

Page 116: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

Page 117: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 118: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 119: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 120: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 121: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 122: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 123: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 124: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 125: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 126: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 127: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 128: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 129: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 130: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 131: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 132: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 133: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 134: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 135: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 136: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 137: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 138: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 139: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 140: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 141: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 142: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 143: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 144: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 145: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 146: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 147: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 148: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 149: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 150: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 151: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 152: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 153: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 154: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 155: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 156: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 157: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 158: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 159: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 160: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 161: 010 Lexical Analysis

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 162: 010 Lexical Analysis

A Minor Simplification

Page 163: 010 Lexical Analysis

A Minor Simplification

start d o

start d o u b l e

start Σ

Page 164: 010 Lexical Analysis

A Minor Simplification

start d o

start d o u b l e

start Σ

Page 165: 010 Lexical Analysis

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Page 166: 010 Lexical Analysis

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

Page 167: 010 Lexical Analysis

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

Page 168: 010 Lexical Analysis

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu eld o bu el

Page 169: 010 Lexical Analysis

More Tiebreaking

● When two regular expressions apply, choose the one with the greater “priority.”

● Simple priority system: which rule was defined first?

Page 170: 010 Lexical Analysis

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu eld o bu el

Page 171: 010 Lexical Analysis

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu el

Page 172: 010 Lexical Analysis

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu elWhy isn't this a

problem?

Page 173: 010 Lexical Analysis

One Last Detail...

● We know what to do if multiple rules match.● What if nothing matches?● Trick: Add a “catch-all” rule that matches any

character and reports an error.

Page 174: 010 Lexical Analysis

Summary of Conflict Resolution

● Construct an automaton for each regular expression.

● Merge them into one automaton by adding a new start state.

● Scan the input, keeping track of the last known match.

● Break ties by choosing higher-precedence matches.

● Have a catch-all rule to handle errors.

Page 175: 010 Lexical Analysis

Speeding up the Scanner

Page 176: 010 Lexical Analysis

DFAs

● The automata we've seen so far have all been NFAs.

● A DFA is like an NFA, but with tighter restrictions:● Every state must have exactly one transition

defined for every letter.● ε-moves are not allowed.

Page 177: 010 Lexical Analysis

A Sample DFA

Page 178: 010 Lexical Analysis

A Sample DFA

start

0 0

1

0

1

0

1

1

Page 179: 010 Lexical Analysis

A Sample DFA

D

start

0 0

1

0

1

0

1

A

C

B1

Page 180: 010 Lexical Analysis

A

A

B

B

C

C

D

D

A Sample DFA

D

start

0 0

1

0

1

0

1

A

C

B1

A

B

C

D

0 1

Page 181: 010 Lexical Analysis

Simulating a DFA

● Begin in the start state.● For each character of the input:

● Follow the indicated transition to the next state.

● Assuming transitions are in a table, time to scan a string of length S is O(|S|).

Page 182: 010 Lexical Analysis

Speeding up Matching

● In the worst-case, an NFA takes O(|S||Q|2) time to match a string.

● DFAs, on the other hand, take only O(|S|).● There is another (beautiful!) algorithm to

convert NFAs to DFAs.

Lexical Specification

RegularExpressions NFA DFA

Table-DrivenDFA

Page 183: 010 Lexical Analysis

Subset Construction

● NFAs can be in many states at once, while DFAs can only be in a single state at a time.

● Key idea: Make the DFA simulate the NFA.● Have the states of the DFA correspond to the

sets of states of the NFA.● Transitions between states of DFA correspond

to transitions between sets of states in the NFA.

Page 184: 010 Lexical Analysis

Formally...

● Given an NFA N = (Σ, Q, δ, s, F), define the DFA D = (Σ', Q', δ', s', F') as follows:● Σ' = Σ● Q' = 2Q

● δ'(S, a) = {ECLOSE(s') | s ∈ S and δ(s, a) = s'}● s' = ECLOSE(s)● F' = { S | s ∃ ∈ S. s ∈ F }

● Here, ECLOSE(s) is the set of states reachable from s via ε-moves.

Page 185: 010 Lexical Analysis

From NFA to DFA

Page 186: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

Page 187: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

Page 188: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

Page 189: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

Page 190: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 191: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 192: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 193: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 194: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 195: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 196: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 197: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 198: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6

12

Page 199: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6

12

Page 200: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6 7 8 9 1010u b l e

12

Page 201: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6 7 8 9 1010u b l e

Σ

Σ – o Σ – u Σ – b Σ – l Σ – e

Σ

Σ

12

Page 202: 010 Lexical Analysis

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6 7 8 9 1010u b l e

Σ

Σ – o Σ – u Σ – b Σ – l Σ – e

Σ

Σ

12

Page 203: 010 Lexical Analysis

Modified Subset Construction

● Instead of marking whether a state is accepting, remember which token type it matches.

● Break ties with priorities.● When using DFA as a scanner, consider the

DFA “stuck” if it enters the state corresponding to the empty set.

Page 204: 010 Lexical Analysis

Performance Concerns

● The NFA-to-DFA construction can introduce exponentially many states.

● Time/memory tradeoff:● Low-memory NFA has higher scan time.● High-memory DFA has lower scan time.

● Could use a hybrid approach by simplifying NFA before generating code.

Page 205: 010 Lexical Analysis

Real-World Scanning: Python

Page 206: 010 Lexical Analysis

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

While

++

Ident

<

Ident Ident

ip z ip

Page 207: 010 Lexical Analysis

Python Blocks

● Scoping handled by whitespace:if w == z:

a = b

c = d

else:

e = f

g = h

● What does that mean for the scanner?

Page 208: 010 Lexical Analysis

Whitespace Tokens

● Special tokens inserted to indicate changes in levels of indentation.

● NEWLINE marks the end of a line.● INDENT indicates an increase in indentation.● DEDENT indicates a decrease in indentation.● Note that INDENT and DEDENT encode

change in indentation, not the total amount of indentation.

Page 209: 010 Lexical Analysis

if w == z: a = b c = delse: e = fg = h

Scanning Python

Page 210: 010 Lexical Analysis

if w == z: a = b c = delse: e = fg = h

if ident

w

== ident

z

: NEWLINE

INDENT ident

a

= ident

b

NEWLINE

ident

c

= ident

d

NEWLINE

DEDENT else :

INDENT ident

e

= ident

f

NEWLINE

DEDENT ident

g

= ident

h

NEWLINE

Scanning Python

NEWLINE

Page 211: 010 Lexical Analysis

if w == z: { a = b; c = d;} else { e = f;}g = h;

if ident

w

== ident

z

: NEWLINE

INDENT ident

a

= ident

b

NEWLINE

ident

c

= ident

d

NEWLINE

DEDENT else :

INDENT ident

e

= ident

f

NEWLINE

DEDENT ident

g

= ident

h

NEWLINE

Scanning Python

NEWLINE

Page 212: 010 Lexical Analysis

if w == z: { a = b; c = d;} else { e = f;}g = h;

if ident

w

== ident

z

:

{ ident

a

= ident

b

;

ident

c

= ident

d

;

} else :

ident

e

= ident

f

;

} ident

g

= ident

h

;

Scanning Python

{

Page 213: 010 Lexical Analysis

Where to INDENT/DEDENT?

● Scanner maintains a stack of line indentations keeping track of all indented contexts so far.

● Initially, this stack contains 0, since initially the contents of the file aren't indented.

● On a newline:● See how much whitespace is at the start of the line.● If this value exceeds the top of the stack:

– Push the value onto the stack.

– Emit an INDENT token.

● Otherwise, while the value is less than the top of the stack:– Pop the stack.

– Emit a DEDENT token.

Source: http://docs.python.org/reference/lexical_analysis.html

Page 214: 010 Lexical Analysis

Interesting Observation

● Normally, more text on a line translates into more tokens.

● With DEDENT, less text on a line often means more tokens:

if cond1: if cond2: if cond3: if cond4: if cond5: statement1statement2

Page 215: 010 Lexical Analysis

Summary

● Lexical analysis splits input text into tokens holding a lexeme and an attribute.

● Lexemes are sets of strings often defined with regular expressions.

● Regular expressions can be converted to NFAs and from there to DFAs.

● Maximal-munch using an automaton allows for fast scanning.

● Not all tokens come directly from the source code.

Page 216: 010 Lexical Analysis

Next Time

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

SourceCode

MachineCode

(Plus a little bit here)