Compiler Construction Lent Term 2021 Lecture 2 : Lexical ...

Post on 02-May-2022

2 views 0 download

Transcript of Compiler Construction Lent Term 2021 Lecture 2 : Lexical ...

1

Compiler Construction

Lent Term 2021

Lecture 2 : Lexical analysis

Timothy G. Griffin

tgg22@cam.ac.uk

Computer Laboratory

University of Cambridge

• Recall regular expressions

• Recall Finite Automata

• Recall NFA to DFA transformation

• What is the “lexing problem”?

• How DFAs are used to solve the lexing

problem?

1

What problem are we solving?

if m = 0 then 1 else if m = 1 then 1 else fib (m - 1) + fib (m -2)

Translate a sequence of characters

into a sequence of tokens

type token =

| INT of int| IDENT of string | LPAREN | RPAREN

| ADD | SUB | EQUAL | IF | THEN | ELSE

| …

IF, IDENT “m”, EQUAL, INT 0, THEN, INT 1, ELSE, IF,

IDENT “m”, EQUAL, INT 1, THEN, INT 1, ELSE, IDENT “fib”,

LPAREN, IDENT “m”, SUB, INT 1, RPAREN, ADD,

IDENT “fib”, LPAREN, IDENT “m”, SUB, INT 2, RPAREN

implemented with some data type

2

)(||||| * aeeeeeae

)()(

)()(

}{)(

)}(),(|{)(

)()()(

}{)(

}{)(

{})(

)(

0

*

1

0

22112121

2121

*

n

n

nn

eMeM

eeMeM

eM

eMweMwwweeM

eMeMeeM

aaM

M

M

eM

alphabet over sexpressionRegular e

3

Regular Expression (RE) Examples

},,,

,,,,,{

))(( *

aaaabbbbabbbaabb

ababbaaabbbaabbaabbabb

abbbaM

},,

,,,

,,,{

))(( *

M

4

Review of Finite Automata (FA)

),,,,( 0 FqQM

states :Q alphabet :

statestart : Q0 q states final :QF

(DFA)FA ticdeterminisfor

Qa)(q,,, aQq

(NFA)FA nisticnondetermifor

Qa)(q,}),{(, aQq

5

NFA Example

)cbbcaabM(a ** **

acceptingNFA An

c

c

a

b

a

b

start

6

A bit of notation

},|{)(

and ),( if

0

322131

qqFqwML

qqqaqqq

qq

w

waw

ε

For deterministic FA.

For nondeterministic FA.

},|{)(

and),(q if

and),( if

0

321231

321231

qqFqwML

qqaqqq

qqqqqq

qq

w

waw

ww

ε

7

Review of RE -> NFA

e startq finalq

A regular

expression.

A nondeterministic

FA accepting M(e) with

a single final state.

The construction is done by induction on

the structure of e.

)(eN

8

Review of RE -> NFA

0q 1q)(N

0q 1q)(N

0q 1q)(aNa

9

Review of RE -> NFA

)( 21 eeN

)( 1eN

)( 2eN

10

Review of RE -> NFA

)( 21eeN

)( 2eN)( 1eN

11

Review of RE -> NFA

)( *eN

)(eN

12

))(( *abbbaN

a

b

ab b

0 1

2 3

6

54

10 9 8

7

start

13

Review of NFA -> DFA

),,,,( 0 FqQM

),,,,( ''

0

''' FqQM

}|{

}{

})|),('({),(

}',|'{)(

}|{

'

0

'

0

'

'

FSQSF

qclosureq

SqaqqclosureaS

qqSqQqSclosure

QSSQ

14

?)( compute wedo How Sclosure

resultreturn

stackon u push

result :result then

result if

each for

stack theoff pop

empty not stack while

:result

stack a onto of elements allpush

:)(

{u}

u

)(q,u

q

S

S

Sclosure

Look familiar?

It’s just a version of

transitive closure!

15

)))((( *abbbaNDFA

10,7,6

,5,4,2,1

7,4

,2,1,0

8,7,6

,4,3,2,1

9,7,6

5,4,2,1b

a

ba

startb

b

b

aa

a

7,6

5,4,2,1

16

17

Traditional Regular Language Problem

?)( is , and Given eLwwe

Solution : construct NFA from e, then DFA, then run

the DFA on w.

But is this a solution to the “lexing problem?

No!

17

Something closer to the “lexing problem”

.

The expressions are ordered by priority. Why?

Is “if” a variable or a keyword? Need priority to

resolve ambiguity (so “if” matched keyword RE

before identifier RE.

We need to do a longest match. Why?

Is “ifif” a variable or two “if” keywords?

w

)w(i),w,(i),w,(i nn,21 ...21

keee ,, 21 and

find

Given

so that

n1 www=w ...2 )L(ewjijii

and what else?

and

18

19

Define Tokens with Regular Expressions (Finite

Automata)

Keyword: if

1i

2f

3

1i

2f

3

0

-{f}

-{i}

This FA is really shorthand for:

“dead state” 19

20

Define Tokens with Regular Expressions (Finite

Automata)

Keyword:

if1

i2

f3 KEY(IF)

Keyword:

then1

t2

h3

KEY(then)

5

e

n4

Regular Expression Finite Automata Token

Identifier:

[a-zA-Z][a-zA-Z0-9]*1 2

[a-zA-Z]

[a-zA-Z0-9]

ID(s)

20

21

Define Tokens with Regular Expressions (Finite

Automata)

Regular Expression Finite Automata Token

number:

[0-9][0-9]*1 2

[0-9]

[0-9]

NUM(n)

real:

([0-9]+ ‘.’ [0-9]*)

| ([0-9]* ‘.’ [0-9]+)

1

3

[0-9] NUM(n) 2

[0-9]

[0-9]

.

4

.

[0-9]5

[0-9]21

22

No Tokens for “White-Space”

White-space with one line

comments starting with %

1

3

%2

[A-za-z0-9’ ‘]

4

\n

\t

\n‘ ‘

22

Constructing a Lexer

an ordered list of regular expressions

Highest priority first, lowest last

INPUT: keee ,, 21

keeee 21for NFA

23

priority.highest

of with theassociated

state finaleach DFA with

ie

Constructing a Lexer

(1) Keyword : then

(2) Ident : [a-z][a-z]*

(2) White-space: ‘ ‘

24

1t

2:IDh

3:ID

5:THEN

e

n

4:ID

7:W

‘ ‘

6:ID

State 5 could accept

either an ID or

the keyword “then”.

The priority rules

eliminates this

ambiguity and

associates state 5

with the keyword.

25

What about longest match?

1t

2:IDh

3:ID

5:THEN

e

n

4:ID

7:W

‘ ‘

6:ID

|then thenx$ 1 0

t|hen thenx$ 2 2

th|en thenx$ 3 3

the|n thenx$ 4 4

then| thenx$ 5 5

then |thenx$ 0 5 EMIT KEY(THEN)

then| thenx$ 1 0 RESET

then |thenx$ 7 7

then t|henx$ 0 7 EMIT WHITE(‘ ‘)

then |thenx$ 1 0 RESET

then t|henx$ 2 2

then th|enx$ 3 3

then the|nx$ 4 4

then then|x$ 5 5

then thenx|$ 6 6

then thenx$| 0 6 EMIT ID(thenx)

Start in initial state,

Repeat:

(1) read input until dead state is

reached. Emit token associated

with last accepting state.

(2) reset state to start state

| = current position, $ = EOF

Input

current state

last accepting state