LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.

39
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006

Transcript of LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.

LEXICAL ANALYSIS

Phung Hua Nguyen

University of Technology

2006

Faculty of IT - HCMUT Lexical Analysis 2

Outline

• Introduction to Lexical Analysis• Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction, Algorithm 3.3)– NFA DFA (subset construction, Algorithm 3.2)– DFA minimal DFA (Algorithm 3.6)

• Programming

Faculty of IT - HCMUT Lexical Analysis 3

Introduction

• Read the input characters

• Produce as output a sequence of tokens

• Eliminate white space and comments

lexical analyzer

parser

symbol table

source program

token

get next token

Faculty of IT - HCMUT Lexical Analysis 4

Why ?

• Simplify design

• Improve compiler efficiency

• Enhance compiler portability

Faculty of IT - HCMUT Lexical Analysis 5

Tokens, Patterns, Lexemes

Token Sample Lexeme Informal description of patternconst const const

if if if

relation <,<=,==,!=,>,>= < or <= or == or != or > or >=

id pi, count, x2 letter followed by letters or digits

num 3.14, 25, 6.02E3 any numeric constant

literal “core dumped” any characters between “ and “ except “

Faculty of IT - HCMUT Lexical Analysis 6

Outline

• Introduction • Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction, Algorithm 3.3)– NFA DFA (subset construction, Algorithm 3.2)– DFA minimal DFA (Algorithm 3.6)

• Programming

Faculty of IT - HCMUT Lexical Analysis 7

Alphabet, Strings and Languages

• Alphabet ∑: any finite set of symbols– The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…}– The binary alphabet {0,1}– The ASCII alphabet

• String: a finite sequence of symbols drawn from ∑ :– Length |s| of a string s: the number of symbols in s– The empty string, denoted , || = 0

• Language: any set of strings over ∑; – its two special cases:

: the empty set• {}

Faculty of IT - HCMUT Lexical Analysis 8

Examples of Languages

• ∑ ={a, á, à, ả, ã, ạ, b, c, d, đ,…}– Vietnamese language

• ∑ = {0,1}– A string is an instruction– The set of Pentium instructions

• ∑ = the ASCII set– A string is a program– The set of C programs

Faculty of IT - HCMUT Lexical Analysis 9

Terms (Fig.3.7)

Term Definitionprefix of s a string obtained by removing 0 or more trailing

symbols of s;e.g. ban is a prefix of banana

suffix of s a string formed by deleting 0 or more the leading symbols of s;e.g. na is a suffix of banana

substring of s a string obtained by deleting a prefix and a suffix from s;e.g. nan is a substring of banana

proper prefix, suffix or substring of s

Any nonempty string x that is, respectively, a prefix, suffix os substring of s such that s x

Faculty of IT - HCMUT Lexical Analysis 10

String operations

• String concatenation– If x and y are strings, xy is the string formed

by appending y to x.E.g.: x = hom, y = nay xy = homnay

is the identity: y = y; x = x

• String exponentiation– s0 = – si = si-1s

E.g. s = 01, s0 = , s2 = 0101, s3 = 010101

Faculty of IT - HCMUT Lexical Analysis 11

Language Operations (Fig 3.8)

Term Definition

union: L M L M = { s | s L or s M }

concatenation: LM LM= { st | s L or t M }

Kleene closure: L* L* = L0 L LL LLL …

where L0 = {}

0 or more concatenations of L

positive closure: L+ L+ = L LL LLL …

1 or more concatenations of L

Faculty of IT - HCMUT Lexical Analysis 12

Examples

• L = {A,B,…,Z,a,b,…,z}• D = {0,1,…,9}

Example Language

L D

LD

L4

L*

L(L D)*

D+

letters and digits

strings consists of a letter followed by a digit

all four-letter strings

all strings of letters, including

all strings of letters and digits beginning with a letter

all strings of one or more digits

Faculty of IT - HCMUT Lexical Analysis 13

Regular Expressions (Res) over Alphabet ∑

• Inductive base:1. is a RE, denoting the RL {}2. a ∑ is a RE, denoting the RL {a}

• Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then

3. (r)|(s) is a RE, denoting the RL L(r) L(s)4. (r)(s) is a RE, denoting the RL L(r)L(s)5. (r)* is a RE, denoting the RL (L(r))*6. (r) is a RE, denoting the RL L(r)

Faculty of IT - HCMUT Lexical Analysis 14

Precedence and Associativity

• Precedence:– “*” has the highest precedence– “concatenation” has the second highest precedence– “|” has the lowest precedence

• Associativity:– all are left-associative

E.g.: (a)|((b)*(c)) a|b*c

Unnecessary parentheses can be removed

Faculty of IT - HCMUT Lexical Analysis 15

Example

• ∑ = {a, b}

1. a|b denotes {a,b}

2. (a|b)(a|b) denotes {aa,ab,ba,bb}

3. a* denotes {,a,aa,aaa,aaaa,…}

4. (a|b)* denotes ?

5. a|a*b denotes ?

Faculty of IT - HCMUT Lexical Analysis 16

Notational Shorthands

• One or more instances +: r+ = rr*– denotes the language (L(r))+

– has the same precedence and associativity as *

• Zero or one instance ?: r? = r|– denotes the language (L(r) {})

• Character classes– [abc] denotes a|b|c– [A-Z] denotes A|B|…|Z– [a-zA-Z_][a-zA-Z0-9_]* denotes ?

Faculty of IT - HCMUT Lexical Analysis 17

Outline

• Introduction • Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction, Algorithm 3.3)– NFA DFA (subset construction, Algorithm 3.2)– DFA minimal DFA (Algorithm 3.6)

• Programming

Faculty of IT - HCMUT Lexical Analysis 18

Overview

RE

NFA DFA mDFA

3.5

3.63.2

3.3

Faculty of IT - HCMUT Lexical Analysis 19

Nondeterministic finite automata

• A nondeterministic finite automaton (NFA) is a mathematical model that consists of– a finite set of states S– a set of input symbols ∑– a transition function move: S ∑ S

– a start state s0

– a finite set of final or accepting states F

Faculty of IT - HCMUT Lexical Analysis 20

Transition graph

• state

transition

start state

final state

A Ba

A

A

A

Faculty of IT - HCMUT Lexical Analysis 21

Transition table

a b

0 {0,1} {0}

1 - {2}

2 - {3}

Input symbolState

Faculty of IT - HCMUT Lexical Analysis 22

Acceptance

• A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x.

A B

0

1

01010

01011

A B A B A B0 1 0 1 0

A B A B A ?0 1 0 1 1error

01

0

Faculty of IT - HCMUT Lexical Analysis 23

Deterministic finite automata

• A deterministic finite automaton (DFA) is a special case of NFA in which

1. no state has an -transition, and

2. for each state s and input symbol a, there is at most one edge labeled a leaving s.

Faculty of IT - HCMUT Lexical Analysis 24

Thompson’s construction of NFA from REs

• guided by the syntactic structure of the RE r

• For ,

• For a in ∑

i f

i fa

Faculty of IT - HCMUT Lexical Analysis 25

Thompson’s construction (cont’d)

• Suppose N(s) and N(t) are NFA’s for REs s and t– For s|t,

– For st,

– For s*,

– For (s), use N(s) itself

N(s)

N(t)i f

N(t)N(s)i f

N(t)i f

Faculty of IT - HCMUT Lexical Analysis 26

Outline

• Introduction • Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction)– DFA minimal DFA (Algorithm 3.6)

• Programming

Faculty of IT - HCMUT Lexical Analysis 27

Subset construction

Operation Description

-closure(s) Set of NFA states reachable from state s on -transition alone

-closure(T) Set of NFA states reachable from some state s in T on -transition alone

move(T,a) Set of NFA states to which there is a transition on input a from some state s in T

• s : an NFA state

• T : a set of NFA states

Faculty of IT - HCMUT Lexical Analysis 28

Subset construction (cont’d)

Let s0 be the start state of the NFA;

Dstates contains the only unmarked state -closure(s0);while there is an unmarked state T in Dstates do begin

mark Tfor each input symbol a do begin

U := -closure(move(T; a));if U is not in Dstates then

Add U as an unmarked state to Dstates;DTran[T; a] := U;

end;end;

Faculty of IT - HCMUT Lexical Analysis 29

DFA

• Let (∑, S, T, F, s0) be the original NFA. The DFA is:

• The alphabet: ∑ • The states: all states in Dstates• The transitions: DTran• The accepting states: all states in Dstates

containing at least one accepting state in F of the NFA

• The start state: -closure(s0)

Faculty of IT - HCMUT Lexical Analysis 30

Outline

• Introduction • Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction) – DFA minimal DFA (Algorithm 3.6)

• Programming

Faculty of IT - HCMUT Lexical Analysis 31

Minimise a DFA

Initially, create two states:1. one is the set of all final states: F2. the other is the set of all non-final states: S - F

while (more splits are possible) { Let S = {s1,…, sn} be a state and c be any char in ∑Let t1,…, tn be the successor states to s1,…, sn under cif (t1,…, tn don't all belong to the same state) {

Split S into new states so that si and sj remain in the

same state iff ti and tj are in the same state

}}

Faculty of IT - HCMUT Lexical Analysis 32

Example

A B D E

Cb

b

b

bb

a

a

a aa

Step1: {A,B,C,D} {E}

For a, {B,B,B,B}

For b, {C,D,C,E}

Split {A,B,C} {D} {E}

Step 2:

For b, {C,D,C}

Split {A,C} {B} {D} {E}

Step 3:

For a, {B,B}

For b, {C,C}

Terminate

A B D Eb

b

b

bba

a aa

Faculty of IT - HCMUT Lexical Analysis 33

Outline

• Introduction • Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction) – DFA minimal DFA (Algorithm 3.6)

• Programming

Faculty of IT - HCMUT Lexical Analysis 34

Input Bufferingbegin…

Scanner

eof

if (forward at end of first half) {reload second halfforward++

} else if (forward at end of second half) {

reload first halfforward = 0

} elseforward++

Faculty of IT - HCMUT Lexical Analysis 35

Input Bufferingbegin…

Scanner

eof

eof

eof

forward = forward + 1if (forward↑=eof) {

if (forward at end of first half) {reload second halfforward++

} else if (forward at end of second half) {

reload first halfforward = 0

} elseterminate the analysis

}

Faculty of IT - HCMUT Lexical Analysis 36

Transition Diagrams

relop <= | < |<> 0 1 2

3

4

< =

>

other

return(relop,LE)

return(relop,NE)

return(relop,LT)

id letter(letter|digit)* 5 6 7letter

letter or digit

other return(id,lexeme)

Transition diagram is a DFA in which there is no edge leaving out of a final state

Faculty of IT - HCMUT Lexical Analysis 37

Implementationtoken nexttoken() {

while (1) { switch (state) {

case 0: c = nextchar(); if (c == ‘<‘) state = 1;

else state = fail(0);break;

case 1: c = nextchar();if (c == ‘=‘) state = 2;else if (c == ‘>’ state = 3;else state = 4;break;

case 2: retract(0); return new

Token(relop,”<=”); case 4: retract(1);

return new Token(relop,”<”);

case 5: c = nextchar(); if (Character.isLetter(c))

state = 6;else state = fail(5);break;

case 6: c = nextchar();if (Character.isLetter(c)

||Character.isDigit(c)) continue;

else state = 7;break;

case 7: retract(1); return new Token(id,

getLexeme());

Faculty of IT - HCMUT Lexical Analysis 38

Implemetation (cont’d)

int fail(int current_state) {

forward = beginning;

switch (current_state) {

case 0: return 5;

case 5: error();

}

}

void retract(int flag) {

if (flag ==1)

move forward back

get lexeme from beginning to forward

move forward onward

beginning = forward

state = 0

}

b│e│g│i│n│:│=│ │ │…

Faculty of IT - HCMUT Lexical Analysis 39

Outline

• Introduction • Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction) – DFA minimal DFA (Algorithm 3.6)

• Programming