CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the...
Transcript of CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the...
![Page 1: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/1.jpg)
CS308 Compiler Principles
Lexical Analyzer
Li JiangDepartment of Computer Science and Engineering
Shanghai Jiao Tong University
![Page 2: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/2.jpg)
Compiler Principles
Outline
• Content:
• Basic concepts: pattern, lexeme, and token.
• Operations on languages, and regular expression
• Recognition of tokens
• Finite automata, including NFA and DFA
• Conversion from regular expression to NFA and
DFA
• Optimization of lexical analyzer
2
![Page 3: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/3.jpg)
Compiler Principles
Lexical Analyzer
• Lexical Analyzer reads the source program character by character to produce tokens.
– strips out comments and whitespaces
– returns a token when the parser asks for
– correlates error messages with the source program
3
![Page 4: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/4.jpg)
Compiler Principles
Token
• A token is a pair of a token name and an optional attribute value.
– Token name specifies the pattern of the token
– Attribute stores the lexeme of the token
• Tokens
– Keyword: “begin”, “if”, “else”, …
– Identifier: string of letters or digits, starting with a letter
– Integer: a non-empty string of digits
– Punctuation symbol: “,”, “;”, “(”, “)”, …
• Regular expressions are widely used to specify patterns of the tokens.
4
![Page 5: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/5.jpg)
Compiler Principles
Attributes of Token
• Information for subsequent compiler
phases about the particular lexeme
– Token name influences parsing decision
– attribute value influences translation of tokens
after the parse
• Attributes of identifier
– Lexeme, type, location
– Stored in symbol table
• Tricky problem
– DO 5 I = 1.25 VS. DO 5 I = 1,25
5
![Page 6: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/6.jpg)
Compiler Principles
Token Example
6
![Page 7: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/7.jpg)
Compiler Principles
Outline
• Content:
• Basic concepts: pattern, lexeme, and token.
• Operations on languages, and regular expression
• Recognition of tokens
• Finite automata, including NFA and DFA
• Conversion from regular expression to NFA and
DFA
• Optimization of lexical analyzer
7
![Page 8: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/8.jpg)
Compiler Principles
Input Buffering
• Why a compiler needs buffers?
• Buffer Pairs: alternately reload
• Two pointers
– lexemeBegin
– forward
• Sentinels: a mark for buffer end
8
If length of lexeme + look
ahead distance > buffer size
![Page 9: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/9.jpg)
Compiler Principles
Lookahead with Sentinels
9
![Page 10: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/10.jpg)
Compiler Principles
Terminology of Languages• Alphabet: a finite set of symbols
– ASCII
– Unicode
• String: a finite sequence of symbols on an alphabet– is the empty string
– |s| is the length of string s
– Concatenation: xy represents x followed by y
– Exponentiation: sn= s s s .. s ( n times) s0
=
• Language: a set of strings over some fixed alphabet– the empty set is a language
– The set of well-formed C programs is a language
10
![Page 11: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/11.jpg)
Compiler Principles
Operations on Languages
• Union: L1 L2 = { s | s L1 or s L2 }
• Concatenation: L1L2 = { s1s2 | s1 L1 and s2 L2 }
• (Kleene) Closure:
• Positive Closure:
0
*
i
iLL
1i
iLL
11
![Page 12: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/12.jpg)
Compiler Principles
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1 L2 =
• L1L2 =
• L1* =
• L1+ =
12
{a,b,c,d,1,2}
{a1,a2,b1,b2,c1,c2,d1,d2}
all strings using letters a,b,c,d
including the empty string
all strings using letters a,b,c,d
without the empty string
![Page 13: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/13.jpg)
Compiler Principles
Regular Expressions
• Regular expression is a representation of a language that can be built from the operatorsapplied to the symbols of some alphabet.
• A regular expression is built up of smaller regular expressions (using defining rules).
• Each regular expression r denotes a language L(r).
• A language denoted by a regular expression is called as a regular set.
13
![Page 14: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/14.jpg)
Compiler Principles
Regular Expressions (Rules)Regular expressions over alphabet
Reg. Expr Language it denotes L() = {}a L(a) = {a}(r1) | (r2) L(r1) L(r2)(r1) (r2) L(r1) L(r2)(r)* (L(r))*
(r) L(r)
Extension(r)+ = (r)(r)* (L(r))+ Positive closure(r)? = (r) | L(r) {} zero or one instance [a1-an] L(a1|a2|…|an) character class
14
![Page 15: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/15.jpg)
Compiler Principles
Regular Expressions (cont.)
• We may remove parentheses by using precedence rules:– * highest
– concatenation second highest
– | lowest
• (a(b)*)|(c)
• Example:– =
– 0|1 =>
– (0|1)(0|1) =>
– 0* =>
– (0|1)* =>
15
ab*|c
{0,1}
{0,1}
{00,01,10,11}
{ ,0,00,000,0000,....}
all strings with 0 and 1, including
the empty string
![Page 16: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/16.jpg)
Compiler Principles
Lex regular expression
16
![Page 17: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/17.jpg)
Compiler Principles
Regular Definitions
• We can give names to regular expressions, and use these names as symbols to define other regular expressions.
• A regular definition is a sequence of the definitions of the form:
d1 r1 where di is a innovative symbol and
d2 r2 ri is a regular expression over symbols
… in {d1,d2,...,di-1}
dn rn
alphabetpreviously defined
symbols17
![Page 18: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/18.jpg)
Compiler Principles
Regular Definitions Example
• Example: Identifiers in Pascal
letter A | B | ... | Z | a | b | ... | z
digit 0 | 1 | ... | 9
id letter (letter | digit ) *
– If we try to write the regular expression
representing identifiers without using regular
definitions, that regular expression will be
complex.
18
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
Q: unsigned numbers (integer or floating point)
![Page 19: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/19.jpg)
Compiler Principles
Quiz
1. All strings of lowercase letters that
contain the five vowels in order.
2. All strings of lowercase letters in which
the letters are in ascending lexicographic
order.
3. Comments, consisting of a string
surrounded by /* and */, without an
intervening */, unless it is inside double-
quotes (“). [HOMEWORK]
19
*
![Page 20: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/20.jpg)
Compiler Principles
Outline
• Content:
• Basic concepts: pattern, lexeme, and token.
• Operations on languages, and regular expression
• Recognition of tokens
• Finite automata, including NFA and DFA
• Conversion from regular expression to NFA and
DFA
• Optimization of lexical analyzer
21
![Page 21: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/21.jpg)
Compiler Principles
Recognition of token
Grammar
Regular Definitions
22
Express the pattern
Find a prefix that is a
lexeme matching the
pattern
![Page 22: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/22.jpg)
Compiler Principles
Transition Diagram
• State: represents a condition that could
occur during scanning
– start/initial state:
– accepting/final state: lexeme found
– intermediate state:
• Edge: directs from one state to another,
labeled with one or a set of symbols
23
*
![Page 23: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/23.jpg)
Compiler Principles
Transition Diagram for relop
Transition Diagram for ``relop < | > |< = | >= | = | <>’’
24
Among the lexemes that
match the pattern for relop,
what can we only be
looking at?
![Page 24: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/24.jpg)
Compiler Principles
Transition-Diagram-Based Lexical Analyzer
Implementation of relop transition diagram
25
Switch statement or multi way branch
Determines the next state by reading
and examining the next input character
Holds the number of
the current state
Find the edge Take action
![Page 25: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/25.jpg)
Compiler Principles
Transition Diagram for Others
A transition diagram for id's
A transition diagram for unsigned numbers
26
What about the Transition
Diagram of letter/digit?
*
![Page 26: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/26.jpg)
Compiler Principles
Outline
• Content:
• Basic concepts: pattern, lexeme, and token.
• Operations on languages, and regular expression
• Recognition of tokens
• Finite automata, including NFA and DFA
• Conversion from regular expression to NFA and
DFA
• Optimization of lexical analyzer
29
![Page 27: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/27.jpg)
Compiler Principles
Finite Automata
• A finite automaton is a recognizer that takes a string, and answers “yes” if the string matches a pattern of a specified language, and “no” otherwise.
• Two kinds:– Nondeterministic finite automaton (NFA)
• no restriction on the labels of their edges
– Deterministic finite automaton (DFA)• exactly one edge with a distinguished symbol goes out of
each state
• Both NFA and DFA have the same capability
• We may use NFA or DFA as lexical analyzer
30
*
![Page 28: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/28.jpg)
Compiler Principles
Nondeterministic Finite Automaton (NFA)
• A NFA consists of:– S: a set of states
– Σ: a set of input symbols (alphabet)
– A transition function: maps state-symbol pairs to sets of states
– s0: a start (initial) state
– F: a set of accepting states (final states)
• NFA can be represented by a transition graph
• Accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x.
• Remarks– The same symbol can label edges from one state to
several different states
– An edge may be labeled by ε, the empty string
31
![Page 29: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/29.jpg)
Compiler Principles
NFA Example (1)The language recognized by this NFA is
32
(a|b) * a b
![Page 30: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/30.jpg)
Compiler Principles
NFA Example (2)
NFA accepting aa* |bb*
33
![Page 31: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/31.jpg)
Compiler Principles
Implementing an NFAS -closure({s0}) { set all of states can be accessible
from s0 by -transitions }
c nextchar()
while (c != eof) {
begin
S -closure(move(S,c))
c nextchar
end
if (SF != ) then { if S contains an accepting state }
return “yes”
else
return “no”
{ set of all states can be
accessible from a state in S by a
transition on c}
34
Subset Constructionbacktrack may be needed to identify the longest match.
![Page 32: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/32.jpg)
Compiler Principles
Excise 3• For NFA in the following figure, indicate all the paths
labeled aabb. Does the NFA accept aabb?
• Give the transition table.
35
- (0) -a-> (1) -a-> (2) -b-> (2) -b-> ((3)) (0) -a-> (1) -a-> (2) -b-> (2) -b-> (2)
- (0) -a-> (0) -a-> (0) -b-> (0) -b-> (0) (0) -a-> (0) -a-> (1) -b-> (1) -b-> (1)
- (0) -a-> (1) -a-> (1) -b-> (1) -b-> (1) (0) -a-> (1) -a-> (2) -b-> (2) -ε-> (0) -b-> (0)
- (0) -a-> (1) -a-> (2) -ε-> (0) -b-> (0) -b-> (0)
![Page 33: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/33.jpg)
Compiler Principles
Deterministic Finite Automaton (DFA)
• A Deterministic Finite Automaton (DFA) is
a special form of a NFA.
– No state has ε- transition
– For each symbol a and state s, there is at
most one a labeled edge leaving s.
The language recognized by this DFA is ?
start
36
(a|b) * a b
![Page 34: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/34.jpg)
Compiler Principles
Practice
• Draw the transition diagram for recognizing
the following regular expression
a(a|b)*a
37
1 2 3aa
a|b
Nondeterministic
1 2 3aa
b
b a
Deterministic
*
![Page 35: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/35.jpg)
Compiler Principles
Implementing a DFA
s s0 { start from the initial state }
c nextchar { get the next character from the input string }
while (c != eof) do { do until the end of the string }
begin
s move(s,c) { transition function }
c nextchar
end
if (s in F) then { if s is an accepting state }
return “yes”
else
return “no”
38
![Page 36: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/36.jpg)
Compiler Principles
NFA vs. DFA
Compactibility Readability Speed
NFA Good Good Slow
DFA Bad Bad Fast
• DFAs are widely used to build lexical analyzers.
NFA DFAThe language recognized (a|b) * a b
39
Maintaining a set of state is more complex than keeping
track a single state.
![Page 37: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/37.jpg)
Compiler Principles40
(a)1 2 3 4 5
6 7 8 9
0
0 0 0
0
00
1 1
1
111
1
(b) 1 2 3 4 5
a
a aaa
Pop Quiz
1) What are the languages presented by the two FAs?
40
Solution: 01 strings with length 4, except 0110
Solution: a(aaaaa)*
Fixed pattern
Closure
![Page 38: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/38.jpg)
Compiler Principles
Outline
• Content:
• Basic concepts: pattern, lexeme, and token.
• Operations on languages, and regular expression
• Recognition of tokens
• Finite automata, including NFA and DFA
• Conversion from regular expression to NFA and
DFA
• Optimization of lexical analyzer
42
![Page 39: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/39.jpg)
Compiler Principles
Regular Expression NFA
• McNaughton-Yamada-Thompson (MYT)
construction
– Simple and systematic (recursive up the
parse tree for the regular expression)
– Construction starts from the simplest parts
(alphabet symbols).
– For a complex regular expression, sub-
expressions are combined to create its NFA.
– Guarantees the resulting NFA will have
exactly one final state, and one start state.
43
![Page 40: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/40.jpg)
Compiler Principles
MYT Construction
• Basic rules: for subexpressions with no
operators
– For expression
– For a symbol a in the alphabet
i fstart
i fastart
44
![Page 41: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/41.jpg)
Compiler Principles
MYT Construction Cont’d
• Inductive rules: for constructing larger
NFAs from the NFAs of subexpressions
(Let N(r1) and N(r2) denote NFAs for regular
expressions r1 and r2, respectively)
– For regular expression r1 | r2
i
N(r1)
N(r2)
f
start
45
![Page 42: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/42.jpg)
Compiler Principles
MYT Construction Cont’d
– For regular expression r1r2
– For regular expression r*
i N(r1) fN(r2)start
N(r)i f
start
46
![Page 43: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/43.jpg)
Compiler Principles47
Example: (a|b)*a
a:a
bb:
(a|b):
a
b
b
a
(a|b)*:
b
a
a(a|b)*a:
47
![Page 44: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/44.jpg)
Compiler Principles
Properties of the Constructed NFA
1. N(r) has at most twice as many states as there are operators and operands in r.
– This bound follows from the fact that each step of the algorithm creates at most two new states.
2. N(r) has one start state and one accepting state. The accepting state has no outgoing transitions, and the start state has no incoming transitions.
3. Each state of N(r) other than the accepting state has either one outgoing transition on a symbol in {} or two outgoing transitions, both on .
48
![Page 45: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/45.jpg)
Compiler Principles
Conversion of an NFA to a DFA
• Approach: Subset Construction– each state of the constructed DFA corresponds to
a set / combination of NFA states
• Details① Create transition table Dtran for the DFA
② Insert -closure(s0) to Dstates as initial state
③ Pick a not visited state T in Dstates
④ For each symbol a, Create state
-closure(move(T, a)), and add it to Dstates and Dtran
⑤ Repeat step (3) and (4) until all states in Dstates are visited
49
![Page 46: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/46.jpg)
Compiler Principles
The Subset Construction
50
Simulate in parallel all
possible moves NFA can
make on the input a
![Page 47: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/47.jpg)
Compiler Principles
NFA to DFA Example
NFA for (a|b) * abb
51
A = -closure({0}) = {0,1,2,4,7} A into DS as an unmarked state mark A
-closure(move(A,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B B into DS
-closure(move(A,b)) = -closure({5}) = {1,2,4,5,6,7} = C C into DS
transfunc[A,a] B transfunc[A,b] C mark B
-closure(move(B,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B
-closure(move(B,b)) = -closure({5,9}) = {1,2,4,5,6,7,9} = D
transfunc[B,a] B transfunc[B,b] D mark C
-closure(move(C,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B
-closure(move(C,b)) = -closure({5}) = {1,2,4,5,6,7} = C
transfunc[C,a] B transfunc[C,b] C
![Page 48: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/48.jpg)
Compiler Principles
NFA to DFA Example
NFA for (a|b) * abb
Transition table for DFA Equivalent DFA
52
4
![Page 49: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/49.jpg)
Compiler Principles55
Quiz 1
Suppose we have two tokens: (1) the keyword if, and (2)
identifiers, which are strings of letters other than if. Show:
1. The NFA for these tokens, and
2. The DFA for these tokens
NFA DFA
![Page 50: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/50.jpg)
Compiler Principles
Regular Expression DFA
• First, augment the given regular expression
by concatenating a special symbol #
r r# augmented regular expression
• Second, create a syntax tree for the
augmented regular expression.
– All leaves are alphabet symbols (plus # and the
empty string)
– All inner nodes are operators
• Third, number each alphabet symbol (plus #)
(position numbers)
56
![Page 51: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/51.jpg)
Compiler Principles57
Regular Expression DFA Cont’d
(a|b)*a (a|b)*a# augmented regular expression
*
|
b
a
#
a1
4
3
2
• each symbol is at a leaf
• each symbol is numbered (positions)
• inner nodes are operators
Syntax tree of (a|b)*a#
3 F
2
1
b
a
a4
#
![Page 52: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/52.jpg)
Compiler Principles58
followpos
Then we define the function followpos for the positions (positions
assigned to leaves).
followpos(i) -- the set of positions which can follow
the position i in the strings generated by
the augmented regular expression.
Example: ( a | b) * a #
1 2 3 4
followpos(1) = {1,2,3}
followpos(2) = {1,2,3}
followpos(3) = {4}
followpos(4) = {}
followpos() is just defined for leaves,
not defined for inner nodes.
![Page 53: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/53.jpg)
Compiler Principles
firstpos, lastpos, nullable
• To compute followpos, we need three more functions defined for the nodes (not just for leaves) of the syntax tree.– firstpos(n) -- the set of the positions of the first
symbols of strings generated by the sub-expression rooted by n.
– lastpos(n) -- the set of the positions of the lastsymbols of strings generated by the sub-expression rooted by n.
– nullable(n) -- true if the empty string is a member of strings generated by the sub-expression rooted by n; false otherwise
59
![Page 54: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/54.jpg)
Compiler Principles
Usage of the Functions
*
|
b
a
#
a1
4
3
2
(a|b)*a (a|b)*a# augmented regular expression
Syntax tree of (a|b)*a#
n
m
nullable(n) = false
nullable(m) = true
firstpos(n) = {1, 2, 3}
lastpos(n) = {3}
60
![Page 55: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/55.jpg)
Compiler Principles61
Computing nullable, firstpos, lastpos
n nullable(n) firstpos(n) lastpos(n)
leaf labeled true
leaf labeled
with position i
false {i} {i}
|
c1 c2
nullable(c1) or
nullable(c2)
firstpos(c1) firstpos(c2) lastpos(c1)
lastpos(c2)
c1 c2
nullable(c1)
and
nullable(c2)
if (nullable(c1))
firstpos(c1)firstpos(c2)
else firstpos(c1)
if (nullable(c2))
lastpos(c1)lastpos(c2)
else lastpos(c2)
*
c1
true firstpos(c1) lastpos(c1)
Straightforward recursion on the height of the tree
![Page 56: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/56.jpg)
Compiler Principles62
Thinking
Extend the above table to include two more operations
(a) ? (b) +
n nullable(n) firstpos(n) lastpos(n)
?
c1
+
c1
TRUE firstpos(c1) lastpos(c1)
Nullable(c1 ) firstpos(c1) lastpos(c1)
![Page 57: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/57.jpg)
Compiler Principles
How to evaluate followpos
• Two-rules define the function followpos:
1. If n is concatenation-node with left child c1 and right child c2, and i is a position in lastpos(c1), then all positions in firstpos(c2) are in followpos(i).
2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are in followpos(i).
• If firstpos and lastpos have been computed for each node, followpos of each position can be computed by making one depth-first traversal of the syntax tree.
63
![Page 58: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/58.jpg)
Compiler Principles64
Example -- ( a | b) * a #
*
|
b
a
#
a1
4
3
2
{1,2,3}
{3}{1,2}
{1,2}
{1} {2}
{1,2,3} {4}
{4}
{4}{3}
{3}{1,2}
{1,2}
{1} {2}
red – firstpos
blue – lastpos
Then we can calculate followpos
followpos(1) = {1,2,3}
followpos(2) = {1,2,3}
followpos(3) = {4}
followpos(4) = {}
• After we calculate follow positions, we are ready to create
DFA for the regular expression.
![Page 59: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/59.jpg)
Compiler Principles
Algorithm (RE DFA)1. Create the syntax tree of (r) #
2. Calculate nullable, firstpos, lastpos, followpos
3. Put firstpos(root) into the states of DFA as an unmarked state.
4. while (there is an unmarked state S in the states of DFA) do
– mark S
– for each input symbol a do
• let s1,...,sn are positions in S and symbols in those positions are a
• S’ followpos(s1) ... followpos(sn)
• Dtran[S,a] S’
• if (S’ is not in the states of DFA)
– put S’ into the states of DFA as an unmarked state.
• the start state of DFA is firstpos(root)
• the accepting states of DFA are all states containing the position of #
65
![Page 60: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/60.jpg)
Compiler Principles
Example -- ( a | b) * a #
followpos(1)={1,2,3} followpos(2)={1,2,3} followpos(3)={4} followpos(4)={}
S1=firstpos(root)={1,2,3}
mark S1
a: followpos(1) followpos(3)={1,2,3,4}=S2 Dtran[S1,a]=S2
b: followpos(2)={1,2,3}=S1 Dtran[S1,b]=S1
mark S2
a: followpos(1) followpos(3)={1,2,3,4}=S2 Dtran[S2,a]=S2
b: followpos(2)={1,2,3}=S1 Dtran[S2,b]=S1
start state: S1
accepting states: {S2}
1 2 3 4
S1 S2
a
b
b
a
66
![Page 61: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/61.jpg)
Compiler Principles67
Example -- ( a | ) b c* #1 2 3 4
followpos(1)={2} Let’s continue
followpos(2)={3,4} followpos(3)={3,4} followpos(4)={}
S1=firstpos(root)={1,2}
mark S1
a: followpos(1)={2}=S2 Dtran[S1,a]=S2
b: followpos(2)={3,4}=S3 Dtran[S1,b]=S3
mark S2
b: followpos(2)={3,4}=S3 Dtran[S2,b]=S3
mark S3
c: followpos(3)={3,4}=S3 Dtran[S3,c]=S3
start state: S1
accepting states: {S3}
S3
S2
S1
c
ab
b
![Page 62: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/62.jpg)
Compiler Principles
Minimizing Number of DFA States
• For any regular language, there is always a uniqueminimum state DFA, which can be constructed from any DFA of the language.
• Algorithm:– Partition the set of states into two groups:
• G1 : set of accepting states
• G2 : set of non-accepting states
– For each new group G• partition G into subgroups such that states s1 and s2 are in the
same group iff
for all input symbols a, states s1 and s2 have transitions to states in the same group.
– Start state of the minimized DFA is the group containing the start state of the original DFA.
– Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA.
68
![Page 63: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/63.jpg)
Compiler Principles69
Minimizing DFA – Example (1)
b a
a
a
b
b
3
2
1
G1 = {2}
G2 = {1,3}
G2 cannot be partitioned because
Dtran[1,a]=2 Dtran[1,b]=3
Dtran[3,a]=2 Dtran[3,b]=3
So, the minimized DFA (with minimum states) is
1 2
a
a
b
b
![Page 64: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/64.jpg)
Compiler Principles70
Minimizing DFA – Example (2)
Groups: {1,2,3} {4}
a b
1->2 1->3
2->2 2->3
3->4 3->3
{1,2} {3}no more partitioning
Minimized DFA
b
b
b
a
a
a
a
b 4
3
2
1
3
1
2b
a
a
a
b
b
70
![Page 65: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/65.jpg)
Compiler Principles71
Architecture of A Lexical Analyzer
71
![Page 66: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/66.jpg)
Compiler Principles
An NFA for Lex program
• Create an NFA for each
regular expression
• Combine all the NFAs into
one
• Introduce a new start
state
• Connect it with ε-
transitions to the start
states of the NFAs
72
![Page 67: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/67.jpg)
Compiler Principles
Pattern Matching with NFA① The lexical analyzer reads
in input and calculates the set of states it is in at each symbol.
② Eventually, it reach a point with no next state.
③ It looks backwards in the sequence of sets of states, until it finds a set including one or more accepting states.
④ It picks the one associated with the earliest pattern in the list from the Lexprogram.
⑤ It performs the associated action of the pattern.
73
![Page 68: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/68.jpg)
Compiler Principles
Pattern Matching with NFA -- Example
Input: aaba
Report pattern: a*b+
74
![Page 69: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/69.jpg)
Compiler Principles
Pattern Matching with DFA① Convert the NFA for all the
patterns into an equivalent DFA. For each DFA state with more than one accepting NFA states, choose the pattern, who is defined earliest, the output of the DFA state.
② Simulate the DFA until there is no next state.
③ Trace back to the nearest accepting DFA state, and perform the associated action.
Input: abba
0137 247 58 68
Report pattern abb
75
![Page 70: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/70.jpg)
Compiler Principles
Summary
• How lexical analyzers work
– Convert REs to NFA
– Convert NFA to DFA
– Minimize DFA
– Use the minimized DFA to recognize tokens
in the input
– Use priorities, longest matching rule
76
![Page 71: CS416 Compiler Design - SJTUjiangli/teaching/CS308/CS308-slides02.pdf• Lexical Analyzer reads the source program character by character to produce tokens. ... Compiler Principles](https://reader031.fdocuments.in/reader031/viewer/2022013104/5aab17c67f8b9a8f498b73fc/html5/thumbnails/71.jpg)
Compiler Principles
Homework
• Check the web page!!!
77