Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University...
-
Upload
jasmin-whitehead -
Category
Documents
-
view
218 -
download
0
Transcript of Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University...
![Page 1: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/1.jpg)
Lexical Analysis IIIRecognizing Tokens
Lecture 4CS 4318/5331Apan Qasem
Texas State University
Spring 2015
![Page 2: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/2.jpg)
Announcements
• Assg 1 due this Friday at 11:59 PM
• Test instances on github
• No lecture at RRC this week
![Page 3: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/3.jpg)
Lexical Analysis
int main() {
int i
for (i = 0; i < MAX; i++)
printf(“Hello World”);
}
Scanner
<KEYWORD,int> <ID,main> <OP,(> <OP,)> <OP,{> <KEYWORD,int> <ID,i>
<SEP,;> <KEYWORD, for> <OP,(> <ID,I> <OP,=> <CONST,0> <SEP,;>
<ID,i> <ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,printf> <OP,(> <OP,“> <STR, Hello World> <OP,”> <OP,)> <SEP,;> <OP,{>
What do we do if we encounter a missing semi-colon?
Nothing!
![Page 4: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/4.jpg)
Lexical Analysis
int main() {
int i;
for (i = 0; i < MAX; i++)
abcprintf(“Hello World”);
}
Scanner
<KEYWORD, int> <ID,main> <OP,(> <OP,)> <OP,{> <KEYWORD,int> <ID,i>
<SEP,;> <KEYWORD,for> <OP,(> <ID,I> <OP,=> <CONST,0> <SEP,;> <ID,i>
<ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,printf> <OP,(> <OP,“>
<STR, Hello World> <OP,”> <OP,)> <SEP,;><OP,}>
What do we do if we encounter an undefined function name?
Nothing!
![Page 5: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/5.jpg)
Lexical Analysis
int main() {
int i;
for (i = 0; i < MAX; i++)
abcprintf(“Hello World”);
}
Scanner
<KEYWORD, int> <ID,main> <OP,(> <OP,)> <OP,{> <KEYWORD,int> <ID,i>
<SEP,;> <KEYWORD,for> <OP,(> <ID,I> <OP,=> <CONST,0> <SEP,;> <ID,i>
<ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,abcprintf> <OP,(> <OP,“>
<STR, Hello World> <OP,”> <OP,)> <SEP,;><OP,}>
What do we do if we encounter an undefined function name?
Nothing!
![Page 6: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/6.jpg)
Lexical Analysis
intmain(){inti;for(i=0;i<MAX;i++)printf(“Hello World”);}
Scanner
<ID,intmain> <OP,(> <OP, )> <OP,{> <KEYWORD,inti> <SEP,;>
<KEYWORD,for> <OP,(> <ID,i> <OP,=> <CONST,0> <SEP,;> <ID,i>
<ID,<> <ID,MAX> <SEP,;> <ID,i> <OP,++> <ID,printf> <OP,(> <OP,“>
<STR, Hello World> <OP,”> <OP,)> <SEP,;><OP,}>
Legal C program? Passes Scanner?
No Yes
![Page 7: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/7.jpg)
Lexical Analysis
intmain(){inti;for(i=0;i<MAX;i++)printf(“Hello World”);}
Scanner
<ID,intmain> <OP,(> <OP, )> <OP,{> <ID,inti> <SEP,;>
<KEYWORD,for> <OP,(> <ID,i> <OP,=> <CONST,0> <SEP,;> <ID,i>
<ID,<> <ID,MAX> <SEP,;> <ID,i> <OP,++> <ID,printf> <OP,(> <OP,”>
<STR, Hello World> <OP,”> <OP,)> <SEP,;><OP,}>
Legal C program? Passes Scanner?
No Yes
![Page 8: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/8.jpg)
Lexical Analysis
int main() {
int %$*&i;
for (i = 0; i < MAX; i++)
printf(“Hello World”);
}
Scanner
What’s an illegal C program at the scanner phase?
Very Few!C/C++ has become too large!
<KEYWORD,int> <ID,main> <OP,(> <OP, )> <OP,{> <KEYWORD,int> <OP,%>
<ID,$> <OP,*> <OP,&> <ID,i> <SEP,;> <KEYWORD, for> <OP,(> <ID,i>
<OP,=> <CONST,0> <SEP,;> <ID, i> <ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,printf> <OP,(> <OP,“> <STR, Hello World> <OP,”> <OP,)> <SEP,;><OP,{>
![Page 9: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/9.jpg)
Breaking Down Lexical Analysis Further …
1. Specify patterns for tokens• Look at language description and identify the types of
tokens needed for the language• usually trivial
• Use regular expressions to specify a pattern for each token• patterns for some tokens are trivial
2. Recognize patterns in the input stream and generate tokens for the parser
![Page 10: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/10.jpg)
Recognizing Tokens
• We can specify the regular expression while
for the while keyword in C
• How do we recognize it if we see it in the input stream?• Essentially a pattern-matching algorithm
![Page 11: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/11.jpg)
Code for Recognizing while
if (nextchar() == ‘w’) if (nextchar() == ‘h’) if (nextchar() == ‘i’) if (nextchar() == ‘l’) if (nextchar() == ‘e’)
return KEYWORD_WHILE; else
// do something else // do something else // do something else // do somethingelse
// do something
This approach works for more complex REs as well while (nextchar() == ‘a’ || …)
Need to decide what to do for strings like when
Need to account for strings like whileabc
Need to account for strings like abcwhile
Can we generate this code automatically?
![Page 12: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/12.jpg)
Code for Recognizing while
if (nextchar() == ‘w’) if (nextchar() == ‘h’) if (nextchar() == ‘i’) if (nextchar() == ‘l’) if (nextchar() == ‘e’)
return KEYWORD_WHILE; else
// do something else // do something else // do something else // do somethingelse // do something
Each ‘if clause’ represents a state
The state is determined solely based on what we have seen so far in the input stream
No need to go back and rescan input
At each state we make a decision to move to a new state based on the next input symbol
This is exactly the idea behind (deterministic) finite state
machines
![Page 13: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/13.jpg)
Recognizing Tokens
General idea• Consume a character from the input stream
• Based on the value of the character move to a new state • If the character just consumed
• produces a valid token and no more characters to consume then DONE
• leads to a valid token, move to a valid state• produces an invalid token go to error state and finish
• Repeatabove recognizes one token
![Page 14: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/14.jpg)
Recognizing Tokens
• Need to construct a recognizer based on regular expressions
• A recognizer for a regular expression is a machine that recognizes the language described by the RE
• Given an input string constructed from the alphabet, the recognizer will
• Say “yes” if the string is in the language (ACCEPT)• Say “no” if the string is not in the language (REJECT)
• Implications • Must produce a yes or no answer on every input• Cannot say yes when the string is not in the language (false
positives)
![Page 15: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/15.jpg)
RE and DFA
For every RE there is a recognizer that recognizes the corresponding RL
If you build it … it will be recognizable!
The recognizers are called deterministic finite automata (DFAs)
Kleene’s Theorem (1952)
![Page 16: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/16.jpg)
Deterministic Finite Automata
Formal mathematical construct • Abstract state machines that can recognize regular
languages• A set of states with transitions defined on each input
symbol on every state• Formal definition in Text (Section 2.2.1) • Convenient to reason about DFAs using state transition
diagrams
![Page 17: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/17.jpg)
DFA Diagram
s0 s2s1 s3i n t
E
initial state
input
error state
final state
error states sometimes implicit
only one initial state
can have multiple final states
i
n t
![Page 18: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/18.jpg)
Acceptance Criteria for DFAs
• A DFA accepts a string if and only if the DFA ends up in a final state after consuming all input symbols
• Implications • A DFA built to recognize int will _______ intmain
• A DFA built to recognize intmain will _______ int
reject
reject
Easy fix if we want the machine to recognize int AND intmain
![Page 19: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/19.jpg)
DFA Example : if
s0 s1i f s2
![Page 20: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/20.jpg)
DFA Example: int | if
s0 s1i f s3
s2
n
s4
t
![Page 21: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/21.jpg)
DFA for if | int
s0
s1if s3
s4ns2i t s5
Non-determinism
![Page 22: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/22.jpg)
DFA Example : Integers
Σ = {0-9}Digit : 0|1|2|3|… |9 Integer : 0 | (1|2|3|… |9)(Digit)*
s0 s2
E
s11-9
0-9
0
![Page 23: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/23.jpg)
REs and DFAs
every RL has a DFA that recognizes it and every DFA has a corresponding RL
there are algorithms that allow us to convert an RE to a DFA and vice versa
we can automate scanning!
to convert REs to DFAs we need to first look at non-deterministic finite
automata (NFA)
![Page 24: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/24.jpg)
Non-determinism
DFAs do not allow non-determinism• Must have a transition defined on every state
on every possible input symbol• Cannot move to a new state without
consuming an input symbol• Cannot have multiple transitions on the same
input symbol
![Page 25: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/25.jpg)
NFA
• DFAs with transitions
• To run NFAs, start at the initial state and guess the right transition at each step• Always guess correctly• If some sequence of correct guesses leads to a
final state then accept
Sounds dubiousBut works!
![Page 26: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/26.jpg)
NFA for if | int
s0
s1if s3
s4ns2
i t s5
NFA, multiple transitions on i in state s0
![Page 27: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/27.jpg)
NFA and DFA
• Although NFAs allow non-determinism it has been shown that NFAs and DFAs are equivalent!
Scott and Rabin (1959)
• DFAs are just specialized forms of NFAs• NFAs and DFAs both recognize the same set of languages • Can simulate a DFA with an NFA• Can construct corresponding DFAs for any NFA
• Implication• For every RE there is also an NFA
Relatively easy to construct an NFA from an RE
![Page 28: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/28.jpg)
RE to NFA : Empty String
1. is a regular expression that denotes { }, the set that contains the empty string
s0 s1
![Page 29: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/29.jpg)
RE to NFA : Symbol
2. For each , a is a regular expression denoting {a}, the set containing the string a.
s0 s1a
![Page 30: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/30.jpg)
RE to NFA : Union
3. r | s is an RE denoting L(r) U L(s) e.g., RE = a | b L(RE) = {a, b}
s0 s1b
s0 s1a s1 s3a
s2 s4b
s5s0
![Page 31: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/31.jpg)
RE to NFA : Concatenation
4. rs is an RE denoting L(r)L(s) e.g., RE = ab L(RE) = {ab}
s0 s1b
s0 s1a
s1 s3s0 a bs2
![Page 32: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/32.jpg)
RE to NFA : Closure
5. r* is an RE denoting L(r)* e.g., RE = a* L(RE) = { , a, aa, aaa, aaaa, …}
s1 s3s0 a s2s0 s1a
![Page 33: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/33.jpg)
RE to NFA
• The algorithm for converting REs to NFAs is known as Thompson’s construction• Repeated application of the five conversion
rules!• Named after Ken Thompson (1968)
![Page 34: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/34.jpg)
Example : NFA for a(b|c)*
Work inside parentheses b|c
s0 s1c
s0 s1b
s0 s5
![Page 35: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/35.jpg)
Example : NFA for a(b|c)*
Work inside parentheses b|c
s2 s4c
s1 s3b
s0 s5
Adjust final statesRename states
![Page 36: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/36.jpg)
Example : NFA for a (b|c)*
Step 3: * (closure)
(b | c)*
s1 s3b
s2 s4c
s5s0 s5s0
![Page 37: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/37.jpg)
Example : NFA for a (b|c)*
Step 3: * (closure)
(b | c)*
s2 s4b
s3 s5c
s6s0 s7s1
![Page 38: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/38.jpg)
Example : NFA for a (b|c)*
Step 4: concatenation
s4 s5b
s6 s7c
s8s1 s9s3s2s0a
![Page 39: Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d355503460f94a0c8ac/html5/thumbnails/39.jpg)
Cycle of Construction
RE
MinimizedDFA
DFA
NFA
Code
Thompson’s Construction
SubsetConstructionHopcroft’s
Algorithm