Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt...
Transcript of Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt...
![Page 1: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/1.jpg)
Compila(on 0368-‐3133
Lecture 2: Lexical Analysis
Syntax Analysis (1): CFLs, CFGs, PDAs
Noam Rinetzky 1
![Page 2: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/2.jpg)
2
![Page 3: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/3.jpg)
3
![Page 4: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/4.jpg)
What is a Compiler? source language target language
Compiler
Executable
code
exe Source
text
txt
4
![Page 5: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/5.jpg)
Compiler vs. Interpreter
5
![Page 6: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/6.jpg)
Toy compiler
6
![Page 7: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/7.jpg)
The Real Anatomy of a Compiler
7
![Page 8: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/8.jpg)
The Real Anatomy of a Compiler
Executable
code
exe
Source
text
txt Lexical Analysis
Sem. Analysis
Process text input
characters Syntax Analysis tokens AST
Intermediate code
generation
Annotated AST
Intermediate code
optimization
IR Code generation IR
Target code optimization
Symbolic Instructions
SI Machine code generation
Write executable
output
MI
8
Lexical Analysis
Syntax Analysis
![Page 9: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/9.jpg)
Lexical Analysis
9 Lexical analyzers are also known as “scanners’’
![Page 10: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/10.jpg)
Lexical Analysis: from Text to Tokens
Lexical Analysis
Syntax Analysis
Sem. Analysis
Inter. Rep.
Code Gen.
x = b*b – 4*a*c
txt
<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>
Token Stream
10
![Page 11: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/11.jpg)
• Scan the input • Par((ons the text into stream of tokens
– Numbers – Iden(fiers – Keywords – Punctua(on
• Tokens usually represented as (kind, value) • Defined using regular expressions*
• “word” in the source language • “meaningful” to the syntac(cal analysis
What does Lexical Analysis do?
11
![Page 12: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/12.jpg)
What does Lexical Analysis do?
• Language: fully parenthesized expressions Context free language
Regular languages
( ( 23 + 7 ) * 19 )
Expr → Num | LP Expr Op Expr RP Num → Dig | Dig Num Dig → ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP → ‘(’ RP → ‘)’ Op → ‘+’ | ‘*’
12
![Page 13: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/13.jpg)
What does Lexical Analysis do?
• Language: fully parenthesized expressions Context free language
( ( 23 + 7 ) * 19 )
13
Regular languages
Expr → Num | LP Expr Op Expr RP Num → Dig | Dig Num Dig → ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP → ‘(’ RP → ‘)’ Op → ‘+’ | ‘*’
LP LP Num Op Num RP Op Num RP
![Page 14: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/14.jpg)
From scanning to parsing((23 + 7) * x)
) ? * ) 7 + 23 ( (
RP Id OP RP Num OP Num LP LP
Lexical Analyzer
program text
token stream
Parser Grammar: Expr → ... | Id Id → ‘a’ | ... | ‘z’
Op(*)
Id(?)
Num(23) Num(7)
Op(+)
Abstract Syntax Tree
valid syntax error
14
![Page 15: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/15.jpg)
Some basic terminology
• Lexeme (aka symbol) -‐ a series of lecers separated from the rest of the program according to a conven(on (space, semi-‐column, comma, etc.)
• Pacern -‐ a rule specifying a set of strings. Example: “an iden(fier is a string that starts with a lecer and con(nues with lecers and digits” – (Usually) a regular expression
• Token -‐ a pair of (pacern, acributes)
15
![Page 16: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/16.jpg)
Example void match0(char *s) /* find a zero */
{
if (!strncmp(s, “0.0”, 3))
return 0. ;
}
VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN
LBRACE
IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN
RETURN REAL(0.0) SEMI
RBRACE
EOF 16
![Page 17: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/17.jpg)
Regular languages
• Formal languages – Σ = finite set of lecers – Word = sequence of lecer – Language = set of words
• Regular languages defined equivalently by – Regular expressions – Finite-‐state automata
17
![Page 18: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/18.jpg)
Example: Integers w/o Leading Zeros
Digit = 1|2|…|9
Digit0 = 0|Digit Pos = Digit Digit0* Integer = 0 | Pos| -‐Pos
18
![Page 19: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/19.jpg)
Challenge: Ambiguity
• If = if• Id = Lecer (Lecer | Digit)*
• “if” is a valid iden(fiers… what should it be? • ‘’iffy” is also a valid iden(fier
• Solu(on – Longest matching token – Break (es using order of defini(ons…
• Keywords should appear before iden(fiers 19
![Page 20: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/20.jpg)
Building a Scanner – Take I
• Input: String
• Output: Sequence of tokens
20
![Page 21: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/21.jpg)
Building a Scanner – Take I Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `: goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c); return Plus; }; case `<`: … case `w`: … } 21
![Page 22: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/22.jpg)
There must be a becer way!
22
![Page 23: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/23.jpg)
A becer way
• AutomaGcally generate a scanner
• Define tokens using regular expressions
• Use finite-‐state automata for detec(on
23
![Page 24: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/24.jpg)
Reg-‐exp vs. automata
• Regular expressions are declara(ve – Good for humans – Not “executable”
• Automata are opera(ve – Define an algorithm for deciding whether a given word is in a regular language
– Not a natural nota(on for humans 24
![Page 25: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/25.jpg)
Overview
• Define tokens using regular expressions
• Construct a nondeterminis(c finite-‐state automaton (NFA) from regular expression
• Determinize the NFA into a determinis(c finite-‐state automaton (DFA)
• DFA can be directly used to iden(fy tokens 25
![Page 26: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/26.jpg)
Automata theory: a bird’s-‐eye view
26
![Page 27: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/27.jpg)
Determinis(c Automata (DFA)
• M = (Σ, Q, δ, q0, F) – Σ -‐ alphabet – Q – finite set of state – q0 ∈ Q – ini(al state – F ⊆ Q – final states – δ : Q × Σ à Q -‐ transi(on func(on
• For a word w, M reach some state x – M accepts w if x ∈ F
27
![Page 28: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/28.jpg)
DFA in pictures
start
a
b,c
a,b
c
accepting state
start state
transition
• An automaton is defined by states and transi(ons
28
a,b,c a,b,c
![Page 29: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/29.jpg)
Accep(ng Words
• Words are read leq-‐to-‐rightcba
start
a
b
c
29
• Missing transi(on = non-‐acceptance – “Stuck state”
![Page 30: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/30.jpg)
• Words are read leq-‐to-‐right
Accep(ng Words
cba
start
a
b
c
30
![Page 31: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/31.jpg)
• Words are read leq-‐to-‐right
Accep(ng Words
cba
start
a
b
c
31
![Page 32: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/32.jpg)
• Words are read leq-‐to-‐right
Accep(ng Words
cba
start
a
b
c
32
![Page 33: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/33.jpg)
Rejec(ng Words
cbb
start
a
b
c
33
• Words are read leq-‐to-‐right
![Page 34: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/34.jpg)
start
Rejec(ng Words
• Missing transi(on means non-‐acceptancecbb
a
b
c
34
![Page 35: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/35.jpg)
Non-‐determinis(c Automata (NFA)
• M = (Σ, Q, δ, q0, F) – Σ -‐ alphabet – Q – finite set of state – q0 ∈ Q – ini(al state
– F ⊆ Q – final states – δ : Q × (Σ ∪ {ε}) → 2Q -‐ transi(on func(on
• DFA: δ : Q × Σ à Q
• For a word w, M can reach a number of states X – M accepts w if X ∩ M ≠ {}
• Possible: X = {}
• Possible ε-‐transi(ons
35
![Page 36: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/36.jpg)
NFA
• Allow mul(ple transi(ons from given state labeled by same lecer
start
a
a
b
c
c
b
36
![Page 37: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/37.jpg)
Accep(ng words
cba
start
a
a
b
c
c
b
37
![Page 38: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/38.jpg)
Accep(ng words
• Maintain set of states
cba
start
a
a
b
c
c
b
38
![Page 39: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/39.jpg)
Accep(ng words
cba
start
a
a
b
c
c
b
39
![Page 40: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/40.jpg)
Accep(ng words• Accept word if reached an accep(ng state
cba
start
a
a
b
c
c
b
40
![Page 41: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/41.jpg)
NFA+Є automata
• Є transi(ons can “fire” without reading the input
Є
start a
b
c
41
![Page 42: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/42.jpg)
NFA+Є run example
cba
Є
start a
b
c
42
![Page 43: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/43.jpg)
NFA+Є run example• Now Є transi(on can non-‐determinis(cally take place
cba
Є
start a
b
c
43
![Page 44: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/44.jpg)
NFA+Є run example
cba
Є
start a
b
c
44
![Page 45: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/45.jpg)
NFA+Є run example
cba
Є
start a
b
c
45
![Page 46: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/46.jpg)
NFA+Є run example
cba
Є
start a
b
c
46
• Є transi(ons can “fire” without reading the input
![Page 47: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/47.jpg)
NFA+Є run example
cba
• Word accepted
Є
start a
b
c
47
![Page 48: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/48.jpg)
From regular expressions to NFA
• Step 1: assign expression names and obtain pure regular expressions R1…Rm
• Step 2: construct an NFA Mi for each regular expression Ri
• Step 3: combine all Mi into a single NFA
• Ambiguity resolu8on: prefer longest accep8ng word 48
![Page 49: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/49.jpg)
From reg. exp. to automata• Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression
• Proof: by induc8on on the structure of the regular expression
start
49
![Page 50: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/50.jpg)
R = ε
R = φ
R = a a
Basic constructs
50
![Page 51: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/51.jpg)
Composi(on R = R1 | R2 ε M1
M2 ε
ε
ε
R = R1R2
ε M1 M2
ε ε
51
![Page 52: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/52.jpg)
Repe((on
R = R1*
ε M1
ε
ε
ε
52
![Page 53: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/53.jpg)
53
![Page 54: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/54.jpg)
Naïve approach
• Try each automaton separately
• Given a word w: – Try M1(w) – Try M2(w) – … – Try Mn(w)
• Requires rese{ng aqer every acempt 54
![Page 55: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/55.jpg)
Actually, we combine automata
1 2 a a
3 a
4 b 5 b 6
abb
7 8 b a*b+ b a
9 a
10 b 11 a 12 b 13
abab
0
ε
ε
ε
ε
a abb a*b+ abab
combines
55
![Page 56: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/56.jpg)
Corresponding DFA
0 1 3 7 9
8
7
b
a
a 2 4 7 10
a
b b
6 8
5 8 11 b
12 13 a b
b
abb a*b+ a*b+
a*b+
abab
a
Combine automata: an example.
Combine a, abb, a*b+, abab.
75#
1# 2#a#
a#
3#a#
4#b#
5#b#
6#
abb#
7# 8#b#
a*b+#b#a#
9#a#
10#b#
11#a#
12#b#
13#
abab#
0#
ε#
ε#
ε#
ε#
b
56
![Page 57: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/57.jpg)
Scanning with DFA
• Run un(l stuck – Remember last accepGng state
• Go back to accep(ng state • Return token
57
![Page 58: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/58.jpg)
Ambiguity resolu(on
• Longest word • Tie-‐breaker based on order of rules when words have same length
58
Combine automata: an example.
Combine a, abb, a*b+, abab.
75#
1# 2#a#
a#
3#a#
4#b#
5#b#
6#
abb#
7# 8#b#
a*b+#b#a#
9#a#
10#b#
11#a#
12#b#
13#
abab#
0#
ε#
ε#
ε#
ε#
![Page 59: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/59.jpg)
Examples
0 1 3 7 9
8
7
b
a
a 2 4 7 10
a
b b
6 8
5 8 11 b
12 13 a b
b
abb a*b+ a*b+
a*b+
abab
a
Combine automata: an example.
Combine a, abb, a*b+, abab.
75#
1# 2#a#
a#
3#a#
4#b#
5#b#
6#
abb#
7# 8#b#
a*b+#b#a#
9#a#
10#b#
11#a#
12#b#
13#
abab#
0#
ε#
ε#
ε#
ε#b
abaa: gets stuck aqer aba in state 12, backs up to state (5 8 11) pacern is a*b+, token is ab Tokens: <a*b+, ab> <a,a><a,a> 59
![Page 60: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/60.jpg)
Examples
0 1 3 7 9
8
7
b
a
a 2 4 7 10
a
b b
6 8
5 8 11 b
12 13 a b
b
abb a*b+ a*b+
a*b+
abab
a
b
abba: stops aqer second b in (6 8), token is abb because it comes first in spec 60 Tokens: <abb, abb> <a,a>
Combine automata: an example.
Combine a, abb, a*b+, abab.
75#
1# 2#a#
a#
3#a#
4#b#
5#b#
6#
abb#
7# 8#b#
a*b+#b#a#
9#a#
10#b#
11#a#
12#b#
13#
abab#
0#
ε#
ε#
ε#
ε#
![Page 61: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/61.jpg)
Summary of Construc(on
• Describe tokens as regular expressions – Decide acributes (values) to save for each token
• Regular expressions turned into a DFA – Also, records which acributes (values) to keep
• Lexical analyzer simulates the run of an automata with the given transi(on table on any input string
61
![Page 62: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/62.jpg)
A Few Remarks
• Turning an NFA to a DFA is expensive, but – Exponen(al in the worst case – In prac(ce, works fine
• The construc(on is done once per-‐language – At Compiler construc(on (me – Not at compila(on (me
62
![Page 63: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/63.jpg)
Implementa(on
63
![Page 64: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/64.jpg)
Implementa(on by Example if { return IF; } [a-‐z][a-‐z0-‐9]* { return ID; } [0-‐9]+ { return NUM; } [0-‐9]”.”[0-‐9]+|[0-‐9]*”.”[0-‐9]+ { return REAL; } (\-‐\-‐[a-‐z]*\n)|(“ “|\n|\t) { ; } . { error(); }
64
if
xy, i, zs98
3,32, 032
0.55, 33.1
-‐-‐comm\n \n, \t, “ “ ID
IF
ID error REAL
NUM REAL
error w.s. error w.s.
01
2 3
9 10 11 12
![Page 65: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/65.jpg)
int edges[][256]= { /* …, 0, 1, 2, 3, ..., -‐, e, f, g, h, i, j, ... */
/* state 0 */ {0, …, 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0}, /* state 1 */ {13, … , 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, …, 13, 13}, /* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, …, 0, 0}, /* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0}, /* state 4 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, …, 0, 0}, /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}, /* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0}, /* state 7 */ /* state … */ ... /* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0} }; 65
ID
IF
ID error REAL
NUM REAL
error w.s. error w.s.
01
2 3
9 10 11 12
![Page 66: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/66.jpg)
Pseudo Code for Scanner char* input = … ; Token nextToken() { lastFinal = 0; currentState = 1 ; inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {
nextState = edges[currentState][*currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition; } input = inputPositionAtLastFinal + 1; return action[lastFinal]; }
66
![Page 67: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/67.jpg)
Example
Input: “if -‐-‐not-‐a-‐com”
67
2 blanks
ID
IF
ID error REAL
NUM REAL
error w.s. error w.s.
01
2 3
9 10 11 12
![Page 68: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/68.jpg)
final state input
0 1 if -‐-‐not-‐a-‐com
2 2 if -‐-‐not-‐a-‐com
3 3 if -‐-‐not-‐a-‐com
3 0 if -‐-‐not-‐a-‐com
return IF
68
ID
IF
ID error REAL
NUM REAL
error w.s. error w.s.
01
2 3
9 10 11 12
![Page 69: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/69.jpg)
found whitespace
final state input
0 1 --not-a-com
12 12 --not-a-com
12 12 --not-a-com
12 0 --not-a-com
69
![Page 70: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/70.jpg)
final state input
0 1 -‐-‐not-‐a-‐com
9 9 -‐-‐not-‐a-‐com
9 10 -‐-‐not-‐a-‐com
9 10 -‐-‐not-‐a-‐com
9 10 -‐-‐not-‐a-‐com
9 0 -‐-‐not-‐a-‐com error
70
ID
IF
ID error REAL
NUM REAL
error w.s. error w.s.
01
2 3
9 10 11 12
![Page 71: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/71.jpg)
final state input
0 1 -‐not-‐a-‐com
9 9 -‐not-‐a-‐com
9 0 -‐not-‐a-‐com
9 0 -‐not-‐a-‐com
9 0 -‐not-‐a-‐com
error
71
ID
IF
ID error REAL
NUM REAL
error w.s. error w.s.
01
2 3
9 10 11 12
![Page 72: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/72.jpg)
Concluding remarks
• Efficient scanner • Minimiza(on • Error handling • Automa(c crea(on of lexical analyzers
72
![Page 73: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/73.jpg)
Efficient Scanners
• Efficient state representa(on • Input buffering • Using switch and gotos instead of tables
73
![Page 74: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/74.jpg)
Minimiza(on
• Create a non-‐determinis(c automaton (NDFA) from every regular expression
• Merge all the automata using epsilon moves (like the | construc(on)
• Construct a determinis(c finite automaton (DFA) – State priority
• Minimize the automaton – separate accep(ng states by token kinds
74
![Page 75: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/75.jpg)
Example if { return IF; } [a-‐z][a-‐z0-‐9]* { return ID; } [0-‐9]+ { return NUM; }
75 Modern compiler implementa(on in ML, Andrew Appel, (c)1998, Figures 2.7,2.8
ID IF
error NUM
![Page 76: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/76.jpg)
Example if { return IF; } [a-‐z][a-‐z0-‐9]* { return ID; } [0-‐9]+ { return NUM; }
76 Modern compiler implementa(on in ML, Andrew Appel, (c)1998, Figures 2.7,2.8
ID IF
error
NUM
ID
NUM
ID
ID IF
error NUM
![Page 77: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/77.jpg)
Example
77
ID IF
error NUM
ID IF
error
NUM
ID
NUM
ID
if { return IF; } [a-‐z][a-‐z0-‐9]* { return ID; } [0-‐9]+ { return NUM; }
ID IF
error NUM
ID
ID
ID
IF
NUM NUM
error
Modern compiler implementa(on in ML, Andrew Appel, (c)1998, Figures 2.7,2.8
![Page 78: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/78.jpg)
Example
78
if { return IF; } [a-‐z][a-‐z0-‐9]* { return ID; } [0-‐9]+ { return NUM; }
ID IF
error NUM
ID
ID
ID
IF
NUM NUM
error
Modern compiler implementa(on in ML, Andrew Appel, (c)1998, Figures 2.7,2.8
![Page 79: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/79.jpg)
Error Handling • Many errors cannot be iden(fied at this stage • Example: “fi (a==f(x))”. Should “fi” be “if”? Or is it a rou(ne name?
– We will discover this later in the analysis – At this point, we just create an iden(fier token
• Some(mes the lexeme does not match any pacern – Easiest: eliminate lecers un(l the beginning of a legi(mate lexeme – Alterna(ves: eliminate/add/replace one lecer, replace order of two adjacent
lecers, etc.
• Goal: allow the compila(on to con(nue • Problem: errors that spread all over
79
![Page 80: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/80.jpg)
Automa(cally generated scanners
• Use of Program-‐Genera(ng Tools – Specifica(on è Part of compiler – Compiler-‐Compiler
Stream of tokens
JFlex regular expressions
input program scanner 80
![Page 81: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/81.jpg)
Use of Program-‐Genera(ng Tools
• Input: regular expressions and ac(ons • Ac(on = Java code
• Output: a scanner program that • Produces a stream of tokens • Invoke ac(ons when pacern is matched
Stream of tokens
JFlex regular expressions
input program scanner 81
![Page 82: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/82.jpg)
Line Coun(ng Example
• Create a program that counts the number of lines in a given input text file
82
![Page 83: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/83.jpg)
Crea(ng a Scanner using Flex
int num_lines = 0; %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }
83
![Page 84: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/84.jpg)
Crea(ng a Scanner using Flex
ini(al
other
newline \n
^\n
int num_lines = 0; %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }
84
![Page 85: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/85.jpg)
JFLex Spec File User code: Copied directly to Java file %% JFlex direc(ves: macros, state names %% Lexical analysis rules:
– Op(onal state, regular expression, ac(on – How to break input to tokens – Ac(on when token matched
Possible source of javac errors down
the road
DIGIT= [0-‐9] LETTER= [a-‐zA-‐Z]
YYINITIAL
{LETTER} ({LETTER}|{DIGIT})*
85
![Page 86: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/86.jpg)
Crea(ng a Scanner using JFlex import java_cup.runtime.*; %% %cup %{ private int lineCounter = 0; %} %eofval{ System.out.println("line number=" + lineCounter); return new Symbol(sym.EOF); %eofval} NEWLINE=\n %% {NEWLINE} { lineCounter++; } [^{NEWLINE}] { }
86
![Page 87: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/87.jpg)
Catching errors
• What if input doesn’t match any token defini(on?
• Trick: Add a “catch-‐all” rule that matches any character and reports an error – Add aqer all other rules
87
![Page 88: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/88.jpg)
A JFlex specifica(on of C Scanner import java_cup.runtime.*; %% %cup %{ private int lineCounter = 0; %} Letter= [a-‐zA-‐Z_] Digit= [0-‐9] %% ”\t” { } ”\n” { lineCounter++; } “;” { return new Symbol(sym.SemiColumn);} “++” { return new Symbol(sym.PlusPlus); } “+=” { return new Symbol(sym.PlusEq); } “+” { return new Symbol(sym.Plus); } “while” { return new Symbol(sym.While); } {Letter}({Letter}|{Digit})*
{ return new Symbol(sym.Id, yytext() ); } “<=” { return new Symbol(sym.LessOrEqual); } “<” { return new Symbol(sym.LessThan); }
88
![Page 89: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/89.jpg)
Missing
• Crea(ng a lexical analysis by hand • Table compression • Symbol Tables • Nested Comments • Handling Macros
89
![Page 90: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/90.jpg)
Lexical Analysis: What
• Input: program text (file) • Output: sequence of tokens
90
![Page 91: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/91.jpg)
Lexical Analysis: How
• Define tokens using regular expressions
• Construct a nondeterminis(c finite-‐state automaton (NFA) from regular expression
• Determinize the NFA into a determinis(c finite-‐state automaton (DFA)
• DFA can be directly used to iden(fy tokens 91
![Page 92: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/92.jpg)
Lexical Analysis: Why
• Read input file • Iden(fy language keywords and standard iden(fiers • Handle include files and macros • Count line numbers • Remove whitespaces • Report illegal symbols
• [Produce symbol table]
92
![Page 93: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/93.jpg)
Syntax Analysis (1)
Context Free Languages Context Free Grammars Pushdown Automata
93
![Page 94: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/94.jpg)
The Real Anatomy of a Compiler
Executable
code
exe
Source
text
txt Lexical Analysis
Sem. Analysis
Process text input
characters Syntax Analysis tokens AST
Intermediate code
generation
Annotated AST
Intermediate code
optimization
IR Code generation IR
Target code optimization
Symbolic Instructions
SI Machine code generation
Write executable
output
MI
94
Lexical Analysis
Syntax Analysis
![Page 95: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/95.jpg)
Frontend: Scanning & Parsing((23 + 7) * x)
) x * ) 7 + 23 ( (
RP Id OP RP Num OP Num LP LP
Lexical Analyzer
program text
token stream
Parser Grammar: E → ... | Id Id → ‘a’ | ... | ‘z’
Op(*)
Id(b)
Num(23) Num(7)
Op(+)
Abstract Syntax Tree
valid syntax error
95
![Page 96: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/96.jpg)
From scanning to parsing((23 + 7) * x)
) x * ) 7 + 23 ( (
RP Id OP RP Num OP Num LP LP
Lexical Analyzer
program text
token stream
Parser Grammar: E → ... | Id Id → ‘a’ | ... | ‘z’
Op(*)
Id(b)
Num(23) Num(7)
Op(+)
Abstract Syntax Tree
valid syntax error
96
![Page 97: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/97.jpg)
Parsing • Construct a structured representa(on of the input
• Challenges – How do you describe the programming language? – How do you check validity of an input?
• Is a sequence of tokens a valid program in the language?
– How do you construct the structured representa(on? – Where do you report an error?
97
![Page 98: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/98.jpg)
Some founda(ons
98
![Page 99: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/99.jpg)
Context free languages (CFLs)
• L01 = { 0n1n | n > 0 }
• Lpolyndrom = {pp’ | p ∊ Σ* , p’=reverse(p)}
• Lpolyndrom# = {p#p’ | p ∊ Σ* , p’=reverse(p), # ∉ Σ}
99
![Page 100: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/100.jpg)
Context free grammars (CFG)
• V – non terminals (syntac(c variables) • T – terminals (tokens) • P – deriva(on rules
– Each rule of the form V t (T 4 V)*
• S – start symbol
G = (V,T,P,S)
100
![Page 101: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/101.jpg)
What can CFGs do?
• Recognize CFLs
• S t 0T1 • T t 0T1 | ℇ
101
![Page 102: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/102.jpg)
~ language-‐defining power
Recognizing CFLs
• Context Free Grammars (CFG)
• Nondeterminis(c push down automata (PDA)
102
![Page 103: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/103.jpg)
Pushdown Automata (PDA)
• Nondeterminis(c PDAs define all CFLs
• Determinis(c PDAs model parsers. – Most programming languages have a determinis(c PDA
– Efficient implementa(on
103
![Page 104: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/104.jpg)
Intui(on: PDA
• An ε-‐NFA with the addi(onal power to manipulate one stack
104
stack
X
Y
IF
$
Top
control (ε-‐NFA)
![Page 105: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/105.jpg)
Intui(on: PDA
• Think of an ε-‐NFA with the addi(onal power that it can manipulate a stack
• PDA moves are determined by: – The current state (of its “ε-‐NFA”) – The current input symbol (or ε) – The current symbol on top of its stack
105
![Page 106: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/106.jpg)
Intui(on: PDA
input
stack
if (oops) then stat:= blah else abort
X
Y
IF
$
Top
Current
control (ε-‐NFA)
106
![Page 107: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/107.jpg)
Intui(on: PDA
• Moves: – Change state – Replace the top symbol by 0…n symbols
• 0 symbols = “pop” (“reduce”) • 0 < symbols = sequence of “pushes” (“shiq”)
• Nondeterminis(c choice of next move
107
![Page 108: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/108.jpg)
PDA Formalism
• PDA = (Q, Σ, Γ, δ, q0, $, F): – Q: finite set of states – Σ: Input symbols alphabet – Γ: stack symbols alphabet – δ: transi(on func(on – q0: start state – $: start symbol – F: set of final states
108
![Page 109: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/109.jpg)
The Transi(on Func(on
• δ(q, a, X) = { (p1, σ1), … ,(pn, σn)} – Input: triplet
• A state q ∊ Q • An input symbol a ∊ Σ or ε • A stack symbol X ∊ Γ
– Output: set of 0 … k ac(ons of the form (p, σ) • A state p ∊ Q • σ a sequence X1⋯Xn ∊ Γ* of stack symbols
109
![Page 110: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/110.jpg)
Ac(ons of the PDA
• Say (p, σ) ∊ δ(q, a, X) – If the PDA is in state q and X is the top symbol and a is at the front of the input
– Then it can • Change the state to p. • Remove a from the front of the input
– (but a may be ε).
• Replace X on the top of the stack by σ.
110
![Page 111: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/111.jpg)
Example: Determinis(c PDA
• Design a PDA to accept {0n1n | n > 1}. • The states:
– q = We have not seen 1 so far • start state
– p = we have seen at least one 1 and no 0s since – f = final state; accept.
111
![Page 112: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/112.jpg)
Example: Stack Symbols
• $ = start symbol. – Also marks the bocom of the stack, – Indicates when we have counted the same number of 1’s as 0’s.
• X = “counter” – used to count the number of 0s we saw
112
![Page 113: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/113.jpg)
Example: Transi(ons
• δ(q, 0, $) = {(q, X$)}. • δ(q, 0, X) = {(q, XX)}.
– These two rules cause one X to be pushed onto the stack for each 0 read from the input.
• δ(q, 1, X) = {(p, ε)}. – When we see a 1, go to state p and pop one X.
• δ(p, 1, X) = {(p, ε)}. – Pop one X per 1.
• δ(p, ε, $) = {(f, $)}. – Accept at bocom. 113
![Page 114: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/114.jpg)
Ac(ons of the Example PDA
q
0 0 0 1 1 1
$
114
![Page 115: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/115.jpg)
Ac(ons of the Example PDA
q
X $
0 0 0 1 1 1
115
![Page 116: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/116.jpg)
Ac(ons of the Example PDA
q
X X $
0 0 0 1 1 1
116
![Page 117: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/117.jpg)
Ac(ons of the Example PDA
q
X X X $
0 0 0 1 1 1
117
![Page 118: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/118.jpg)
Ac(ons of the Example PDA
p
X X $
0 0 0 1 1 1
118
![Page 119: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/119.jpg)
Ac(ons of the Example PDA
p
X $
0 0 0 1 1 1
119
![Page 120: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/120.jpg)
Ac(ons of the Example PDA
p
$
0 0 0 1 1 1
120
![Page 121: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/121.jpg)
Ac(ons of the Example PDA
f
$
0 0 0 1 1 1
121
![Page 122: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/122.jpg)
Example: Non Determinis(c PDA
• A PDA that accepts palindromes – L {pp’ ∊ Σ* | p’=reverse(p)}
122
![Page 123: Compilaon - TAUmaon/teaching/2014-2015/compilation/compilatio… · code exe Source text txt Lexical Analysis Sem . Analysis Process text input characters Syntax Analysis tokens AST](https://reader034.fdocuments.in/reader034/viewer/2022050410/5f870c397f0ee66e7217ad25/html5/thumbnails/123.jpg)
123