Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00...

Post on 31-Dec-2015

213 views 0 download

Transcript of Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00...

Lexical Analysis

Mooly Sagivmsagiv@post.tau.ac.il

Schrierber 31703-640-7606

Wed 10:00-12:00html://www.math.tau.ac.il/~msagiv/courses/wcc.html

Textbook:Modern Compiler Implementation in CChapter 2

A motivating example• Create a program that counts the number of lines in

a given input file

A motivating examplesolution

int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Subjects• Roles of lexical analysis

• The straightforward solution a manual scanner for C

• Regular Expressions

• Finite automata

• From regular languages into finite automata

• Flex

Basic Compiler PhasesSource program (string)

Fin. Assembly

lexical analysis

syntax analysis

semantic analysis

Translate

Instruction selection

Register Allocation

Tokens

Abstract syntax tree

Intermediate representation

Assembly

Finite automata

Pushdown automata

Memory organization

graph algorithms

Dynamic programming

Example

a\b := 5 + 3 ;\nb := (print(a, a-1), 10 * a) ;\nprint(b)

• Input string

• Tokens

id (“a”) assign num (5) + num(3) ;id(“b”) assign

print(id(“a”) , id(“a”) - num(1)), num(10) * id(“a”)) ;print(id(“b”))

• Functionality

– input

• program text (file)

– output

• sequence of tokens

– Read input file

– Identify language keywords and standard identifiers

– Handle include files and macros

– Count line numbers

– Remove whitespaces

– Report illegal symbols

– Produce symbol table

Lexical Analysis (Scanning)

A simplified scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){

case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;

switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c);

return Plus; } case `<`:case `w`:

}

Automatic Generation of Lexical Analysis

• The matching of input strings can be performed by a finite automaton

• Examples:– An automaton for while– An automaton for C identifier– An automaton for C comment

• The program for the automaton is automatically generated from regular expressions

Flex• Input

– regular expressions and actions (C code)

• Output– A scanner program that reads the input and

applies actions when input regular expression is matched

flex

regular expressions

input program tokensscanner

Regular Expression Notations

a An ordinary character stands for itselfM|N M or NMN M followed by NM* Zero or more times of MM+ One or more times of MM? Zero or one occurrence of M[a-zA-Z] Character set alternation (single character). Any (single) character but newline“a.+” Quotation\ Convert an operator into text

Ambiguity Resolving

• Find the longest matching token

• Between two tokens with the same length use the one declared first

A Flex specification of C Scanner

Letter [a-zA-Z_]Digit [0-9]%%[ \t] {;} [\n] {line_count++;}“;” { return SemiColumn;}“++” { return PlusPlus ;}“+=“ { return PlusEqual ;}“+” { return Plus}“while” { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}“<=” { return LessOrEqual;}“<” { return LessThen ;}

Running Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

int edges[][256] ={ /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... *//* state 0 */ {0, ..., 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0}/* state 1 */ {13, ..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, ..., 13, 13}/* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ..., 0, 0}/* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0}/* state 4 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0, 0} /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}/* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0}/* state 7 */

.../* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}

Pseudo Code for ScannerToken nextToken(){lastFinal = 0; currentState = 1 ;inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {

nextState = edges[currentState][currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition;

}input = inputPositionAtLastFinal ;return action[lastFinal]; }

Example

Input: “if --not-a-com”

Efficient Scanners

• Efficient state representation

• Input buffering

• Using switch and goto instead of tables

Constructing Automaton from Specification

• Create a non-deterministic automaton (NDFA) from every regular expression

• Merge all the automata using epsilon moves(like the | construction)

• Construct a deterministic finite automaton (DFA)

• Minimize the automaton starting with separate accepting states

NDFA Constructionif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

DFA Construction

Minimization

%{/* C declarations */#include “tokens.h'' /* Mapping of tokens into integers */#include “errormsg.h'' /* Shared by all the phases */union {int ival; string sval; double fval;} yylval;int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng)%}/* Lex Definitions */digits [0-9]+%%if { ADJ; return IF;}[a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; }{digits} {ADJ; yylval.ival=atoi(yytext); return NUM; }({digits}\.{digits}?)|({digits}?\.{digits}) {

ADJ; yylval.fval=atof(yytext); return REAL; }(\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; }. { ADJ; EM_error(“illegal character''); }

Start States• Regular expressions may be more complicated

than automata– C comments

• Solutions– Conversion of automata into regular expressions– Start States

% start s1 s2%%< INITIAL>r1 { action0 ; BEGIN s_1; }<s1>r1 { action1 ; BEGIN s2; }<s2>r2 { action2 ; BEGIN INITIAL};

Realistic Example% start Comment%%<INITIAL>”/*'' { BEGIN Comment; }<INITIAL>r1 { Usual actions; }<INITIAL>r2 { Usual actions; }

...<INITIAL>rk { Usual actions; }<Comment>”*/”’ { BEGIN Initial; }<Comment>.|\n ;

Summary

• For most programming languages lexical analyzers can be easily constructed

• Exceptions:– Fortran– PL/1

• Flex is a useful tool beyond compilers