Chapter 3. Lexical Analysis (1)

28
Chapter 3. Lexical Analysis (1)

description

Chapter 3. Lexical Analysis (1). Interaction of lexical analyzer with parser. Lexical Analysis. Issues Simpler design is preferred Compiler efficiency is improved Compiler portability is improved Terms Tokens  terminal symbols in a grammar - PowerPoint PPT Presentation

Transcript of Chapter 3. Lexical Analysis (1)

Page 1: Chapter 3. Lexical Analysis (1)

Chapter 3.

Lexical Analysis (1)

Page 2: Chapter 3. Lexical Analysis (1)

2

Interaction of lexical analyzer with parser.

sourceprogram

lexicalanalyzer

parser

symboltable

token

get nexttoken

Page 3: Chapter 3. Lexical Analysis (1)

3

Lexical Analysis

Issues – Simpler design is preferred – Compiler efficiency is improved– Compiler portability is improved

Terms– Tokens terminal symbols in a grammar– Patterns rules to describing strings of a token

– Lexemes a set of strings matched by the pattern

Page 4: Chapter 3. Lexical Analysis (1)

4

TOKEN SAMPLE LEXEMESINFORMAL DESCRIPTION OF

PATTERN

const

if

relation

id

num

literal

const

if

<, <=, =, <>, >, >=

pi, count, D2

3.1416, 0, 6.02E23

"core dumped"

const

if

< or <= or = or < > or >= or >

letter followed by letters and digits

any numeric constant

any characters between " and " except "

Examples of tokens.

Page 5: Chapter 3. Lexical Analysis (1)

5

Difficulties in implementing lexical

analyzers FORTRAN

– No delimiter is used– DO 5 I=1.25 DO 5 I=1,25 DO 5 I= 1 25

PL/I– Keywords are not reserved– IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;

Page 6: Chapter 3. Lexical Analysis (1)

6

Attributes for tokens

A lexical analyzer collects information about tokens into their associated attributes

Example – E = M * C ** 2

• <id, pointer to symbol-table entry for E>• <assign_op,>• <id, pointer to symbol-table entry for M>• <mult_op,_>• <id, pointer to symbol-table entry for C>• <exp_op,>• <num, integer value 2> generally stored in constant table

Page 7: Chapter 3. Lexical Analysis (1)

7

Lexical Errors

Rules for error recovery– Deleting an extraneous character– Inserting a missing character– Replacing an incorrect character by a correct character– Transposing two adjacent characters

Minimum-distance erroneous correction Example

– Detectable : 2as3, 2#31, …– Undetectable : fi(a == f(x)) …

Page 8: Chapter 3. Lexical Analysis (1)

8

Input Buffering

A single buffer could make a big difficulty– 두 버퍼 사이에 있는 word– Declare (arg1, …. , argn) array or function

Buffer pairs– A good solution– Sentinels 을 쓰면 매번 버퍼의 끝인지와

파일의 끝인지를 동시에 검사할 필요가 없음

Page 9: Chapter 3. Lexical Analysis (1)

9

Sentinels at end of each buffer half.

: : : E : : = : : M : * : eof C : * : * : 2 : eof : : : : : eof

lexeme_beginning

forward

Page 10: Chapter 3. Lexical Analysis (1)

10

Specification of Tokens

Strings and languages – Alphabet or character class finite set of symbols

– String sentence word

– |s| length of a string s

– ε : empty string, Ф ={ε} : empty set

– x, y are strings • xy : concatenation, εx = x ε = x

Operations on languages

Page 11: Chapter 3. Lexical Analysis (1)

11

Terms for parts of a string.

TERM DEFINTION

prefix of sA string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana.

suffix of sA string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana.

substring of s

A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and are prefixes, suffixes, and substrings of s.

proper prefix, suffix, or substring of s

Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s x.

subsequence of sAny string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.

Page 12: Chapter 3. Lexical Analysis (1)

12

Definitions of operations on languages.

OPERATION DEFINITION

union of L and M written

L M.L M = {s | s is in L or s is in M}

concatenation of L and M written LM

LM = { st | s is in L and t is in M }

Kleene closure of L

written L* L* denotes “zero or more concatenations of” L.

positive closure of L

written L+

L+ denotes “one or more concatenations of” L.

Page 13: Chapter 3. Lexical Analysis (1)

13

Regular Expressions

1. is a regular expression that denotes {}, that is, the set containing the empty string.

2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol.

3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then,a) (r)|(s) is a regular expression denoting L(r) L(s).

b) (r)(s) is a regular expression denoting L(r)L(s).

c) (r)* is a regular expression denoting (L(r))*.

d) (r) is a regular expression denoting L(r).

Page 14: Chapter 3. Lexical Analysis (1)

14

Examples on operations in regular expressions

Σ ={a,b} alphabets– a | b {a,b}– (a|b)(c|d) {ac, ad, bc, bd}– a* {ε, a, aa, aaa, …}

– (a|b)* (a*|b*)*

– aa* = a+, ε|a+ = a*

– (a|b) = (b|a)

Page 15: Chapter 3. Lexical Analysis (1)

15

Algebraic properties of regular expressions.

AXIOM DESCRIPTION

r|s = s|r | is commutative

r|(s|t) = (r|s)|t | is associative

(rs)t = r(st) concatenation is associative

r(s|t) = rs|rt

(s|t)r = sr|trconcatenation distributes over |

r = r

r = r is the identity element for concatenation

r* = (r|)* relation between * and

r** = r* * is idempotent

Page 16: Chapter 3. Lexical Analysis (1)

16

Regular Definitions

Regular definition– d1 r1 d2 r2 …. dn rn

• 예• letter A|B| … |Z|a|b| … |z• digit 0|1| … | 9• id letter (letter|digit)*

Page 17: Chapter 3. Lexical Analysis (1)

17

Unsigned numbers

Pascal digit 0|1| … |9

digits digit digit*

operational_fraction . digits | ε optional_exponent (E(+|-| ε) digits | ε

num digits operational_fraction optional_exponent

Page 18: Chapter 3. Lexical Analysis (1)

18

Notational Shorthands (1/2)

1. One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r)+ is a regular expression that denotes the language (L(r))+. Thus, the regular expression a+ denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relate the Kleene and positive closure operators.

2. Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r|. If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r) {}. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as

Page 19: Chapter 3. Lexical Analysis (1)

19

Notational Shorthands (2/2)

3. Character classes. The notation [abc] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [a – z] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression

[A – Za – z][A – Za – z0 – 9]*

digit

digits

optional _fraction

optional_exponent

num

0 | 1 | ··· | 9

digit+

( . digits )?

( E ( + | - )? digits )?

Digits optional_fraction optional_exponent

Page 20: Chapter 3. Lexical Analysis (1)

20

Nonregular set

{wcw-1|w is a string of a’s and b’s}

context-free grammar is required to

represent the string

Page 21: Chapter 3. Lexical Analysis (1)

21

Regular-expression patterns for tokens.

REGULAR

EXPRESSIONTOKEN ATTRIBUTE-VALUE

wsif

thenelseid

num<

<==

< >>

>=

-if

thenelseid

numreloprelopreloprelopreloprelop

----

pointer to table entrypointer to table entry

LTLEEQNEGTGE

Page 22: Chapter 3. Lexical Analysis (1)

22

Transition diagram

Finite-state automata states and edges 몇 가지 예를 보여줌 … . 다음 페이지 , 그림 3.14 는 앞의 예를 바탕으로 그림

Page 23: Chapter 3. Lexical Analysis (1)

23

9 10 1011letter otherstart

return(gettoken(), install_id())

letter or digit

*

Transition diagram for identifiers and keywords.

Page 24: Chapter 3. Lexical Analysis (1)

24

Lex 에 의한 구현

Regular definition finite automata, transition diagram

C 프로그램으로 출력 Lexical analysis, pattern matching, …

Page 25: Chapter 3. Lexical Analysis (1)

25

Creating a lexical analyzer with Lex.

Lexcompiler

lex.yy.c

Lexsource

programlex.l

Ccompiler

a.outlex.yy.c

a.outsequence

oftokens

inputstream

Page 26: Chapter 3. Lexical Analysis (1)

26

Lex program for the tokens of Fig. 3. 10. (1/2)

%{

/*definitions of manifest constants

LT, LE, EQ, NE, GT, GE,

IF, THEN, ELSE, ID, NUMBER, RELOP */

%}

/*regular definitions */

delim [ \ t \ n ]

ws { delim }+

letter [ A-Za-z ]

digit [ 0 – 9 ]

id { letter } ( { letter } | { digit } )*

number { digit } + ( \ .{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?

Page 27: Chapter 3. Lexical Analysis (1)

27 Lex program for the tokens of Fig. 3. 10. (2/2)

%%{ ws } { /* no action and no return */ }if { return(IF); }then { return(THEN); }else { return(ELSE); }{ id } { yylval = install_id(); return(ID); }{ number } { yylval = install_num(); return(NUMBER); }“<” { yylval = LT; return(RELOP); }“<=” { yylval = LE; return(RELOP); }“=” { yylval = EQ; return(RELOP); }“<>” { yylval = NE; return(RELOP); }“>” { yylval = GT; return(RELOP); }“>=” { yylval = GE; return(RELOP); }%%

install_id() {/* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */

}install_num() {

/* similar procedure to install a lexeme that is a number */}

Page 28: Chapter 3. Lexical Analysis (1)

28

Lookahead operator

DO 5 I = 1.25 DO 5 I=1,25– DO/({letter | digit})* = ({letter} | {digit})*,– DO/{id}* = {digit}*,

IF(I,J)=3 IF(condition) statement– IF/ \( .* \) {letter}