Chapter 2_Lexical Analysis

download Chapter 2_Lexical Analysis

of 69

Transcript of Chapter 2_Lexical Analysis

  • 8/17/2019 Chapter 2_Lexical Analysis

    1/69

    Compiler Design (Comp382)

    Chapter 2: Lexical Analysis/Scanner

    By Esubalew Alemneh

  • 8/17/2019 Chapter 2_Lexical Analysis

    2/69

    Session1

    Lexical Analyzer The Role of the Lexical Analyzer

    What is token?

    Tokens, Patterns and Lexemes

    Token Attributes

    Lexical Errors

    Input Buffering

    Specification of Tokens

    Recognition of Tokens

  • 8/17/2019 Chapter 2_Lexical Analysis

    3/69

    Lexical Analyzer

    Lexical analysis is the process of converting a sequence of

    characters into a sequence of tokens. A program or function which performs lexical analysis is called a

    lexical analyzer, lexer, or scanner.

    A lexer often exists as a single function which is called by aparser (syntax Analyzer) or another function.

    The tasks done by scanner are

    Grouping input characters into tokens Stripping out comments and white spaces and other separators

    Correlating error messages with the source program

    Macros expansion

    Pass token to parser

  • 8/17/2019 Chapter 2_Lexical Analysis

    4/69

    Lexical

    analyzer  Parser 

    Source

     program

    read char 

     pass token

    and attribute value

    get next

    Token

    Symbol Table

    id

    The Role of a Lexical Analyzer

  • 8/17/2019 Chapter 2_Lexical Analysis

    5/69

    Lexical Analyzer …. Sometimes Lexical Analyzers are divided into a cascade of two

    phases; Scanning and Lexical analysis Advantages of separating lexical analysis from scanning are

    Simpler design - Separate the two analysis

    High cohesion / Low coupling

    Improve maintainability Compiler efficiency

    Liberty to apply specialized techniques that serves only lexical tasks,not the whole parsing

    Speedup reading input characters using specialized bufferingtechniques

    Compiler portability (e.g. Linux to Win)

    Input device peculiarities are restricted to the lexical Analyzer

    Enable integration of third-party lexers [lexical analysis tool]

  • 8/17/2019 Chapter 2_Lexical Analysis

    6/69

    Tokens correspond to sets of strings. Identifier : strings of letters or digits, starting with a letter

    Integer: a non-empty string of digits; 123, 123.45

    Keyword: ―else‖ or ―if ‖ or ―begin‖ or …

    Whitespace: a non-empty sequence of blanks, newlines, and tabs

    Symbols: +, -, *, /, =, , ->, …

    Char (string) literals: ―Hello‖, ‗c‘

    Comments, White space

    Terminal symbols in a grammar

    What is token?

  • 8/17/2019 Chapter 2_Lexical Analysis

    7/69

    What do we want to do? Example:

    if (i == j)

    Z = 0;

    elseZ = 1;

    The input is just a string of characters:

    \t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;

    Goal: Partition input string into substrings Where the substrings are tokens

    Tokens: if, (, i, ==, j, ), z, =, 0, ;, else, 1, …

    What is token? …Example

  • 8/17/2019 Chapter 2_Lexical Analysis

    8/69

    Lexeme Actual sequence of characters that matches a pattern and has a given

    token class.

    Examples: Name, Data, x, 345,2,0,629,....

    Token A classification for a common set of strings

    Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....

    One token for all identifiers

    Pattern

    The rules that characterize the set of strings for a token

    Examples: [0-9]+

    identifier: ([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*

    Tokens, Patterns and Lexemes

  • 8/17/2019 Chapter 2_Lexical Analysis

    9/69

    Tokens, Patterns and Lexemes: Example

  • 8/17/2019 Chapter 2_Lexical Analysis

    10/69

    More than one lexeme can match a pattern

    the lexical analyzer must provide the information about the particular

    lexeme that matched.

    For example, the pattern for token number matches both 0, 1, 934, …

    But code generator must know which lexeme was found in thesource program.

    Thus, the lexical analyzer returns to the parser a token name and an

    attribute value

    For each lexeme the following type of output is produced(token-name, attribute-value)

    attribute-value points to an entry in the symbol table for this token

    Token Attributes

  • 8/17/2019 Chapter 2_Lexical Analysis

    11/69

    E = M * C ** 2

    < assign_op>

    Example of Attribute Values

  • 8/17/2019 Chapter 2_Lexical Analysis

    12/69

    Lexical analyzer can‘t detect all errors – needs other

    components

    In what situations do errors occur?

    • When none of the patterns for tokens matches any prefix of the

    remaining input. However look at: fi(a==f(x))…

    generates no lexical error in C – subsequent phases of compiler do

    generate the errors

    Possible error recovery actions:

    • Deleting or Inserting Input Characters

    • Replacing or Transposing Characters

    Or, skip over to next separator to ignore problem

    Lexical Errors

  • 8/17/2019 Chapter 2_Lexical Analysis

    13/69

    Processing large number of characters during the compilation of

    a large source program consumes time- due to overhead required to

     process a single input character 

    How to speed the reading of source program ?

    specialized buffering techniques have been developed two buffer scheme - that handles large lookaheads safely

    sentinels – improvement which saves time of checking buffer ends

    Buffer Pair Buffer size N, N = size of a disk block (4096 bytes)

    read N characters into a buffer

    one system call i.e. not one call per character

    If read < N characters we put eof that marks the end of the source file

    Input Buffering

  • 8/17/2019 Chapter 2_Lexical Analysis

    14/69

    Two pointers to the input are maintained

    lexemeBegin – marks the beginning of the current lexeme

    Forward – scans ahead until a pattern match is found

    Initially both pointers point the first character of the next lexeme

    Forward pointer scans; if a lexeme is found, it is set to the lastcharacter of the lexeme found So, forward must be retracted one position to its left.

    After processing the lexeme, both pointers are set to the characterimmediately the lexeme

    Buffer Pair

  • 8/17/2019 Chapter 2_Lexical Analysis

    15/69

    In preceding schema, for each character read we make two tests:

    have we moved off one of the buffer, and

    one to determine what character is read

    We can combine the two tests if we extend each buffer to hold a

    sentinel character (e.g. eof ) at the end:

    Note that eof retains its use as a marker for the end of the entire

    input.

    Any eof that appears other than at the end of a buffer means that

    the input is at an end.

    Sentinels

  • 8/17/2019 Chapter 2_Lexical Analysis

    16/69

    Sentinels…Algorithm for advancing forward

  • 8/17/2019 Chapter 2_Lexical Analysis

    17/69

    Two issues in lexical analysis.

    How to specify tokens? And How to recognize the tokens giving a token

    specification? (i.e. how to implement the nexttoken( ) routine)?

    How to specify tokens:

    Tokens are specified by regular expressions.

    Regular Expressions

    Represent patterns of strings of characters

    cannot express all possible patterns, they are very effective in specifyingthose types of patterns that we actually need for tokens.

    The set of strings generated by a regular expression r is as L(r)

    Specifying Tokens

  • 8/17/2019 Chapter 2_Lexical Analysis

    18/69

    alphabet : a finite set of symbols. E.g. {a, b, c}

    A string/sentence/word over an alphabet is a finite sequence of

    symbols drawn from that alphabet

    Operations.

    concatenation of strings x and y: xy; εx=x ε=x exponential : S0 = ε, Si =si-1s, i>0

    Do you remember words related to parts of a string:

     prefix VS suffix

     proper prefix VS proper suffix,

    substring

    subsequence

    Strings and Alphabets

  • 8/17/2019 Chapter 2_Lexical Analysis

    19/69

    A language is any countable set of strings over some fixedalphabet.

    Alphabet Language

    {0,1} {0,10,100,1000,10000,…}

    {0,1,100,000,111,…}{a,b,c} {abc,aabbcc,aaabbbccc,…}

    {A…Z} {TEE,FORE,BALL…}

    {FOR,WHILE,GOTO…}

    {A…Z,a…z,0…9, {All legal PASCAL progs}+,-,…,,…} {All grammatically correct English Sentences}

    Special Languages: Φ – EMPTY LANGUAGE

    ε – contains empty string ε only

    Languages

  • 8/17/2019 Chapter 2_Lexical Analysis

    20/69

    Operations on Languages

    L = {A, B, C, D } D = {1, 2, 3}

    L D = {A, B, C, D, 1, 2, 3 }

    LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }

    L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, É  DD}

    L4

    = L2

    L2

    = ??L* = { All possible strings of L plus }

    L+ = L* -

    L (L D ) = ??

     L (L D )* = ??

  • 8/17/2019 Chapter 2_Lexical Analysis

    21/69

    Formal definition of Regular expression:

    Given an alphabet  ,

    (1) l is a regular expression that denote {l}, the set that

    contains the empty string.

    (2) For each a e  , a is a regular expression denote {a}, the set

    containing the string a. (3) r and s are regular expressions, then

    o r | s is a regular expression denoting L( r ) U L( s )

    o rs is a regular expression denoting L( r ) L ( s )

    o ( r )* is a regular expression denoting (L ( r )) *

    Regular expression is defined together with the language it

    denotes.

    Regular expressions are used to denote regular languages

    Regular Expressions

  • 8/17/2019 Chapter 2_Lexical Analysis

    22/69

    Regular Expressions: Example

  • 8/17/2019 Chapter 2_Lexical Analysis

    23/69

    Regular Expressions: Algebraic Properties

    All Strings in English alphabet that start with ―tab‖ or end with ―bat‖:

    tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat

    All Strings in English alphabet in Which {1,2,3} exist in ascending order:

    {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

  • 8/17/2019 Chapter 2_Lexical Analysis

    24/69

    Gives names to regular expressions to construct more complicate

    regular expressions. If is an alphabet of basic symbols, then a regular definition is a

    sequence of definitions of the form: d1 r1,

    d2 r2, d3 r3,

    where:

    Each di is a new symbol, not in and not the same as any other ofthe d's

    Each ri is a regular expression over the alphabet U {dl, d2,. . . , di-l).

    Regular Definition

  • 8/17/2019 Chapter 2_Lexical Analysis

    25/69

    Example

    C identifiers are strings of letters, digits, and underscores. The regular

    definition for the language of C identifiers.

    Letter  A | B | C|…| Z | a | b | … |z|

    digit 0|1|2 |… | 9

    idletter ( letter | digit )*

    Unsigned numbers (integer or floating point) are strings such as 5280,

    0.01234, 6.336E4, or 1.89E-4. The regular definition

    digit 0|1|2 |… | 9

    digits digit digit*

    optionalFraction .digits |  

    optionalExponent  ( E( + |- |  ) digits ) |  

    number  digits optionalFraction optionalExponent

    Regular Definition: Example

  • 8/17/2019 Chapter 2_Lexical Analysis

    26/69

    Short hand notations

    ―+‖: one or more r* = r+/ and r+ = rr* ―?‖: zero or one r? = r/

    [range]: set range of characters [A-Z] = A|B|C|….|Z Example

    C variable name: [A -Za - z ][A - Za - z0 - 9 ]*

    Write regular definitions for Phone numbers of form (510) 643-1481

    number  ‘(‘ area ‘)’ exchange ‘ -’ phone,

     phone  digit4,

    exchange  digit3 ,

    area  digit3,

    digit  { 0, 1, 2, 3, …, 9 }

    Email Addresses (Exercise)

     

     

    Regular Definition…

  • 8/17/2019 Chapter 2_Lexical Analysis

    27/69

    Basically regular grammars are used to represent regular languages

    The Regular Grammars are either left or right linear:

    Right Regular Grammars:

    Rules of the forms: A →  ε, A → a, A → aB, A,B: variables and a:terminal

    Left Regular Grammars:

    Rules of the forms: A →  ε, A → a, A → Ba; A,B: variables and A:terminal

    Example: S→

    aS | bA, A→

    cA |ε

    This grammar produces the language produced by the regular

    expression a* bc*

    Regular Grammars

  • 8/17/2019 Chapter 2_Lexical Analysis

    28/69

    Recognition of tokens is the second issue in lexical analysis How to recognize the tokens giving a token specification?

    Given the grammar of branching statement:

    The terminals of the grammar, which are if, then, else, relop,id, and number, are the names of tokens

    The patterns for the given tokens are:

    Recognition of tokens

  • 8/17/2019 Chapter 2_Lexical Analysis

    29/69

    The lexical analyzer also has the job of stripping out whitespace, byrecognizing the "token" ws defined by:

    Recognition of tokens…

    Note that

  • 8/17/2019 Chapter 2_Lexical Analysis

    30/69

    Recognition of tokens…

    The table shows which token name is returned to the parser and what

    attribute value for each lexeme or family of lexemes.

  • 8/17/2019 Chapter 2_Lexical Analysis

    31/69

    Transition diagrams

    As intermediate step in the construction of a lexical analyzer, wefirst convert patterns into stylized flowcharts, called "transition

    diagrams"Example Transition diagram for relop

    * Is used to indicate thatwe must retract the input

    one position ( back)

    Recognition of tokens…

  • 8/17/2019 Chapter 2_Lexical Analysis

    32/69

    Token nexttoken( ) {

    while (1) {

    switch (state) {case 0:

    c = nextchar();

    if (c is white space) state = 0;

    else if (c == ‘’) goto state 3 retrun

    else

    retract ; return;if (isletter( c) ) state = 10; else state =fail();

     break;

    case 10: ….

    case 11: retract(1); insert(id);

    return;

    } } }

    Recognition of tokens…

    Mapping transition diagrams into C code

  • 8/17/2019 Chapter 2_Lexical Analysis

    33/69

    Recognition of Reserved Words and Identifiers

    Recognizing keywords and identifiers presents a problem.

    keywords like if or then are reserved but they look like identifiers

    There are two ways that we can handle

    stall the reserved words in the symbol table initially.

    Create separate transition diagrams for each keyword; the transition diagram for the reserved word then

    Recognition of tokens…

    If we find an identifier, installID is called to

    place it in the symbol table.

  • 8/17/2019 Chapter 2_Lexical Analysis

    34/69

    Session2

    Finite AutomataNondeterministic Finite Automata

    Deterministic Finite Automata

    From Regular Expressions to Automata

    Conversion of NFA to DFA

    State minimizationDesign of a Lexical-Analyzer Generator

    Implementing Scanners

  • 8/17/2019 Chapter 2_Lexical Analysis

    35/69

    At the heart of the transition diagram is the formalism known as

      finite automata

    A  finite automaton is a  finite-state transition diagram that can be

    used to model the recognition of a token type specified by a regular

    expression

    Finite automata are recognizers; they simply say "yes" or "no"

    about each possible input string.

    A finite automaton can be a Non-deterministic finite automaton (NFA

    Deterministic finite automaton (DFA)

    DFA & NFA are capable of recognizing the same languages

    Finite Automata

  • 8/17/2019 Chapter 2_Lexical Analysis

    36/69

    Nondeterministic Finite Automata

  • 8/17/2019 Chapter 2_Lexical Analysis

    37/69

    An NFA can be diagrammatically represented by a labeled

    directed graph called a transition graphThe set of states =

    {A,B, C, D}

    Input symbol = {0, 1}

    Start state is A,

    accepting state is D

    NFA: Transition Graph and Transition tables

    Transition graph shown in

    preceding slide can be

    represented by transition

    table on the right side

    State

    Input Symbol

    0 1

     A {A,B} {A,C}B {D} --

    C -- {D}

    D {D} {D}

  • 8/17/2019 Chapter 2_Lexical Analysis

    38/69

    Acceptance of Input Strings by Automata

    An NFA accepts an input string str iff there is some path in the finite-state

    transition diagram from the start state to some  final state such that the edge

    labels along this path spell out str 

    The language recognized by an NFA is the set of strings it accepts Example

    The language recognized by this NFA is aa*|bb*

    Nondeterministic Finite Automata…

  • 8/17/2019 Chapter 2_Lexical Analysis

    39/69

    A deterministic finite automaton (DFA) is a special case of an

    NFA where:

    There are no moves on input ε,and

    For each state q and input symbol a, there is exactly one edge out of a

    labeled a.

    Example

    The language recognized by

    this DFA is also

    (b*aa*(aa*b)+)

    Deterministic Finite Automata

  • 8/17/2019 Chapter 2_Lexical Analysis

    40/69

    Let us assume that the end of a string is marked with a special

    symbol (say eos). The algorithm for recognition will be as follows: (an efficient

    implementation)

    q = q0 { start from the initial state }

    c = nextchar { get the next character from the input string }while (c != eos) do { do until the end of the string }

     begin

    q = move(q,c) { transition function }

    c = nextchar

    end

    if (q in F) then { if s is an accepting state }

    return ―yes‖

    else

    return ―no‖

    Implementing a DFA

  • 8/17/2019 Chapter 2_Lexical Analysis

    41/69

    Q = -closure({q0})

    c = nextcharwhile (c != eos) {

     begin

    Q = -closure(move(Q,c))

    c = nextchar

    end

    if (Q F !=) then { if Q contains an accepting state }

    return ―yes‖else

    return ―no‖

    This algorithm is not efficient.

    Implementing an NFA

    { set of all states that can beaccessible from q0 by -transitions }

    { set of all states that can

     be accessible from a state

    in q by a transition on c }

  • 8/17/2019 Chapter 2_Lexical Analysis

    42/69

       -closure(q): set of states reachable from some state q in Q on  -

    transitions alone.

    move(q, c): set of states to which there is a transition on input symbol

    c from some state q in Q  

    Computing the -closure (Q)

  • 8/17/2019 Chapter 2_Lexical Analysis

    43/69

    Given an NFA below find -closure (s)

    Computing the -closure (Q) ...Example

    i.e. All states that can be

    reached through -transitionsonly

  • 8/17/2019 Chapter 2_Lexical Analysis

    44/69

    For each kind of RE, there is a corresponding NFA

    i.e. any regular expression can be converted to a NFA that defines the

    same language.

    The algorithm is syntax-directed, in the sense that it works

    recursively up the parse tree for the regular expression. For each sub-expression the algorithm constructs an NFA with a

    single accepting state.

    Algorithm: The McNaughton-Yamada-Thompsonalgorithm to convert a regular expression to a NFA.

    INPUT: A regular expression r over alphabet .

    OUTPUT: An NFA N accepting L(r).

    from Regular Expression to NFA

    f l

  • 8/17/2019 Chapter 2_Lexical Analysis

    45/69

    Method: Begin by parsing r into its constituent sub-expressions. The rules

    for constructing an NFA consist of basis rules for handling sub-expressionswith no operators, and inductive rules for constructing larger NFA's from

    the NFA's for the immediate sub-expressions of a given expression.

    For expression e construct the NFA

    For any sub-expression a in C, construct the NFA

    NFA for the union of two regular expressions

    Ex: a|b

    from Regular expressions to NFA…

    f l

  • 8/17/2019 Chapter 2_Lexical Analysis

    46/69

    NFA for r1r2

    NFA for r*

    Example: (a|b)*

    from Regular expressions to NFA…

    f l

  • 8/17/2019 Chapter 2_Lexical Analysis

    47/69

    Example: Constructing NFA for regular expression

    r= (a|b)*abb

    from Regular expressions to NFA…

    Step 1: construct a, b

    Step 2: constructing a | b

    Step3: construct (a|b)*

    Step4: concatenate it with a, then, b,

    then b

    C f NFA DFA

  • 8/17/2019 Chapter 2_Lexical Analysis

    48/69

    Why?

    DFA is difficult to construct directly from RE‘s

    NFA is difficult to represent in a computer program and inefficient to

    compute

    So, RE

    NFA

    DFA Conversion algorithm: subset construction

    The idea is that each DFA state corresponds to a set of NFA states.

    After reading input a1a2…an, the DFA is in a state that represents the

    subset Q of the states of the NFA that are reachable from the start state. INPUT: An NFA N

    OUTPUT: A DFA D accepting the same language as N.

     ALGORITHM:…

    Conversion of NFA to DFA

    S b C Al h

  • 8/17/2019 Chapter 2_Lexical Analysis

    49/69

    Dstates := -closure (q0) //and they all are unmarked

    While (there is an unmarked state Q in Dstates ) do

    begin

    mark Q;

    for (each input symbola)

    dobegin

    U := -closure ( move(Q,a) );

    if  (U is not in Dstates ) then

    add U as an unmarked state to Dstates;Dtran [Q, a] := U;

    end

    end

    Subset Construction Algorithm

    S b C i Al i h

  • 8/17/2019 Chapter 2_Lexical Analysis

    50/69

    The start state A of the equivalent DFA is -closure(0), A = {0,1,2,4,7},

    Since these are exactly the states reachable from state 0 via apath all of whose edges have label . Note that a path can have zero edges, so state 0 is reachable from

    itself by an -labeled path.

    Subset Construction Algorithm…Example NFA to DFA

    S b C i Al i h

  • 8/17/2019 Chapter 2_Lexical Analysis

    51/69

    The input alphabet is {a, b). Thus, our first step is to mark A and

    computeDtran[A, a] = -closure(move(A, a)) and

    Dtran[A, b] = - closure(move(A, b)) .

    Among the states 0, 1, 2, 4, and 7, only 2 and 7 have transitionson a, to 3 and 8, respectively. Thus, move(A, a) = {3,8). Also, -closure({3,8} )= {1,2,3,4,6,7,8), so

    we call this set B,

    let Dtran[A, a] = B

    compute Dtran[A, b]. Among the states in A, only 4 has atransition on b, and it goes to 5

    Subset Construction Algorithm…

    S b C i Al i h

  • 8/17/2019 Chapter 2_Lexical Analysis

    52/69

    C =

    If we continue this process with the unmarked sets B and C, we eventuallyreach a point where all the states of the DFA are marked

    Subset Construction Algorithm…

    NFA DFA i E l 2

  • 8/17/2019 Chapter 2_Lexical Analysis

    53/69

    0 1 3start a

    2 b b

     b

    a

    -closure(0) = {0}

    move(0,a) = {0,1}

    move(0,b) = {0}

    move({0,1}, a) = {0,1}move({0,1}, b) = {0,2}

    move({0,2}, a) = {0,1}

    move({0,2}, b) = {0,3}

     New states

    A = {0}

    B = {0,1}C = {0,2}

    D = {0,3}

    a b

     A B A

    B B C

    C B D

    D B A

    NFA to DFA conversion: Example 2

    NFA DFA i E l 2

  • 8/17/2019 Chapter 2_Lexical Analysis

    54/69

    A B Dstart a

    C b b

     b

    a

    a

     b

    a

    NFA to DFA conversion: Example 2…

    NFA t DFA i E l 3

  • 8/17/2019 Chapter 2_Lexical Analysis

    55/69

    a

    0

    1

    start

    2a

    3

    4

     b b

    Due to -transitions, we

    must compute -closure(S)

    Example: -closure (0) ={0,3}

    NFA to DFA conversion: Example 3

    Dstates := -closure(1) = {1,2}

    U:= -closure (move( {1,2}, a)) = {3,4,5}Add {3,4,5} to Dstates

    U:= -closure (move( {1,2}, b)) = {}-closure (move( {3,4,5}, a)) = {5}

    -closure (move( {3,4,5}, b)) = {4,5}

    -closure (move( {4,5}, a)) = {5}

    -closure (move( {4,5}, b)) = {5}

    NFA t DFA i E l 3

  • 8/17/2019 Chapter 2_Lexical Analysis

    56/69

    1

    2

    start

    a

    3a

    4

     b

    a|b5

    a

    a b

     A{1,2} B --

    B{3,4,5} D C

    C{4,5} D D

    D{5} -- --

    NFA to DFA conversion: Example 3…

  • 8/17/2019 Chapter 2_Lexical Analysis

    57/69

    NFA to DFA conversion: Exercise

    Convert the -NFA shown below into a DFA, using subset construction

    P tti th Pi T th

  • 8/17/2019 Chapter 2_Lexical Analysis

    58/69

    RE NFAConversion

    NFA

    DFAConversion

    DFA

    Simulation

    Yes, if w L(R)

    No, if w L(R)InputString

    RegularExpression R

    w

    Putting the Pieces Together

    St t i i i ti

  • 8/17/2019 Chapter 2_Lexical Analysis

    59/69

    If we implement a lexical analyzer as a DFA, we would generally

    prefer a DFA with as few states as possible, since each state requires entries in the table that describes the lexical

    analyzer.

    There is always a unique minimum state DFA for any regular

    language.

    Moreover, this minimum-state DFA can be constructed from any DFA

    for the same language by grouping sets of equivalent states.

     An Algorithm to Minimizing the number of states of a DFA INPUT: A DFA D with set of states S, input alphabet , start state 0, and

    set of accepting states F.

    OUTPUT: A DFA D' accepting the same language as D and having as

    few states as possible.

    State minimization

    St t i i i ti

  • 8/17/2019 Chapter 2_Lexical Analysis

    60/69

    State minimization

    St t i i i ti

  • 8/17/2019 Chapter 2_Lexical Analysis

    61/69

    An Algorithm to Minimizing the number of states of a DFAState minimization…

    Example input set is {a b} with DFA

  • 8/17/2019 Chapter 2_Lexical Analysis

    62/69

    1. Initially partition consists of the two groups

    non-final states {A, B, C, D},

    final state{E}

    2. To Construct IInew , group {E} cannot be split3. group {A, B, C, D} can be split into {A, B, C}{D}, and IInew for this

    round is {A, B, C}{D}{E}.

    Example: input set is {a,b}, with DFA

    Minimized DFA

  • 8/17/2019 Chapter 2_Lexical Analysis

    63/69

    E

    DA

     ba B

    a

     ba

     b b

    a

    Minimized DFA…

    In the next round, split {A, B, C} into {A, C}{B}, since A and C

    each go to a member of {A, B, C) on input b, while B goes to amember of another group, {D}. Thus, after the second round, new= {A, C} {B} {D} {E).

    For the third round, we cannot split the one remaining group with

    more than one state, since A and C each go to the same state (andtherefore to the same group) on each input. final = {A,C}{B){D){E). The minimum-state of the given DFA has four states.

    Minimized DFA: Exercise

  • 8/17/2019 Chapter 2_Lexical Analysis

    64/69

    Minimized DFA: Exercise Minimize states of the following DFA

    A

    B(first convert to DFA)

    C

    Some Other Issues in Lexical Analyzer

  • 8/17/2019 Chapter 2_Lexical Analysis

    65/69

    The lexical analyzer has to recognize the longest possible string.

    Ex: identifier newval   n ne new newv newva newval

    What is the end of a token? Is there any character which marks the

    end of a token? It is normally not defined.

    If the number of characters in a token is fixed, in that case no problem:

    + - are ok but < implies < or (in Pascal)

    The end of an identifier : the characters that cannot be in an identifier can

    mark the end of token.

    We may need a lookhead

    In Prolog: p :- X is 1. p :- X is 1.5.

    The dot followed by a white space character can mark the end of a

    number. But if that is not the case, the dot must be treated as a

    part of the number.

    Some Other Issues in Lexical Analyzer

    Implementing the Scanner

  • 8/17/2019 Chapter 2_Lexical Analysis

    66/69

    We have learnt the formalism for lexical analysis

    Regular expression (RE)

    How to actually get the lexical analyzer?

    Solution 1: to implement using a tool — Lex (for C), Flex (for

    C++), Jlex (for java) Programmer specifies the interesting tokens using REs

    The tool generates the source code from the given Res

    Solution 2: to write the code starting from scratch

    This is also the code that the tool generates The code is table-driven and is based on finite state automaton

    Implementing the Scanner

    Lex: aTool for Lexical Analysis

  • 8/17/2019 Chapter 2_Lexical Analysis

    67/69

    Big difference from your previous coding experience

    Writing REs instead of the code itself 

    Writing actions associated with each RE

    Internal Structure of Lex

    Lex: a Tool for Lexical Analysis

    The final states of the DFA

    are associated with actions

    The detailed about lex will be discussed laboratory sessions

    Assignment 1 (10%)

  • 8/17/2019 Chapter 2_Lexical Analysis

    68/69

    g ( )

    1. Implement DFA and NFA. The program should accept states,alphabet, transition function, start state and set of final states. Thenthe program should check if a string provided by a user is acceptedor rejected. You can use either Java or C++ for theimplementation

    2. Convert the following NFA to DFA

    3. Minimize the number of states from the DFA you obtained as an

    answer for question 2

  • 8/17/2019 Chapter 2_Lexical Analysis

    69/69

    Reading Assignment

    Constructing a DFA directly

    from a regular expression