14823_02. Chapter 3 - Lexical Analysis

download 14823_02. Chapter 3 - Lexical Analysis

of 50

Transcript of 14823_02. Chapter 3 - Lexical Analysis

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    1/50

    Chapter 3

    Lexical Analysis

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    2/50

    OutlineRole of lexical analyzer

    Specification of tokens

    Recognition of tokensLexical analyzer generator

    Finite automata

    Design of lexical analyzer generator

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    3/50

    The role of lexical

    analyzer

    LexicalAnalyzer Parser

    Source

    program

    token

    getNextToken

    Symboltable

    To semantic

    analysis

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    4/50

    Why to separate Lexical

    analysis and parsing1. Simplicity of design

    2. Improving compiler efficiency

    3. Enhancing compiler portability

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    5/50

    Tokens, Patterns and

    LexemesA token is a pair a token name and an

    optional token value

    A pattern is a description of the form that thelexemes of a token may take

    A lexeme is a sequence of characters in thesource program that matches the pattern for

    a token

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    6/50

    ExampleToken Informal description Sample lexemes

    if

    else

    comparison

    id

    number

    literal

    Characters i, f

    Characters e, l, s, e

    < or > or = or == or !=

    Letter followed by letter and digits

    Any numeric constant

    Anything but sorrounded by

    if

    else

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    7/50

    Attributes for tokensE = M * C ** 2

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    8/50

    Lexical errorsSome errors are out of power of lexical

    analyzer to recognize:

    fi (a == f(x)) However it may be able to recognize errors

    like:d = 2r

    Such errors are recognized when no patternfor tokens matches a character sequence

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    9/50

    Error recoveryPanic mode: successive characters are

    ignored until we reach to a well formed token

    Delete one character from the remaininginput

    Insert a missing character into the remaininginput

    Replace a character by another characterTranspose two adjacent characters

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    10/50

    Input bufferingSometimes lexical analyzer needs to look

    ahead some symbols to decide about thetoken to returnIn C language: we need to look after -, = or < to

    decide what token to return

    In Fortran: DO 5 I = 1.25

    We need to introduce a two buffer scheme tohandle large look-aheads safely

    E = M * C * * 2 eof

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    11/50

    SentinelsSwitch (*forward++) {

    case eof:

    if (forward is at end of first buffer) {reload second buffer;

    forward = beginning of second buffer;

    }

    else if {forward is at end of second buffer) {

    reload first buffer;\

    forward = beginning of first buffer;

    }

    else /* eof within a buffer marks the end of input */

    terminate lexical analysis;

    break;

    cases for the other characters;}

    E = M eof* C * * 2 eof eof

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    12/50

    Specification of tokensIn theory of compilation regular expressions

    are used to formalize the specification oftokens

    Regular expressions are means for specifyingregular languages

    Example:

    Letter_(letter_ | digit)*Each regular expression is a pattern

    specifying the form of strings

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    13/50

    Regular expressions is a regular expression, L() = {}

    If a is a symbol in then a is a regular

    expression, L(a) = {a}(r) | (s) is a regular expression denoting the

    language L(r) L(s)

    (r)(s) is a regular expression denoting the

    language L(r)L(s)(r)* is a regular expression denoting (L9r))*

    (r) is a regular expression denting L(r)

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    14/50

    Regular definitionsd1 -> r1

    d2 -> r2

    dn -> rn

    Example:

    letter_ -> A | B | | Z | a | b | | Z | _

    digit -> 0 | 1 | | 9

    id -> letter_ (letter_ | digit)*

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    15/50

    ExtensionsOne or more instances: (r)+

    Zero of one instances: r?

    Character classes: [abc]

    Example:letter_ -> [A-Za-z_]

    digit -> [0-9]

    id -> letter_(letter|digit)*

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    16/50

    Recognition of tokensStarting point is the language grammar to

    understand the tokens:

    stmt -> ifexpr then stmt| ifexpr then stmt else stmt

    |

    expr -> term relop term

    | term

    term -> id

    | number

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    17/50

    Recognition of tokens

    (cont.)The next step is to formalize the patterns:

    digit -> [0-9]

    Digits -> digit+

    number-> digit(.digits)? (E[+-]? Digit)?letter -> [A-Za-z_]

    id -> letter (letter|digit)*

    If -> if

    Then -> then

    Else -> else

    Relop -> < | > | = | = |

    We also need to handle whitespaces:

    ws -> (blank | tab | newline)+

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    18/50

    Transition diagramsTransition diagram for relop

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    19/50

    Transition diagrams

    (cont.)Transition diagram for reserved words and

    identifiers

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    20/50

    Transition diagrams

    (cont.)Transition diagram for unsigned numbers

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    21/50

    Transition diagrams

    (cont.)Transition diagram for whitespace

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    22/50

    transition-diagram-based

    lexical analyzerTOKEN getRelop(){

    TOKEN retToken = new (RELOP)

    while (1) { /* repeat character processing until a

    return or failure occurs */

    switch(state) {

    case 0: c= nextchar();

    if (c == ) state = 6;

    else fail(); /* lexeme is not a relop */

    break;

    case 1:

    case 8: retract();

    retToken.attribute = GT;

    return(retToken);

    }

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    23/50

    Lexical Analyzer

    Generator - LexLexical

    Compiler

    Lex Source

    program

    lex.l

    lex.yy.c

    Ccompiler

    lex.yy.c a.out

    a.outInput stream Sequenceof tokens

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    24/50

    Structure of Lex

    programs

    declarations

    %%translation rules

    %%

    auxiliary functions

    Pattern {Action}

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    25/50

    Example%{

    /* definitions of manifest constants

    LT, LE, EQ, NE, GT, GE,IF, THEN, ELSE, ID, NUMBER, RELOP */

    %}

    /* regular definitions

    delim [ \t\n]

    ws {delim}+letter [A-Za-z]

    digit [0-9]

    id {letter}({letter}|{digit})*

    number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

    %%

    {ws} {/* no action and no return */}

    if {return(IF);}

    then {return(THEN);}

    else {return(ELSE);}

    {id} {yylval = (int) installID(); return(ID); }

    {number} {yylval = (int) installNum();

    return(NUMBER);}

    Int installID() {/* funtion to installthe lexeme, whose firstcharacter is pointed to byyytext, and whose length isyyleng, into the symbol tableand return a pointer thereto */

    }

    Int installNum() { /* similar toinstallID, but puts numericalconstants into a separatetable */

    }

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    26/50

    26

    Finite AutomataRegular expressions = specification

    Finite automata = implementation

    A finite automaton consists ofAn input alphabet

    A set of states S

    A start state nA set of accepting states F S

    A set of transitions state input state

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    27/50

    27

    Finite AutomataTransition

    s1a s2

    Is readIn state s1 on input a go to state s2

    If end of inputIf in accepting state => accept, othewise =>

    reject

    If no transition possible => reject

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    28/50

    28

    Finite Automata State

    GraphsA state

    The start state

    An accepting state

    A transitiona

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    29/50

    29

    A Simple ExampleA finite automaton that accepts only 1

    A finite automaton accepts a string if we canfollow transitions labeled with the charactersin the string from the start to some acceptingstate

    1

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    30/50

    30

    Another Simple ExampleA finite automaton accepting any number of1s followed by a single 0

    Alphabet: {0,1}

    Check that 1110 is accepted but 110 isnot

    0

    1

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    31/50

    31

    And Another ExampleAlphabet {0,1}What language does this recognize?

    0

    1

    0

    1

    0

    1

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    32/50

    32

    And Another ExampleAlphabet still { 0, 1 }

    The operation of the automaton is notcompletely defined by the inputOn input 11 the automaton could be in either

    state

    1

    1

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    33/50

    33

    Epsilon MovesAnother kind of transition: -moves

    Machine can move from state A to state Bwithout reading input

    A B

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    34/50

    34

    Nondeterministic

    AutomataDeterministic Finite Automata (DFA)One transition per input per state

    No -moves

    Nondeterministic Finite Automata (NFA)Can have multiple transitions for one input in a

    given state

    Can have -movesFinite automata have finite memoryNeed only to encode the current state

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    35/50

    35

    Execution of Finite

    AutomataA DFA can take only one path through the

    state graphCompletely determined by input

    NFAs can chooseWhether to make -moves

    Which of multiple transitions for a single input totake

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    36/50

    36

    Acceptance of NFAsAn NFA can get into multiple states

    Input:

    0

    1

    1

    0

    1 0 1

    Rule: NFA accepts if it can get in a final state

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    37/50

    37

    NFA vs. DFA (1)NFAs and DFAs recognize the same set of

    languages (regular languages)

    DFAs are easier to implementThere are no choices to consider

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    38/50

    38

    NFA vs. DFA (2)For a given language the NFA can be simplerthan the DFA

    01

    0

    0

    01

    0

    1

    0

    1

    NFA

    DFA

    DFA can be exponentially larger than NFA

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    39/50

    39

    Regular Expressions to

    Finite AutomataHigh-level sketch

    Regularexpressions

    NFA

    DFA

    LexicalSpecification

    Table-drivenImplementation of DFA

    l i

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    40/50

    40

    Regular Expressions to

    NFA (1)For each kind of rexp, define an NFANotation: NFA for rexp A

    A

    For

    For input aa

    R l E i

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    41/50

    41

    Regular Expressions to

    NFA (2)For ABA B

    For A | B

    A

    B

    R l E i t

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    42/50

    42

    Regular Expressions toNFA (3)For A*

    A

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    43/50

    43

    Example of RegExp ->

    NFA conversionConsider the regular expression(1 | 0)*1

    The NFA is

    1C E

    0D F

    B

    G

    AH

    1I J

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    44/50

    44

    Next

    Regularexpressions

    NFA

    DFA

    LexicalSpecification

    Table-drivenImplementation of DFA

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    45/50

    45

    NFA to DFA. The TrickSimulate the NFAEach state of resulting DFA

    = a non-empty subset of states of the NFAStart state

    = the set of NFA states reachable through -moves from NFA start state

    Add a transition S a S to DFA iffS is the set of NFA states reachable from the

    states in S after seeing the input a considering -moves as well

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    46/50

    46

    NFA -> DFA Example

    1

    01

    A BC

    D

    E

    FG H I J

    ABCDHI

    FGABCDHI

    EJGABCDHI

    0

    1

    0

    10 1

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    47/50

    47

    NFA to DFA. RemarkAn NFA may be in many states at any time

    How many different states ?

    If there are N states, the NFA must be in somesubset of those N states

    How many non-empty subsets are there?2N - 1 = finitely many, but exponentially many

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    48/50

    48

    ImplementationA DFA can be implemented by a 2D table TOne dimension is states

    Other dimension is input symbols

    For every transition Sia Sk define T[i,a] = k

    DFA execution

    If in state Si and input a, read T[i,a] = k and skip

    to state SkVery efficient

    Table Implementation of

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    49/50

    49

    Table Implementation ofa DFA

    S

    T

    U

    0

    1

    0

    10 1

    0 1

    S T UT T U

    U T U

  • 8/3/2019 14823_02. Chapter 3 - Lexical Analysis

    50/50

    Implementation (Cont.)NFA -> DFA conversion is at the heart of toolssuch as flex or jflex

    But, DFAs can be huge

    In practice, flex-like tools trade off speed for

    space in the choice of NFA and DFArepresentations